En el apartado "Loading Data" de esta URL:

https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python

Se explica cómo se cargan una serie de datos: 

1. Utiliza esa misma forma para cargar los datos.
2. Limpia los datos si es necesario
3. Dibuja con plotly los que creas necesarios gráficos para entender los datos.
4. Utiliza los métodos de clasificación vistos hasta ahora para clasificar el target de los datos, ¿cuál da mejores resultados? 
5. Intenta superarte en el score cambiando las features de los algoritmos.

In [1]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()

In [3]:
X = cancer.data
y = cancer.target

### Data study

In [4]:
import pandas as pd

In [5]:
df = pd.DataFrame(X, columns=cancer.feature_names)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [10]:
print(df.shape)
for col in df:
    print(f'{col} unique values', df[col].nunique())

(569, 30)
mean radius unique values 456
mean texture unique values 479
mean perimeter unique values 522
mean area unique values 539
mean smoothness unique values 474
mean compactness unique values 537
mean concavity unique values 537
mean concave points unique values 542
mean symmetry unique values 432
mean fractal dimension unique values 499
radius error unique values 540
texture error unique values 519
perimeter error unique values 533
area error unique values 528
smoothness error unique values 547
compactness error unique values 541
concavity error unique values 533
concave points error unique values 507
symmetry error unique values 498
fractal dimension error unique values 545
worst radius unique values 457
worst texture unique values 511
worst perimeter unique values 514
worst area unique values 544
worst smoothness unique values 411
worst compactness unique values 529
worst concavity unique values 539
worst concave points unique values 492
worst symmetry unique values 500
worst f

In [12]:
pd.Series(y).nunique()

2

All values in feature data are numerical while the target is categorical. We need classification models to predict the target.

### Data visualisation

In [13]:
import plotly.express as px

In [16]:
hist_y = pd.DataFrame(y)

In [24]:
fig = px.histogram(hist_y) 
fig.show()

## Modelling

In [29]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split 

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [31]:
# Create a pipeline

# Le podemos poner cualquier clasificador. Irá cambiando según va probando pero necesita 1.
pipe = Pipeline(steps=[('classifier', RandomForestClassifier())])


logistic_params = {
    'classifier': [LogisticRegression()],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__C': np.logspace(0, 4, 10)
    }

random_forest_params = {
    'classifier': [RandomForestClassifier()],
    'classifier__n_estimators': [10, 100, 1000],
    'classifier__max_features': [1, 2, 3]
    }

to_test = np.arange(1, 6)

svm_params = {
    'classifier': [svm.SVC()],
    'classifier__kernel':('linear', 'rbf', 'sigmoid'), 
    'classifier__C':[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
    'classifier__degree': to_test,
    'classifier__coef0': [-10.,-1., 0., 0.1, 0.5, 1, 10, 100],
    'classifier__gamma': ('scale', 'auto')
    }

# hypertuning 
# Create space of candidate learning algorithms and their hyperparameters
search_space = [
    logistic_params,
    random_forest_params,
    svm_params
    ]

In [32]:
%%time

cv = RepeatedKFold(n_splits=10, n_repeats=1, random_state=1)
# Create grid search 
clf = GridSearchCV(estimator=pipe, param_grid=search_space, cv=cv, verbose=0, n_jobs=-1)

# Fit grid search
best_model = clf.fit(X_train, y_train)

# View best model
separator = "\n############################\n"
print(separator)
print("best estimator:", best_model.best_estimator_.get_params()['classifier'])
print(separator)
print("clf.best_params_", clf.best_params_)
print(separator)
# Mean cross-validated score of the best_estimator
print("clf.best_score", clf.best_score_)


############################

best estimator: RandomForestClassifier(max_features=2)

############################

clf.best_params_ {'classifier': RandomForestClassifier(max_features=2), 'classifier__max_features': 2, 'classifier__n_estimators': 100}

############################

clf.best_score 0.9691787439613527


NameError: name 'pickle' is not defined

In [34]:
best_model.best_estimator_.score(X_test,y_test)

0.9385964912280702