# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful function of the Scikit-Learn libry

Contents:

0. An end-to-end Scikit-learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm/model for our problems
3. Fit the model/algorithm/estimator and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together

In [1]:
# Standard inputs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

## 0. An end-to-end Scikit-learn workflow

In [2]:
# 1. Get the data ready
import pandas as pd
import numpy as np
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Create X (features matrix)
X = heart_disease.drop("target", axis = 1)

# Create y (labels)
y = heart_disease["target"]

In [4]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier

# convention is clf (classifier) =, or model = 
clf = RandomForestClassifier()

# We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [5]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [6]:
clf.fit(X_train, y_train);

In [7]:
# make a prediction
y_label = clf.predict(np.array([0,2,3,4]))

ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [8]:
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1], dtype=int64)

In [9]:
# 4. Evaluate the model on the training data and testing data
clf.score(X_train, y_train)


1.0

In [10]:
clf.score(X_test, y_test)

0.819672131147541

In [11]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.78      0.81      0.79        26
           1       0.85      0.83      0.84        35

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [12]:
confusion_matrix(y_test, y_preds)

array([[21,  5],
       [ 6, 29]], dtype=int64)

In [13]:
accuracy_score(y_test, y_preds)

0.819672131147541

In [14]:
# 5. Improve a model
# Try different amount of n_estimators (one of hyperparemets a.k.a. dials on ML model)
np.random.seed(42)

for i in range(10, 100, 10):
    print(f"trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators = i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

trying model with 10 estimators...
Model accuracy on test set: 73.77%

trying model with 20 estimators...
Model accuracy on test set: 81.97%

trying model with 30 estimators...
Model accuracy on test set: 77.05%

trying model with 40 estimators...
Model accuracy on test set: 78.69%

trying model with 50 estimators...
Model accuracy on test set: 81.97%

trying model with 60 estimators...
Model accuracy on test set: 80.33%

trying model with 70 estimators...
Model accuracy on test set: 81.97%

trying model with 80 estimators...
Model accuracy on test set: 80.33%

trying model with 90 estimators...
Model accuracy on test set: 81.97%



In [15]:
# 6. Save a model and load it
import pickle 

pickle.dump(clf, open("random_forest_model_1.pkl", "wb")) # write binary

In [16]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb")) # read binary
loaded_model.score(X_test, y_test)

0.819672131147541

## 1. Getting data ready to be used with ML

Three main things to do:
1. Split the data into feature and labels (usually 'X' and 'y')
2. Filling (also called inputting) or disregarding missing value
3. Converting non-numerical values to numerical values (a.k.a. feature encoding)

In [17]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [18]:
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

In [19]:
# Splitting the data into the training and tests sets

from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size = 0.2)

In [20]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (242,), (61, 13), (61,))

In [21]:
X.shape[0] * 0.8

242.4

### 1.1 Make sure it's all numerical

In [22]:
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [23]:
len(car_sales)

1000

In [24]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [25]:
# Split into X/y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

#Split into training and test
from sklearn.model_selection import train_test_split

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.2)

In [26]:
# Build ML model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

In [None]:
X.head()

In [None]:
# Turn the categories into number
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] #Doors can be considered as categorical
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder = "passthrough")
# column transformer takes one-hot-encoder and apply it to categorical features and,for the remainder of the 
# columns that it finds, passes through (ignores them)

transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
pd.DataFrame(transformed_X).head() 

In [None]:
# alternative to one-hot-encoder: dummy variables
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies.head()

In [None]:
# Doors didn't convert to categorical, checking for myself:
car_sales2 = car_sales.copy()
car_sales2["Doors"] = car_sales2["Doors"].astype(str)

In [None]:
car_sales2.dtypes

In [None]:
dummies2 = pd.get_dummies(car_sales2[["Make","Colour","Doors"]])
dummies2.head()

In [None]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train), model.score(X_test, y_test)

### 1.2 What if there were missing values?

1. Fill them with some value (also known as imputation)
2. Remove the samples with missing data altogether

In [27]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,,4.0,20306.0
8,,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


In [28]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

## Stuff that i was checking, most-likely to be ignored

In [29]:
# if i want to use .get_dummies i have to change doors to categorical first but .astype(str) changes even NaNs to strings hence:
car_sales_missing["Doors"] = car_sales_missing["Doors"].astype("category")

In [30]:
car_sales_missing_dummy = pd.get_dummies(car_sales_missing[["Make", "Colour", "Doors"]])
car_sales_missing_dummy.isna().sum()
# concerning -> get_dummies changes missing data into category (actually that's Dan's solution later on so idk)

Make_BMW        0
Make_Honda      0
Make_Nissan     0
Make_Toyota     0
Colour_Black    0
Colour_Blue     0
Colour_Green    0
Colour_Red      0
Colour_White    0
Doors_3.0       0
Doors_4.0       0
Doors_5.0       0
dtype: int64

---

### Dealing with NaN erros
#### 1. Fill missing data with Pandas

In [31]:
car_sales_missing["Make"].fillna("missing", inplace = True)
car_sales_missing["Colour"].fillna("missing", inplace = True)
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace = True)
car_sales_missing["Doors"].fillna(4, inplace = True)

In [32]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

If the label is missing, the row probably might be just as well deleted

In [33]:
# remove rows with missing price value
car_sales_missing.dropna(inplace=True)
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [34]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [35]:
# Turn the categories into number
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] #Doors can be considered as categorical
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder = "passthrough")
# column transformer takes one-hot-encoder and apply it to categorical features and,for the remainder of the 
# columns that it finds, passes through (ignores them)

transformed_X = transformer.fit_transform(X)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

Worth to rembember that he removed NaNs because onehot couldn't work with NaNs in the past (no longer the case)

### 2. Fill missing values with Scikit-learn

In [36]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [37]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [38]:
# Drop the rows with no labels
car_sales_missing.dropna(subset = ["Price"], inplace = True)

In [39]:
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [40]:
# Split into X and y
X = car_sales_missing.drop(axis = 1, labels =  "Price")
y = car_sales_missing["Price"]

# Split data into train and data test
from sklearn.model_selection import train_test_split
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                     y,
                                                     test_size = 0.2)

In [41]:
X_train.isna().sum()

Make             35
Colour           38
Odometer (KM)    36
Doors            38
dtype: int64

In [42]:
X_train.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
986,Honda,White,71934.0,4.0
297,Toyota,Red,162665.0,4.0
566,Honda,White,42844.0,4.0
282,Honda,White,195829.0,4.0
109,Honda,Blue,219217.0,4.0


In [43]:
# fill missing values with scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# fill categorical values with 'missing' and numerical with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value = "missing") # if value is constant we have to pass fill_value that will be used to fill empty values
door_imputer = SimpleImputer(strategy = "constant", fill_value = 4) # constant because the filling value remains the same for all defined columns
num_imputer = SimpleImputer(strategy = "mean")

# Define columns
cat_features  = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features), # name of imputer, imputer to be used on the specified features
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

# Transform the data
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)
  

In [44]:
# Get transformed data array back into data frame format
car_sales_filled_train = pd.DataFrame(filled_X_train,
                                     columns = ["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled_test = pd.DataFrame(filled_X_test,
                                     columns = ["Make", "Colour", "Doors", "Odometer (KM)"])
# check missing data in training data set
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [45]:
# Make categorical features into numerical
cat_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one hot",
                                  one_hot,
                                  cat_features)],
                                remainder = "passthrough")

# fill the train and test values seperately
transformed_X_train = transformer.fit_transform(car_sales_filled_train)
transformed_X_test = transformer.transform(car_sales_filled_test)
transformed_X_train.toarray()

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 7.19340e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.62665e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.28440e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.96225e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.33117e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.50582e+05]])

In [46]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# Make sure to use transformed (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.21229043336119102

## 2. Choosing the right estimator/algorithm for our problem

Scikit-learn uses word 'estimator' as another term for machine learning model or algorithm

* Classification - predicting whether a sample is one thing or another
* Regression - predicting a number

Step 1 - Check the Scikit-learn ML map https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html


## 2.1 Picking a machine learning model for a regression problem 

In [73]:
# Import Boston housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
boston;

In [68]:
boston_df = pd.DataFrame(boston["data"], columns = boston["feature_names"])
X = boston_df
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [61]:
y = pd.Series(boston["target"])
boston_df["target"] = y
boston_df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


In [67]:
# How many samples:
len(boston_df)

506

![](../Pictures/ml_map.png)

by looking at it right now we end up on "few features should be important" - we have 13 features and don't really know if that's a case for now, so let's ignore it: hence `RidgeRegression` or `SVR(kernel='linear')`

The image above is interactive on scikit learn page and takes to respective documentation pages for given estimator

In [70]:
# Let's try Ridge Regression model
from sklearn.linear_model import Ridge

# set up random seed for reproducibility 
np.random.seed(42)

# Create the data 
display(X, y)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Length: 506, dtype: float64

In [89]:
# Split into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Instantiate Ridge model
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the ridge model on test data
model.score(X_test, y_test)

0.760685177585062

How do we improve this score?

What if Ridge wasn't working?

Refer back to map https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [97]:
# Let's try random forest regressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Split data into train and test sets
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Instantiate Regressor model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

#Evaluate random forest regressor 
rf.score(X_test, y_test)

0.8654448653350507

In [98]:
model.score(X_test, y_test), rf.score(X_test, y_test)

(0.7021035461175933, 0.8654448653350507)

### 2.2 Choosing estimator for classification problem

... let's go to the map https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [118]:
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [119]:
len(heart_disease)

303

In [123]:
# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

Consulting the map and it says to try `LinearSVC`

In [143]:
# set up random seed
np.random.seed(42)

# divide data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# Import LinearSVC estimator class 
from sklearn.svm import LinearSVC

# Instantiate LinearSVC
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.score(X_train, y_train), clf.score(X_test, y_test)



(0.7933884297520661, 0.8688524590163934)

In [144]:
# warning due to fact that SVC is not scale invariant - highly suggested to normalise the data

# set up random seed
np.random.seed(42)

# divide data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# normalisation (min maxing)
heart_disease
from sklearn import preprocessing
x = X_train.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
X_train = pd.DataFrame(x_scaled, columns = X_train.columns)
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.270833,1.0,0.333333,0.265306,0.378753,0.0,0.5,0.649123,0.0,0.000000,1.0,0.00,0.666667
1,0.604167,1.0,0.000000,0.571429,0.321016,0.0,0.0,0.201754,1.0,0.142857,1.0,0.00,1.000000
2,0.354167,1.0,0.666667,0.571429,0.230947,0.0,0.5,0.517544,0.0,0.642857,0.5,0.00,0.666667
3,0.541667,0.0,0.333333,0.418367,0.274827,0.0,0.0,0.640351,0.0,0.250000,0.5,0.00,0.666667
4,0.645833,1.0,0.000000,0.234694,0.228637,1.0,0.5,0.631579,1.0,0.250000,1.0,0.50,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,0.437500,1.0,0.666667,0.469388,0.235566,0.0,0.5,0.657895,0.0,0.107143,0.5,0.25,1.000000
238,0.458333,1.0,0.666667,0.000000,0.221709,0.0,0.5,0.578947,1.0,0.000000,1.0,0.25,1.000000
239,0.833333,1.0,1.000000,0.673469,0.237875,1.0,0.0,0.377193,0.0,0.017857,0.5,0.25,0.666667
240,0.354167,1.0,0.000000,0.265306,0.272517,0.0,0.0,0.491228,0.0,0.142857,1.0,0.00,1.000000


In [145]:
clf = LinearSVC()
clf.fit(X_train, y_train)
clf.score(X_train, y_train), clf.score(X_test, y_test)

(0.859504132231405, 0.47540983606557374)

In [159]:
# Standarisation 
from sklearn.preprocessing import StandardScaler

# set up random seed
np.random.seed(42)

# divide data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
X_train_scaled = pd.DataFrame(X_scaled, columns = X_train.columns)
X_train_scaled

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,-1.356798,0.722504,0.008099,-0.616856,0.914034,-0.383301,0.843133,0.532781,-0.676632,-0.920864,0.953905,-0.689701,-0.509048
1,0.385086,0.722504,-0.971891,1.169491,0.439527,-0.383301,-1.046109,-1.753582,1.477907,-0.193787,0.953905,-0.689701,1.178480
2,-0.921327,0.722504,0.988089,1.169491,-0.300704,-0.383301,0.843133,-0.139679,-0.676632,2.350982,-0.694988,-0.689701,-0.509048
3,0.058483,-1.384075,0.008099,0.276318,0.059921,-0.383301,-1.046109,0.487950,-0.676632,0.351521,-0.694988,-0.689701,-0.509048
4,0.602822,0.722504,-0.971891,-0.795490,-0.319684,2.608918,0.843133,0.443119,1.477907,0.351521,0.953905,1.333421,1.178480
...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,-0.485856,0.722504,0.988089,0.574042,-0.262744,-0.383301,0.843133,0.577611,-0.676632,-0.375556,-0.694988,0.321860,1.178480
238,-0.376988,0.722504,0.988089,-2.165023,-0.376625,-0.383301,0.843133,0.174136,1.477907,-0.920864,0.953905,0.321860,1.178480
239,1.582631,0.722504,1.968079,1.764940,-0.243763,2.608918,-1.046109,-0.856969,-0.676632,-0.829979,-0.694988,0.321860,-0.509048
240,-0.921327,0.722504,-0.971891,-0.616856,0.040941,-0.383301,-1.046109,-0.274171,-0.676632,-0.193787,0.953905,-0.689701,1.178480


In [161]:
clf = LinearSVC()
clf.fit(X_train_scaled, y_train)
clf.score(X_train_scaled, y_train), clf.score(X_test, y_test)

(0.859504132231405, 0.5081967213114754)

Considering it's binary problem score of 0.5 is highly undesirable 

In [164]:
# import the randomforestclassifier 
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# instantiate Random Forest Classifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# evaluate
rf.score(X_test, y_test)

0.8524590163934426

Nicc

There's tidbit in ML community: 
     1. if you're working with structured data there are high chances that ensembles will perform well
     2. if you're working with unstructured data, use deep learning or transfer learning