# Scikit learn Basics

This notebook demonstrates some of the most important and useful functions of scikit-learn library

Things to be done:
1. Importing libraries
2. Getting data ready
3. Choose right estimator/algorithm for the problem
4. Fit the model/algorithm and use it to make predictions
5. Evaluating model
6. Improving model
7. Save and load the trained model

In [6]:
# 1.Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [7]:
# 2. Getting data ready
dataset = pd.read_csv("heart-disease.csv")

X = dataset.drop('target', axis=1)    # Features matrix
y = dataset['target']    # Labels

In [8]:
# 3. Choose right model/ algorithm
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [16]:
# 4. Fit the model/algorithm and use it to make predictions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Fitting the data
clf.fit(X_train, y_train)

In [17]:
y_pred = clf.predict(X_test)
y_pred

array([1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 0])

In [18]:
# 5. Evaluating model
clf.score(X_train, y_train)

1.0

In [19]:
clf.score(X_test, y_test)

0.8289473684210527

In [21]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.81      0.82        37
           1       0.82      0.85      0.84        39

    accuracy                           0.83        76
   macro avg       0.83      0.83      0.83        76
weighted avg       0.83      0.83      0.83        76



In [23]:
print(confusion_matrix(y_test, y_pred))

[[30  7]
 [ 6 33]]


In [24]:
accuracy_score(y_test, y_pred)

0.8289473684210527

In [26]:
# 6. Improving model
# Trying out model with n-estimators

for i in range(10, 100, 10):
    print(f"Trying model with {i} estimator...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test)}")
    print()

Trying model with 10 estimator...
Model accuracy on test set: 0.7894736842105263

Trying model with 20 estimator...
Model accuracy on test set: 0.7763157894736842

Trying model with 30 estimator...
Model accuracy on test set: 0.8157894736842105

Trying model with 40 estimator...
Model accuracy on test set: 0.8421052631578947

Trying model with 50 estimator...
Model accuracy on test set: 0.8289473684210527

Trying model with 60 estimator...
Model accuracy on test set: 0.8026315789473685

Trying model with 70 estimator...
Model accuracy on test set: 0.8157894736842105

Trying model with 80 estimator...
Model accuracy on test set: 0.8026315789473685

Trying model with 90 estimator...
Model accuracy on test set: 0.8157894736842105



In [27]:
# 7. Save and load the trained model
import pickle
pickle.dump(clf, "random_forest_classfier.pk", "wb")    # Saves the model as a binary file

# Dealing with data-set having missing values

In [38]:
dataset = pd.read_csv("../DATA/car-sales-extended-missing-data.csv")
dataset

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


# Fill the missing data with pandas

Option 1

In [75]:
# Fill the Make column
dataset.fillna({"Make":"missing"}, inplace=True)

# Fill the Colour column
dataset.fillna({"Colour":"missing"}, inplace=True)

# Fill the Odomenter column
dataset.fillna({"Colour":dataset["Odometer (KM)"].mean()}, inplace=True)

# Fill the doors column
dataset.fillna({"Doors":4}, inplace=True)

# Removes the row having missing price value
dataset.dropna(inplace=True)

In [76]:
# Creating X and y
X = dataset.drop("Price", axis=1)
y = dataset["Price"]

In [81]:
# Turn textual data into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

textual_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, textual_features)], remainder="passthrough")

X = transformer.fit_transform(dataset)
X

array([[1.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [1.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [1.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [1.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [1.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

Option 2

In [83]:
dataset = pd.read_csv("../DATA/car-sales-extended-missing-data.csv")
dataset.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [85]:
dataset.dropna(subset=["Price"], inplace=True)
dataset.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [106]:
# Split X and y
X = dataset.drop("Price", axis=1)
y = dataset["Price"]

X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

In [109]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' and numerical values with mean
text_values = SimpleImputer(strategy="constant", fill_value="missing")
door_values = SimpleImputer(strategy="constant", fill_value=4)
num_values = SimpleImputer(strategy="mean")

# Define columns
text_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create a imputer 
imputer = ColumnTransformer([
    ("text_values", text_values, text_features),
    ("door_values", door_values, door_features),
    ("num_values", num_values, num_features)
])

X = imputer.fit_transform(X)
X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [110]:
X = pd.DataFrame(X, columns=["Make", "Colour", "Doors", "Odometer (KM)"])
X.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64