# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful function of the beautiful
Scikit-learn library.

0. An end-to-end Scikit-Learn Worklow.
1. Getting the sata ready.
2. Choose the right estimator/algoeithms for our problems.
3. Fit the model/algorithm and use it to make predixtions n our data.
4. Evaluating the model.
5. Improve a model.
6. Save and load a trained model.
7. Putting it all together !!!!

In [1]:
import numpy as np
import pandas as pd

# 0. an end-to-end Scikit-Learn Workflow.

In [2]:
# 1. Get the data ready
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# Create X (feature matrix)
X = heart_disease.drop("target", axis=1)

# Create Y (labels)
Y = heart_disease["target"]

In [4]:
# 2. Choose the model for preditions
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

# Keeping the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [5]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [6]:
clf.fit(x_train,y_train);

In [7]:
 ## Making a prediction 
y_label = clf.predict(np.array([0,3,2,4]));



ValueError: Expected 2D array, got 1D array instead:
array=[0. 3. 2. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [8]:
y_preds = clf.predict(x_test)
y_preds

array([1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0], dtype=int64)

In [9]:
y_test

145    1
104    1
268    0
266    0
197    0
      ..
280    0
132    1
204    0
259    0
255    0
Name: target, Length: 61, dtype: int64

In [10]:
# 4. Evaluate the model on the training data and the testing data
clf.score(x_train,y_train)

1.0

In [11]:
clf.score(x_test,y_test)

0.7049180327868853

In [12]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.69      0.64      0.67        28
           1       0.71      0.76      0.74        33

    accuracy                           0.70        61
   macro avg       0.70      0.70      0.70        61
weighted avg       0.70      0.70      0.70        61



In [13]:
print(confusion_matrix(y_test, y_preds))

[[18 10]
 [ 8 25]]


In [14]:
print(accuracy_score(y_test, y_preds))

0.7049180327868853


In [15]:
# 5. Improving the model 
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying with {i} estimators.......")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train,y_train)
    print(f"Model accuracy score on the test : {clf.score(x_test,y_test) * 100:.2f}% ")

Trying with 10 estimators.......
Model accuracy score on the test : 68.85% 
Trying with 20 estimators.......
Model accuracy score on the test : 75.41% 
Trying with 30 estimators.......
Model accuracy score on the test : 77.05% 
Trying with 40 estimators.......
Model accuracy score on the test : 78.69% 
Trying with 50 estimators.......
Model accuracy score on the test : 75.41% 
Trying with 60 estimators.......
Model accuracy score on the test : 77.05% 
Trying with 70 estimators.......
Model accuracy score on the test : 70.49% 
Trying with 80 estimators.......
Model accuracy score on the test : 78.69% 
Trying with 90 estimators.......
Model accuracy score on the test : 73.77% 


In [16]:
# 6. Save a model and load it 
import pickle
pickle.dump(clf, open("random_forest_model01.pkl","wb"))

In [17]:
## importing the same model
loaded_model = pickle.load(open("random_forest_model01.pkl", "rb"))
loaded_model.score(x_test,y_test)

0.7377049180327869

# 1. Getting our data ready to be used with machine learning

Three main things we have to do:
   1. Split the data into features and labels (usually `X` & `Y`)
   2. Filling (Imputing) or disregarding missing values.
   3. Coverting non-numerical values to numerical values (also called feature encoding)

In [18]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [19]:
X = heart_disease.drop("target", axis=1)

In [20]:
# Splitting the data into training and testing
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [21]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

# 1.1 Make sure it's all numerical

In [22]:
car_sales = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [23]:
# Making set X/Y
X = car_sales.drop("Price", axis=1)
Y = car_sales["Price"]

# Spliting the data 
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

In [24]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train,y_train)
model.score(x_test, y_test)

ValueError: could not convert string to float: 'Toyota'

In [None]:
## Turn categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")
tranformed_X = transformer.fit_transform(X)
tranformed_X

In [None]:
p = pd.DataFrame(tranformed_X)
p

In [25]:
# Let's refit the model 
np.random.seed(42)
x_train, x_test, y_train, y_test = train_test_split(tranformed_X, Y, test_size=0.2)

model.fit(x_train,y_train)

NameError: name 'tranformed_X' is not defined

Now the above cell will show error for `nan-values` please ignore for the timebiing.....
Here the main concept is to change the non-numerical values into  numerical values

### 1.2 What if there were missing values?
   1. Fill them with some values(also known as imputation).
   2. Remove the samples with missing data together.

In [26]:
# Importing car sales missing data
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [27]:
# Counting the nan
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [28]:
# Creating  X & Y
X = car_sales_missing.drop("Price", axis=1)
Y = car_sales_missing["Price"]

In [29]:
# Converting into numeric
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                                 remainder="passthrough")
transformed_X = transformer.fit_transform(X)
tranformed_X

NameError: name 'tranformed_X' is not defined

### Option 1: Filling missing the data with Pandas

In [30]:
car_sales_missing["Doors"].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

In [31]:
# Filling Make column
car_sales_missing["Make"].fillna("missing", inplace=True)

# Filling Colour column
car_sales_missing["Colour"].fillna("missing", inplace=True)

# Filling Odometer column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)

# Filling the Door column with most preferrable
car_sales_missing["Doors"].fillna(4,inplace=True)

In [32]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [33]:
# Price column we have to predict so we will remove the rows with no price details
car_sales_missing.dropna(inplace=True)

In [34]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [35]:
len(car_sales_missing)


950

In [36]:
# Creating  X & Y
X = car_sales_missing.drop("Price", axis=1)
Y = car_sales_missing["Price"]

In [37]:
# Converting into numeric
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                                 remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [38]:
x = pd.DataFrame(transformed_X)
x

Unnamed: 0,0
0,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
1,"(0, 0)\t1.0\n (0, 6)\t1.0\n (0, 13)\t1.0\n..."
2,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
3,"(0, 3)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
4,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 11)\t1.0\n..."
...,...
945,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 12)\t1.0\n..."
946,"(0, 4)\t1.0\n (0, 9)\t1.0\n (0, 11)\t1.0\n..."
947,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 12)\t1.0\n..."
948,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."


### Option 2: Filling mising values using scikit-learn

In [39]:
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [40]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [41]:
car_sales_missing.dropna(subset=["Price"],inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [42]:
# Spliting the data 
X = car_sales_missing.drop("Price", axis=1)
Y = car_sales_missing["Price"]

In [43]:
# Filling missing values with Scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define Column
cat_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer =  ColumnTransformer([
    ("cat_imputer",cat_imputer, cat_features),
    ("door_imputer",door_imputer,door_features),
    ("num_imputer",num_imputer,num_features)
])

# Transform the data
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [44]:
car_sales_filled = pd.DataFrame(filled_X,
                               columns=["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0
...,...,...,...,...
945,Toyota,Black,4.0,35820.0
946,missing,White,3.0,155144.0
947,Nissan,Blue,4.0,66604.0
948,Honda,White,4.0,215883.0


In [45]:
## Categories into numerical
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [46]:
# Fitting it into the model 
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(transformed_X, Y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.21990196728583944

# 2. Choosing the right estimator/algorithm for our problem

Scikit-learn uses estimator as term for machine learning model or algorithm.
 * Classification - predicting wheter a sample is one thing or another 
 * Regression - predicting a number 

### 2.1 Picking a machine learning model for regression problem.

In [47]:
# Import Boston housing problem
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
diabetes;

In [48]:
diabetes_df = pd.DataFrame(diabetes["data"], columns=diabetes["feature_names"])
diabetes_df["target"] = pd.Series(diabetes["target"])
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [49]:
# Number of samples
len(diabetes_df)

442

In [50]:
# From standard diagram from sklearn estimator predictor we seleceted the type

In [51]:
# Let;s try the Ridge Regression Model
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
np.random.seed(42)

#Create the data
X = diabetes_df.drop("target",axis=1)
Y = diabetes_df["target"]

# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Model Invoking
model = Ridge()
model.fit(x_train,y_train)

# Check the score of the Ridge Model on test data
model.score(x_test,y_test)

0.41915292635986545

#### For improving the score ---> we will be using ensemble model(combination of 2 or more models...)

In [52]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
np.random.seed(42)

#Create the data
X = diabetes_df.drop("target",axis=1)
Y = diabetes_df["target"]

# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Model Invoking
rf = RandomForestRegressor()
rf.fit(x_train,y_train)

# Check the score of the Ridge Model on test data
rf.score(x_test,y_test)

0.42417421495301266

### 2.2 Choosing an estimators for a classification problem

In [53]:
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Cross-checking the map and its says `LinearSVC`

In [54]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
np.random.seed(42)

#Spliting the data
X = heart_disease.drop("target",axis=1)
Y = heart_disease["target"]

#Split the data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC()
clf.fit(x_train,y_train)

# Evaluate the LinearSVC
clf.score(x_test,y_test)



0.8688524590163934

In [55]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

In [56]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
np.random.seed(42)

#Spliting the data
X = heart_disease.drop("target",axis=1)
Y = heart_disease["target"]

#Split the data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Instantiate LinearSVC
clf = RandomForestClassifier()
clf.fit(x_train,y_train)

# Evaluate the LinearSVC
clf.score(x_test,y_test)

0.8524590163934426

Tidbit:

1. If you have structured data, use emsembled methods.
2. If you have unstructured data, use deep learning or transfer learning.

# 3. Fit the model/algorithm on our data and use it to make predictions.

### 3.1 Fitting model to the data

In [57]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
np.random.seed(42)

X = heart_disease.drop("target",axis=1)
Y = heart_disease["target"]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# Instantiate LinearSVC
clf = RandomForestClassifier()

# Fit the model to the data (training the machine learning model)
clf.fit(x_train,y_train)

# Evaluate Classifier (use the patterns the model has learned)
clf.score(x_test,y_test)

0.8524590163934426

### 3.2 Make predictions using a machine learning

2 ways to make predictions:
1. `predict()`
2. `predict_probs()`

In [58]:
# Using a trained model to make predictions
clf.predict(np.array([1,2,3,4])) ## Doesn't work



ValueError: Expected 2D array, got 1D array instead:
array=[1. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [59]:
x_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


In [60]:
clf.predict(x_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [61]:
np.array(y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

In [62]:
# Compare predicitons to truth labels to evaluate the model
y_preds = clf.predict(x_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [63]:
clf.score(x_test, y_test)

0.8524590163934426

In [64]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

Make prediction with `predict_proba()`

In [65]:
# predict_proba() returns probabilities of a classification label
clf.predict_proba(x_test[:5])

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

In [66]:
clf.predict(x_test[:5])

array([0, 1, 1, 0, 1], dtype=int64)

In [67]:
x_test[:5]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


`predict()` can be used for Regression Model

In [68]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X = diabetes_df.drop("target", axis=1)
Y = diabetes_df["target"]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

model = RandomForestRegressor().fit(x_train,y_train)

# Make prediction
y_preds = model.predict(x_test)

In [69]:
y_preds[:5]

array([ 97.42,  77.89, 106.2 , 223.03, 157.32])

In [70]:
np.array(y_test[:5])

array([ 84.,  89., 108., 306., 103.])

In [71]:
# Compare the predictions into the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

48.73123595505618

# 4. Evaluating a machine learning model

Three ways to evaluate the model
1. Estimators `score` method.
2. The `scoring` paramter.
3. Problem-specific metric functions.

### 4.1 Evaluating a model with the score method.

In [72]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = heart_disease.drop("target", axis=1)
Y = heart_disease["target"]

x_train, x_test, y_train, y_test = train_test_split(X ,Y, test_size=0.2)

model = RandomForestClassifier().fit(x_train,y_train)

In [73]:
model.score(x_test, y_test)

0.8524590163934426

In [74]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X = diabetes_df.drop("target", axis=1)
Y = diabetes_df["target"]

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

model = RandomForestRegressor().fit(x_train, y_train)

In [75]:
model.score(x_test, y_test)

0.5797031356362126

### 4.2 Evaluating a model using the `scoring` parameters.

In [76]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = heart_disease.drop("target", axis=1)
Y = heart_disease["target"]

x_train, x_test, y_train, y_test = train_test_split(X ,Y, test_size=0.2)

clf = RandomForestClassifier().fit(x_train,y_train)

In [77]:
clf.score(x_test,y_test)

0.8852459016393442

In [78]:
cross_val_score(clf, X, Y, cv=5)

array([0.83606557, 0.90163934, 0.86885246, 0.81666667, 0.78333333])

In [79]:
np.random.seed(0)

# Single training and test split score
clf_single_score = clf.score(x_test,y_test)

# Take the mean of 5-fold cross-validation score 
clf_cross_val_score = np.mean(cross_val_score(clf,X,Y,cv=5))

# Comparing the two
clf_single_score, clf_cross_val_score

(0.8852459016393442, 0.8248087431693989)

Default scoring parameter of classifier = mean accuracy
`clf.score()`

In [81]:
# Scoring parameter set to None by default
cross_val_score(clf, X, Y, cv=5, scoring=None)

array([0.80327869, 0.90163934, 0.85245902, 0.78333333, 0.73333333])