## ScikitLearn Notebook

In [1]:
what_were_learning = '''1. Getting data ready
                        2. Choosing a machine learning model
                        3. Fitting a model to the data and making predictions
                        4. Evaluating model predictions
                        5. Improving model predictions
                        6. Saving & Loading models'''

Outline : <br>
1. Getting the data ready <br>
    - 1.1 Making the data numerical<br>
    - 1.2 Handling missing data<br>
        - a. Fill missing values with pandas <br>
        - b. Fill missing values with scikit-Learn
        
<br>

2. Choosing the right estimator/algorithm<br>
    - 2.1 Choosing an estimator for Regression problem
    - 2.2 Choosing an estimator for Classification problem

<br>

3. Fitting the model and use it to make predictions<br>
(do this for both classification & regression)
    - 3.1 Fit the model
    - 3.2 Evaluating the model
        - for classification
            - a. `predict()`
            - b. `predict_proba()`
        - for regression
            - a. `predict()`
            - b. Mean Absolute Error

<br>

4. Evaluating a Machine Learning Model
    1. Estimator's built-in `score()` method<br>
    2. The `scoring` parameter<br>
    3. Problem-specific metric functions<br>
        - For classification problems 
            - a. Accuracy
            - b. ROC Curve
            - c. Confusion matrix
            - d. Classification Reports

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

#### 1. Getting our data ready to be used with machine learning

Three main things we have to do:<br>
    1. Split the data into features and labels (usually 'X' and 'y')<br>
    2. Filling (also called imputing) or disregarding missing values<br>
    3. Converting non-numerical values to numeric values (also known as encoding)<br>

In [3]:
heart_disease = pd.read_csv("/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/csv/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
X = heart_disease.drop("target", axis=1)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [5]:
y = heart_disease["target"]
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [6]:
type(y)

pandas.core.series.Series

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [8]:
X_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
66,51,1,2,100,222,0,1,143,1,1.2,1,0,2
16,58,0,2,120,340,0,1,172,0,0.0,2,0,2
245,48,1,0,124,274,0,0,166,0,0.5,1,0,3
118,46,0,1,105,204,0,1,172,0,0.0,2,0,2
153,66,0,2,146,278,0,0,152,0,0.0,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
147,60,0,3,150,240,0,1,171,0,0.9,2,0,2
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3
194,60,1,2,140,185,0,0,155,0,3.0,1,0,2


In [9]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

#### 1.1 Making the data numerical

In [10]:
car_sales = pd.read_csv("/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/csv/car-sales-extended.csv")

In [11]:
car_sales.shape

(1000, 5)

In [12]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [13]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [14]:
car_sales["Doors"].value_counts()
#we will considering Doors as a categorical attribute as it only has 3 types of values, that are '3,4&5'

4    856
5     79
3     65
Name: Doors, dtype: int64

In [15]:
#Split into X & y
X= car_sales.drop("Price",axis=1 )
y = car_sales["Price"]

#split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [16]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 4), (200, 4), (800,), (200,))

In [17]:
#turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categories_feature = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,categories_feature)], remainder="passthrough")

transformed_X= transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [18]:
transformed_X = pd.DataFrame(transformed_X)
type(transformed_X)

pandas.core.frame.DataFrame

In [19]:
transformed_X.value_counts()

0    1    2    3    4    5    6    7    8    9    10   11   12      
0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  10217.0     1
     1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  136279.0    1
                                                            116986.0    1
                                                            117907.0    1
                                                            120283.0    1
                                                                       ..
     0.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  230314.0    1
                                                            230908.0    1
                                                            232912.0    1
                                                            234051.0    1
1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  201190.0    1
Length: 1000, dtype: int64

In [20]:
X.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3
5,Honda,Red,42652,4
6,Toyota,Blue,163453,4
7,Honda,White,43120,4
8,Nissan,White,130538,4
9,Honda,Blue,51029,4


In [21]:
transformed_X.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,42652.0
6,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,163453.0
7,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,43120.0
8,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,130538.0
9,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,51029.0


In [22]:
#another way of doing the same thing
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies.head()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0


In [23]:
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()

In [24]:
#lets refit the model
# np.random.seed(30)
X_train,X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

In [25]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 13), (200, 13), (800,), (200,))

In [26]:

clf.fit(X_train,y_train)
clf.score(X_train, y_train)

0.8893302437872022

In [27]:
y_predicted = clf.predict(X_test)
y_predicted


array([ 7290.62, 13481.02, 12292.65, 17719.25, 20031.15, 10847.04,
       18164.12, 12029.99,  7619.17, 16236.56, 15231.53, 17587.76,
       10891.27,  9289.44, 23219.89, 31077.38, 21304.36, 11259.12,
        9148.47, 12322.77, 25527.14, 21815.  ,  9846.17, 10852.47,
       10382.76, 19117.47, 16814.19, 23281.5 , 14292.89, 15926.2 ,
        8350.03, 13087.14, 15522.86,  7479.31, 11416.29, 23308.32,
       18502.56, 11012.8 , 13684.46, 21780.46, 15114.58,  9917.41,
        8495.32, 13534.46, 26215.14, 13964.1 , 13695.67, 18099.54,
       10898.77, 13320.24, 29339.97, 27024.4 , 11112.39, 14073.05,
       10913.29, 13644.01, 26443.78, 22863.54, 11994.83, 17873.17,
       23242.38, 22391.26,  9120.84, 15988.72, 13223.84, 12846.25,
       14772.18, 15406.23, 10116.49, 15729.15,  7983.25, 21866.25,
       11179.06, 13434.61, 18625.  , 16663.69, 12040.32,  9427.03,
       23011.94,  9446.78,  9163.66, 47885.42, 18403.57, 13770.24,
       10678.9 , 10826.36, 14702.14, 21859.25, 20842.38, 13997

In [28]:
clf.score(X_test, y_test)

0.3613422945550251

In [29]:
# np.random.seed(14)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators")
    clf = RandomForestRegressor(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set : {clf.score(X_test, y_test)}\n")

Trying model with 10 estimators
Model accuracy on test set : 0.28895291675245216

Trying model with 20 estimators
Model accuracy on test set : 0.3200189951395338

Trying model with 30 estimators
Model accuracy on test set : 0.35932464231016925

Trying model with 40 estimators
Model accuracy on test set : 0.3722951890901365

Trying model with 50 estimators
Model accuracy on test set : 0.3402907827221576

Trying model with 60 estimators
Model accuracy on test set : 0.3482666892898284

Trying model with 70 estimators
Model accuracy on test set : 0.3601915099365257

Trying model with 80 estimators
Model accuracy on test set : 0.35440472997332295

Trying model with 90 estimators
Model accuracy on test set : 0.3603536652327862



#### 1.2 Dealing with missing values
Two ways:<br>
    1. Fill them with some value (also known as Imputation) <br>
    2. Remove the samples with missing data altogether

In [30]:
car_sales_missing = pd.read_csv("/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/csv/car-sales-extended-missing-data.csv")

In [31]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [32]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [33]:
#Create X & y
X = car_sales_missing.drop("Price", axis=1)
X.head()


Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431.0,4.0
1,BMW,Blue,192714.0,5.0
2,Honda,White,84714.0,4.0
3,Toyota,White,154365.0,4.0
4,Nissan,Blue,181577.0,3.0


In [34]:
y = car_sales_missing["Price"]
y.head()

0    15323.0
1    19943.0
2    28343.0
3    13434.0
4    14043.0
Name: Price, dtype: float64

In [35]:
#turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categories_feature = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,categories_feature)], remainder="passthrough")

transformed_X= transformer.fit_transform(X)
transformed_X



##### OneHotEncoder couldnt handle missing values in versions below (0.23). i have this piece of code here to explain the same.
##### Here, it won't give us any error. we're good to go

<1000x16 sparse matrix of type '<class 'numpy.float64'>'
	with 4000 stored elements in Compressed Sparse Row format>

##### Option 1: Fill missing data with pandas

In [36]:
car_sales_missing["Doors"].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

In [37]:
#fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)

#fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True)

#fill the "Odometer (KM)" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean , inplace=True)

#fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace=True)



In [38]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,missing,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [39]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [40]:
#remove rows with missing price value
car_sales_missing.dropna(inplace=True)

In [41]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [42]:
len(car_sales_missing)

950

In [43]:
#get new X & y for the missing values
X= car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [44]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categories_feature = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,categories_feature)], remainder="passthrough")

transformed_X= transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.0, 1.0, 0.0, ..., 0.0, 35431.0, 15323.0],
       [1.0, 0.0, 0.0, ..., 1.0, 192714.0, 19943.0],
       [0.0, 1.0, 0.0, ..., 0.0, 84714.0, 28343.0],
       ...,
       [0.0, 0.0, 1.0, ..., 0.0, 66604.0, 31570.0],
       [0.0, 1.0, 0.0, ..., 0.0, 215883.0, 4001.0],
       [0.0, 0.0, 0.0, ..., 0.0, 248360.0, 12732.0]], dtype=object)

##### Option 2 : Filling the missing values with Scikit-Learn

In [45]:
car_sales_missing = pd.read_csv("/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/csv/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [46]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [47]:
#drop the rows which have the missing price values
car_sales_missing.dropna(subset=["Price"],inplace=True)
len(car_sales_missing)

950

In [48]:
#we've lost some of the missing Doors, colours values because they might be overlapping with the missing Price column
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [49]:
#Split into X & y
X=car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [50]:
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

In [51]:
#fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

#fill the categorical values with 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
numerical_imputer = SimpleImputer(strategy="mean")

In [52]:
#define columns
cat_features = ["Make", "Colour"]
door_features = ["Doors"]
numerical_features = ["Odometer (KM)"]

In [53]:
#create an imputer (something that fills the missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_features),
    ("numerical_imputer",numerical_imputer ,numerical_features)

])

In [54]:
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [55]:
car_sales_filled = pd.DataFrame(filled_X,columns=["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0
...,...,...,...,...
945,Toyota,Black,4.0,35820.0
946,missing,White,3.0,155144.0
947,Nissan,Blue,4.0,66604.0
948,Honda,White,4.0,215883.0


In [56]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [57]:
#Now, that we dont have any missing values. Let's convert our dataframe into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categories_feature = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,categories_feature)], remainder="passthrough")

transformed_X= transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [58]:
#Now that we've got our data as numbers and it has no missing values. Let's fit the model
np.random.seed(13)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train,y_test = train_test_split(transformed_X,y,test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.24708595018025115

#### 2. Choosing the right estimators/algorithms for your problem

- Sklearn refers to machine learning models, algorithms as estimators
- Classification problem - predicting a category (eg, heart disease or not)
    - Sometimes you'll see `clf` (short for classifier) used as a classification estimator
- Regression problem - predicting a number (selling price of a car)
<br><br>

If you're working on a machine learning problem and looking to use Sk-Learn and not sure what model you should use, refer : https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

<img src= "/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/Scikit-Learn Library/images/ml_map.png"/>

#### 2.1 Choosing the right estimator for a Regression Problem

In [59]:
from sklearn.datasets import fetch_california_housing
california = fetch_california_housing()

In [60]:
california

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [61]:
california_df = pd.DataFrame(california['data'], columns= california['feature_names'])
california_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [62]:
california_df["MedHouseVal"] = california["target"]
california_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [63]:
from sklearn.model_selection import train_test_split
#Import algorithm
from sklearn.linear_model import Ridge


#set up random seed
np.random.seed(43)

#creating features and target variables
X = california_df.drop("MedHouseVal",axis=1)
y = california_df["MedHouseVal"]

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#instantiate and fit an algorithm
model = Ridge()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.6176737722553941

What if `Ridge` didnt work out and fit our needs?<br>
We could always try a different model <br>
How about we try an ensemble model (an ensemble model is a combination of smaller models to try and make better predictions than just a single model)<br>
Refer documentation for more



In [64]:
#lets try ensemble methods
from sklearn.ensemble import RandomForestRegressor
np.random.seed(10)
#create data
X = california_df.drop("MedHouseVal",axis=1)
y = california_df["MedHouseVal"]

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#create a random forest model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.811322083536415

#### 2.2 Choosing an estimator for a classification problem

In [65]:
heart_disease.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [66]:
heart_disease.shape

(303, 14)

##### Consulting the map and it says to try  `Linear SVC`

In [67]:
#Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

#setup random seed
np.random.seed(13)

#make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#Instantiate LinearSVC
clf = LinearSVC(dual=False)     
clf.fit(X_train, y_train)

#Evaluating the LinearSVC
clf.score(X_test, y_test)

0.819672131147541

In [68]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

##### Let's try ensemble methods for classification - `SVC Ensemble Classifier`

In [69]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(13)

#make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate the model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

#evaluate the model
clf.score(X_test, y_test)


0.7540983606557377

Tidbit : <br>
    1. If you have structured data, use ensemble methods<br>
    2. If you have unstructured data, use deep learning or transfer learning<br>

### 3. Fit the model/ algorithm on our data and use it to make predictions
Fitting the model for classification
#### 3.1 Fitting the model
##### a. Fitting the model for classification
Different names for - <br>
- `X` = features, feature variables, data
- `y` = labels, target, target variables

In [70]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(13)

#make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate the model
clf = RandomForestClassifier()

#Fit the model to the data (training the machine learning model)
clf.fit(X_train, y_train)

#evaluate the model (using the patterns the model has learnt )
clf.score(X_test, y_test)


0.7540983606557377

In [71]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


### Random Forest model deep dive

These resources will help you understand what's happening inside the Random Forest models we've been using.

* [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
* [Random Forest Wikipedia (simple version)](https://simple.wikipedia.org/wiki/Random_forest)
* [Random Forests in Python](http://blog.yhat.com/posts/random-forests-in-python.html) by yhat
* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

#### 3.2 Make predictions on data
2 ways to make predictions: 
* `predict()`
* `predict_prob()`

#### making predictions using predict()

In [72]:
# Use trained models to make predictions
y_preds = clf.predict(X_test)                   #predict y

In [73]:
#compare the predicted value of y with the actual value of y
np.mean(y_preds == y_test)          #this value is the same as clf.score(X_test, y_test)

0.7540983606557377

In [74]:
clf.score(X_test, y_test)

0.7540983606557377

In [75]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.7540983606557377

#### making predictions with predict_proba()


In [76]:
clf.predict_proba(X_test[:5])               #predicts the labels with the probabilty

array([[0.67, 0.33],
       [0.06, 0.94],
       [0.1 , 0.9 ],
       [0.22, 0.78],
       [0.53, 0.47]])

In [77]:
#try predict() on the same data
clf.predict(X_test[:5])

array([0, 1, 1, 1, 0])

##### Fitting data for regression model
`predict()` can be used on regression models as well

In [78]:
california_df.head(20)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
5,4.0368,52.0,4.761658,1.103627,413.0,2.139896,37.85,-122.25,2.697
6,3.6591,52.0,4.931907,0.951362,1094.0,2.128405,37.84,-122.25,2.992
7,3.12,52.0,4.797527,1.061824,1157.0,1.788253,37.84,-122.25,2.414
8,2.0804,42.0,4.294118,1.117647,1206.0,2.026891,37.84,-122.26,2.267
9,3.6912,52.0,4.970588,0.990196,1551.0,2.172269,37.84,-122.25,2.611


In [79]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(10)
#create data
X = california_df.drop("MedHouseVal",axis=1)
y = california_df["MedHouseVal"]

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#create a random forest model
model = RandomForestRegressor()

#fitting the model
model.fit(X_train, y_train)

#evaluating the model
model.score(X_test, y_test)

0.811322083536415

In [80]:
y_preds = model.predict(X_test)
y_preds[:5]

array([2.13002  , 3.8265603, 2.25399  , 1.4706   , 1.12384  ])

In [81]:
np.array(y_preds)

array([2.13002  , 3.8265603, 2.25399  , ..., 1.3883   , 0.97625  ,
       4.7736867])

#### Regression Evaluation Metrics
    
- Mean Absolute Error - The average difference between the predicted values and the true values

In [82]:
#compare the predictions to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)
#this means the average difference between the actual value and the predicted value is 0.32

0.32363581913759704

### 4. Evaluating a Machine Learning Model
Three ways to evaluate Scikit-Learn models/estimators:<br>
    1. Estimator's built-in `score()` method<br>
    2. The `scoring` parameter<br>
    3. Problem-specific metric functions<br>

##### 4.1 Evaluating a model with the `score` method
- Using the score() method on Classification problem

In [83]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(13)

#make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate the model
clf = RandomForestClassifier()

#Fit the model to the data
clf.fit(X_train, y_train)

RandomForestClassifier()

In [84]:
#evaluate the model score() method
clf.score(X_test, y_test)

0.7540983606557377

- Using the score() method on Regression Problem

In [94]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(10)
#create data
X = california_df.drop("MedHouseVal",axis=1)
y = california_df["MedHouseVal"]

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#create a random forest model
model = RandomForestRegressor()

#fitting the model
model.fit(X_train, y_train)

RandomForestRegressor()

In [95]:
#evaluating the model- score() method

#The default score() evaluation metric is r_squared for regression algorithms
#Highest = 1.0, lowest = 0.0
model.score(X_test, y_test)

0.811322083536415

##### 4.2 Evaluating a model using the `scoring` parameter
ON CLASSIFICATION PROBLEMS
- Uses cross validation (5 fold)

<img src = "/home/hp1/Documents/College/Coding/Machine Learning/zero_to_mastery_course/Scikit-Learn Library/images/cross-validation.png"/>

In [97]:
#adding some classification code
from sklearn.ensemble import RandomForestClassifier
np.random.seed(13)

#make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate the model
clf = RandomForestClassifier()

#Fit the model to the data
clf.fit(X_train, y_train)

RandomForestClassifier()

In [98]:
#evaluate using the score() function
clf.score(X_test, y_test)

0.7540983606557377

In [100]:
from sklearn.model_selection import cross_val_score
cross_val_score(clf, X, y)      #it takes 5 folds as default and thus produces 5 different evaluation scores

array([0.85245902, 0.91803279, 0.81967213, 0.85      , 0.76666667])

In [101]:
#Comparing the score() and mean of cross_val_score()
clf_single_score = clf.score(X_test, y_test)
clf_cross_val_score = np.mean(cross_val_score(clf, X, y))

#compare the two
clf_single_score, clf_cross_val_score

#Cross validation is always accurate.It's reliable unlike score()

(0.7540983606557377, 0.8182513661202186)

#### 4.2.1 Classification model evaluation metrics
- 1. Accuracy
- 2. Area Under ROC curve
- 3. Confusion matrix
- 4. Classification report

In [116]:
from sklearn.model_selection import cross_val_score

#adding some classification code
from sklearn.ensemble import RandomForestClassifier
np.random.seed(13)

#make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate the model
clf = RandomForestClassifier()

#Fit the model to the data
clf.fit(X_train, y_train)
cross_val_score_classification = cross_val_score(clf,X,y)
cross_val_score_classification

array([0.81967213, 0.8852459 , 0.81967213, 0.81666667, 0.76666667])

In [107]:
np.mean(cross_val_score_classification)

0.8281967213114754

##### Area under ROC (Receiver Operating Characteristic) Curve (AUC/ROC)
- Area Under Curve (AUC)
- ROC Curve
    - ROC Curves are a comparision of a model's true positive rate(tpr) and false positive rate(fpr)
        * True positive = predicted value is 1 | Actual value is 1
        * False positive = predicted value is 1 | Actual value is 0
        * True negative = predicted value is 0 | Actual value is 0
        * False negative = predicted value is 0 | Actual value is 1


In [114]:
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.2)
# clf.fit(X_train, y_train)

In [117]:
from sklearn.metrics import roc_curve

#making predictions with probabilities
y_probs = clf.predict_proba(X_test)

y_probs[:10]

array([[0.67, 0.33],
       [0.06, 0.94],
       [0.1 , 0.9 ],
       [0.22, 0.78],
       [0.53, 0.47],
       [0.87, 0.13],
       [0.73, 0.27],
       [0.53, 0.47],
       [0.34, 0.66],
       [0.46, 0.54]])

In [119]:
#ROC curve is only a comparision of positive data, extract the second column of y_probs (index=1)
y_probs_positive = y_probs[:,1]
y_probs_positive[:10]

array([0.33, 0.94, 0.9 , 0.78, 0.47, 0.13, 0.27, 0.47, 0.66, 0.54])

In [121]:
fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)
fpr

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03846154,
       0.03846154, 0.07692308, 0.07692308, 0.15384615, 0.15384615,
       0.19230769, 0.19230769, 0.23076923, 0.23076923, 0.26923077,
       0.30769231, 0.30769231, 0.38461538, 0.38461538, 0.57692308,
       0.57692308, 0.57692308, 0.61538462, 0.73076923, 0.92307692,
       1.        ])