# Introduction to Scikit-Learn(sklearn)
This notebook demonstrate some of the most usefull functions of the beautiful Scikit-learn Library

What we're going to cover:


1. Getting the data ready
2. Chose the right estimatoralgorithm for our problems
3. Fit the model/algorithm and use it to make prediction on our data
4. Evaluating a model
5. Improve a model
6. Save and load a traned model
7. Putting it all together!

In [38]:
cover = ['1. Getting the data ready',
         '2. Chose the right estimatoralgorithm for our problems',
         '3. Fit the model/algorithm and use it to make prediction on our data',
         '4. Evaluating a model',
         '5. Improve a model',
         '6. Save and load a traned model',
         '7. Putting it all together!']

## 1. Geting our data ready to be used with machine learning

Three main things we have to do:
1. Split the data into feaures nad lables(usually `x` & `y`)
2. Filling (also called imputing) or disregarding missing values
3. Converting non-numerical values to numarical values(also called feature encoding)

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Import data
car_sales = pd.read_csv("data/car-sales-extended-missing-data.csv")

car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [12]:
# See the missing data
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [9]:
# There are 50 Na values in the Price Column as we have to pridict price first we will deleate roes in which price is Na
car_sales.dropna(subset='Price',inplace=True)
car_sales.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [13]:
car_sales.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

# Method 1

## Here we will create a model

In [84]:


# Import all the neccary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

# Remove the Na values from target data
car_sales.dropna(subset='Price',inplace=True)

#Split the data into x & y
x = car_sales.drop("Price",axis=1)
y = car_sales["Price"]

# Now we will convert the Na values in x so that there is no error
#Filling missing value is called imputation

# Fill Catagorical values with "missing" & numerrical value with mean

cat_features = ["Make","Colour"]
cat_imputer = SimpleImputer(strategy="constant",fill_value="missing")

door_features = ["Doors"]
door_imputer = SimpleImputer(strategy="constant",fill_value=4) #4 becuse maximum number of car have 4 doors

num_features = ["Odometer (KM)"]
num_imputer = SimpleImputer(strategy="mean")

# Create the imputer (Something that fill missing values)

# ColuColumnTransformer([(),(),()]) contains list in which touple is stored , Syantax of tupple ("name to recognise the function",opration_or_function,list on which opration will be performed)

imputer = ColumnTransformer([('cat_imputer',cat_imputer,cat_features),
                             ('door_imputer',door_imputer,door_features),
                             ('num_imputer',num_imputer,num_features)])

filled_x = imputer.fit_transform(x)

# As above column will become array we have to convert it into data-frame
x = pd.DataFrame(filled_x,columns=["Make","Colour","Doors","Odometer (KM)"])


#Now we will convert all the non-numeric Datatypes into numbers
# Make a list of catagorical featureas
catagorical_l = ["Make","Colour","Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',one_hot,catagorical_l)],remainder='passthrough')

# Fit & transform the data
transformed_x = transformer.fit_transform(x)

#Split further into train set & test set
x_train,x_test,y_train,y_test = train_test_split(transformed_x,y,test_size=0.2)



### MODEL_1 [RandomForestRegressor()]

In [76]:
# Define random seed
np.random.seed(42)

#Define the model name
model1_1 = RandomForestRegressor()

#Fit the model
model1_1.fit(x_train,y_train)

#Prediction of model
y_preds1_1 = model.predict(x_test)

In [68]:
y_preds1_1[:10] # Regresser model do naot ahve proba

array([ 9715.8       , 11968.87      , 13550.48202381, 10005.56      ,
       16076.62504224, 13872.26      , 13489.92      , 15557.08      ,
       14319.77      , 22813.43      ])

#### Evaluation of model

In [64]:
from sklearn.model_selection import cross_val_score

def RG_Ev_model(model):
    x = str(model)
    version = x[-3:]
    name_s = "model" + version + "_score"
    name_cv = "model" + version + "_cv"
    name_s = model.score(x_test,y_test)
    name_cv = cross_val_score(model,transformed_x,y,cv=5)
    print(f"Score : {name_s*100:.2f}%")
    print(f"Cross Validation Score : {np.mean(name_cv)*100:2f}%")


# model1_1_score = model1_1.score(x_test,y_test)
# model1_1_cv = cross_val_score(model1_1,transformed_x,y,cv=5)

# print(f"Score : {model1_1_score*100:.2f}%")
# print(f"Cross Validation Score : {np.mean(model1_1_cv)*100:2f}%") # We will consider the cross validation score if diffrence not too high

RG_Ev_model(model1_1,1_1)

Score : 32.03%
Cross Validation Score : 21.869083%


####  Regression model evaluation metrics
Model evaluation metrics documentation - https://scikit-learn.org/1.5/modules/model_evaluation.html 

The one's we're going to cover are:

1. R^2 (pronounced r-requared) of coefficent of determination
2. Mean absolute error (MAE)
3. Mean square error (MSE)

In [80]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

def RG_Ev_mat(yt,yp):
    mae = float(mean_absolute_error(yt,yp))
    mse = float(mean_squared_error(yt,yp))
    r2  = float(r2_score(yt,yp))
    print(f"mean abselute error:{float(mean_absolute_error(yt,yp))}")
    print(f"mean squared error:{float(mean_squared_error(yt,yp))}")
    print(f"r2 score:{float(r2_score(yt,yp))*100:2f}%")

RG_Ev_mat(y_test,y_preds1_1)
    


mean abselute error:5642.8748041238
mean squared error:49334132.09996635
r2 score:32.029667%


#### 5. IMPROVING A MODEL

First predictions = baseline prediction
Firt model = baseline model

From a data perspective:
* Could we collect more data ? (genraly, the more data, the better)
* Could we improve our data ?

From a model perspective:
* Is there a better model we could use ?
* Could we improve a current model ?

Hyperparameaters  V/s Parameaters

Parameaters = model finds pattern in data

Hyperparameaters = settings on a model you can adjest to (potentially) improve its ability to find patterns

Three ways to adjest hyperparameaters:
1. By hand
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchCV

##### 5.2 Hyperparameater tuning with RandomizedSearchCV

Laet's make 3 sets,traning,validition and test

In [82]:
model1_1.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

we're going to try and adjext:

* `max_dept`
* `max_features`
* `min_samples_leaf`
* `min_samples_split`
* `n_estimators`

In [100]:
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV

grid = {'n_estimators': [10,20,30,40,50,100,200,300,500,1000,1200],
        'max_features': ["auto",'log2', 'sqrt'],
        'max_depth': [None,5,10,20,30],
        'min_samples_leaf': [1,2,4,5],
        'min_samples_split': [2,4,6],}

np.random.seed(42)

model = RandomForestRegressor(n_jobs=1)

model_GS = RandomizedSearchCV(estimator=model,
                                   param_distributions=grid,
                                   n_iter=50,
                                   cv=5,
                                   verbose=2)

model_GS.fit(x_train,y_train);

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END max_depth=20, max_features=log2, min_samples_leaf=5, min_samples_split=2, n_estimators=1000; total time=   2.0s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=5, min_samples_split=2, n_estimators=1000; total time=   2.1s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=5, min_samples_split=2, n_estimators=1000; total time=   2.0s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=5, min_samples_split=2, n_estimators=1000; total time=   1.8s
[CV] END max_depth=20, max_features=log2, min_samples_leaf=5, min_samples_split=2, n_estimators=1000; total time=   1.7s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=20; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=20; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_esti

90 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\avk24\Conda\ztm_ML\env\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\avk24\Conda\ztm_ML\env\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "C:\Users\avk24\Conda\ztm_ML\env\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\avk24\Conda\ztm_ML\env\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constrain

In [101]:
model_GS.best_params_

{'n_estimators': 1000,
 'min_samples_split': 4,
 'min_samples_leaf': 1,
 'max_features': 'log2',
 'max_depth': 5}

In [92]:
from sklearn.metrics import accuracy_score,recall_score,f1_score,precision_score

def evaluation_preads(y_true,y_preds):
    '''
    performs evaluations comparision on y_true lables v/s y_pred labels 
    on the classification.
    '''
    accuracy = accuracy_score(y_true,y_preds)
    precision = precision_score(y_true,y_preds)
    recall  = recall_score(y_true,y_preds)
    f1 = f1_score(y_true,y_preds)

    metric_dict = {'accuracy':round(accuracy,2),
                   'precision':round(precision,2),
                   'recall':round(recall,2),
                    'f1':round(f1,2)}

    print(f"Accuracy :{accuracy*100:.2f}%")
    print(f"Precition :{precision:.2f}")
    print(f"recall :{recall:.2f}")
    print(f"f1:{f1:.2f}")

    return metric_dict