#  Introduction to scikit-learn(sklearn)
This notebook will demostrate most useful functions of the beatiful Scikit-Learn library.

## What we're going to cover:

#### 0.Scikit-Learn end-to-end workflow.
#### 1.Getting the data ready.
#### 2.Choose the right estimator/algorithm for our problem.
#### 3.Fit the model/algorithm and use it to make predictions on our data.
#### 4.Evaluate a model.
#### 5.Improve a model.
#### 6.Save and load a trained model.
#### 7.Putting it all together.







## 0. Scikit-learn end-to-end workflow.

In [44]:
# 1.Get the data ready.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

heart_disease = pd.read_csv("data/heart.csv")#It is a preprocessed data (cleaning and feature engineering has been done before.)
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [45]:
# Create X (features matrics)
X = heart_disease.drop("target", axis=1)

# Create y(labels)
y = heart_disease["target"]

In [46]:
heart_disease.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [47]:
heart_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [48]:
# 2.Choose the right estimator and hyperparameters.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# Default parameter
clf.get_params()


{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [49]:
# 3.Fit the model in training data.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [50]:
clf.fit(X_train,y_train)

RandomForestClassifier()

In [51]:
# Make predictions
#y_labels = clf.predict(np.array([0,2,3,4]))
#This does not works because models are only able to run on same type of matrics as the train data.

In [52]:
y_preds = clf.predict(X_test)
y_preds

array([1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0], dtype=int64)

In [55]:
np.array([y_test])

array([[1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,
        1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
        1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0]], dtype=int64)

In [53]:
# 4.Evaluatin our model .
#On training data
clf.score(X_train, y_train)

1.0

In [54]:
np.mean(y_preds==y_test)

0.8360655737704918

In [56]:
clf.predict_proba(X_test[:5])

array([[0.02, 0.98],
       [0.5 , 0.5 ],
       [0.04, 0.96],
       [0.09, 0.91],
       [0.17, 0.83]])

In [25]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

classification_report(y_test, y_preds)

'              precision    recall  f1-score   support\n\n           0       0.72      0.88      0.79        26\n           1       0.90      0.74      0.81        35\n\n    accuracy                           0.80        61\n   macro avg       0.81      0.81      0.80        61\nweighted avg       0.82      0.80      0.80        61\n'

In [26]:
confusion_matrix(y_test, y_preds)

array([[23,  3],
       [ 9, 26]], dtype=int64)

In [27]:
accuracy_score(y_test, y_preds)

0.8032786885245902

In [31]:
# 5.Improve the model.
# changing the n-estimator parameter
np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimator....")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set:{clf.score(X_test,y_preds)*100:.2f}%")
    print("")

Trying model with 10 estimator....
Model accuracy on test set:85.25%

Trying model with 20 estimator....
Model accuracy on test set:95.08%

Trying model with 30 estimator....
Model accuracy on test set:95.08%

Trying model with 40 estimator....
Model accuracy on test set:93.44%

Trying model with 50 estimator....
Model accuracy on test set:90.16%

Trying model with 60 estimator....
Model accuracy on test set:91.80%

Trying model with 70 estimator....
Model accuracy on test set:95.08%

Trying model with 80 estimator....
Model accuracy on test set:91.80%

Trying model with 90 estimator....
Model accuracy on test set:95.08%



In [32]:
#Above we can see that model with 20,30,70,90 has given an accuracy of 95.08%

In [33]:
# 6.Savo the model and load trained model.
import pickle

pickle.dump(clf, open("Random_classifier_1.pkl", "wb"))

In [39]:
# Loading model
loaded_model = pickle.load(open("Random_classifier_1.pkl","rb"))
loaded_model.score(X_test, y_test)

0.7704918032786885

# 1.TRANSFORMINH NON-NUMERICAL DATA TO NUMERICAL DATA

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("data/Car_sales.csv")
df.head()

Unnamed: 0,Manufacturer,Model,Sales_in_thousands,__year_resale_value,Vehicle_type,Price_in_thousands,Engine_size,Horsepower,Wheelbase,Width,Length,Curb_weight,Fuel_capacity,Fuel_efficiency,Latest_Launch,Power_perf_factor
0,Acura,Integra,16.919,16.36,Passenger,21.5,1.8,140.0,101.2,67.3,172.4,2.639,13.2,28.0,2/2/2012,58.28015
1,Acura,TL,39.384,19.875,Passenger,28.4,3.2,225.0,108.1,70.3,192.9,3.517,17.2,25.0,6/3/2011,91.370778
2,Acura,CL,14.114,18.225,Passenger,,3.2,225.0,106.9,70.6,192.0,3.47,17.2,26.0,1/4/2012,
3,Acura,RL,8.588,29.725,Passenger,42.0,3.5,210.0,114.6,71.4,196.6,3.85,18.0,22.0,3/10/2011,91.389779
4,Audi,A4,20.397,22.255,Passenger,23.99,1.8,150.0,102.6,68.2,178.0,2.998,16.4,27.0,10/8/2011,62.777639


In [2]:
df.columns

Index(['Manufacturer', 'Model', 'Sales_in_thousands', '__year_resale_value',
       'Vehicle_type', 'Price_in_thousands', 'Engine_size', 'Horsepower',
       'Wheelbase', 'Width', 'Length', 'Curb_weight', 'Fuel_capacity',
       'Fuel_efficiency', 'Latest_Launch', 'Power_perf_factor'],
      dtype='object')

In [3]:
df.isna().sum()/df.shape[0]*100

Manufacturer            0.000000
Model                   0.000000
Sales_in_thousands      0.000000
__year_resale_value    22.929936
Vehicle_type            0.000000
Price_in_thousands      1.273885
Engine_size             0.636943
Horsepower              0.636943
Wheelbase               0.636943
Width                   0.636943
Length                  0.636943
Curb_weight             1.273885
Fuel_capacity           0.636943
Fuel_efficiency         1.910828
Latest_Launch           0.000000
Power_perf_factor       1.273885
dtype: float64

In [4]:
df.dropna(subset=['Price_in_thousands'],inplace=True)

In [5]:
df.fillna(df.mean(),inplace=True)
df.isna().sum()

Manufacturer           0
Model                  0
Sales_in_thousands     0
__year_resale_value    0
Vehicle_type           0
Price_in_thousands     0
Engine_size            0
Horsepower             0
Wheelbase              0
Width                  0
Length                 0
Curb_weight            0
Fuel_capacity          0
Fuel_efficiency        0
Latest_Launch          0
Power_perf_factor      0
dtype: int64

In [6]:
"""#Filling missing values using scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

##Fill categorical values with "missing" and numerical values with mean.
num_imputer = SimpleImputer(strategy='mean')


##Define features.
num_features = ['__year_resale_value','Curb_weight','Fuel_efficiency']


##Create an imputer.
imputer = ColumnTransformer([
    ("num_imputer",num_imputer,num_features)
    
])

##Transform data
filled_X = imputer.fit_transform(X)
filled_X"""


'#Filling missing values using scikit learn\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.compose import ColumnTransformer\n\n##Fill categorical values with "missing" and numerical values with mean.\nnum_imputer = SimpleImputer(strategy=\'mean\')\n\n\n##Define features.\nnum_features = [\'__year_resale_value\',\'Curb_weight\',\'Fuel_efficiency\']\n\n\n##Create an imputer.\nimputer = ColumnTransformer([\n    ("num_imputer",num_imputer,num_features)\n    \n])\n\n##Transform data\nfilled_X = imputer.fit_transform(X)\nfilled_X'

In [7]:
#from sklearn.preprocessing import OneHotEncoder
#from sklearn.compose import ColumnTransformer

#categorical_feature = ['Manufacturer']
#one_hot = OneHotEncoder()
#transformer = ColumnTransformer([('one_hot',one_hot,categorical_feature)])
#transformed_X = transformer.fit_transform(df_filled)
#transformed_X

In [8]:
dummies = pd.get_dummies(df['Manufacturer'])
dummies.head()

Unnamed: 0,Acura,Audi,BMW,Buick,Cadillac,Chevrolet,Chrysler,Dodge,Ford,Honda,...,Oldsmobile,Plymouth,Pontiac,Porsche,Saab,Saturn,Subaru,Toyota,Volkswagen,Volvo
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
resulted_df = pd.concat([dummies.reset_index(drop=True), df.reset_index(drop=True)], axis=1)
resulted_df.isna().sum()
new_df = resulted_df.drop(['Model','Vehicle_type','Manufacturer','Latest_Launch'], axis=1)
new_df.columns

Index(['Acura', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet', 'Chrysler',
       'Dodge', 'Ford', 'Honda', 'Hyundai', 'Infiniti', 'Jaguar', 'Jeep',
       'Lexus', 'Lincoln', 'Mercedes-B', 'Mercury', 'Mitsubishi', 'Nissan',
       'Oldsmobile', 'Plymouth', 'Pontiac', 'Porsche', 'Saab', 'Saturn',
       'Subaru', 'Toyota', 'Volkswagen', 'Volvo', 'Sales_in_thousands',
       '__year_resale_value', 'Price_in_thousands', 'Engine_size',
       'Horsepower', 'Wheelbase', 'Width', 'Length', 'Curb_weight',
       'Fuel_capacity', 'Fuel_efficiency', 'Power_perf_factor'],
      dtype='object')

In [10]:
X = new_df.drop(['Price_in_thousands'],axis=1)
y_1 = df['Price_in_thousands']
y=y_1.astype(int)*1000

In [11]:
#Lets fit the model.
np.random.seed(42)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2)


model = RandomForestClassifier()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.12903225806451613

# 2.CHOOSING THE RIGHT ESTIMATOR/ALGORITHM.


###### SCIKIT LEARN REFERS MODELS/ALGORITHM AS ESTIMATOR
1.CLASSIFICATION-Whether one thing or not.

2.REGRESSION-Predicting numbers.

3.Checkout the sklearn model map :https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## 2.1.Picking a machine learning model for regression problem.

In [76]:
#Importing boston housing dataset
from sklearn.datasets import load_boston

boston = load_boston()
boston

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3

In [77]:
boston_df=pd.DataFrame(boston["data"],columns=boston["feature_names"])
boston_df["target"]=pd.Series(boston["target"])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [78]:
#Sampple size
len(boston_df)

506

In [79]:
from sklearn.linear_model import Ridge

np.random.seed(42)

#Seprating the data into features and label.
X = boston_df.drop(["target"],axis=1)
y = boston_df["target"]

#Splitting data into train and test.
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2)

In [80]:
#Intansiate ridge model
model = Ridge()

model.fit(X_train,y_train)

Ridge()

In [81]:
#Checking score on test data
md = model.score(X_test,y_test)

In [82]:
#Intansiate svm model (linear)
from sklearn import svm
np.random.seed(42)
regr = svm.SVR()
regr.fit(X_train,y_train)

SVR()

In [83]:
#Checking score on test data
regr.score(X_test,y_test)

0.27948125010200286

In [84]:
#Lets try the random forest regressor
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)

rand_regr = RandomForestRegressor()
rand_regr.fit(X_train,y_train)

RandomForestRegressor()

In [85]:
#Checking score
rf = rand_regr.score(X_test,y_test)
rf

0.8922527442109116

In [86]:
print(f"Now we can see that RandomForestRegressor with score:{rf*100}%  has performed greatly as compared to Ridge model with score:{md*100}%")

Now we can see that RandomForestRegressor with score:89.22527442109116%  has performed greatly as compared to Ridge model with score:66.62221670168519%


Tidbit:

    1.If you have structured data, use ensemble learning.
    
    2.If you have un-structured data(e.g. video,audio,text data), use deep-learning or transfer learning.

In [96]:
#Make prediction using predict()
y_pred=rand_regr.predict(X_test)
y_pred

array([22.839, 30.676, 16.317, 23.51 , 16.819, 21.374, 19.358, 15.62 ,
       21.091, 21.073, 20.047, 19.297,  8.611, 21.398, 19.378, 25.453,
       19.187,  8.538, 46.132, 14.536, 24.728, 23.996, 14.509, 23.847,
       14.363, 14.796, 21.126, 13.663, 19.535, 21.29 , 19.449, 23.393,
       29.3  , 20.338, 14.596, 15.594, 33.835, 19.123, 20.915, 24.376,
       19.286, 29.61 , 46.108, 19.428, 22.653, 13.676, 15.035, 24.321,
       18.689, 28.821, 21.107, 33.811, 16.502, 25.779, 44.922, 21.982,
       15.416, 32.032, 22.596, 20.296, 25.611, 33.916, 28.134, 18.551,
       26.745, 17.568, 13.992, 23.195, 29.022, 15.663, 21.074, 27.426,
       10.06 , 21.569, 21.952,  7.084, 19.905, 46.154, 11.274, 12.981,
       21.288, 12.562, 19.561,  9.392, 20.76 , 27.283, 15.383, 23.399,
       23.628, 17.617, 21.68 ,  8.019, 19.616, 18.714, 22.592, 19.786,
       41.733, 12.79 , 12.726, 13.119, 20.603, 23.902])

In [89]:
test= np.array([y_test[:10]])
test.reshape(-1,1)

array([[23.6],
       [32.4],
       [13.6],
       [22.8],
       [16.1],
       [20. ],
       [17.8],
       [14. ],
       [19.6],
       [16.8]])

In [97]:
#Compare the prediction to the truth
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test,y_pred)

2.0395392156862746