## Introduction to Scikit Learn (Machine Learning Library)

This notebook demonstrate some most useful functions of machine learning of the sklearn library

topics to cover:

1. end-to-end sklearn workflow
2. Getting the data ready
3. Choose the right estimator/algorithem for our problems
4. fit he model/algorithem and use it to make predictions
5. Evaluating the model
6. Improve the model
7. save and load a train model
8. putting it all together

In [96]:
what_were_covered = [ 
    '1. end-to-end sklearn workflow',
    '2. Getting the data ready',
    '3. Choose the right estimator/algorithem for our problems',
    '4. fit he model/algorithem and use it to make predictions',
    '5. Evaluating the model',
    '6. Improve the model',
    '7. save and load a train model',
    '8. putting it all together']

In [97]:
what_were_covered


['1. end-to-end sklearn workflow',
 '2. Getting the data ready',
 '3. Choose the right estimator/algorithem for our problems',
 '4. fit he model/algorithem and use it to make predictions',
 '5. Evaluating the model',
 '6. Improve the model',
 '7. save and load a train model',
 '8. putting it all together']

In [98]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### 0. End-to-End sklearn Workflow

In [99]:
#  1 Get the data ready
import pandas as pd
heart_disease = pd.read_csv('heart-disease.csv')

In [100]:
x = heart_disease.drop('target', axis=1)
y = heart_disease['target']

In [101]:
# 2 Choose the right estimator/algorithem for our problems

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()


In [102]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [103]:
# fit the model to the training data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [104]:
clf.fit(x_train, y_train);

In [105]:
# make a prediction
import numpy as np
y_preds = clf.predict(x_test)
y_preds

array([0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1], dtype=int64)

In [106]:
y_test

166    0
12     1
182    0
282    0
177    0
      ..
20     1
298    0
5      1
88     1
82     1
Name: target, Length: 61, dtype: int64

In [107]:
# evaluate the model on train data and test data

clf.score(x_train, y_train)

1.0

In [108]:
clf.score(x_test, y_test)

0.8360655737704918

In [109]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.85      0.79      0.81        28
           1       0.83      0.88      0.85        33

    accuracy                           0.84        61
   macro avg       0.84      0.83      0.83        61
weighted avg       0.84      0.84      0.84        61



In [110]:
confusion_matrix(y_test, y_preds)

array([[22,  6],
       [ 4, 29]], dtype=int64)

In [111]:
accuracy_score(y_test, y_preds)

0.8360655737704918

In [112]:
# 5. Improve a model
# Try different amount of n_estimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f"Model accuracy on test set: {clf.score(x_test, y_test) * 100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 78.69%

Trying model with 20 estimators...
Model accuracy on test set: 77.05%

Trying model with 30 estimators...
Model accuracy on test set: 83.61%

Trying model with 40 estimators...
Model accuracy on test set: 80.33%

Trying model with 50 estimators...
Model accuracy on test set: 81.97%

Trying model with 60 estimators...
Model accuracy on test set: 83.61%

Trying model with 70 estimators...
Model accuracy on test set: 83.61%

Trying model with 80 estimators...
Model accuracy on test set: 83.61%

Trying model with 90 estimators...
Model accuracy on test set: 81.97%



In [113]:
# 6. Save a model and load it
import pickle

pickle.dump(clf, open("random_forst_model_1.pkl", "wb"))

In [114]:
loaded_model = pickle.load(open("random_forst_model_1.pkl", "rb"))
loaded_model.score(x_test, y_test)

0.819672131147541

## Getting Data Ready

Three main things we have to do:
* split the data into features and labels (usually 'X' and 'Y')
* Filling Missing values
* converting non-numerical data into numirecal data

In [115]:
# Getting data ready

heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [116]:
X = heart_disease.drop('target', axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [117]:
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

Split data into features and labels

In [118]:
## Split the data into features and labels (Training and test sets)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) 

In [119]:
X_train.shape, X_test.shape

((242, 13), (61, 13))

In [120]:
y_train.shape, y_test.shape

((242,), (61,))

In [121]:
heart_disease.shape

(303, 14)

In [122]:
242+61

303

### 1.1 make Sure its all numeric

In [123]:
car_sales = pd.read_csv('car-sales-extended.csv')

In [124]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [125]:
len(car_sales)

1000

In [126]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [127]:
## split into X/y, feature and label

X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

In [128]:
X. head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [129]:
y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [130]:
## Split into Training and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [132]:
## Build machine learning model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, X_test)

ValueError: could not convert string to float: 'Toyota'

now you can see string value cannot be train 

In [133]:
## converting string values to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

object_feature = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, object_feature)], remainder='passthrough')

transformer_X = transformer.fit_transform(X)
transformer_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [134]:
## lets put it into dataframe

pd.DataFrame(transformer_X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


In [135]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [136]:
car_sales['Make'].value_counts()

Toyota    398
Honda     304
Nissan    198
BMW       100
Name: Make, dtype: int64

In [137]:
car_sales.Colour.value_counts()

White    407
Blue     321
Black     99
Red       94
Green     79
Name: Colour, dtype: int64

In [138]:
car_sales['Make'].unique()

array(['Honda', 'BMW', 'Toyota', 'Nissan'], dtype=object)

In [139]:
car_sales.Colour.unique()

array(['White', 'Blue', 'Red', 'Green', 'Black'], dtype=object)

In [140]:
car_sales.Doors.value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

to combine these Series or columns (Make, Colour, Doors) and count the values of these columns we get 12 columns of combined values. and once we convert these string or categorical data into numbers with the help of sklearn OneHotEncoder we get 12 columns. we it means sklearn just converted every columns cetegory into unique or one column

In [141]:
len(transformer_X)

1000

In [142]:
## 1.1 converting object data into numbers with the help of pandas

dummies = pd.get_dummies(car_sales[['Make', 'Colour', 'Doors']])
dummies.head()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0


In [143]:
## let's refit the machine learning model

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformer_X, y, test_size=0.2)

model.fit(X_train, y_train)

RandomForestRegressor()

In [144]:
model.score(X_test, y_test)

0.3235867221569877

## 1.2 what if there is Missing values?

1. Fill them with some values (also known as imputation)
2. remove the samples with missing data altogether

In [145]:
# import missing dataset

car_sales_missing = pd.read_csv('car-sales-extended-missing-data.csv')

In [146]:
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [147]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [148]:
car_sales_missing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           951 non-null    object 
 1   Colour         950 non-null    object 
 2   Odometer (KM)  950 non-null    float64
 3   Doors          950 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB


In [149]:
car_sales_missing.shape

(1000, 5)

In [150]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [151]:
X = car_sales_missing.drop('Price', axis=1)
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431.0,4.0
1,BMW,Blue,192714.0,5.0
2,Honda,White,84714.0,4.0
3,Toyota,White,154365.0,4.0
4,Nissan,Blue,181577.0,3.0


In [152]:
y = car_sales_missing['Price']
y.head()

0    15323.0
1    19943.0
2    28343.0
3    13434.0
4    14043.0
Name: Price, dtype: float64

In [153]:
# let's convert our data into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

object_feature = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, object_feature)], remainder='passthrough')

transformer_X = transformer.fit_transform(X)
transformer_X

ValueError: Input contains NaN

In [154]:
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


## Option 1. Fill Missing values with Pandas

In [155]:
car_sales_missing['Make'].fillna('missing', inplace=True)

car_sales_missing['Colour'].fillna('missing', inplace=True)

car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean(), inplace=True)

car_sales_missing['Doors'].fillna(4, inplace=True)

In [156]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [157]:
## remove missing values of Price coloum

car_sales_missing.dropna(inplace=True)

In [158]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [159]:
len(car_sales_missing)

950

In [160]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [161]:
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [162]:
# let's convert our data into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

object_feature = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, object_feature)], remainder='passthrough')

transformer_X = transformer.fit_transform(car_sales_missing)
transformer_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [163]:
pd.DataFrame(transformer_X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0,14043.0


## Option 2. Fill Missing values with scikit-learn (sklearn)

In [164]:
## Filling missing values with sklearn

car_missing = pd.read_csv('car-sales-extended-missing-data.csv')
car_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [165]:
car_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [166]:
car_missing.dropna(subset=['Price'], inplace=True)
car_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [167]:
X = car_missing.drop('Price', axis=1)
y = car_missing['Price']

In [168]:
# fill missing values with sklearn

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' and numerical values with mean
cat_impute = SimpleImputer(strategy='constant', fill_value='missing')
door_impute = SimpleImputer(strategy='constant', fill_value=4)
num_impute = SimpleImputer(strategy='mean')

cat_column = ['Make', 'Colour']
door_column = ['Doors']
num_column = ['Odometer (KM)']

In [169]:
# creating imputer 

imputer = ColumnTransformer([
    ('cat_impute', cat_impute, cat_column),
    ('door_impute', door_impute, door_column),
    ('num_impute', num_impute, num_column)
])

# Transform the data

filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [182]:
car_filled = pd.DataFrame(filled_X, 
                         columns = ['Make', 'Colour', 'Doors', 'Odometer (KM)'])

car_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4,35431
1,BMW,Blue,5,192714
2,Honda,White,4,84714
3,Toyota,White,4,154365
4,Nissan,Blue,3,181577


In [183]:
# Converting into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

object_feature = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, object_feature)], remainder='passthrough')

transformer_X = transformer.fit_transform(car_filled)
transformer_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [185]:
pd.DataFrame(transformer_X).head()

Unnamed: 0,0
0,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
1,"(0, 0)\t1.0\n (0, 6)\t1.0\n (0, 13)\t1.0\n..."
2,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
3,"(0, 3)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
4,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 11)\t1.0\n..."


In [189]:
# let's fit the model

np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformer_X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.21990196728583944

In [190]:
what_were_covered

['1. end-to-end sklearn workflow',
 '2. Getting the data ready',
 '3. Choose the right estimator/algorithem for our problems',
 '4. fit he model/algorithem and use it to make predictions',
 '5. Evaluating the model',
 '6. Improve the model',
 '7. save and load a train model',
 '8. putting it all together']

## 3 Choosing the right alroithem and use it to make predictions

scikit learn uses estimator as another term for machine learning model or algorithem

* Classification = predicting whether the sample is one thing or another
* regression = predicting the numbers

<img src='sklearn-ml-map-cheatsheet-heart-disease-linear-svc.png'>

## picking a machine learning model for regression problem

In [194]:
from sklearn.datasets import load_boston

boston = load_boston()

In [197]:
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
boston_df['target'] = pd.Series(boston['target'])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [198]:
len(boston_df)

506

In [203]:
# let's try Ridge model

from sklearn.linear_model import Ridge

np.random.seed(42)

X = boston_df.drop('target', axis=1)
y = boston_df['target']

# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#instantiate the model

model = Ridge()
model.fit(X_train, y_train)

# let's check the score of the ridge model on boston dataset

model.score(X_test, y_test)

0.6662221670168522

How do we improve this score?

what if Ridge model wasn't the right model for our dataset?