# Train | Test split procedure

0. lean and adjust data as necessary for X and y
1. Split data in  Train/Test set for both X and y
2. Fit/Train scaleron Training Data
3. Scale X Test Data
4. Create model
5. Fit/Train model on X Train Data
6. Evaluate model on X Test Data (by creating predictions and comparing to y_test)
7. Adjust parameters as necessary and repeat steps 5 and 6

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [2]:
df=pd.read_csv("Advertising.csv")

In [3]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
#  splitting train & test set
X=df.drop('sales',axis=1)

In [5]:
y=df['sales']

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

# Now TV, radio , newspaper can have different units so we need to scale data to compare their coefficients

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
scaler = StandardScaler()

In [10]:
scaler.fit(X_train)

In [11]:
X_train=scaler.transform(X_train)

In [12]:
X_test=scaler.transform(X_test)

In [13]:
#  LETS BEGIN WITH RIDGE REGRESSION

In [14]:
from sklearn.linear_model import Ridge

In [15]:
model=Ridge(alpha=100)   # taking hyper parameter =100

In [16]:
model.fit(X_train,y_train)

In [17]:
y_pred=model.predict(X_test)

In [18]:
from sklearn.metrics import mean_squared_error

In [19]:
RMSE=np.sqrt(mean_squared_error(y_test,y_pred))

In [20]:
RMSE

2.7095711448556075

In [22]:
#  Now we can adjust hyperparameters to have better performance

In [23]:
model_two= Ridge(alpha=1)

In [26]:
model_two.fit(X_train,y_train)

In [27]:
y_pred=model_two.predict(X_test)

In [28]:
mean_squared_error(y_test,y_pred)

2.3190215794287514

# train_test_split using scikit-learn
# Advantages:
 1. Easy
 2. One-liner statement
# Disadvantages:
 1. We have to create a new model to check performance  for a new value of alpha.
 2. This mean_squared_error we get above, we could get some more better performance than this.
 3. In model two we adjusted hyperparameter based of model 1 performance on test set,this is not a fair evaluation.
    Since we cant report a performance metric on truly "unseen" data.
 # So, in next step we could hold  a little bit of data aside that the model is never adjusted to , to get true and fair evaluation of model's capability
 

# Train_Validation_Test_Split

Idea ->
1. we have separated the test data from given data set and donot perform precictions on test data until we have adjusted hyperparameters by splitting the rest of data in train set & validation set

 2. Training model on train set then validating model on validation set then adjusting hyperparameter again perform this     process until we get a better performance
  
 3. test data is for final evaluation & here can't adjust hyperparameters . Just used for reporting

In [29]:
#  implementation -> just call train_test_split twice

In [30]:
X_train, X_other, y_train, y_other = train_test_split(X,y,test_size=0.3,random_state=101)

In [32]:
#  test_size= 0.5 of(o.3 of all data)-> 50% of (30% of all data)-> 15%
X_eval, X_test, y_eval, y_test = train_test_split(X_other,y_other,test_size=0.5,random_state=101)

In [33]:
len(df)

200

In [34]:
len(X_train)

140

In [35]:
len(X_eval)

30

In [36]:
len(X_test)

30

In [44]:
#  scaling the data -> note we also have to scale test data    ->why we are scaling ?-> to have same units among all features 
#  so that we can compare coefficients ans since we are using ridge regression

In [45]:
scaler=StandardScaler()

In [46]:
scaler.fit(X_train)

In [47]:
X_train=scaler.transform(X_train)

In [48]:
X_test=scaler.transform(X_test)

In [49]:
X_eval=scaler.transform(X_eval)

In [50]:
#  now fitting the model
from sklearn.linear_model import Ridge

In [51]:
model=Ridge(alpha=100)

In [52]:
model.fit(X_train,y_train)

In [63]:
#  predicting validation set
eval_pred=model.predict(X_eval)

In [64]:
mean_squared_error(y_eval,eval_pred)

7.320101458823867

In [65]:
# for another version of alpha
model_two=Ridge(alpha=1)

In [66]:
model_two.fit(X_train,y_train)

In [67]:
new_eval_pred=model_two.predict(X_eval)

In [68]:
mean_squared_error(y_eval,new_eval_pred)

2.383783075056986

In [61]:
#  is this better and fair  performance? It is better but not a fair performance , since we adjusted hyperparametert according 
# to the previous result we get

In [62]:
#  Now assume we have satisfied with model_two , now predicting test set for final evaluation for reporting to boss

In [69]:
y_final_test_pred=model_two.predict(X_test)

In [71]:
mean_squared_error(y_test,y_final_test_pred)

2.254260083800518

# cross_val_score function   
 K- fold Splitting   : More the value of k , more the computation needed. Max value of k equals to number of rows in trains set
 
 # K - fold : K-1 fold for training and 1 fold for validating
 
 Advantage : model is trained on all portions of data at some point in time & validate it on all portions of data at some point of time

----
----
----
## Cross Validation with cross_val_score

----

<img src="grid_search_cross_validation.png">

----

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

In [74]:
from sklearn.preprocessing import StandardScaler

In [75]:
scaler=StandardScaler()

In [76]:
scaler.fit(X_train)

In [77]:
X_train=scaler.transform(X_train)

In [78]:
X_test=scaler.transform(X_test)

In [79]:
#  fitting the model

In [80]:
model=Ridge(alpha=100)

In [81]:
from sklearn.model_selection import cross_val_score

In [82]:
scores=cross_val_score(model,X_train,y_train,scoring='neg_mean_squared_error',cv=5)

In [83]:
scores

array([ -9.32552967,  -4.9449624 , -11.39665242,  -7.0242106 ,
        -8.38562723])

In [84]:
#  overall error
abs(scores.mean())

8.215396464543607

In [86]:
#  using another version of alpha
model=Ridge(alpha=1)

In [87]:
scores=cross_val_score(model,X_train,y_train,scoring='neg_mean_squared_error',cv=5)

In [88]:
scores

array([-3.15513238, -1.58086982, -5.40455562, -2.21654481, -4.36709384])

In [89]:
abs(scores.mean())

3.344839296530695

In [90]:
# model is not fitted inside cross_val_scores , we have to fit it 
model.fit(X_train,y_train)

In [91]:
y_final_test_pred=model.predict(X_test)

In [92]:
mean_squared_error(y_test,y_final_test_pred)

2.3190215794287514

# cross-validation
 : allows us to view multiple performance metrics from cross validation on a model  and explore how much time fitting and testing took

In [94]:
# .create  X and y
X=df.drop('sales',axis=1)
y=df['sales']

#  train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=101)

# scaling the data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

In [95]:
# now cross validation
from sklearn.model_selection import cross_validate

In [96]:
model=Ridge(alpha=100)

In [97]:
scores=cross_validate(model,X_train,y_train,scoring=['neg_mean_squared_error','neg_mean_absolute_error'],cv=10)

In [98]:
scores

{'fit_time': array([0.00160599, 0.        , 0.00050473, 0.        , 0.00099993,
        0.        , 0.        , 0.        , 0.        , 0.        ]),
 'score_time': array([0.        , 0.00100851, 0.        , 0.00099874, 0.0010004 ,
        0.0010004 , 0.00099802, 0.00100017, 0.00100112, 0.0009985 ]),
 'test_neg_mean_squared_error': array([ -6.06067062, -10.62703078,  -3.99342608,  -5.00949402,
         -9.14179955, -13.08625636,  -3.83940454,  -9.05878567,
         -9.05545685,  -5.77888211]),
 'test_neg_mean_absolute_error': array([-1.8102116 , -2.54195751, -1.46959386, -1.86276886, -2.52069737,
        -2.45999491, -1.45197069, -2.37739501, -2.44334397, -1.89979708])}

In [99]:
scores=pd.DataFrame(scores)

In [102]:
 scores  #more information than cross_val_score

Unnamed: 0,fit_time,score_time,test_neg_mean_squared_error,test_neg_mean_absolute_error
0,0.001606,0.0,-6.060671,-1.810212
1,0.0,0.001009,-10.627031,-2.541958
2,0.000505,0.0,-3.993426,-1.469594
3,0.0,0.000999,-5.009494,-1.862769
4,0.001,0.001,-9.1418,-2.520697
5,0.0,0.001,-13.086256,-2.459995
6,0.0,0.000998,-3.839405,-1.451971
7,0.0,0.001,-9.058786,-2.377395
8,0.0,0.001001,-9.055457,-2.443344
9,0.0,0.000998,-5.778882,-1.899797


In [103]:
scores.mean()

fit_time                        0.000311
score_time                      0.000801
test_neg_mean_squared_error    -7.565121
test_neg_mean_absolute_error   -2.083773
dtype: float64

In [104]:
#  now improce our performance
model=Ridge(alpha=1)

In [105]:
scores=cross_validate(model,X_train,y_train,scoring=['neg_mean_squared_error','neg_mean_absolute_error'],cv=10)

In [106]:
scores=pd.DataFrame(scores)

In [107]:
scores.mean()

fit_time                        0.000329
score_time                      0.000331
test_neg_mean_squared_error    -3.323018
test_neg_mean_absolute_error   -1.308467
dtype: float64

In [108]:
#  now finally fitting the model and evaluating finally
model.fit(X_train,y_train)

In [109]:
y_final_test_predict=model.predict(X_test)

In [110]:
mean_squared_error(y_test,y_final_test_predict)

2.3190215794287514