# Homework 12

https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

* Implement a mini batch functionality to train a regressor.
    - (Optional) If anyone want to do this in a pipeline can do this: https://koaning.github.io/tokenwiser/api/pipeline.html

* Save model, load the model again and test it on `X_test`

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [5]:
def test_df():
    df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/car_prices/car_prices.csv', low_memory=False)

    df = df.sample(5000, random_state=100).reset_index(drop=True)
    
    y = df['sellingprice']
    df.drop('sellingprice', axis=1, inplace=True)
    X = df
    
    return X,y

def partial_df():
    df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/car_prices/car_prices.csv', low_memory=False)
   
    while(True):
        yield df.sample(100).reset_index(drop=True)
        
gen = partial_df()

In [6]:
X_test, y_test = test_df()

In [7]:
# each time you call this you will get a new slice of the dataframe.
next(gen)

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
0,2014,Honda,Civic,LX,Coupe,automatic,2hgfg3b5xeh516466,nv,4.1,10891.0,gray,gray,"ahfc/honda lease trust/hvt, inc.",14550,14300,Fri May 22 2015 05:00:00 GMT-0700 (PDT)
1,2012,smart,fortwo,passion coupe,Hatchback,automatic,wmeej3ba0ck570956,wi,4.9,8944.0,red,gray,mercedes-benz financial services,6250,6000,Wed Jun 03 2015 03:00:00 GMT-0700 (PDT)
2,2013,Mercedes-Benz,E-Class,E350 Sport,sedan,automatic,wddhf5kb9da698742,ca,3.4,49095.0,white,black,the hertz corporation,24600,24300,Tue Jun 16 2015 05:30:00 GMT-0700 (PDT)
3,2009,Nissan,Sentra,2.0,sedan,automatic,3n1ab61e89l620559,nc,2.9,93182.0,black,gray,boyd chevrolet buick gmc,4550,6100,Mon Jun 08 2015 02:00:00 GMT-0700 (PDT)
4,2006,Nissan,Sentra,SE-R,Sedan,automatic,3n1ab51d16l566115,ga,1.9,105779.0,silver,black,bmw alphera/alphera financial services,2375,2100,Thu Feb 26 2015 02:00:00 GMT-0800 (PST)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2012,Dodge,Avenger,SE,Sedan,automatic,1c3cdzab9cn220996,ca,4.2,20206.0,white,black,citi remarketing & recovery services,10300,10000,Thu Feb 19 2015 04:00:00 GMT-0800 (PST)
96,2012,Toyota,RAV4,Base,SUV,automatic,2t3zf4dv5cw124364,oh,4.3,13978.0,brown,tan,nassief auto group inc,16250,16100,Tue Feb 10 2015 01:30:00 GMT-0800 (PST)
97,2011,Nissan,Rogue,SV,SUV,automatic,jn8as5mv2bw310124,il,3.9,44183.0,—,black,nissan-infiniti lt,14650,15600,Tue Jan 20 2015 02:00:00 GMT-0800 (PST)
98,2014,Chrysler,Town and Country,Touring,Minivan,automatic,2c4rc1bg4er395540,fl,4,30306.0,white,tan,pv holding inc/gdp,18450,20100,Mon Mar 02 2015 04:30:00 GMT-0800 (PST)


# Preprocessing

In [8]:
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
numerical = list(X_test.select_dtypes(include='number').columns)
categorical = list(X_test.select_dtypes(include=['object', 'category']).columns)


In [9]:
n_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])


In [10]:
c_pipeline = Pipeline([
    ('simpleimputer', SimpleImputer(strategy='most_frequent')),
    ('onehotencoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
])

In [11]:
from sklearn.compose import ColumnTransformer

pipeline = ColumnTransformer([
    ('numerical', n_pipeline, numerical),
    ('categorical', c_pipeline, categorical)
])
pipeline.fit(X_test)

# Implement a mini batch functionality to train a regressor.

In [12]:
Regressor = SGDRegressor(loss='squared_error', random_state=100, learning_rate='constant', eta0=0.01)


In [13]:
import warnings
warnings.filterwarnings("ignore")

total_no_of_batches = 60
n_batches = 100

for a in range(n_batches):
    for ind in range(total_no_of_batches):
        
        batch_data = next(gen)
        X_SGD = batch_data.drop('sellingprice', axis=1)
        Y_SGD = batch_data['sellingprice']
        X_SGD_Pre = pipeline.transform(X_SGD)
        X_SGD_Pre = np.nan_to_num(X_SGD_Pre)
        Regressor.partial_fit(X_SGD_Pre, Y_SGD)

# regressor for predictions
X_SGD_test = pipeline.transform(X_test)
X_SGD_test = np.nan_to_num(X_SGD_test)
y_prediction = Regressor.predict(X_SGD_test)


In [14]:
print(X_SGD_test)

  (0, 0)	-0.4742961437221523
  (0, 1)	0.5629169986629211
  (0, 8)	1.0
  (0, 440)	1.0
  (0, 810)	1.0
  (0, 1086)	1.0
  (0, 1114)	1.0
  (0, 2437)	1.0
  (0, 6146)	1.0
  (0, 6160)	1.0
  (0, 6195)	1.0
  (0, 6213)	1.0
  (0, 6564)	1.0
  (0, 7789)	1.0
  (0, 9070)	1.0
  (1, 0)	0.2670232431521104
  (1, 1)	-0.615146664421381
  (1, 19)	1.0
  (1, 228)	1.0
  (1, 756)	1.0
  (1, 1077)	1.0
  (1, 1114)	1.0
  (1, 5038)	1.0
  (1, 6146)	1.0
  (1, 6183)	1.0
  :	:
  (4998, 1087)	1.0
  (4998, 1114)	1.0
  (4998, 1444)	1.0
  (4998, 6142)	1.0
  (4998, 6189)	1.0
  (4998, 6207)	1.0
  (4998, 6213)	1.0
  (4998, 6792)	1.0
  (4998, 7889)	1.0
  (4998, 9398)	1.0
  (4999, 0)	1.0083426300263731
  (4999, 1)	-0.9418176011012634
  (4999, 8)	1.0
  (4999, 268)	1.0
  (4999, 815)	1.0
  (4999, 1087)	1.0
  (4999, 1114)	1.0
  (4999, 3404)	1.0
  (4999, 6121)	1.0
  (4999, 6170)	1.0
  (4999, 6200)	1.0
  (4999, 6213)	1.0
  (4999, 6829)	1.0
  (4999, 7907)	1.0
  (4999, 9277)	1.0


In [15]:
print(y_prediction)

[14536.38501481 16761.60611585 13026.5627159  ... 18897.78011844
 16604.79594297 15772.56297734]


# Save model, load the model again and test it on X_test

In [19]:
import joblib

joblib.dump(Regressor, 'regressor.joblib')
r_JOB = joblib.load('regressor.joblib')
X_job_Test = pipeline.transform(X_test)
y_pred = r_JOB.predict(X_job_Test)
s = r_JOB.score(X_job_Test, y_test)
print("Accuaracy of model using SGDRegressor:", s)

Accuaracy of model using SGDRegressor: 0.8265994351553551


The regressor object (Regressor) is serialized and saved to a file named 'regressor.joblib' using the joblib.dump() function from the joblib library.
The saved regressor object is then loaded from the file 'regressor.joblib' using the joblib.load() function, and the loaded object is assigned to the variable r_JOB.
The test data (X_test) is preprocessed using the pipeline object's transform() method, and the preprocessed data is stored in X_job_Test.
The loaded regressor (r_JOB) is used to make predictions on the preprocessed test data (X_job_Test) using the predict() method, and the predicted values are assigned to y_pred.
The accuracy of the model is computed by calling the score() method on the loaded regressor (r_JOB) with the preprocessed test data (X_job_Test) and the ground truth labels (y_test). The resulting accuracy score is stored in the variable s.
Finally, the accuracy score is printed as "Accuracy of model using SGDRegressor:" followed by the value of s.

Finally, we got an accuarcy of 82%