## 4.0 Capstone Project Two: All State Purchase Prediction - Pre Processing and Training Data DEvelopment<a id='2_Exploratory_Data_Analysis'></a>
**Submitted By:** Amit Kukreja

## Objectives<a id='2.2_EDA_Objectives'></a>

1) Split the data into training and test set into separate csv files and only use training data going forward.

2) Make a data frame with one row per customer in the training set, that includes the data from the first two shopping points, and then the final shopping point as the target.

3) Try to write a function (or a class even better) that can take in any number of shopping points and make a dataframe with that number of shopping points 

4) Create dummy or indicator features for categorical variables.

5) Standardize the magnitude of numeric features using a scaler. 

6) Create a class that performs the pre processing steps and stores the dataframes and scaler as attributes.




In [1]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
from pandas_profiling.utils.cache import cache_file
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sb_utils import save_file
from collections import defaultdict
from collections import Counter
from sklearn.model_selection import train_test_split


In [4]:
# Let's read the wide dataframe created in the data wrangling stage into a dataframe object
df_wide = pd.read_csv("WIP_data/df_horizontal_expand_ver3.csv")

df_wide.head()

Unnamed: 0,customer_ID,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,...,C_previous_13,duration_previous_13,A_13,B_13,C_13,D_13,E_13,F_13,G_13,cost_13
0,10000000,9,1,0,12:07,IN,10001,2,0,2,...,,,,,,,,,,
1,10000005,6,1,3,09:09,NY,10006,1,0,10,...,,,,,,,,,,
2,10000007,8,1,4,14:26,PA,10008,1,0,11,...,,,,,,,,,,
3,10000013,4,1,4,09:31,WV,10014,2,1,3,...,,,,,,,,,,
4,10000014,6,1,1,17:50,MO,10015,1,0,5,...,,,,,,,,,,


In [59]:
# Split the dataframe into training and test sets with test being 20% of the observations
# We use stratify on the shopping point column so that both training and test sets contain a similar proportion of all 
# shopping points. Straify is important here as for some shopping points e.g. 12 & 13, very few datapoints are available.

df_train, df_test = train_test_split(df_wide, test_size = 0.2, random_state = 123, stratify = df_wide['shopping_pt'])


In [60]:
df_train.shape, df_test.shape

((77607, 259), (19402, 259))

In [61]:
# Let's check proportion of different shopping pts in train dataset
count_vals_tr = pd.DataFrame(df_train['shopping_pt'].value_counts().sort_values(ascending=False)).reset_index()
count_vals_tr.columns = ['shopping_pt', '#']
count_vals_tr['%'] = np.round(count_vals_tr['#'] * 100 / np.sum(count_vals_tr['#']),2)
count_vals_tr

Unnamed: 0,shopping_pt,#,%
0,7,14872,19.16
1,8,13798,17.78
2,6,12498,16.1
3,9,9588,12.35
4,5,9015,11.62
5,4,6401,8.25
6,10,4857,6.26
7,3,4455,5.74
8,11,1703,2.19
9,12,380,0.49


In [63]:
# Let's check proportion of different shopping pts in test dataset
count_vals_te = pd.DataFrame(df_test['shopping_pt'].value_counts().sort_values(ascending=False)).reset_index()
count_vals_te.columns = ['shopping_pt', '#']
count_vals_te['%'] = np.round(count_vals_te['#'] * 100 / np.sum(count_vals_te['#']),2)
count_vals_te


Unnamed: 0,shopping_pt,#,%
0,7,3718,19.16
1,8,3450,17.78
2,6,3125,16.11
3,9,2397,12.35
4,5,2254,11.62
5,4,1600,8.25
6,10,1214,6.26
7,3,1113,5.74
8,11,426,2.2
9,12,95,0.49


Proportions of different shopping pts are the same in both training and test datasets.
Now let's save them and use only the training dataset for next stage.


In [64]:
datapath = "WIP_data"

save_file(df_train, 'training_data.csv', datapath)
save_file(df_test, 'test_data.csv', datapath)


A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "WIP_data\training_data.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "WIP_data\test_data.csv"


In [2]:
# Now we import only the training data into a dataframe

df_wide = pd.read_csv("WIP_data/training_data.csv")

df_wide.head()

Unnamed: 0,customer_ID,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,...,C_previous_13,duration_previous_13,A_13,B_13,C_13,D_13,E_13,F_13,G_13,cost_13
0,10109793,9,1,4,14:44,CO,13320,1,1,3,...,,,,,,,,,,
1,10002231,8,1,5,13:36,OH,10601,1,0,13,...,,,,,,,,,,
2,10150024,8,1,4,15:33,OH,10081,3,1,20,...,,,,,,,,,,
3,10003949,8,1,4,09:13,FL,10302,2,1,7,...,,,,,,,,,,
4,10103809,10,1,2,11:06,FL,14844,1,1,9,...,,,,,,,,,,


In [3]:
df_wide.shape

(77607, 259)

In [27]:
# The training data has the entire shopping history of the customer.
# we want to predict the final product vectors (target) based on early shopping history of the customer
# So we define a class that contains the entire shopping history and has an in-built function to extract 
# any number of shopping points as we want

class QuoteHistory:
    #class to contain customer data and extract appropriate quote history
    
    def pass_data(self, dataframe):
        # initialize the shopping_history object with customer data across all shopping points
        self.data = dataframe
        
    def get_history(self, how = 'first2', quote_nos=[]):
        
        customer_data = ['group_size', 'homeowner', 'car_age', 'car_value', 'risk_factor', 'age_oldest', 'age_youngest', \
                 'married_couple', 'C_previous', 'duration_previous']
        product_vectors = ['A', 'B', 'C','D','E','F','G', 'cost']

        def hist_extract(quotes=[1,2]):
            
            df_temp = self.data[self.data['shopping_pt'] > np.max(quotes)]\
                                  [['customer_ID','shopping_pt','state', 'A', 'B', 'C','D','E','F','G', 'cost']]
            
            vector_cols = ['customer_ID']+[x+'_'+str(y) for x in product_vectors for y in quotes]
            customer_data_cols = ['customer_ID']+[x+'_'+str(np.max(quotes)) for x in customer_data]

            df_temp = df_temp.merge(self.data[customer_data_cols], on='customer_ID', how='left', suffixes=["",""])
            df_temp = df_temp.merge(self.data[vector_cols], on='customer_ID', how='left', suffixes=["",""])
            
            return df_temp

  
        if how == 'first2':
                                
            return hist_extract([1,2])
        
        elif how == 'first3':
            return hist_extract([1,2,3])
        
        elif how == 'specific' and quote_nos != []:
            
            return hist_extract(quote_nos)
        
        elif how == 'last2':
                
            df_temp = self.data.iloc[:, 0:25]
            
            for index, row in self.data.iterrows():
                print(index)
                quote_second_last = self.data.loc[index, 'shopping_pt'] - 2
                quote_last = self.data.loc[index, 'shopping_pt'] - 1

                for feature in quote_features:
                    df_temp.loc[index, feature+'_2nd_last'] = self.data.loc[index, feature+"_q_"+str(quote_second_last)]
                    df_temp.loc[index, feature+'_last'] = self.data.loc[index, feature+"_q_"+str(quote_last)]
            
            return df_temp



In [5]:
df_2 = QuoteHistory()
df_2.pass_data(df_wide)

df_hist_first2 = df_2.get_history(how='first2')
df_hist_first2.head()

Unnamed: 0,customer_ID,shopping_pt,state,A,B,C,D,E,F,G,...,D_1,D_2,E_1,E_2,F_1,F_2,G_1,G_2,cost_1,cost_2
0,10109793,9,CO,1,1,3,3,0,2,1,...,3,3,0,0,2,2,1,1,656,656
1,10002231,8,OH,0,0,1,3,0,0,3,...,1,3,0,0,2,0,3,3,598,557
2,10150024,8,OH,1,1,2,3,0,2,3,...,3,3,0,0,3,3,3,3,617,617
3,10003949,8,FL,1,1,2,2,1,2,3,...,2,2,1,1,2,2,4,3,647,675
4,10103809,10,FL,1,1,1,3,1,1,3,...,3,3,1,1,1,1,4,3,637,617


In [7]:
df_hist_first2.columns

Index(['customer_ID', 'shopping_pt', 'state', 'A', 'B', 'C', 'D', 'E', 'F',
       'G', 'cost', 'group_size_2', 'homeowner_2', 'car_age_2', 'car_value_2',
       'risk_factor_2', 'age_oldest_2', 'age_youngest_2', 'married_couple_2',
       'C_previous_2', 'duration_previous_2', 'A_1', 'A_2', 'B_1', 'B_2',
       'C_1', 'C_2', 'D_1', 'D_2', 'E_1', 'E_2', 'F_1', 'F_2', 'G_1', 'G_2',
       'cost_1', 'cost_2'],
      dtype='object')

In [17]:
datapath = "WIP_data"

save_file(df_hist_first2, 'training_data_with_first2_quotes.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "WIP_data\training_data_with_first2_quotes.csv"


In [18]:
df_hist_2_3_4 = df_2.get_history(how='specific', quote_nos=[2,3,4])

#df_last2.columns

#df_2.get_history
df_hist_2_3_4.head()


Unnamed: 0,customer_ID,shopping_pt,state,A,B,C,D,E,F,G,...,E_4,F_2,F_3,F_4,G_2,G_3,G_4,cost_2,cost_3,cost_4
0,10109793,9,CO,1,1,3,3,0,2,1,...,0.0,2,2,2.0,1,1,1.0,656,656,656.0
1,10002231,8,OH,0,0,1,3,0,0,3,...,0.0,0,0,0.0,3,3,3.0,557,564,564.0
2,10150024,8,OH,1,1,2,3,0,2,3,...,0.0,3,3,2.0,3,3,2.0,617,617,624.0
3,10003949,8,FL,1,1,2,2,1,2,3,...,1.0,2,2,2.0,3,3,3.0,675,675,675.0
4,10103809,10,FL,1,1,1,3,1,1,3,...,1.0,1,1,1.0,3,3,3.0,617,617,617.0


In [19]:
df_hist_2_3_4.columns

Index(['customer_ID', 'shopping_pt', 'state', 'A', 'B', 'C', 'D', 'E', 'F',
       'G', 'cost', 'group_size_4', 'homeowner_4', 'car_age_4', 'car_value_4',
       'risk_factor_4', 'age_oldest_4', 'age_youngest_4', 'married_couple_4',
       'C_previous_4', 'duration_previous_4', 'A_2', 'A_3', 'A_4', 'B_2',
       'B_3', 'B_4', 'C_2', 'C_3', 'C_4', 'D_2', 'D_3', 'D_4', 'E_2', 'E_3',
       'E_4', 'F_2', 'F_3', 'F_4', 'G_2', 'G_3', 'G_4', 'cost_2', 'cost_3',
       'cost_4'],
      dtype='object')

In [20]:
datapath = "WIP_data"

save_file(df_hist_2_3_4, 'training_data_with_quotes_2_3_4.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "WIP_data\training_data_with_quotes_2_3_4.csv"


In [21]:
df_train_first2 = pd.read_csv('WIP_data/training_data_with_first2_quotes.csv')
df_train_first2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77607 entries, 0 to 77606
Data columns (total 37 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   customer_ID          77607 non-null  int64  
 1   shopping_pt          77607 non-null  int64  
 2   state                77607 non-null  object 
 3   A                    77607 non-null  int64  
 4   B                    77607 non-null  int64  
 5   C                    77607 non-null  int64  
 6   D                    77607 non-null  int64  
 7   E                    77607 non-null  int64  
 8   F                    77607 non-null  int64  
 9   G                    77607 non-null  int64  
 10  cost                 77607 non-null  int64  
 11  group_size_2         77607 non-null  int64  
 12  homeowner_2          77607 non-null  int64  
 13  car_age_2            77607 non-null  int64  
 14  car_value_2          77607 non-null  int64  
 15  risk_factor_2        77607 non-null 

In [22]:
df_train_first2.head()

Unnamed: 0,customer_ID,shopping_pt,state,A,B,C,D,E,F,G,...,D_1,D_2,E_1,E_2,F_1,F_2,G_1,G_2,cost_1,cost_2
0,10109793,9,CO,1,1,3,3,0,2,1,...,3,3,0,0,2,2,1,1,656,656
1,10002231,8,OH,0,0,1,3,0,0,3,...,1,3,0,0,2,0,3,3,598,557
2,10150024,8,OH,1,1,2,3,0,2,3,...,3,3,0,0,3,3,3,3,617,617
3,10003949,8,FL,1,1,2,2,1,2,3,...,2,2,1,1,2,2,4,3,647,675
4,10103809,10,FL,1,1,1,3,1,1,3,...,3,3,1,1,1,1,4,3,637,617


We see that all the product vector columns are stored as integers. These features are categorical as customer is choosing discrete options for the product vector. So they must be changed to categorical i.e. object.
Similarly, there are certain customer info features that have discrete values but are stored as integers. These are:

 - group_size_2: Takes values (1,2,3,4)           
 
 - homeowner_2: Takes values (0.1)
 
 - car_value_2: Takes values (1 to 9)
 
 - risk_factor_2: Takes values (0,1,2,3,4)
 
 - married_couple_2: Takes values (0,1)
 
 - C_previous_2: Takes values (1,2,3,4)
 
 
 
The above features along with product vectors need to be converted to categorical.
 
The features that take continuous values are cost, car_age_2, age_oldest_2, age_youngest_2, and duration_previous_2. Only duration previous is stored as float, the rest are stored as integers. We shall convert the others to float as well.
 
While 'shopping_pt' takes values 1 to 13, we shall keep it as as numerical variable as there is a 'distance' quality in this feature, customers with larger shopping points travel more 'distance' in their purchase process.

In [8]:
# To pre-process our dataset, we define a class. This class will hold the dataset and perform all the operations necessary
# to make the dataset ready for modelling stage.

class PreProcess:
    #class to contain customer data and extract appropriate quote history
    
    def dataset(self, dataframe):
        # initialize the shopping_history object with customer data across all shopping points
        self.data = dataframe
        
    def transform(self, quote_nos=[1,2], build_scaler = True, scaler = StandardScaler()):
        
        # To transform the training dataframe and make it ready for modelling, we need to:
        # 1) add new columns a) No. of vector changes made at each quote b) cost difference between each set of quotes
        # 2) scale the numeric features
        # 3) dummy encode the categorical features
        # 4) Encode vectors C and G from 0 to 3 and vector D as (0,1,2) as XG boost doesn't accept encoding starting at 1.
        
        # Let's define a function to count changes made by each customer for given dataframe, vectors and shopping points
        # returns a dataframe with customer_IDs and corresponding changes at each step

        def changes_by_shopping_pt(dataframe, product_vectors, shp_pt):
    
            cust_tracker = dataframe[['customer_ID']]

            for idx, quote in enumerate(shp_pt):
                if idx >= 1:
        
                    cols_to_select = ['customer_ID']
                    for vector in product_vectors:
                        cols_to_select.append(vector+"_"+str(shp_pt[idx-1]))
                        cols_to_select.append(vector+"_"+str(shp_pt[idx]))
        
                    cols_to_compare = dataframe[dataframe['shopping_pt'] >= quote][cols_to_select]
                    cols_to_sum = []

                    for vector in product_vectors:
    
                        # count how many customers changed vector from previous quote to current quote
                        cols_to_compare[str(quote)+vector] = np.where(cols_to_compare[vector+"_"+str(shp_pt[idx])] \
                                                       != cols_to_compare[vector+"_"+str(shp_pt[idx-1])], 1, 0)
        
                        cols_to_sum.append(str(quote)+vector)
        
                    cols_to_compare['changes_step_'+str(quote)] = cols_to_compare[cols_to_sum].sum(axis=1)
    
                    cust_tracker = cust_tracker.merge(cols_to_compare[['customer_ID', 'changes_step_'+str(quote)]], on = 'customer_ID', how='left')
        
            return cust_tracker

        def change_vector_encoding(quote_nos):
            change_vectors = ['C', 'D', 'G']
            new_encoding = {1:0, 2:1, 3:2, 4:3}
            for vect in change_vectors:
                self.data[vect] = self.data[vect].map(lambda x: new_encoding[x])
                for quote in quote_nos:
                    self.data[vect+'_'+str(quote)] = self.data[vect+'_'+str(quote)].map(lambda x: new_encoding[x])
            
            return self.data
            
        self.data = change_vector_encoding(quote_nos)
        
        self.data = self.data.merge(
            changes_by_shopping_pt(self.data, ['A', 'B', 'C','D','E','F','G'], quote_nos), on='customer_ID', how='left')
        
        for idx, quote in enumerate(quote_nos):
            if idx >= 1:
                self.data['cost_diff_step_'+str(quote)] = self.data['cost_'+str(quote_nos[idx])] - self.data['cost_'+str(quote_nos[idx-1])]
        
        num_features = ['shopping_pt'] + ['cost_'+str(quote) for quote in quote_nos] \
                        +['car_age_'+str(quote_nos[-1])]+ ['age_oldest_'+str(quote_nos[-1])] \
                        + ['age_youngest_'+str(quote_nos[-1])] + ['duration_previous_'+str(quote_nos[-1])] \
                        + ['cost_diff_step_'+str(quote) for idx, quote in enumerate(quote_nos) if idx >= 1] \
                        + ['changes_step_'+str(quote) for quote in quote_nos[1:]]
        
        catg_features = [x for x in self.data.columns if x not in num_features]

        self.data[catg_features] = self.data[catg_features].astype('object')
        self.data[num_features] = self.data[num_features].astype('float')
        
        # From self.data, we need to drop the customer_ID and target(final product vector) columns to create 
        # our features dataset X. Also, we shall define y with the target vectors and policy cost.
        
        X = self.data.drop(columns=['customer_ID', 'A','B','C','D','E','F','G','cost'])
        y = self.data[['A','B','C','D','E','F','G']]
        
        # Next, we do dummy encoding for the categorical columns and scale the numeric columns
        
        X = pd.get_dummies(X, drop_first=True)
        
        # We store the scaler as an attribute of our class
        if build_scaler == True:
            self.scaler = StandardScaler().fit(X[num_features])
        else:
            self.scaler = scaler
        
        X[num_features] = self.scaler.transform(X[num_features])
        
        # We split the dataset into training and test sets and store them as attribute as well
        
        return X, y
        
        


In [24]:
pp = PreProcess()
pp.dataset(df_train_first2)

X, y = pp.transform(quote_nos=[1,2])

X.shape, y.shape

((77607, 94), (77607, 7))

In [9]:
X.head(7)

Unnamed: 0,shopping_pt,car_age_2,age_oldest_2,age_youngest_2,duration_previous_2,cost_1,cost_2,changes_step_2,cost_diff_step_2,state_AR,...,F_1_3,F_2_1,F_2_2,F_2_3,G_1_1,G_1_2,G_1_3,G_2_1,G_2_2,G_2_3
0,1.072089,-0.901461,-0.00378,0.129103,-0.636138,0.464913,0.467906,-0.871664,-0.043014,0,...,0,0,1,0,0,0,0,0,0,0
1,0.571683,0.824464,-1.209101,-1.073002,0.862673,-0.677346,-1.614463,0.90345,-1.338349,0,...,0,0,0,0,0,1,0,0,1,0
2,0.571683,2.032611,-0.061176,-1.359217,-0.636138,-0.303158,-0.352421,-0.871664,-0.043014,0,...,1,0,0,1,0,1,0,0,1,0
3,0.571683,-0.211091,-0.692535,-0.615057,-0.636138,0.287666,0.867553,-0.279959,0.841605,0,...,0,0,1,0,0,0,1,0,1,0
4,1.572496,0.134094,-1.036912,-0.901272,-0.422023,0.090725,-0.352421,0.311745,-0.674885,0,...,0,1,0,0,0,0,1,0,1,0
5,-1.930351,-0.211091,1.718107,1.846396,1.290905,-1.760522,-0.015876,3.270269,2.800404,0,...,0,1,0,0,0,0,1,1,0,0
6,0.071276,-0.211091,1.086749,1.216722,-0.850254,0.701243,0.594111,-0.279959,-0.232575,0,...,0,0,1,0,0,1,0,0,1,0


In [11]:
y['G'].value_counts()

1    30258
2    24656
0    16106
3     6587
Name: G, dtype: int64

In [10]:
y.head()

Unnamed: 0,A,B,C,D,E,F,G
0,1,1,2,2,0,2,0
1,0,0,0,2,0,0,2
2,1,1,1,2,0,2,2
3,1,1,1,1,1,2,2
4,1,1,0,2,1,1,2


In [11]:
pp.scaler

StandardScaler()

In [12]:
datapath2 = "Transformed_data"

save_file(X, 'pre_processed_training_data_with_quotes_1_2.csv', datapath2)
save_file(y, 'training_data_target_columns.csv', datapath2)


A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "Transformed_data\pre_processed_training_data_with_quotes_1_2.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "Transformed_data\training_data_target_columns.csv"


In [14]:
# Let's also save the pre_processor object

save_file(pp, 'pre_processor.pkl', datapath2)


Writing file.  "Transformed_data\pre_processor.pkl"


In [26]:
# Now let's load the test data, get the first two quotes and pre-process it, while using the train data scaler

df_test = pd.read_csv("WIP_data/test_data.csv")

df_test.head()


Unnamed: 0,customer_ID,shopping_pt,record_type,day,time,state,location,group_size,homeowner,car_age,...,C_previous_13,duration_previous_13,A_13,B_13,C_13,D_13,E_13,F_13,G_13,cost_13
0,10129104,5,1,3,09:41,FL,10127,2,1,10,...,,,,,,,,,,
1,10095371,7,1,3,13:54,WV,10209,1,1,3,...,,,,,,,,,,
2,10032679,8,1,3,12:44,MO,10143,2,0,7,...,,,,,,,,,,
3,10025603,5,1,4,13:11,NY,13524,1,1,1,...,,,,,,,,,,
4,10122799,8,1,1,18:29,NY,10558,1,0,1,...,,,,,,,,,,


In [28]:
# Extract 1st two quotes of test data
qh = QuoteHistory()
qh.pass_data(df_test)

df_test_first2 = qh.get_history(how='first2')
df_test_first2.head()


Unnamed: 0,customer_ID,shopping_pt,state,A,B,C,D,E,F,G,...,D_1,D_2,E_1,E_2,F_1,F_2,G_1,G_2,cost_1,cost_2
0,10129104,5,FL,1,1,3,3,0,1,3,...,3,3,0,0,1,1,3,3,606,606
1,10095371,7,WV,2,0,1,1,0,2,1,...,1,2,0,0,0,2,4,1,596,675
2,10032679,8,MO,2,1,3,3,1,2,3,...,3,3,1,1,1,2,3,3,614,617
3,10025603,5,NY,0,1,4,3,1,0,2,...,3,3,1,1,0,0,2,2,599,590
4,10122799,8,NY,2,1,3,3,1,0,2,...,3,3,1,1,0,0,4,2,724,717


In [31]:
df_test_first2.shape

(19402, 37)

In [29]:
datapath = "Test_data"

save_file(df_test_first2, 'test_data_with_first2_quotes.csv', datapath)


Directory Test_data was created.
Writing file.  "Test_data\test_data_with_first2_quotes.csv"


In [30]:
# Load the pre-processor training data object
import pickle

with open('Transformed_data/pre_processor.pkl', 'rb') as f:
    pp_train_first2 = pickle.load(f)

type(pp_train_first2)


__main__.PreProcess

In [32]:
# Pre-process test data, but use the scaler from training data

pp_test = PreProcess()
pp_test.dataset(df_test_first2)

X, y = pp_test.transform(quote_nos=[1,2], build_scaler = False, scaler = pp_train_first2.scaler)

X.shape, y.shape

((19402, 94), (19402, 7))

In [33]:
X.head()

Unnamed: 0,shopping_pt,car_age_2,age_oldest_2,age_youngest_2,duration_previous_2,cost_1,cost_2,changes_step_2,cost_diff_step_2,state_AR,...,F_1_3,F_2_1,F_2_2,F_2_3,G_1_1,G_1_2,G_1_3,G_2_1,G_2_2,G_2_3
0,-0.929538,0.306686,1.718107,1.331208,-1.06437,-0.519793,-0.583795,-0.871664,-0.043014,0,...,0,1,0,0,0,1,0,0,1,0
1,0.071276,-0.901461,-1.151705,-1.015758,-1.06437,-0.716734,0.867553,2.08686,2.452875,0,...,0,0,1,0,0,0,1,0,0,0
2,0.571683,-0.211091,0.742371,0.587048,-1.06437,-0.36224,-0.352421,-0.279959,0.051766,0,...,0,0,1,0,0,1,0,0,1,0
3,-0.929538,-1.246646,-0.061176,0.07186,1.933252,-0.657652,-0.92034,-0.871664,-0.327356,0,...,0,0,0,0,1,0,0,1,0,0
4,0.571683,-1.246646,-1.094309,-0.958515,-0.422023,1.804113,1.750982,-0.279959,-0.264169,0,...,0,0,0,0,0,0,1,1,0,0


In [34]:
y.head()

Unnamed: 0,A,B,C,D,E,F,G
0,1,1,2,2,0,1,2
1,2,0,0,0,0,2,0
2,2,1,2,2,1,2,2
3,0,1,3,2,1,0,1
4,2,1,2,2,1,0,1


In [35]:
datapath = "Test_data"

save_file(X, 'pre_processed_test_data_with_quotes_1_2.csv', datapath)
save_file(y, 'test_data_target_columns.csv', datapath)


Writing file.  "Test_data\pre_processed_test_data_with_quotes_1_2.csv"
Writing file.  "Test_data\test_data_target_columns.csv"


In [37]:
datapath = "Test_data"

save_file(df_test_first2, 'test_data_first2_quotes_CDG_remapped.csv', datapath)


Writing file.  "Test_data\test_data_first2_quotes_CDG_remapped.csv"


In [13]:
df_train_q2_q3 = pd.read_csv('WIP_data/training_data_with_quotes_2_3.csv')
df_train_q2_q3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73152 entries, 0 to 73151
Data columns (total 37 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   customer_ID          73152 non-null  int64  
 1   shopping_pt          73152 non-null  int64  
 2   state                73152 non-null  object 
 3   A                    73152 non-null  int64  
 4   B                    73152 non-null  int64  
 5   C                    73152 non-null  int64  
 6   D                    73152 non-null  int64  
 7   E                    73152 non-null  int64  
 8   F                    73152 non-null  int64  
 9   G                    73152 non-null  int64  
 10  cost                 73152 non-null  int64  
 11  group_size_3         73152 non-null  int64  
 12  homeowner_3          73152 non-null  int64  
 13  car_age_3            73152 non-null  int64  
 14  car_value_3          73152 non-null  int64  
 15  risk_factor_3        73152 non-null 

In [15]:
pp23 = PreProcess()
pp23.dataset(df_train_q2_q3)

X, y = pp23.transform(quote_nos=[2,3])

X.shape, y.shape

((73152, 94), (73152, 7))

In [14]:
df_train_q2_q3.head()

Unnamed: 0,customer_ID,shopping_pt,state,A,B,C,D,E,F,G,...,D_2,D_3,E_2,E_3,F_2,F_3,G_2,G_3,cost_2,cost_3
0,10109793,9,CO,1,1,3,3,0,2,1,...,3,3,0,0,2,2,1,1,656,656
1,10002231,8,OH,0,0,1,3,0,0,3,...,3,3,0,0,0,0,3,3,557,564
2,10150024,8,OH,1,1,2,3,0,2,3,...,3,3,0,0,3,3,3,3,617,617
3,10003949,8,FL,1,1,2,2,1,2,3,...,2,2,1,1,2,2,3,3,675,675
4,10103809,10,FL,1,1,1,3,1,1,3,...,3,3,1,1,1,1,3,3,617,617


In [16]:
X.head()

Unnamed: 0,shopping_pt,car_age_3,age_oldest_3,age_youngest_3,duration_previous_3,cost_2,cost_3,changes_step_3,cost_diff_step_3,state_AR,...,F_2_3,F_3_1,F_3_2,F_3_3,G_2_1,G_2_2,G_2_3,G_3_1,G_3_2,G_3_3
0,1.054006,-0.897173,-0.000878,0.134046,-0.639141,0.459097,0.443265,-0.531869,-0.072856,0,...,0,0,1,0,0,0,0,0,0,0
1,0.501449,0.826906,-1.207892,-1.069244,0.858809,-1.6223,-1.564636,-0.531869,0.261214,0,...,0,0,0,0,0,1,0,0,1,0
2,0.501449,2.033761,-0.058355,-1.355742,-0.639141,-0.360847,-0.40791,-0.531869,-0.072856,0,...,1,0,0,1,0,1,0,0,1,0
3,0.501449,-0.207541,-0.6906,-0.610848,-0.639141,0.858557,0.857941,-0.531869,-0.072856,0,...,0,0,1,0,0,1,0,0,1,0
4,1.606563,0.137274,-1.035461,-0.897345,-0.425148,-0.360847,-0.40791,-0.531869,-0.072856,0,...,0,1,0,0,0,1,0,0,1,0


In [17]:
y.head()

Unnamed: 0,A,B,C,D,E,F,G
0,1,1,2,2,0,2,0
1,0,0,0,2,0,0,2
2,1,1,1,2,0,2,2
3,1,1,1,1,1,2,2
4,1,1,0,2,1,1,2


In [18]:
datapath2 = "Transformed_data"

save_file(X, 'pre_processed_training_data_with_quotes_2_3.csv', datapath2)
save_file(y, 'training_data_target_columns_quotes_2_3.csv', datapath2)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "Transformed_data\pre_processed_training_data_with_quotes_2_3.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "Transformed_data\training_data_target_columns_quotes_2_3.csv"


In [11]:
# Let's also save the pre_processor object

save_file(pp23, 'pre_processor_quotes_2_3.pkl', datapath2)


Writing file.  "Transformed_data\pre_processor_quotes_2_3.pkl"
