* Explore the dataset - EDA
* Wrangle dataset - Cleaning dataset
* Build a regression model to predict insurance price
* Model interpretation
* Create amazing visualization using plotly, seaborn, tableau
* Deploy project using flask and/or plotly using Heroku

In [1]:
import pandas as pd

In [2]:
insurance = pd.read_csv("https://raw.githubusercontent.com/EvidenceN/Insurance_premium_prediction/master/data/auto_insurance_data.csv")

In [3]:
pd.options.display.max_columns = 999

In [4]:
insurance.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,69,32,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,94,13,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,108,18,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,106,18,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,73,12,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


In [5]:
insurance.shape

(9134, 24)

In [6]:
from pandas_profiling import ProfileReport

In [7]:
profile = ProfileReport(insurance)

variables: 100%|██████████| 24/24 [00:08<00:00,  2.82it/s]
correlations [recoded]: 100%|██████████| 6/6 [00:07<00:00,  1.21s/it]  
interactions [continuous]: 100%|██████████| 64/64 [00:11<00:00,  5.49it/s]
table: 100%|██████████| 1/1 [00:00<00:00,  7.25it/s]
missing [matrix]: 100%|██████████| 2/2 [00:00<00:00,  2.21it/s]
package: 100%|██████████| 1/1 [00:00<00:00, 90.92it/s]
build report structure: 100%|██████████| 1/1 [00:09<00:00,  9.78s/it]


In [8]:
profile



# Things to do and check out -- Data WRANGLING steps to complete
## Regression problem - 2 Models for 2 targets.
**one target = predicting insurance premium** 

**second target = predicting customer lifetime value**
### Find the meaning of column names

* [x] Profile report says no date column - Wrong, convert date column to date. Convert Effective_to_date to dates
* [x] Drop customer column because it doesn't matter
* [x] Encode Coverage column - categorical to integer
* [] Target column for prediction - Customer lifetime value. 
* [] Mean baseline for lifetime value = 8004.940475
* [] Mean baseline for monthly premium = 93.2192905
* [] Change customer lifetime value to 2 significant figures
* [] Change monthly premium to 2 significant figures
* [] Look at correlation between monthly premium and customer lifetime value
* [] Encode education column into categorical
* [x] change effective_to_date from object to date
* [] Change employment status into numbers
* [] Change gender into numbers
* [] Change location_code into integer
* [] Change marital status into integer
* [] Explore relationship between month since last claim and insurance premium
* [] Look at relationship between month since last claim, and premium price
* [] Look at relationship between month since policy inception, and premium price
* [] Look at relationship between number of complaints and premium price
* [] Look at relationship between number of policy and premium price
* [] Type of policy needs to be converted into integers
* [] Drop renew_offer type because we don't know what offer1, offer2, offer3 means in this dataset. 
* [] drop `response` column because we don't know what "Response" means. Response to what? What was the original question they are responding to?
* [] change sales_channel to integer
* [] change state to integers - encoding
* [] look at relationship between state and insurance premium and life time value
* [] explore relationship between total claim amount and insurance premium
* [] encode vehicle class and 
* [] explore relationship between vehicle class and insurance premium and also lifetime value
* [] Explore vehicle class and number of policy. 
* [] Explore vehicle class and gender. 
* [] Explore vehicle class and vehicle size status. 
* [] Explore vehicle size and insurance premium status. 
* [] Explore vehicle size and lifetime value status. 
* [] Encode vehicle size into integer

Marked checkbox unicode  - &#x2611;
Marked checkbox unicode  - &#9745;

In [9]:
insurance.dtypes

Customer                          object
State                             object
Customer Lifetime Value          float64
Response                          object
Coverage                          object
Education                         object
Effective To Date                 object
EmploymentStatus                  object
Gender                            object
Income                             int64
Location Code                     object
Marital Status                    object
Monthly Premium Auto               int64
Months Since Last Claim            int64
Months Since Policy Inception      int64
Number of Open Complaints          int64
Number of Policies                 int64
Policy Type                       object
Policy                            object
Renew Offer Type                  object
Sales Channel                     object
Total Claim Amount               float64
Vehicle Class                     object
Vehicle Size                      object
dtype: object

In [10]:
# Convert Effective_To_Date from categorical to date
# Effective_to_date could mean the day insurance starts or could be interpreted
# as the day insurance ends. As in effective until this date. 

insurance["activation_date"] = pd.to_datetime(insurance["Effective To Date"], infer_datetime_format = True)

In [11]:
insurance.dtypes

Customer                                 object
State                                    object
Customer Lifetime Value                 float64
Response                                 object
Coverage                                 object
Education                                object
Effective To Date                        object
EmploymentStatus                         object
Gender                                   object
Income                                    int64
Location Code                            object
Marital Status                           object
Monthly Premium Auto                      int64
Months Since Last Claim                   int64
Months Since Policy Inception             int64
Number of Open Complaints                 int64
Number of Policies                        int64
Policy Type                              object
Policy                                   object
Renew Offer Type                         object
Sales Channel                           

In [12]:
insurance.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,69,32,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize,2011-02-24
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,94,13,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize,2011-01-31
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,108,18,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize,2011-02-19
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,106,18,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize,2011-01-20
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,73,12,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize,2011-02-03


In [13]:
insurance = insurance.drop(columns = ["Customer", "Effective To Date"])

In [14]:
insurance.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
0,Washington,2763.519279,No,Basic,Bachelor,Employed,F,56274,Suburban,Married,69,32,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize,2011-02-24
1,Arizona,6979.535903,No,Extended,Bachelor,Unemployed,F,0,Suburban,Single,94,13,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize,2011-01-31
2,Nevada,12887.43165,No,Premium,Bachelor,Employed,F,48767,Suburban,Married,108,18,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize,2011-02-19
3,California,7645.861827,No,Basic,Bachelor,Unemployed,M,0,Suburban,Married,106,18,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize,2011-01-20
4,Washington,2813.692575,No,Basic,Bachelor,Employed,M,43836,Rural,Single,73,12,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize,2011-02-03


In [15]:
# split data into train, test, validation dataset before proceeding. 
# should have done this before chaging to datetime and dropping columns. 

from sklearn.model_selection import train_test_split

train, test = train_test_split(insurance, train_size = 0.85, test_size=0.15, random_state=42)

# validation dataset

train, val = train_test_split(train, train_size = 0.85, test_size=0.15, random_state=42)

In [16]:
#encode coverage column from categorical to integer

insurance['Coverage'].describe()

count      9134
unique        3
top       Basic
freq       5568
Name: Coverage, dtype: object

In [17]:
insurance['Coverage'].value_counts()

Basic       5568
Extended    2742
Premium      824
Name: Coverage, dtype: int64

In [18]:
#[{‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}] correct mapping structure for ordinal encoding

coverage_dictionary = [{'col': 'Coverage','mapping':{"Basic":1, "Extended":2, "Premium": 3}}]

In [19]:
# use ordinal encoding to do encode coverage column

import category_encoders as ce

coverage_encoder = ce.OrdinalEncoder(cols="Coverage", mapping=coverage_dictionary)

In [20]:
train_encoded = coverage_encoder.fit_transform(train)
test_encoded = coverage_encoder.transform(test)

In [21]:
train_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.889347,No,1,College,Employed,M,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.907007,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.085637,Yes,1,Bachelor,Employed,M,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.425609,SUV,Medsize,2011-01-21
6357,Oregon,11638.89947,Yes,1,College,Medical Leave,F,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.953723,No,1,College,Unemployed,F,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.810486,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.3679,No,1,College,Unemployed,F,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.062306,Four-Door Car,Medsize,2011-01-19


In [22]:
train_encoded['Coverage'].describe()

count    6598.000000
mean        1.475447
std         0.651431
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         3.000000
Name: Coverage, dtype: float64

In [23]:
train['Coverage'].value_counts()

Basic       4038
Extended    1983
Premium      577
Name: Coverage, dtype: int64

In [24]:
train_encoded['Coverage'].value_counts()

1    4038
2    1983
3     577
Name: Coverage, dtype: int64

In [25]:
test['Coverage'].value_counts()

Basic       826
Extended    421
Premium     124
Name: Coverage, dtype: int64

In [26]:
test_encoded['Coverage'].value_counts()

1    826
2    421
3    124
Name: Coverage, dtype: int64

In [27]:
# encode education from categorical value to integers

train["Education"].describe()

count         6598
unique           5
top       Bachelor
freq          1992
Name: Education, dtype: object

In [28]:
train["Education"].value_counts()

Bachelor                1992
College                 1926
High School or Below    1898
Master                   521
Doctor                   261
Name: Education, dtype: int64

In [29]:
# combine college and bachelor into one datatype. Could be that
# college means people that went to college but didn't graduate
# Combine college and bachelor for clarification and easier assessment

train['Education'] = train['Education'].replace({"College":"Bachelor"})

In [30]:
train["Education"].value_counts()

Bachelor                3918
High School or Below    1898
Master                   521
Doctor                   261
Name: Education, dtype: int64

In [31]:
test['Education'] = test['Education'].replace({"College":"Bachelor"});

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [32]:
test["Education"].value_counts()

Bachelor                806
High School or Below    398
Master                  125
Doctor                   42
Name: Education, dtype: int64

In [33]:
# encoding education column from categorical into integers

education_dictionary = [{'col': 'Education','mapping':{"High School or Below":1, 
                                                       "Bachelor":2, "Master": 3,
                                                      "Doctor": 4}}]

education_encoder = ce.OrdinalEncoder(cols="Education", mapping=education_dictionary)


train_encoded['Education'] = train_encoded['Education'].replace({"College":"Bachelor"})
test_encoded['Education'] = test_encoded['Education'].replace({"College":"Bachelor"})

train_encoded = education_encoder.fit_transform(train_encoded)
test_encoded = education_encoder.transform(train_encoded)

In [34]:
train_encoded['Education'].value_counts()

2    3918
1    1898
3     521
4     261
Name: Education, dtype: int64

In [35]:
train_encoded.columns

Index(['State', 'Customer Lifetime Value', 'Response', 'Coverage', 'Education',
       'EmploymentStatus', 'Gender', 'Income', 'Location Code',
       'Marital Status', 'Monthly Premium Auto', 'Months Since Last Claim',
       'Months Since Policy Inception', 'Number of Open Complaints',
       'Number of Policies', 'Policy Type', 'Policy', 'Renew Offer Type',
       'Sales Channel', 'Total Claim Amount', 'Vehicle Class', 'Vehicle Size',
       'activation_date'],
      dtype='object')

In [36]:
# rounding various series in our dataframe. 
# columns to round - lifetime value, monthly premium, total claim amount

train_encoded = train_encoded.round({"Customer Lifetime Value": 2, "Total Claim Amount": 2})
test_encoded = test_encoded.round({"Customer Lifetime Value": 2, "Total Claim Amount": 2})
train = train.round({"Customer Lifetime Value": 2, "Total Claim Amount": 2})
test = test.round({"Customer Lifetime Value": 2, "Total Claim Amount": 2})

In [37]:
train_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.89,No,1,2,Employed,M,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.91,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.09,Yes,1,2,Employed,M,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.43,SUV,Medsize,2011-01-21
6357,Oregon,11638.9,Yes,1,2,Medical Leave,F,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.95,No,1,2,Unemployed,F,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.81,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.37,No,1,2,Unemployed,F,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.06,Four-Door Car,Medsize,2011-01-19


# Things to do and check out -- Data WRANGLING steps to complete
## Regression problem - 2 Models for 2 targets.
**one target = predicting insurance premium** 

**second target = predicting customer lifetime value**
### Find the meaning of column names

* [x] Profile report says no date column - Wrong, convert date column to date. Convert Effective_to_date to dates
* [x] Drop customer column because it doesn't matter
* [x] Encode Coverage column - categorical to integer
* [] Target column for prediction - Customer lifetime value. 
* [] Mean baseline for lifetime value = 8004.940475
* [] Mean baseline for monthly premium = 93.2192905
* [x] Change customer lifetime value to 2 significant figures
* [x] Change total claim amount to 2 significant figures
* [] Look at correlation between monthly premium and customer lifetime value
* [x] Encode education column into categorical
* [x] change effective_to_date from object to date
* [] Change employment status into numbers
* [] Change gender into numbers
* [] Change location_code into integer
* [] Change marital status into integer
* [] Explore relationship between month since last claim and insurance premium
* [] Look at relationship between month since last claim, and premium price
* [] Look at relationship between month since policy inception, and premium price
* [] Look at relationship between number of complaints and premium price
* [] Look at relationship between number of policy and premium price
* [] Type of policy needs to be converted into integers
* [] Drop renew_offer type because we don't know what offer1, offer2, offer3 means in this dataset. 
* [] drop `response` column because we don't know what "Response" means. Response to what? What was the original question they are responding to?
* [] change sales_channel to integer
* [] change state to integers - encoding
* [] look at relationship between state and insurance premium and life time value
* [] explore relationship between total claim amount and insurance premium
* [] encode vehicle class and 
* [] explore relationship between vehicle class and insurance premium and also lifetime value
* [] Explore vehicle class and number of policy. 
* [] Explore vehicle class and gender. 
* [] Explore vehicle class and vehicle size status. 
* [] Explore vehicle size and insurance premium status. 
* [] Explore vehicle size and lifetime value status. 
* [] Encode vehicle size into integer

Week 3 of project. Starting with encoding employment status

What type of encoding. Does order matter in this encoding. 

Start by using one hot encoding and then in the future

**COME BACK AND TRY OUT TARGET ENCODING TO SEE 
EFFECTS ON MODEL**. 

In [38]:
train_encoded["EmploymentStatus"].describe()

count         6598
unique           5
top       Employed
freq          4095
Name: EmploymentStatus, dtype: object

In [39]:
train_encoded["EmploymentStatus"].value_counts()

Employed         4095
Unemployed       1674
Medical Leave     323
Disabled          297
Retired           209
Name: EmploymentStatus, dtype: int64

Train, Test = RAW UNMANIPULATED DATA

**and**

Train_encoded, test_encoded is MANIPULATED DATAFRAME

In [40]:
# Encode employment status column with one hot encoding
# experiment with target encoding later. 
# order doesn't matter in this column situation

employment_encoder = ce.OneHotEncoder(cols = "EmploymentStatus", use_cat_names=True)

train_encoded = employment_encoder.fit_transform(train_encoded)
test_encoded = employment_encoder.transform(test_encoded)

In [41]:
train_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,EmploymentStatus_Employed,EmploymentStatus_Medical Leave,EmploymentStatus_Unemployed,EmploymentStatus_Disabled,EmploymentStatus_Retired,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.89,No,1,2,1,0,0,0,0,M,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.91,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.09,Yes,1,2,1,0,0,0,0,M,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.43,SUV,Medsize,2011-01-21
6357,Oregon,11638.9,Yes,1,2,0,1,0,0,0,F,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.95,No,1,2,0,0,1,0,0,F,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.81,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.37,No,1,2,0,0,1,0,0,F,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.06,Four-Door Car,Medsize,2011-01-19


In [42]:
test_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,EmploymentStatus_Employed,EmploymentStatus_Medical Leave,EmploymentStatus_Unemployed,EmploymentStatus_Disabled,EmploymentStatus_Retired,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.89,No,1,-1.0,1,0,0,0,0,M,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.91,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.09,Yes,1,-1.0,1,0,0,0,0,M,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.43,SUV,Medsize,2011-01-21
6357,Oregon,11638.9,Yes,1,-1.0,0,1,0,0,0,F,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.95,No,1,-1.0,0,0,1,0,0,F,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.81,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.37,No,1,-1.0,0,0,1,0,0,F,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.06,Four-Door Car,Medsize,2011-01-19


In [43]:
# changing employement status column names on train dataset

train_encoded = train_encoded.rename(columns = {"EmploymentStatus_Employed": "Employed", "EmploymentStatus_Unemployed": "Unemployed",
                                      "EmploymentStatus_Disabled": "Disabled", "EmploymentStatus_Retired": "Retired",
                                      "EmploymentStatus_Medical Leave": "Medical_Leave"})

In [45]:
train_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,Employed,Medical_Leave,Unemployed,Disabled,Retired,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.89,No,1,2,1,0,0,0,0,M,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.91,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.09,Yes,1,2,1,0,0,0,0,M,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.43,SUV,Medsize,2011-01-21
6357,Oregon,11638.9,Yes,1,2,0,1,0,0,0,F,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.95,No,1,2,0,0,1,0,0,F,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.81,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.37,No,1,2,0,0,1,0,0,F,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.06,Four-Door Car,Medsize,2011-01-19


In [46]:
# changing employement status column names on test dataset

test_encoded = test_encoded.rename(columns = {"EmploymentStatus_Employed": "Employed", "EmploymentStatus_Unemployed": "Unemployed",
                                      "EmploymentStatus_Disabled": "Disabled", "EmploymentStatus_Retired": "Retired",
                                      "EmploymentStatus_Medical Leave": "Medical_Leave"})

In [48]:
# encode gender to be numerical. 

train['Gender'].value_counts()

F    3377
M    3221
Name: Gender, dtype: int64

In [50]:
# Encode gender column with one hot encoding
# order doesn't matter in this column situation

gender_encoder = ce.OneHotEncoder(cols = "Gender", use_cat_names=True)

train_encoded = gender_encoder.fit_transform(train_encoded)
test_encoded = gender_encoder.transform(test_encoded)

In [51]:
train_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,Employed,Medical_Leave,Unemployed,Disabled,Retired,Gender_M,Gender_F,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.89,No,1,2,1,0,0,0,0,1,0,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.91,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.09,Yes,1,2,1,0,0,0,0,1,0,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.43,SUV,Medsize,2011-01-21
6357,Oregon,11638.9,Yes,1,2,0,1,0,0,0,0,1,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.95,No,1,2,0,0,1,0,0,0,1,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.81,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.37,No,1,2,0,0,1,0,0,0,1,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.06,Four-Door Car,Medsize,2011-01-19


In [52]:
# change gender column names on train and test dataset. 

train_encoded = train_encoded.rename(columns = {"Gender_M": "Male", "Gender_F": "Female"})
test_encoded = test_encoded.rename(columns = {"Gender_M": "Male", "Gender_F": "Female"})

In [54]:
train_encoded.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,Employed,Medical_Leave,Unemployed,Disabled,Retired,Male,Female,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,activation_date
5249,Arizona,4786.89,No,1,2,1,0,0,0,0,1,0,45515,Urban,Married,61,14,33,0,9,Personal Auto,Personal L3,Offer2,Call Center,236.91,Two-Door Car,Large,2011-01-02
2077,Arizona,8838.09,Yes,1,2,1,0,0,0,0,1,0,82664,Rural,Married,114,24,10,3,9,Corporate Auto,Corporate L3,Offer2,Agent,133.43,SUV,Medsize,2011-01-21
6357,Oregon,11638.9,Yes,1,2,0,1,0,0,0,0,1,25370,Suburban,Married,102,10,77,0,2,Personal Auto,Personal L3,Offer1,Branch,489.6,Sports Car,Large,2011-01-26
8128,California,4670.95,No,1,2,0,0,1,0,0,0,1,0,Urban,Divorced,64,25,89,0,4,Corporate Auto,Corporate L2,Offer2,Call Center,181.81,Four-Door Car,Medsize,2011-02-15
6787,Arizona,2352.37,No,1,2,0,0,1,0,0,0,1,0,Suburban,Divorced,64,4,61,0,1,Corporate Auto,Corporate L2,Offer1,Branch,381.06,Four-Door Car,Medsize,2011-01-19
