# Allstate Purchase Predictions
As a customer shops an insurance policy, heshe will receive a number of quotes with different coverage options before
purchasing a plan. This is represented in this challenge as a series of rows that include a customer ID, information
about the customer, information about the quoted policy, and the cost. Your task is to predict the purchased coverage
options using a limited subset of the total interaction history. If the eventual purchase can be predicted sooner in
the shopping window, the quoting process is shortened and the issuer is less likely to lose the customer's business.

1. customer_ID - A unique identifier for the customer
2. shopping_pt - Unique identifier for the shopping point of a given customer
3. record_type - 0=shopping point, 1=purchase point
4. day - Day of the week (0-6, 0=Monday)
5. time - Time of day (HH:MM)
6. state - State where shopping point occurred
7. location - Location ID where shopping point occurred
8. group_size - How many people will be covered under the policy (1, 2, 3 or 4)
9. homeowner - Whether the customer owns a home or not (0=no, 1=yes)
10. car_age - Age of the customer’s car
11. car_value - How valuable was the customer’s car when new
12. risk_factor - An ordinal assessment of how risky the customer is (1, 2, 3, 4)
13. age_oldest - Age of the oldest person in customer's group
14. age_youngest - Age of the youngest person in customer’s group
15. married_couple - Does the customer group contain a married couple (0=no, 1=yes)
16. C_previous - What the customer formerly had or currently has for product option C (0=nothing, 1, 2, 3,4)
17. duration_previous -  how long (in years) the customer was covered by their previous issuer
18. A,B,C,D,E,F,G - the coverage options
19. cost - cost of the quoted coverage options



In [159]:
# set up environment
import numpy as np
import pandas as pd
from datetime import datetime as dt

In [13]:
# READ files
dir = 'C:\\Users\\Lenovo\\PycharmProjects\\Kaggle\\Project3_AllState\\'
train = pd.read_csv(dir + 'train.csv.zip')
test = pd.read_csv(dir + 'test_v2.csv.zip')

## Intro summaries and data exploration
1. common columns
2. shape of both sets
3. null values
4. unique values
5. Data Types

In [21]:
train.shape, test.shape

((665249, 25), (198856, 25))

In [37]:
# COMMON COLUMNS
if list(train.columns) == list(test.columns):
    print('Both data set have the same columns')

pd.Series(train.columns)

Both data set have the same columns


0           customer_ID
1           shopping_pt
2           record_type
3                   day
4                  time
5                 state
6              location
7            group_size
8             homeowner
9               car_age
10            car_value
11          risk_factor
12           age_oldest
13         age_youngest
14       married_couple
15           C_previous
16    duration_previous
17                    A
18                    B
19                    C
20                    D
21                    E
22                    F
23                    G
24                 cost
dtype: object

In [130]:
# % OF NULL VALUES
setA = (train.isnull().sum() / train.shape[0]) * 100
setB = (test.isnull().sum() / test.shape[0]) * 100
pd.concat([setA, setB], join='outer', axis=1, keys=('Train', 'Test')).query('Train !=0 | Test !=0')

Unnamed: 0,Train,Test
location,0.0,0.34095
car_value,0.230139,0.371626
risk_factor,36.139551,37.960635
C_previous,2.812631,4.9126
duration_previous,2.812631,4.9126


In [40]:
# # OF UNIQUE VALUES IN EACH COLUMN
setA = train.apply(lambda x: x.value_counts().count())
setB = test.apply(lambda x: x.value_counts().count())
pd.concat([setA, setB], axis=1, keys=('Train', 'Test'))

Unnamed: 0,Train,Test
customer_ID,97009,55716
shopping_pt,13,11
record_type,2,1
day,7,7
time,1204,1045
state,36,36
location,6248,6029
group_size,4,4
homeowner,2,2
car_age,67,57


In [129]:
# UNIQUE VALUES IN EACH OF COLUMN
column_list = []
def col_to_list(x):
    if x.value_counts().count() < 20:
        column_list.append(x.name)
train.apply(col_to_list)

unique_values = pd.DataFrame({'Column': [None], 'Values': [None]}).iloc[:,[1,0]]

for col in column_list:
    temp_df = pd.DataFrame({'Column': col, 'Values': [train.loc[:, col].unique()]})
    unique_values = pd.concat([unique_values, temp_df], axis=0)
unique_values

Unnamed: 0,Column,Values
0,,
0,shopping_pt,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]"
0,record_type,"[0, 1]"
0,day,"[0, 3, 4, 2, 1, 5, 6]"
0,group_size,"[2, 1, 3, 4]"
0,homeowner,"[0, 1]"
0,car_value,"[g, e, c, d, f, nan, h, i, b, a]"
0,risk_factor,"[3.0, 4.0, nan, 2.0, 1.0]"
0,married_couple,"[1, 0]"
0,C_previous,"[1.0, 3.0, 2.0, 4.0, nan]"


In [45]:
# DATA TYPES
pd.concat([train.dtypes, test.dtypes], axis=1, keys=['Train', 'Test'])

Unnamed: 0,Train,Test
customer_ID,int64,int64
shopping_pt,int64,int64
record_type,int64,int64
day,int64,int64
time,object,object
state,object,object
location,int64,float64
group_size,int64,int64
homeowner,int64,int64
car_age,int64,int64


## Data Manipulation and Cleaning:
1. Join both sets for easier manipulation
* Car_Value change to int
* Time to datetime.Time object
* Day to datetime.Day object
* Try to save few columns as ordered scale
* 

In [200]:
# JOIN BOTH SETS FOR EASIER MANIPULATION
train['Set_No'] = 0
test['Set_No'] = 1
total = pd.concat([train, test], ignore_index=True)

In [136]:
# CONVERT CAR VALUE TO INTEGER
def car_value_converter(x):
    if x == 'a':
        return 1
    elif x == 'b':
        return 2
    elif x == 'c':
        return 3
    elif x == 'd':
        return 4
    elif x == 'e':
        return 5
    elif x == 'f':
        return 6
    elif x == 'g':
        return 7
    elif x == 'h':
        return 8
    else:
        return np.nan
    

total.car_value = total.car_value.apply(car_value_converter)

In [201]:
# CONVERT TIME INTO DATETIME.TIME OBJECT
total.time = total.time.apply(lambda x: dt.strptime(x, '%H:%M').time())

In [208]:
# CONVERT day into DAY format

In [210]:
# Fill nulls in Location
total[total.location.isnull()].head(10).tranpose()

MemoryError: 