# Activites List

Here are some of the tasks you need to perform:

### Activity 1

- Aggregate data into one Data Frame using Pandas.
- Standardizing header names
- Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- Removing duplicates
- Replacing null values – Replace missing values with means of the column (for numerical columns)

### Activity 2

- Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- Standardizing the data – Use string functions to standardize the text data (lower case)

### Activity 3

- Which columns are numerical?
- Which columns are categorical?
- Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).
- Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.
- BONUS: Put all the previously mentioned data transformations into a function/functions.

### Activity 4

- Show a plot of the total number of responses.
- Show a plot of the response rate by the sales channel.
- Show a plot of the response rate by the total claim amount.
- Show a plot of the response rate by income.
- Don't limit your creativity!  plot any interesting findings/insights that describe some interesting facts about your data set and its variables. Use the relevant plotting when you feel it is needed.
- Plot the Correlation Heatmap.
- Clean your notebook and make it a readible and presentable with a good documentation that summarizes the Data Cleaning, Exploration(including plots) Steps that you have performed.

### Activity 5

- Check the data types of the columns. Get the numeric data into dataframe called `numerical` and categorical columns in a dataframe called `categoricals`.
(You can use np.number and np.object to select the numerical data types and categorical data types respectively)
- Now we will try to check the normality of the numerical variables visually
  - Use seaborn library to construct distribution plots for the numerical variables
  - Use Matplotlib to construct histograms
  - Do the distributions for different numerical variables look like a normal distribution 
- Normalize (numericals)
- For the numerical variables, check the multicollinearity between the features. Please note that we will use the column `total_claim_amount` later as the target variable. 
- Drop one of the two features that show a high correlation between them (greater than 0.9). Write code for both the correlation matrix and for seaborn heatmap. If there is no pair of features that have a high correlation, then do not drop any features

- Bonus: split Data set into train and test sets

### Activity 6

#### Processing Data

(_Further processing..._)
- X-y split.
- Normalize (numerical). (_done_)
- One Hot/Label Encoding (categorical).
- Concat DataFrames

#### Linear Regression

- Train-test split.
- Apply linear regression.

#### Model Validation

- Description:
  - MSE.
  - RMSE.
  - MAE.
  - R2.

# Start

## Activity 1
- Aggregate data into one Data Frame using Pandas.
- Standardizing header names
- Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- Removing duplicates
- Replacing null values – Replace missing values with means of the column (for numerical columns)


In [47]:
# load libarys

import pandas as pd
import numpy as np

In [48]:
# load file data

data_market = pd.read_csv('Data_Marketing_Customer_Analysis_Round2.csv')
data_file_1 = pd.read_csv('file1.csv')
data_file_2 = pd.read_csv('file2.csv')

In [49]:
# take a look on given data

#market
data_market.head()


Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [50]:
data_file_1.head()

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323


In [51]:
data_file_2.head()

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Total Claim Amount,Policy Type,Vehicle Class
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,633.6,Personal Auto,Four-Door Car
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,547.2,Special Auto,SUV
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,537.6,Personal Auto,Two-Door Car
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,1027.2,Personal Auto,Luxury Car
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,451.2,Corporate Auto,Two-Door Car


In [52]:
# Aggregate data into one Data Frame using Pandas.
main_data = pd.concat([data_market, data_file_1, data_file_2])
main_data.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type,ST,GENDER
0,0.0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,,,
1,1.0,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,,,
2,2.0,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A,,
3,3.0,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A,,
4,4.0,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,,,


In [53]:
# Standardizing header names

main_data.columns


Index(['Unnamed: 0', 'Customer', 'State', 'Customer Lifetime Value',
       'Response', 'Coverage', 'Education', 'Effective To Date',
       'EmploymentStatus', 'Gender', 'Income', 'Location Code',
       'Marital Status', 'Monthly Premium Auto', 'Months Since Last Claim',
       'Months Since Policy Inception', 'Number of Open Complaints',
       'Number of Policies', 'Policy Type', 'Policy', 'Renew Offer Type',
       'Sales Channel', 'Total Claim Amount', 'Vehicle Class', 'Vehicle Size',
       'Vehicle Type', 'ST', 'GENDER'],
      dtype='object')

In [54]:
main_data = pd.concat([data_market, data_file_1, data_file_2])

# merge data from ST to State and GENDER to Gender
# checking with pd.isnull seems not working, however np.nan did the job

main_data['Gender'] = list(map(lambda x,y : y if x is np.nan else x, main_data['Gender'],main_data['GENDER']))
main_data['State'] = list(map(lambda x,y: y if x is np.nan else x , main_data['State'], main_data['ST']))

#main_data

In [55]:
# Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data

main_data.drop(columns=['Unnamed: 0','GENDER','ST'], inplace=True)
main_data.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,48029.0,...,0.0,9.0,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,0.0,...,0.0,1.0,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,22139.0,...,0.0,2.0,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,49078.0,...,0.0,2.0,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,23675.0,...,,7.0,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,



- Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )


In [56]:
# lifetime value
# reduce percent sign and reduce to 2 digits

# convert series in string
main_data['Customer Lifetime Value'] = main_data['Customer Lifetime Value'].astype('string')
main_data['Income'] = main_data['Income'].astype('string')
# split % sign from cells
main_data['Customer Lifetime Value'] = list(map(lambda x: float(x.split('%')[0]) if type(x) == str and '%' in x else x, main_data['Customer Lifetime Value']))
main_data['Income'] = list(map(lambda x: x.strip() if type(x) == str else x, main_data['Income']))
# fill empty fields with 0
main_data.loc[main_data['Customer Lifetime Value'].isnull()] = 0
main_data.loc[main_data['Number of Open Complaints'].isnull()] = 0

# reassign data type to float
main_data['Customer Lifetime Value'] = main_data['Customer Lifetime Value'].astype('float')
main_data['Income'] = main_data['Income'].astype('float')

#print(main_data['Number of Open Complaints'].isnull().sum())
#print(main_data['Customer Lifetime Value'].isnull().sum())


- Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns


In [57]:
# Standardizing header names

main_data.columns


Index(['Customer', 'State', 'Customer Lifetime Value', 'Response', 'Coverage',
       'Education', 'Effective To Date', 'EmploymentStatus', 'Gender',
       'Income', 'Location Code', 'Marital Status', 'Monthly Premium Auto',
       'Months Since Last Claim', 'Months Since Policy Inception',
       'Number of Open Complaints', 'Number of Policies', 'Policy Type',
       'Policy', 'Renew Offer Type', 'Sales Channel', 'Total Claim Amount',
       'Vehicle Class', 'Vehicle Size', 'Vehicle Type'],
      dtype='object')


- Removing duplicates


In [58]:

print(main_data[main_data.duplicated('Customer')]) # show doubles



# delete all Customer which are double
main_data.drop_duplicates(subset='Customer', inplace=True) 


# find and delete last row with 0
main_data.loc[main_data['Customer'] == 0] #preset on row 4
main_data.drop(4, inplace=True) # delete

#main_data.head()

    Customer       State  Customer Lifetime Value Response Coverage Education  \
23         0           0                     0.00        0        0         0   
51         0           0                     0.00        0        0         0   
59         0           0                     0.00        0        0         0   
67         0           0                     0.00        0        0         0   
84         0           0                     0.00        0        0         0   
..       ...         ...                      ...      ...      ...       ...   
991  HV85198     Arizona                847141.75      NaN      NaN    Master   
992  BS91566     Arizona                543121.91      NaN      NaN   College   
993  IL40123      Nevada                568964.41      NaN      NaN   College   
994  MY32149  California                368672.38      NaN      NaN    Master   
995  SA91515  California                399258.39      NaN      NaN  Bachelor   

    Effective To Date Emplo


- Replacing null values – Replace missing values with means of the column (for numerical columns)

In [59]:
# columns with numeric data
#'Customer Lifetime Value','Income', 'Total Claim Amount'

main_data['Customer Lifetime Value'] = list(map(lambda x: x if x > 0 else main_data['Customer Lifetime Value'].mean(), main_data['Customer Lifetime Value']))

main_data['Income'] = list(map(lambda x: x if x > 0 else main_data['Income'].mean(), main_data['Income']))

main_data['Total Claim Amount'] = list(map(lambda x: x if x > 0 else main_data['Total Claim Amount'].mean(), main_data['Total Claim Amount']))

In [60]:
main_data['Income'].describe()

count     9134.000000
mean     47209.837802
std      21723.702506
min      10037.000000
25%      34337.000000
50%      37657.380009
75%      62320.000000
max      99981.000000
Name: Income, dtype: float64

In [61]:
main_data.drop(columns='Customer', inplace=True)
main_data.head()

Unnamed: 0,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,48029.0,Suburban,...,0.0,9.0,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,37657.380009,Suburban,...,0.0,1.0,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,22139.0,Suburban,...,0.0,2.0,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,49078.0,Suburban,...,0.0,2.0,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
5,,4745.181764,,Basic,High School or Below,2/14/11,Employed,M,50549.0,Suburban,...,0.0,7.0,Personal Auto,Personal L3,Offer1,Agent,292.8,Four-Door Car,Medsize,A


## Activity 2

- Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- Standardizing the data – Use string functions to standardize the text data (lower case)