# Lab Customer Analysis Round 2
For this lab, we will be using the marketing_customer_analysis.csv file that you can find in the files_for_lab folder. Check out the files_for_lab/about.md to get more information if you are using the Online Excel.

Note: For the next labs we will be using the same data file. Please save the code, so that you can re-use it later in the labs following this lab.

Dealing with the data
1. Show the dataframe shape.
2. Standardize header names.
3. Which columns are numerical?
4. Which columns are categorical?
5. Check and deal with NaN values.
6. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.
BONUS: Put all the previously mentioned data transformations into a function.

In [1]:
# import the necessary libraries
import pandas as pd
import numpy as np

#read the csv file into a pandas dataframe

customers = pd.read_csv('C:/Users/Ish/Documents/Ironhack Bootcamp/Day 2/lab-customer-analysis-round-2/files_for_lab/csv_files/marketing_customer_analysis.csv', index_col = [0])

# view the data frame
print(customers.head())

# view the shape of the dataframe
print(customers.shape)

  Customer       State  Customer Lifetime Value Response  Coverage Education  \
0  DK49336     Arizona              4809.216960       No     Basic   College   
1  KX64629  California              2228.525238       No     Basic   College   
2  LZ68649  Washington             14947.917300       No     Basic  Bachelor   
3  XL78013      Oregon             22332.439460      Yes  Extended   College   
4  QA50777      Oregon              9025.067525       No   Premium  Bachelor   

  Effective To Date EmploymentStatus Gender  Income  ...  \
0           2/18/11         Employed      M   48029  ...   
1           1/18/11       Unemployed      F       0  ...   
2           2/10/11         Employed      M   22139  ...   
3           1/11/11         Employed      M   49078  ...   
4           1/17/11    Medical Leave      F   23675  ...   

  Number of Open Complaints Number of Policies     Policy Type        Policy  \
0                       0.0                  9  Corporate Auto  Corporate L3  

In [2]:
#view column header and standardize headers
cols= [col.lower() for col in customers.columns]
print(cols)
cols[8] = 'employment status'

#update column headers
customers.columns = cols
customers.head(3)


['customer', 'state', 'customer lifetime value', 'response', 'coverage', 'education', 'effective to date', 'employmentstatus', 'gender', 'income', 'location code', 'marital status', 'monthly premium auto', 'months since last claim', 'months since policy inception', 'number of open complaints', 'number of policies', 'policy type', 'policy', 'renew offer type', 'sales channel', 'total claim amount', 'vehicle class', 'vehicle size', 'vehicle type']


Unnamed: 0,customer,state,customer lifetime value,response,coverage,education,effective to date,employmentstatus,employment status,income,...,number of open complaints,number of policies,policy type,policy,renew offer type,sales channel,total claim amount,vehicle class,vehicle size,vehicle type
0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,48029,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,0,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,22139,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A


In [3]:
# determine which columns are numerical
print(customers.dtypes)


customer                          object
state                             object
customer lifetime value          float64
response                          object
coverage                          object
education                         object
effective to date                 object
employmentstatus                  object
employment status                 object
income                             int64
location code                     object
marital status                    object
monthly premium auto               int64
months since last claim          float64
months since policy inception      int64
number of open complaints        float64
number of policies                 int64
policy type                       object
policy                            object
renew offer type                  object
sales channel                     object
total claim amount               float64
vehicle class                     object
vehicle size                      object
vehicle type    

In [4]:
# print the columns containing numerical data

print('Numerical Data\n',customers._get_numeric_data().head())

# print the columns containing categorical data
print('Categorical Data\n',customers.select_dtypes('object').head())

Numerical Data
    customer lifetime value  income  monthly premium auto  \
0              4809.216960   48029                    61   
1              2228.525238       0                    64   
2             14947.917300   22139                   100   
3             22332.439460   49078                    97   
4              9025.067525   23675                   117   

   months since last claim  months since policy inception  \
0                      7.0                             52   
1                      3.0                             26   
2                     34.0                             31   
3                     10.0                              3   
4                      NaN                             31   

   number of open complaints  number of policies  total claim amount  
0                        0.0                   9          292.800000  
1                        0.0                   1          744.924331  
2                        0.0               

In [5]:
# check for NaN values in the dataset
# determine numbers of null values by column
print('Data before removing empty rows:\n',customers.shape)
na_percent_df = pd.DataFrame(round(customers.isna().sum()/len(customers),4)*100)
print(na_percent_df)
# the column 'vehicle type' is missing more than 50% of values. This column should therefore be dropped.
customers = customers.drop(['vehicle type'], axis = 1)


Data before removing empty rows:
 (10910, 25)
                                   0
customer                        0.00
state                           5.78
customer lifetime value         0.00
response                        5.78
coverage                        0.00
education                       0.00
effective to date               0.00
employmentstatus                0.00
employment status               0.00
income                          0.00
location code                   0.00
marital status                  0.00
monthly premium auto            0.00
months since last claim         5.80
months since policy inception   0.00
number of open complaints       5.80
number of policies              0.00
policy type                     0.00
policy                          0.00
renew offer type                0.00
sales channel                   0.00
total claim amount              0.00
vehicle class                   5.70
vehicle size                    5.70
vehicle type                 

In [14]:
# there seems to be a correlation between samples with missing values for 'state','response','months since last claim','number of open complaints','vehicle class' and 'vehicle size'
# view data sorted by 'months since last claim' and 'number of open complaints' as they possess the highest percentage.
nan_rows = pd.DataFrame(customers[customers['months since last claim'].isna()==True])
nan_rows.head(100)
#print(nan_rows.head(10))
#na_percent_df = pd.DataFrame(round(customers.isna().sum()/len(customers),4)*100)
#na_percent_df



Unnamed: 0,customer,state,customer lifetime value,response,coverage,education,effective to date,employmentstatus,employment status,income,...,months since policy inception,number of open complaints,number of policies,policy type,policy,renew offer type,sales channel,total claim amount,vehicle class,vehicle size
4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,23675,...,31,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize
23,NQ71171,California,5107.071054,No,Basic,Bachelor,2/2/11,Employed,M,70174,...,80,,7,Personal Auto,Personal L2,Offer1,Agent,128.900320,Four-Door Car,Medsize
51,FT56968,Arizona,2590.096027,No,Basic,High School or Below,1/3/11,Employed,M,22398,...,76,,1,Personal Auto,Personal L1,Offer1,Agent,321.600000,Four-Door Car,Large
59,EP83939,Arizona,5575.751228,No,Basic,High School or Below,1/26/11,Employed,M,91416,...,39,,5,Personal Auto,Personal L3,Offer2,Call Center,109.904496,Four-Door Car,Medsize
67,KR35099,Washington,7507.455372,Yes,Basic,College,2/6/11,Employed,M,60920,...,61,,2,Personal Auto,Personal L3,Offer2,Agent,231.201886,Two-Door Car,Medsize
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,PZ47901,Arizona,5504.139033,Yes,Basic,Bachelor,2/8/11,Unemployed,F,0,...,45,,5,Corporate Auto,Corporate L3,Offer1,Call Center,350.400000,Four-Door Car,Medsize
1346,RE46032,Nevada,43290.495430,No,Extended,College,1/29/11,Medical Leave,M,23203,...,71,,2,Personal Auto,Personal L3,Offer1,Agent,1158.793110,SUV,Medsize
1356,PL44132,Arizona,3061.799398,No,Extended,College,1/18/11,Employed,M,88362,...,63,,1,Personal Auto,Personal L1,Offer2,Call Center,452.599718,Two-Door Car,Medsize
1388,LO48748,Oregon,7485.356469,No,Extended,Master,1/30/11,Employed,F,26380,...,70,,4,Personal Auto,Personal L1,Offer2,Web,177.369840,Four-Door Car,Large
