# Lab-customer-analysis-round-2

For this lab, we will be using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab` folder. Check out the `files_for_lab/about.md` to get more information if you are using the Online Excel.

**Note**: For the next labs we will be using the same data file. Please save the code, so that you can re-use it later in the labs following this lab.

In [1]:
import pandas as pd
df = pd.read_csv('files_for_lab/csv_files/marketing_customer_analysis.csv',index_col=0)

In [2]:
df.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,48029,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,0,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,22139,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,49078,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,23675,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


## 1. Show the dataframe shape.


In [8]:
df.shape

(10910, 25)

## 2. Standardize header names.


In [9]:
df.columns

Index(['customer', 'state', 'customer_lifetime_value', 'response', 'coverage',
       'education', 'effective_to_date', 'employmentstatus', 'gender',
       'income', 'location_code', 'marital_status', 'monthly_premium_auto',
       'months_since_last_claim', 'months_since_policy_inception',
       'number_of_open_complaints', 'number_of_policies', 'policy_type',
       'policy', 'renew_offer_type', 'sales_channel', 'total_claim_amount',
       'vehicle_class', 'vehicle_size', 'vehicle_type'],
      dtype='object')

In [11]:
df.columns = df.columns.str.replace(' ', '_').str.lower()
df.dtypes

customer                          object
state                             object
customer_lifetime_value          float64
response                          object
coverage                          object
education                         object
effective_to_date                 object
employmentstatus                  object
gender                            object
income                             int64
location_code                     object
marital_status                    object
monthly_premium_auto               int64
months_since_last_claim          float64
months_since_policy_inception      int64
number_of_open_complaints        float64
number_of_policies                 int64
policy_type                       object
policy                            object
renew_offer_type                  object
sales_channel                     object
total_claim_amount               float64
vehicle_class                     object
vehicle_size                      object
vehicle_type    

## 3. Which columns are numerical?


In [12]:
numeric = df.select_dtypes(include = ['int','float'])
numeric.columns

Index(['customer_lifetime_value', 'income', 'monthly_premium_auto',
       'months_since_last_claim', 'months_since_policy_inception',
       'number_of_open_complaints', 'number_of_policies',
       'total_claim_amount'],
      dtype='object')

## 4. Which columns are categorical?


In [13]:
categorical = df.select_dtypes(include = ['object'])
categorical.columns

Index(['customer', 'state', 'response', 'coverage', 'education',
       'effective_to_date', 'employmentstatus', 'gender', 'location_code',
       'marital_status', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'vehicle_class', 'vehicle_size', 'vehicle_type'],
      dtype='object')

## 5. Check and deal with `NaN` values.


In [14]:
df.isna().sum()

customer                            0
state                             631
customer_lifetime_value             0
response                          631
coverage                            0
education                           0
effective_to_date                   0
employmentstatus                    0
gender                              0
income                              0
location_code                       0
marital_status                      0
monthly_premium_auto                0
months_since_last_claim           633
months_since_policy_inception       0
number_of_open_complaints         633
number_of_policies                  0
policy_type                         0
policy                              0
renew_offer_type                    0
sales_channel                       0
total_claim_amount                  0
vehicle_class                     622
vehicle_size                      622
vehicle_type                     5482
dtype: int64

In [16]:
# cleaning rules as follows
# state, response: null --> other
# months_since_last_claim, number_of_open_complaints: null --> mean
# vehicle_class, vehicle_size, vehicle_type --> drop 

df['state'] = df['state'].fillna('Other')
df['response'] = df['response'].fillna('Other')

mslc_mean = df['months_since_last_claim'].mean()
df['months_since_last_claim'] = df['months_since_last_claim'].fillna(mslc_mean)
noc_mean = df['number_of_open_complaints'].mean()
df['number_of_open_complaints'] = df['number_of_open_complaints'].fillna(noc_mean)

df.drop('vehicle_class', inplace=True, axis=1)
df.drop('vehicle_size', inplace=True, axis=1)
df.drop('vehicle_type', inplace=True, axis=1)

In [17]:
# to check if cleaning is working properly
df.isna().sum()

customer                         0
state                            0
customer_lifetime_value          0
response                         0
coverage                         0
education                        0
effective_to_date                0
employmentstatus                 0
gender                           0
income                           0
location_code                    0
marital_status                   0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
policy_type                      0
policy                           0
renew_offer_type                 0
sales_channel                    0
total_claim_amount               0
dtype: int64

## 6. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. _Hint_: If data from March does not exist, consider only January and February.


In [18]:
#Change effective_to_date format from 'object' to 'date'

df['effective_to_date'] = pd.to_datetime(df['effective_to_date'], errors='coerce') 
df['effective_to_date']

0       2011-02-18
1       2011-01-18
2       2011-02-10
3       2011-01-11
4       2011-01-17
           ...    
10905   2011-01-19
10906   2011-01-06
10907   2011-02-06
10908   2011-02-13
10909   2011-01-08
Name: effective_to_date, Length: 10910, dtype: datetime64[ns]

In [19]:
#Create new column 'month' which contains only month

df['month'] = pd.DatetimeIndex(df['effective_to_date'])
df['month'] = df['month'].dt.month

In [20]:
df.month.value_counts()

1    5818
2    5092
Name: month, dtype: int64

In [21]:
#filter the column 'month' by the first quater of the yer 

filtered = df[(df['month'] == 1)|(df['month'] == 2)|(df['month'] == 3)]
filtered

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,month
0,DK49336,Arizona,4809.216960,No,Basic,College,2011-02-18,Employed,M,48029,...,7.000000,52,0.000000,9,Corporate Auto,Corporate L3,Offer3,Agent,292.800000,2
1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,0,...,3.000000,26,0.000000,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,1
2,LZ68649,Washington,14947.917300,No,Basic,Bachelor,2011-02-10,Employed,M,22139,...,34.000000,31,0.000000,2,Personal Auto,Personal L3,Offer3,Call Center,480.000000,2
3,XL78013,Oregon,22332.439460,Yes,Extended,College,2011-01-11,Employed,M,49078,...,10.000000,3,0.000000,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,1
4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,23675,...,15.149071,31,0.384256,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,FE99816,Nevada,15563.369440,No,Premium,Bachelor,2011-01-19,Unemployed,F,0,...,15.149071,40,0.384256,7,Personal Auto,Personal L1,Offer3,Web,1214.400000,1
10906,KX53892,Oregon,5259.444853,No,Basic,College,2011-01-06,Employed,F,61146,...,7.000000,68,0.000000,6,Personal Auto,Personal L3,Offer2,Branch,273.018929,1
10907,TL39050,Arizona,23893.304100,No,Extended,Bachelor,2011-02-06,Employed,F,39837,...,11.000000,63,0.000000,2,Corporate Auto,Corporate L3,Offer1,Web,381.306996,2
10908,WA60547,California,11971.977650,No,Premium,College,2011-02-13,Employed,F,64195,...,0.000000,27,4.000000,6,Personal Auto,Personal L1,Offer1,Branch,618.288849,2


## 7. Put all the previously mentioned data transformations into a function.

In [64]:
# does not work....still working on it!!

# Cleaning fonction 

def clean_other(x):
    if x is None: 
        return 'Other'
    else:
        return x

def clean_mean_claims(x):
    if x is None:
        return df['months_since_last_claim'].mean()
    else:
        return x

def clean_mean_complains(x):
    if x is None:
        return df['number_of_open_complaints'].mean()
    else:
        return x

## 8. BONUS 

### 8.1. List Comprehensions

#### 8.1.1 Find the capital letters (and not white space) in the sentence 'The Quick Brown Fox Jumped Over The Lazy Dog'.


In [22]:
string = 'The Quick Brown Fox Jumped Over The Lazy Dog'
capital_letters = [char for char in string if char.isupper()]
print(str(capital_letters))

['T', 'Q', 'B', 'F', 'J', 'O', 'T', 'L', 'D']


#### 8.1.2. Use a list comprehension to create a list with the same elements rounded to 2 decimal positions.

In [23]:
a = [
    0.84062117, 0.48006452, 0.7876326 , 0.77109654,
    0.44409793, 0.09014516, 0.81835917, 0.87645456,
    0.7066597 , 0.09610873, 0.41247947, 0.57433389,
    0.29960807, 0.42315023, 0.34452557, 0.4751035 ,
    0.17003563, 0.46843998, 0.92796258, 0.69814654,
    0.41290051, 0.19561071, 0.16284783, 0.97016248,
    0.71725408, 0.87702738, 0.31244595, 0.76615487,
    0.20754036, 0.57871812, 0.07214068, 0.40356048,
    0.12149553, 0.53222417, 0.9976855 , 0.12536346,
    0.80930099, 0.50962849, 0.94555126, 0.33364763
]

In [28]:
s = pd.Series(a)
s.round(2)

#keep the list form as below
#new_a = [f"{num:.2f}" for num in a]
#new_a

0     0.84
1     0.48
2     0.79
3     0.77
4     0.44
5     0.09
6     0.82
7     0.88
8     0.71
9     0.10
10    0.41
11    0.57
12    0.30
13    0.42
14    0.34
15    0.48
16    0.17
17    0.47
18    0.93
19    0.70
20    0.41
21    0.20
22    0.16
23    0.97
24    0.72
25    0.88
26    0.31
27    0.77
28    0.21
29    0.58
30    0.07
31    0.40
32    0.12
33    0.53
34    1.00
35    0.13
36    0.81
37    0.51
38    0.95
39    0.33
dtype: float64

### 8.2. Lambdas

#### 8.2.1. Using Lambda Expressions in List Comprehensions

In the following challenge, we will combine two lists using a lambda expression in a list comprehension.

To do this, we will need to introduce the zip function. The zip function returns an iterator of tuples.

The way zip function works with list has been shown below:

In this exercise we will try to compare the elements on the same index in the two lists. We want to zip the two lists and then use a lambda expression to compare if: list1 element > list2 element