# Lab Case Study

### Scenario

You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability.

### Business Objectives

- Retain customers,
- analyze relevant customer data,
- develop focused customer retention programs.

Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.

### Activities

Refer to the `Activities.md` file where you will find guidelines for some of the activities that you want to do.

### Data

The csv files is provided in the folder. The columns in the file are self-explanatory.

# Activites List

Here are some of the tasks you need to perform:

### Activities:
-  Aggregate data into one Data Frame using Pandas.
- Standardizing header names
- Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- Removing duplicates
- Optional:Replacing null values – Replace missing values with means of the column (for numerical columns)

- Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- Standardizing the data – Use string functions to standardize the text data (lower case)
- Which columns are numerical?
- Which columns are categorical?
- Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).
- Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.
- BONUS: Put all the previously mentioned data transformations into a function/functions.


----

## 1. Aggregate data into one Data Frame using Pandas.

In [120]:

import pandas as pd
df= pd.read_csv('files/file1.csv')
df1=pd.read_csv('files/file2.csv')



In [121]:
#df

In [122]:
set(df.columns) == set(df1.columns)

True

In [123]:
#concat or append can be used to combine both files
#ignore_index='TRUE' is important, otherwhise the previous index will be used

df2=pd.concat([df,df1], ignore_index='TRUE')

In [124]:
#df2

## 2. Standardizing header names



In [125]:
df2.columns = [i.lower() for i in df2.columns]
df2.columns

Index(['customer', 'st', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'policy type', 'vehicle class', 'total claim amount'],
      dtype='object')

## 3. Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data

In [126]:
#delete column 'customer'
df2 = df2.drop(columns = 'customer')

In [127]:
#df2

## 4. Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )


In [128]:
#Checking the data types of every column
df2.dtypes

st                            object
gender                        object
education                     object
customer lifetime value       object
income                       float64
monthly premium auto         float64
number of open complaints     object
policy type                   object
vehicle class                 object
total claim amount           float64
dtype: object

In [129]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5004 entries, 0 to 5003
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   st                         2067 non-null   object 
 1   gender                     1945 non-null   object 
 2   education                  2067 non-null   object 
 3   customer lifetime value    2060 non-null   object 
 4   income                     2067 non-null   float64
 5   monthly premium auto       2067 non-null   float64
 6   number of open complaints  2067 non-null   object 
 7   policy type                2067 non-null   object 
 8   vehicle class              2067 non-null   object 
 9   total claim amount         2067 non-null   float64
dtypes: float64(3), object(7)
memory usage: 391.1+ KB


In [130]:
#customer lifetime value should be a float and not a %

def function(x):
    x= str(x)

    if x != "nan":
        #all vales besides of the last % shall be a float number
        y = x [:-1]
        return float (y)
    else:
        return 0

df2["customer lifetime value"] = df2["customer lifetime value"].apply(function)




In [131]:
# values of 'number of complaints'
df2["number of open complaints"].value_counts()



1/0/00    1626
1/1/00     247
1/2/00      93
1/3/00      60
1/4/00      29
1/5/00      12
Name: number of open complaints, dtype: int64

In [132]:
#number of complaints should only disply the 3rd character
def function2(x):
    x= str(x)

    if x != "nan":
        #all vales besides of the last % shall be a float number
        y = x[2]
        return int(y)
    else:
        return 0

df2["number of open complaints"] = df2["number of open complaints"].apply(function2)




In [133]:
#check of the function
df2["number of open complaints"].unique()



array([0, 2, 1, 3, 5, 4])

## 5. Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns


In [134]:
#analyse the uniques in st
df2.st.unique()


array(['Washington', 'Arizona', 'Nevada', 'California', 'Oregon', 'Cali',
       'AZ', 'WA', nan], dtype=object)

In [135]:
#cleaning the state names
def st_clean(x):
    if not x == x:
    #then i know that x is = none
        return x
    else:
        states= {'Arizona' : 'AZ', 
                 'California': 'CA',
                 'Nevada' : 'NV', 
                 'Cali' :'CA,', 
                 'Oregon':'OR', 
                 'AZ' :'AZ', 
                 'WA':'WA',
                 'Washington':'WA',
                 'Cali':'CA'}
        
        return states[x]

In [136]:
#apply the function to the main df2
df2["st"] = df2["st"].apply(st_clean)

In [137]:
#analyse the uniques in gender
df2.gender.unique()

array([nan, 'F', 'M', 'Femal', 'Male', 'female'], dtype=object)

In [138]:
#cleaning the gender names
def gender1_clean(x):
    if not x == x:
    #then i know that x is = none
        return x
    else:
        g = {'F': 'F', 
                 'M' : 'M', 
                 'Femal': 'F',
                 'Male' : 'M', 
                 'female' :'F'}
        
        return g[x]
        





In [139]:
df2["gender"] = df2["gender"].apply(gender1_clean)

In [140]:
df2.gender.unique()

array([nan, 'F', 'M'], dtype=object)

## 6. Removing duplicates


In [142]:
df2.drop_duplicates()

Unnamed: 0,st,gender,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,WA,,Master,0.00,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934
1,AZ,F,Bachelor,697953.59,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935
2,NV,F,Bachelor,1288743.17,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247
3,CA,M,Bachelor,764586.18,0.0,106.0,0,Corporate Auto,SUV,529.881344
4,WA,M,High School or Below,536307.65,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...
4999,AZ,M,Master,847141.75,63513.0,70.0,0,Personal Auto,Four-Door Car,185.667213
5000,AZ,F,College,543121.91,58161.0,68.0,0,Corporate Auto,Four-Door Car,140.747286
5001,NV,F,College,568964.41,83640.0,70.0,0,Corporate Auto,Two-Door Car,471.050488
5002,CA,F,Master,368672.38,0.0,96.0,0,Personal Auto,Two-Door Car,28.460568


In [143]:
#index should be from 0 to 2055
import numpy as np
df2.index = np.arange(0,5004)

In [144]:
#df2

7. Optional:Replacing null values – Replace missing values with means of the column (for numerical columns)

In [145]:
df2.columns

Index(['st', 'gender', 'education', 'customer lifetime value', 'income',
       'monthly premium auto', 'number of open complaints', 'policy type',
       'vehicle class', 'total claim amount'],
      dtype='object')

In [146]:
df2["customer lifetime value"].mean()

321081.0141846531

In [147]:
#replace empty fields with mean
df2['customer lifetime value'].fillna(df2["customer lifetime value"].mean(),inplace=True )

In [148]:
df2['income'].fillna(df2["income"].mean(),inplace=True)

In [149]:
df2['monthly premium auto'].fillna(df2["monthly premium auto"].mean(),inplace=True)

In [150]:
df2['total claim amount'].fillna(df2["total claim amount"].mean(),inplace=True)

In [151]:
df2['number of open complaints'].fillna(df2["number of open complaints"].mean(),inplace=True)

In [152]:
# isn't working:
#df2 = df2.applymap(lambda i: i.mean() if type(df2.columns) == float or  type(df2.columns)== int else i)

In [153]:
df2.isnull().sum().sum()

14807

## 8. Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central


In [154]:
df2.st.unique()
def st_region(x):
    if not x == x:
    #then i know that x is = none
        return x
    else:
        region= {'AZ': 'Central', 
                  'CA': 'West',
                 'NV': 'Central', 
                 'OR':'North West', 
                  'AZ':'Central', 
                 'WA':'East'}
        
        return region[x]
    
df2["st"] = df2["st"].apply(st_region)

In [155]:
df2 = df2.rename(columns={'st': 'region'})

## 9. Standardizing the data – Use string functions to standardize the text data (lower case)

In [156]:
df2 = df2.applymap(lambda i: i.lower() if type(i) == str else i)


In [39]:
# of topic lambda
def my_func(x):
    return x **2
#is like lambda but lambda cannot be called later (have no specific name), useful if you just need a function ones:
'lambda x : x **2'

'lambda x : x **2'


## 10. Which columns are numerical?

In [157]:
df2_numerical= df2.select_dtypes(include=np.number).columns.tolist()
df2_numerical

['customer lifetime value',
 'income',
 'monthly premium auto',
 'number of open complaints',
 'total claim amount']

## 11. Which columns are categorical?

In [158]:
df2_categorical = df2.select_dtypes(include=object).columns.tolist()
df2_categorical

['region', 'gender', 'education', 'policy type', 'vehicle class']


## 12. Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).

In [160]:
df2.fillna(0,inplace=True)

In [161]:
df2.isnull().sum().sum()

0

## 13. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.

In [162]:
df3 = pd.read_csv("Data_Marketing_Customer_Analysis_Round2.csv")


df3['date']= pd.to_datetime(df3['Effective To Date'])

In [163]:
df3['date']=list(map(lambda date:date.strftime(format='%B'),df3['date']))

In [165]:
df3['date'].unique()
#there is only data for the first quarter, so no filtering is needed


array(['February', 'January'], dtype=object)

## 14. BONUS: Put all the previously mentioned data transformations into a function/functions.

### Putting cleaning activities into functions 

In [166]:
def load_data(path):
    return pd.read_csv(path)

In [167]:
def lower_case_column(df3):
    df3.columns = [i.lower() for i in df3]
    return df3

In [168]:
def rename_columns(df3):
    df3.rename(columns={'state':'STATE'},inplace =True)
    return df3

### Pipeline control - to be put at the top to rerun (just comment the function you don't want to be runned)


In [169]:
df3 = load_data(path ="Data_Marketing_Customer_Analysis_Round2.csv")\
.pipe(lower_case_column)\
.pipe(rename_columns)
#...

In [170]:
df3

Unnamed: 0,unnamed: 0,customer,STATE,customer lifetime value,response,coverage,education,effective to date,employmentstatus,gender,...,number of open complaints,number of policies,policy type,policy,renew offer type,sales channel,total claim amount,vehicle class,vehicle size,vehicle type
0,0,DK49336,Arizona,4809.216960,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.800000,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.917300,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.000000,SUV,Medsize,A
3,3,XL78013,Oregon,22332.439460,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,10905,FE99816,Nevada,15563.369440,No,Premium,Bachelor,1/19/11,Unemployed,F,...,,7,Personal Auto,Personal L1,Offer3,Web,1214.400000,Luxury Car,Medsize,A
10906,10906,KX53892,Oregon,5259.444853,No,Basic,College,1/6/11,Employed,F,...,0.0,6,Personal Auto,Personal L3,Offer2,Branch,273.018929,Four-Door Car,Medsize,A
10907,10907,TL39050,Arizona,23893.304100,No,Extended,Bachelor,2/6/11,Employed,F,...,0.0,2,Corporate Auto,Corporate L3,Offer1,Web,381.306996,Luxury SUV,Medsize,
10908,10908,WA60547,California,11971.977650,No,Premium,College,2/13/11,Employed,F,...,4.0,6,Personal Auto,Personal L1,Offer1,Branch,618.288849,SUV,Medsize,A


In [None]:
#df3.to_csv(".../cleaned_file.csv")