# Lab Case Study
## Scenario
You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability.
## Business Objectives
- Retain customers,
- analyze relevant customer data,
- develop focused customer retention programs.
Based on the analysis, take targeted actions to increase profitable customer response, retention, and growth.
## Activities
Refer to the `Activities.md` file where you will find guidelines for some of the activities that you want to do.
## Data
The csv files is provided in the folder. The columns in the file are self-explanatory.

# Activites List
Here are some of the tasks you need to perform:
### Activity 1
- Aggregate data into one Data Frame using Pandas.
- Standardizing header names
- Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints)
- Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- Removing duplicates
- Replacing null values – Replace missing values with means of the column (for numerical columns)
### Activity 2
- Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- Standardizing the data – Use string functions to standardize the text data (lower case)
- Standardizing the data – Use string functions to standardize the text data (lower case)

### Activity 3
- Which columns are numerical?
- Which columns are categorical?
- Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).
- Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.
- BONUS: Put all the previously mentioned data transformations into a function/functions.


In [1]:
import pandas as pd
import numpy as np

In [4]:
#Aggregate data into one Data Frame using Pandas.
insure1 = pd.read_csv('file1.csv')
insure2 = pd.read_csv('file2.csv')
type(insure1)

pandas.core.frame.DataFrame

In [5]:
#Lowercase column headers in insure1+2 tables
#insure1.columns = map(str.lower, insure1.columns)
#insure2.columns = map(str.lower, insure2.columns)
#print(insure1.columns)
#print(insure2.columns)
print(set(insure1.columns) == set(insure2.columns))

True


In [6]:
column_names = insure1.columns
#print(column_names)
#insure_all = pd.DataFrame(columns = column_names)
insure_all = pd.concat([insure1, insure2], axis = 0, ignore_index = True)
column_names = [name.lower().replace(' ', '_') for name in insure_all.columns]
print(column_names)
#column_names = ['customer', 'st', 'gender', 'education', 'customer_lifetime_value',
#       'income', 'monthly_premium_auto', 'number_of_open_complaints',
#       'policy_type', 'vehicle_class', 'total_claim_amount']
insure_all.columns = column_names

['customer', 'st', 'gender', 'education', 'customer_lifetime_value', 'income', 'monthly_premium_auto', 'number_of_open_complaints', 'policy_type', 'vehicle_class', 'total_claim_amount']


In [7]:
insure_all = insure_all.drop(columns = 'customer')
insure_all

Unnamed: 0,st,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...
4999,Arizona,M,Master,847141.75%,63513.0,70.0,1/0/00,Personal Auto,Four-Door Car,185.667213
5000,Arizona,F,College,543121.91%,58161.0,68.0,1/0/00,Corporate Auto,Four-Door Car,140.747286
5001,Nevada,F,College,568964.41%,83640.0,70.0,1/0/00,Corporate Auto,Two-Door Car,471.050488
5002,California,F,Master,368672.38%,0.0,96.0,1/0/00,Personal Auto,Two-Door Car,28.460568


In [8]:
#Deletion of dulpicates
insure_all.drop_duplicates()

Unnamed: 0,st,gender,education,customer_lifetime_value,income,monthly_premium_auto,number_of_open_complaints,policy_type,vehicle_class,total_claim_amount
0,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...
4999,Arizona,M,Master,847141.75%,63513.0,70.0,1/0/00,Personal Auto,Four-Door Car,185.667213
5000,Arizona,F,College,543121.91%,58161.0,68.0,1/0/00,Corporate Auto,Four-Door Car,140.747286
5001,Nevada,F,College,568964.41%,83640.0,70.0,1/0/00,Corporate Auto,Two-Door Car,471.050488
5002,California,F,Master,368672.38%,0.0,96.0,1/0/00,Personal Auto,Two-Door Car,28.460568


In [9]:
insure_all['st'].unique()

array(['Washington', 'Arizona', 'Nevada', 'California', 'Oregon', 'Cali',
       'AZ', 'WA', nan], dtype=object)

In [10]:
insure_all['st'].value_counts()

Oregon        623
California    488
Arizona       328
Nevada        223
Washington    181
Cali          120
AZ             74
WA             30
Name: st, dtype: int64

In [11]:
def state(name):
    #Check for the NaN (not a number) Null values
    if name != name:
        pass
    #name = str(name)
#    list_replace1 = ['Cali', 'California']
#    for i in list_replace1:
    if str(name).endswith('li') == True:
        name = str(name).replace('Cali', 'California')
    name = str(name).replace('AZ', 'Arizona')
    name = str(name).replace('WA', 'Washington')
    return name 

print(insure_all['st'].apply(state).value_counts())
print(insure_all['st'].apply(state).unique())
insure_all['st'] = insure_all['st'].apply(state)

nan           2937
Oregon         623
California     608
Arizona        402
Nevada         223
Washington     211
Name: st, dtype: int64
['Washington' 'Arizona' 'Nevada' 'California' 'Oregon' 'nan']


In [11]:
def clean_complaints(value):
#Check for the NaN (not a number) Null values
    if value != value:
        #pass
        return 0
    value = str(value)
    return int(value[2])
insure_all['number_of_open_complaints'].apply(clean_complaints).value_counts()

0    4563
1     247
2      93
3      60
4      29
5      12
Name: number_of_open_complaints, dtype: int64

In [9]:
insure_all['number_of_open_complaints'] = insure_all['number_of_open_complaints'].apply(clean_complaints)
print(insure_all.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5004 entries, 0 to 5003
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   st                         2067 non-null   object 
 1   gender                     1945 non-null   object 
 2   education                  2067 non-null   object 
 3   customer_lifetime_value    2060 non-null   object 
 4   income                     2067 non-null   float64
 5   monthly_premium_auto       2067 non-null   float64
 6   number_of_open_complaints  5004 non-null   int64  
 7   policy_type                2067 non-null   object 
 8   vehicle_class              2067 non-null   object 
 9   total_claim_amount         2067 non-null   float64
dtypes: float64(3), int64(1), object(6)
memory usage: 391.1+ KB
None


In [10]:
def clean_life(value):
#Check for the NaN (not a number) Null values
    if value != value:
        return 0
        #pass
    value = str(value)
    return float(value[:-1]) / 100

insure_all['customer_lifetime_value'].apply(clean_life).describe()

count     5004.000000
mean      3211.398471
std       5587.659894
min          1.000000
25%          1.000000
50%          1.000000
75%       5107.254050
max      58166.553500
Name: customer_lifetime_value, dtype: float64

In [11]:
insure_all['customer_lifetime_value'] = insure_all['customer_lifetime_value'].apply(clean_life)
insure_all['customer_lifetime_value'].head(10)

0        1.0000
1     6979.5359
2    12887.4317
3     7645.8618
4     5363.0765
5     8256.2978
6     5380.8986
7     7216.1003
8    24127.5040
9     7388.1781
Name: customer_lifetime_value, dtype: float64

In [13]:
insure_all.dtypes

st                            object
gender                        object
education                     object
customer_lifetime_value      float64
income                       float64
monthly_premium_auto         float64
number_of_open_complaints      int64
policy_type                   object
vehicle_class                 object
total_claim_amount           float64
dtype: object

In [26]:
insure_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5004 entries, 0 to 5003
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer                   2067 non-null   object 
 1   st                         2067 non-null   object 
 2   gender                     1945 non-null   object 
 3   education                  2067 non-null   object 
 4   customer_lifetime_value    2060 non-null   object 
 5   income                     2067 non-null   float64
 6   monthly_premium_auto       2067 non-null   float64
 7   number_of_open_complaints  2067 non-null   object 
 8   policy_type                2067 non-null   object 
 9   vehicle_class              2067 non-null   object 
 10  total_claim_amount         2067 non-null   float64
dtypes: float64(3), object(8)
memory usage: 430.2+ KB
