# Activites List

Here are some of the tasks you need to perform:

### Activity 1

- [x] Aggregate data into one Data Frame using Pandas.
- [x] Standardizing header names
- [ ] Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- [ ] Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints )
- [x] Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- [ ] Removing duplicates
- [ ] Replacing null values – Replace missing values with means of the column (for numerical columns)

In [1]:
# setup libraries
import pandas as pd
import numpy as np

In [2]:
# read files
file_1 = pd.read_csv('Data/file1.csv')
file_2 = pd.read_csv('Data/file2.csv')
file_3 = pd.read_csv('Data/file3.csv')

In [3]:
# combine data
data = pd.concat([file_1, file_2, file_3]).reindex()
#data.columns

# put GENDER values to Gender and ST to State

data['Gender'] = list(map(lambda x, y: x if x == x else y, data['Gender'],data['GENDER']))
data['State'] = list(map(lambda x, y: x if x == x else y, data['State'], data['ST']))

# drop GENDER and ST as it is double information
data.drop(columns=['ST','GENDER'], inplace=True)

In [4]:
data

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,Bachelor,23405.98798,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
7066,PK87824,College,3096.511217,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
7067,TD14365,Bachelor,8163.890428,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
7068,UP19263,College,7524.442436,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


In [5]:
# clean up gender
data['Gender'].unique()

array([nan, 'F', 'M', 'Femal', 'Male', 'female'], dtype=object)

In [6]:
F = ['Femal', 'female', 'F']
M = ['Male', 'M']
data['Gender'] = list(map(lambda x: 'M' if x==x and x in M else ('F' if x==x and x in F else x), data['Gender']))
data['Gender'].unique()

array([nan, 'F', 'M'], dtype=object)

In [7]:
# fix State values
data['State'].unique()

array(['Washington', 'Arizona', 'Nevada', 'California', 'Oregon', 'Cali',
       'AZ', 'WA', nan], dtype=object)

In [8]:
data['State'] = list(map(lambda x: x if x != 'AZ' else 'Arizona', data['State']))
data['State'] = list(map(lambda x: x if x != 'Cali' else 'California', data['State']))
data['State'] = list(map(lambda x: x if x != 'WA' else 'Washington', data['State']))

In [9]:
data['State'].unique()

array(['Washington', 'Arizona', 'Nevada', 'California', 'Oregon', nan],
      dtype=object)

In [10]:
#data['Customer Lifetime Value'].describe

#remove percent value and cast to float
data['Customer Lifetime Value'] = list(map(lambda x: float(str(x).strip('%\r\t\n')) if x==x else float(0), data['Customer Lifetime Value']))
# convert to float

#debug print(data['Customer Lifetime Value'])

#data['Customer Lifetime Value'] = 
data['Customer Lifetime Value'].astype('float64')

# round 
data['Customer Lifetime Value'] = data['Customer Lifetime Value'].round(0)

print(data.dtypes)
#debug print(data['Customer Lifetime Value'])
data

Customer                      object
Education                     object
Customer Lifetime Value      float64
Income                       float64
Monthly Premium Auto         float64
Number of Open Complaints     object
Policy Type                   object
Vehicle Class                 object
Total Claim Amount           float64
State                         object
Gender                        object
dtype: object


Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,0.0,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697954.0,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.0,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.0,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536308.0,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,Bachelor,23406.0,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
7066,PK87824,College,3097.0,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
7067,TD14365,Bachelor,8164.0,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
7068,UP19263,College,7524.0,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


In [11]:
# format number of open complains
# looks like one format is just the number 0-5 and second format is 1/0-5/00 eg. 1/2/00
# remove 1/ and /00 from string

# first fill fields with 0 which are NaN
data.loc[data['Number of Open Complaints'].isnull()] = 0
#data['Number of Open Complaints'].fillna(0)
data['Number of Open Complaints'] = list(map(lambda x: int(x[2]) if x==x and str(x).startswith('1/') and str(x).endswith('/00') else int(x), data['Number of Open Complaints']))

data['Number of Open Complaints'].astype(int)
data['Number of Open Complaints'].unique()

array([0, 2, 1, 3, 5, 4])

In [12]:
# fix education
data['Education'].unique()

array(['Master', 'Bachelor', 'High School or Below', 'College',
       'Bachelors', 'Doctor', 0], dtype=object)

In [13]:
# replace 'Bachelors' with 'Bachelor'
data['Education'] = list(map(lambda x: x if x != 'Bachelors' else 'Bachelor', data['Education']))

data['Education'].unique()

array(['Master', 'Bachelor', 'High School or Below', 'College', 'Doctor',
       0], dtype=object)

In [14]:
# as it looks, all fields without filled education are as well without any kind of data
# time to clean this up
data = data[data.Education != 0]

In [15]:
data

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,0.0,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697954.0,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.0,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.0,0.0,106.0,0,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536308.0,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,Bachelor,23406.0,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
7066,PK87824,College,3097.0,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
7067,TD14365,Bachelor,8164.0,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
7068,UP19263,College,7524.0,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


In [16]:
data['Education'].unique()

array(['Master', 'Bachelor', 'High School or Below', 'College', 'Doctor'],
      dtype=object)

In [17]:
#drop 2937 rows

#data.drop(data.loc[data['Education']==0].index, inplace=True)

# reset and remove duplicates
#data.reset_index()
#data.drop_duplicates(inplace= True)

#data.reset_index()
#data.duplicated().unique()

In [18]:
data.drop_duplicates()
#non_gender = data.loc[(data['Gender'] != 'M') & (data['Gender'] != 'F')]
#data.drop(non_gender, inplace=True)
#non_gender
#data.reset_index(inplace=True, drop=True)
data.reset_index(inplace= True, drop= True)
data

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,0.0,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697954.0,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.0,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.0,0.0,106.0,0,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536308.0,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...,...
9132,LA72316,Bachelor,23406.0,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
9133,PK87824,College,3097.0,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
9134,TD14365,Bachelor,8164.0,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
9135,UP19263,College,7524.0,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


### Activity 2

- Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- Standardizing the data – Use string functions to standardize the text data (lower case)

In [19]:
print(data.columns)
print(data.dtypes)

Index(['Customer', 'Education', 'Customer Lifetime Value', 'Income',
       'Monthly Premium Auto', 'Number of Open Complaints', 'Policy Type',
       'Vehicle Class', 'Total Claim Amount', 'State', 'Gender'],
      dtype='object')
Customer                      object
Education                     object
Customer Lifetime Value      float64
Income                       float64
Monthly Premium Auto         float64
Number of Open Complaints      int64
Policy Type                   object
Vehicle Class                 object
Total Claim Amount           float64
State                         object
Gender                        object
dtype: object


In [20]:
col = ['Customer','Education','Policy Type','Vehicle Class','State','Gender']

In [21]:

for i in col:
    data[i].str.lower()
    
#data.columns = data.columns.str.lower()
data

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,0.0,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697954.0,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.0,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.0,0.0,106.0,0,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536308.0,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...,...
9132,LA72316,Bachelor,23406.0,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
9133,PK87824,College,3097.0,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
9134,TD14365,Bachelor,8164.0,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
9135,UP19263,College,7524.0,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


In [22]:
cat_zone = {
    'west region' : 'california',
    'north west'  : 'oregon',
    'east'        : 'washington',
    'central'     :['arizona', 'nevada']
}

In [23]:
def regroup_location(state: str) -> str:
    for k,v in cat_zone.items():
        if state.lower() in v:
            return k
    return state

In [24]:
#print(list(map(lambda x: x, data['State'])))

In [31]:

data['State'] = list(map(lambda x: regroup_location(x), data['State']))
#data.State = state_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['State'] = list(map(lambda x: regroup_location(x), data['State']))


In [30]:
print(regroup_location('california'))
print(regroup_location('nevada'))
data

west region
central


Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,0.0,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934,east,
1,QZ44356,Bachelor,697954.0,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935,central,F
2,AI49188,Bachelor,1288743.0,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247,central,F
3,WW63253,Bachelor,764586.0,0.0,106.0,0,Corporate Auto,SUV,529.881344,west region,M
4,GA49547,High School or Below,536308.0,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323,east,M
...,...,...,...,...,...,...,...,...,...,...,...
9132,LA72316,Bachelor,23406.0,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,west region,M
9133,PK87824,College,3097.0,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,west region,F
9134,TD14365,Bachelor,8164.0,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,west region,M
9135,UP19263,College,7524.0,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,west region,M


### Activity 3

- Which columns are numerical?
- Which columns are categorical?
- Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).
- Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.