# Activites List

Here are some of the tasks you need to perform:

### Activity 1

- [x] Aggregate data into one Data Frame using Pandas.
- [x] Standardizing header names
- [x] Deleting and rearranging columns – delete the column customer as it is only a unique identifier for each row of data
- [x] Working with data types – Check the data types of all the columns and fix the incorrect ones (for ex. customer lifetime value and number of complaints)
- [x] Filtering data and Correcting typos – Filter the data in state and gender column to standardize the texts in those columns
- [x] Removing duplicates
- [x] Replacing null values – Replace missing values with means of the column (for numerical columns)

In [1]:
# setup libraries
import pandas as pd
import numpy as np

In [2]:
# setup
pd.set_option('display.max_rows', 1000)
# pd.get_option('display.max_rows')

 create useful check function for everytime it needs to
 check for value uniqueness and counting

In [3]:
def check_series(data_frame,serie_name):
    print('\nvalue counts:\n',data_frame[serie_name].value_counts())
    print('\nunique:\n',data_frame[serie_name].unique())
    print('\n',data_frame[serie_name].describe())

In [4]:
# read files
file_1 = pd.read_csv('Data/file1.csv')
file_2 = pd.read_csv('Data/file2.csv')
file_3 = pd.read_csv('Data/file3.csv')

In [5]:
# combine data
data = pd.concat([file_1, file_2, file_3]).reindex()

##### the number of columns show doubles columns for state and gender
> first copy all data in preferd column eg. Gender and State
>
> delete old double Columns "ST" and "GENDER"

In [6]:
data['Gender'] = list(map(lambda x, y: x if x == x else y, data['Gender'],data['GENDER']))
data['State'] = list(map(lambda x, y: x if x == x else y, data['State'], data['ST']))

# drop GENDER and ST as it is double information
data.drop(columns=['ST','GENDER'], inplace=True)

In [7]:
data.head()

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,Washington,M


##### ***Step:*** getting an overwiew which kind of values are in data for Gender and State
>***decision:***
>
> - stay with M as Male and F as Female
>
> - stay with State as location column

In [8]:
# which kind of values are in Gender
data['Gender'].unique()

array([nan, 'F', 'M', 'Femal', 'Male', 'female'], dtype=object)

In [9]:
F = ['Femal', 'female', 'F']
M = ['Male', 'M']
data['Gender'] = list(map(lambda x: 'M' if x==x and x in M else ('F' if x==x and x in F else x), data['Gender']))

In [10]:
# check result
check_series(data, 'Gender')
data['Gender'].isnull().sum()


value counts:
 F    4607
M    4408
Name: Gender, dtype: int64

unique:
 [nan 'F' 'M']

 count     9015
unique       2
top          F
freq      4607
Name: Gender, dtype: object


3059

- Gender has still 3059 null values, 
- this will be adressed after removing doubles and cleaning other stuff, so that some of these fields will be allready removed

#### State

In [11]:
# fix State values
check_series(data, 'State')


value counts:
 California    3032
Oregon        2601
Arizona       1630
Nevada         882
Washington     768
Cali           120
AZ              74
WA              30
Name: State, dtype: int64

unique:
 ['Washington' 'Arizona' 'Nevada' 'California' 'Oregon' 'Cali' 'AZ' 'WA'
 nan]

 count           9137
unique             8
top       California
freq            3032
Name: State, dtype: object


as visible, States needed to be renamed to get a clear standard

In [12]:
# running lambda map fuctions to set all states in similar standard

data['State'] = list(map(lambda x: x if x != 'AZ' else 'Arizona', data['State']))
data['State'] = list(map(lambda x: x if x != 'Cali' else 'California', data['State']))
data['State'] = list(map(lambda x: x if x != 'WA' else 'Washington', data['State']))

In [13]:
check_series(data, 'State')


value counts:
 California    3152
Oregon        2601
Arizona       1704
Nevada         882
Washington     798
Name: State, dtype: int64

unique:
 ['Washington' 'Arizona' 'Nevada' 'California' 'Oregon' nan]

 count           9137
unique             5
top       California
freq            3152
Name: State, dtype: object


In [14]:
data['State'].isnull().sum()

2937

still 2937 states as null, silimar workflow like with Gender field

#### Customer Lifetime Value

In [15]:
check_series(data, 'Customer Lifetime Value')


value counts:
 16468.220790    6
5246.278375     6
22332.439460    6
4270.034394     6
5107.163002     6
               ..
7477.176362     1
15700.284360    1
2968.077571     1
5452.171237     1
2611.836866     1
Name: Customer Lifetime Value, Length: 8211, dtype: int64

unique:
 [nan '697953.59%' '1288743.17%' ... 8163.890428 7524.442436 2611.836866]

 count      9130.00000
unique     8211.00000
top       16468.22079
freq          6.00000
Name: Customer Lifetime Value, dtype: float64


> fields show that some values are represented with percent sign
>
> this getting removed with string operation
>
> afterwards the series is casted as float datatype

In [16]:
#remove percent value and cast to float
data['Customer Lifetime Value'] = list(map(lambda x: float(str(x).strip('%\r\t\n')) if x==x else float(0), data['Customer Lifetime Value']))

# convert to float and round with 2 decimal
data['Customer Lifetime Value'] = data['Customer Lifetime Value'].astype('float').round(2)

# debug
print(data.dtypes)
print()
check_series(data, 'Customer Lifetime Value')

Customer                      object
Education                     object
Customer Lifetime Value      float64
Income                       float64
Monthly Premium Auto         float64
Number of Open Complaints     object
Policy Type                   object
Vehicle Class                 object
Total Claim Amount           float64
State                         object
Gender                        object
dtype: object


value counts:
 0.00        2944
6689.02        6
22332.44       6
5568.95        6
5246.28        6
            ... 
11875.90       1
10634.84       1
5184.95        1
16373.73       1
2611.84        1
Name: Customer Lifetime Value, Length: 8188, dtype: int64

unique:
 [      0.    697953.59 1288743.17 ...    8163.89    7524.44    2611.84]

 count    1.207400e+04
mean     1.377715e+05
std      3.914222e+05
min      0.000000e+00
25%      2.236980e+03
50%      5.334900e+03
75%      1.326808e+04
max      5.816655e+06
Name: Customer Lifetime Value, dtype: float64


In [17]:
data

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,RB50392,Master,0.00,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934,Washington,
1,QZ44356,Bachelor,697953.59,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,AI49188,Bachelor,1288743.17,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,WW63253,Bachelor,764586.18,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344,California,M
4,GA49547,High School or Below,536307.65,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...,...
7065,LA72316,Bachelor,23405.99,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
7066,PK87824,College,3096.51,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
7067,TD14365,Bachelor,8163.89,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
7068,UP19263,College,7524.44,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


#### Number of Open Complaints

> looks like one format is just the number '0...5' and second format is '1/0...5/00' eg. '1/2/00'
>
> remove with string operation '1/' and '/00' from string


In [18]:
# checking NaN 
data['Number of Open Complaints'].isna().sum()

2937

In [19]:
# as NaN included in this field, so it is needed to fill these field
# otherwise the string operation will throw an error
data.loc[data['Number of Open Complaints'].isnull()] = 0

data['Number of Open Complaints'] = list(map(lambda x: int(x[2]) if x==x and str(x).startswith('1/') and str(x).endswith('/00') else int(x), data['Number of Open Complaints']))

# use pandas function to cast series to numbers
pd.to_numeric(data['Number of Open Complaints'], errors='coerce')

# check result
check_series(data, 'Number of Open Complaints')


value counts:
 0    10192
1     1012
2      376
3      290
4      148
5       56
Name: Number of Open Complaints, dtype: int64

unique:
 [0 2 1 3 5 4]

 count    12074.000000
mean         0.290376
std          0.807688
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          5.000000
Name: Number of Open Complaints, dtype: float64


#### Education

In [20]:
check_series(data, 'Education')


value counts:
 0                       2937
Bachelor                2719
College                 2682
High School or Below    2616
Master                   752
Doctor                   344
Bachelors                 24
Name: Education, dtype: int64

unique:
 ['Master' 'Bachelor' 'High School or Below' 'College' 'Bachelors' 'Doctor'
 0]

 count     12074
unique        7
top           0
freq       2937
Name: Education, dtype: int64


> just some small adjustment needed, to keep 'Bachelor' aligned

In [21]:
# replace 'Bachelors' with 'Bachelor'
data['Education'] = list(map(lambda x: x if x != 'Bachelors' else 'Bachelor', data['Education']))

check_series(data, 'Education')


value counts:
 0                       2937
Bachelor                2743
College                 2682
High School or Below    2616
Master                   752
Doctor                   344
Name: Education, dtype: int64

unique:
 ['Master' 'Bachelor' 'High School or Below' 'College' 'Doctor' 0]

 count     12074
unique        6
top           0
freq       2937
Name: Education, dtype: int64


As shown in the check function, there are still 2937 entries without education information.
Which will adressed after doublicate removement

#### Duplicates

> find and remove all Duplicate entries
> - this is the reason to kept 'Customer' as long as possible, as it could helpfull
> - if customer would be deleted first could cause that entries with similar values or in case less entries values became a duplicate

In [22]:
data.duplicated().sum()

2939

amount of duplicates fits more or less to number count of 0 and NaN entries of previous entries

*check duplicates entries:*

In [23]:
data[data.duplicated()]

Unnamed: 0,Customer,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
1072,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
1073,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
1074,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
1075,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
1076,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
4006,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
4007,0,0,0.00,0.0,0.0,0,0,0,0.0,0,0
0,GS98873,Bachelor,323912.47,16061.0,88.0,0,Personal Auto,Four-Door Car,633.6,Arizona,F
1,CW49887,Master,462680.11,79487.0,114.0,0,Special Auto,SUV,547.2,California,F


> the majority of duplicated values are just zero entries
>
> to confirm the last three entries, we will double check in the customer series

In [24]:
check_series(data, 'Customer')


value counts:
 0          2937
QD28391       2
PY42157       2
FM14335       2
HX77930       2
           ... 
KG49115       1
NT78297       1
TN66124       1
JV99867       1
Y167826       1
Name: Customer, Length: 9057, dtype: int64

unique:
 ['RB50392' 'QZ44356' 'AI49188' ... 'TD14365' 'UP19263' 'Y167826']

 count     12074
unique     9057
top           0
freq       2937
Name: Customer, dtype: int64


*There are double entries in 'customer', so we will delete all this entries and the series customer now.*

In [25]:
data.drop_duplicates(inplace=True)
data = data.drop(columns='Customer')
data

Unnamed: 0,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
0,Master,0.00,0.0,1000.0,0,Personal Auto,Four-Door Car,2.704934,Washington,
1,Bachelor,697953.59,0.0,94.0,0,Personal Auto,Four-Door Car,1131.464935,Arizona,F
2,Bachelor,1288743.17,48767.0,108.0,0,Personal Auto,Two-Door Car,566.472247,Nevada,F
3,Bachelor,764586.18,0.0,106.0,0,Corporate Auto,SUV,529.881344,California,M
4,High School or Below,536307.65,36357.0,68.0,0,Personal Auto,Four-Door Car,17.269323,Washington,M
...,...,...,...,...,...,...,...,...,...,...
7065,Bachelor,23405.99,71941.0,73.0,0,Personal Auto,Four-Door Car,198.234764,California,M
7066,College,3096.51,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.200000,California,F
7067,Bachelor,8163.89,0.0,85.0,3,Corporate Auto,Four-Door Car,790.784983,California,M
7068,College,7524.44,21941.0,96.0,0,Personal Auto,Four-Door Car,691.200000,California,M


#### final steps for Activity 1

> double check data

In [26]:
# check all fields
data.reset_index(drop=True)
for i in data.columns:
    print('\ncheck:', i)
    check_series(data, i)


check: Education

value counts:
 Bachelor                2742
College                 2681
High School or Below    2616
Master                   751
Doctor                   344
0                          1
Name: Education, dtype: int64

unique:
 ['Master' 'Bachelor' 'High School or Below' 'College' 'Doctor' 0]

 count         9135
unique           6
top       Bachelor
freq          2742
Name: Education, dtype: object

check: Customer Lifetime Value

value counts:
 0.00        8
4686.47     6
2300.69     6
9095.05     6
5246.28     6
           ..
10634.84    1
5184.95     1
16373.73    1
10477.78    1
2611.84     1
Name: Customer Lifetime Value, Length: 8188, dtype: int64

unique:
 [      0.    697953.59 1288743.17 ...    8163.89    7524.44    2611.84]

 count    9.135000e+03
mean     1.819121e+05
std      4.408862e+05
min      0.000000e+00
25%      4.644155e+03
50%      7.712060e+03
75%      2.612852e+04
max      5.816655e+06
Name: Customer Lifetime Value, dtype: float64

check: Inc

Issues:
- Education 1x 0 entries -> delete
- Policity Type 1x 0 entries -> delete and cast serie to String
- vehicle class 1x 0 entrie -> delete and cast to String
- total claim ammount -> round(2)
- state 1x 0 entrie ->delete
- gender 1x 0 entrie -> delete
- and there are still NaN entries -> delete

In [27]:
# delete the zero entrie
res = data[data['Gender'] == 0]
data.iloc[res.index,:]

Unnamed: 0,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
1071,0,0.0,0.0,0.0,0,0,0,0.0,0,0


In [28]:
data.drop(index=res.index, axis=1, inplace=True)

In [29]:
# Policy Type cast to string
data['Policy Type'] = data['Policy Type'].astype('string')
# Total claim ammount round(2)
data['Total Claim Amount'] = data['Total Claim Amount'].round(2)

In [30]:
data.isna().sum()

Education                      0
Customer Lifetime Value        0
Income                         0
Monthly Premium Auto           0
Number of Open Complaints      0
Policy Type                    0
Vehicle Class                  0
Total Claim Amount             0
State                          0
Gender                       122
dtype: int64

In [31]:
# remove all nan rows -> still 122 Gender entries NaN
data.dropna(inplace=True)
data.isna().sum()
data

Unnamed: 0,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount,State,Gender
1,Bachelor,697953.59,0.0,94.0,0,Personal Auto,Four-Door Car,1131.46,Arizona,F
2,Bachelor,1288743.17,48767.0,108.0,0,Personal Auto,Two-Door Car,566.47,Nevada,F
3,Bachelor,764586.18,0.0,106.0,0,Corporate Auto,SUV,529.88,California,M
4,High School or Below,536307.65,36357.0,68.0,0,Personal Auto,Four-Door Car,17.27,Washington,M
5,Bachelor,825629.78,62902.0,69.0,0,Personal Auto,Two-Door Car,159.38,Oregon,F
...,...,...,...,...,...,...,...,...,...,...
7065,Bachelor,23405.99,71941.0,73.0,0,Personal Auto,Four-Door Car,198.23,California,M
7066,College,3096.51,21604.0,79.0,0,Corporate Auto,Four-Door Car,379.20,California,F
7067,Bachelor,8163.89,0.0,85.0,3,Corporate Auto,Four-Door Car,790.78,California,M
7068,College,7524.44,21941.0,96.0,0,Personal Auto,Four-Door Car,691.20,California,M


In [32]:
data.describe()

Unnamed: 0,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Total Claim Amount
count,9011.0,9011.0,9011.0,9011.0,9011.0
mean,174080.9,37835.331151,109.878593,0.381867,430.82583
std,429552.1,30385.533173,581.497481,0.90828,290.313469
min,0.0,0.0,61.0,0.0,0.1
25%,4618.645,0.0,68.0,0.0,266.16
50%,7613.05,34317.0,83.0,0.0,377.35
75%,22986.48,62458.0,109.0,0.0,547.2
max,4922143.0,99981.0,35354.0,5.0,2893.24


### Activity 2

- [x] Bucketing the data - Write a function to replace column "State" to different zones. California as West Region, Oregon as North West, and Washington as East, and Arizona and Nevada as Central
- [x] Standardizing the data – Use string functions to standardize the text data (lower case)

##### ***Format everthing in lower case***

In [33]:
print(data.columns)

Index(['Education', 'Customer Lifetime Value', 'Income',
       'Monthly Premium Auto', 'Number of Open Complaints', 'Policy Type',
       'Vehicle Class', 'Total Claim Amount', 'State', 'Gender'],
      dtype='object')


In [34]:
# get all columns
col = data.columns

for i in col:
    if data[i].dtypes == np.dtype('O'):#str: # only working for 'object' or 'string' types
        data[i] = data[i].str.lower() 

# lower case column header        
data.columns = col.str.lower()
#debug
data.head()

Unnamed: 0,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount,state,gender
1,bachelor,697953.59,0.0,94.0,0,Personal Auto,four-door car,1131.46,arizona,f
2,bachelor,1288743.17,48767.0,108.0,0,Personal Auto,two-door car,566.47,nevada,f
3,bachelor,764586.18,0.0,106.0,0,Corporate Auto,suv,529.88,california,m
4,high school or below,536307.65,36357.0,68.0,0,Personal Auto,four-door car,17.27,washington,m
5,bachelor,825629.78,62902.0,69.0,0,Personal Auto,two-door car,159.38,oregon,f


#### creating 'State zones'

In [35]:
cat_zone = {
    'west region' : 'california',
    'north west'  : 'oregon',
    'east'        : 'washington',
    'central'     :['arizona', 'nevada']
}

define helper function for catorizing the states

In [36]:
def regroup_location(state: str) -> str:
    for k,v in cat_zone.items():
        if state in v or state == v:
            print(k)
            return k
    return state

In [37]:
# reoganize the states into it`s zones
#data['state'] = list(map(regroup_location, data['state']))

#state_list = list(map(lambda x: regroup_location(x), data['State']))
#data['State']= list(state_list)
#data['state'] = data['state'].astype(str)
data['state'] = list(map(regroup_location, data['state']))

central
central
west region
east
north west
north west
central
north west
north west
west region
east
east
east
east
east
east
east
east
east
east
east
east
east
central
north west
central
west region
east
north west
central
west region
central
central
north west
east
east
central
central
west region
west region
north west
north west
central
central
west region
east
west region
east
west region
north west
west region
west region
north west
west region
west region
north west
central
central
west region
west region
east
central
east
west region
west region
central
north west
central
west region
west region
west region
central
west region
north west
west region
west region
west region
north west
central
east
north west
north west
east
west region
west region
central
central
west region
west region
west region
west region
west region
north west
east
west region
west region
east
west region
central
north west
north west
north west
east
west region
north west
west region
central
east
north w

north west
central
east
north west
west region
north west
central
north west
west region
west region
east
central
west region
west region
central
east
west region
west region
west region
central
west region
west region
north west
east
north west
central
west region
west region
central
west region
north west
north west
west region
north west
west region
west region
east
west region
west region
central
west region
north west
east
east
central
west region
east
west region
central
north west
east
central
central
central
north west
north west
north west
west region
west region
north west
central
central
north west
central
west region
west region
central
north west
central
north west
central
central
north west
east
north west
north west
north west
west region
west region
east
north west
north west
north west
north west
west region
north west
north west
east
north west
central
central
west region
north west
central
north west
west region
north west
west region
central
north west
north west
ea

central
central
west region
west region
west region
west region
north west
west region
west region
central
central
west region
central
north west
west region
central
north west
east
central
central
east
central
central
west region
central
north west
central
central
north west
west region
central
west region
central
central
north west
east
north west
west region
west region
north west
north west
central
north west
east
north west
central
west region
west region
central
west region
west region
east
north west
central
central
west region
east
central
north west
east
central
central
central
central
central
central
north west
north west
central
west region
east
central
north west
north west
north west
north west
central
west region
central
west region
west region
west region
north west
north west
east
central
west region
north west
central
west region
west region
central
north west
central
west region
west region
central
north west
central
central
north west
west region
east
west region
wes

central
west region
north west
east
central
north west
central
west region
central
north west
north west
central
west region
central
north west
central
north west
north west
central
north west
central
north west
west region
central
north west
north west
west region
central
central
north west
north west
east
north west
central
north west
west region
central
west region
central
central
west region
east
east
central
north west
west region
north west
north west
west region
east
north west
west region
west region
north west
north west
east
central
north west
north west
central
west region
north west
central
north west
central
north west
north west
central
central
central
north west
central
central
east
west region
central
east
central
east
central
west region
north west
central
central
east
central
north west
north west
north west
west region
west region
west region
north west
north west
north west
north west
central
central
central
east
east
west region
west region
central
west region
nort

west region
west region
west region
west region
west region
west region
central
central
central
north west
central
west region
north west
north west
central
east
west region
north west
west region
north west
central
central
west region
west region
east
north west
central
north west
central
east
central
central
east
west region
east
west region
east
east
east
north west
central
east
east
east
west region
north west
central
west region
central
central
west region
west region
east
west region
central
west region
west region
north west
west region
west region
central
north west
north west
central
west region
north west
west region
central
west region
central
north west
north west
central
central
central
central
east
east
north west
north west
west region
west region
central
east
west region
west region
west region
north west
west region
north west
central
west region
north west
central
west region
central
central
central
west region
north west
central
central
west region
west region
west r

west region
north west
central
central
north west
central
west region
west region
north west
central
north west
west region
west region
north west
north west
east
west region
north west
west region
central
west region
central
north west
central
east
central
central
central
central
north west
east
central
west region
central
west region
north west
central
north west
central
west region
north west
central
west region
north west
central
east
east
central
north west
north west
central
central
west region
central
east
north west
north west
central
west region
east
east
east
central
west region
central
central
north west
west region
west region
north west
west region
west region
central
west region
west region
north west
east
west region
west region
west region
central
central
central
west region
central
west region
central
west region
west region
west region
north west
east
west region
west region
central
central
central
central
central
central
north west
west region
west region
central
cen

east
central
north west
north west
central
central
east
central
central
central
east
central
west region
north west
north west
north west
central
east
central
west region
central
north west
east
north west
central
north west
east
north west
west region
central
central
north west
west region
west region
central
north west
central
north west
west region
central
west region
central
north west
north west
central
north west
central
east
central
north west
east
west region
north west
central
north west
central
west region
north west
west region
central
west region
east
central
east
north west
central
central
central
north west
west region
central
west region
east
central
west region
central
central
west region
central
central
west region
east
west region
north west
central
north west
west region
central
central
west region
north west
central
central
central
north west
central
central
west region
central
central
north west
east
west region
north west
north west
west region
central
central
cen

east
west region
central
west region
north west
north west
central
central
west region
east
central
central
central
west region
west region
central
central
east
north west
east
north west
central
north west
north west
west region
east
north west
central
west region
central
east
central
north west
central
central
north west
east
west region
west region
west region
west region
central
north west
central
north west
central
central
central
west region
east
central
north west
west region
north west
north west
north west
north west
central
east
central
west region
central
north west
east
north west
central
north west
west region
north west
central
north west
west region
central
central
central
north west
central
north west
north west
west region
west region
west region
west region
central
east
west region
central
north west
north west
central
west region
north west
west region
central
central
north west
central
north west
central
central
north west
west region
central
central
east
west regio

west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west region
west

In [38]:
data.rename(columns={'state':'zone'}, inplace=True)

In [39]:
# debug check
data.head()

Unnamed: 0,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount,zone,gender
1,bachelor,697953.59,0.0,94.0,0,Personal Auto,four-door car,1131.46,central,f
2,bachelor,1288743.17,48767.0,108.0,0,Personal Auto,two-door car,566.47,central,f
3,bachelor,764586.18,0.0,106.0,0,Corporate Auto,suv,529.88,west region,m
4,high school or below,536307.65,36357.0,68.0,0,Personal Auto,four-door car,17.27,east,m
5,bachelor,825629.78,62902.0,69.0,0,Personal Auto,two-door car,159.38,north west,f


### Activity 3

- Which columns are numerical?
- Which columns are categorical?
- Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).
- Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.

- Which columns are numerical?

In [40]:
num = list(data.describe())
num

['customer lifetime value',
 'income',
 'monthly premium auto',
 'number of open complaints',
 'total claim amount']

- Which columns are categorical?

In [41]:
print([x for x in data.columns if x not in num])

['education', 'policy type', 'vehicle class', 'zone', 'gender']


- Check and deal with NaN values. (Hint:Replacing null values – Replace missing values with means of the column (for numerical columns)).

In [42]:
data['income'].describe()

count     9011.000000
mean     37835.331151
std      30385.533173
min          0.000000
25%          0.000000
50%      34317.000000
75%      62458.000000
max      99981.000000
Name: income, dtype: float64

In [43]:
# helper function
def fill_with_mean(series_name):
    print(series_name.name ,'mean:',series_name.mean())
    return list(map(lambda x: x if x != 0 else series_name.mean(),series_name))
#    series_name = list(map(lambda x: x if x > 1.0 else series_name.mean(),series_name))


In [44]:
# fixing income
data['income'] = fill_with_mean(data['income'])
data['income'] = data['income'].astype('float').round(0)
data['income'].describe()

income mean: 37835.33115081567


count     9011.000000
mean     47362.310953
std      21725.736150
min      10037.000000
25%      34496.500000
50%      37835.000000
75%      62458.000000
max      99981.000000
Name: income, dtype: float64

In [45]:
# fixing total claim mount
#data['total claim amount']
data['total claim amount'] = fill_with_mean(data['total claim amount'])

total claim amount mean: 430.8258295416686


In [46]:
data['total claim amount'].describe()

count    9011.000000
mean      430.825830
std       290.313469
min         0.100000
25%       266.160000
50%       377.350000
75%       547.200000
max      2893.240000
Name: total claim amount, dtype: float64

In [47]:
data['customer lifetime value'].describe()

count    9.011000e+03
mean     1.740809e+05
std      4.295521e+05
min      0.000000e+00
25%      4.618645e+03
50%      7.613050e+03
75%      2.298648e+04
max      4.922143e+06
Name: customer lifetime value, dtype: float64

In [48]:
fill_with_mean(data['customer lifetime value'])
data['customer lifetime value'].describe()

customer lifetime value mean: 174080.91236932587


count    9.011000e+03
mean     1.740809e+05
std      4.295521e+05
min      0.000000e+00
25%      4.618645e+03
50%      7.613050e+03
75%      2.298648e+04
max      4.922143e+06
Name: customer lifetime value, dtype: float64

In [49]:
# round customer lifetime value
data['customer lifetime value'] = data['customer lifetime value'].round(2)

In [50]:
# store as file
data.to_csv('data/work_file.csv')
data

Unnamed: 0,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount,zone,gender
1,bachelor,697953.59,37835.0,94.0,0,Personal Auto,four-door car,1131.46,central,f
2,bachelor,1288743.17,48767.0,108.0,0,Personal Auto,two-door car,566.47,central,f
3,bachelor,764586.18,37835.0,106.0,0,Corporate Auto,suv,529.88,west region,m
4,high school or below,536307.65,36357.0,68.0,0,Personal Auto,four-door car,17.27,east,m
5,bachelor,825629.78,62902.0,69.0,0,Personal Auto,two-door car,159.38,north west,f
...,...,...,...,...,...,...,...,...,...,...
7065,bachelor,23405.99,71941.0,73.0,0,Personal Auto,four-door car,198.23,west region,m
7066,college,3096.51,21604.0,79.0,0,Corporate Auto,four-door car,379.20,west region,f
7067,bachelor,8163.89,37835.0,85.0,3,Corporate Auto,four-door car,790.78,west region,m
7068,college,7524.44,21941.0,96.0,0,Personal Auto,four-door car,691.20,west region,m


- Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. Hint: If data from March does not exist, consider only January and February.

## ?? no time data found