# Lab | Customer Analysis Round 1

#### Remember the process:

1. Case Study
2. Get data
3. Cleaning/Wrangling/EDA
4. Processing Data
5. Modeling
6. Validation
7. Reporting

### Abstract

The objective of this data is to understand customer demographics and buying behavior. Later during the week, we will use predictive analytics to analyze the most profitable customers and how they interact. After that, we will take targeted actions to increase profitable customer response, retention, and growth.

For this lab, we will gather the data from 3 _csv_ files that are provided in the `files_for_lab` folder. Use that data and complete the data cleaning tasks as mentioned later in the instructions.

### Instructions

- Read the three files into python as dataframes
- Show the DataFrame's shape.
- Standardize header names.
- Rearrange the columns in the dataframe as needed
- Concatenate the three dataframes
- Which columns are numerical?
- Which columns are categorical?
- Understand the meaning of all columns
- Perform the data cleaning operations mentioned so far in class

  - Delete the column education and the number of open complaints from the dataframe.
  - Correct the values in the column customer lifetime value. They are given as a percent, so multiply them by 100 and change `dtype` to `numerical` type.
  - Check for duplicate rows in the data and remove if any.
  - Filter out the data for customers who have an income of 0 or less.

In [1]:
# Your solution to the LAB here:

In [2]:
import pandas as pd
import numpy as np

We read the files

In [3]:
file1 = pd.read_csv('./files_for_lab/file1.csv')
file2 = pd.read_csv('./files_for_lab/file2.csv')
file3 = pd.read_csv('./files_for_lab/file3.csv')

We get the shapes of the files

In [4]:
file1.shape

(4008, 11)

In [5]:
file2.shape

(996, 11)

In [6]:
file3.shape

(7070, 11)

Rearrangement of columns + merging the 3 datasets

In [7]:
column_names = file1.columns
data = pd.DataFrame(columns=column_names)
data = pd.concat([data,file1, file2], axis=0)
data.shape

(5004, 11)

In [8]:
data

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,Master,847141.75%,63513,70,1/0/00,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,College,543121.91%,58161,68,1/0/00,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,College,568964.41%,83640,70,1/0/00,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,Master,368672.38%,0,96,1/0/00,Personal Auto,Two-Door Car,28.460568


In [9]:
cols = []
for i in range(len(data.columns)):
    cols.append(data.columns[i].lower())
data.columns = cols
#Renaming the necessary cols
data = data.rename(columns={ 'st':'state'})

In [10]:
data.columns

Index(['customer', 'state', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'policy type', 'vehicle class', 'total claim amount'],
      dtype='object')

In [11]:
datatypes = data.dtypes
datatypes

customer                     object
state                        object
gender                       object
education                    object
customer lifetime value      object
income                       object
monthly premium auto         object
number of open complaints    object
policy type                  object
vehicle class                object
total claim amount           object
dtype: object

Types of the columns:

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5004 entries, 0 to 995
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   customer                   2067 non-null   object
 1   state                      2067 non-null   object
 2   gender                     1945 non-null   object
 3   education                  2067 non-null   object
 4   customer lifetime value    2060 non-null   object
 5   income                     2067 non-null   object
 6   monthly premium auto       2067 non-null   object
 7   number of open complaints  2067 non-null   object
 8   policy type                2067 non-null   object
 9   vehicle class              2067 non-null   object
 10  total claim amount         2067 non-null   object
dtypes: object(11)
memory usage: 469.1+ KB


Borrar columna de education y number of open complaints

In [13]:
del data['number of open complaints']

In [14]:
del data['education']

In [15]:
data

Unnamed: 0,customer,state,gender,customer lifetime value,income,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,697953.59%,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,1288743.17%,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,764586.18%,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,536307.65%,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,847141.75%,63513,70,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,543121.91%,58161,68,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,568964.41%,83640,70,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,368672.38%,0,96,Personal Auto,Two-Door Car,28.460568


In [16]:
data['customer lifetime value'].info()

<class 'pandas.core.series.Series'>
Int64Index: 5004 entries, 0 to 995
Series name: customer lifetime value
Non-Null Count  Dtype 
--------------  ----- 
2060 non-null   object
dtypes: object(1)
memory usage: 78.2+ KB


Modify customer lifetime value column

In [17]:
#Aqui creo otra columna para ver como quedaria si lo pasase a string y comparar original con la nueva
#data['customer lifetime value2'] = str(data['customer lifetime value'])

In [18]:
#data.isna().sum()

In [19]:
data

Unnamed: 0,customer,state,gender,customer lifetime value,income,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,697953.59%,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,1288743.17%,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,764586.18%,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,536307.65%,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,847141.75%,63513,70,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,543121.91%,58161,68,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,568964.41%,83640,70,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,368672.38%,0,96,Personal Auto,Two-Door Car,28.460568


Sacamos los duplicados en la columna 'customer'

In [20]:
data.drop_duplicates(subset = ['customer'],inplace = True)

In [21]:
data

Unnamed: 0,customer,state,gender,customer lifetime value,income,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,,0.0,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,697953.59%,0.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,1288743.17%,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,764586.18%,0.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,536307.65%,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
991,HV85198,Arizona,M,847141.75%,63513,70,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,543121.91%,58161,68,Corporate Auto,Four-Door Car,140.747286
993,IL40123,Nevada,F,568964.41%,83640,70,Corporate Auto,Two-Door Car,471.050488
994,MY32149,California,F,368672.38%,0,96,Personal Auto,Two-Door Car,28.460568


Filtrar y sacar los customers que tienen income de 0 o menor:

In [22]:
data['income'].unique()

array([0.0, 48767.0, 36357.0, ..., 63513, 58161, 83640], dtype=object)

In [23]:
data = data[data['income']> 0]
data

Unnamed: 0,customer,state,gender,customer lifetime value,income,monthly premium auto,policy type,vehicle class,total claim amount
2,AI49188,Nevada,F,1288743.17%,48767.0,108.0,Personal Auto,Two-Door Car,566.472247
4,GA49547,Washington,M,536307.65%,36357.0,68.0,Personal Auto,Four-Door Car,17.269323
5,OC83172,Oregon,F,825629.78%,62902.0,69.0,Personal Auto,Two-Door Car,159.383042
6,XZ87318,Oregon,F,538089.86%,55350.0,67.0,Corporate Auto,Four-Door Car,321.6
8,DY87989,Oregon,M,2412750.40%,14072.0,71.0,Corporate Auto,Four-Door Car,511.2
...,...,...,...,...,...,...,...,...,...
988,EX61844,Arizona,M,1401997.55%,36733,71,Personal Auto,Four-Door Car,327.078047
989,QC30857,Nevada,F,293115.52%,45768,73,Personal Auto,Two-Door Car,191.934494
991,HV85198,Arizona,M,847141.75%,63513,70,Personal Auto,Four-Door Car,185.667213
992,BS91566,Arizona,F,543121.91%,58161,68,Corporate Auto,Four-Door Car,140.747286


In [24]:
data['income'].unique()

array([48767.0, 36357.0, 62902.0, ..., 63513, 58161, 83640], dtype=object)