# Lab-customer-analysis-round-2

For this lab, we will be using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab` folder. Check out the `files_for_lab/about.md` to get more information if you are using the Online Excel.

**Note**: For the next labs we will be using the same data file. Please save the code, so that you can re-use it later in the labs following this lab.

In [None]:
import pandas as pd

## 1. Show the dataframe shape.


In [None]:
data = pd.read_csv('files_for_lab/csv_files/marketing_customer_analysis.csv', index_col=0)
data.shape

## 2. Standardize header names.


In [None]:
data = data.rename(columns={'Customer':'id', 'EmploymentStatus':'employment_status'})
data.columns = data.columns.str.lower().str.strip().str.replace(' ', '_')

## 3. Which columns are numerical?


In [None]:
data.select_dtypes(include=['float', 'int']).dtypes

## 4. Which columns are categorical?


In [None]:
data.select_dtypes(include=['object']).dtypes

## 5. Check and deal with `NaN` values.


In [None]:
data.isnull().sum()

In [None]:
print(data['state'].value_counts())
print(data['state'].mode())
print(data['response'].value_counts())
print(data['response'].mode())
print(data['vehicle_class'].value_counts())
print(data['vehicle_class'].mode())
print(data['vehicle_size'].value_counts())
print(data['vehicle_size'].mode())
print(data['vehicle_type'].value_counts())
print(data['vehicle_type'].mode())

print(data['months_since_last_claim'].mean())
print(data['months_since_last_claim'].median())
print(data['number_of_open_complaints'].mean())
print(data['number_of_open_complaints'].median())

In [None]:
#Repalcing NaN with the most common value in that column for categorical columns and median for numerical columns
replace_dict = {
    'id': '',
    'state': 'California',
    'customer_lifetime_value': '',
    'response': 'No',
    'coverage': '',
    'education': '',
    'effective_to_date': '',
    'employment_status': '',
    'gender': '',
    'income': '',
    'location_code': '',
    'marital_status': '',
    'monthly_premium_auto': '',
    'months_since_last_claim': data['months_since_last_claim'].median(),
    'months_since_policy_inception': '',
    'number_of_open_complaints': data['number_of_open_complaints'].median(),
    'number_of_policies': '',
    'policy_type': '',
    'policy': '',
    'renew_offer_type': '',
    'sales_channel': '',
    'total_claim_amount': '',
    'vehicle_class': 'Four-Door Car',
    'vehicle_size': 'Medsize',
    'vehicle_type': 'A',  
}

for column in data.columns:
    data[column] = data[column].fillna(replace_dict[column])

## 6. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. _Hint_: If data from March does not exist, consider only January and February.


In [None]:
data['effective_to_date'] = pd.to_datetime(data['effective_to_date'], errors='coerce')
print(data.dtypes)
print(data.isna().sum())

In [None]:
data['effective_to_month'] = pd.DatetimeIndex(data['effective_to_date']).month

In [None]:
data[data['effective_to_month'] == 3]

In [None]:
data[(data['effective_to_month'] >= 1) & (data['effective_to_month'] <= 2)]

In [None]:
# Exporting for use in Lab 1.05
data.to_csv('files_for_lab/csv_files/marketing_customer_analysis_clean.csv')

## 7. Put all the previously mentioned data transformations into a function.

In [None]:
def clean_dfheaders(df):
    df.rename(columns={'Customer':'id', 'EmploymentStatus':'employment_status'}, inplace=True)
    df.columns = df.columns.str.lower().str.strip().str.replace(' ', '_')
    return df
# using the 2 operations together only works when removing the 'df=' infront of the first satemment. Why?
# the first operation doesnt work at all in a function without the inplace parameter. Outside of a function it does work. Why?

def clean_df(df):
    replace_dict = {
        'id': '',
        'state': 'California',
        'customer_lifetime_value': '',
        'response': 'No',
        'coverage': '',
        'education': '',
        'effective_to_date': '',
        'employment_status': '',
        'gender': '',
        'income': '',
        'location_code': '',
        'marital_status': '',
        'monthly_premium_auto': '',
        'months_since_last_claim': df['months_since_last_claim'].median(),
        'months_since_policy_inception': '',
        'number_of_open_complaints': df['number_of_open_complaints'].median(),
        'number_of_policies': '',
        'policy_type': '',
        'policy': '',
        'renew_offer_type': '',
        'sales_channel': '',
        'total_claim_amount': '',
        'vehicle_class': 'Four-Door Car',
        'vehicle_size': 'Medsize',
        'vehicle_type': 'A',  
    }
# replacing the value with a mode() expression, e.g. 'state': data['state'].mode() doesnt'work. why?
    
    for column in df.columns:
        df[column] = df[column].fillna(replace_dict[column])
        
    df['effective_to_date'] = pd.to_datetime(df['effective_to_date'], errors='coerce')
    df['effective_to_month'] = pd.DatetimeIndex(df['effective_to_date']).month
        
    return df

In [None]:
clean_dfheaders(data)
clean_df(data)

data

## 8. BONUS 

### 8.1. List Comprehensions

#### 8.1.1 Find the capital letters (and not white space) in the sentence 'The Quick Brown Fox Jumped Over The Lazy Dog'.


In [None]:
sentence = 'The Quick Brown Fox Jumped Over The Lazy Dog'
cap_letters = [letter for letter in sentence if letter.isupper() == True]
print(cap_letters)

#### 8.1.2. Use a list comprehension to create a list with the same elements rounded to 2 decimal positions.

In [None]:
a = [
    0.84062117, 0.48006452, 0.7876326 , 0.77109654,
    0.44409793, 0.09014516, 0.81835917, 0.87645456,
    0.7066597 , 0.09610873, 0.41247947, 0.57433389,
    0.29960807, 0.42315023, 0.34452557, 0.4751035 ,
    0.17003563, 0.46843998, 0.92796258, 0.69814654,
    0.41290051, 0.19561071, 0.16284783, 0.97016248,
    0.71725408, 0.87702738, 0.31244595, 0.76615487,
    0.20754036, 0.57871812, 0.07214068, 0.40356048,
    0.12149553, 0.53222417, 0.9976855 , 0.12536346,
    0.80930099, 0.50962849, 0.94555126, 0.33364763
]

In [None]:
b = [round(float, 2) for float in a]
print(b)

### 8.2. Lambdas

#### 8.2.1. Using Lambda Expressions in List Comprehensions

In the following challenge, we will combine two lists using a lambda expression in a list comprehension.

To do this, we will need to introduce the zip function. The zip function returns an iterator of tuples.

The way zip function works with list has been shown below:

In this exercise we will try to compare the elements on the same index in the two lists. We want to zip the two lists and then use a lambda expression to compare if: list1 element > list2 element