# Preprocessing of training data

This section will be dedicated to the preprocessing of the training dataset. 
The objective is to create a new dataset with good characteristics for, in future sections, be used to train some machine learning models.

We will consider the main insights from the EDA section to guide the preprocessing steps.
The following tasks will be performed:
- Treat missing values
- Remove unnecessary columns
- Deal with outliers
- Scale numerical features (considering Standard Scaling and Min-Max Scaling)
- Encode categorical features (considering Label Encoding and One-Hot Encoding)

In [72]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

In [73]:
# Load the original dataset
df = pd.read_csv('../data/customer.csv')
print(df.shape)
print(df.columns)

(72458, 15)
Index(['Unnamed: 0', 'custid', 'sex', 'is_employed', 'income',
       'marital_status', 'health_ins', 'housing_type', 'num_vehicles', 'age',
       'state_of_res', 'code_column', 'gas_usage', 'rooms', 'recent_move_b'],
      dtype='object')


In [74]:
# Check the frequency of categories in the 'is_employed' column
df['is_employed'].value_counts(dropna=False)

is_employed
True     44630
NaN      25515
False     2313
Name: count, dtype: int64

In [75]:
# people with missing values in 'is_employed' will be considered as unemployed
df['is_employed'] = df['is_employed'].fillna(False)
df['is_employed'].value_counts()

  df['is_employed'] = df['is_employed'].fillna(False)


is_employed
True     44630
False    27828
Name: count, dtype: int64

In [76]:
# Maximum and Minimum number of code_column associated with a state_of_res
max(df.groupby('state_of_res')['code_column'].nunique()), min(df.groupby('state_of_res')['code_column'].nunique())

(1, 1)

In [77]:
# Since each state has 1 code, we can drop the 'code_column' feature.
# We can also drop the idetifier columns 'Unnamed: 0' and 'custid'
# 'recent_move_b' is an irrelevant feature, we can drop it as well
df.drop(['Unnamed: 0','custid','code_column','recent_move_b'], axis=1, inplace=True)
print(df.shape)
print(df.columns)

(72458, 11)
Index(['sex', 'is_employed', 'income', 'marital_status', 'health_ins',
       'housing_type', 'num_vehicles', 'age', 'state_of_res', 'gas_usage',
       'rooms'],
      dtype='object')


In [78]:
df.isnull().sum()

sex                  0
is_employed          0
income               0
marital_status       0
health_ins           0
housing_type      1686
num_vehicles      1686
age                  0
state_of_res         0
gas_usage         1686
rooms                0
dtype: int64

In [79]:
num = df[df.isnull().any(axis=1)].shape[0]
print(print(f'{num} rows have missing values. \nApprox. {num/df.shape[0]*100:.2f}% of the orignal dataset.'))

1686 rows have missing values. 
Approx. 2.33% of the orignal dataset.
None


```python
# Missing values are all in the same rows. We can drop them
df.dropna(inplace=True)
df.shape
````

In [80]:
# Instead of dropping the rows with missing values, we can use imputation techniques
# For numerical features, we can use the median
# For categorical features, we can use the mode
df['housing_type'].fillna(df['housing_type'].mode()[0], inplace=True)
df['num_vehicles'].fillna(df['num_vehicles'].median(), inplace=True)
df['gas_usage'].fillna(df['gas_usage'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['housing_type'].fillna(df['housing_type'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['num_vehicles'].fillna(df['num_vehicles'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the inter

In [81]:
df[df['age'] < 21]['age'].value_counts()

age
0    77
Name: count, dtype: int64

In [82]:
df[df['age'] < 21]

Unnamed: 0,sex,is_employed,income,marital_status,health_ins,housing_type,num_vehicles,age,state_of_res,gas_usage,rooms
594,Male,True,50000.0,Never married,False,Rented,1.0,0,Alabama,3.0,3
1260,Male,False,0.0,Married,True,Rented,0.0,0,Arizona,3.0,4
1658,Female,True,24700.0,Never married,True,Rented,3.0,0,Arizona,3.0,5
2340,Female,True,2400.0,Divorced/Separated,True,Rented,0.0,0,Arizona,3.0,4
2859,Female,False,9700.0,Married,True,Homeowner free and clear,3.0,0,Arkansas,3.0,2
...,...,...,...,...,...,...,...,...,...,...,...
67967,Female,False,5000.0,Widowed,True,Homeowner with mortgage/loan,0.0,0,Virginia,3.0,2
68681,Female,True,80000.0,Married,True,Homeowner with mortgage/loan,2.0,0,Virginia,90.0,3
69200,Male,False,0.0,Never married,True,Rented,2.0,0,Washington,3.0,6
70015,Male,True,75000.0,Divorced/Separated,True,Homeowner free and clear,2.0,0,Washington,3.0,4


In [83]:
# For variable 'age', we will truncate values to 21-99. 
# Values outside this range will be replaced to th closest endpoint.
print(f'Max age: {df['age'].max()} | Min age: {df['age'].min()}')
df['age'] = df['age'].clip(lower=21, upper=99)
print(f'Max age: {df['age'].max()} | Min age: {df['age'].min()}')
print(df.shape)

Max age: 120 | Min age: 0
Max age: 99 | Min age: 21
(72458, 11)


In [84]:
# Columns to be scaled to min-max range: 'age', 'num_vehicles', 'rooms'
min_max_columns = ['age', 'num_vehicles', 'rooms']
scaler = MinMaxScaler()
# Round the scaled values to 2 decimal places, to group similar values
df[min_max_columns] = scaler.fit_transform(df[min_max_columns]).round(2)
df[min_max_columns].describe()

Unnamed: 0,age,num_vehicles,rooms
count,72458.0,72458.0,72458.0
mean,0.361449,0.343805,0.49891
std,0.229563,0.192299,0.341307
min,0.0,0.0,0.0
25%,0.17,0.17,0.2
50%,0.35,0.33,0.4
75%,0.53,0.5,0.8
max,1.0,1.0,1.0


In [85]:
df['income'].describe()

count    7.245800e+04
mean     4.188143e+04
std      5.827460e+04
min     -6.900000e+03
25%      1.070000e+04
50%      2.640000e+04
75%      5.200000e+04
max      1.257000e+06
Name: income, dtype: float64

In [86]:
# Negative values in 'income' will be replaced by their absolute values
df['income'] = df['income'].abs()
df['income'].describe()

count    7.245800e+04
mean     4.188688e+04
std      5.827069e+04
min      0.000000e+00
25%      1.070000e+04
50%      2.640000e+04
75%      5.200000e+04
max      1.257000e+06
Name: income, dtype: float64

In [87]:
# Columns to be scaled to standard normal distribution: 'income', 'gas_usage'
standard_columns = ['income', 'gas_usage']
scaler = StandardScaler()
df[standard_columns] = scaler.fit_transform(df[['income', 'gas_usage']]).round(2)
df[standard_columns].describe()

Unnamed: 0,income,gas_usage
count,72458.0,72458.0
mean,-7.4e-05,-0.001335
std,1.000088,1.000588
min,-0.72,-0.63
25%,-0.54,-0.6
50%,-0.27,-0.49
75%,0.17,0.15
max,20.85,8.46


In [88]:
df.head(10)

Unnamed: 0,sex,is_employed,income,marital_status,health_ins,housing_type,num_vehicles,age,state_of_res,gas_usage,rooms
0,Male,True,-0.34,Never married,True,Homeowner free and clear,0.0,0.04,Alabama,2.71,0.4
1,Female,False,-0.32,Divorced/Separated,True,Rented,0.0,0.78,Alabama,-0.6,1.0
2,Female,True,-0.36,Never married,True,Homeowner with mortgage/loan,0.33,0.13,Alabama,-0.01,0.4
3,Female,False,-0.07,Widowed,True,Homeowner free and clear,0.17,0.92,Alabama,1.27,0.2
4,Male,True,-0.05,Divorced/Separated,True,Rented,0.33,0.59,Alabama,-0.6,0.2
5,Male,False,-0.53,Married,True,Homeowner free and clear,0.33,0.71,Alabama,2.55,1.0
6,Female,True,-0.28,Married,False,Rented,0.33,0.06,Alabama,-0.6,0.4
7,Female,False,-0.13,Married,True,Homeowner free and clear,0.33,0.67,Alabama,0.15,0.8
8,Female,True,-0.29,Never married,True,Homeowner free and clear,0.83,0.08,Alabama,-0.6,0.6
9,Male,True,-0.18,Married,True,Homeowner with mortgage/loan,0.5,0.42,Alabama,-0.33,1.0


- sex - categorical nominal (binary)
- is_emplyed - categorical nominal (binary)
- income - numerical
- marital_status - categorical nominal (multiclass)
- health_ins - categorical nominal (binary)
- housing_type - categorical nominal (multiclass)
- num_vehicles - numerical
- age - numerical
- state_of_residence - categorical nominal (multiclass)
- gas_usage - numerical
- rooms - numerical

In [89]:
cols_label_encode = ['sex','is_employed','health_ins', 'state_of_res']
cols_one_hot_encode = ['marital_status', 'housing_type']

In [90]:
label_encoder = LabelEncoder()
for col in cols_label_encode:
    df[col] = label_encoder.fit_transform(df[col])

In [91]:
df = pd.get_dummies(df, columns=cols_one_hot_encode)
df.head()

Unnamed: 0,sex,is_employed,income,health_ins,num_vehicles,age,state_of_res,gas_usage,rooms,marital_status_Divorced/Separated,marital_status_Married,marital_status_Never married,marital_status_Widowed,housing_type_Homeowner free and clear,housing_type_Homeowner with mortgage/loan,housing_type_Occupied with no rent,housing_type_Rented
0,1,1,-0.34,1,0.0,0.04,0,2.71,0.4,False,False,True,False,True,False,False,False
1,0,0,-0.32,1,0.0,0.78,0,-0.6,1.0,True,False,False,False,False,False,False,True
2,0,1,-0.36,1,0.33,0.13,0,-0.01,0.4,False,False,True,False,False,True,False,False
3,0,0,-0.07,1,0.17,0.92,0,1.27,0.2,False,False,False,True,True,False,False,False
4,1,1,-0.05,1,0.33,0.59,0,-0.6,0.2,True,False,False,False,False,False,False,True


In [92]:
dummies = list(filter(lambda x: x.startswith(tuple(cols_one_hot_encode)), df.columns))
dummies

['marital_status_Divorced/Separated',
 'marital_status_Married',
 'marital_status_Never married',
 'marital_status_Widowed',
 'housing_type_Homeowner free and clear',
 'housing_type_Homeowner with mortgage/loan',
 'housing_type_Occupied with no rent',
 'housing_type_Rented']

In [93]:
for col in dummies:
    df[col] = label_encoder.fit_transform(df[col])

df.head()

Unnamed: 0,sex,is_employed,income,health_ins,num_vehicles,age,state_of_res,gas_usage,rooms,marital_status_Divorced/Separated,marital_status_Married,marital_status_Never married,marital_status_Widowed,housing_type_Homeowner free and clear,housing_type_Homeowner with mortgage/loan,housing_type_Occupied with no rent,housing_type_Rented
0,1,1,-0.34,1,0.0,0.04,0,2.71,0.4,0,0,1,0,1,0,0,0
1,0,0,-0.32,1,0.0,0.78,0,-0.6,1.0,1,0,0,0,0,0,0,1
2,0,1,-0.36,1,0.33,0.13,0,-0.01,0.4,0,0,1,0,0,1,0,0
3,0,0,-0.07,1,0.17,0.92,0,1.27,0.2,0,0,0,1,1,0,0,0
4,1,1,-0.05,1,0.33,0.59,0,-0.6,0.2,1,0,0,0,0,0,0,1


In [94]:
# Min-max scale the 'state_of_res' column
df['state_of_res'] = MinMaxScaler().fit_transform(df[['state_of_res']])
df['state_of_res'].describe()

count    72458.000000
mean         0.474746
std          0.302055
min          0.000000
25%          0.180000
50%          0.460000
75%          0.760000
max          1.000000
Name: state_of_res, dtype: float64

In [95]:
# save the cleaned data to a new csv file
df.to_csv('../data/customer_cleaned.csv', index=False)

The cleaned dataset is saved in a new file: `customer_cleaned.csv`. This file can be used in future sections to load the cleaned dataset and train some models.

In this section, we conducted the following approaches:
- missing values in `is_employed` were filled with `False`
- columns `Unnamed: 0`, `custid`, `code_column`, `recent_move_b` were removed
- `age` values were truncateed to [21, 99]
- columns `age`, `num_vehicles`, `rooms` were scaled using Min-Max Scaling
- columns `income` and `gas_usage` were scaled using Standard Scaling
- categorical columns were encoded using One-Hot Encoding or Label Encoding, according to the nature of their categories