Title: 7.2 Exercises

Author: Chad Wood

Date: 1 May 2022

Modified By: Chad Wood

Description: This program demonstrates performing EDA on sample categorical and continuous data for predictive modeling

### Instructions and Data Imports

Within the GitHub for Week 7, you’ll find a data set named eda_classification.csv. This data set contains both continuous and categorical variables and the target is binary (0 or 1). Build a model that predicts ‘y’ (i.e., the column labeled ‘y’). Note that you’ll need to consider EDA and feature engineering to do this. Be sure to review the feature engineering article from Week 3.


You're welcome to try and calculate an appropriate evaluation metric, although it's not required since we won't cover that until next week. Note that the model will not perform well, the data is not build for making predictions, it's built for you to perform EDA on categorical data.

In [129]:
import numpy as np
import pandas as pd

data = pd.read_csv('data/eda_classification.csv')

y = data['y']
input_data = data.drop(columns=[target.name])

### Data Cleansing

##### Seperating Categorical From Continuous

In [130]:
possibly_categorical = []

# Creates list of all cols with less than 50 unique values
for col in input_data.columns:
    if len(input_data[col].unique()) < 50:
        possibly_categorical.append(col)

# Displays to visually verify categorical columns
input_data[possibly_categorical]

Unnamed: 0,x1,x11,x13,x14,x17
0,Jun,-0.01%,tesla,thurday,small
1,July,0.00%,Toyota,thur,small
2,Aug,0.00%,bmw,wednesday,small
3,Aug,0.01%,Toyota,wed,small
4,May,-0.01%,Honda,wednesday,small
...,...,...,...,...,...
9994,Apr,0.00%,volkswagon,thurday,small
9995,May,0.00%,bmw,wed,small
9996,May,-0.02%,Honda,wednesday,small
9997,Aug,-0.01%,bmw,friday,small


In [131]:
# Manually list non categorical column names
non_categorical = ['x11']

# Creates filter from list
categoricals = filter(lambda column: column not in non_categorical, possibly_categorical)

# Separates categorical and continious data
categorical_data = input_data[categoricals]
continuous_data = input_data.drop(columns=categorical_data.columns)

##### Cleaning Categorical Data

In [132]:
# Previews data
categorical_data.head()

Unnamed: 0,x1,x13,x14,x17
0,Jun,tesla,thurday,small
1,July,Toyota,thur,small
2,Aug,bmw,wednesday,small
3,Aug,Toyota,wed,small
4,May,Honda,wednesday,small


In [133]:
# Dict to swap keys with values in dataframe
months = {
    'January': 'jan', 
    'Feb':     'feb',
    'Mar':     'mar', 
    'Apr':     'apr', 
    'May':     'may', 
    'Jun':     'jun', 
    'July':    'jul', 
    'Aug':     'aug',
    'sept.':   'sep',
    'Oct':     'oct',
    'Nov':     'nov',
    'Dev':     'dec'
}
categorical_data = categorical_data.replace({'x1': months})

In [134]:
# Dict to swap keys with values in dataframe
week_days = {
    'monday':    'mon', 
    'tuesday':   'tue',
    'wednesday': 'wed', 
    'wed':       'wed', 
    'thurday':   'thu', 
    'thur':      'thu', 
    'friday':    'fri', 
    'fri':       'fri'
}
categorical_data = categorical_data.replace({'x14': week_days})

In [135]:
# Lowercases all strings, and previews data again
categorical_data['x13'] = categorical_data.x13.str.lower() 
categorical_data['x17'] = categorical_data.x17.str.lower() 

In [136]:
import category_encoders as ce

# Fills missing values with the most common value
for column in categorical_data.columns:
    fill_val = categorical_data[column].value_counts().idxmax()
    categorical_data[column] = categorical_data[column].fillna(fill_val)

# Label encoding for ordinal values
ordinal_cols = 'x17'
encoder = ce.OrdinalEncoder(cols=ordinal_cols)
categorical_data = encoder.fit_transform(categorical_data)

# OneHot encoding for the remainder
categorical_data = pd.get_dummies(categorical_data)

# Previews data again
categorical_data.head()

Unnamed: 0,x17,x1_apr,x1_aug,x1_dec,x1_feb,x1_jan,x1_jul,x1_jun,x1_mar,x1_may,...,x13_mercades,x13_nissan,x13_tesla,x13_toyota,x13_volkswagon,x14_fri,x14_mon,x14_thu,x14_tue,x14_wed
0,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,0,0
1,1,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,1,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1


##### Cleaning Continious Data

In [137]:
# Locates columns of type object (for display)
continuous_data.select_dtypes(include=object).head()

Unnamed: 0,x7,x11
0,"($1,306.52)",-0.01%
1,($24.86),0.00%
2,($110.85),0.00%
3,($324.43),0.01%
4,"$1,213.37",-0.01%


In [138]:
# Corrects monetary column, parenthases indicating negative
continuous_data['x7'] = (continuous_data['x7']
                         .replace('[\$,) ]+', '', regex=True ) # Removes [')', '$']
                         .replace('[(]','-', regex=True ) # Replaces '(' with '-'
                         .replace('', 'NaN', regex=True ).astype(float)) # Blank to NaN

In [139]:
# Corrects string percentages to floats
continuous_data['x11'] = continuous_data['x11'].str.rstrip('%').astype('float') / 100.0

In [140]:
# Fills missing values with the mean value
for column in continuous_data.columns:
    fill_val = continuous_data[column].mean()
    continuous_data[column] = continuous_data[column].fillna(fill_val)

# Previews data again
continuous_data.head()

Unnamed: 0,x0,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x15,x16
0,-17.933519,6.55922,-14.45281,-4.732855,0.381673,2.563194,-1306.52,-89.394348,-28.454044,-16.201298,-0.0001,0.21701,9.729891,-0.786431
1,-37.214754,10.77493,-15.384004,-0.077339,10.983774,-15.210206,-24.86,153.032652,-32.557736,69.675903,0.0,-3.584908,35.727926,-0.985552
2,0.330441,-19.609972,-9.167911,2.064124,12.071688,12.506141,-110.85,-141.437276,-20.794952,55.042604,0.0,-3.991366,-9.283523,-3.394718
3,-13.709765,-8.01139,6.759264,1.727615,-1.768382,24.039733,-324.43,51.039653,-7.046908,-31.424419,0.0001,7.908897,-2.891882,-2.690222
4,-4.202598,7.07621,-26.004919,-4.269696,-3.414224,2.115989,1213.37,-31.0467,19.061182,-31.525515,-0.0001,0.846719,25.49748,3.516801


In [142]:
# Lists of column names for reference
continuous_cols = continuous_data.columns
categorical_cols = categorical_data.columns

# Recombines cleaned data into one dataframe
X = pd.concat([continuous_data, categorical_data], axis=1)

### Data Split / Standardization

In [160]:
from sklearn.model_selection import train_test_split

# Seperates data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [161]:
from sklearn.preprocessing import StandardScaler

# Standardizes training data
sc = StandardScaler()
stnd_continious_train_data = sc.fit_transform(X_train[continuous_cols])
stnd_X_train = np.append(stnd_continious_train_data, X_train[categorical_cols].values, axis=1)

# Standardizes testing data
sc = StandardScaler()
stnd_continious_test_data = sc.fit_transform(X_test[continuous_cols])
stnd_X_test = np.append(stnd_continious_test_data, X_test[categorical_cols].values, axis=1)

### Model Building

In [163]:
from sklearn.linear_model import LinearRegression

'''
With the knowledge that the model will not perform well, 
and the point of this exercise was EDA on categorical data,
I've opted for a simple LR model.
'''

# Retrieves predictions
lr = LinearRegression()
lr.fit(stnd_X_train, y_train)
pred = lr.predict(stnd_X_test)

pred

array([0.47208427, 0.53774658, 0.5324499 , ..., 0.50383502, 0.45350643,
       0.51710992])