# **Preprocessing Notebook**

### In this notebook we will use FeatureTools to generate more features to help us establish relationships with our main dataframe.  We will then add our baseline model in a Logistic Regression here.

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler






#### Let's begin again by bringing in our main dataframe and taking a quick look.

In [2]:
df_mod = pd.read_csv(f'/Users/ryanm/Desktop/df-mod.csv')
df_mod = df_mod.drop(columns = ['product_name', 'aisle', 'department'])
print(df_mod.shape)
df_mod.head(5)

(3214874, 24)


Unnamed: 0,user_id,order_number,order_id,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,...,user_product_order_count,user_product_last_order,user_product_reorder_count,avg_days_between_orders,std_days_between_orders,total_orders,total_user_unique_products,user_reorder_proportion,product_popularity,avg_cart_position
0,1,1,2539329,3,9,12.0,26405,5,0,54,...,2,4,1,19.7,9.26,10,7,0.9,405,2.02
1,1,2,2398795,4,8,16.0,26088,6,1,23,...,1,2,1,19.7,9.26,10,7,0.9,247,2.99
2,1,3,473747,4,13,22.0,30450,5,1,88,...,1,3,1,19.7,9.26,10,7,0.9,1696,3.4
3,1,4,2254736,5,8,30.0,26405,5,1,54,...,2,4,1,19.7,9.26,10,7,0.9,405,2.02
4,1,5,431534,5,16,29.0,41787,8,1,24,...,1,5,1,19.7,9.26,10,7,0.9,5653,7.29


#### Before we start our baseline model let's scale the dataframe so our data is normalized.  This will not only help our output in our forthcoming Logistic Regression model but in our other models as well.

In [3]:
num_features = [
    'user_product_reorder_count',
    'user_product_last_order',
    'user_product_order_count',
    'avg_cart_position',
    'product_popularity',
    'user_reorder_proportion',
    'total_user_unique_products',
    'total_orders',
    'days_since_prior_order',
    'add_to_cart_order',
    'avg_days_between_orders',
    'std_days_between_orders'
]

existing_num_features = [feature for feature in num_features if feature in df_mod.columns]
print(f"Existing numerical features to be scaled: {existing_num_features}")

df_numerical = df_mod[existing_num_features]

scaler = StandardScaler()

df_num_scaled = scaler.fit_transform(df_numerical)
df_num_scaled = pd.DataFrame(df_num_scaled, columns = existing_num_features)

df_scaled = pd.concat([df_mod.drop(columns = existing_num_features), df_num_scaled], axis = 1)
df_scaled.head()

Existing numerical features to be scaled: ['user_product_reorder_count', 'user_product_last_order', 'user_product_order_count', 'avg_cart_position', 'product_popularity', 'user_reorder_proportion', 'total_user_unique_products', 'total_orders', 'days_since_prior_order', 'add_to_cart_order', 'avg_days_between_orders', 'std_days_between_orders']


Unnamed: 0,user_id,order_number,order_id,order_dow,order_hour_of_day,product_id,reordered,aisle_id,department_id,product_name_code,...,user_product_order_count,avg_cart_position,product_popularity,user_reorder_proportion,total_user_unique_products,total_orders,days_since_prior_order,add_to_cart_order,avg_days_between_orders,std_days_between_orders
0,1,1,2539329,3,9,26405,0,54,17,31683,...,-0.410997,-1.920683,-0.508987,0.146423,-0.76297,-0.901902,0.031628,-0.676228,1.461402,1.007456
1,1,2,2398795,4,8,26088,1,23,19,980,...,-0.551215,-1.689794,-0.513694,0.146423,-0.76297,-0.901902,0.493386,-0.543345,1.461402,1.007456
2,1,3,473747,4,13,30450,1,88,13,7124,...,-0.551215,-1.592202,-0.470523,0.146423,-0.76297,-0.901902,1.186023,-0.676228,1.461402,1.007456
3,1,4,2254736,5,8,26405,1,54,17,31683,...,-0.410997,-1.920683,-0.508987,0.146423,-0.76297,-0.901902,2.109539,-0.676228,1.461402,1.007456
4,1,5,431534,5,16,41787,1,24,4,2419,...,-0.551215,-0.666268,-0.352629,0.146423,-0.76297,-0.901902,1.994099,-0.277578,1.461402,1.007456


#### Excellent, lots of new and scaled features to add to our data for the forthcoming model.

#### As we have done before let's save our work so we can keep using in the next notebook(s).

In [10]:
main_path = r'C:/Users/ryanm/Desktop/df-scaled.csv'


df_scaled.to_csv(main_path, index = False)