This Notebook is being used as a quick preparation for Modeling. The purpose will be to replace the categorical variable, standardize features using StandardScaler(), and to split the dataframe into training and testing sets.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os.path

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

upon inspection of my dataframe, some index columns had been inserted between all of the notebooks that I have been using. This column is simply removing the unnecessary index columns (including the seq_number id field that I used to merge tables together earlier in the project as it is no longer needed)

In [2]:
df = pd.read_csv('my_data/prep_preprocess.csv')
df.rename(columns={"Unnamed: 0":"delete"},inplace=True)
df = df.drop(columns=["delete","Seq_Number"])

The only Categorical field in the dataframe was the age_grp column that I inserted last notebook to differentiate between rows that represented children (<=18) and adults. This cell is replacing that field with dummy binary fields.

In [3]:
age_grp_dummy = pd.get_dummies(df['age_grp'])
df = pd.concat([df,age_grp_dummy],axis=1).drop(columns=["age_grp"])
df.head()

Unnamed: 0,Gender,Age_yr,#_diff_foods,tot_calories,total_protein,total_carb,total_sugar,total_fiber,total_fat,avg_visc_fat,bmi,waist,weight,ave_BP,adult,child
0,1,69,11.0,1574.0,43.63,239.59,176.47,10.8,52.81,20.6,26.7,100.0,78.3,112.666667,1,0
1,1,54,8.0,5062.0,338.13,423.78,44.99,16.7,124.29,24.4,28.6,107.6,89.5,157.333333,1,0
2,1,72,27.0,1743.0,64.61,224.39,102.9,9.9,65.97,25.6,28.9,109.2,88.9,142.0,1,0
3,1,9,19.0,1490.0,77.75,162.92,80.58,10.6,58.27,14.9,17.1,61.0,32.2,104.666667,0,1
4,2,73,7.0,1421.0,55.24,178.2,87.78,12.3,55.36,20.8,19.7,88.6,52.0,137.333333,1,0


Here we are Standardizing our data

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)

In [5]:
scaled_df.columns

Index(['Gender', 'Age_yr', '#_diff_foods', 'tot_calories', 'total_protein',
       'total_carb', 'total_sugar', 'total_fiber', 'total_fat', 'avg_visc_fat',
       'bmi', 'waist', 'weight', 'ave_BP', 'adult', 'child'],
      dtype='object')

Then finally we are splitting the data into training and testing sets.

In [6]:
from sklearn.model_selection import train_test_split

X = scaled_df.drop(columns=["weight"])
y = scaled_df.weight

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

That concludes the scope of this notebook, later we will Model our standardized data and see what insights we can come up with!