# Capstone 2: Pre-processing and Training Data Development<a id='Pre-processing_and_Training_Data_Development'></a>

## Contents<a id='Contents'></a>
* [Pre-processing and Training Data Development](#Pre-processing_and_Training_Data_Development)
  * [Contents](#Contents)
  * [Introduction](#Introduction)
  * [Imports](#Imports)
  * [Load the Data](#Load_the_Data)
  * [Training and Test Sets](#Training_and_Test_Sets)
    * [Training and test split with markdown columns](#Split_with_markdown_columns)
    * [Training and test split without markdown columns](#Split_without_markdown_columns)
  * [Scale the Data](#Scale_the_Data)
  * [Export the Data](#Export_the_Data)
  * [Summary](#Summary)

## Introduction<a id='Introduction'></a>

After running through the expoloritory data analysis, I've found that we need to take a closer look at some of our features to truly see if I can predict the weekly sales and which features will most acurately predict this. In order to prepare my data for machine learning models, I will need to create training and testing datasets as well as apply any scaling need for each model.

## Imports<a id='Imports'></a>

In [16]:
#Import libraries
# Import relevant libraries and packages.
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns # For all our visualization needs.
import statsmodels.api as sm #This is a python module which provides classes and functions for the estimation of different statistical models, conducting statistical tests, and statistical EDA.
from statsmodels.graphics.api import abline_plot # For visualizing evaluating predictions.
from sklearn.metrics import mean_squared_error, r2_score #The mean_squared error is the average squared difference between the estimated values and true value. The r2_score is used to determine how the variability of one factor can be caused by its relationship to another related factor.
from sklearn.model_selection import train_test_split # To split the data.
from sklearn import linear_model, preprocessing # The linear model is the ordinary least squares linear regression model. Preprocessing helps to standardize a data set. If some outliers are present in the set, robust scalers or transformers are more appropriate.
import warnings # For handling error messages.
# Don't worry about the following two instructions: they just suppress warnings that could occur later. 
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action='ignore', module='scipy', message='^internal gelsd')
import os
from library.sb_utils import save_file

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Load the Data<a id='Load_the_Data'></a>

In [17]:
#Load the CSV data from the data wrangling
data_all_0 = pd.read_csv('C:/Users/jmhat/Desktop/Coding/Capstone2/data/data_all_0.csv')
data_no_nan = pd.read_csv('C:/Users/jmhat/Desktop/Coding/Capstone2/data/data_no_nan.csv')

## Split data into training and test sets<a id='Training_and_Test_Sets'></a>

Using the train test split function, I'll split both the data with and without the markdown columns. I've used a split of 80/20 training to test samples and a random state of 42 for reproducibility.

### Training and test split with markdown columns<a id='Split_with_markdown_columns'></a>

In [18]:
#Split the data into training and testing sections
X_0 = data_all_0.drop(columns=['Weekly_Sales'])
y_0 = data_all_0['Weekly_Sales']
X_train, X_test, y_train, y_test = train_test_split(X_0, y_0, test_size=0.2, random_state=42)

### Training and test split without markdown columns<a id='Split_without_markdown_columns'></a>

In [19]:
X_nan = data_no_nan.drop(columns=['Weekly_Sales'])
y_nan = data_no_nan['Weekly_Sales']
X1_train, X1_test, y1_train, y1_test = train_test_split(X_nan, y_nan, test_size=0.2, random_state=42)

## Scale the training and testing data<a id='Scale_the_Data'></a>

There are a few common scalers, but I've decided to use the standard scaler and min max scaler methods for the data. I will also export the unscaled data for use in tree based machine learning models. If I were to go on to create a neural network I would also need to make sure I didn't have any negative values.

In [21]:
SS_scaler = StandardScaler()
MM_scaler = MinMaxScaler()

SS_scaler.fit(X_train)
MM_scaler.fit(X_train)

XtrainSS = SS_scaler.transform(X_train)
XtrainMM = MM_scaler.transform(X_train)

#XSS_train = pd.DataFrame(XtrainSS, columns=X_0.columns)
#XMM_train = pd.DataFrame(XtrainMM, columns=X_0.columns)

In [22]:
SS_scaler.fit(X1_train)
MM_scaler.fit(X1_train)

X1trainSS = SS_scaler.transform(X1_train)
X1trainMM = MM_scaler.transform(X1_train)

#X1SS_train = pd.DataFrame(X1trainSS, columns=X_nan.columns)
#X1MM_train = pd.DataFrame(X1trainMM, columns=X_nan.columns)

In [23]:
SS_scaler.fit(X_test)
MM_scaler.fit(X_test)

XtestSS = SS_scaler.transform(X_test)
XtestMM = MM_scaler.transform(X_test)

#XSS_test = pd.DataFrame(XtestSS, columns=X_0.columns)
#XMM_test = pd.DataFrame(XtestMM, columns=X_0.columns)

In [24]:
SS_scaler.fit(X1_test)
MM_scaler.fit(X1_test)

X1testSS = SS_scaler.transform(X1_test)
X1testMM = MM_scaler.transform(X1_test)

#X1SS_test = pd.DataFrame(X1testSS, columns=X_nan.columns)
#X1MM_test = pd.DataFrame(X1testMM, columns=X_nan.columns)

## Export Test and Training Data<a id='Export_the_Data'></a>

In [28]:
# save the data to a new csv file
#XtrainSS.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/XtrainSS.csv", index=False)
np.savetxt('XtrainSS.csv', XtrainSS, delimiter=',')

In [29]:
#XtrainMM.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/XtrainMM.csv", index=False)
np.savetxt('XtrainMM.csv', XtrainMM, delimiter=',')

In [30]:
#X1trainSS.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X1trainSS.csv", index=False)
np.savetxt('X1trainSS.csv', X1trainSS, delimiter=',')

In [31]:
#X1trainMM.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X1trainMM.csv", index=False)
np.savetxt('X1trainMM.csv', X1trainMM, delimiter=',')

In [32]:
#XtestSS.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/XtestSS.csv", index=False)
np.savetxt('XtestSS.csv', XtestSS, delimiter=',')

In [33]:
#XtestMM.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/XtestMM.csv", index=False)
np.savetxt('XtestMM.csv', XtestMM, delimiter=',')

In [34]:
#X1testSS.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X1testSS.csv", index=False)
np.savetxt('X1testSS.csv', X1testSS, delimiter=',')

In [35]:
#X1testMM.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X1testMM.csv", index=False)
np.savetxt('X1testMM.csv', X1testMM, delimiter=',')

In [36]:
#X_train.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X_train.csv", index=False)
np.savetxt('X_train.csv', X_train, delimiter=',')

In [37]:
#X_test.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X_test.csv", index=False)
np.savetxt('X_test.csv', X_test, delimiter=',')

In [38]:
#y_train.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/y_train.csv", index=False)
np.savetxt('y_train.csv', y_train, delimiter=',')

In [39]:
#y_test.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/y_test.csv", index=False)
np.savetxt('y_test.csv', y_test, delimiter=',')

In [40]:
#X1_train.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X1_train.csv", index=False)
np.savetxt('X1_train.csv', X1_train, delimiter=',')

In [41]:
#X1_test.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/X1_test.csv", index=False)
np.savetxt('X1_test.csv', X1_test, delimiter=',')

In [42]:
#y1_train.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/y1_train.csv", index=False)
np.savetxt('y1_train.csv', y1_train, delimiter=',')

In [43]:
#y1_test.to_csv("C:/Users/jmhat/Desktop/Coding/Capstone2/data/y1_test.csv", index=False)
np.savetxt('y1_test.csv', y1_test, delimiter=',')

## Summary<a id='Summary'></a>

In this notebook, I finished the final pre-processing and split the data from the two dataframes into training and test elements. I will now be able to use these sets of data to run machine learning models to determine if there are any features that accuractely predict the weekly sales experienced by these stores. I will also use the scaled and unscaled data to determine if scaling helps my machine learning models.