# Guided Capstone Step 4. Pre-Processing and Training Data Development

**The Data Science Method**  


1.   Problem Identification 


2.   Data Wrangling 
  
 
3.   Exploratory Data Analysis   

4.   **Pre-processing and Training Data Development**  
 * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

**<font color='teal'> Start by loading the necessary packages as we did in step 3 and printing out our current working directory just to confirm we are in the correct project directory. </font>**

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#os.getcwd(), os.listdir()
#os.chdir('data')
#os.listdir()
dfno = pd.read_csv("data\step3_output_noindex.csv")

**<font color='teal'>  Load the csv file you created in step 3, remember it should be saved inside your data subfolder and print the first five rows.</font>**

In [3]:
dfno.tail()

Unnamed: 0,Name,state,summit_elev,vertical_drop,base_elev,trams,fastEight,fastSixes,fastQuads,quad,...,SkiableTerrain_ac,SnowMaking_ac,daysOpenLastYear,yearsOpen,averageSnowfall,AdultWeekday,AdultWeekend,projectedDaysOpen,NightSkiing_ac,clusters
170,Hogadon Basin,Wyoming,8000,640,7400,0,0.0,0,0,0,...,92.0,32.0,121.0,61.0,80.0,48.0,48.0,95.0,0.0,1
171,Sleeping Giant Ski Resort,Wyoming,7428,810,6619,0,0.0,0,0,0,...,184.0,18.0,61.0,81.0,310.0,42.0,42.0,77.0,0.0,1
172,Snow King Resort,Wyoming,7808,1571,6237,0,0.0,0,0,1,...,400.0,250.0,121.0,80.0,300.0,59.0,59.0,123.0,110.0,1
173,Snowy Range Ski & Recreation Area,Wyoming,9663,990,8798,0,0.0,0,0,0,...,75.0,30.0,131.0,59.0,250.0,49.0,49.0,120.053004,0.0,1
174,White Pine Ski Area,Wyoming,9500,1100,8400,0,0.0,0,0,0,...,370.0,0.0,115.103943,81.0,150.0,57.916957,49.0,120.053004,0.0,1


## Create dummy features for categorical variables

**<font color='teal'> Create dummy variables for `state`. Add the dummies back to the dataframe and remove the original column for `state`. </font>**

Hint: you can see an example of how to execute this in Aiden's article on preprocessing [here](https://medium.com/@aiden.dataminer/the-data-science-method-dsm-pre-processing-and-training-data-development-fd2d75182967). 

In [4]:
dfo = dfno[["state"]]
dfab = pd.concat([dfno.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1)
dfab.iloc[:3, ::5].T

Unnamed: 0,0,1,2
Name,Hilltop Ski Area,Sunrise Park Resort,Yosemite Ski & Snowboard Area
fastEight,0,0,0
double,0,1,3
LongestRun_mi,1,1.2,0.4
averageSnowfall,69,250,300
clusters,0,1,1
state_Connecticut,0,0,0
state_Maine,0,0,0
state_Missouri,0,0,0
state_New Mexico,0,0,0


In [5]:
dfab.state_Wyoming

0      0
1      0
2      0
3      0
4      0
      ..
170    1
171    1
172    1
173    1
174    1
Name: state_Wyoming, Length: 175, dtype: uint8

## Standardize the magnitude of numeric features

**<font color='teal'> Using sklearn preprocessing standardize the scale of the features of the dataframe except the name of the resort which we done't need in the dataframe for modeling, so it can be droppped here as well. Also, we want to hold out our response variable(s) so we can have their true values available for model performance review. Let's set `AdultWeekend` to the y variable as our response for scaling and modeling. Later we will go back and consider the `AdultWeekday`, `dayOpenLastYear`, and `projectedDaysOpen`. For now leave them in the development dataframe. </font>**

In [6]:
# first we import the preprocessing package from the sklearn library
from sklearn import preprocessing

# Declare an explanatory variable, called X,and assign it the result of dropping 'Name' and 'AdultWeekend' from the df
X = dfab.drop(['Name','AdultWeekend'], axis=1)

# Declare a response variable, called y, and assign it the AdultWeekend column of the df 
y = dfab["AdultWeekend"]

# Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
scaler = preprocessing.StandardScaler().fit(X)

# Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X 
X_scaled = scaler.transform(X)

## Split into training and testing datasets

**<font color='teal'> Using sklearn model selection import train_test_split, and create a 75/25 split with the y = `AdultWeekend`. We will start by using the adult weekend ticket price as our response variable for modeling.</font>**

In [7]:
# Import the train_test_split function from the sklearn.model_selection utility.  
from sklearn.model_selection import train_test_split

# Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
y = y.ravel()

# Call the train_test_split() function with the first two parameters set to X_scaled and y 
# Declare four variables, X_train, X_test, y_train and y_test separated by commas 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [8]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape, y.dtype

((131, 59), (44, 59), (131,), (44,), dtype('float64'))

Here we start the actual modeling work. First let's fit a multiple linear regression model to predict the `AdultWeekend` price.