# The Validation Set Approach (Holdout Validation)

There are a few different types of cross-validation techniques we can use to evaluate a classifier's effectiveness. The simplest technique is called holdout validation, which involves:

randomly splitting our dataset into a training data and a test set,


fitting the model using the training set,


making predictions on the test set.




We'll randomly select 80% of the observations in the Smarket Dataframe as the training set and the remaining 20% as the test set. This ratio isn't set in stone, and you'll see many people using a 75%-25% split instead.

We'll explore more advanced cross-validation techniques in later missions and will focus on holdout validation, the simplest kind of validation, in this mission. To split the data randomly into a training and a test set, we'll:




In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

In [2]:
# instruction
## use the **train_test_split()** funciton from the scikit-learn package
# Read Smarket.csv into a Dataframe named stocks
stocks = pd.read_csv('Data/Smarket.csv') 

stocks_up = pd.get_dummies(stocks['Direction'])
# Join the dummy variables to the main dataframe
stocks_new = pd.concat([stocks, stocks_up], axis=1)

stocks_new["actual_label"] = stocks_new["Up"]
stocks_new["Volume_sq"] = stocks_new["Volume"]*stocks_new["Volume"]

train, test = train_test_split(stocks_new, train_size=0.8, random_state=15)
train.head()

Unnamed: 0.1,Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Down,Up,actual_label,Volume_sq
310,311,2002,1.133,-0.666,0.228,-0.321,0.084,1.4479,-2.369,Down,1,0,0,2.096414
136,137,2001,1.608,-1.627,-1.637,-0.343,0.605,1.2807,1.045,Up,0,1,1,1.640192
566,567,2003,0.63,1.95,-0.376,0.646,-1.4,1.4602,-1.224,Down,1,0,0,2.132184
556,557,2003,1.214,-1.774,-0.578,-0.164,-0.548,1.4616,2.612,Up,0,1,1,2.136275
813,814,2004,-0.665,-0.209,0.767,0.851,0.529,1.4588,-0.106,Down,1,0,0,2.128097


## Train a Logistic Regression model with the train dataset
Now that we've split up the dataset into a training and a test set, we can:

train a logistic regression model on just the *training* set,
use the model to predict labels for the *test* set.

In [3]:
logreg = LogisticRegression()

# dt = DecisionTreeClassifier()
x_columns = ["Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume"]

# fit a logistic model with the train dataset
logreg.fit(train[x_columns], train["actual_label"])
print(logreg.coef_)

[[-0.07821872 -0.01875277  0.01126101  0.03855466 -0.00392213  0.20445724]]




## Evaluate model performance with the test dataset

Evaluate the accuracy of the predicted labels for the test set.

In [4]:
# use the model to provide predictions for the test dataset
fitted_labels = logreg.predict(test[x_columns])
test["predicted_label"] = fitted_labels

matches = test["predicted_label"] == test["actual_label"]
correct_predictions = test[matches]
accuracy = len(correct_predictions) / len(test)
print("The accuracy level for Reg1 is:", accuracy)

reg_mse = mean_squared_error(test["predicted_label"], test["actual_label"])
print('Reg1 MSE: %.4f' % reg_mse)

# In this case, logtistic regression correctly predicted the movement of the market ___ % of the time.

The accuracy level for Reg1 is: 0.484
Reg1 MSE: 0.5160


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Generate a new model with Volume_sq

In [5]:
# create a new logistic regression model with the volume_sq variable
x_columns_2 = ["Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume", "Volume_sq"]

logreg_2 = LogisticRegression()
# fit a logistic model with the train dataset
logreg_2.fit(train[x_columns_2], train["actual_label"])
print(logreg_2.coef_)

[[-0.07810795 -0.01847347  0.01161489  0.03876639 -0.00363947  0.0987837
   0.03457029]]




## Evaluate the accuracy of the new model

In [6]:
fitted_labels_2 = logreg_2.predict(test[x_columns_2])
test["predicted_label_2"] = fitted_labels

matches_2 = test["predicted_label_2"] == test["actual_label"]
correct_predictions_2 = test[matches_2]
accuracy_2 = len(correct_predictions_2) / len(test)
print("The accuracy level of the new model = ", accuracy_2)

The accuracy level of the new model =  0.484


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
