## **Predictive Modeling**
## **Demonstrate data splitting - Training-Validation-Testing - Cross Validations**

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
#Read a csv file to a dataFrame WestRox
housing_df = pd.read_csv('https://raw.githubusercontent.com/reisanar/datasets/master/WestRoxbury.csv')

In [3]:
housing_df.head(2)

Unnamed: 0,TOTAL VALUE,TAX,LOT SQFT,YR BUILT,GROSS AREA,LIVING AREA,FLOORS,ROOMS,BEDROOMS,FULL BATH,HALF BATH,KITCHEN,FIREPLACE,REMODEL
0,344.2,4330,9965,1880,2436,1352,2.0,6,3,1,1,1,0,
1,412.6,5190,6590,1945,3108,1976,2.0,10,4,2,1,1,0,Recent


- *Recap steps or jump directly to "Predictive Power and Overfitting" code*

In [5]:
housing_df.REMODEL.dtype

dtype('O')

In [6]:
housing_df.columns = [s.strip().replace(" " , "_") for s in housing_df.columns]

In [7]:
housing_df.REMODEL = housing_df.REMODEL.astype("category")

In [8]:
housing_df.REMODEL.cat.categories # Show number of categories

Index(['None', 'Old', 'Recent'], dtype='object')

In [12]:
housing_df.REMODEL.dtype # Check type of converted variable

CategoricalDtype(categories=['None', 'Old', 'Recent'], ordered=False)

## **Creating Dummy Variables in pandas**

In [10]:
# use drop_first=True to drop the first dummy variable (There are 3 ie 'None', 'Old', 'Recent')
housing_df = pd.get_dummies(housing_df, prefix_sep="_", drop_first=True)

In [11]:
housing_df.columns

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT', 'YR_BUILT', 'GROSS_AREA',
       'LIVING_AREA', 'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH', 'HALF_BATH',
       'KITCHEN', 'FIREPLACE', 'REMODEL_Old', 'REMODEL_Recent'],
      dtype='object')

In [12]:
housing_df.loc[:, "REMODEL_Old":"REMODEL_Recent"].head(2)

Unnamed: 0,REMODEL_Old,REMODEL_Recent
0,0,0
1,0,1


In [None]:
Normalizing (Standardizing) and Rescaling Data. This operation is also sometimes called standardizing.

In [29]:
Seen last class

In [None]:
2.5 Predictive Power and Overfitting

## **In supervised learning, a key question presents itself: How well will our prediction or**

classification model perform when we apply it to new data? We are particularly interested in comparing the performance of various models so that we can choose the one we think will do the best when it is implemented in practice. A key concept is to make sure that our chosen model **generalizes** beyond the dataset that we have at hand. To assure **generalization**, we use the concept of **data partitioning** and try to avoid **overfitting**. These two important concepts are described next.

- *overfitting ---Overfitting: This function fits the data with no error*

## **Somewhat surprisingly, even if we know for a fact that a higher-degree curve is the**

appropriate model, if the model-fitting dataset is not large enough, a lower-degree function (that is not as likely to fit the noise) is likely to perform better in terms of predicting new values. **Overfitting** can also result from the application of many different models, from which the best performing model is selected.

In [None]:
Creation and Use of Data Partitions

## **Training Partition**

The training partition, typically the largest partition, contains the data used to build the various models we are examining. The same training partition is generally used to develop multiple models. Validation Partition The validation partition (sometimes called the test partition) is used to assess the predictive performance of each model so that you can compare models and choose the best one. In some algorithms (e.g., classification and regression trees, k-nearest neighbors), the validation partition may be used in an automated fashion to tune and improve the model. Test Partition The test partition (sometimes called the holdout or evaluation partition) is used to assess the performance of the chosen model with new data. Why have both a validation and a test partition? When we use the validation data to assess multiple models and then choose the model that performs best with the validation data, we again encounter another (lesser) facet of the **overfitting** problem—chance aspects of the validation data that happen to match the chosen model better than they match other models. In other words, by using the validation data to choose one of several models, the performance of the chosen model on the validation data will be overly optimistic. The random features of the validation data that enhance the apparent performance of the chosen model will probably not be present in new data to which the model is applied. Therefore, we may have overestimated the accuracy of our model.

In [13]:
trainData, validData = train_test_split(housing_df, test_size=0.40,random_state=1)

In [None]:
What is train_test_split?
train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: 
for training data and for testing data. With this function, you don't need to divide the dataset manually.
By default, Sklearn train_test_split will make random partitions for the two subsets. 
However, you can also specify a random state for the operation.

## **train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)**

X, y. The first parameter is the dataset you're selecting to use. train_size. This parameter sets the size of the training dataset. There are three options: None, which is the default, Int, which requires the exact number of samples, and float, which ranges from 0.1 to 1.0. test_size. This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default. random_state. The default mode performs a random split using np.random. Alternatively, you can add an integer using an exact number.

In [14]:
import sklearn.model_selection as model_selection

In [15]:
import sklearn.model_selection as model_selection
trainData, validData = model_selection.train_test_split(housing_df, test_size=0.40,random_state=1)

## **print("Training : ", trainData.shape)**

print("Validation : ", validData.shape) print()

- *training (50; and then splitting validation 40% and test 10%*

In [18]:
# training (50
trainData, temp = model_selection.train_test_split(housing_df, test_size=0.5, random_state=1)

In [19]:
validData, testData = model_selection.train_test_split(temp, test_size=0.4, random_state=1)

## **print("Training : ", trainData.shape)**

print("Test : ", testData.shape)

## **Cross-Validation**

When the number of records in our sample is small, **data partitioning** might not be advisable as each partition will contain too few records for model building and performance evaluation. Furthermore, some data mining methods are sensitive to small changes in the training data, so that a different partitioning can lead to different results. An alternative to **data partitioning** is cross-validation, which is especially useful with small samples. Cross-validation, or k-fold cross-validation, is a procedure that starts with partitioning the data into “folds,” or non-overlapping subsamples. Often we choose k = 5 folds, meaning that the data are randomly partitioned into five equal parts, where each fold has 20% of the observations. A model is then fit k times. Each time, one of the folds is used as the validation set and the remaining k − 1 folds serve as the training set. The result is that each fold is used once as the validation set, thereby producing predictions for every observation in the dataset. We can then combine the model’s predictions on each of the k validation sets in order to evaluate the overall performance of the model. In Python, cross-validation is achieved using the cross_val_score() or the more general cross_validate function, where argument cv determines the number of folds. Sometimes cross-validation is built into a data mining algorithm, with the results of the crossvalidation used for choosing the algorithm’s parameters