# `sklearn` 

**More model building with `sklearn`**

## Project Contents

-[The Full Data Science Process](#The-Full-Data-Science-Process)  
-[Fitting Models in sklearn](#Fitting-Models-in-sklearn)  
-[Fitting Your Own Model](#Fitting-Your-Own-Model)


## Overview

The primary task of the project is the creation of an `sklearn` model which will give good predictions on an unseen set of data. The creation of a model and its use on out-of-sample data is, at the end of the day, an important benchmark in the Data Science / Machine Learning process.  

The project starts out with an example of the model building process on the synthetic office data-set. This will include creating a hold-out test sample - a technique not yet demonstrated.  

After building that model, there will be a review of a couple of the model types available in `sklearn`, and finally you will be asked to pick, train, and submit a model for testing on data to which you do not have access.

## Activities in this Project
- Fit an `sklearn` model of your choice that will make good predictions on a derived set of data  

### The Full Data Science Process

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

%matplotlib inline

**Reading-In and Cleaning Up The Data**  


The first thing to do is to take a hold-out set. This is different from the training and testing set in that it will only be used at the very end of our work in order to select between the various modesl that we tuned using the training and testing set.  

Using the train/test parlance, the hold-out set might be considered the "exam" data. For this we will select 10% of the data.

In [2]:
holdout_path = '../resource/asnlib/publicdata/holdout.csv'
# holdout_indicies = np.random.choice(df.index, replace = False, size = round(len(df.index) * .1))
# pd.Series(holdout_indicies).to_csv(holdout_path, index = False, header = False)

holdout_indicies = pd.read_csv(holdout_path, header= None).values.flatten()
holdout_indicies[:10]

array([16177,  6687, 11676,  8850, 24789, 16988, 43580, 24626,  1692,
       13685])

With the indicies for the holdout data in hand, we will read in all the data and separate the holdout-data from the training data.

In [3]:
# Read in full data
data_path = "../resource/asnlib/publicdata/office_supply.csv"
full_df = pd.read_csv(data_path)

# Separate fitting data from exam data
holdout_data = full_df.loc[holdout_indicies, :]
training_data = full_df.drop(holdout_indicies, axis = 'rows')

**Below are the functions which will be used to preprocess the data**

In [4]:
# Function for preprocessing data
def office_preprocess(X,y):
    # Hard-code lists for dropping and to_bool
    # Dropped variables include dates and features with many missing values
    to_drop = ['date_of_last_transaction', 'date_of_first_purchase',
               'customer_number', 'language',
               'last_transaction_channel', 'number_of_employees']
    to_bool = ['desk', 'executive_chair', 'standard_chair',
               'monitor', 'printer','computer', 'insurance',
               'toner', 'office_supplies']
    # Hard-code values for notice, auto, and prem
    notice = "NOTICE"
    auto = "AUTO RENEW"
    prem = "Premier"

    # Function to convert and fill "Y/N" features
    def convert_fill_bool(val):
        if val == 'Y': return True
        else: return False

    # Function to encode the service as "premium" : true or false
    def encode_service(val):
        if val == prem: return True
        else: return False

    # Function to encode the repurchase feature into two columns: "notice" true/false and "auto_renew" true/false
    # "payment" plan implied by "false" in "notice" and "auto_renew" columns
    def encode_repurchase(series):

        def notice_encode(val):
            if val == notice: return True
            else: return False

        def auto_renew_encode(val):
            if val == auto: return True
            else: return False

        ser_notice = series.apply(notice_encode)
        ser_notice.name = "repurchase_notice"
        ser_auto = series.apply(auto_renew_encode)
        ser_auto.name = "repurchase_auto"

        return pd.concat([ser_notice, ser_auto], axis = 'columns')

    # Function to transform campaign_period_sales to a float
    def transform_target(raw):
        # make sure the value is initially cast as a string
        raw = str(raw)

        # determine if negative or not
        if raw.count("(") > 0: sign = -1
        else: sign = 1

        # remove all spaces, commas, dollar signs, and parentheses
        for to_rem in [" ",",","$", "(",")"]:
            raw = raw.replace(to_rem,"")
        return sign *float(raw)

    y_trans = y.apply(transform_target)

    X_trans = X.drop(to_drop, axis = 'columns')

    for col in to_bool:
        X_trans[col] = X_trans[col].apply(convert_fill_bool)

    X_trans['premier_service'] = X_trans['service_level'].apply(encode_service)
    X_trans.drop('service_level', axis = 'columns', inplace = True)

    repurch = encode_repurchase(X_trans['repurchase_method'])
    X_trans = pd.concat([X_trans.drop('repurchase_method', axis = 'columns'), repurch], axis = 'columns')

    return X_trans, y_trans

def rename_columns(df):
    df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]
    return df

def pull_out_target_pass_to_preprocess(df):
    # Pull out target and explanatory variables
    X = df.drop('campaign_period_sales', axis = 'columns')
    y = df['campaign_period_sales']

    X, y = office_preprocess(X,y)

    return pd.concat([y,X],axis = 'columns')

In [5]:
# Perform pre-processing on both holdout and fitting data
holdout_data = rename_columns(holdout_data)
holdout_data = pull_out_target_pass_to_preprocess(holdout_data)

training_data = rename_columns(training_data)
training_data = pull_out_target_pass_to_preprocess(training_data)

print(holdout_data.head(1))
print(training_data.head(1))

       campaign_period_sales  number_of_transactions  \
16177                 153.82                      13   

       do_not_direct_mail_solicit  do_not_email  do_not_telemarket  \
16177                       False          True              False   

       email_available  desk  executive_chair  standard_chair  monitor  \
16177             True  True            False           False    False   

       printer  computer  insurance  toner  office_supplies  premier_service  \
16177    False     False      False  False             True             True   

       repurchase_notice  repurchase_auto  
16177              False             True  
   campaign_period_sales  number_of_transactions  do_not_direct_mail_solicit  \
0                 107.16                      20                       False   

   do_not_email  do_not_telemarket  email_available   desk  executive_chair  \
0         False              False            False  False            False   

   standard_chair  monitor  

Now we have our holdout data, which will be used for the final evaluation of our model -- indicating precisely how well we believe the model will perform.  

To select our model, we need to create our train_test_split. At this point we will separate out our target (y) and our explanatory data (X):

In [7]:
X = training_data.drop('campaign_period_sales', axis = 'columns')
y = training_data['campaign_period_sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25)

Finally we can proceed with model building. Our target is a continuous variable - campaign period sales.  

Below, two tree models and a simple linear regression are fit, and used to predict on our test-set

In [8]:
# Instantiate Model
dt = DecisionTreeRegressor()
rf = RandomForestRegressor(n_estimators=100)
lr = LinearRegression()

# Fit Models
dt.fit(X_train, y_train)
rf.fit(X_train, y_train)
lr.fit(X_train, y_train)

# Find Score on Testing dAta
print("Decision Tree r2 Score:", dt.score(X_test, y_test))
print("Random Forest r2 Score:", rf.score(X_test, y_test) )
print("Linear Regression r2 Score:", lr.score(X_test, y_test))

Decision Tree r2 Score: 0.45292288735397995
Random Forest r2 Score: 0.49727241250938314
Linear Regression r2 Score: 0.4967582153786629


Only the Decision tree $R^2$ score looks slightly worse.  

Let's decide to use the Random Forest.  

The next step will be to fit the Random Forest on both the training and testing data and _then_ we can see how well it performs on the holdout-data

In [9]:
# Fitting model on ALL the training data
rf.fit(X,y)

# Splitting out target in the holdout data
holdout_target = holdout_data['campaign_period_sales']
holdout_explanitory = holdout_data.drop('campaign_period_sales', axis = 'columns')

# Finding r2 score
rf.score(holdout_explanitory, holdout_target)

0.500787676794062

Notably, our model did not perform markedly better (or worse) on the holdout data. This is actaully a good things. At this point we can be pretty confident that our Random Forest Model will account for between 45 and 50% of the variance in our sales data.  

Were the scores to be inconsistent, that would most likely be a sign that the model was getting "lucky" on the data being picked in the holdout-set. 

### Fitting Models in sklearn  

The above section offered a look at the model fitting portion of the data-science process. This section will focus more specifically on the syntax of fitting models in `sklearn`. As was mentioned last week, the syntax is very consistent for all `sklearn` models:  

- Step 1: Instantiate the model: `clf = <Model Type>()`  
- Step 2: Fit the model: `clf.fit(X_train, y_train)`  
- Step 3: Make Predictions or find Score `clf.predict(X_test)` **OR** `clf.score(X_test,y_test)`  

#### Step 1: Instantiaing the model  

Most `sklearn` models can be instantiated with or without specifying options. However, if you want to look at the available options, you should check out the relevant page in the `sklearn` documentation. [Example: Documentation For Decision Tree Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).  

Below a decision tree, random forest and linear regression model are instantiated.  

Note how a parameter is specified for the random forest.

In [10]:
decision_tree = DecisionTreeRegressor()
random_forest = RandomForestRegressor(n_estimators=100)
linear_regression = LinearRegression()

At this point, the variables `decision_tree`, `random_forest`, and `linear_regression` hold instances of sklearn models. However these models have not been fitted.  

#### Step 2: Fitting the model
All `sklearn` models, may be fit with the `.fit()` method.  

The `.fit()` method should be fed two parameters, first the explanitory variables (Which must be two-dimensional), and the target (which should be one-dimensional).  

Below, the three models are fit with the training data.  

The fitting process is the longest, most computationally intensive step.  

In [11]:
decision_tree.fit(X_train, y_train)
random_forest.fit(X_train, y_train)
linear_regression.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

#### Step 3: Predictions and/or Scores  

Once the model is fit you may make predictions using the model.  

This may be done (expclitily) with the `.predict()` method or (implicitly) with the `.score()` method.  

The `.predict()` function takes one argument - a numpy or pandas or array with the same number of columns as the explanitory variables. The output is a vector of predictions for each of those observations.

The `.score()` function takes the same array that is given to `.predict()`, and a vector of true values of the target variable. In the case of these regressions, the output will be the $r^2$ score  

Below, `.predict()` is used with the `decision_tree`; `.score()` is used with the `linear_regression`.  

In [12]:
print(".predict(): ", decision_tree.predict(X_test))
print(".score(): ", linear_regression.score(X_test, y_test))

.predict():  [1970.70166667 1850.4         468.191      ... 3408.25        413.76564286
  463.16342105]
.score():  0.4967582153786629


The few steps above are all you need to complete the next question.

### Fitting Our Own Model

In [None]:

### Below is the training and testing data.
### In the "Grading" portion of this question there is holdout data (which you cannot see / access)
### We are creating and training a *regressor* which achieves and r^2 score of
### more than .8. 

### We can use LinearRegression, a DecisionTreeRegressor, RandomForestRegressor or any other Regressor
### The regressor MUST implement the `.predict` method (As all sklearn models do)

### It is recommended that you train our model using ONLY the training data (X_train and y_train)

### At the bottom of this cell is some code which demonstrates how the r^2 score will be calculated
### using the "testing" data. If we achieve a score of greater than .8 on that test, we pass.

### Save regressor to the variable "reg"

### DATA
train_path = "../resource/asnlib/publicdata/train.csv"
full = pd.read_csv(train_path)

X_test = full.loc[0:5400, ['x1','x2']]
y_test = full.loc[:5400, 'y']
X_train = full.loc[5400:, ["x1","x2"]]
y_train = full.loc[5400:, "y"]

### Regression
reg = DecisionTreeRegressor()

reg.fit(X_test, y_test)


### Testing r^2 score
preds = reg.predict(X_test)
score = r2_score(y_test, preds)
print("You achieved an r^2 score of", score, "on the testing data")