_Lambda School Data Science ‚Äî Linear Models_

# Intro to Predictive Modeling

### Objectives
- recognize examples of supervised learning with tabular data
- distinguish between regression problems and classification problems
- explain why overfitting is a problem and model validation is important
- do train/test split
- begin with baselines for regression

I like Brandon Rohrer‚Äôs blog post, [‚ÄúWhat questions can machine learning answer?‚Äù](https://brohrer.github.io/five_questions_data_science_answers.html)

We‚Äôll focus on two of these questions in Unit 2. These are both types of ‚Äúsupervised learning.‚Äù

- ‚ÄúIs this A or B?‚Äù (Classification)
- ‚ÄúHow Much / How Many?‚Äù (Regression)

**This unit, you‚Äôll do four supervised learning projects** with ‚Äútabular data‚Äù (data in tables, like spreadsheets).

- Predict New York City apartment rents <-- **Today, we'll start this project!**
- Predict which water pumps in Tanzania need repairs
- Predict the prices suppliers will quote Caterpillar for industrial parts
- Choose your own labeled, tabular dataset, train a predictive model, and publish a blog post or web app with visualizations to explain your model!

In [79]:
import pandas as pd

# Predict NYC apartment rent üè†üí∏

You'll use a real-world data with rent prices for a subset of apartments in New York City!

Run this code cell to load the dataset: 

In [6]:
LOCAL = '../data/nyc/nyc-rent-2016.csv'
WEB = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/nyc/nyc-rent-2016.csv'

df = pd.read_csv(WEB)
assert df.shape == (48300, 34)

### Install [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling), version >= 2

In [7]:
!pip3 install --upgrade pandas-profiling

Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.7/site-packages (2.1.0)




In [8]:
import pandas_profiling
pandas_profiling.__version__

'2.1.0'

## Define the problem
- Is this **supervised** learning?

    Yes, we are using labeled data
    
- Is this **tabular** data?

    Yes, it is in csv format
    
- Is this **regression** or **classification**?

    This is Regression, we are predicting a continuous value price.

In [9]:
df.head()
target = 'price'

## Explain why overfitting is a problem and model validation is important

#### Jason Brownlee, [Overfitting and Underfitting With Machine Learning Algorithms](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/)

> The goal of a good machine learning model is to **generalize** well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

> The cause of poor performance in machine learning is either overfitting or underfitting the data.

> **Overfitting** refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. 

> **Underfitting** refers to a model that can neither model the training data nor generalize to new data.

> Ideally, you want to select a model at the sweet spot between underfitting and overfitting.


#### Rob Hyndman & George Athanasopoulos, [_Forecasting: Principles and Practice_, Chapter 3.4](https://otexts.com/fpp2/accuracy.html), Evaluating forecast accuracy:

> The following points should be noted.

> - A model which fits the training data well will not necessarily forecast well.
> - A perfect fit can always be obtained by using a model with enough parameters.
> - Over-fitting a model to data is just as bad as failing to identify a systematic pattern in the data.

> **The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.**

> When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.

![](https://otexts.com/fpp2/fpp_files/figure-html/traintest-1.png)

> The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.

> Some references describe the test set as the ‚Äúhold-out set‚Äù because these data are ‚Äúheld out‚Äù of the data used for fitting. Other references call the training set the ‚Äúin-sample data‚Äù and the test set the ‚Äúout-of-sample data‚Äù. We prefer to use ‚Äútraining data‚Äù and ‚Äútest data‚Äù in this book.


#### Rachel Thomas, [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)

> An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again. How can this happen?

> One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). 



#### James, Witten, Hastie, Tibshirani, [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 2.2, Assessing Model Accuracy

> In general, we do not really care how well the method works training on the training data. Rather, _we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data._ Why is this what we care about? 

> Suppose that we are interested test data in developing an algorithm to predict a stock‚Äôs price based on previous stock returns. We can train the method using stock returns from the past 6 months. But we don‚Äôt really care how well our method predicts last week‚Äôs stock price. We instead care about how well it will predict tomorrow‚Äôs price or next month‚Äôs price. 

> On a similar note, suppose that we have clinical measurements (e.g. weight, blood pressure, height, age, family history of disease) for a number of patients, as well as information about whether each patient has diabetes. We can use these patients to train a statistical learning method to predict risk of diabetes based on clinical measurements. In practice, we want this method to accurately predict diabetes risk for _future patients_ based on their clinical measurements. We are not very interested in whether or not the method accurately predicts diabetes risk for patients used to train the model, since we already know which of those patients have diabetes.

#### Owen Zhang, [Winning Data Science Competitions](https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions/8)

> Good validation is _more important_ than good models. 

## Wrangle

In [98]:
df['price'].describe()

count    48300.000000
mean      3438.297950
std       1401.422247
min       1025.000000
25%       2495.000000
50%       3100.000000
75%       4000.000000
max       9999.000000
Name: price, dtype: float64

In [58]:
df.describe(exclude='number')

Unnamed: 0,created,description,display_address,street_address,interest_level
count,48300,46879.0,48168,48290,48300
unique,47643,37490.0,8691,15093,3
top,2016-05-14 05:23:52,,Broadway,3333 Broadway,low
freq,3,1614.0,424,174,33270


In [68]:
def basic_wrangle(df):
    df = df.copy()
    
    #Create an is_missing column for every column with any NANs
    for col in df.columns:
        if df[col].isna().sum():
            df[f'{col}_missing'] = df[col].isna()
            
    #Fill numeric columns with mean
    for col in df.select_dtypes(include='number').columns:
        df[col].fillna(value=df[col].mean)
    
    #Fill non-numeric columns with 'MISSING'
    for col in df.select_dtypes(exclude='number').columns:
        df[col].fillna(value='MISSING')
    return df
df_w = basic_wrangle(df)

## Feature Engineering

In [69]:
df.columns
def feature_engineering(df):
    df = df.copy()
    df['description_length'] = len(df['description'])
    df['dt'] = pd.to_datetime(df['created'])
    df['year'] = df['dt'].dt.year
    df['month'] = df['dt'].dt.month
    return df
df_f = feature_engineering(df_w)

## Feature Selection

In [76]:
def FeatureSelect(df):
    drop_me = ['description', 'dt', 'created', 'street_address']
    df = df.drop(columns=drop_me)
    return df

df_e = FeatureSelect(df_f)

## Do train/test split

We have two options for where we choose to split:
- Time
- Random




In [81]:
from sklearn.model_selection import train_test_split as tts

In [82]:
def train_val_test_split(x, y, test_size=.2, stratify=None):
    train_size = 1 - (test_size * 2)
    x_train, x_testval, y_train, y_testval = tts(x, y, train_size=train_size, stratify=stratify)
    x_test, x_val, y_test, y_val = tts(x_testval, y_testval, train_size=.5, stratify=stratify)
    return x_train, x_val, x_test, y_train, y_val, y_test

In [83]:
df_e.shape

(48300, 37)

In [84]:
x_train, x_val, x_test, y_train, y_val, y_test = train_val_test_split(df_e.drop(columns=target), df_e[target], test_size=.2)
x_train.shape, x_val.shape, x_test.shape

((28980, 36), (9660, 36), (9660, 36))

This choice depends on your goals. Rachel Thomas explains why you may want to split on time:

#### Rachel Thomas, [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)

> If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future). If your data includes the date and you are building a model to use in the future, you will want to choose a continuous section with the latest dates as your validation set


For this project, we'll split based on time. 

- Use data from April & May 2016 to train.
- Use data from June 2016 to test.

(But in some future projects this unit, we'll do random splits, and explain why.)

## Pre-processing

In [86]:
from sklearn.pipeline import make_pipeline
import category_encoders as ce

def pre_processing(train, val, test):
    pipeline = make_pipeline(ce.OrdinalEncoder())
    pipeline.fit(train)
    
    x_train = pd.DataFrame(pipeline.transform(train))
    x_train.columns = train.columns
    x_val = pd.DataFrame(pipeline.transform(val))
    x_val.columns = val.columns
    x_test = pd.DataFrame(pipeline.transform(test))
    x_test.columns = test.columns
    
    return x_train, x_val, x_test

pp_train, pp_val, pp_test = pre_processing(x_train, x_val, x_test)

## Begin with baselines for regression

### Why begin with baselines?

[My mentor](https://www.linkedin.com/in/jason-sanchez-62093847/) [taught me](https://youtu.be/0GrciaGYzV0?t=40s):

>***Your first goal should always, always, always be getting a generalized prediction as fast as possible.*** You shouldn't spend a lot of time trying to tune your model, trying to add features, trying to engineer features, until you've actually gotten one prediction, at least. 

> The reason why that's a really good thing is because then ***you'll set a benchmark*** for yourself, and you'll be able to directly see how much effort you put in translates to a better prediction. 

> What you'll find by working on many models: some effort you put in, actually has very little effect on how well your final model does at predicting new observations. Whereas some very easy changes actually have a lot of effect. And so you get better at allocating your time more effectively.

My mentor's advice is echoed and elaborated in several sources:

[Always start with a stupid model, no exceptions](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa)

> Why start with a baseline? A baseline will take you less than 1/10th of the time, and could provide up to 90% of the results. A baseline puts a more complex model into context. Baselines are easy to deploy.

[Measure Once, Cut Twice: Moving Towards Iteration in Data Science](https://blog.datarobot.com/measure-once-cut-twice-moving-towards-iteration-in-data-science)

> The iterative approach in data science starts with emphasizing the importance of getting to a first model quickly, rather than starting with the variables and features. Once the first model is built, the work then steadily focuses on continual improvement.

[*Data Science for Business*](https://books.google.com/books?id=4ZctAAAAQBAJ&pg=PT276), Chapter 7.3: Evaluation, Baseline Performance, and Implications for Investments in Data

> *Consider carefully what would be a reasonable baseline against which to compare model performance.* This is important for the data science team in order to understand whether they indeed are improving performance, and is equally important for demonstrating to stakeholders that mining the data has added value.

### What does baseline mean?

Baseline is an overloaded term, as you can see in the links above. Baseline has multiple meanings:

#### The score you'd get by guessing

> A baseline for classification can be the most common class in the training dataset.

> A baseline for regression can be the mean of the training labels. 

> A baseline for time-series regressions can be the value from the previous timestep. ‚Äî[Will Koehrsen](https://twitter.com/koehrsen_will/status/1088863527778111488)

#### Fast, first models that beat guessing

What my mentor was talking about.

#### Complete, tuned "simpler" model

Can be simpler mathematically and computationally. For example, Logistic Regression versus Deep Learning.

Or can be simpler for the data scientist, with less work. For example, a model with less feature engineering versus a model with more feature engineering.

#### Minimum performance that "matters"

To go to production and get business value.

#### Human-level performance 

Your goal may to be match, or nearly match, human performance, but with better speed, cost, or consistency.

Or your goal may to be exceed human performance.

In [102]:
x_val['pred']= y_train.mean()
x_test['pred'] = y_train.mean()

## Baseline Evaluation

In [92]:
from sklearn.metrics import mean_squared_error as msqe, r2_score as r2, mean_squared_log_error as msqle

In [103]:
msqe(y_val, x_val['pred'])

1959221.5503116844

In [105]:
msqe(y_test, x_test['pred'])

1987571.105819831

## Model

In [111]:
from xgboost import XGBRegressor as xgb

In [110]:
model = xgb()
model.fit(pp_train, y_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [112]:
yp_val = model.predict(pp_val)

In [113]:
msqe(y_val, yp_val)

441924.9974513354

In [114]:
yp_test = model.predict(pp_test)

In [115]:
msqe(y_test, yp_test)

442489.99748959584