<a href="https://colab.research.google.com/github/Nolanole/DS-Unit-2-Sprint-4-Practicing-Understanding/blob/master/Josh_Mancuso_Monday_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ASSIGNMENT

**1.** Complete the notebook cells that were originally commented **`TODO`**. 

**2.** Then, focus on feature engineering to improve your cross validation scores. Collaborate with your cohort on Slack. You could start with the ideas [Jake VanderPlas suggests:](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html#Example:-Predicting-Bicycle-Traffic)

> Our model is almost certainly missing some relevant information. For example, nonlinear effects (such as effects of precipitation and cold temperature) and nonlinear trends within each variable (such as disinclination to ride at very cold and very hot temperatures) cannot be accounted for in this model. Additionally, we have thrown away some of the finer-grained information (such as the difference between a rainy morning and a rainy afternoon), and we have ignored correlations between days (such as the possible effect of a rainy Tuesday on Wednesday's numbers, or the effect of an unexpected sunny day after a streak of rainy days). These are all potentially interesting effects, and you now have the tools to begin exploring them if you wish!

**3.** Experiment with the Categorical Encoding notebook.

**4.** At the end of the day, take the last step in the "universal workflow of machine learning" — "You can train your final production model on all the available data (training and validation) and evaluate it one last time on the test set."

See the [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) documentation for the `refit` parameter, `best_estimator_` attribute, and `predict` method:

> **refit : boolean, or string, default=True**

> Refit an estimator using the best found parameters on the whole dataset.

> The refitted estimator is made available at the `best_estimator_` attribute and permits using `predict` directly on this `GridSearchCV` instance.

### STRETCH

**A.** Apply this lesson other datasets you've worked with, like Ames Housing, Bank Marketing, or others.

**B.** In additon to `RandomizedSearchCV`, scikit-learn has [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Another library called scikit-optimize has [`BayesSearchCV`](https://scikit-optimize.github.io/notebooks/sklearn-gridsearchcv-replacement.html). Experiment with these alternatives.

**C.** _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?

In [79]:
!curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1616k    0 1616k    0     0   749k      0 --:--:--  0:00:02 --:--:--  748k


In [80]:
!wget https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/BicycleWeather.csv

--2019-05-13 23:03:27--  https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/BicycleWeather.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 234945 (229K) [text/plain]
Saving to: ‘BicycleWeather.csv.9’


2019-05-13 23:03:28 (4.52 MB/s) - ‘BicycleWeather.csv.9’ saved [234945/234945]



In [0]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [0]:
# Modified from cells 15, 16, and 20, at
# https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html#Example:-Predicting-Bicycle-Traffic

# Download and join data into a dataframe
def load(): 
    fremont_bridge = 'https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD'
    
    bicycle_weather = 'https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/BicycleWeather.csv'

    counts = pd.read_csv(fremont_bridge, index_col='Date', parse_dates=True, 
                         infer_datetime_format=True)

    weather = pd.read_csv(bicycle_weather, index_col='DATE', parse_dates=True, 
                          infer_datetime_format=True)

    daily = counts.resample('d').sum()
    daily['Total'] = daily.sum(axis=1)
    daily = daily[['Total']] # remove other columns

    weather_columns = ['PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'AWND']
    daily = daily.join(weather[weather_columns], how='inner')
    
    # Make a feature for yesterday's total
    daily['Total_yesterday'] = daily.Total.shift(1)
    #daily = daily.drop(index=daily.index[0])
    
    return daily

daily = load()

In [0]:
# Modified from code cells 17-21 at
# https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html#Example:-Predicting-Bicycle-Traffic
import numpy as np

def jake_wrangle(X):  
    X = X.copy()

    # patterns of use generally vary from day to day; 
    # let's add binary columns that indicate the day of the week:
    days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    for i, day in enumerate(days):
        X[day] = (X.index.dayofweek == i).astype(float)


    # we might expect riders to behave differently on holidays; 
    # let's add an indicator of this as well:
    from pandas.tseries.holiday import USFederalHolidayCalendar
    cal = USFederalHolidayCalendar()
    holidays = cal.holidays('2012', '2016')
    X = X.join(pd.Series(1, index=holidays, name='holiday'))
    X['holiday'].fillna(0, inplace=True)


    # We also might suspect that the hours of daylight would affect 
    # how many people ride; let's use the standard astronomical calculation 
    # to add this information:
    def hours_of_daylight(date, axis=23.44, latitude=47.61):
        """Compute the hours of daylight for the given date"""
        days = (date - pd.datetime(2000, 12, 21)).days
        m = (1. - np.tan(np.radians(latitude))
             * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))
        return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.

    X['daylight_hrs'] = list(map(hours_of_daylight, X.index))

    
    # temperatures are in 1/10 deg C; convert to C
    X['TMIN'] /= 10
    X['TMAX'] /= 10
    
    # We can also calcuate the average temperature.
    X['Temp (C)'] = 0.5 * (X['TMIN'] + X['TMAX'])

    # precip is in 1/10 mm; convert to inches
    X['PRCP'] /= 254

    # In addition to the inches of precipitation, let's add a flag that 
    # indicates whether a day is dry (has zero precipitation):
    X['dry_day'] = (X['PRCP'] == 0).astype(int)


    # Let's add a counter that increases from day 1, and measures how many 
    # years have passed. This will let us measure any observed annual increase 
    # or decrease in daily crossings:
    X['annual'] = (X.index - X.index[0]).days / 365.

    return X
  
daily = jake_wrangle(daily)
daily = daily.reset_index().rename(columns={'index':'date'})

In [0]:
#Do some more feature engineering:

#is it the weekend:

def weekend(X):
  X = X.copy()
  X['wknd'] = (X.Sat == 1) | (X.Sun == 1)
  X['wknd'] = X['wknd'].astype('int')
  return X

daily = weekend(daily)

#Cold and rainy/snowy:

def cold_precip(X):
  X = X.copy()
  X['cold_precip'] = (X['Temp (C)'] < 10 ) & (X.dry_day ==0)
  X['cold_precip'] = X['cold_precip'].astype('int')
  return X

daily = cold_precip(daily)

#create bool col for extreme temp (below freezing or above 86 F (30C)
def extreme_temp(X):
  X = X.copy()
  X['extreme_temp'] = (X['Temp (C)'] < 0) | (X['Temp (C)'] > 30) 
  X['extreme_temp'] = X['extreme_temp'].astype('int')
  return X

daily = extreme_temp(daily)

#Make col for length of prior consecutive rain days as well as col for a sunny wknd that follows a rainy wknd:
def rain_streak(X):
  X = X.copy()
  X['rain_streak_count'] = 0
  for i in range(1, len(X)):
    X.at[i, 'rain_streak_count'] = 0 if (X.at[i, 'dry_day'] == 1) else (X.at[i-1, 'rain_streak_count'] + 1)
  X['rain_streak_now_sunny'] = (X.rain_streak_count.shift(1) > 1) & (X.dry_day == 1)
  X['rain_streak_now_sunny'] =  X['rain_streak_now_sunny'].astype('int')
  X['rain_streak_now_sunny_length'] =  X['rain_streak_now_sunny'].astype('int') * X['rain_streak_count'].shift(1)
  X['dry_wknd_day_and_rained_last_wknd'] = (X.dry_day==1) & ((X.Sat == 1) & (X.dry_day.shift(7) == 0 & (X.dry_day.shift(6) == 0))) | ((X.Sun == 1) & (X.dry_day.shift(8) == 0) & (X.dry_day.shift(7) == 0) & (X.dry_day.shift(1) == 0))
  X['dry_wknd_day_and_rained_last_wknd'] = X['dry_wknd_day_and_rained_last_wknd'].astype('int')
  X = X.drop(index=X.index[0]).reset_index(drop=True)
  return X

daily = rain_streak(daily)

def convert_date(X):
  X = X.copy()
  def get_month(x):
    return x.month
  def get_day(x):
    return x.day
  X['month'] = X['date'].apply(get_month)
  X['day'] = X['date'].apply(get_day)
  return X

daily = convert_date(daily)

In [0]:
#Fix some wacky values in the SNOW col:
daily['SNOW'] = daily['SNOW'].replace(-9999, 0)

In [0]:
from sklearn.metrics import mean_absolute_error

test = daily[-100:]
train = daily[:-100]
train.shape, test.shape

target = 'Total'
X_train = train.drop(columns=[target, 'date'])
y_train = train[target]

X_test = test.drop(columns=[target, 'date'])
y_test = test[target]

In [87]:
#first test w/ LinReg
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

scores = cross_validate(LinearRegression(), 
                        X_train, 
                        y_train, 
                        scoring="neg_mean_absolute_error",
                        cv=3, 
                        return_train_score=True, 
                        return_estimator=True)

pd.DataFrame(scores)

for i, model in enumerate(scores["estimator"]):
  coefficients = model.coef_
  intercept = model.intercept_
  feature_names = X_train.columns
  
  print(f'Model from cross-validation fols #{i}')
  print("Intercept", intercept)
  print(pd.Series(coefficients, feature_names).to_string())
  print('\n')

Model from cross-validation fols #0
Intercept 340.48533372148904
PRCP                                 -611.149459
SNOW                                    2.621691
SNWD                                   -2.774714
TMAX                                   62.500086
TMIN                                  -32.323433
AWND                                   -1.338357
Total_yesterday                         0.299317
Mon                                   617.279683
Tue                                   203.485488
Wed                                   141.979727
Thu                                    48.699896
Fri                                  -195.900206
Sat                                  -557.444842
Sun                                  -258.099746
holiday                             -1026.758662
daylight_hrs                           71.560777
Temp (C)                               15.088327
dry_day                               372.122879
annual                                 -8.867683
wknd

In [88]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, max_depth=None, n_jobs=-1)

scores = cross_validate(model, 
                        X_train, 
                        y_train, 
                        scoring="neg_mean_absolute_error", 
                        cv=3, 
                        return_train_score=True, 
                        return_estimator=True)

pd.DataFrame(scores)
scores["test_score"].mean()

-313.21847352024923

In [89]:
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "n_estimators": [100, 300],
    "max_depth": [3, 5, 10, None],
    "criterion": ["mae"]
}

gridsearch = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=-1, random_state=42),
    param_distributions=param_distributions, 
    n_iter=8, 
    cv=3, scoring="neg_mean_absolute_error", 
    verbose=5, 
    return_train_score=True,
    n_jobs=-1)

gridsearch.fit(X_train, y_train)


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:   37.5s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  1.4min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=-1,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=8, n_jobs=-1,
          param_distributions={'n_estimators': [100, 300], 'max_depth': [3, 5, 10, None], 'criterion': ['mae']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='neg_mean_absolute_error',
          verbose=5)

In [90]:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values(by="rank_test_score")

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
7,11.867215,2.224591,0.113286,0.007195,300,,mae,"{'n_estimators': 300, 'max_depth': None, 'crit...",-347.677471,-314.903785,-273.339211,-311.973489,30.419119,1,-101.878014,-93.048837,-103.104652,-99.343834,4.479316
5,12.279874,0.091461,0.117344,0.001911,300,10.0,mae,"{'n_estimators': 300, 'max_depth': 10, 'criter...",-347.365909,-318.871246,-273.364247,-313.200467,30.475999,2,-124.962653,-113.644626,-128.479097,-122.362125,6.329162
6,4.583032,0.117595,0.118683,0.003246,100,,mae,"{'n_estimators': 100, 'max_depth': None, 'crit...",-350.8981,-313.801916,-277.853551,-314.184522,29.821539,3,-103.374517,-94.423333,-105.724603,-101.174151,4.86901
4,4.163617,0.02619,0.120221,0.002099,100,10.0,mae,"{'n_estimators': 100, 'max_depth': 10, 'criter...",-355.787181,-319.729003,-279.145981,-318.220722,31.30681,4,-127.01028,-114.49088,-129.907173,-123.802778,6.689872
3,8.887595,0.071076,0.109503,0.007272,300,5.0,mae,"{'n_estimators': 300, 'max_depth': 5, 'criteri...",-356.302305,-364.976729,-311.044564,-344.107866,23.64597,5,-262.324377,-236.483154,-264.378842,-254.395458,12.693651
2,3.068998,0.15084,0.11616,0.008399,100,5.0,mae,"{'n_estimators': 100, 'max_depth': 5, 'criteri...",-360.829766,-366.471978,-310.200717,-345.834154,25.301713,6,-262.958123,-237.316332,-264.152266,-254.808907,12.378722
0,2.282674,0.006074,0.10926,0.007181,100,3.0,mae,"{'n_estimators': 100, 'max_depth': 3, 'criteri...",-420.684798,-447.824611,-397.802368,-422.103925,20.446135,7,-396.550919,-351.95655,-389.161394,-379.222954,19.514847
1,6.524556,0.134476,0.116653,0.001089,300,3.0,mae,"{'n_estimators': 300, 'max_depth': 3, 'criteri...",-425.358001,-445.755073,-399.99284,-423.701971,18.719015,8,-400.40148,-351.063806,-391.214574,-380.893287,21.42348


In [96]:
import xgboost as xgb

param_distributions = {
    'n_estimators' : [100, 200, 500],
    'max_depth': [2,3,5,10,50],
    'criterion' : ['mae']
}

gridsearch = RandomizedSearchCV(
    xgb.XGBRegressor(n_jobs=-1, random_state=42),
    param_distributions= param_distributions,
    n_iter=8,
    cv=3,
    scoring='neg_mean_absolute_error',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)

gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    5.6s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    6.4s finished
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    6.4s remaining:    0.0s


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, importance_type='gain',
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
       nthread=None, objective='reg:linear', random_state=42, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1),
          fit_params=None, iid='warn', n_iter=8, n_jobs=-1,
          param_distributions={'n_estimators': [100, 200, 500], 'max_depth': [2, 3, 5, 10, 50], 'criterion': ['mae']},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='neg_mean_absolute_error',
          verbose=10)

In [97]:
results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values(by="rank_test_score")

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
2,0.392386,0.051357,0.007635,0.001722,500,2,mae,"{'n_estimators': 500, 'max_depth': 2, 'criteri...",-227.192625,-311.778166,-253.846503,-264.272431,35.310088,1,-139.202904,-127.597751,-130.548334,-132.449663,4.924847
3,0.135203,0.004577,0.004086,0.000101,100,3,mae,"{'n_estimators': 100, 'max_depth': 3, 'criteri...",-254.639534,-306.908451,-255.102223,-272.216736,24.531474,2,-164.833339,-164.40554,-170.440788,-166.559889,2.749762
0,0.385264,0.04064,0.008424,0.003227,200,5,mae,"{'n_estimators': 200, 'max_depth': 5, 'criteri...",-257.099819,-313.838645,-266.629967,-279.189477,24.807654,3,-35.147918,-36.651196,-39.750693,-37.183269,1.91637
6,0.211701,0.001803,0.00493,0.000108,100,5,mae,"{'n_estimators': 100, 'max_depth': 5, 'criteri...",-262.194168,-308.755083,-267.093963,-279.347738,20.890124,4,-73.489717,-69.109372,-70.31999,-70.973026,1.846925
5,0.987066,0.009773,0.013213,0.000662,500,5,mae,"{'n_estimators': 500, 'max_depth': 5, 'criteri...",-268.311474,-310.433659,-270.854414,-283.199849,19.285175,5,-5.917024,-6.237357,-7.646684,-6.600355,0.751335
1,0.54727,0.047977,0.006903,0.000581,100,50,mae,"{'n_estimators': 100, 'max_depth': 50, 'criter...",-260.535721,-307.769596,-298.837508,-289.047608,20.488067,6,-0.639236,-0.749855,-0.628221,-0.672437,0.054927
4,0.89312,0.005846,0.011426,2.3e-05,200,10,mae,"{'n_estimators': 200, 'max_depth': 10, 'criter...",-259.114431,-305.243336,-307.822155,-290.726641,22.377986,7,-0.197268,-0.288504,-0.183371,-0.223048,0.046631
7,0.406145,0.060942,0.006555,0.000896,100,10,mae,"{'n_estimators': 100, 'max_depth': 10, 'criter...",-258.762087,-305.718756,-307.927652,-290.802831,22.674167,8,-2.618583,-4.446952,-2.417623,-3.161053,0.912962


In [95]:
#Try on the 100 sample withheld test data:

best_XGB = xgb.XGBRegressor(max_depth=2, n_estimators=500, n_jobs=-1, random_state=42)
best_XGB.fit(X_train, y_train)

y_pred = best_XGB.predict(X_test)

mean_absolute_error(y_test,y_pred)

227.77776306152344