# 1) Moneyball dataset

## a) [Linear regression](#linear)

## b) [Lasso Regression](#lasso)

## c) [Random Forest](#rf)

# Load the first dataset

In [None]:
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
input_file = 'baseball.csv'
Mb_data = pd.read_csv(input_file,  sep = ',', header = 0)
Mb_data

# Columns

RS ... Runs Scored, 

RA ... Runs Allowed

***RD ... Run differential (actually difference)***

W ... Wins

OBP ... On-Base Percentage

SLG ... Slugging Percentage

BA ... Batting Average

Playoffs (binary)

RankSeason

RankPlayoffs

G ... Games Played

OOBP ... Opponent On-Base Percentage

OSLG ... Opponent Slugging Percentage

In [None]:
col_dict = {'RS':  'Runs Scored', 
            'RA':  'Runs Allowed',
            'RD':  'Run differential (actually difference)',
            'W':  'Wins',
            'OBP':  'On-Base Percentage',
            'SLG':  'Slugging Percentage',
            'BA':  'Batting Average',
            'Playoffs': 'playoffs reached (binary)',
            'RankSeason': 'season rank',
            'RankPlayoffs': 'playoff rank',
            'G':  'Games Played',
            'OOBP':  'Opponent On-Base Percentage',
            'OSLG':  'Opponent Slugging Percentage'
           }

<a id='rf'></a>

# c) Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.model_selection import train_test_split
from IPython.display import display
           
df_raw = Mb_data
rf = RandomForestRegressor(n_jobs=1)

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)
            
def add_RD(df):
    df['RD'] = df.apply(lambda row: row.RS - row.RA, axis = 1) 

# First look on DATA and information

In [None]:
display_all(df_raw.tail().transpose())
print('#'*40)
display('Some more info')
print('#'*40)
display(df_raw.info())

# Preprocessing for random forest

In [None]:
df_prep = df_raw
add_RD(df_prep)
display_all(df_prep.tail().transpose())
display(df_prep.info())

In [None]:
cols_to_drop = ['Team', 'League', 'Year', 'RankSeason', 'RankPlayoffs', 'Playoffs']
df_prep = df_prep.drop(cols_to_drop, axis=1)

# Fix missing values and type
df_prep.replace("?",0, inplace=True)
#df_prep = df_prep[df_prep.OOBP != 0]
df_prep[['OOBP','OSLG']] = df_prep[['OOBP','OSLG']].astype(float)


In [None]:
display(df_prep.columns.values)
display(df_prep.index)

In [None]:
display(df_prep)

In [None]:
# Split into train and test
def split_simple(df, n): 
    '''n... number to split at'''
    return df[:n].copy(), df[n:].copy()

def split_proper(df, test_ratio, seed=42):
    np.random.seed(seed)
    shuffled_indices = np.random.permutation(len(df))
    test_set_size = int(len(df) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return df.iloc[train_indices], df.iloc[test_indices]

# Train for the wins (FTW)

In [None]:
ratio = 0.2 # test/num_samples
num_instances, _ = df_prep.shape
print(f"From {num_instances} using {num_instances*ratio:.0f} for testing and {num_instances*(1-ratio):.0f} for training. Ratio = {ratio*100:.2f}%")

X, y = (df_prep.drop(['W'], axis=1), df_prep.W)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = ratio, random_state = 42)

train_simple, test_simple = split_simple(df_prep, int(num_instances*(1-ratio)))

display(test_simple)
print('\n\n\t\t\t\t\t\tVS\n\n')
display(X_test)

In [None]:
import math
def rmse(x,y): 
    return math.sqrt(((x-y)**2).mean())

def print_score(m, X_train, X_valid, y_train, y_valid, score='neg_mean_squared_error'):
    res = {
        'RMS(train)': rmse(m.predict(X_train), y_train),
        'RMS(valid)': rmse(m.predict(X_valid), y_valid)}
    if score=='neg_mean_squared_error':
        res['Model_Score=r²'] = [np.sqrt(-m.score(X_train, y_train)), np.sqrt(-m.score(X_valid, y_valid))]
    elif score=='pos_mean_squared_error':
        res['Model_Score=r²'] = [np.sqrt(m.score(X_train, y_train)), np.sqrt(m.score(X_valid, y_valid))]
    else:
        res['Model_Score=r²'] = [m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res['oob_score_'] = m.oob_score_
    display(res)
    return res

# Feature importance
from prettytable import PrettyTable as PT
def print_RF_featureImportance(rf, X):
    table = PT()
    table.field_names = ['Feature', 'Score', 'Comment']
    for name, score in zip(X.columns.values, rf.feature_importances_):
        print(f"{name}: {score:.5f}\t\t... {col_dict[name]}")
        table.add_row([name, round(score, ndigits=4), col_dict[name]])
    print(table)

before = 0

In [None]:
n_cores = 4
rf_W = RandomForestRegressor(n_jobs=n_cores)
# The following code is supposed to fail due to string values in the input data
rf_W.fit(X_train, y_train)
print("Before:")
display(before)#
print("Now:")
before = print_score(rf_W, X_train, X_test, y_train, y_test)


In [None]:
print_RF_featureImportance(rf_W, X_train)

In [None]:
rf_W_prediction = rf_W.predict(X_test)

In [None]:
sns.distplot(y_test-rf_W_prediction)

# Try to target RD

In [None]:
X, y = (df_prep.drop(['W', 'RD'], axis=1), df_prep.W)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

display(X_test)
before = 0

# Bootstrapping:

Bootstrapping: Selecting data from a data to generate a new dataset of the same size by picking WITH replacement.

Example:

    > DS = [1,2,3,4]
    > could turn into 
    > DS_bootstrapped = [3,2,4,4]
    
Consequences:

- Instances (rows) of the original set can end up duplicated (multiple times) in the resulting dataset.
- Some instances are left out entirely (up to 1/3) --> "Out-Of-Bag Dataset" (=OOB Dataset)

## Using the OOB Dataset

The OOB dataset was not used to construct the tree, so we can actually use it to test our tree and gain some insight into the error measure of the tree.
This error is called the "Out-Of-Bag Error" (OOB error).

In [None]:
n_cores = 4
number_of_trees = 1000 # default = 100
rf = RandomForestRegressor(n_jobs=n_cores, n_estimators=number_of_trees, bootstrap=True) #, verbose=1)

rf.fit(X_train, y_train)
print("Before:")
display(before)#
print("Now:")
before = print_score(rf, X_train, X_test, y_train, y_test)
print()
print("Feature importance")
print_RF_featureImportance(rf, X_train)
rf_RD = rf

In [None]:
rfRD_prediction = rf_RD.predict(X_test)

In [None]:
sns.distplot(y_test-rfRD_prediction)

# Optimize Hyperparameters via GridSearch

because we lazy bois

## Notes on the RandomForestRegressor from scikit-learn
-----
The default values for the parameters controlling the size of the trees
(e.g. ``max_depth``, ``min_samples_leaf``, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets. To
reduce memory consumption, the complexity and size of the trees should be
controlled by setting those parameter values.

## Number of variables/features per tree

A good starting point is/might be: *the square root of the number of features presented to the tree*. Then, test some values below and above that starting point.

In [None]:
def print_GridSearchResult(grid):
    print(grid_search.best_params_)
    print(grid_search.best_estimator_)

In [None]:
from numpy import sqrt
num_features = X.shape[1]
print(num_features)
sqrt_num_features = round(sqrt(num_features), 0)
sqrt_num_features

In [None]:
from sklearn.model_selection import GridSearchCV
n_cores = 4
# but since we dont have that many features...we are just gonna brute force it :D
param_grid = [
    {
        'n_estimators': [3, 10, 30, 100, 1000], 'max_features': [i for i in range(1,num_features+1)]
    }
#,{'bootstrap': [False], 'n_estimators': [3, 30, 100, 1000], 'max_features': [2, 3, 4]},
]
k = 10
forest_reg = RandomForestRegressor(n_jobs=n_cores)
grid_search = GridSearchCV(forest_reg, param_grid, n_jobs=n_cores , cv=k, return_train_score=True) #, scoring='neg_mean_squared_error'
grid_search.fit(X_train, y_train)


In [None]:
print_GridSearchResult(grid_search)
print_score(grid_search, X_train, X_test, y_train, y_test, score='')

In [None]:
print("max_features = 8")
{'RMS(train)': 1.6169133890268252,
 'RMS(valid)': 4.158368280598173,
 'Model_Score=r²': (0.980355161825956, 0.8600920766278795)}

# k-fold cross validation

In [None]:
from sklearn.model_selection import cross_val_score
from prettytable import PrettyTable

def display_scores(scores):
    print("Scores:", scores)
    table = PrettyTable()
    table.field_names = ['Run', 'Score']
    for i, score in enumerate(scores):
        table.add_row([i, round(score, 3)])
    print(table)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
k = 5
model = rf_RD
scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=k)

In [None]:

display_scores(rf_rmse_scores)

In [None]:
# Dump model
import joblib
import os

os.makedirs('tmp', exist_ok=True)
joblib.dump(rf_RD, "tmp/rf_RD.pkl")
# To load the model
# my_model_loaded = joblib.load("my_model.pkl")

# Summary on Random Forests

Book I like: **Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow** by Aurélien Géron (my colleague form work recommended it to me)

Youtube series heavily based on that book: https://www.youtube.com/watch?v=D_2LkhMJcfY

The company behind the Youtube channel kinda sucks...but the videos are a nice summary of the book.

git: https://github.com/ageron/handson-ml

citation: Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Retrieved from https://books.google.at/books?id=HHetDwAAQBAJ

Another video series that seemed nice: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html?highlight=random%20forest#sklearn.ensemble.RandomForestRegressor

Another series of courses I use: https://course18.fast.ai/lessonsml1/lesson1.html

    > github: https://github.com/fastai/fastai/tree/master/courses/ml1
    > "Mitschrift": https://medium.com/@hiromi_suenaga/machine-learning-1-lesson-1-84a1dc2b5236

Analysis/Guide on the Moneyball-Set: https://www.kaggle.com/wduckett/beane-and-depodesta-s-regression-roadmap

## General idea:

Based on decision trees (aka Classification And Regression Tree = CART).
Use multiple (different) CARTs and use a reduced version of each trees output (say, some form of average)... "Wisdom of the crowd".

Growing a decision tree: Use every single feature for the tree --> search for the very best feature when splitting a node.

The Random Forest algorithm introduces extra randomness when growing trees; 
instead of searching for the very best feature when splitting a node (see Chapter 6), it searches for the best feature among a random subset of features.
This results in agreater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. 

They are a simple examples of "Ensemble Learning"; using multiple predictors to form another predictor.
In general, the set of base predictors can be made up of different types and/or use different sets of hyperparameters.
The result of the Ensemble is calculated by aggregating the result of each base predictor (e.g. by voting or averaging).

## Pseudocode

Training:

1. Assume number of cases in the training set is N. Then, a sample of these N cases is taken at random but with replacement (bootstrapping).

2. If there are M input variables (or features), a number m < M is specified (subset of features) such that at each node, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.

3. Each tree is grown to the largest extent possible and there is no pruning.

Prediction:

1. Let each tree produce its prediction output.

2. Aggregate the individual prediction into the final result of the Random Forest (i.e. majority vote for classification, average for regression)

Terminology: **B**ootstrapping data + **agg**regating the results to make a decision = Bagging

## Advantages:

    - can handle both regression and classification
    - handles missing data well --> less preprocessing needed
    - maintains data accuracy
    - won't overfit (surprising as CARTs tend to do that)
    - can handle large amounts of data with high dimensionality well
    - usefull for EDA: feature importance
    
## Disadvantages:

    - not AS great for regression because it doesn't actually give continous output
    - little control over what the model does (black box approach)
    
## Applications:

Think of areas where similiar structures are already used by domain experts:

    - Medicine: diagnosing, figuring out medication,...
    - Stock market
    - Image classification (XBOX Kinect body part identification)
    
## How to tune the Hyperparameters?

https://www.gormanalysis.com/blog/random-forest-from-top-to-bottom/

On other words, “How do I tune the hyperparameters of a random forest?” This question isn’t specific to random forest. The most common approach is to use grid-search + cross validation – essentially “guess and check” where the training phase is based on one dataset and the testing phase is based on another. Otherwise, here are some notes specific to random forest:

- The more trees the better. But at a certain point the next tree just slows down your computer without adding more predictive power
- Leo Breiman (random forest’s creator) suggests sampling (with replacement) n rows from the training set before growing each tree where n = number of rows in the training set. This technique is known as bagging and will result in roughly 63% of the unique training samples being used to construct a single decision tree.

    




In [None]:
from sklearn import neighbors

neighbors.kNeigh

# Save DF

In [None]:
import os
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')
df_raw = pd.read_feather('tmp/raw')