# **Distance Predictor Part 4**
Author: Declan Costello

Date: 8/10/2023

## **Part 4 Description**

Here I Create pipelines with Imputation, Scalling, One Hot encoding, and then use grid search for hyper parameter tuning utilizing the new features created in part 3

## **Table of Context**

1. [Installation](#Installation)
2. [Machine Learning](#Machine-Learning)
3. [Grid Search](#Grid-Search)
4. [Random Search](#Random-Search)
69. Weights
5. [Results](#Results)
6. [Future Analysis](#Future-Analysis)

# **Installation**

The following installs the necessary packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import set_config
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import  StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures

# **Data Import**

In [2]:
data = pd.read_csv('FE_data.csv')

data.pop('Unnamed: 0')
data.pop('hc_x')
data.pop('hc_y')
data.pop('events')
data.pop('woba_value')
data.pop('hit_distance_sc_percentile')
data.pop('launch_speed_percentile')
data.pop('release_speed_percentile')
data.pop('launch_angle_binned')
data.pop('pull_percent_binned')
data.pop('Pop_percentile')

data.pop('pitch_type')
data.pop('spray_angle')

0        -24.123185
1        -28.255293
2        -20.586881
3        -26.691419
4        -10.550415
            ...    
116153    34.530967
116154     0.538752
116155   -31.130626
116156    12.611244
116157    25.147719
Name: spray_angle, Length: 116158, dtype: float64

# **Random Forest Regressor Pipeline**

In [3]:
numeric_features = ['launch_angle', 'launch_speed']
#'release_speed', 'fav_platoon_split_for_batter', 'grouped_pitch_type','domed','game_elevation','is_barrel','Pop','pull_percent'
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])


categorical_features = ["stand"]
#"p_throws","home_team"
categorical_transformer = OneHotEncoder(handle_unknown="ignore")


preprocessor = ColumnTransformer(transformers=[
    ("num_transform", numeric_transformer, numeric_features),
    ("cat_transform", categorical_transformer, categorical_features)
])


pipeline = Pipeline(steps=[(
    "preprocesser", preprocessor), 
    ("Random Forest Regressor", RandomForestRegressor())])


set_config(display='diagram')

pipeline

**Train Test Split**

In [4]:
feature_cols = ['launch_angle', 'launch_speed','stand']
#'release_speed', 'home_team', 'stand', 'p_throws', 'fav_platoon_split_for_batter', 'grouped_pitch_type','domed','game_elevation','is_barrel','Pop','pull_percent']

#home_team, p_throws
X = data.loc[:, feature_cols]

target_cols = ['hit_distance_sc']
y = data.loc[:, target_cols]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

**Fit, MAE**

In [5]:
pipeline.fit(X_train, y_train.values.ravel())

preds = pipeline.predict(X_valid)

RandomForestRegressor_mean_absolute_error =  mean_absolute_error(y_valid, preds)

RandomForestRegressor_mean_absolute_error

12.027377663625185

# **HyperParameter Tuning**

# **RandomForestRegressor GridSearch**

**Train Test Split**

In [6]:
feature_cols = ['launch_angle', 'launch_speed', 'pull_percent']
# took out cats: 'home_team', 'stand', 'p_throws'
X = data.loc[:, feature_cols]

target_cols = ['hit_distance_sc']
y = data.loc[:, target_cols]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

**GridSearchCV**

In [7]:
param_grid = {'n_estimators': ([50,100,150]),
              'max_depth':[2,4,6], 
              'random_state':[0,1]}

grid = GridSearchCV(RandomForestRegressor(), param_grid)

**Fit Best Params**

In [8]:
grid.fit(X_train, y_train.values.ravel())

grid.best_params_

{'max_depth': 6, 'n_estimators': 150, 'random_state': 0}

**MAE**

In [9]:
preds = grid.predict(X_valid)

grid_lin_mean_absolute_error =  mean_absolute_error(y_valid, preds)

grid_lin_mean_absolute_error

14.967859689417626

# **RandomForestRegressor Pipeline GridSearch**

where did I find this example again?

In [10]:
numeric_features = ['launch_angle', 'launch_speed','pull_percent']
# 'release_speed', 'fav_platoon_split_for_batter', 'grouped_pitch_type','domed','game_elevation','is_barrel','Pop','pull_percent']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])


categorical_features = ["stand"]
#"p_throws", "home_team"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")


preprocessor = ColumnTransformer(transformers=[
    ("num_transform", numeric_transformer, numeric_features),
    ("cat_transform", categorical_transformer, categorical_features)
])


pipeline = Pipeline(steps=[(
    "preprocesser", preprocessor), 
    ("Random Forest Regressor", RandomForestRegressor())])


set_config(display='diagram')

pipeline

**Train Test Split**

In [11]:
feature_cols = ['launch_angle', 'launch_speed', 'pull_percent','stand']
# took out cats: 'home_team', 'stand', 'p_throws'
X = data.loc[:, feature_cols]

target_cols = ['hit_distance_sc']
y = data.loc[:, target_cols]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

**GridSearchCV**

In [18]:
# param_grid = [
#     {
#         "preprocesser__num_transform__imputer__strategy": ["mean", "median"],
#         "classifier__n_estimators": [10, 100, 1000],
#         "classifier": [RandomForestRegressor()]#its not a classier though?
#     }
# ]

param_grid = {'n_estimators': ([50,100,150]),
              'max_depth':[2,4,6], 
              'random_state':[0,1]}

#GOOGLE THIS LINE
grid_search = GridSearchCV(pipeline, param_grid)
#, cv=2, verbose=1,n_jobs=-1)

**GridSearch**

In [19]:
grid_search.fit(X_train, y_train.values.ravel())

grid_search.best_params_

# print(grid_search.best_score_)
# print("best logistic regression from grid search:")
# print(grid_search.score(X_test, y_test))

# https://stackoverflow.com/questions/60786220/attributeerror-gridsearchcv-object-has-no-attribute-best-params

# best_ = GridSearchCV(pipeline, param_grid, refit=False, n_jobs=-1).fit(X_train, y_train).best_estimator_   # <---- OK
# best_

ValueError: Invalid parameter 'max_depth' for estimator Pipeline(steps=[('preprocesser',
                 ColumnTransformer(transformers=[('num_transform',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['launch_angle',
                                                   'launch_speed',
                                                   'pull_percent']),
                                                 ('cat_transform',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['stand'])])),
                ('Random Forest Regressor', RandomForestRegressor())]). Valid parameters are: ['memory', 'steps', 'verbose'].

# **PolynomialRegression Grid Search**

https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks_v1/05.03-Hyperparameters-and-Model-Validation.ipynb

same thing:

https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html

In [None]:
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

param_grid = {'polynomialfeatures__degree': np.arange(5),
              'linearregression__fit_intercept': [True, False]}

grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)

# **Fit Best Params**

In [None]:
grid.fit(X_train, y_train)

grid.best_params_

# **MAE**

In [None]:
preds = grid.predict(X_valid)

grid_lin_mean_absolute_error =  mean_absolute_error(y_valid, preds)

grid_lin_mean_absolute_error

# **HYPER PARAM TODO**

grid search for feats
https://github.com/wlongxiang/mlpipeline/blob/main/ml_pipeline_with_grid_search.ipynb

grid search for regression feats again: https://github.com/Andrew-Ng-s-number-one-fan/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-and-TensorFlow/blob/master/Notebooks/C2_N1_Predicting%20Housing%20Price.ipynb

grid search for other models
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_stats.html#sphx-glr-auto-examples-model-selection-plot-grid-search-stats-py

# SVC 
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_stats.html#sphx-glr-auto-examples-model-selection-plot-grid-search-stats-py

https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769

# **Future Analysis**

In the future hope to try hyper param tuning with a classfication project instead of a regression project

classifiyying grid search: 

https://towardsdatascience.com/gridsearchcv-for-beginners-db48a90114ee

https://github.com/BindiChen/machine-learning/blob/main/traditional-machine-learning/005-grid-search-vs-random-search-vs-bayes-search/gridsearch-vs-randomsearch-vs-bayessearch.ipynb

classifing grid search hyper param for best classifying model

https://github.com/tjburch/mlb-hit-classifier/blob/master/notebooks/2-added-variables.ipynb

# **Random**