# Machine Learning Phase

### Author: Solomon Stevens
### Date: July 28th, 2024

Determine the best machine learning model for our MLB lack-of-run-support data set
* Normalize all independant variables
* Make the output boolean numeric (1 or 0)
* Create and train multiple machine learning models
* Test each model and log its performance
* Compare and determine which is best

### 1. Import libraries and data

In [1]:
# Imports
import pandas as pd

#-> For model creation
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

#-> For scoring
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import precision_score
from sklearn.metrics import r2_score
from sklearn.metrics import recall_score
from sklearn.metrics import root_mean_squared_error


print('Imported')

Imported


In [2]:
# Read in data
df = pd.read_csv('stats_EDA.csv')

print(df.head())

  last_name, first_name  year  home_run  k_percent  barrel_batted_rate  \
0          Adleman, Tim  2017        29       20.3                 7.9   
1      Alcantara, Sandy  2019        23       18.0                 6.5   
2      Alcantara, Sandy  2021        21       24.0                 6.1   
3      Alcantara, Sandy  2022        16       23.4                 5.3   
4      Alcantara, Sandy  2023        22       19.8                 7.0   

   hard_hit_percent  lack_of_run_support  p_unearned_run  
0              32.7                False               4  
1              34.3                 True               9  
2              39.4                 True              12  
3              38.5                False               9  
4              40.6                False               6  


### 2. Generate a data frame to work with
* We don't need factors such as the player's name or what year they pitched as independent variables.
* We also need to noralize each variable to avoid unnecessary bias.
  * Normalize with Feature Scaling

In [3]:
# Grab the columns we want
#-> Include the dependent variable first for visual purposes

df_ml = df[['lack_of_run_support', 'home_run', 'p_unearned_run', 'k_percent', 'barrel_batted_rate', 'hard_hit_percent']]

print(df_ml.head())

   lack_of_run_support  home_run  p_unearned_run  k_percent  \
0                False        29               4       20.3   
1                 True        23               9       18.0   
2                 True        21              12       24.0   
3                False        16               9       23.4   
4                False        22               6       19.8   

   barrel_batted_rate  hard_hit_percent  
0                 7.9              32.7  
1                 6.5              34.3  
2                 6.1              39.4  
3                 5.3              38.5  
4                 7.0              40.6  


In [4]:
# Normalize all the numeric columns
#-> Home Runs Allowed
df_ml.home_run = (df_ml.home_run - df_ml.home_run.min()) / (df_ml.home_run.max() - df_ml.home_run.min())

#-> Unearned Runs Allowed
df_ml.p_unearned_run = (df_ml.p_unearned_run - df_ml.p_unearned_run.min()) / (df_ml.p_unearned_run.max() - df_ml.p_unearned_run.min())

#-> Strikeout Percent
df_ml.k_percent = (df_ml.k_percent - df_ml.k_percent.min()) / (df_ml.k_percent.max() - df_ml.k_percent.min())

#-> Barrel Batted Rate
df_ml.barrel_batted_rate = (df_ml.barrel_batted_rate - df_ml.barrel_batted_rate.min()) / (df_ml.barrel_batted_rate.max() - df_ml.barrel_batted_rate.min())

#-> Hard Hit Percent
df_ml.hard_hit_percent = (df_ml.hard_hit_percent - df_ml.hard_hit_percent.min()) / (df_ml.hard_hit_percent.max() - df_ml.hard_hit_percent.min())


print(df_ml.head())


   lack_of_run_support  home_run  p_unearned_run  k_percent  \
0                False  0.642857        0.173913   0.322476   
1                 True  0.500000        0.391304   0.247557   
2                 True  0.452381        0.521739   0.442997   
3                False  0.333333        0.391304   0.423453   
4                False  0.476190        0.260870   0.306189   

   barrel_batted_rate  hard_hit_percent  
0            0.457364          0.357430  
1            0.348837          0.421687  
2            0.317829          0.626506  
3            0.255814          0.590361  
4            0.387597          0.674699  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ml.home_run = (df_ml.home_run - df_ml.home_run.min()) / (df_ml.home_run.max() - df_ml.home_run.min())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ml.p_unearned_run = (df_ml.p_unearned_run - df_ml.p_unearned_run.min()) / (df_ml.p_unearned_run.max() - df_ml.p_unearned_run.min())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.htm

In [5]:
# Turn the True/False values of our output to 1/0
df_ml.lack_of_run_support = df_ml.lack_of_run_support.apply(lambda x: 1 if x==True else 0)

print(df_ml.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ml.lack_of_run_support = df_ml.lack_of_run_support.apply(lambda x: 1 if x==True else 0)


   lack_of_run_support  home_run  p_unearned_run  k_percent  \
0                    0  0.642857        0.173913   0.322476   
1                    1  0.500000        0.391304   0.247557   
2                    1  0.452381        0.521739   0.442997   
3                    0  0.333333        0.391304   0.423453   
4                    0  0.476190        0.260870   0.306189   

   barrel_batted_rate  hard_hit_percent  
0            0.457364          0.357430  
1            0.348837          0.421687  
2            0.317829          0.626506  
3            0.255814          0.590361  
4            0.387597          0.674699  


### 3. Split the Data
* 80% for training
* 20% for testing

In [6]:
# Turn the independent variables into an array to allow splitting
indep_vars = df_ml[['home_run', 'p_unearned_run', 'k_percent', 'barrel_batted_rate', 'hard_hit_percent']].to_numpy()

print(indep_vars[0:4])

[[0.64285714 0.17391304 0.32247557 0.45736434 0.35742972]
 [0.5        0.39130435 0.247557   0.34883721 0.42168675]
 [0.45238095 0.52173913 0.44299674 0.31782946 0.62650602]
 [0.33333333 0.39130435 0.42345277 0.25581395 0.59036145]]


In [7]:
# Assign the training and testing variables
X_train, X_test, y_train, y_test = train_test_split(indep_vars, df_ml['lack_of_run_support'], test_size=0.2, random_state=11)

In [8]:
# Confirm the correct sizes
print('X_train:', len(X_train), 'X_test:', len(X_test))
print('y_train:', len(y_train), 'y_test:', len(y_test))

X_train: 799 X_test: 200
y_train: 799 y_test: 200


### 4. Train and Optimize Models
A. Logistic Regression

In [9]:
# Create and train the model
logistic_regression_model = LogisticRegression().fit(X_train, y_train)

In [10]:
# Use the model to make predictions
logistic_regression_predict = logistic_regression_model.predict(X_test)

In [11]:
# Grade the test
print('Accuracy:', accuracy_score(y_test, logistic_regression_predict))
print('Precision:', precision_score(y_test, logistic_regression_predict, zero_division=0.0))
print('MSE:', mean_squared_error(y_test, logistic_regression_predict))
print('RMSE:', root_mean_squared_error(y_test, logistic_regression_predict))
print('MAE:', mean_absolute_error(y_test, logistic_regression_predict))
print('Recall:', recall_score(y_test, logistic_regression_predict))
print('R2:', r2_score(y_test, logistic_regression_predict))

Accuracy: 0.79
Precision: 0.0
MSE: 0.21
RMSE: 0.458257569495584
MAE: 0.21
Recall: 0.0
R2: -0.2658227848101262


B. Decision Trees

In [12]:
# Create the model
decision_tree_model = tree.DecisionTreeClassifier().fit(X_train, y_train)

In [13]:
# Make predictions
decision_tree_predict = decision_tree_model.predict(X_test)

In [14]:
# Grade the test
print('Accuracy:', accuracy_score(y_test, decision_tree_predict))
print('Precision:', precision_score(y_test, decision_tree_predict))
print('MSE:', mean_squared_error(y_test, decision_tree_predict))
print('RMSE:', root_mean_squared_error(y_test, decision_tree_predict))
print('MAE:', mean_absolute_error(y_test, decision_tree_predict))
print('Recall:', recall_score(y_test, decision_tree_predict))
print('R2:', r2_score(y_test, decision_tree_predict))

Accuracy: 0.65
Precision: 0.24074074074074073
MSE: 0.35
RMSE: 0.5916079783099616
MAE: 0.35
Recall: 0.30952380952380953
R2: -1.1097046413502105


C. Random Forest
* A Random Forest can have a varying number of depths we can work with
  * We have to determine which is best

In [20]:
# Print a header row
print(f"{'Max Depth':<15}{'Accuracy':<15}{'Precision':20}{'MSE':<7}{'RMSE':<20}{'MAE':7}{'Recall':<20}{'R2':<20}")

# Create a loop for testing different depths of the model
for i in range(1,11):
    # Create the model
    random_forest_model = RandomForestClassifier(max_depth=i).fit(X_train, y_train)

    # Make predictions
    random_forest_predict = random_forest_model.predict(X_test)

    # Log outputs
    print(f'{i:<15}{accuracy_score(y_test, random_forest_predict):<15}{precision_score(y_test, random_forest_predict, zero_division=0.0):<20}{mean_squared_error(y_test, random_forest_predict):<7}{root_mean_squared_error(y_test, random_forest_predict):<20}{mean_absolute_error(y_test, random_forest_predict):<7}{recall_score(y_test, random_forest_predict):<20}{r2_score(y_test, random_forest_predict):<20}')

Max Depth      Accuracy       Precision           MSE    RMSE                MAE    Recall              R2                  
1              0.79           0.0                 0.21   0.458257569495584   0.21   0.0                 -0.2658227848101262 
2              0.79           0.0                 0.21   0.458257569495584   0.21   0.0                 -0.2658227848101262 
3              0.79           0.0                 0.21   0.458257569495584   0.21   0.0                 -0.2658227848101262 
4              0.8            1.0                 0.2    0.4472135954999579  0.2    0.047619047619047616-0.20554550934297744
5              0.8            1.0                 0.2    0.4472135954999579  0.2    0.047619047619047616-0.20554550934297744
6              0.8            1.0                 0.2    0.4472135954999579  0.2    0.047619047619047616-0.20554550934297744
7              0.805          1.0                 0.195  0.44158804331639234 0.195  0.07142857142857142 -0.17540687160940305


Which is best?

* 1-3 is the same
* 4-6 is the same
* 7 and 8 have a precision of 1.0
  * may indicate overfitting
* go with 9

D. Naive Bayes

In [16]:
# Create the model
naive_bayes_model = GaussianNB().fit(X_train, y_train)

In [17]:
# Make predictions
naive_bayes_predict = naive_bayes_model.predict(X_test)

In [18]:
# Grade the test
print('Accuracy:', accuracy_score(y_test, naive_bayes_predict))
print('Precision:', precision_score(y_test, naive_bayes_predict, zero_division=0.0))
print('MSE:', mean_squared_error(y_test, naive_bayes_predict))
print('RMSE:', root_mean_squared_error(y_test, naive_bayes_predict))
print('MAE:', mean_absolute_error(y_test, naive_bayes_predict))
print('Recall:', recall_score(y_test, naive_bayes_predict))
print('R2:', r2_score(y_test, naive_bayes_predict))

Accuracy: 0.79
Precision: 0.0
MSE: 0.21
RMSE: 0.458257569495584
MAE: 0.21
Recall: 0.0
R2: -0.2658227848101262


E. K-Nearest Neighbor
* The number of neighbors can be modified
  * Have to find the best one

In [19]:
# Create initial header row
print(f"{'# Neighbors':<15}{'Accuracy':<15}{'Precision':20}{'MSE':<7}{'RMSE':<20}{'MAE':7}{'Recall':<20}{'R2':<20}")


# Create another loop
for i in range(3, 16, 2):
    # Create the model
    knn_model = KNeighborsClassifier(n_neighbors=i).fit(X_train, y_train)

    # Make predictions
    knn_predict = knn_model.predict(X_test)

    # Log outputs
    print(f'{i:<15}{accuracy_score(y_test, knn_predict):<15}{precision_score(y_test, knn_predict):<20}{mean_squared_error(y_test, knn_predict):<7}{root_mean_squared_error(y_test, knn_predict):<20}{mean_absolute_error(y_test, knn_predict):<7}{recall_score(y_test, knn_predict):<20}{r2_score(y_test, knn_predict):<20}')

# Neighbors    Accuracy       Precision           MSE    RMSE                MAE    Recall              R2                  
3              0.76           0.40625             0.24   0.4898979485566356  0.24   0.30952380952380953 -0.446654611211573  
5              0.785          0.46153846153846156 0.215  0.4636809247747852  0.215  0.14285714285714285 -0.29596142254370084
7              0.765          0.2727272727272727  0.235  0.4847679857416329  0.235  0.07142857142857142 -0.4165159734779984 
9              0.77           0.16666666666666666 0.23   0.47958315233127197 0.23   0.023809523809523808-0.386377335744424  
11             0.795          0.5714285714285714  0.205  0.4527692569068708  0.205  0.09523809523809523 -0.23568414707655183
13             0.785          0.4                 0.215  0.4636809247747852  0.215  0.047619047619047616-0.29596142254370084
15             0.8            1.0                 0.2    0.4472135954999579  0.2    0.047619047619047616-0.20554550934297744


Which is best?
* 15 has the best numbers but possesses a perfect precision
  * Possible overfitting
* 11 has the next highest accuracy and precision
  * Go with 11

### 5. Summarize

| Algorithm                  | Accuracy | Precision | MSE   | RMSE  | MAE   | Recall | R2    |
| :------------------------: | :------: | :-------: | :---: | :---: | :---: | :----: | :---: |
| Logistic Regression        | 0.79     | 0.0       | 0.21  | 0.46  | 0.21  | 0.0    | -0.27 |
| Decision Trees             | 0.71     | 0.30      | 0.295 | 0.54  | 0.295 | 0.31   | -0.78 |
| Random Forest (9 branches) | 0.805    | 0.67      | 0.195 | 0.44  | 0.195 | 0.143  | -0.18 |
| Naive Bayes                | 0.79     | 0.0       | 0.21  | 0.46  | 0.21  | 0.0    | -0.27 |
| KNN (11 neighbors)         | 0.795    | 0.57      | 0.205 | 0.45  | 0.205 | 0.095  | -0.24 |

### 6. Which is best?
* Want to avoid precision of 0.0 or 1.0
  * Remove (1) and (4)

* Remaining order for each score (best to worst)
  * Accuracy: (3), (5), (2)
  * Precision: (3), (5), (2)
  * MSE: (3), (5), (2)
  * RMSE: (3), (5), (2)
  * MAE: (3), (5), (2)
  * Recall: (2), (3), (5)
  * R2: (3), (5), (2)

* Random Forest with 9 branches is the winner
  * Use to predict what genreates the greatest lack of run support