### Baseball Case project by Francis Afuwah.
Batch: DS2312

### 1.0 Overview
This document describes the general methodology that has been applied throughout the development of a machine-learning model whose main purpose is to predict Major League Baseball team wins using historical data. The model utilizes different team-performance features and applies a number of regression techniques to statistically predict the number of wins.

### 1.1 Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import GridSearchCV
import joblib

### 1.2 Loading and Preprocessing Data
1. Data source: CSV file containing the team performance statistics dataset.
2. Exploring the data: Displaying the first few rows of the data and summarizing it allows us to know the types of data and their distribution.
3. Rename Columns: Renaming columns to better understandable titles according to the descriptions of given features, makes the data handling procedures easy.

In [2]:
# Load the dataset 
data = pd.read_csv(r'\Users\Admin\Desktop\Flip Robo-Intern\Datasets\baseball.csv')

In [3]:
# Displaying the first few rows and a summary description of the dataset
print(data.head())
print(data.describe())

       1      2    3    4   5  6   7    8   9   10  11  12  13  14  15  16  \
0  0.271  0.328   74  161  22  6  12   58  49  133  23  17   1   1   0   0   
1  0.264  0.318   24   48   7  0   1   22  15   18   0   7   0   0   0   0   
2  0.251  0.338  101  141  35  3  32  105  71  104  34   6   0   0   1   0   
3  0.224  0.274   28   94  21  1   1   44  27   54   2   7   1   1   0   0   
4  0.206  0.262   14   51  18  1   1   28  17   26   0   3   1   1   0   0   

   salary  
0     109  
1     160  
2    2700  
3     550  
4     300  
                1           2           3           4           5           6  \
count  337.000000  337.000000  337.000000  337.000000  337.000000  337.000000   
mean     0.257825    0.323973   46.697329   92.833828   16.673591    2.338279   
std      0.039546    0.047132   29.020166   51.896322   10.452001    2.543336   
min      0.063000    0.063000    0.000000    1.000000    0.000000    0.000000   
25%      0.238000    0.297000   22.000000   51.000000 

In [4]:
# Renaming the columns based on the provided feature descriptions
column_names = [
    "W", "R", "AB", "H", "2B", "3B", "HR", "BB", "SO", "SB",
    "RA", "ER", "ERA", "CG", "SHO", "SV", "E"
]
data.columns = column_names

### 2.0 Making data ready and data prep

1. Feature Scaling: Excludes all features from the standard variable 'W' for wins.

2. Adding New Features: Create two new features, Batting_Average and Slugging_Percentage, to give more insight regarding the team performance.
3. Data Cleaning: It replaces infinite values or missing data with zeros so that it does not disrupt the model training.

### 2.1 Data Preprocessing and Feature Engineering

In [5]:
# Scaling the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop(['W'], axis=1))
scaled_df = pd.DataFrame(scaled_features, columns=data.columns[1:])

In [6]:
# Adding new features
singles = data['H'] - (data['2B'] + data['3B'] + data['HR'])
scaled_df['Batting_Average'] = data['H'] / data['AB']
scaled_df['Slugging_Percentage'] = (singles + 2 * data['2B'] + 3 * data['3B'] + 4 * data['HR']) / data['AB']

In [7]:
# Handling potential infinities and NaNs
scaled_df.replace([np.inf, -np.inf], np.nan, inplace=True)
scaled_df.fillna(0, inplace=True)

In [8]:
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(scaled_df, data['W'], test_size=0.2, random_state=42)

### 3.0 Training of Models and Their Evaluation
1. Model Selection: These four models are Linear Regression, Decision Tree, Random Forest, and Gradient Boosting Regressor.
2. Training: Scale the features and train the model.
3. Evaluation: MAE and RMSE measures will be extracted for the training and test sets.

In [9]:
# Feature scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(data.drop('W', axis=1))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_scaled, data['W'], test_size=0.2, random_state=42)


In [10]:
# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42)
}

# Train models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"{name} trained.")


Linear Regression trained.
Decision Tree trained.
Random Forest trained.
Gradient Boosting trained.


In [11]:
# Function to evaluate the models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)
    metrics = {
        "Train MAE": mean_absolute_error(y_train, train_preds),
        "Test MAE": mean_absolute_error(y_test, test_preds),
        "Train RMSE": np.sqrt(mean_squared_error(y_train, train_preds)),
        "Test RMSE": np.sqrt(mean_squared_error(y_test, test_preds))
    }
    return metrics

# Evaluate and print the performance of each model
for name, model in models.items():
    metrics = evaluate_model(model, X_train, y_train, X_test, y_test)
    print(f"{name} Performance: {metrics}")


Linear Regression Performance: {'Train MAE': 0.008288968512317762, 'Test MAE': 0.011260315647597037, 'Train RMSE': 0.012830519742369706, 'Test RMSE': 0.017749565611179727}
Decision Tree Performance: {'Train MAE': 0.0, 'Test MAE': 0.01802941176470588, 'Train RMSE': 0.0, 'Test RMSE': 0.02613258052136991}
Random Forest Performance: {'Train MAE': 0.005121263940520431, 'Test MAE': 0.012863970588235303, 'Train RMSE': 0.007833426375043931, 'Test RMSE': 0.01941981004178012}
Gradient Boosting Performance: {'Train MAE': 0.0026127517008521753, 'Test MAE': 0.009822843733609315, 'Train RMSE': 0.0033796668775065816, 'Test RMSE': 0.015700909520507907}


### 4.0 Model Optimization
Grid Search: Applied on Gradient Boosting Regressor to fine-tune hyperparameters such as n_estimators, learning_rate, and max_depth.
Best Model Selection: What will be done here is to identify the best parameters and then retrain the model using those optimized settings.


In [12]:
# Grid search parameters
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=GradientBoostingRegressor(random_state=42),
                           param_grid=param_grid,
                           scoring='neg_mean_squared_error',
                           cv=3,
                           verbose=1)

# Perform grid search
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)

# Best model
best_model = grid_search.best_estimator_

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}


In [13]:
# Evaluate the best model from grid search
best_metrics = evaluate_model(best_model, X_train, y_train, X_test, y_test)
print("Best Model Performance:", best_metrics)


Best Model Performance: {'Train MAE': 0.0011308382493280045, 'Test MAE': 0.009198377723720612, 'Train RMSE': 0.0014578533927352218, 'Test RMSE': 0.01478032155621409}


### 5.0 Final Model Training and Saving

In [14]:
final_model = grid_search.best_estimator_
y_pred = final_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {mse}")

Test MSE: 0.00021845790530508686


In [15]:
# Specify the file name and path
model_path = r'C:\Users\Admin\Desktop\Flip Robo-Intern\Datasets\baseball_model.pkl'

# Save the model
joblib.dump(final_model, model_path)
print("Model saved successfully to:", model_path)

Model saved successfully to: C:\Users\Admin\Desktop\Flip Robo-Intern\Datasets\baseball_model.pkl


### 6.0 Make predictions

In [16]:
# Load the model
final_model = joblib.load(r'C:\Users\Admin\Desktop\Flip Robo-Intern\Datasets\baseball_model.pkl')

# Assuming you have the test features loaded in X_test (you need to preprocess it as per your model's requirements)
y_pred = final_model.predict(X_test)

# Output the predicted wins
print("Number of predicted wins (W):", y_pred)

Number of predicted wins (W): [0.23417135 0.22363396 0.2737471  0.29279285 0.27928345 0.26146156
 0.291897   0.25559584 0.20575369 0.24140603 0.29934622 0.26343065
 0.12867751 0.27874131 0.21600342 0.25200186 0.2715085  0.29788143
 0.29902979 0.24770042 0.24437874 0.26056935 0.21040182 0.24510116
 0.30206567 0.24143723 0.3133333  0.27161034 0.24715905 0.23518055
 0.23765968 0.23343751 0.23042589 0.26096947 0.25093978 0.2425882
 0.26047312 0.2333012  0.2279649  0.29300551 0.21529046 0.24295576
 0.25222092 0.26861074 0.26250756 0.24627604 0.25956634 0.28155989
 0.25768772 0.24028341 0.24860997 0.24196058 0.28495197 0.24673924
 0.29764867 0.28892332 0.3024204  0.21316357 0.26055049 0.12193192
 0.26044577 0.27409992 0.27481712 0.25015169 0.31637968 0.26163577
 0.21465798 0.24649704]


### 7.0 Conclusion:
The study on the development of Major League Baseball wins is quite deep and insightful in coming up with the predictive model. Due to the application of several statistical data points, this project was able to provide a frame capable of predicting team performance with a high accuracy level. The development of advanced regression models namely linear regression, decision tree, random forest, and gradient boosting enabled the results to be carefully analyzed within the context of the most dominant features leading a team toward success.

These preprocessing steps, scaling and feature engineering, did much toward increasing the performance of the model. An overall improved analytical view was taken by adding derived metrics such as batting average and slugging percentage, and this increased the predictivity of the model. This means that the richness of a data set can heavily increase with thoughtful feature engineering, such that it further results in more precise predictions.

Application of GridSearchCV for model finetuning of Gradient Boosting Regressor has shown that hyperparameter optimization is too crucial in securing optimal model performance. This helps in finding the best set of model parameters, which probably has practical impact on predictive accuracy due to machine learning optimizations.

Operationally, the final model will be able to predict not only wins but also be used as a strategic asset for management and planning around the team. This will be in a position to make strategic decisions on the next player to trade, where to focus on training, or game strategies and, hence, improve the competitive edge of the team.

The project has also pointed out future improvements and iterations to the work, such as extension of the feature set, the use of more complex modeling techniques, and real-time data updates for continuous model learning. These improvements will thus enhance the effectiveness and flexibility of this model under changing dynamics in sports analytics.

In conclusion, such a study lays the groundwork for strong predictive modeling in sports and shows the tremendous potential for fusion between data science and sports management. The set of paradigms that are set by such a study is sure to grow toward being central in strategic sports decision-making as the field of sports analytics grows.