## 1. Importing Required Libraries
We begin by importing the necessary Python libraries for data manipulation, visualization, and machine learning. These include:
- `pandas` for data handling,
- `random` and `numpy` for generating random numbers and working with arrays,
- `matplotlib.pyplot` and `seaborn` for data visualization,
- `xgboost` for implementing the XGBoost algorithm,
- `sklearn` for model training and evaluation.

In [None]:
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
from xgboost import XGBRegressor
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE

## 2. Setting Global Seeds
Global seeds for both NumPy and Python’s random module are set to ensure reproducibility of results across different runs.

In [None]:
# Set global seeds for reproducibility
np.random.seed(42)
random.seed(42)

## 3. Loading / Insepecting Data 
The dataset (Appearances_Percentage_Batter.csv) is loaded into a Pandas DataFrame. This allows us to easily manipulate and analyze the data.

In [None]:
# Load data into pandas dataframe
%cd "C:\Users\curna\Desktop\Data"
df = pd.read_csv('Appearances_Percentage_Batter.csv')
# Display first few rows to understand dataframe
df.head()

In [None]:
# Check columns for null values
df.isnull().sum()

In [None]:
# Use the describe method to get high level understanding of data
#df.describe()

## 4. Visualizing the Relationship between Plate Appearances (PA) and On-Base Percentage (OBP)
To better understand how Plate Appearances (PA) relate to On-Base Percentage (OBP) over different years, we generate a scatter plot. This will allow us to visually assess any trends or correlations.

### Steps:
1. A list of years is defined (`['16', '17', '18', '19', '20', '21']`) to represent the years of interest.
2. A `figure` object is created with a size of 12x8 to ensure the plot has adequate space.
3. For each year, a scatter plot is created using `PA_{year}` on the x-axis and `OBP_{year}` on the y-axis. This plots Plate Appearances against On-Base Percentage for that specific year.
4. Labels for the x-axis, y-axis, and a title are added for clarity. A legend is also included to differentiate between years.

In [None]:
years = ['16', '17', '18', '19', '20', '21']
# Define custom colors for each year
colors = ['blue', 'green', 'red', 'orange', 'purple', 'gray']

# Create figure object and set figure size
plt.figure(figsize=(12, 8))

# Loop through each year and plot with specific color
for year, color in zip(years, colors):
    plt.scatter(df[f'PA_{year}'], df[f'OBP_{year}'], label=f'20{year}', color=color)

plt.xticks(np.arange(0, 800, 50))

# Add title, legend, and x, y labels
plt.xlabel('Plate Appearances (PA)')
plt.ylabel('On-Base Percentage (OBP)')
plt.title('Plate Appearances vs. On-Base Percentage')
plt.legend()
plt.show()

## 5. Data Cleaning and Preparation
In this section, we focus on cleaning the dataset by filtering out players with insufficient data and handling missing values.

### Steps:
1. **Filter Players by Plate Appearances:**  
   We filter the DataFrame to retain only those players who have a total of 150 or more plate appearances across the seasons from 2016 to 2020. This step ensures that we analyze players with sufficient data, which is critical for meaningful predictions.

2. **Handle Missing Values:**  
   For each year (2016 to 2021), we replace any missing values in the On-Base Percentage (OBP) and Plate Appearances (PA) columns with the median of their respective columns. This method of imputation is chosen to maintain the integrity of the dataset while minimizing the influence of outliers.

3. **Calculate Age:**  
   The player's age in 2021 is computed by subtracting the birth year (extracted from the 'birth_date' column) from 2021. This new 'age' column provides additional information that may be useful in predicting performance metrics.

These preprocessing steps are essential for ensuring that the dataset is robust and ready for further analysis and modeling.

In [None]:
# Filter out players with fewer than 150 total plate appearances across multiple seasons
df = df[df[['PA_16', 'PA_17', 'PA_18', 'PA_19', 'PA_20']].sum(axis=1) >= 150]

# Replace missing values in both OBP and PA columns with their respective medians
years = ['16', '17', '18', '19', '20', '21']
for year in years:
    df[f'OBP_{year}'] = df[f'OBP_{year}'].fillna(df[f'OBP_{year}'].median())
    df[f'PA_{year}'] = df[f'PA_{year}'].fillna(df[f'PA_{year}'].median())

# Calculate player's age in 2021 by subtracting birth year from 2021
df['age'] = 2021 - pd.to_datetime(df['birth_date']).dt.year


## 6. Correlation Analysis
In this section, we analyze the relationships between selected features and the target variable by calculating and visualizing their correlations.

### Steps:
1. **Select Relevant Columns:**  
   A list named `data_cols` is created, which includes the relevant feature columns and the target variable for analysis. The columns chosen are:
   - Plate Appearances (PA) and On-Base Percentage (OBP) for the years 2016 to 2021.
   - The player's age.  
   These columns will help in understanding how different metrics relate to one another.

2. **Initialize the Figure:**  
   A new figure is initialized with a specified size of 10 by 5 inches to provide a clear view of the heatmap.

3. **Calculate the Correlation Matrix:**  
   The correlation matrix is computed using the `corr()` method on the selected columns. This matrix quantifies the linear relationship between pairs of features, with values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

4. **Generate the Heatmap:**  
   A heatmap is created using Seaborn’s `heatmap()` function to visualize the correlation matrix. The `annot` parameter is set to `True` to display the correlation coefficients within the heatmap, and the color map `coolwarm` is used to visually represent the strength and direction of correlations.

5. **Set the Title:**  
   The heatmap is titled 'Correlation Heatmap of All Features' to provide context for the analysis.

This analysis allows us to identify significant correlations between features, which can inform feature selection and modeling strategies in subsequent steps.


In [None]:
# Select relevant columns to be used as features and target variables
data_cols = ['PA_21', 'OBP_21', 'PA_20', 'OBP_20', 'PA_19', 'OBP_19', 'PA_18', 'OBP_18', 'PA_17', 'OBP_17', 'PA_16', 'OBP_16', 'age']

# Initialize figure and specify its size
plt.figure(figsize=(10, 5))

# Calculate the correlation matrix for the chosen features
correlation_matrix = df[data_cols].corr()

# Generate a heatmap to visualize the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Set the title for the heatmap
plt.title('Correlation Heatmap of All Features')
plt.show()

## 7. Feature Engineering
In this section, we enhance the dataset by creating new features that better capture player performance trends and the significance of recent statistics.

### Steps:
1. **Generate a Weighted OBP Metric:**  
   A new column, `OBP_weighted`, is created to represent a weighted average of on-base percentage (OBP) over the past five years (2016 to 2020). This metric assigns more importance to the most recent seasons:
   - OBP for 2016 is weighted at 10%
   - OBP for 2017 is weighted at 15%
   - OBP for 2018 is weighted at 20%
   - OBP for 2019 is weighted at 25%
   - OBP for 2020 is weighted at 30%  
   This weighted approach helps to emphasize the players' current performance trends, which are more indicative of their future performance.

2. **Calculate OBP Trends:**  
   Two new columns, `OBP_trend_1920` and `OBP_trend_1819`, are computed to capture the change in OBP between consecutive seasons:
   - `OBP_trend_1920` reflects the difference between the OBP of 2020 and 2019, indicating how a player’s performance improved or declined from 2019 to 2020.
   - `OBP_trend_1819` measures the change in OBP from 2018 to 2019.  
   These trend features provide insights into players' performance consistency and improvement over time, which can be valuable for making predictions.

By incorporating these engineered features, the dataset is better equipped for modeling, potentially leading to improved pedictive accuracy.
'] * 0.3)


In [None]:
# Generate a weighted OBP metric, giving more significance to recent years
df['OBP_weighted'] = (df['OBP_16'] * 0.1 + df['OBP_17'] * 0.15 
                      + df['OBP_18'] * 0.2 + df['OBP_19'] * 0.25 
                      + df['OBP_20'] * 0.3)

# Calculate OBP trends to reflect how performance changes between consecutive seasons
df['OBP_trend_1920'] = df['OBP_20'] - df['OBP_19']
df['OBP_trend_1819'] = df['OBP_19'] - df['OBP_18']

## 8. Data Preparation for Modeling
In this section, we specify the feature columns to be used for predicting the target variable. The dataset is then split into training and test sets to facilitate model training and evaluation.

### Steps:
1. **Define Feature Columns and Target Variable:**  
   A list named `features` is created to specify the independent variables that will be used for prediction. These include various statistics from previous years along with calculated features like `OBP_weighted` and `OBP_trend`. The target variable, `OBP_21`, represents the on-base percentage for the year 2021, which we aim to predict.

2. **Split the Data:**  
   The dataset is split into training and testing subsets using the `train_test_split` function from the `sklearn.model_selection` module. This function randomly divides the data:
   - `x_train` and `y_train` are the training features and target values, respectively.
   - `x_test` and `y_test` are the test features and target values.
   The `train_size` parameter is set to 0.7, indicating that 70% of the data will be used for training, while 30% will be reserved for testing. The `random_state` parameter is set to 42 to ensure reproducibility of the results across different runs.

By preparing the data in this manner, we can effectively train models and evaluate their performance based on unsen test data.


In [None]:
# Specify the feature columns and the target variable for prediction
features = ['PA_20', 'OBP_20', 'PA_19', 'OBP_19', 'PA_18', 'OBP_18', 'PA_17', 'OBP_17', 'PA_16', 'OBP_16',
            'age', 'OBP_weighted', 'OBP_trend_1920', 'OBP_trend_1819']
target = 'OBP_21'

# Split the data into training and test sets, ensuring results can be reproduced
x_train, x_test, y_train, y_test = train_test_split(df[features], df[target], train_size=0.7, test_size=0.3, random_state=42)

## 9. Model Training
In this section, we create a variety of regression models to predict on-base percentage (OBP) and train each model using the training dataset. The models include a range of regression techniques to compare their performance.

### Steps:
1. **Create a Dictionary of Models:**  
   A dictionary named `models` is established to store different regression models. Each model is associated with its corresponding name for easy reference during evaluation. The models included are:
   - Linear Regression
   - Ridge Regression
   - Lasso Regression
   - Random Forest Regression
   - Decision Tree Regressor
   - Gradient Boosting
   - AdaBoost
   - ElasticNet Regression
   - XGBoost

2. **Train Each Model:**  
   A loop iterates through the values of the `models` dictionary, which contains the instantiated model objects. For each model, the `fit` method is called using the training features (`x_train`) and target values (`y_train`). This step trains the model on the provided dataset, allowing it to learn the underlying patterns necessary for making predictions.

By training multiple models, we can later compare their performance and select the best one based on predictive accuracy and other valuation metrics.


In [None]:
# Create a dictionary of models to be evaluated
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest Regression': RandomForestRegressor(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
    'AdaBoost': AdaBoostRegressor(),
    'ElasticNet Regression': ElasticNet(),
    'XGBoost': XGBRegressor()
}

# Train each model using the training dataset
for model in models.values():
    model.fit(x_train, y_train)

## 10. Model Evaluation
In this section, we define a function to assess the accuracy of the predictions made by the various regression models. The evaluation focuses on two key performance metrics: Mean Squared Error (MSE) and the Coefficient of Determination (R²).

### Steps:
1. **Define the Evaluation Function:**  
   A function named `evaluate_model` is created to evaluate the performance of a given model. The function takes three parameters: the model to be evaluated, the test features (`x_test`), and the actual target values (`y_test`).

2. **Generate Predictions:**  
   Inside the function, predictions for the test dataset are generated using the provided model. This step is crucial for comparing the predicted values against the actual values.

3. **Compute Mean Squared Error (MSE):**  
   The Mean Squared Error is calculated by comparing the actual target values with the predicted values. MSE provides a measure of how close the predictions are to the actual outcomes, with lower values indicating better accuracy.

4. **Compute Coefficient of Determination (R²):**  
   The Coefficient of Determination is computed to assess the model's explanatory power. R² indicates the proportion of variance in the dependent variable that can be explained by the independent variables, with values closer to 1 signifying a better fit.

5. **Display Performance Metrics:**  
   The MSE and R² values are printed to the console, providing a summary of the model's performance.

6. **Assess Each Model's Performance:**  
   A loop iterates through all models in the `models` dictionary:
   - For each model, a message is printed to indicate which model is being evaluated.
   - The `evaluate_model` function is called for each model, passing the test features and actual target values.

By evaluating the models based on MSE and R², we can determine which model performs best in predicting OBP, providing valuable insights for model seection and improvement.


In [None]:
# Create a function to assess the accuracy of model predictions
def evaluate_model(model, x_test, y_test):
    # Generate predictions for the test dataset
    predictions = model.predict(x_test)
    # Compute the Mean Squared Error between actual and predicted values
    mse = mean_squared_error(y_test, predictions)
    # Compute the Coefficient of Determination to assess model performance
    r2 = r2_score(y_test, predictions)
    print(f'MSE: {mse:.4f} and R2: {r2:.4f}')

# Assess each model's performance
for name, model in models.items():
    print(f'Evaluating {name}:')
    evaluate_model(model, x_test, y_test)

## 11. Comparison of Model Predictions
In this section, we compare the predictions made by the top-performing regression models: ElasticNet, Ridge, and Linear Regression. This comparison visually assesses how well each model predicts the actual on-base percentage (OBP) for the year 2021.

### Steps:
1. **Select Top-Performing Models:**  
   A list of the best-performing models is defined, including ElasticNet Regression, Ridge Regression, and Linear Regression. These models will be evaluated based on their prediction capabilities.

2. **Initialize Figure for Plotting:**  
   A new figure is created for plotting the predictions. The figure's size is set to ensure clarity and readability of the visualizations.

3. **Define Colors for the Models:**  
   A list of colors is specified for the models to distinguish between them in the scatter plot. Colors help to enhance the visual appeal and facilitate easy comparison.

4. **Iterate Through Selected Models:**  
   A loop is initiated to go through each model in the selected list:
   - **Retrieve the Model:** Each model is retrieved from the predefined dictionary of models.
   - **Generate Predictions:** Predictions are generated using the test dataset for the current model.
   - **Plot Predictions:** The predictions are added to a scatter plot, with actual OBP values on the x-axis and predicted OBP values on the y-axis, using the specified colors for each model.

5. **Add Reference Line for Perfect Predictions:**  
   A reference line is plotted to represent perfect predictions, where actual values equal predicted values. This line serves as a benchmark to evaluate the accuracy of the models' predictions.

6. **Set Plot Labels and Title:**  
   The x-axis and y-axis labels are defined, along with the plot title. A legend is added to identify each model's predictions.

7. **Display the Plot:**  
   Finally, the plot is displayed, allowing for a visual comparison of how well each model's predictions align with the actual OBP values for the year 2021.

By visually comparing the predictions of the different models, we can assess their performance and identify which model offers the most acurate predictions for OBP.


In [None]:
# Select the top-performing models for comparison
best_models = ['ElasticNet Regression', 'Ridge Regression', 'Linear Regression']

# Initialize the figure for plotting
plt.figure(figsize=(12, 8))

# Define a list of colors for the models
colors = ['blue', 'red', 'green']  # dark brown

# Iterate through each of the selected models to plot predictions
for i in range(len(best_models)):
    # Retrieve the model from the dictionary
    model = models[best_models[i]]
    # Generate predictions using the test dataset
    predictions = model.predict(x_test)
    # Add the predictions to the scatter plot with specified colors
    plt.scatter(y_test, predictions, label=f'{best_models[i]}', color=colors[i])

# Add a reference line for perfect predictions
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='black', linestyle='--')

# Set plot labels and title
plt.xlabel('Actual OBP from 2021')
plt.ylabel('Predicted OBP from 2021')
plt.title('Predicted vs. Actual OBP for 2021')
plt.legend()
plt.show()

## 12. Analyzing Feature Importance in Ridge Regression
In this section, we analyze the feature importance derived from the Ridge Regression model by examining the coefficients associated with each feature. This analysis helps us understand the impact of various features on the target variable, on-base percentage (OBP).

### Steps:
1. **Select the Model for Analysis:**  
   The Ridge Regression model is chosen for analysis. This model has been previously trained and is now utilized to extract insights regarding feature importance.

2. **Retrieve Model Coefficients:**  
   The coefficients of the trained Ridge Regression model are retrieved. These coefficients represent the weight or importance of each feature in predicting the target variable.

3. **Create a DataFrame for Feature Importance:**  
   A DataFrame is created to display the importance of each feature. The absolute values of the coefficients are calculated to emphasize significant correlations, whether positive or negative. This step allows us to identify which features have the greatest influence on OBP.

4. **Sort Features by Importance:**  
   The DataFrame is sorted in descending order based on the absolute values of the coefficients. This sorting enables easy identification of the most important features impacting OBP.

5. **Display Feature Importance:**  
   The feature importance DataFrame is printed, showcasing each feature alongside its calculated importance. Notably, features with significant negative correlations (e.g., increased age negatively impacting OBP) are highlighted, providing valuable insights for interpretation.

By analyzing the coefficients of the Ridge Regression model, we gain a clearer understanding of how various features contribute to predictions, informing potential strategies for player evaluation ad decision-making.


In [None]:
# Choose the Ridge Regression model for analysis
model = models['Ridge Regression']

# Retrieve the coefficients from the trained model
coefficients = model.coef_

# Create a dataframe to showcase feature importance
# *** Calculate the absolute values of the coefficients to highlight significant negative 
# correlations (e.g., increased age has a strong negative correlation with OBP) ***
feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': [round(abs(c), 6) for c in coefficients]
}).sort_values(by='Importance', ascending=False)

print(feature_importance)

## 13. Feature Selection Using Recursive Feature Elimination (RFE)
In this section, we employ Recursive Feature Elimination (RFE) to identify the most important features for predicting the on-base percentage (OBP). This process helps to simplify the model by selecting only the most relevant variables, potentially improving model performance and interpretability.

### Steps:
1. **Initialize RFE:**  
   RFE is initialized using the previously trained Ridge Regression model as the estimator. The goal is to select the top 10 features that contribute most significantly to the prediction of the target variable.

2. **Fit the RFE Model to Training Data:**  
   The RFE model is fitted to the training dataset. During this process, RFE evaluates the importance of each feature by recursively removing the least important ones until only the specified number of top features remains.

3. **Retrieve Selected Features:**  
   The features selected by RFE are retrieved. This includes the names of the top 10 features that have been identified as most significant for predicting OBP.

4. **Define the Target Variable:**  
   The target variable for prediction, which is the on-base percentage for the year 2021 (`OBP_21`), is defined. This variable will be used in the subsequent modeling steps.

5. **Split the Data into Training and Testing Subsets:**  
   The dataset is split into training and testing subsets based on the selected features. This split ensures that 70% of the data is used for training, while 30% is reserved for testing, allowing for reproducible results. 

By performing RFE, we focus on the most relevant features, which can lead to improved model efficiency and clearer insights into the factos influencing OBP.


In [None]:
# Initialize Recursive Feature Elimination (RFE) using the Ridge model, selecting the top 10 features
rfe = RFE(estimator=model, n_features_to_select=10)

# Fit the RFE model to the training data
rfe.fit(x_train, y_train)

# Retrieve the features selected by RFE
selected_features = x_train.columns[rfe.support_]
print("Selected Features:", selected_features)

# Define the target variable for prediction
target = 'OBP_21'

# Split the data into training and testing subsets (ensuring reproducibility)
x_train, x_test, y_train, y_test = train_test_split(df[selected_features], df[target], train_size=0.7, test_size=0.3, random_state=42)

## 14. Hyperparameter Tuning for Ridge Regression
In this section, we perform hyperparameter tuning on the Ridge Regression model to determine the optimal value for the alpha parameter. This tuning process helps enhance the model's performance by balancing bias and variance.

### Steps:
1. **Initialize the Model:**  
   A new Ridge Regression model is instantiated. This model will be optimized based on the alpha parameter.

2. **Define the Parameter Grid:**  
   A parameter grid is created to explore a range of alpha values. The alpha parameter is varied on a logarithmic scale from \(10^{-4}\) to \(10^{4}\), allowing for a comprehensive search for the best regularization strength.

3. **Set Up Grid Search with Cross-Validation:**  
   GridSearchCV is utilized to systematically evaluate different alpha values through cross-validation. The scoring method used is the negative mean squared error (MSE), which helps identify the alpha that minimizes prediction error.

4. **Fit the Model to Training Data:**  
   The Ridge Regression model is fitted to the training dataset using the parameter grid. This process trains the model while evaluating each alpha value's performance through cross-validation.

5. **Output the Best Alpha Value:**  
   The best alpha value identified during the grid search is printed. This value represents the optimal regularization strength for the Ridge Regression model, which can be used for subsequent model fitting and evaluation.

By conducting this hyperparameter tuning, we enhance the model's capability to generalize well to unseen data, thereby improving its predctive performance.


In [None]:
# Initialize a new Ridge Regression model
ridge = Ridge()

# Define a parameter grid to explore optimal values for alpha
param_grid = {'alpha': np.logspace(-5, 5, 50)}  # Values from 0.00001 to 100,000

# Use GridSearchCV to find the best alpha value through cross-validation
# The scoring method is based on the negative mean squared error
ridge = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error')

# Fit the model to the training data
ridge.fit(x_train, y_train)

# Output the best alpha value for the Ridge Regression model
print(f"Best alpha for Ridge Regression: {ridge.best_params_['alpha']}")

## 15. Optimizing and Evaluating the Ridge Regression Model
In this section, we instantiate a new Ridge Regression model using the optimal alpha value identified during the hyperparameter tuning process. The model is then fitted to the training data, and its performance is evaluated on the test dataset.

### Steps:
1. **Instantiate the Model:**  
   A new Ridge Regression model is created using the optimal alpha value obtained from the previous grid search. This model is designed to minimize prediction error while preventing overfitting.

2. **Fit the Model to Training Data:**  
   The optimized Ridge Regression model is trained using the training dataset. This process involves learning the relationship between the features and the target variable (OBP) based on the training data.

3. **Evaluate Model Performance:**  
   The performance of the fitted model is assessed on the test dataset using a dedicated evaluation function. This function typically calculates metrics such as Mean Squared Error (MSE) and R² score to quantify how well the model predicts the on-base percentage for the players in the test set.

By optimizing the model and evaluating its performance, we can gain insights into its predictive capabilities and make informed decisions about potential improvements or futher tuning.


In [None]:
# Instantiate a new Ridge Regression model with the optimal alpha value
best_ridge_model = Ridge(alpha=ridge.best_params_['alpha'])

# Fit the optimized model to the training data
best_ridge_model.fit(x_train, y_train)

# Evaluate the performance of the model on the test data
evaluate_model(best_ridge_model, x_test, y_test)

## 16. Final Predictions of All OBP

In [None]:
# Assign the best Ridge Regression model to a variable for predictions
model = best_ridge_model

# Generate predictions using the test dataset
predictions = model.predict(x_test)

# Calculate the absolute differences between actual values and model predictions
errors = abs(y_test - predictions)

# Create a DataFrame to compare actual and predicted values along with the errors
error_df = pd.DataFrame({
    'Player': df.loc[x_test.index, 'Name'],
    'Actual': np.round(y_test, 3),
    'Predicted': np.round(predictions, 3),
    'Error': np.round(errors, 3)
})

# Sort the DataFrame by the 'Error' column
closest_predictions = error_df.sort_values(by='Error')

# Set display options to show all rows
pd.set_option('display.max_rows', None)  # None means unlimited

# Print all results
print(closest_predictions)

# Optionally, reset to default after printing if needed
pd.reset_option('display.max_rows')


## 17. Distribution of Errors based on predictions
Step 1: Set the figure size
Set the figure size to 10 inches wide and 6 inches tall.

Step 2: Create a histogram for errors
Create a histogram for the Error column of the closest_predictions dataset with 29 bins ranging from 0 to 0.145 with an increment of 0.005, and plot the kernel density estimate (KDE).

Step 3: Set title and labels
Set the title of the histogram to "Distribution of Prediction Errors" and add labels for the x-axis and y-axis.

Step 4: Set x-ticks
Set the x-ticks to show all increments and rotate them by 45 degrees for better readability.

Step 5: Initialize counters for predictions
Initialize counters for predictions within and outside the 0.045 range.

Step 6: Color bars based on x-axis value
Get the bar heights and positions, and set the color of each bar based on the x-axis value. If the x-axis value is less than or equal to 0.050, color the bar green; otherwise, color it red.

Step 7: Add text to bars
Add text to each bar showing the height of the bar.

Step 8: Create a legend
Create a legend (key) in the top right corner to distinguish between successful and unsuccessful predictions.

Step 9: Show the plot
Show the tight layout.plot with a tight layout.

In [None]:
# Set the figure size
plt.figure(figsize=(10, 6))

# Create a histogram for errors with a specified number of bins
histogram = sns.histplot(closest_predictions['Error'], bins=np.arange(0, 0.145, 0.005), kde=True, color='blue')

# Set title and labels
plt.title('Distribution of Prediction Errors', fontsize=16)
plt.xlabel('Absolute Error', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

# Set x-ticks to show all increments
plt.xticks(np.arange(0, 0.145, 0.005), rotation=45)

# Initialize counters for predictions within and outside the 0.045 range
within_range_count = 0
outside_range_count = 0

# Get the bar heights and positions, and set the color of each bar based on the x-axis value
for bar in histogram.patches:
    x_value = bar.get_x() + bar.get_width() / 2  # Get the x-axis value for the bar
    if x_value <= 0.050:  # Check if the x-axis value is less than 0.045
        bar.set_facecolor('green')
        within_range_count += bar.get_height()  # Increment within-range count
    else:
        bar.set_facecolor('red')
        outside_range_count += bar.get_height()  # Increment outside-range count
    
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, height, f'({int(height)})', ha='center', va='bottom', fontsize=10)

# Create a legend (key) in the top right corner
plt.legend(handles=[plt.Rectangle((0, 0), 1, 1, color='green'), plt.Rectangle((0, 0), 1, 1, color='red')],
           labels=['Within 0.050 (successful)', 'Outside 0.050 (unsuccesful)'], loc='upper right', fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()

## 17.1 Distribution of Errors based on predictions
Step 1: Set the figure size
Set the figure size to 8 inches wide and 6 inches tall.

Step 2: Create a bar chart for successful and unsuccessful counts
Create a bar chart to compare the counts of successful and unsuccessful predictions.

Step 3: Set title and labels
Set the title of the bar chart to "Total Successful and Unsuccessful Predictions" and add labels for the x-axis and y-axis.

Step 4: Add total count above each bar
Add the total count above each bar to provide a clear visual representation of the data.

Step 5: Show the plot
Show the plot with a tight layout.

In [None]:
# Set the figure size
plt.figure(figsize=(8, 6))

# Create a bar chart for successful and unsuccessful counts
plt.bar(['Successful', 'Unsuccessful'], [within_range_count, outside_range_count], color=['green', 'red'])

# Set title and labels
plt.title('Total Successful and Unsuccessful Predictions', fontsize=16)
plt.xlabel('Prediction Outcome', fontsize=14)
plt.ylabel('Count', fontsize=14)

# Add total count above each bar
for i, count in enumerate([within_range_count, outside_range_count]):
    plt.text(i, count + 1, f'({int(count)})', ha='center', va='bottom', fontsize=10)

# Show the plot
plt.tight_layout()
plt.show()