## March Madness 2023 regression model to predict tournament wins

In [None]:
#import packages
import pandas as pd
import statsmodels.api as sm

In [None]:
# Read in the data from a CSV file
data = pd.read_csv("Tournament Team Data.csv")

In [None]:
# Get the unique values in the "ROUND" column
rounds = data["ROUND"].unique()

# Display the unique values
print(rounds)

The "ROUND" column is the last Round the team played in. To look at the Rounds and how they are displayed in the data set, we use the unique function to see the distinct values in "ROUND".

In [None]:

# Define a function to convert round numbers to wins
def round_to_wins(round_num):
    if round_num == 68 or round_num == 64:
        return 0
    elif round_num == 32:
        return 1
    elif round_num == 16:
        return 2
    elif round_num == 8:
        return 3
    elif round_num == 4:
        return 4
    elif round_num == 2:
        return 5
    elif round_num == 1:
        return 6
    else:
        return None


# Apply the function to create a new "WINS" column
data["WINS"] = data["ROUND"].apply(round_to_wins)

This code defines a function named "round_to_wins" that takes a round number as its input and returns the corresponding number of wins for that team in the tournament. The function is then applied to a column named "ROUND" in a dataset called "data" using the .apply() function. The results are stored in a new column called "WINS" in the same dataset.

In [None]:
# Display the expected value of wins by 'SEED'
expected_wins = data.groupby('SEED')['WINS'].mean()
print(expected_wins)

This code groups the data by "SEED" and calculates the average number of wins for each group using the .groupby() function and the .mean() method. It then prints the resulting expected value of wins for each "SEED" group.

In [None]:
# Define the predictor variables
predictors = ['SEED', 'KENPOM ADJUSTED OFFENSE', 'KENPOM ADJUSTED DEFENSE']

# Define the response variable
response = 'WINS'

# Create the design matrix by adding a constant and selecting the predictor variables
X = sm.add_constant(data[predictors])

# Define the response variable
y = data[response]

# Fit the linear regression model
model = sm.OLS(y, X).fit()

This code defines a linear regression model using the OLS (ordinary least squares) method from the statsmodels library. The predictor variables, response variable, and design matrix are defined, and the linear regression model is fit to the data. The resulting model can be use the SEED, Kenpom Ajusted Offense rating, and Kenpom Adjusted Defense Rating to predict the number of wins a team will have in the tournament. 

In [None]:
# Save the model summary as a text file
with open('model_summary.txt', 'w') as file:
    file.write(str(model.summary()))

![](first_model_summary.png)

This is the output of an Ordinary Least Squares (OLS) regression analysis. The model includes three independent variables, SEED, KENPOM ADJUSTED OFFENSE, and KENPOM ADJUSTED DEFENSE, to predict the dependent variable, WINS. The R-squared value of 0.331 suggests that the model explains 33.1% of the variation in WINS. The coefficients for the independent variables and their standard errors, t-statistics, and p-values are also provided. The model suggests that SEED has a negative effect on WINS, while KENPOM ADJUSTED OFFENSE and KENPOM ADJUSTED DEFENSE have positive and negative effects, respectively. The F-statistic of 154.9 and its associated p-value of 1.53e-81 suggest that the model as a whole is statistically significant.

In [None]:
# Convert SEED to categorical variable
data['SEED'] = pd.Categorical(data['SEED'])

# Define the predictor variables
predictors = ['SEED', 'KENPOM ADJUSTED OFFENSE', 'KENPOM ADJUSTED DEFENSE']

# Define the response variable
response = 'WINS'

# Create the design matrix by adding a constant and selecting the predictor variables
X = sm.add_constant(pd.get_dummies(data[predictors]))

# Define the response variable
y = data[response]

# Fit the linear regression model
model = sm.OLS(y, X).fit()

 Seed is a discrete numerical value from values 1 to 16. To investigate a way to improve the model I changed the variable Seed to categorical.The 'SEED' variable is first converted to a categorical variable using pandas, and then the predictor variables and response variable are defined. The design matrix is created using the 'add_constant' and 'get_dummies' functions from statsmodels, and the model is fit using the Ordinary Least Squares (OLS) method from statsmodels.

In [None]:
# Save the updated model summary as a text file
with open('model_summary_new.txt', 'w') as file:
    file.write(str(model.summary()))

![](second_model_summary.png)

This is the output of a linear regression model where the dependent variable is "WINS" and the independent variables include "KENPOM ADJUSTED OFFENSE", "KENPOM ADJUSTED DEFENSE", and "SEED" (with 16 different levels). The R-squared value is 0.421, which means that 42.1% of the variance in the dependent variable is explained by the independent variables. The coefficients of each independent variable show how much the dependent variable changes when the corresponding independent variable changes by one unit while holding all other variables constant. The p-values of the coefficients indicate the statistical significance of each independent variable in the model.

The new model has a higher R-squared value of 0.421 so this model will be better to use to predict the tournament wins.

In [None]:
# Extract the coefficients and intercept
coefficients = model.params[1:]
intercept = model.params[0]

# Define a function to predict the number of wins
def predict_wins(seed, adj_offense, adj_defense):
    # Create a dictionary of the predictor variable values
    predictors_dict = {'SEED_' + str(seed): 1, 
                       'KENPOM ADJUSTED OFFENSE': adj_offense, 
                       'KENPOM ADJUSTED DEFENSE': adj_defense}
    
    # Calculate the predicted number of wins
    wins = intercept
    for predictor, value in predictors_dict.items():
        wins += coefficients[predictor] * value
    
    return wins

The code extracts the coefficients and intercept from a statistical model and defines a function called predict_wins() that takes in three predictor variables: seed, adjusted offense, and adjusted defense. The function then creates a dictionary of the predictor variable values and uses the coefficients and intercept extracted earlier to calculate the predicted number of wins based on the provided predictor variable values. The predicted number of wins is returned as the output of the function.

### Function to predict wins:

In [15]:
# Function to display expected wins
pw = predict_wins(1, 120.0, 79.0)
pw = round(pw, 3)
print(f"This team is expected to win {pw} games in the tournament.")

This team is expected to win 3.844 games in the tournament.
