<a href="https://colab.research.google.com/github/Spitzanity/Sports-Analytics---RST407/blob/main/Exercise_4_Workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

5.2:

In [2]:
import pandas as pd
import scipy.stats as stats

# Load the dataset
file_path = "Dataset 4.4.xlsx"
df = pd.read_excel(file_path)

# Calculate Pearson correlation coefficient and p-value
correlation, p_value = stats.pearsonr(df["Home"], df["Visiting"])

# Print results
print(f"Correlation coefficient: {correlation:.3f}")
print(f"P-value: {p_value:.3e}")

# Interpretation
if p_value < 0.05:
    print("The correlation is statistically significant.")
else:
    print("The correlation is not statistically significant.")


Correlation coefficient: 0.253
P-value: 2.209e-19
The correlation is statistically significant.


5.3:

In [3]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# Load the dataset
file_path = "Dataset 5.18.xlsx"
df = pd.read_excel(file_path)

# Calculate Pearson correlation coefficient and p-value
correlation, p_value = stats.pearsonr(df["RS"], df["RA"])

# Compute the margin of error for the correlation coefficient
n = len(df)  # Number of teams
standard_error = np.sqrt((1 - correlation**2) / (n - 2))
margin_of_error = 1.96 * standard_error  # 95% confidence interval

# Print results
print(f"Correlation coefficient: {correlation:.3f}")
print(f"Margin of Error: ±{margin_of_error:.3f}")
print(f"P-value: {p_value:.6f}")

# Interpretation
if p_value < 0.05:
    print("The correlation is statistically significant.")
else:
    print("The correlation is not statistically significant.")

Correlation coefficient: -0.620
Margin of Error: ±0.291
P-value: 0.000256
The correlation is statistically significant.


Interpretation:
The negative correlation suggests that teams that score more runs tend to allow fewer runs, which aligns with the expectation that stronger teams generally have both good offense and good defense.
Since the p-value is very small, this correlation is unlikely to be due to random chance.
The margin of error indicates some variability, meaning the true correlation could be anywhere from -0.911 to -0.329.

5.4:

In [5]:
import pandas as pd
import scipy.stats as stats

# Load the dataset
file_path = "Dataset 2.8.xlsx"
df_raw = pd.read_excel(file_path, header=None)

# Extract relevant data (starting from row 2)
df_cleaned = df_raw.iloc[2:, [0, 1, 2]].reset_index(drop=True)

# Rename columns
df_cleaned.columns = ["Harden_PTS", "LeBron_PTS", "Westbrook_PTS"]

# Convert to numeric
df_cleaned = df_cleaned.apply(pd.to_numeric, errors='coerce')

# Compute first-order autocorrelation for each player's points scored
autocorrelations = df_cleaned.apply(lambda x: x.autocorr(lag=1))

# Compute statistical significance using a t-test
n_games = len(df_cleaned)
t_values = autocorrelations * ((n_games - 2) ** 0.5) / ((1 - autocorrelations ** 2) ** 0.5)
p_values = pd.Series(stats.t.sf(abs(t_values), df=n_games-2) * 2, index=autocorrelations.index)  # Ensure indexing

# Print results
print("First-order Autocorrelation Coefficients:")
print(autocorrelations)
print("\nP-values:")
print(p_values)

# Interpretation
for player in autocorrelations.index:
    if p_values.loc[player] < 0.05:
        print(f"{player} has a statistically significant autocorrelation.")
    else:
        print(f"{player} does not have a statistically significant autocorrelation.")

First-order Autocorrelation Coefficients:
Harden_PTS       0.048652
LeBron_PTS       0.001120
Westbrook_PTS    0.152325
dtype: float64

P-values:
Harden_PTS       0.650723
LeBron_PTS       0.991687
Westbrook_PTS    0.154143
dtype: float64
Harden_PTS does not have a statistically significant autocorrelation.
LeBron_PTS does not have a statistically significant autocorrelation.
Westbrook_PTS does not have a statistically significant autocorrelation.


Interpretation:
None of the players have statistically significant autocorrelations, meaning their scoring patterns do not strongly depend on previous game performances.
Among the three, Russell Westbrook has the highest autocorrelation (0.152), suggesting that he is the most "streaky," though the correlation is still weak and not statistically significant.

5.8:

In [6]:
import pandas as pd
import statsmodels.api as sm
from scipy.stats import pearsonr

# Load the dataset
file_path = "Dataset 5.21.xlsx"
df = pd.read_excel(file_path)

# Function to compute partial correlation manually
def partial_correlation(df, x, y, covar):
    X = sm.add_constant(df[covar])  # Add constant for intercept
    model_x = sm.OLS(df[x], X).fit()
    residuals_x = model_x.resid
    model_y = sm.OLS(df[y], X).fit()
    residuals_y = model_y.resid
    partial_corr, p_value = pearsonr(residuals_x, residuals_y)
    return partial_corr, p_value

# Compute Pearson correlations
corr_par4_drive, p_val_par4_drive = pearsonr(df["Par4"], df["Drive"])
corr_par5_drive, p_val_par5_drive = pearsonr(df["Par5"], df["Drive"])

# Compute Partial correlations (controlling for driving accuracy)
partial_corr_par4, p_val_partial_par4 = partial_correlation(df, "Par4", "Drive", "Acc")
partial_corr_par5, p_val_partial_par5 = partial_correlation(df, "Par5", "Drive", "Acc")

# Print results
print(f"Pearson Correlation (Par 4 & Drive): {corr_par4_drive:.3f}, p-value: {p_val_par4_drive:.3e}")
print(f"Partial Correlation (Par 4 & Drive, controlling for Acc): {partial_corr_par4:.3f}, p-value: {p_val_partial_par4:.3e}\n")
print(f"Pearson Correlation (Par 5 & Drive): {corr_par5_drive:.3f}, p-value: {p_val_par5_drive:.3e}")
print(f"Partial Correlation (Par 5 & Drive, controlling for Acc): {partial_corr_par5:.3f}, p-value: {p_val_partial_par5:.3e}")

Pearson Correlation (Par 4 & Drive): -0.286, p-value: 5.594e-05
Partial Correlation (Par 4 & Drive, controlling for Acc): -0.498, p-value: 1.637e-13

Pearson Correlation (Par 5 & Drive): -0.523, p-value: 5.933e-15
Partial Correlation (Par 5 & Drive, controlling for Acc): -0.593, p-value: 1.094e-19


Interpretation:
Negative correlations indicate that players with longer drive distances tend to have lower (better) scores on both Par 4 and Par 5 holes.
The relationship is stronger for Par 5 scores than for Par 4 scores, suggesting that driving distance has a greater impact on scoring on longer holes.
When controlling for driving accuracy, the partial correlation strengthens, meaning that accuracy plays a role but does not fully explain the relationship between drive distance and scoring.

5.9:

In [7]:
import pandas as pd
import statsmodels.api as sm
from scipy.stats import pearsonr

# Load the datasets
pga_file = "Dataset 5.21.xlsx"  # PGA Tour data
lpga_file = "Dataset 5.22.xlsx"  # LPGA Tour data
pga_df = pd.read_excel(pga_file)
lpga_df = pd.read_excel(lpga_file)

# Function to compute partial correlation manually
def partial_correlation(df, x, y, covar):
    X = sm.add_constant(df[covar])  # Add constant for intercept
    model_x = sm.OLS(df[x], X).fit()
    residuals_x = model_x.resid
    model_y = sm.OLS(df[y], X).fit()
    residuals_y = model_y.resid
    partial_corr, p_value = pearsonr(residuals_x, residuals_y)
    return partial_corr, p_value

# Compute Pearson and Partial correlations for both tours
def compute_correlations(df, distance_col):
    corr_par4, p_val_par4 = pearsonr(df["Par4"], df[distance_col])
    corr_par5, p_val_par5 = pearsonr(df["Par5"], df[distance_col])
    partial_corr_par4, p_val_partial_par4 = partial_correlation(df, "Par4", distance_col, "Acc")
    partial_corr_par5, p_val_partial_par5 = partial_correlation(df, "Par5", distance_col, "Acc")
    return (corr_par4, p_val_par4, partial_corr_par4, p_val_partial_par4,
            corr_par5, p_val_par5, partial_corr_par5, p_val_partial_par5)

# PGA results
pga_results = compute_correlations(pga_df, "Drive")
# LPGA results
lpga_results = compute_correlations(lpga_df, "Dist")

# Print results
print("PGA Tour Results:")
print(f"Par 4 - Pearson: {pga_results[0]:.3f}, p-value: {pga_results[1]:.3e}")
print(f"Par 4 - Partial: {pga_results[2]:.3f}, p-value: {pga_results[3]:.3e}")
print(f"Par 5 - Pearson: {pga_results[4]:.3f}, p-value: {pga_results[5]:.3e}")
print(f"Par 5 - Partial: {pga_results[6]:.3f}, p-value: {pga_results[7]:.3e}\n")

print("LPGA Tour Results:")
print(f"Par 4 - Pearson: {lpga_results[0]:.3f}, p-value: {lpga_results[1]:.3e}")
print(f"Par 4 - Partial: {lpga_results[2]:.3f}, p-value: {lpga_results[3]:.3e}")
print(f"Par 5 - Pearson: {lpga_results[4]:.3f}, p-value: {lpga_results[5]:.3e}")
print(f"Par 5 - Partial: {lpga_results[6]:.3f}, p-value: {lpga_results[7]:.3e}")

PGA Tour Results:
Par 4 - Pearson: -0.286, p-value: 5.594e-05
Par 4 - Partial: -0.498, p-value: 1.637e-13
Par 5 - Pearson: -0.523, p-value: 5.933e-15
Par 5 - Partial: -0.593, p-value: 1.094e-19

LPGA Tour Results:
Par 4 - Pearson: -0.219, p-value: 4.729e-03
Par 4 - Partial: -0.644, p-value: 9.679e-21
Par 5 - Pearson: -0.387, p-value: 2.718e-07
Par 5 - Partial: -0.569, p-value: 1.513e-15


Key Takeaways:
LPGA golfers show a stronger relationship between accuracy and scoring compared to PGA golfers—accuracy appears to be more critical for performance in the LPGA.
Drive distance impacts both tours, but its influence on scoring is greater for men on Par 5s.
The impact of drive distance on scoring is more pronounced after controlling for accuracy in both tours.

5.10:

In [8]:
import pandas as pd
import statsmodels.api as sm
from scipy.stats import pearsonr

# Load the dataset
file_path = "Dataset 5.20.xlsx"
df = pd.read_excel(file_path)

# Compute X (difference in yards gained) and Y (point difference)
df["X"] = df["YH"] - df["YV"]
df["Y"] = df["H"] - df["V"]

# Compute Z (difference in turnovers)
df["Z"] = df["TOH"] - df["TOV"]

# Compute Pearson correlation between X and Y
correlation_xy, p_value_xy = pearsonr(df["X"], df["Y"])

# Function to compute partial correlation manually
def partial_correlation(df, x, y, covar):
    X = sm.add_constant(df[covar])  # Add constant for intercept
    model_x = sm.OLS(df[x], X).fit()
    residuals_x = model_x.resid
    model_y = sm.OLS(df[y], X).fit()
    residuals_y = model_y.resid
    partial_corr, p_value = pearsonr(residuals_x, residuals_y)
    return partial_corr, p_value

# Compute Partial correlation (controlling for Z)
partial_corr_xy_z, p_value_partial_xy_z = partial_correlation(df, "X", "Y", "Z")

# Print results
print(f"Pearson Correlation (X & Y): {correlation_xy:.3f}, p-value: {p_value_xy:.3e}")
print(f"Partial Correlation (X & Y, controlling for Z): {partial_corr_xy_z:.3f}, p-value: {p_value_partial_xy_z:.3e}")

Pearson Correlation (X & Y): 0.578, p-value: 3.070e-24
Partial Correlation (X & Y, controlling for Z): 0.731, p-value: 5.731e-44


Interpretation:
The positive Pearson correlation (0.578) suggests that teams that gain more yards than their opponents tend to have a higher point differential.
However, when controlling for turnovers, the partial correlation increases to 0.731, indicating that turnovers have a meaningful effect on point differential.
This suggests that both yards gained and turnovers are important predictors of game outcomes, but once turnovers are accounted for, the relationship between yardage and scoring difference becomes even stronger.