# Logistic Regression Analysis for my Performance on English Landing Park DGC

Using Logistic Regression to better understand hole number influence on the overall score at English Landing Park

Load and Filter the Data

In [1]:
import pandas as pd
# Load the dataset
data = pd.read_csv('MJ_Scorecards_2024.csv')

# Filter for the relevant players and course
filtered_data = data[
    ((data['PlayerName'] == 'Par') | (data['PlayerName'] == 'Matthew Jones')) & 
    (data['CourseName'] == 'English Landing Park DGC')
]

# Drop unnecessary columns (Hole10 to Hole18)
filtered_data = filtered_data.drop(columns=[f'Hole{i}' for i in range(10, 19)])

# Display the first few rows of the filtered data
filtered_data.head()

Unnamed: 0,PlayerName,CourseName,LayoutName,StartDate,EndDate,Total,+/-,RoundRating,Hole1,Hole2,Hole3,Hole4,Hole5,Hole6,Hole7,Hole8,Hole9
3,Par,English Landing Park DGC,Main,2024-10-10 1818,2024-10-10 1852,28,,,3,3,4,3,3,3,3,3,3
4,Matthew Jones,English Landing Park DGC,Main,2024-10-10 1818,2024-10-10 1852,29,1.0,189.0,3,3,4,3,4,3,4,3,2
7,Par,English Landing Park DGC,Main,2024-10-08 1823,2024-10-08 1856,28,,,3,3,4,3,3,3,3,3,3
8,Matthew Jones,English Landing Park DGC,Main,2024-10-08 1823,2024-10-08 1856,31,3.0,172.0,4,3,3,3,4,3,4,4,3
9,Par,English Landing Park DGC,Main,2024-10-07 1820,2024-10-07 1930,28,,,3,3,4,3,3,3,3,3,3


Only including first 9 holes as to make rounds consistent. English Landing is a 9 hole course that I have played through twice to make 18 sometimes.

In [7]:
# Select independent variables (Hole scores) and dependent variable (+/-)
X = filtered_data[[f'Hole{i}' for i in range(1, 10)]]  # Hole1 to Hole9
y = filtered_data['+/-']  # Dependent variable

# Ensure the dependent variable is binary for logistic regression (e.g., 0 for neutral, 1 for a score deviation)
y = (y != 0).astype(int)  # Convert +/- to binary: 1 if deviation from Par, 0 if neutral (Par)

from sklearn.model_selection import train_test_split

# Train-test split for evaluation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((76, 9), (20, 9), (76,), (20,))

Logistic regression models often perform better when the independent variables are on a similar scale. Using StandardScaler

In [4]:
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Check the standardized values
X_train_scaled[:5]

array([[-0.67995731,  0.14664712,  0.02503915, -0.54978343, -0.80527431,
        -0.66937577, -0.74118569, -0.56077215, -0.10432022],
       [-0.67995731, -2.63964807, -1.87793657, -0.54978343,  0.39474231,
        -0.66937577, -0.74118569, -0.56077215, -0.10432022],
       [ 1.10199978,  0.14664712,  0.02503915, -0.54978343, -0.80527431,
         1.21479306,  1.01913032, -0.56077215, -0.10432022],
       [-0.67995731,  0.14664712,  0.02503915, -0.54978343, -0.80527431,
        -0.66937577, -0.74118569, -0.56077215, -0.10432022],
       [-0.67995731,  0.14664712,  0.02503915, -0.54978343, -0.80527431,
        -0.66937577, -0.74118569, -0.56077215, -0.10432022]])

Training Model

In [5]:
import statsmodels.api as sm

# Add a constant term for the intercept
X_train_scaled = sm.add_constant(X_train_scaled)
X_test_scaled = sm.add_constant(X_test_scaled)

# Train the logistic regression model
model = sm.Logit(y_train, X_train_scaled)
result = model.fit()

# Display the summary of the model
result.summary()

         Current function value: 0.000000
         Iterations: 35


  return 1 - self.llf/self.llnull


0,1,2,3
Dep. Variable:,+/-,No. Observations:,76.0
Model:,Logit,Df Residuals:,66.0
Method:,MLE,Df Model:,9.0
Date:,"Sat, 12 Oct 2024",Pseudo R-squ.:,inf
Time:,15:38:14,Log-Likelihood:,-2.7193e-10
converged:,False,LL-Null:,0.0
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,26.3562,6.06e+04,0.000,1.000,-1.19e+05,1.19e+05
x1,5.459e-15,8.11e+04,6.73e-20,1.000,-1.59e+05,1.59e+05
x2,1.107e-15,6.57e+04,1.68e-20,1.000,-1.29e+05,1.29e+05
x3,5.449e-15,6.75e+04,8.07e-20,1.000,-1.32e+05,1.32e+05
x4,-1.69e-14,8.13e+04,-2.08e-19,1.000,-1.59e+05,1.59e+05
x5,-3.777e-15,1.13e+05,-3.35e-20,1.000,-2.21e+05,2.21e+05
x6,1.435e-14,7.96e+04,1.8e-19,1.000,-1.56e+05,1.56e+05
x7,1.457e-14,9.04e+04,1.61e-19,1.000,-1.77e+05,1.77e+05
x8,-1.012e-14,1.02e+05,-9.92e-20,1.000,-2e+05,2e+05


Analyzing Coefficents

In [6]:
# Get the coefficients to determine feature importance
coefficients = result.params[1:]  # Exclude the constant term
feature_importance = pd.DataFrame({
    'Hole': X.columns,
    'Coefficient': coefficients
}).sort_values(by='Coefficient', ascending=False)

# Display the feature importance
feature_importance

Unnamed: 0,Hole,Coefficient
x7,Hole7,1.457e-14
x6,Hole6,1.434894e-14
x1,Hole1,5.45868e-15
x3,Hole3,5.448732e-15
x2,Hole2,1.106991e-15
x5,Hole5,-3.776801e-15
x8,Hole8,-1.011714e-14
x9,Hole9,-1.281963e-14
x4,Hole4,-1.690468e-14


### Interpreting the Coefficients

- **Hole 7 (1.457000e-14)**: The impact of Hole 7 is almost zero, meaning the score here doesn’t really make a big difference to the final score.
- **Hole 6 (1.434894e-14)**: Similar to Hole 7, the score on Hole 6 barely moves the needle.
- **Hole 1 (5.458680e-15)**: The score on Hole 1 has a small positive effect on the overall score.
- **Hole 3 (5.448732e-15)**: Just like Hole 1, Hole 3 also has a slight positive influence.
- **Hole 2 (1.106991e-15)**: The score on Hole 2 has a tiny positive impact, though it’s barely noticeable.
- **Hole 5 (-3.776801e-15)**: Here, the score on Hole 5 shows a small negative impact on the overall score.
- **Hole 8 (-1.011714e-14)**: The score on Hole 8 also pulls the overall score down a bit.
- **Hole 9 (-1.281963e-14)**: Hole 9 has a slightly larger negative impact compared to Holes 5 and 8.
- **Hole 4 (-1.690468e-14)**: Hole 4 has the strongest negative influence on the final score among all the holes.

### Overall Breakdown

- **Positive Coefficients**: Holes 7, 6, 1, 3, and 2 have positive coefficients, which means higher scores on these holes slightly push up the final score. But honestly, the effect is pretty minor.
- **Negative Coefficients**: Holes 5, 8, 9, and 4 have negative coefficients, suggesting higher scores on these holes are linked to slightly lower final scores. Hole 4 stands out as the one to watch with the most noticeable (though still small) negative impact.

### Final Thoughts

The results show that none of the holes are making a huge difference to the final score—everything is pretty close to zero. That said, if you were to focus on any specific hole, Hole 4 is the one that could use some extra attention due to its slightly larger negative influence. Even so, the overall impact is still pretty minimal.