# Predicting Match Attendance using Linear Regression

**What is Linear Regression?**  
Linear regression models the relationship between one continuous dependent variable (target) and one or more independent variables (features), assuming a linear relationship.

**Why use it?**  
- For predicting a continuous outcome.
- Provides interpretable coefficients indicating the average change in attendance per unit change in each feature.
- Helps quantify and compare the influence of different factors.

**Outputs:**  
- Predicted values for the target variable.
- RMSE (Root Mean Squared Error) for evaluating prediction accuracy.
- Model coefficients showing the magnitude and direction of each feature’s effect on the target value.

**Why analyze weather and match-related effects on attendance?**
Understanding these drivers supports data-driven decisions:
1. Stadium operations: Adjust staffing, concessions, and security based on expected turnout under varying weather.
2. Facility preparedness: Plan for weather-related infrastructure needs (e.g., covering stands, heating) to maintain fan comfort and safety.
3. Marketing & promotions: Time special offers or dynamic pricing on days with predicted low attendance due to adverse conditions.

## 1. Imports and Data Sources

Import necessary Python libraries.

In [182]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## 2. Build the Combined Dataset

Load your datasets, and merge the important tables into a single DataFrame for easier feature engineering and modeling.

In [183]:
# Replace this data import with your real data.
# In the weather table, we want to keep 'None' values as an own category in the column
# 'precipitation', which is why we need the addition.
df_match        = pd.read_csv('match_table.csv')
df_weather      = pd.read_csv('weather_table.csv', na_values=[], keep_default_na=False)
df_stadium      = pd.read_csv('stadium_table.csv')

df = (
    df_match
      .merge(df_weather, on='match_id', how='left')
      .merge(df_stadium[['stadium_id', 'capacity']],  on='stadium_id',  how='left')
)

## 3. Feature Engineering

Create new features that are not directly available in the raw data.  
Example: Calculate the stadion attendance in percentage from the absolute numbers.

In [184]:
df['attendance_in_percent'] = (df['attendance'] / df['capacity'] * 100).round(2)

# Test if the values have been computed correctly.
print(df[['attendance_in_percent']].head())

   attendance_in_percent
0                  66.15
1                  94.10
2                  52.10
3                  72.27
4                  87.02


## 4. Select and Split Features and Target

Choose the predictor variables and the target variable.

Create a separate copy of the selected columns to avoid modifying the original DataFrame `df`. This is important when performing transformations (e.g., scaling) that shouldn't affect `df`.

Finally, separate features (X) and target (y).

In [185]:
# Change 'features' to include whatever predictors you need.
model_columns = [
    "temperature",
    "conditions",
    "humidity",
    "precipitation",
    "wind_speed_in_km_h",
    "attendance_in_percent"   # target: in percent.
]

model_df = df[model_columns].copy()

X = model_df.drop(columns="attendance_in_percent")
y = model_df["attendance_in_percent"]

## 5. Train/Test Split

Split the data into training and testing sets to train the model on one portion and evaluate its performance on unseen data.

You can adjust 'test_size' to control how much data goes into the test set. Common values range from 0.2 to 0.3 (i.e., 20–30% of the data used for testing).
The 'random_state' ensures reproducibility: using the same value will yield the same split every time. You can choose any integer value.

In [186]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=17
)

## 6. Encode Categorical Features

Many machine learning models cannot handle categorical variables directly.
One-Hot-Encoding transforms a categorical feature (e.g., "conditions" and "precipitation") into multiple binary (0/1) features — one for each category, indicating whether a sample belongs to that category.
This allows the model to learn a separate coefficient for each category.

When One-Hot-Encoding is applied, the first column is dropped and becomes the baseline. All coefficients of the remaining categories will be interpreted in relation to this dropped base category.

Repeat the following steps for every categorical column you have among your predictors.

In [187]:
# Show all categories to decide which to drop.
print("All conditions in the dataset:")
print(np.sort(df['conditions'].unique()))

print("All precipitations in the dataset:")
print(np.sort(df['precipitation'].unique()))

# Apply One-Hot-Encoding to the 'conditions' and 'precipitation' column.
# Choose in the 'drop' argument which category you want to use as the reference category
# (the first is for the first column and the second for the second column).
encoder = OneHotEncoder(drop=['Sunny', 'None'], sparse_output=False)
encoder.fit(X_train[['conditions', 'precipitation']])
pos_cols = encoder.get_feature_names_out(['conditions', 'precipitation'])
X_train_cons_pres = pd.DataFrame(
    encoder.transform(X_train[['conditions', 'precipitation']]),
    columns=pos_cols,
    index=X_train.index
)
X_test_cons_pres = pd.DataFrame(
    encoder.transform(X_test[['conditions', 'precipitation']]),
    columns=pos_cols,
    index=X_test.index
)

# Show which category was dropped (baseline) for interpretation later.
reference_condition = list(set(df['conditions'].unique()) - set([col.split("_")[1] for col in pos_cols]))
print(f"\nReference (baseline) position used for comparison: {reference_condition[0]}")

reference_precipitation= list(set(df['precipitation'].unique()) - set([col.split("_")[1] for col in pos_cols]))
print(f"\nReference (baseline) position used for comparison: {reference_precipitation[0]}")

# Remove original 'conditions' and 'precipitation' column and append the dummies that have just been created.
X_train = pd.concat([X_train.drop(columns=['conditions', 'precipitation']), X_train_cons_pres], axis=1)
X_test  = pd.concat([X_test.drop(columns=['conditions', 'precipitation']), X_test_cons_pres],  axis=1)


All conditions in the dataset:
['Cloudy' 'Fog' 'Rain' 'Snow' 'Sunny' 'Windy']
All precipitations in the dataset:
['Drizzle' 'Hail' 'None' 'Rain' 'Snow']

Reference (baseline) position used for comparison: Sunny

Reference (baseline) position used for comparison: None


## 7. Scale Numeric Features

Standardize numeric columns so they have mean = 0 and standard deviation = 1. This is important because many machine learning models are sensitive to feature scales. Features with larger ranges can dominate the model's learning process.
Standardization ensures that each feature contributes equally to the result.

In [188]:
numeric_cols = [
    'temperature',
    'humidity',
    'wind_speed_in_km_h'
]

scaler = StandardScaler()
scaler.fit(X_train[numeric_cols])
X_train[numeric_cols] = scaler.transform(X_train[numeric_cols])
X_test[numeric_cols]  = scaler.transform(X_test[numeric_cols])

## 8. Train Linear Regression Model

Train linear regression on the training data.

In [189]:
model = LinearRegression()
model.fit(X_train, y_train)

## 9. Make Predictions and Evaluate Performance

Predict the target variable values on the test set and evaluate the model accuracy.

- y_pred contains the predicted continuous values (in this case: attendance percentages) from the model.
- The Mean Squared Error (MSE) is computed to measure the average squared difference between actual and predicted values.
- The Root Mean Squared Error (RMSE) is computed as the square root of MSE, providing an interpretable error metric in the same units as the target variable.

This evaluation helps us understand how well the model predicts attendance and how large the typical prediction errors are.

In [190]:
y_pred = model.predict(X_test)
mse    = mean_squared_error(y_test, y_pred)
rmse   = np.sqrt(mse)
print(f"RMSE     : {rmse:.2f}")


RMSE     : 15.17


## 10. Interpret Model Coefficients

Coefficients show how each feature affects the target variable (here: match attendance) in a linear way.

- Numeric features (e.g., 'temperature', 'humidity'), indicate the change in attendance in percent per unit increase. Example: A coefficient of 0.95 for temperature means that for each additional degree, attendance increases by ~0.95 percentage points, assuming all other variables stay constant.
- Categorical features (e.g., 'conditions_Cloudy', 'precipitation_Rain') are interpreted relative to the baseline category (for 'conditions': 'Sunny', and for 'precipitation': 'None'). A negative coefficient means lower attendance compared to the reference, while a positive one means higher attendance.

In [191]:
coef_df = pd.DataFrame({
    "feature":    X_train.columns,
    "coefficient": model.coef_
}).sort_values(by="coefficient", ascending=False)

print("\nFeature impacts on attendance:")
print(coef_df)


Feature impacts on attendance:
                  feature  coefficient
3       conditions_Cloudy     5.246340
9      precipitation_Hail     4.619206
5         conditions_Rain     3.612750
8   precipitation_Drizzle     3.574786
7        conditions_Windy     3.265041
4          conditions_Fog     2.752692
6         conditions_Snow     2.647916
11     precipitation_Snow     2.481612
1                humidity     1.767599
0             temperature     0.945195
2      wind_speed_in_km_h    -0.306377
10     precipitation_Rain    -0.524757


## 11. Conclusion

Reference for 'conditions': 'Sunny'.  
Reference for 'precipitation': 'None'.

- Concerning the conditions, cloudy weather shows the highest positive impact on attendance.
- Surprisingly, on days with conditions like rain, wind, fog and snow, the attendance also seems to be higher than on sunny days.  

  
- Regarding the precipitation, a similar effect shows. The precipitation types like hail, drizzle and snow are also associated with higher attendance than days without precipitation.
- Rain is the only precipitation type with a negative impact. 
 

- Higher humidity and temperature slightly increase attendance, reflecting preference for warmer and more humid days.
- Wind speed has a small negative effect.