# 1. Title: Predicting Golf Court Crowdedness: Seasonal Influences of Temperature and Humidity

# 2. Introduction 

We aim to determine the impact of weather on golf course crowdedness using Samy Baladram's 'Golf Play Dataset Extended.' Specifically, we analyze the correlation between crowdedness and temperature in winter, and humidity in summer. To achieve this, we employ linear regression models that enable us to make predictions about crowdedness levels by quantifying the relationship between weather conditions and golf course attendance. Our objective is to use regression to predict precise crowdedness levels based on these factors. The dataset offers golf-related metrics, including weather conditions, seasonality, and time, enabling us to efficiently study and predict the weather-driven crowdedness patterns in two distinct seasons.

# 3. Methods & Results

## 3.1 Methods Overview 
Our dataset `golf_df` consists of various attributes potentially influencing the crowdedness of a golf course. After conducting initial exploratory visualizations, we've pinpointed that crowdedness on the golf course, quantified from 0 to 1, correlates with winter temperatures and summer humidity. Our study will, therefore, concentrate on these seasons, producing two targeted graphs to elucidate the relationships of interest.

- **Data Segmentation:** The dataset is bifurcated into two subsets based on seasonal demarcation: `golf_winter_df` and `golf_summer_df`.

- **Feature Selection:** For the winter model, `Temperature` is identified as a predictive feature, while `Humidity` is selected for the summer model.

- **Model Development:** A Linear Regression algorithm is employed to construct separate predictive models for each seasonal subset.

- **Model Training:** The models are trained on their respective datasets—fitting the `Temperature` feature against `Crowdedness` for winter and `Humidity` for summer.

- **Prediction and Evaluation:** The trained models are utilized to forecast crowdedness. Model performance is quantitatively assessed using standard metrics, such as `RMSE` (root mean squared error).

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
alt.__version__


# If your altair version is < 5.1.2, please run the script below in a new cell and restart the Kenel 
# pip install -U altair

'5.2.0'

## 3.2 Data Preparation and Analysis

### 3.2.1 Read Data

In [2]:
# dataset source: https://www.kaggle.com/datasets/samybaladram/golf-play-extended?select=golf_dataset_long_format_with_text.csv
url = "https://raw.githubusercontent.com/DeeHu/dsci-100-group-project/main/data/golf_dataset_long_format_with_text.csv"
golf_df = pd.read_csv(url)
display(golf_df.head())


Unnamed: 0,Date,Weekday,Holiday,Month,Season,Temperature,Humidity,Windy,Outlook,Crowdedness,EmailCampaign,MaintenanceTask,ID,Play,PlayTimeHour,Review
0,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",A,1,3.1,Absolutely exhilarating first day of the year!...
1,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",B,0,0.0,
2,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",C,0,0.0,
3,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",D,1,3.6,"Ah, the exhilarating dance with the wind today..."
4,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",E,1,3.4,The atmosphere on the course was nothing short...


### 3.2.2 Clean and Wrangle Data
<p>We noted that our dataset is relatively clean, with all the variables being in an appropriate format for analysis. We have transformed the dataset into a tidy format, ensuring each variable is a column, each observation is a row, and each type of observational unit forms a table.</p>

**Extracting Relevant Data**
\
For our specific analysis, we only need the `Month`, `Season`, `Temperature`, `Humidity`, and `Crowdedness` columns.

In [3]:
clean_golf_df = golf_df[['Month', 'Season', 'Temperature', 'Humidity', 'Crowdedness']]
clean_golf_df

Unnamed: 0,Month,Season,Temperature,Humidity,Crowdedness
0,Jan,Winter,3.3,49,0.73
1,Jan,Winter,3.3,49,0.73
2,Jan,Winter,3.3,49,0.73
3,Jan,Winter,3.3,49,0.73
4,Jan,Winter,3.3,49,0.73
...,...,...,...,...,...
7660,Dec,Winter,1.8,43,0.57
7661,Dec,Winter,1.8,43,0.57
7662,Dec,Winter,1.8,43,0.57
7663,Dec,Winter,1.8,43,0.57


**Handling Missing Data**
\
Before proceeding, we ensure no missing data in the relevant columns.

In [4]:
missing_data = clean_golf_df.isnull().sum()
missing_data

Month          0
Season         0
Temperature    0
Humidity       0
Crowdedness    0
dtype: int64

Great! There is no missing data in our current data frame.

### 3.2.3 Summary of the dataset:

In [5]:
# Generate summary statistics for columns with  numeric value only
detailed_summary = clean_golf_df[['Temperature', 'Humidity', 'Crowdedness']].describe()

**Table 1: Summary for Temperature, Humidity, and Crowdedness**
<br>


|       | Temperature | Humidity | Crowdedness |
|-------|-------------|----------|-------------|
| count | 7665.000000 | 7665.000000 | 7665.000000 |
| mean  | 13.435525   | 61.525114  | 0.620721    |
| std   | 8.040172    | 14.429511  | 0.150415    |
| min   | -2.000000   | 18.000000  | 0.000000    |
| 25%   | 6.000000    | 52.000000  | 0.530000    |
| 50%   | 13.800000   | 61.000000  | 0.630000    |
| 75%   | 20.500000   | 72.000000  | 0.720000    |
| max   | 29.500000   | 99.000000  | 1.000000    |

**Table 1** provides a summary of the key statistical measures for each numerical feature in the dataset, including the count of non-missing values, mean, standard deviation (std), minimum (min), lower quartile (25%), median (50%), upper quartile (75%), and maximum (max) values.


### 3.2.4 Dataset Visualization

From **Figure 1.1** and **Figure 1.2**, we discovered that:
- There is a strong relationship between `crowdedness` and `temperature` in winter months
- There is a strong relationship between `crowdedness` and `humidity` in summer months

In [6]:
alt.data_transformers.disable_max_rows()

month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

# Define a common y-axis title
y_axis_title = "Crowdedness Index"

# Temperature Plot
temp_plot = alt.Chart(clean_golf_df).mark_point().encode(
    x=alt.X('Temperature:Q').title('Temperature (°C)').scale(zero=False),
    y=alt.Y('Crowdedness:Q').title(y_axis_title)
).properties(
    title="Monthly Temperature vs. Crowdedness"
).facet(
    column=alt.Column("Month:N", sort=month_order, title="Month")
).properties(
    title="Figure 1.1 - Relationship between Temperature and Crowdedness by Month"
)

# Humidity Plot
humidity_plot = alt.Chart(clean_golf_df).mark_point().encode(
    x=alt.X('Humidity:Q').title('Humidity (%)').scale(zero=False),
    y=alt.Y('Crowdedness:Q').title(y_axis_title)
).properties(
    title="Monthly Humidity vs. Crowdedness"
).facet(
    column=alt.Column("Month:N", sort=month_order, title="Month")   
).properties(
    title="Figure 1.2 - Relationship between Humidity and Crowdedness by Month"
)

# Display the plots
display(temp_plot)
display(humidity_plot)


Therefore we need to filter our data frame to have one for winter months (`winter_months_df`), another one for summer months(`summer_months_df`).

In [7]:
winter_months_df = clean_golf_df[clean_golf_df['Season']=='Winter'].reset_index(drop=True)
summer_months_df = clean_golf_df[clean_golf_df['Season']=='Summer'].reset_index(drop=True)

display(winter_months_df.shape[0])
display(summer_months_df.shape[0])

1869

1974

### 3.2.5 Data Analysis & Visualization

**Split Data into `training` and `testing` data**

In [8]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# set the seed
np.random.seed(1)

golf_winter_train, golf_winter_test = train_test_split(
    winter_months_df, train_size=0.75)

golf_summer_train, golf_summer_test = train_test_split(
    summer_months_df, train_size=0.75)


We used 75% of the original dataset while training the model and tested the estimated accuracy of the model using the remaining 25% of the original dataset. As seen below, there is no missing value for both winter and summer datasets, which is good for us to start analysing them.

In [9]:
winter_temperature_plot = alt.Chart(golf_winter_train, title=" Figure 2.1 - Temperature v/s Crowdedness in Winter").mark_point().encode(
    x=alt.X('Temperature').title('Temperature').scale(zero=False),
    y=alt.Y('Crowdedness').title("Crowdedness").scale(zero=False),
).properties(width=600, height=400)

winter_temperature_plot

In [10]:
summer_humidity_plot = alt.Chart(golf_summer_train, title="Figure 2.2 - Humidity v/s Crowdedness in Summer").mark_point().encode(
    x=alt.X('Humidity').title('Humidity').scale(zero=False),
    y='Crowdedness',
).properties(width=600, height=400) 

summer_humidity_plot


#### 3.2.5.1 Data Analysis & Visualization for Winter Months

In [11]:
# Create a LinearRegression object.
lm_winter = LinearRegression()
# fit our linear regression model
X_winter_train = golf_winter_train[['Temperature']]
y_winter_train = golf_winter_train['Crowdedness']

X_winter_test = golf_winter_test[['Temperature']]
y_winter_test = golf_winter_test['Crowdedness']

winter_fit = lm_winter.fit(
    X_winter_train,
    y_winter_train
)
print(winter_fit.coef_)
print(winter_fit.intercept_)

[0.02586892]
0.4806988847733365


In [12]:
# Make predictions
winter_predictions = golf_winter_train.assign(
    predicted = lm_winter.predict(X_winter_train)
)
winter_predictions

Unnamed: 0,Month,Season,Temperature,Humidity,Crowdedness,predicted
707,Jan,Winter,1.3,52,0.40,0.514328
1438,Jan,Winter,1.7,54,0.69,0.524676
1229,Dec,Winter,4.1,51,0.63,0.586761
1432,Jan,Winter,3.5,39,0.61,0.571240
446,Mar,Winter,4.2,41,0.54,0.589348
...,...,...,...,...,...,...
905,Feb,Winter,5.6,52,0.62,0.625565
1791,Mar,Winter,0.9,37,0.50,0.503981
1096,Mar,Winter,5.9,66,0.57,0.633326
235,Feb,Winter,4.6,49,0.46,0.599696


In [13]:
# Calculate RMSE
lm_rmse = mean_squared_error(
    y_true=winter_predictions['Crowdedness'],
    y_pred=winter_predictions['predicted']
)**(1/2)

lm_rmse

0.10491599621811941

**Visualize linear regression model for winter**

In [14]:
# Prepare data for prediction grid
winter_prediction_grid = winter_months_df[['Temperature']].agg(['min', 'max']).reset_index(drop=True)
winter_preds = winter_prediction_grid.assign(
    predicted=lm_winter.predict(winter_prediction_grid)
)

# Visualization using Altair
all_points_winter = alt.Chart(winter_months_df).mark_circle(opacity=0.4).encode(
    x=alt.X("Temperature")
        .scale(zero=False)
        .title("Temperature (°C)"),
    y=alt.Y("Crowdedness")
        .scale(zero=False)
        .title("Crowdedness")
)


# Line chart of predictions
winter_preds_plot = all_points_winter + alt.Chart(winter_preds).mark_line(
    color="#ff7f0e"
).encode(
    x="Temperature",
    y="predicted",
    tooltip=alt.Tooltip(['Temperature', 'predicted'])
).properties(width=600,
             height=400,
             title="Figure 3.1 - Winter Temperature vs. Crowdedness with Predictions")

winter_preds_plot

The the equation for the linear model: `Crowdedness = 0.481 + 0.0259 * Temperature`

**Result Analysis** 

The linear regression analysis yields an RMSE of 0.1049, indicating the model's predictions are relatively close to the observed data, considering the crowdedness values range between approximately 0.2 to 0.9. The positive coefficient in the model's equation, `Crowdedness = 0.481 + 0.0259 * Temperature`, suggests that an increase in temperature is associated with an increase in crowdedness. This model, visualized in **Figure 3.1**, demonstrates a clear trend and offers practical predictive power, although further evaluation against test data is advisable to confirm its generalizability.

#### 3.2.5.2 Data Analysis & Visualization for Summer Months

In [15]:
# Create a LinearRegression object
lm_summer = LinearRegression()

# Fit our linear regression model
X_summer_train = golf_summer_train[['Humidity']]
y_summer_train = golf_summer_train['Crowdedness']

X_summer_test = golf_summer_test[['Humidity']]
y_summer_test = golf_summer_test['Crowdedness']

summer_fit = lm_summer.fit(X_summer_train, y_summer_train)
print(summer_fit.coef_)
print(summer_fit.intercept_)

[-0.01266122]
1.5234691301737615


In [16]:
# Make predictions on the training set
summer_predictions = golf_summer_train.assign(
    predicted=lm_summer.predict(X_summer_train)
)

# Calculate RMSE on the training set
lm_rmse_summer = mean_squared_error(
    y_true=summer_predictions['Crowdedness'],
    y_pred=summer_predictions['predicted']
)**(1/2)

print(lm_rmse_summer)

0.10734088685487335


Visualize linear regression model for summer

In [17]:
# Prepare data for prediction grid
summer_prediction_grid = golf_summer_train[['Humidity']].agg(['min', 'max'])
summer_preds = summer_prediction_grid.assign(
    predicted=lm_summer.predict(summer_prediction_grid)
)

# Visualization using Altair
all_points_summer = alt.Chart(golf_summer_train).mark_circle(opacity=0.4).encode(
    x=alt.X("Humidity")
        .scale(zero=False)
        .title("Humidity"),
    y=alt.Y("Crowdedness")
        .scale(zero=False)
        .title("Crowdedness")
)

# Line chart of predictions
summer_preds_plot = all_points_summer + alt.Chart(summer_preds).mark_line(
    color="#ff7f0e"
).encode(
    x="Humidity",
    y="predicted"
).properties(width=600,
             height=400,
             title="Figure 3.2 - Summer Humidity vs. Crowdedness with Predictions")

summer_preds_plot

The equation for the linear model: `Crowdedness = 1.523 - 0.013 * Humidity`

**Result Analysis**

The regression analysis for summer months indicates a negative correlation between humidity and crowdedness with an RMSE of 0.1073, which, given the data spread, suggests the model's predictions are reasonably accurate. The linear equation, `Crowdedness = 1.523 - 0.013 * Humidity`, implies that higher humidity is associated with lower crowdedness, consistent with the trend shown in **Figure 3.2**. This model provides a useful tool for predicting summer golf course attendance based on humidity levels.

## 5. Discussion

- Findings
<p>Through our analysis, we found out that there is a clear positive relationship between temperature and crowdedness in the winter season. In the summer season, there is a clear negative relationship between humidity and crowdedness. The relationships found in the regression results align with our initial expectation that temperature in winter and humidity in summer have correlations with the level of crowdedness. The strength of these relationships suggests that weather is a reliable predictor of golf course activity levels during these seasons.</p>

- Expectations vs. Reality
<p>The results confirmed our hypotheses: milder winter temperatures increase golf course usage, while higher summer humidity decreases it. This concurrence with our initial predictions underscores the predictable impact of weather on golfing habits.</p>

- Impacts
<p>These insights are valuable for both golfers and course managers. Golfers can use weather data to anticipate course crowdedness, optimizing their playing times. Course managers can leverage this information for better staffing and resource planning, enhancing player experience and operational efficiency.</p>

- Future Research
<p>Our findings can lead to future questions: are the relationships between the temperature and crowdedness in winter and the humidity and crowdedness in summer casual? Another question can be: what level of crowdedness on the golf course do most golf players prefer?</p>

## 6. References

- Timbers, T., Campbell, T., Lee, M., Heagy, L., & Ostblom, J. (n.d.). Data Science: A First Introduction (Python Edition). Retrieved from:https://python.datasciencebook.ca/

- dbSeer. (2019, August 1). Data Science 101: How to Use Linear Regression As Your Predictive Model. Retrieved from https://dbseer.com/data-science-101-how-to-use-linear-regression-as-your-predictive-model/

- Data source: https://www.kaggle.com/datasets/samybaladram/golf-play-extended?select=golf_dataset_long_format_with_text.csv