## 1 Title: Predicting Golf Court Crowdedness: Seasonal Influences of Temperature and Humidity

## 2. Introduction 

We aim to determine the impact of weather on golf course crowdedness using Samy Baladram's "Golf Play Dataset Extended." Specifically, we analyze the correlation between crowdedness and temperature in winter, and humidity in summer. Our objective is to use regression to predict precise crowdedness levels based on these factors. The dataset offers golf-related metrics, including weather conditions, seasonality, and time, enabling us to efficiently study and predict the weather-driven crowdedness patterns in two distinct seasons.

## 3. Preliminary Exploratory Data Analysis

In [47]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In our preliminary analysis, we sourced our dataset from a GitHub repository, ensuring streamlined access. While the data is mostly clean, some empty cells require filtering. We've pinpointed that crowdedness on the golf course, quantified from 0 to 1, correlates with winter temperatures and summer humidity. Our study will, therefore, concentrate on these seasons, producing two targeted graphs to elucidate the relationships of interest.

### 3.1 Read Data 

In [48]:
# dataset source: https://www.kaggle.com/datasets/samybaladram/golf-play-extended?select=golf_dataset_long_format_with_text.csv
url = "https://raw.githubusercontent.com/DeeHu/dsci-100-group-project/main/data/golf_dataset_long_format_with_text.csv"
golf_df = pd.read_csv(url)
display(golf_df.head())
golf_df.info()

Unnamed: 0,Date,Weekday,Holiday,Month,Season,Temperature,Humidity,Windy,Outlook,Crowdedness,EmailCampaign,MaintenanceTask,ID,Play,PlayTimeHour,Review
0,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",A,1,3.1,Absolutely exhilarating first day of the year!...
1,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",B,0,0.0,
2,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",C,0,0.0,
3,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",D,1,3.6,"Ah, the exhilarating dance with the wind today..."
4,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",E,1,3.4,The atmosphere on the course was nothing short...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7665 entries, 0 to 7664
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             7665 non-null   object 
 1   Weekday          7665 non-null   int64  
 2   Holiday          7665 non-null   int64  
 3   Month            7665 non-null   object 
 4   Season           7665 non-null   object 
 5   Temperature      7665 non-null   float64
 6   Humidity         7665 non-null   int64  
 7   Windy            7665 non-null   int64  
 8   Outlook          7665 non-null   object 
 9   Crowdedness      7665 non-null   float64
 10  EmailCampaign    7665 non-null   object 
 11  MaintenanceTask  7665 non-null   object 
 12  ID               7665 non-null   object 
 13  Play             7665 non-null   int64  
 14  PlayTimeHour     7665 non-null   float64
 15  Review           1352 non-null   object 
dtypes: float64(3), int64(5), object(8)
memory usage: 958.2+ KB


### 3.2 Clean and Wrangle Data
<p>Upon preliminary inspection, we noted that our dataset is relatively clean, with all the variables being in an appropriate format for analysis. We have transformed the dataset into a tidy format, ensuring each variable is a column, each observation is a row, and each type of observational unit forms a table.</p>

#### 3.2.1 Extracting Relevant Data
For our specific analysis, we only need the `Month`, `Season`, `Temperature`, `Humidity`, and `Crowdedness` columns.

In [49]:
clean_golf_df = golf_df[['Month', 'Season', 'Temperature', 'Humidity', 'Crowdedness']]
clean_golf_df

Unnamed: 0,Month,Season,Temperature,Humidity,Crowdedness
0,Jan,Winter,3.3,49,0.73
1,Jan,Winter,3.3,49,0.73
2,Jan,Winter,3.3,49,0.73
3,Jan,Winter,3.3,49,0.73
4,Jan,Winter,3.3,49,0.73
...,...,...,...,...,...
7660,Dec,Winter,1.8,43,0.57
7661,Dec,Winter,1.8,43,0.57
7662,Dec,Winter,1.8,43,0.57
7663,Dec,Winter,1.8,43,0.57


#### 3.2.2 Handling Missing Data
Before proceeding, we should ensure no missing data in the relevant columns.

In [50]:
missing_data = clean_golf_df.isnull().sum()
missing_data

Month          0
Season         0
Temperature    0
Humidity       0
Crowdedness    0
dtype: int64

Great! There is no missing data in our current data frame.

#### 3.3.3 Findings After initial investigation:
- There is a strong relationship between `crowdedness` and `temperature` in winter months;
- There is a strong relationship between `crowdedness` and `humidity` in summer months;

In [51]:
alt.data_transformers.disable_max_rows()

month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

humidity_plot = alt.Chart(clean_golf_df).mark_point().encode(
    x=alt.X('Humidity').title('Humidity').scale(zero=False),
    y='Crowdedness'
).facet(
    column=alt.Column("Month", sort=month_order)   
)

temp_plot = alt.Chart(clean_golf_df).mark_point().encode(
    x=alt.X('Temperature').title('Temperature').scale(zero=False),
    y='Crowdedness'
).facet(
    column=alt.Column("Month", sort=month_order)   
)

display(temp_plot)
display(humidity_plot)

Therefore we need to filter our data frame to have one for winter months, another one for summer months.

In [52]:
winter_months_df = clean_golf_df[clean_golf_df['Season']=='Winter'].reset_index(drop=True)
summer_months_df = clean_golf_df[clean_golf_df['Season']=='Summer'].reset_index(drop=True)

display(winter_months_df.shape[0])
display(summer_months_df.shape[0])

1869

1974

### 3.3 Split Data into `training` and `testing` data

In [53]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# set the seed
np.random.seed(5)

golf_winter_train, golf_winter_test = train_test_split(
    winter_months_df, train_size=0.75)

golf_summer_train, golf_summer_test = train_test_split(
    summer_months_df, train_size=0.75)


In [116]:
golf_winter_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1401 entries, 348 to 867
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        1401 non-null   object 
 1   Season       1401 non-null   object 
 2   Temperature  1401 non-null   float64
 3   Humidity     1401 non-null   int64  
 4   Crowdedness  1401 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 65.7+ KB


In [54]:
golf_winter_test.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 468 entries, 1531 to 509
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        468 non-null    object 
 1   Season       468 non-null    object 
 2   Temperature  468 non-null    float64
 3   Humidity     468 non-null    int64  
 4   Crowdedness  468 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 21.9+ KB


In [118]:
golf_summer_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1480 entries, 1220 to 925
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        1480 non-null   object 
 1   Season       1480 non-null   object 
 2   Temperature  1480 non-null   float64
 3   Humidity     1480 non-null   int64  
 4   Crowdedness  1480 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 69.4+ KB


In [None]:
golf_summer_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 494 entries, 1554 to 1030
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        494 non-null    object 
 1   Season       494 non-null    object 
 2   Temperature  494 non-null    float64
 3   Humidity     494 non-null    int64  
 4   Crowdedness  494 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 23.2+ KB


### 3.4 Data Summary 

We have filtered the cells which were erroneous and use two plots for each season. 

We used 75% of the original dataset while training the model and tested the estimated accuracy of the model using the remaining 25% of the original dataset.

In [55]:


winter_attributes =  ['Temperature', 'Humidity', 'Crowdedness']

winter_count_values = golf_winter_train[attributes].count()
winter_missing_values = golf_winter_train[attributes].isnull().sum()
winter_mean_values = golf_winter_train[attributes].mean(numeric_only=True)

winter_summary_df = pd.DataFrame({
    'Attribute': attributes,
    "Count": count_values,
    "Missing Values": missing_values,
    "Mean": mean_values}).reset_index(drop=True)

winter_summary_df

Unnamed: 0,Attribute,Count,Missing Values,Mean
0,Temperature,1401,0,3.6606
1,Humidity,1401,0,49.954318
2,Crowdedness,1401,0,0.576331


In [56]:
summer_attributes =  ['Temperature', 'Humidity', 'Crowdedness']

summer_count_values = golf_summer_train[attributes].count()
summer_missing_values = golf_summer_train[attributes].isnull().sum()
summer_mean_values = golf_summer_train[attributes].mean(numeric_only=True)

summer_summary_df = pd.DataFrame({
    'Attribute': attributes,
    "Count": count_values,
    "Missing Values": missing_values,
    "Mean": mean_values}).reset_index(drop=True)

summer_summary_df

Unnamed: 0,Attribute,Count,Missing Values,Mean
0,Temperature,1401,0,3.6606
1,Humidity,1401,0,49.954318
2,Crowdedness,1401,0,0.576331


### 3.5 Data Visualization

#### 3.5.1 Crowdedness vs Temperature in the Winter

In [57]:
winter_humidity_plot = alt.Chart(golf_winter_train).mark_point().encode(
    x=alt.X('Temperature').title('Temperature').scale(zero=False),
    y='Crowdedness',
).properties(width=600, height=400)

winter_humidity_plot

#### 3.5.2 Crowdedness vs Humidity in the Winter

In [58]:
summer_humidity_plot = alt.Chart(golf_summer_train).mark_point().encode(
    x=alt.X('Humidity').title('Humidity').scale(zero=False),
    y='Crowdedness',
).properties(width=600, height=400) 

summer_humidity_plot


## 4. Methods

<p>To predict golf court crowdedness, we segmented our data by season: winter and summer. In winter, crowdedness correlated with temperature, while in summer, it related to humidity. Accordingly, we emphasized 'Temperature' and 'Crowdedness' for winter analyses and 'Humidity' and 'Crowdedness' for summer. </p>

<p>We will split the dataset into a training set and a testing set and scale our numerical values to ensure that they are on a comparable scale. After choosing a “K” value (number of neighbours) to find the best “K” value for our dataset, we use the model we created to make predictions on the testing dataset. We will use two scatter plots to visualize the shifts in crowdedness. During the winter months, we will predict crowdedness based on temperature, while during the summer months, we will predict crowdedness based on humidity. </p>

## 5. Expected Outcomes and Significance

### 5.1 Expected Outcomes

Using a regression algorithm, we predict golf course crowdedness based on factors like month, season, humidity, and temperature. The model aims to highlight how these factors affect crowdedness. In winter, we expect increased crowdedness with warmer temperatures, as golfers favor milder conditions. Conversely, in summer, higher humidity might lead to reduced crowdedness, suggesting golfers avoid humid conditions.

### 5.2 Significance

Our model aims to provide golf clubs with insights into factors affecting crowdedness, enabling strategic responses to attendance fluctuations. By understanding the ties between seasonal factors and attendance, managers can enhance operational efficiency and guest experiences. This algorithm serves as a pivotal tool for informed decision-making and averting low attendance. Furthermore, it paves the way for predictive tools across outdoor recreational activities, signifying an evolution in the sports industry.