## 1. Title: Predicting Golf Court Crowdedness: Seasonal Influences of Temperature and Humidity

## 2. Introduction 

We aim to determine the impact of weather on golf course crowdedness using Samy Baladram's "Golf Play Dataset Extended." Specifically, we analyze the correlation between crowdedness and temperature in winter, and humidity in summer. Our objective is to use regression to predict precise crowdedness levels based on these factors. The dataset offers golf-related metrics, including weather conditions, seasonality, and time, enabling us to efficiently study and predict the weather-driven crowdedness patterns in two distinct seasons.

## 3. Preliminary Exploratory Data Analysis

In [None]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In our preliminary analysis, we sourced our dataset from a GitHub repository, ensuring streamlined access. While the data is mostly clean, some empty cells require filtering. We've pinpointed that crowdedness on the golf course, quantified from 0 to 1, correlates with winter temperatures and summer humidity. Our study will, therefore, concentrate on these seasons, producing two targeted graphs to elucidate the relationships of interest.

### 3.1 Read Data 

In [None]:
# dataset source: https://www.kaggle.com/datasets/samybaladram/golf-play-extended?select=golf_dataset_long_format_with_text.csv
url = "https://raw.githubusercontent.com/DeeHu/dsci-100-group-project/main/data/golf_dataset_long_format_with_text.csv"
golf_df = pd.read_csv(url)
display(golf_df.head())
golf_df.info()

### 3.2 Clean and Wrangle Data
<p>Upon preliminary inspection, we noted that our dataset is relatively clean, with all the variables being in an appropriate format for analysis. We have transformed the dataset into a tidy format, ensuring each variable is a column, each observation is a row, and each type of observational unit forms a table.</p>

#### 3.2.1 Extracting Relevant Data
For our specific analysis, we only need the `Month`, `Season`, `Temperature`, `Humidity`, and `Crowdedness` columns.

In [None]:
clean_golf_df = golf_df[['Month', 'Season', 'Temperature', 'Humidity', 'Crowdedness']]
clean_golf_df

#### 3.2.2 Handling Missing Data
Before proceeding, we should ensure no missing data in the relevant columns.

In [None]:
missing_data = clean_golf_df.isnull().sum()
missing_data

Great! There is no missing data in our current data frame.

#### 3.3.3 Findings After initial investigation:
- There is a strong relationship between `crowdedness` and `temperature` in winter months;
- There is a strong relationship between `crowdedness` and `humidity` in summer months;

In [None]:
alt.data_transformers.disable_max_rows()

month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

humidity_plot = alt.Chart(clean_golf_df, title="Relationship between Humidity and Crowdedness All Year").mark_point().encode(
    x=alt.X('Humidity').title('Humidity').scale(zero=False),
    y='Crowdedness'
).facet(
    column=alt.Column("Month", sort=month_order)   
)

temp_plot = alt.Chart(clean_golf_df, title="Relationship between Temperature and Crowdedness All Year").mark_point().encode(
    x=alt.X('Temperature').title('Temperature').scale(zero=False),
    y='Crowdedness'
).facet(
    column=alt.Column("Month", sort=month_order)   
)

display(temp_plot)
display(humidity_plot)

Therefore we need to filter our data frame to have one for winter months, another one for summer months.

In [None]:
winter_months_df = clean_golf_df[clean_golf_df['Season']=='Winter'].reset_index(drop=True)
summer_months_df = clean_golf_df[clean_golf_df['Season']=='Summer'].reset_index(drop=True)

display(winter_months_df.shape[0])
display(summer_months_df.shape[0])

### 3.3 Split Data into `training` and `testing` data

In [None]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# set the seed
np.random.seed(5)

golf_winter_train, golf_winter_test = train_test_split(
    winter_months_df, train_size=0.75)

golf_summer_train, golf_summer_test = train_test_split(
    summer_months_df, train_size=0.75)


In [None]:
golf_winter_train.info()

In [None]:
golf_winter_test.info()


In [None]:
golf_summer_train.info()

In [None]:
golf_summer_test.info()

### 3.4 Data Summary 

We have filtered the cells which were erroneous and use two plots for each season. 

We used 75% of the original dataset while training the model and tested the estimated accuracy of the model using the remaining 25% of the original dataset. As seen below, there is no missing value for both winter and summer datasets, which is good for us to start analysing them.

In [None]:


winter_attributes =  ['Temperature', 'Humidity', 'Crowdedness']

winter_count_values = golf_winter_train[winter_attributes].count()
winter_missing_values = golf_winter_train[winter_attributes].isnull().sum()
winter_mean_values = golf_winter_train[winter_attributes].mean(numeric_only=True)

winter_summary_df = pd.DataFrame({
    'Attribute': winter_attributes,
    "Count": winter_count_values,
    "Missing Values": winter_missing_values,
    "Mean": winter_mean_values}).reset_index(drop=True)

winter_summary_df

In [None]:
summer_attributes =  ['Temperature', 'Humidity', 'Crowdedness']

summer_count_values = golf_summer_train[winter_attributes].count()
summer_missing_values = golf_summer_train[winter_attributes].isnull().sum()
summer_mean_values = golf_summer_train[winter_attributes].mean(numeric_only=True)

summer_summary_df = pd.DataFrame({
    'Attribute': winter_attributes,
    "Count": summer_count_values,
    "Missing Values": summer_missing_values,
    "Mean": summer_mean_values}).reset_index(drop=True)

summer_summary_df

### 3.5 Data Visualization

#### 3.5.1 Crowdedness vs Temperature in the Winter

In [None]:
winter_humidity_plot = alt.Chart(golf_winter_train, title="Relationship between temperature and crowdedness in the winter").mark_point().encode(
    x=alt.X('Temperature').title('Temperature').scale(zero=False),
    y='Crowdedness',
).properties(width=600, height=400)

winter_humidity_plot

#### 3.5.2 Crowdedness vs Humidity in the Winter

In [None]:
summer_humidity_plot = alt.Chart(golf_summer_train, title="Relationship between humidity and crowdedness in the summer").mark_point().encode(
    x=alt.X('Humidity').title('Humidity').scale(zero=False),
    y='Crowdedness',
).properties(width=600, height=400) 

summer_humidity_plot


## 4. Methods

<p>To predict golf court crowdedness, we segmented our data by season: winter and summer. In winter, crowdedness correlated with temperature, while in summer, it related to humidity. Accordingly, we emphasized 'Temperature' and 'Crowdedness' for winter analyses and 'Humidity' and 'Crowdedness' for summer. </p>

<p>We will split the dataset into a training set and a testing set and scale our numerical values to ensure that they are on a comparable scale. After choosing a “K” value (number of neighbours) to find the best “K” value for our dataset, we use the model we created to make predictions on the testing dataset. We will use two scatter plots to visualize the shifts in crowdedness. During the winter months, we will predict crowdedness based on temperature, while during the summer months, we will predict crowdedness based on humidity. </p>

#### 4.1 Scaling our numerical values

#### 4.2 Choosing the best "K" value

#### 4.3 Making predictions

## 5. Expected Outcomes and Significance

### 5.1 Expected Outcomes

Using a regression algorithm, we predict golf course crowdedness based on factors like month, season, humidity, and temperature. The model aims to highlight how these factors affect crowdedness. In winter, we expect increased crowdedness with warmer temperatures, as golfers favor milder conditions. Conversely, in summer, higher humidity might lead to reduced crowdedness, suggesting golfers avoid humid conditions.

### 5.2 Significance

Our model aims to provide golf clubs with insights into factors affecting crowdedness, enabling strategic responses to attendance fluctuations. By understanding the ties between seasonal factors and attendance, managers can enhance operational efficiency and guest experiences. This algorithm serves as a pivotal tool for informed decision-making and averting low attendance. Furthermore, it paves the way for predictive tools across outdoor recreational activities, signifying an evolution in the sports industry.