## 1. Title <span style="color:red">(to be completed)</span>

## 2. Introduction 

<p>Our topic focuses on finding the relationship between how crowded golf courses may be and weather as the weather may influence the amount of people going to the golf course. We do this by analyzing the dataset Golf Play Dataset Extended created by Samy Baladram. More specifically, we want to find out the relationship between crowdedness and weather in two seasons (temperature in winter and humidity in summer). </p>

<p>This raises our predictive question we want to answer with regression: finding a specific value for crowdedness after predicting with temperature in winter and humidity in summer. The dataset we use provides extended golf-related information, including weather condition, humidity, temperature, seasons, time, and crowdedness. As assessed by Kaggle, the usability for this dataset is 10 out of 10, which suggests its great compatibility, reliability, and completeness. This dataset is also well-documented and clean so no much further cleaning and wrangling was needed. From the abundant data this dataset provides, we are able to analyze and predict the crowdedness based on the weather in two seasons. </p>

## 3. Preliminary Exploratory Data Analysis

In [36]:
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

<p>For our preliminary exploratory data analysis, we note that our dataset is accessible from our remote GitHub repository, ensuring easy and convenient access for further analysis. We are exploring relatively clean data, with the exception of there being empty cells which we need to filter out. </p> 

<p>One important aspect we have identified is that the correlation we are interested in studying predominantly manifests during the winter and summer months. The crowdedness, a measure of how crowded the golf course was (ranging from 0 to 1), is affected by the low temperatures in winter and the humidity in the summer months. Therefore, our analysis will focus on these specific periods to draw meaningful insights. By limiting our analysis to these seasons, we will generate two distinct graphs, providing a clear and focused perspective on the relationship we aim to explore.</p>

### 3.1 Read Data 

In [37]:
# dataset source: https://www.kaggle.com/datasets/samybaladram/golf-play-extended?select=golf_dataset_long_format_with_text.csv
url = "https://raw.githubusercontent.com/DeeHu/dsci-100-group-project/main/data/golf_dataset_long_format_with_text.csv"
golf_df = pd.read_csv(url)
display(golf_df.head())
golf_df.info()

Unnamed: 0,Date,Weekday,Holiday,Month,Season,Temperature,Humidity,Windy,Outlook,Crowdedness,EmailCampaign,MaintenanceTask,ID,Play,PlayTimeHour,Review
0,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",A,1,3.1,Absolutely exhilarating first day of the year!...
1,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",B,0,0.0,
2,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",C,0,0.0,
3,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",D,1,3.6,"Ah, the exhilarating dance with the wind today..."
4,2021-01-01,4,1,Jan,Winter,3.3,49,1,sunny,0.73,Happy New Year and welcome to the Golf Course!...,"['Cleaning Amenities', 'Restroom Cleaning']",E,1,3.4,The atmosphere on the course was nothing short...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7665 entries, 0 to 7664
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             7665 non-null   object 
 1   Weekday          7665 non-null   int64  
 2   Holiday          7665 non-null   int64  
 3   Month            7665 non-null   object 
 4   Season           7665 non-null   object 
 5   Temperature      7665 non-null   float64
 6   Humidity         7665 non-null   int64  
 7   Windy            7665 non-null   int64  
 8   Outlook          7665 non-null   object 
 9   Crowdedness      7665 non-null   float64
 10  EmailCampaign    7665 non-null   object 
 11  MaintenanceTask  7665 non-null   object 
 12  ID               7665 non-null   object 
 13  Play             7665 non-null   int64  
 14  PlayTimeHour     7665 non-null   float64
 15  Review           1352 non-null   object 
dtypes: float64(3), int64(5), object(8)
memory usage: 958.3+ KB


### 3.2 Clean and Wrangle Data
<p>Upon preliminary inspection, we noted that our dataset is relatively clean, with all the variables being in an appropriate format for analysis. We have transformed the dataset into a tidy format, ensuring each variable is a column, each observation is a row, and each type of observational unit forms a table.</p>
<p>However, for our specific analysis, we do not need all the columns.</p>

#### 3.2.1 Extracting Relevant Data
For our specific analysis, we only need the `Month`, `Season`, `Temperature`, `Humidity`, and `Crowdedness` columns.

In [38]:
clean_golf_df = golf_df[['Month', 'Season', 'Temperature', 'Humidity', 'Crowdedness']]
clean_golf_df

Unnamed: 0,Month,Season,Temperature,Humidity,Crowdedness
0,Jan,Winter,3.3,49,0.73
1,Jan,Winter,3.3,49,0.73
2,Jan,Winter,3.3,49,0.73
3,Jan,Winter,3.3,49,0.73
4,Jan,Winter,3.3,49,0.73
...,...,...,...,...,...
7660,Dec,Winter,1.8,43,0.57
7661,Dec,Winter,1.8,43,0.57
7662,Dec,Winter,1.8,43,0.57
7663,Dec,Winter,1.8,43,0.57


#### 3.2.2 Handling Missing Data
Before proceeding, we should ensure no missing data in the relevant columns.

In [39]:
missing_data = clean_golf_df.isnull().sum()
missing_data

Month          0
Season         0
Temperature    0
Humidity       0
Crowdedness    0
dtype: int64

Great! There is no missing data in our current data frame.

#### 3.3.3 Findings After initial investigation:
- There is a strong relationship between `crowdedness` and `temperature` in winter months;
- There is a strong relationship between `crowdedness` and `humidity` in summer months;

In [40]:
month_order = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

humidity_plot = alt.Chart(df).mark_point().encode(
    x=alt.X('Humidity', title='humidity'),
    y='Crowdedness',
    column=alt.Column('Month', sort=month_order) # Specify the sort order
).properties(width=200, height=200) 

temp_plot = alt.Chart(df).mark_point().encode(
    x=alt.X('Temperature', title='Temperature'),
    y='Crowdedness',
    column=alt.Column('Month', sort=month_order) # Specify the sort order
).properties(width=200, height=200) 

display(temp_plot)
display(humidity_plot)

NameError: name 'df' is not defined

Therefore we need to filter our data frame to have one for winter months, another one for summer months.

In [None]:
winter_months_df = clean_golf_df[clean_golf_df['Season']=='Winter'].reset_index(drop=True)
summer_months_df = clean_golf_df[clean_golf_df['Season']=='Summer'].reset_index(drop=True)

display(winter_months_df.shape[0])
display(summer_months_df.shape[0])

1869

1974

### 3.3 Split Data into `training` and `testing` data

In [None]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# set the seed
np.random.seed(1)

golf_winter_train, golf_winter_test = train_test_split(
    winter_months_df, train_size=0.75)

golf_summer_train, golf_summer_test = train_test_split(
    summer_months_df, train_size=0.75)


In [None]:
golf_winter_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1401 entries, 707 to 1061
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        1401 non-null   object 
 1   Season       1401 non-null   object 
 2   Temperature  1401 non-null   float64
 3   Humidity     1401 non-null   int64  
 4   Crowdedness  1401 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 65.7+ KB


In [None]:
golf_winter_test.info()


<class 'pandas.core.frame.DataFrame'>
Index: 468 entries, 1228 to 1347
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        468 non-null    object 
 1   Season       468 non-null    object 
 2   Temperature  468 non-null    float64
 3   Humidity     468 non-null    int64  
 4   Crowdedness  468 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 21.9+ KB


In [None]:
golf_summer_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1480 entries, 525 to 385
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        1480 non-null   object 
 1   Season       1480 non-null   object 
 2   Temperature  1480 non-null   float64
 3   Humidity     1480 non-null   int64  
 4   Crowdedness  1480 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 69.4+ KB


In [None]:
golf_summer_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 494 entries, 1554 to 1030
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Month        494 non-null    object 
 1   Season       494 non-null    object 
 2   Temperature  494 non-null    float64
 3   Humidity     494 non-null    int64  
 4   Crowdedness  494 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 23.2+ KB


### 3.4 Data Summary 

We have filtered the cells which were erroneous and use two plots for each season. 

We used 75% of the original dataset while training the model and tested the estimated accuracy of the model using the remaining 25% of the original dataset.

### 3.5 Data Viz <span style="color:red">(To be completed by Di)</span>

In [None]:
summer_humidity_plot = alt.Chart(summer_months_df).mark_point().encode(
    x=alt.X('Humidity', title='Humidity'),
    y='Crowdedness',
).properties(width=200, height=200) 

summer_humidity_plot

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


In [None]:
winter_humidity_plot = alt.Chart(winter_months_df).mark_point().encode(
    x=alt.X('Temperature', title='Temperature'),
    y='Crowdedness',
).properties(width=200, height=200)

winter_humidity_plot

  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


## 4. Methods

<p>Using values such as month, season, temperature, and humidity that may affect the crowdedness of the golf clubs. Specifically, we will focus on the changes in the crowdedness of the golf clubs throughout the year. Therefore, for this project, we will use the columns Month, Season, Temperature, Humidity, and Crowdedness to evaluate our question. </p>

<p>We will split the dataset into a training set and a testing set and scale our numerical values to ensure that they are on a comparable scale. After choosing a “K” value (number of neighbours) to find the best “K” value for our dataset, we use the model we created to make predictions on the testing dataset. We will use two scatter plots to visualize the shifts in crowdedness. During the winter months, we will predict crowdedness based on temperature, while during the summer months, we will predict crowdedness based on humidity. </p>

## 5. Expected Outcomes and Significance

### 5.1 Expected Outcomes

<p>We use a regression algorithm to predict the crowdedness at a golf course on the basis of several key factors including month, season, humidity, temperature, and wind conditions. This model should provide valuable insights into how each of the selected factors contributes to the fluctuations in crowdedness, presumably because of the low temperatures or the humidity.</p>

### 5.2 Significance

<p>If accurate, our model could provide valuable insights to golf clubs to come up with solutions that acknowledge a potential drop in crowdedness. The output should highlight the different factors that affect the crowdedness at a golf course and the critical drivers of attendance.</p>

<p>Overall, our regression algorithm is expected to be a powerful tool to enhance decision-making and preventing low attendance at the golf club.</p>