# Data Scientist Associate Case Study

## Company Background
GoalZone is a fitness club chain providing five types of fitness classes in Canada. Finally, the fitness classes schedule is back to normal after the COVID-19 restrictions are lifted. However, they have received many complaints from the customers about having a hard time booking a fitness class.

From initial analysis, the program operation team found out that the fitness classes are always fully booked but are having a low attendance rate per class. To improve this situation, they would like to increase the class capacity for customers to sign up if a low attendance rate is predicted.  


## Customer Question
The operation manager has asked you to answer the following:
- Can you predict the attendance rate for each fitness class? 



## Dataset
The dataset contains the attendance information for the class scheduled this year so far. The data you will use for this analysis can be accessed here: `"data/fitness_class.csv"`

| Column Name                     | Criteria                                                                                                                                                                        |
|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Day of Week                     | Character, the day of the week the class was scheduled, one of values from “Mon” to “Sun”.                                                                                      |
| Time                            | Character, the time of the day the class was scheduled, either "AM" or "PM".                                                                                                        |
| Class Category                  | Character, the category of the fitness class, one of “Yoga”, “Aqua”, “Strength”, “HIIT”, or “Cycling”.                                                                          |
| Days Before                     | Numeric, number of days the class stayed fully booked, maximum five days.                                                                                                       |
| Class Capacity                  | Numeric, maximum number of members can sign up for that class, either 15 or 25.  The class capacity being 26 needs to be updated to 25.                                         |
| Attendance                      | Numeric, number of members actually attended the class.                                                                                                                         |
| Average Age                     | Numeric, average age of the members signing up for that class.  Remove rows that average age is smaller than 14 because group fitness class are for members aged 14 and order.  |
| Number of New Students          | Numeric, number of new students signing up for this class.                                                                                                                      |
| Number of Members Over 6 months | Numeric, number of members signing up for the class have been joining the club more than 6 months.                                                                              |                                                                                     |

# Data Scientist Associate Case Study Submission


In [None]:
# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
# Data Validation

# loading the csv file 
fitness_df = pd.read_csv("data/fitness_class.csv") 

# sum of na values
na_count = fitness_df.isna().sum() 

# class capacity being 26 needs to be updated to 25
fitness_df['class_capacity'] = fitness_df['class_capacity'].replace(26, 25) 

# removing rows with avg age < 14 because class are for aged 14+
fitness_df = fitness_df[fitness_df['age'] > 13] 

# column for proportion of new students
fitness_df['proportion_new'] = fitness_df['new_students']/(fitness_df['new_students']+fitness_df['over_6_month'])

# making sure that the unique values in each column match with what is given in the data dictionary
for i in fitness_df.columns:
    print(fitness_df[str(i)].unique())

**Output**:
<br> ['Wed' 'Sun' 'Mon' 'Tue' 'Thu' 'Sat' 'Fri']
<br> ['AM' 'PM']
<br> ['Yoga' 'Aqua' 'Strength' 'HIIT' 'Cycling']
<br> ...
<br>  0.57894737 0.38095238 0.47619048 0.45833333]

## Data Validation


**Explaination:**

In order to validate the data, I first looked at the amount of missing values in each of the columns of my fitness dataframe. Once I checked that there were 0 missing values in each column, I decided to move forward with cleaning the class capacity column. I replaced any value of 26 in the class capacity column with 25, since there are only classes of size 15 or 25, this means that the input of 26 is an error in the data. I then moved on to removing the rows where the average age of the class was less than 14 because the classes are only for those that are 14 or older. I then created a new column for the dataframe that takes the proportion of new customers in the overall class to develop future analysis. Finally, I decided to iterate through each column in the dataframe and look at the unique values in those columns. I did this to make sure the unique values matched with what was given in the data dictionary.

In [None]:
# Exploratory Analysis
# Explore the characteristics of the variables in the data


# Start coding here... 

# characteristics plots

sns.histplot(data=fitness_df, x='age') # age is normally distibuted around 30 
plt.show()

sns.histplot(data=fitness_df, x='attendance') # attendance seems almost right skewed
plt.show()

# relationship plots

#class capacity difference encompasses
sns.histplot(data=fitness_df, x='attendance', hue='class_capacity', palette='colorblind') 

# skew of attendance
plt.show()

sns.histplot(data=fitness_df, x='attendance', hue='time', palette='colorblind') # no difference between AM and PM
plt.show()

sns.boxplot(data=fitness_df, x='class_capacity', y='attendance') # significant difference in attendance for class capacity
plt.show()

sns.boxplot(data=fitness_df, x='class_category', y='attendance') # little difference in attendance for class capacity
plt.show()

sns.boxplot(data=fitness_df, x='days_before', y='attendance') # the more days the class is at full capacity, the more
# likely it is to get a better attendance
plt.show() 

sns.scatterplot(data=fitness_df, x='proportion_new', y='attendance', hue='class_capacity', palette='colorblind') #shows little corelation between number of veteran customers and attendance, however those that go more often also take the 25 capacity class
plt.show() 

sns.scatterplot(data=fitness_df, x='age', y='attendance', hue='class_capacity', palette='colorblind') #shows negative corelation between age and attendance, however those that are olderare more likely to take the 15 capacity class
plt.show() 




<img src="images/age_1D_hist.png" style="width:400px;height:200;">
<caption><center> <u>Figure 1</u>: Age 1D Histogram <br> </center></caption>

<img src="images/attendance_1D_hist.png" style="width:400px;height:200;">
<caption><center> <u>Figure 2</u>: Attendance 1D Histogram <br> </center></caption>

<img src="images/attendance_by_capacity_1D_hist.png" style="width:400px;height:200;">
<caption><center> <u>Figure 3</u>: Attendance 1D Histogram by Capacity <br> </center></caption>

<img src="images/attendance_by_time_1D_hist.png" style="width:400px;height:200;">
<caption><center> <u>Figure 4</u>: Attendance 1D Histogram by Time<br> </center></caption>

<img src="images/capacity_attendance_boxplot.png" style="width:400px;height:200;">
<caption><center> <u>Figure 5</u>: Capacity vs Attendance Boxplot <br> </center></caption>



In [None]:
# changes to data for modeling
fitness_df = pd.get_dummies(fitness_df, columns=['day_of_week', 'time', 'class_category'])
fitness_df_attendance = fitness_df[['attendance']]
fitness_df_else = fitness_df.drop(columns='attendance')

## Exploratory Analysis

**Explaination:**

In my analysis I started by creating a pairplot of all the variables in the dataset to get a baseline understanding of the overall distributions and relationships of my variables. I decided to investigate the age distribtution first to ensure that there was not a skewed distribution since I believed very young and very old customers to be more likely to not attendant the class. I found that there was not such skew in the age distribution and that it looked like a normal distribution with means of about 30. I then decided to investigate the attendance distribution, in which I found it right skewed and almost like two different distributions stacked onto each other. Since the distribution of attendance looked like two distributions stacked together, I decided to investigate the columns that had only two unique values. These columns would be the time and class capacity columns. 

Investigating the distribution of attendance with a hue of class capacity and a seperate hue of time, I found that the class capacity highly impacted the attendance while the time did not. Moving on from that, I investigated boxplots of the attendance of different class categories and I did not find a significant difference in attendance between them. I also looked at boxplots of the attendance for the number of days the class is fully booked before the class (also know as 'days_before') and I found that the more days the class is fully booked then the higher chance of more attendance. I then looked at two scatter plots of the proportion of new customers in a class and the age of the customer with a hue looking at class capacity. In the scatter plot of proportion of new customers and age, there wasn't much of a difference in attendance however looking at it with a hue showing class capacity, it is again obvious that the classes with a capacity of 25 have higher attendance. Looking at the age vs attendance scatterplot, there is a negative correlation between the two, however looking at the hue of class capacity, it shows that those who are older are more likely to go into the classes with only 15 capacity.

Overall in my analysis I found that: 
* variables most impactful on attendance are class_capacity, the number of days the class is fully booked before the class, and the age of the customers
* negative correlation between age and attendance
* older customers are more likely to go into a lower capacity class which could be the reason lower capacity classes have lower attendance or vice versa


The changes I made to my data to enable modeling was to change categorical data to numberical values since machine learning algorithms can't deal with categorical data. I then seperated the attendance from the rest of the data frame to use as a test for my models.

In [None]:
# Model Fitting
# Choose and fit a baseline model
# Choose and fit a comparison model

# Start coding here... 


# Baseline model of a multiple regression

fitness_df_else = fitness_df_else[['class_capacity', 'days_before', 'age', 'new_students', 'over_6_month', # Reshaping
       'proportion_new', 'day_of_week_Fri', 'day_of_week_Mon',
       'day_of_week_Sat', 'day_of_week_Sun', 'day_of_week_Thu',
       'day_of_week_Tue', 'day_of_week_Wed', 'time_AM', 'time_PM',
       'class_category_Aqua', 'class_category_Cycling', 'class_category_HIIT',
       'class_category_Strength', 'class_category_Yoga']]


X_train, X_test, y_train, y_test = train_test_split(fitness_df_age, fitness_df_attendance, test_size = 0.2, random_state = 42) 

reg = LinearRegression()

reg.fit(X_train, y_train)

y_pred_reg = reg.predict(X_test)

# Comparison model
gradient_reg = GradientBoostingRegressor(random_state =42)

gradient_reg.fit(X_train, y_train)

y_pred_gradient = gradient_reg.predict(X_test)

## Model Fitting

**Explaination:**

Since this question asks us to predict the attendance of a fitness class, where attendance is a continuous variable, this led me to believe the type of machine learning problem is a regression problem. Given that attendance is dependent on a few variables such as age and class capacity, which in turn are related, I decided to start off with a simple multiple regression model, which looks at multiple possible predictive variables, as my baseline model. My comparision model was a gradient boosted regression model. I chose a gradient boosted regression because I wanted to reduce the chance of overfitting and improve the performance of my model.

In [None]:
# Model Evaluation
# Choose a metric and evaluate the performance of the two models

# Start coding here... 

baseline_RMSE = mean_squared_error(y_test, y_pred_reg, squared=False)
print('baseline RMSE: ' + str(baseline_RMSE))

comparison_RMSE = mean_squared_error(y_test, y_pred_gradient, squared=False)
print('comparision RMSE: ' + str(comparison_RMSE))

## Model Evaluation

**Explaination:**

I chose to use the root mean squared error (RMSE) as my metric because it tells how far apart predicted values are from the observed values, making it easy to compare to regression models. The outcome of this evaluate tells us that both the baseline and comparison models perform at about the same level, however the baseline model (multiple regression) performs slightly better as it has a lower RMSE value. 

## ✅ When you have finished...
- Publish your Workspace using the option on the left
- Check the published version of your report:
	- Can you see everything you want us to grade?
    - Are all the graphics visible?
- Review grading rubric. Have you included everything that will be graded?
- Head back to the [Certification Dashboard](https://app.datacamp.com/certification) to submit your case study