## Instructions

● Use Python or R to perform the tasks required.

● Write your solutions in the workspace provided from your certification page.

● Include all of the visualizations you create to complete the tasks.

● Visualizations must be visible in the published version of the workspace. Links to external visualizations will not be accepted.

● You do not need to include code unless the question says you must.

● You must pass all criteria to pass this exam. The full criteria can be found here.


## Introduction
GoalZone is a fitness club chain in Canada.

GoalZone offers a range of fitness classes in two capacities - 25 and 15.

Some classes are always fully booked. Fully booked classes often have a low attendance rate.

GoalZone wants to increase the number of spaces available for classes.

They want to do this by predicting whether the member will attend the class or not.

If they can predict a member will not attend the class, they can make another space available.


## Initialize and load data

In [2]:
# import all needed libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read the data - contains each record when a member registered for a fitness class.

df = pd.read_csv("fitness_class_2212.csv")

# print the number of rows & columns

print("\nThe dataset has {} rows and {} columns.\n".format(df.shape[0], df.shape[1]))

# view the first few rows

df.head()


The dataset has 1500 rows and 8 columns.



Unnamed: 0,booking_id,months_as_member,weight,days_before,day_of_week,time,category,attended
0,1,17,79.56,8,Wed,PM,Strength,0
1,2,10,79.01,2,Mon,AM,HIIT,0
2,3,16,74.53,14,Sun,AM,Strength,0
3,4,5,86.12,10,Fri,AM,Cycling,0
4,5,15,69.29,8,Thu,AM,HIIT,0


## Data Validation

**Original dataset**

In [3]:
# Data Validation
# Check all variables in the data against the criteria in the dataset above

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   booking_id        1500 non-null   int64  
 1   months_as_member  1500 non-null   int64  
 2   weight            1480 non-null   float64
 3   days_before       1500 non-null   object 
 4   day_of_week       1500 non-null   object 
 5   time              1500 non-null   object 
 6   category          1500 non-null   object 
 7   attended          1500 non-null   int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 93.9+ KB


**Missing Data**

In [4]:
# the counts of missing values in each column

df.isna().sum()

booking_id           0
months_as_member     0
weight              20
days_before          0
day_of_week          0
time                 0
category             0
attended             0
dtype: int64

**Unique Values for Categorical Variables & MAX and MIN for Numeric Values**

In [7]:
# datasets columns with their unique and numeric values each

print("\nUnique categories:\n")
print("day_of_week: ", df.day_of_week.unique().tolist())
print("time: ", df.time.unique().tolist())
print("category: ", df.category.unique().tolist())
print("attended: ", df.attended.unique().tolist())

print("\nNumeric values:\n")
print("Max & min booking id: ", df.booking_id.max(), "&", df.booking_id.min())
print("Max & min months_as_member: ", df.months_as_member.max(), "&", df.months_as_member.min())
print("Max & min weight: ", df.weight.max(), "&", df.weight.min())
print("Max & min days_before: ", df.days_before.max(), "&", df.days_before.min(), "\n")


Unique categories:

day_of_week:  ['Wed', 'Mon', 'Sun', 'Fri', 'Thu', 'Wednesday', 'Fri.', 'Tue', 'Sat', 'Monday']
time:  ['PM', 'AM']
category:  ['Strength', 'HIIT', 'Cycling', 'Yoga', '-', 'Aqua']
attended:  [0, 1]

Numeric values:

Max & min booking id:  1500 & 1
Max & min months_as_member:  148 & 1
Max & min weight:  170.52 & 55.41
Max & min days_before:  9 & 1 



**Correcting values**

**A Post-correction Data Description**

In [8]:
# dataframe description post-corrections

df.describe(include='all').fillna('')

Unnamed: 0,booking_id,months_as_member,weight,days_before,day_of_week,time,category,attended
count,1500.0,1500.0,1480.0,1500.0,1500,1500,1500,1500.0
unique,,,,31.0,10,2,6,
top,,,,10.0,Fri,AM,HIIT,
freq,,,,293.0,279,1141,667,
mean,750.5,15.628667,82.610378,,,,,0.302667
std,433.157015,12.926543,12.765859,,,,,0.459565
min,1.0,1.0,55.41,,,,,0.0
25%,375.75,8.0,73.49,,,,,0.0
50%,750.5,12.0,80.76,,,,,0.0
75%,1125.25,19.0,89.52,,,,,1.0


## 1. For every column in the data:

### a. State whether the values match the description given in the table above.

### b. State the number of missing values in the column.

### c. Describe what you did to make values match the description if they did not match.

## 2. Create a visualization that shows how many bookings attended the class. Use the visualization to:

### a. State which category of the variable attended has the most observations

### b. Explain whether the observations are balanced across categories of the variable attended

## 3. Describe the distribution of the number of months as a member. Your answer must include a visualization that shows the distribution.

## 4. Describe the relationship between attendance and number of months as a member. Your answer must include a visualization to demonstrate the relationship.

## 5. The business wants to predict whether members will attend using the data provided. State the type of machine learning problem that this is (regression/ classification/clustering).

## 6. Fit a baseline model to predict whether members will attend using the data provided. You must include your code.

## 7. Fit a comparison model to predict whether members will attend using the data provided. You must include your code.

## 8. Explain why you chose the two models used in parts 6 and 7.

## 9. Compare the performance of the two models used in parts 6 and 7, using any method suitable. You must include your code.

## 10. Explain which model performs better and why.