# Cardio Good Fitness

The market research team at AdRight is assigned the task to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months.

The data are stored in the CardioGoodFitness.csv file. The team identifies the following customer variables to study: 

 - product purchased, 
 - TM195, TM498, or TM798; 
 - gender; 
 - age, in years;
 - education, in years; 
 - relationship status, single or partnered; 
 - annual household income ($); 
 - average number of times the customer plans to use the treadmill each week; 
 - average number of miles the customer expects to walk/run each week; 
 - and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape. 

Perform descriptive analytics to create a customer profile for each CardioGood Fitness treadmill product line.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport

In [2]:
import os
for dirname, _, filenames in os.walk('./'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./CardioGoodFitness.csv
./Untitled.ipynb
./.ipynb_checkpoints/Untitled-checkpoint.ipynb


In [4]:
df = pd.read_csv("CardioGoodFitness.csv")

In [6]:
df.head()

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
0,TM195,18,Male,14,Single,3,4,29562,112
1,TM195,19,Male,15,Single,2,3,31836,75
2,TM195,19,Female,14,Partnered,4,3,30699,66
3,TM195,19,Male,12,Single,3,3,32973,85
4,TM195,20,Male,13,Partnered,4,2,35247,47


In [5]:
df.describe()

Unnamed: 0,Age,Education,Usage,Fitness,Income,Miles
count,180.0,180.0,180.0,180.0,180.0,180.0
mean,28.788889,15.572222,3.455556,3.311111,53719.577778,103.194444
std,6.943498,1.617055,1.084797,0.958869,16506.684226,51.863605
min,18.0,12.0,2.0,1.0,29562.0,21.0
25%,24.0,14.0,3.0,3.0,44058.75,66.0
50%,26.0,16.0,3.0,3.0,50596.5,94.0
75%,33.0,16.0,4.0,4.0,58668.0,114.75
max,50.0,21.0,7.0,5.0,104581.0,360.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product        180 non-null    object
 1   Age            180 non-null    int64 
 2   Gender         180 non-null    object
 3   Education      180 non-null    int64 
 4   MaritalStatus  180 non-null    object
 5   Usage          180 non-null    int64 
 6   Fitness        180 non-null    int64 
 7   Income         180 non-null    int64 
 8   Miles          180 non-null    int64 
dtypes: int64(6), object(3)
memory usage: 12.8+ KB


In [9]:
#EDA using Pandas Profiling
file = ProfileReport(df)
file.to_file(output_file='output.html')

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=23.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…

  cmap.set_bad(cmap_bad)





HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




In [10]:
#Change the categorical variables to numbers
df['Gender'] = df['Gender'].replace('Male', 0)
df['Gender'] = df['Gender'].replace('Female', 1)
df['MaritalStatus'] = df['MaritalStatus'].replace('Single', 0)
df['MaritalStatus'] = df['MaritalStatus'].replace('Partnered', 1)
df.head()

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
0,TM195,18,0,14,0,3,4,29562,112
1,TM195,19,0,15,0,2,3,31836,75
2,TM195,19,1,14,1,4,3,30699,66
3,TM195,19,0,12,0,3,3,32973,85
4,TM195,20,0,13,1,4,2,35247,47


In [11]:
#Perform one hot encoding on the Products column
one_hot = pd.get_dummies(df['Product'])
# Drop column Product as it is now encoded
df = df.drop('Product',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df.head()

Unnamed: 0,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles,TM195,TM498,TM798
0,18,0,14,0,3,4,29562,112,1,0,0
1,19,0,15,0,2,3,31836,75,1,0,0
2,19,1,14,1,4,3,30699,66,1,0,0
3,19,0,12,0,3,3,32973,85,1,0,0
4,20,0,13,1,4,2,35247,47,1,0,0


In [13]:
from sklearn.model_selection import train_test_split
y=df['Fitness']
X=df.drop('Fitness',axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 0)

In [14]:
from sklearn.linear_models import LinearRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02  # step size in the mesh

names = ["LinearRegression", "Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    LinearRegression(),
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

In [16]:
scores = {}
for name,clf in zip(names, classifiers):
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    scores[name] = score
    
sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for i in sorted_scores:
    print(i[0], i[1])

Linear Regression 0.736775113122601
Decision Tree 0.7037037037037037
AdaBoost 0.7037037037037037
Random Forest 0.6666666666666666
