
## Table of Contents

    Assignment
    Data Exploration and Processing
        Statistical Summary
    Non-Graphical Analysis
        Value Counts
        Unique Attributes
    Graphical Analysis
        Univariate Analysis - Numerical Variables
        Univariate Analysis - Categorical Variables
        Bivariate Analysis
        Multivariate Analysis
    Correlation Analysis
    Marginal & Conditional Probabilities
    Outlier Detection
    Actionable Insights & Recommendations



In [51]:
##Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stat
import warnings
warnings.filterwarnings('ignore')
import math
import statistics

In [52]:
## load and read the data
df = pd.read_csv('aerofit_treadmill_data.csv')
df.head(10)

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
0,KP281,18,Male,14,Single,3,4,29562,112
1,KP281,19,Male,15,Single,2,3,31836,75
2,KP281,19,Female,14,Partnered,4,3,30699,66
3,KP281,19,Male,12,Single,3,3,32973,85
4,KP281,20,Male,13,Partnered,4,2,35247,47
5,KP281,20,Female,14,Partnered,3,3,32973,66
6,KP281,21,Female,14,Partnered,3,3,35247,75
7,KP281,21,Male,13,Single,3,3,32973,85
8,KP281,21,Male,15,Single,5,4,35247,141
9,KP281,21,Female,15,Partnered,2,3,37521,85


- Result show that:
    Age --> majority in the range of 30s and the olders is 50
    


In [53]:
# check columns
df.columns

Index(['Product', 'Age', 'Gender', 'Education', 'MaritalStatus', 'Usage',
       'Fitness', 'Income', 'Miles'],
      dtype='object')

In [54]:
## check data types 
df.dtypes

Product          object
Age               int64
Gender           object
Education         int64
MaritalStatus    object
Usage             int64
Fitness           int64
Income            int64
Miles             int64
dtype: object

In [55]:
## define a function that converts to category

def convert_to_category(df, *cols):
    for col in cols:
        df[col] = df[col].astype('category')

## Apply in the columns
convert_to_category(df, ['Gender', 'Product', 'MaritalStatus'])


## check the changes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Product        180 non-null    category
 1   Age            180 non-null    int64   
 2   Gender         180 non-null    category
 3   Education      180 non-null    int64   
 4   MaritalStatus  180 non-null    category
 5   Usage          180 non-null    int64   
 6   Fitness        180 non-null    int64   
 7   Income         180 non-null    int64   
 8   Miles          180 non-null    int64   
dtypes: category(3), int64(6)
memory usage: 9.5 KB


In [56]:
## check the statistical description
df.describe(include='all') 

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
count,180,180.0,180,180.0,180,180.0,180.0,180.0,180.0
unique,3,,2,,2,,,,
top,KP281,,Male,,Partnered,,,,
freq,80,,104,,107,,,,
mean,,28.788889,,15.572222,,3.455556,3.311111,53719.577778,103.194444
std,,6.943498,,1.617055,,1.084797,0.958869,16506.684226,51.863605
min,,18.0,,12.0,,2.0,1.0,29562.0,21.0
25%,,24.0,,14.0,,3.0,3.0,44058.75,66.0
50%,,26.0,,16.0,,3.0,3.0,50596.5,94.0
75%,,33.0,,16.0,,4.0,4.0,58668.0,114.75




#### Observations:

    - There are no missing values in the data.
    - There are 3 unique products in the dataset.
    - KP281 is the most frequent product.
    - Minimum & Maximum age of the person is 18 & 50, mean is 28.79, and 75% of persons have an age less than or equal to 33.
    - Most of the people are having 16 years of education i.e. 75% of persons are having education <= 16 years.
    - Out of 180 data points, 104's gender is Male and rest are the Female.
    - Standard deviation for Income & Miles is very high. These variables might have outliers in them.



In [59]:
## check if there is any null values

df.isna().sum()

Product          0
Age              0
Gender           0
Education        0
MaritalStatus    0
Usage            0
Fitness          0
Income           0
Miles            0
dtype: int64

In [58]:
## check the individul count

df['Product'].value_counts()

Product
KP281    80
KP481    60
KP781    40
Name: count, dtype: int64