**1. Importing Data**

Steps:
*   Load the dataset into Python using pandas found from Kaggle.
*   Importing each library used for the project.
*   Inspecting the data to understand the structure, data types, and summary statistics.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

path = "/content/gym_members_exercise_tracking.csv"

gym_data = pd.read_csv(path)

# Info for the columns, number of rows and data types
gym_data.info()

print()
print()

gym_data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 973 entries, 0 to 972
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            973 non-null    int64  
 1   Gender                         973 non-null    object 
 2   Weight (kg)                    973 non-null    float64
 3   Height (m)                     973 non-null    float64
 4   Max_BPM                        973 non-null    int64  
 5   Avg_BPM                        973 non-null    int64  
 6   Resting_BPM                    973 non-null    int64  
 7   Session_Duration (hours)       973 non-null    float64
 8   Calories_Burned                973 non-null    float64
 9   Workout_Type                   973 non-null    object 
 10  Fat_Percentage                 973 non-null    float64
 11  Water_Intake (liters)          973 non-null    float64
 12  Workout_Frequency (days/week)  973 non-null    int

Unnamed: 0,Age,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Session_Duration (hours),Calories_Burned,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI
count,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0,973.0
mean,38.683453,73.854676,1.72258,179.883864,143.766701,62.223022,1.256423,905.422405,24.976773,2.626619,3.321686,1.809866,24.912127
std,12.180928,21.2075,0.12772,11.525686,14.345101,7.32706,0.343033,272.641516,6.259419,0.600172,0.913047,0.739693,6.660879
min,18.0,40.0,1.5,160.0,120.0,50.0,0.5,303.0,10.0,1.5,2.0,1.0,12.32
25%,28.0,58.1,1.62,170.0,131.0,56.0,1.04,720.0,21.3,2.2,3.0,1.0,20.11
50%,40.0,70.0,1.71,180.0,143.0,62.0,1.26,893.0,26.2,2.6,3.0,2.0,24.16
75%,49.0,86.0,1.8,190.0,156.0,68.0,1.46,1076.0,29.3,3.1,4.0,2.0,28.56
max,59.0,129.9,2.0,199.0,169.0,74.0,2.0,1783.0,35.0,3.7,5.0,3.0,49.84


**2. Data Cleaning**

Steps:



1.   Handle Missing Values:
*   Identify missing or null values using gym_data.isnull().sum().
*   Impute or drop missing data (e.g., mean imputation for numerical columns, mode for categorical ones).

2.   Fix Inconsistent or Outlier Values:
*   Use boxplots or z-score to identify outliers in numerical columns like Calories_Burned, BMI, and Resting_BPM.
*   For outliers, either cap values (using quantiles) or remove them if justified.

3. Standardize Categorical Values:
*   Ensure consistency in categorical values (e.g., Workout_Type).

4. Convert Units or Rename Columns

* Converting units to minutes and renaming column for better consistency:

5. Feature Engineering

* Adding new features, such as BMI categories

6. Check for Duplicates

* Duplicates can skew analysis if the same information is counted multiple times.

7. Verify Data Types

* Ensure each column has the correct data type for analysis (e.g., numerical, categorical).



In [None]:
# Check for missing values
print(gym_data.isnull().sum())

# Handling missing values by replacing them with mean
gym_data['Calories_Burned'].fillna(gym_data['Calories_Burned'].mean(), inplace=True)


Age                              0
Gender                           0
Weight (kg)                      0
Height (m)                       0
Max_BPM                          0
Avg_BPM                          0
Resting_BPM                      0
Session_Duration (hours)         0
Calories_Burned                  0
Workout_Type                     0
Fat_Percentage                   0
Water_Intake (liters)            0
Workout_Frequency (days/week)    0
Experience_Level                 0
BMI                              0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  gym_data['Calories_Burned'].fillna(gym_data['Calories_Burned'].mean(), inplace=True)


**2. Detect and "Handle" Outliers**


*   This code is used to detect and handle outliers in the Calories_Burned column of the gym_data DataFrame, using the Interquartile Range (IQR) method.




In [None]:
# Detect outliers using IQR for Calories_Burned
Q1 = gym_data['Calories_Burned'].quantile(0.25)
Q3 = gym_data['Calories_Burned'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
outliers = gym_data[(gym_data['Calories_Burned'] < Q1 - 1.5 * IQR) | (gym_data['Calories_Burned'] > Q3 + 1.5 * IQR)]

# Capping or removing outliers
gym_data['Calories_Burned'] = gym_data['Calories_Burned'].clip(lower=Q1 - 1.5 * IQR, upper=Q3 + 1.5 * IQR)
print(f"Number of outliers detected: {len(outliers)}")

Number of outliers detected: 10


**3. Standardize or Format Categorical Variables**

*   Ensure categorical variables like Workout_Type and Gender are clean and consistent:


In [None]:
# Correct typos or inconsistencies
gym_data['Workout_Type'] = gym_data['Workout_Type'].str.strip().str.capitalize()
gym_data['Gender'] = gym_data['Gender'].str.strip().str.capitalize()

# Check unique values after cleaning
print(gym_data['Workout_Type'].unique())
print(gym_data['Gender'].unique())

['Yoga' 'Hiit' 'Cardio' 'Strength']
['Male' 'Female']


**4. Convert Units or Rename Columns**

Converting units to minutes and renaming column for better consistency:

In [None]:
# Convert session duration from hours to minutes
gym_data['Session_Duration (minutes)'] = gym_data['Session_Duration (hours)'] * 60

# Drop the old column for cleanliness
gym_data.drop('Session_Duration (hours)', axis=1, inplace=True)


**5. Feature Engineering**

Adding new features, such as BMI and Experience_Level categories:

In [None]:
# Define BMI category classification
def classify_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 24.9:
        return 'Normal'
    elif 25 <= bmi < 29.9:
        return 'Overweight'
    else:
        return 'Obese'

# Apply classification to BMI column
gym_data['BMI_Category'] = gym_data['BMI'].apply(classify_bmi)

# Check the new column
print(gym_data[['BMI', 'BMI_Category']].head())


     BMI BMI_Category
0  30.20        Obese
1  32.00        Obese
2  24.71       Normal
3  18.41  Underweight
4  14.39  Underweight


In [None]:
gym_data.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI,Session_Duration (minutes),BMI_Category
0,56,Male,88.3,1.71,180,157,60,1313.0,Yoga,12.6,3.5,4,3,30.2,101.4,Obese
1,46,Female,74.9,1.53,179,151,66,883.0,Hiit,33.9,2.1,4,2,32.0,78.0,Obese
2,32,Female,68.1,1.66,167,122,54,677.0,Cardio,33.4,2.3,4,2,24.71,66.6,Normal
3,25,Male,53.2,1.7,190,164,56,532.0,Strength,28.8,2.1,3,1,18.41,35.4,Underweight
4,38,Male,46.1,1.79,188,158,68,556.0,Strength,29.2,2.8,3,1,14.39,38.4,Underweight


**6. Check for Duplicates**

Duplicates can skew analysis if the same information is counted multiple times.

In [None]:
# Check for duplicate rows
duplicates = gym_data.duplicated()

# Count and optionally remove duplicates
print(f"Number of duplicate rows: {duplicates.sum()}")
gym_data.drop_duplicates(inplace=True)


Number of duplicate rows: 0


**7. Verify Data Types**

Ensure each column has the correct data type for analysis (e.g., numerical, categorical).

In [None]:
print(gym_data.dtypes)

# Convert data types if necessary
# gym_data['Age'] = gym_data['Age'].astype(int)
# gym_data['Session_Duration (minutes)'] = gym_data['Session_Duration (minutes)'].astype(float)


Age                                int64
Gender                            object
Weight (kg)                      float64
Height (m)                       float64
Max_BPM                            int64
Avg_BPM                            int64
Resting_BPM                        int64
Calories_Burned                  float64
Workout_Type                      object
Fat_Percentage                   float64
Water_Intake (liters)            float64
Workout_Frequency (days/week)      int64
Experience_Level                   int64
BMI                              float64
Session_Duration (minutes)       float64
BMI_Category                      object
dtype: object


**8. Check Data Consistency**

* Ensure logical consistency between columns (e.g., Calories_Burned should be proportional to Session_Duration).

In [None]:
# Example check: Ensure no negative values
assert (gym_data['Calories_Burned'] >= 0).all(), "Negative values found in Calories_Burned"


9. Save the Clean Data


In [None]:
gym_data.to_csv('gym_data_cleaned.csv', index=False)

In [None]:
gym_data.head()

Unnamed: 0,Age,Gender,Weight (kg),Height (m),Max_BPM,Avg_BPM,Resting_BPM,Calories_Burned,Workout_Type,Fat_Percentage,Water_Intake (liters),Workout_Frequency (days/week),Experience_Level,BMI,Session_Duration (minutes),BMI_Category
0,56,Male,88.3,1.71,180,157,60,1313.0,Yoga,12.6,3.5,4,3,30.2,101.4,Obese
1,46,Female,74.9,1.53,179,151,66,883.0,Hiit,33.9,2.1,4,2,32.0,78.0,Obese
2,32,Female,68.1,1.66,167,122,54,677.0,Cardio,33.4,2.3,4,2,24.71,66.6,Normal
3,25,Male,53.2,1.7,190,164,56,532.0,Strength,28.8,2.1,3,1,18.41,35.4,Underweight
4,38,Male,46.1,1.79,188,158,68,556.0,Strength,29.2,2.8,3,1,14.39,38.4,Underweight


       _,-._
    ,-'     `-.
  ,'          `. 
 /              \
|   nazli        |
 \              /
  `.          ,' 
    `-._    _,-'
        `--'
