# Data Process Notebook
This notebook **focuses on preparing and processing** our **Personal Health** dataset.

## Table of Contents
1. [Introduction](#introduction)
2. [Imports](#imports)
3. [Data Collection](#data-collection)
4. [Data Cleaning](#data-cleaning)
5. [Feature Engineering](#feature-engineering)
6. [Saving Processed Data](#saving-data)
7. [Conclusion](#conclusion)


<a id='introduction'></a>
## 1. Introduction
We have a dataset with the following **columns**:
- **Days**: (e.g., Monday, Tuesday, etc.)
- **Weight(kg)**: Daily body weight in kilograms.
- **Step_count**: Number of steps taken.
- **Gym**: Indicates if a gym session occurred ("Yes" or blank).
- **Calorie(kcal)**: Total daily calorie intake.

### Our Main Research Questions
1. Relationship between **calorie intake** and **weight changes**.
2. Impact of **physical activity** (steps, gym) on weight management.
3. Interactions between **calorie intake**, **step count**, and **gym visits**.

Before advanced analysis, let's **clean** and **prepare** the data in this notebook!

<a id='imports'></a>
## 2. Imports
We import necessary libraries for **data manipulation** and **basic checks**.

_In the future, we may add additional imports for advanced outlier detection or data validation._

In [1]:
import warnings

# Suppress DeprecationWarnings (e.g., about pyarrow)
warnings.filterwarnings('ignore', category=DeprecationWarning)

import pandas as pd
import numpy as np


print("Libraries imported successfully!")

Libraries imported successfully!


<a id='data-collection'></a>
## 3. Data Collection
Assuming you have your dataset saved in a CSV file named **`my_personal_health_data.csv`** (or similar), we **load** it here.

We'll also apply a few immediate cleanup actions:
- Use `sep=';'` if it's semicolon-delimited.
- Drop any last column if it contains extraneous data.
- Cast certain columns to **int** if they should not have decimals.


In [2]:
# ------------------------------------------------------------------------------
# Data Collection
# ------------------------------------------------------------------------------

csv_file = 'my_personal_health_data.csv'  # Replace with your actual CSV name
# If your dataset uses commas, remove 'sep=";"'. If it uses semicolons, keep it.

df = pd.read_csv(csv_file, sep=';')
print("Raw Data loaded!\n")

# Example: Drop the last column if it is unnamed or empty
if 'Unnamed: 5' in df.columns:
    df.drop(columns=['Unnamed: 5'], inplace=True)
elif df.columns[-1].startswith('Unnamed'):
    df.drop(df.columns[-1], axis=1, inplace=True)

# Convert Step_count + Calorie(kcal) to integer, if that’s correct for your data
df["Step_count"] = df["Step_count"].fillna(0).astype(int)
df["Calorie(kcal)"] = df["Calorie(kcal)"].fillna(0).astype(int)

print("First 5 rows of the dataset:")
display(df.head())

print("Dataset shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())

Raw Data loaded!

First 5 rows of the dataset:


Unnamed: 0,Days,Weight(kg),Step_count,Gym,Calorie(kcal)
0,Thursday,89.5,4500,Yes,3410
1,Sunday,89.3,3254,Yes,3410
2,Monday,91.2,4149,No,3410
3,Tuesday,90.8,7342,Yes,3410
4,Wednesday,90.7,7784,Yes,3410


Dataset shape: (43, 5)

Column Names: ['Days', 'Weight(kg)', 'Step_count', 'Gym', 'Calorie(kcal)']


<a id='data-cleaning'></a>
## 4. Data Cleaning
Here, we will:
1. Check **data types**.
2. Identify and handle **missing values**.
3. (Optionally) detect or remove **outliers**.

_Note: We are focusing on data validity, not yet analyzing patterns._

In [7]:
# ------------------------------------------------------------------------------
# Data Cleaning Steps
# ------------------------------------------------------------------------------
print("\nData Types before conversions:")
display(df.dtypes)

# Convert numeric columns to numeric with errors='coerce' to handle weird strings
df['Weight(kg)'] = pd.to_numeric(df['Weight(kg)'], errors='coerce')
df['Step_count'] = pd.to_numeric(df['Step_count'], errors='coerce')
df['Calorie(kcal)'] = pd.to_numeric(df['Calorie(kcal)'], errors='coerce')

# If Gym is blank, replace with 'No'
df['Gym'] = df['Gym'].fillna('No')

# Check Missing Values
missing_counts = df.isnull().sum()
print("\nMissing Values after type conversion:")
print(missing_counts)

# Example fill strategy for Weight(kg): fill with mean if missing
df['Weight(kg)'] = df['Weight(kg)'].fillna(df['Weight(kg)'].mean())

# Re-check missing
print("\nMissing Values after filling:")
print(df.isnull().sum())

print("\nStats Overview (post-cleaning):")
display(df.describe())
print("Data cleaning complete!")


Data Types before conversions:


Days                   object
Weight(kg)            float64
Step_count              int32
Gym                    object
Calorie(kcal)           int32
Gym_Bool                 bool
Day_Type               object
Weight_Change         float64
Step_RollingAvg_7d    float64
dtype: object


Missing Values after type conversion:
Days                  0
Weight(kg)            0
Step_count            0
Gym                   0
Calorie(kcal)         0
Gym_Bool              0
Day_Type              0
Weight_Change         0
Step_RollingAvg_7d    0
dtype: int64

Missing Values after filling:
Days                  0
Weight(kg)            0
Step_count            0
Gym                   0
Calorie(kcal)         0
Gym_Bool              0
Day_Type              0
Weight_Change         0
Step_RollingAvg_7d    0
dtype: int64

Stats Overview (post-cleaning):


Unnamed: 0,Weight(kg),Step_count,Calorie(kcal),Weight_Change,Step_RollingAvg_7d
count,43.0,43.0,43.0,43.0,43.0
mean,92.57907,5310.093023,3691.046512,0.134884,4705.671096
std,2.195944,2620.707882,301.021195,2.22217,2197.246462
min,83.9,1649.0,3410.0,-9.3,0.0
25%,91.25,2603.5,3660.0,-0.2,3650.642857
50%,92.6,4505.0,3660.0,0.0,5738.714286
75%,94.25,7590.0,3660.0,0.4,6197.785714
max,95.7,10136.0,4875.0,10.5,6983.571429


Data cleaning complete!


<a id='feature-engineering'></a>
## 5. Feature Engineering
We can create new columns to help with analysis, such as:
- **Gym_Bool**: Convert "Yes" to `True`, else `False`.
- **Day_Type**: Classify days as `Weekend` vs. `Weekday`.
- **Weight_Change**: Difference in weight from previous day.
- (Bonus) A **Rolling Average** of weight or steps, if needed.


In [8]:
# ------------------------------------------------------------------------------
# Feature Engineering
# ------------------------------------------------------------------------------

# 5.1 Gym_Bool
df['Gym_Bool'] = df['Gym'].apply(lambda x: True if x == 'Yes' else False)

# 5.2 Day_Type (Weekend vs. Weekday)
weekend_days = ['Saturday','Sunday']
df['Day_Type'] = df['Days'].apply(lambda d: 'Weekend' if d in weekend_days else 'Weekday')

# 5.3 Weight_Change
df['Weight_Change'] = df['Weight(kg)'].diff().fillna(0)

# Bonus: Rolling average of Step_count over a 7-day window (if we had enough data in order)
try:
    # We'll assume 'df.index' represents consecutive days in order.
    df['Step_RollingAvg_7d'] = df['Step_count'].rolling(window=7).mean().fillna(0)
except:
    df['Step_RollingAvg_7d'] = 0
    print("Rolling average not applied (check data length or ordering)")

# Check new columns
display(df.head(10))

Unnamed: 0,Days,Weight(kg),Step_count,Gym,Calorie(kcal),Gym_Bool,Day_Type,Weight_Change,Step_RollingAvg_7d
0,Thursday,89.5,4500,Yes,3410,True,Weekday,0.0,0.0
1,Sunday,89.3,3254,Yes,3410,True,Weekend,-0.2,0.0
2,Monday,91.2,4149,No,3410,False,Weekday,1.9,0.0
3,Tuesday,90.8,7342,Yes,3410,True,Weekday,-0.4,0.0
4,Wednesday,90.7,7784,Yes,3410,True,Weekday,-0.1,0.0
5,Thursday,90.9,7213,Yes,3410,True,Weekday,0.2,0.0
6,Friday,90.6,7644,No,3410,False,Weekday,-0.3,5983.714286
7,Saturday,90.3,4131,Yes,3410,True,Weekend,-0.3,5931.0
8,Sunday,90.9,2718,Yes,3410,True,Weekend,0.6,5854.428571
9,Monday,91.3,6421,No,3660,False,Weekday,0.4,6179.0


<a id='saving-data'></a>
## 6. Saving Processed Data
After cleaning and feature engineering, we save this final version of the dataset for further analysis.

_We’ll name our output file **`my_personal_health_data_processed.csv`**._

In [9]:
# ------------------------------------------------------------------------------
# Save the processed dataset to a new CSV
# ------------------------------------------------------------------------------

processed_file = 'my_personal_health_data_processed.csv'
df.to_csv(processed_file, index=False)
print(f"Processed data saved as: {processed_file}")

Processed data saved as: my_personal_health_data_processed.csv


<a id='conclusion'></a>
## 7. Conclusion
In this notebook, we:
1. **Loaded** the raw dataset and handled delimiter issues.
2. Performed **data cleaning** (addressed missing values, standardized data types).
3. Created **new features** like `Gym_Bool`, `Day_Type`, `Weight_Change`, and optionally a rolling average.
4. **Saved** the final processed data for future analysis (e.g., in `data_analysis.ipynb`).

> **Next Step**: We’ll explore correlations, statistical tests, and potential modeling in our analysis notebook.

___
**End of Data Processing Notebook**