# Data Process Notebook
This notebook focuses on preparing and processing the **Personal Health** dataset.

## Table of Contents
1. [Introduction](#introduction)
2. [Imports](#imports)
3. [Data Collection](#data-collection)
4. [Data Cleaning](#data-cleaning)
5. [Feature Engineering](#feature-engineering)
6. [Saving Processed Data](#saving-data)
7. [Conclusion](#conclusion)


<a id='introduction'></a>
## 1. Introduction
We have a dataset with the following columns:
- **Days**: (e.g., Monday, Tuesday, etc.)
- **Weight(kg)**: Daily body weight in kilograms.
- **Step_count**: Number of steps taken.
- **Gym**: Indicates if a gym session occurred ("Yes" or blank).
- **Calorie(kcal)**: Total daily calorie intake.

Our **main research questions** revolve around:
- The relationship between calorie intake and weight changes.
- The impact of physical activity (steps, gym) on weight management.
- Interactions between calorie intake, step count, and gym visits.

Before advanced analysis, let's clean and prepare the data in this notebook.

<a id='imports'></a>
## 2. Imports
We import necessary libraries for data manipulation and basic checks.

In [2]:
import pandas as pd
import numpy as np

# (Optional) If you want to see more rows
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 10)
print("Libraries imported.")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Libraries imported.


<a id='data-collection'></a>
## 3. Data Collection
Assuming you have your dataset saved in a CSV file named **`my_personal_health_data.csv`** (or similar), we load it here.

In [5]:
# Replace 'my_personal_health_data.csv' with your actual file name or path.
csv_file = 'my_personal_health_data.csv'

df = pd.read_csv(csv_file, sep=';')

df.drop(df.columns[-1], axis=1, inplace=True)

df["Step_count"] = df["Step_count"].fillna(0)          # Replace NaN with 0
df["Step_count"] = df["Step_count"].astype(int)        # Convert float -> int

df["Calorie(kcal)"] = df["Calorie(kcal)"].fillna(0)          # Replace NaN with 0
df["Calorie(kcal)"] = df["Calorie(kcal)"].astype(int)        # Convert float -> int


print("First 5 rows of the dataset:")

display(df.head())

print("Dataset shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())

First 5 rows of the dataset:


Unnamed: 0,Days,Weight(kg),Step_count,Gym,Calorie(kcal)
0,Thursday,89.5,4500,Yes,3410
1,Sunday,89.3,3254,Yes,3410
2,Monday,91.2,4149,,3410
3,Tuesday,90.8,7342,Yes,3410
4,Wednesday,90.7,7784,Yes,3410


Dataset shape: (44, 5)

Column Names: ['Days', 'Weight(kg)', 'Step_count', 'Gym', 'Calorie(kcal)']


<a id='data-cleaning'></a>
## 4. Data Cleaning
Here, we will:
1. Ensure correct data types.
2. Check and handle missing values.
3. Address any obvious outliers (optional).

In [6]:
# 4.1 Data Types
print("\nData Types before conversions:")
display(df.dtypes)

# Convert numeric columns to float or int where appropriate
df['Weight(kg)'] = pd.to_numeric(df['Weight(kg)'], errors='coerce')
df['Step_count'] = pd.to_numeric(df['Step_count'], errors='coerce')
df['Calorie(kcal)'] = pd.to_numeric(df['Calorie(kcal)'], errors='coerce')

# If Gym is blank, replace with 'No'
df['Gym'] = df['Gym'].fillna('No')

# 4.2 Check Missing Values
missing_counts = df.isnull().sum()
print("\nMissing Values after type conversion:\n", missing_counts)

# Example: fill numeric missing with mean (if needed)
df['Weight(kg)'] = df['Weight(kg)'].fillna(df['Weight(kg)'].mean())
df['Step_count'] = df['Step_count'].fillna(df['Step_count'].mean())
df['Calorie(kcal)'] = df['Calorie(kcal)'].fillna(df['Calorie(kcal)'].mean())

# Re-check missing
print("\nMissing Values after filling:\n", df.isnull().sum())

# 4.3 (Optional) Outlier Detection
# A simple approach to remove or flag outliers could be IQR-based or domain-based.
# We won't remove them here unless we see extreme values.
print("\nStats Overview:")
display(df.describe())

print("Data cleaning complete.")


Data Types before conversions:


Days              object
Weight(kg)       float64
Step_count         int32
Gym               object
Calorie(kcal)      int32
dtype: object


Missing Values after type conversion:
 Days             1
Weight(kg)       1
Step_count       0
Gym              0
Calorie(kcal)    0
dtype: int64

Missing Values after filling:
 Days             1
Weight(kg)       0
Step_count       0
Gym              0
Calorie(kcal)    0
dtype: int64

Stats Overview:


Unnamed: 0,Weight(kg),Step_count,Calorie(kcal)
count,44.0,44.0,44.0
mean,92.57907,5189.409091,3607.159091
std,2.17026,2710.946186,630.982436
min,83.9,0.0,0.0
25%,91.275,2547.0,3660.0
50%,92.589535,4502.5,3660.0
75%,94.175,7563.0,3660.0
max,95.7,10136.0,4875.0


Data cleaning complete.


<a id='feature-engineering'></a>
## 5. Feature Engineering
We can create new columns to help with analysis, such as:
- **Gym_Bool**: Convert "Yes" to `True`, else `False`.
- **Day_Type**: Classify days as `Weekend` vs. `Weekday`.
- **Weight_Change**: Difference in weight from previous day.


In [7]:
# 5.1 Gym_Bool
df['Gym_Bool'] = df['Gym'].apply(lambda x: True if x == 'Yes' else False)

# 5.2 Day_Type (Weekend vs. Weekday)
weekend_days = ['Saturday','Sunday']
df['Day_Type'] = df['Days'].apply(lambda d: 'Weekend' if d in weekend_days else 'Weekday')

# 5.3 Weight_Change
df['Weight_Change'] = df['Weight(kg)'].diff().fillna(0)

# Check new columns
display(df.head(10))

Unnamed: 0,Days,Weight(kg),Step_count,Gym,Calorie(kcal),Gym_Bool,Day_Type,Weight_Change
0,Thursday,89.5,4500,Yes,3410,True,Weekday,0.0
1,Sunday,89.3,3254,Yes,3410,True,Weekend,-0.2
2,Monday,91.2,4149,No,3410,False,Weekday,1.9
3,Tuesday,90.8,7342,Yes,3410,True,Weekday,-0.4
4,Wednesday,90.7,7784,Yes,3410,True,Weekday,-0.1
5,Thursday,90.9,7213,Yes,3410,True,Weekday,0.2
6,Friday,90.6,7644,No,3410,False,Weekday,-0.3
7,Saturday,90.3,4131,Yes,3410,True,Weekend,-0.3
8,Sunday,90.9,2718,Yes,3410,True,Weekend,0.6
9,Monday,91.3,6421,No,3660,False,Weekday,0.4


<a id='saving-data'></a>
## 6. Saving Processed Data
After cleaning and feature engineering, we can save this final version of the dataset for further analysis.

In [9]:
# Save the processed dataset to a new CSV
processed_file = 'my_personal_health_data_processed.csv'
df.to_csv(processed_file, index=False)
print(f"Processed data saved as: {processed_file}")

Processed data saved as: my_personal_health_data_processed.csv


<a id='conclusion'></a>
## 7. Conclusion
In this notebook:
1. We **loaded** the raw dataset.
2. We performed **data cleaning** (fixed missing values, standardized data types).
3. We created **new features** like `Gym_Bool`, `Day_Type`, and `Weight_Change`.
4. We **saved** the final processed data for further analysis (e.g., in `data_analysis.ipynb`).

This completes the data processing phase for our personal health project!