# Data Process Notebook
This notebook focuses on preparing and processing the **Personal Health** dataset.

## Table of Contents
1. [Introduction](#introduction)
2. [Imports](#imports)
3. [Data Collection](#data-collection)
4. [Data Cleaning](#data-cleaning)
5. [Feature Engineering](#feature-engineering)
6. [Saving Processed Data](#saving-data)
7. [Conclusion](#conclusion)


<a id='introduction'></a>
## 1. Introduction
We have a dataset with the following columns:
- **Days**: (e.g., Monday, Tuesday, etc.)
- **Weight(kg)**: Daily body weight in kilograms.
- **Step_count**: Number of steps taken.
- **Gym**: Indicates if a gym session occurred ("Yes" or blank).
- **Calorie(kcal)**: Total daily calorie intake.

Our **main research questions** revolve around:
- The relationship between calorie intake and weight changes.
- The impact of physical activity (steps, gym) on weight management.
- Interactions between calorie intake, step count, and gym visits.

Before advanced analysis, let's clean and prepare the data in this notebook.

<a id='imports'></a>
## 2. Imports
We import necessary libraries for data manipulation and basic checks.

In [None]:
import pandas as pd
import numpy as np

# (Optional) If you want to see more rows
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 10)
print("Libraries imported.")

<a id='data-collection'></a>
## 3. Data Collection
Assuming you have your dataset saved in a CSV file named **`my_personal_health_data.csv`** (or similar), we load it here.

In [None]:
# Replace 'my_personal_health_data.csv' with your actual file name or path.
csv_file = 'my_personal_health_data.csv'

df = pd.read_csv(csv_file)
print("First 5 rows of the dataset:")
display(df.head())

print("Dataset shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())

<a id='data-cleaning'></a>
## 4. Data Cleaning
Here, we will:
1. Ensure correct data types.
2. Check and handle missing values.
3. Address any obvious outliers (optional).

In [None]:
# 4.1 Data Types
print("\nData Types before conversions:")
display(df.dtypes)

# Convert numeric columns to float or int where appropriate
df['Weight(kg)'] = pd.to_numeric(df['Weight(kg)'], errors='coerce')
df['Step_count'] = pd.to_numeric(df['Step_count'], errors='coerce')
df['Calorie(kcal)'] = pd.to_numeric(df['Calorie(kcal)'], errors='coerce')

# If Gym is blank, replace with 'No'
df['Gym'] = df['Gym'].fillna('No')

# 4.2 Check Missing Values
missing_counts = df.isnull().sum()
print("\nMissing Values after type conversion:\n", missing_counts)

# Example: fill numeric missing with mean (if needed)
df['Weight(kg)'] = df['Weight(kg)'].fillna(df['Weight(kg)'].mean())
df['Step_count'] = df['Step_count'].fillna(df['Step_count'].mean())
df['Calorie(kcal)'] = df['Calorie(kcal)'].fillna(df['Calorie(kcal)'].mean())

# Re-check missing
print("\nMissing Values after filling:\n", df.isnull().sum())

# 4.3 (Optional) Outlier Detection
# A simple approach to remove or flag outliers could be IQR-based or domain-based.
# We won't remove them here unless we see extreme values.
print("\nStats Overview:")
display(df.describe())

print("Data cleaning complete.")

<a id='feature-engineering'></a>
## 5. Feature Engineering
We can create new columns to help with analysis, such as:
- **Gym_Bool**: Convert "Yes" to `True`, else `False`.
- **Day_Type**: Classify days as `Weekend` vs. `Weekday`.
- **Weight_Change**: Difference in weight from previous day.


In [None]:
# 5.1 Gym_Bool
df['Gym_Bool'] = df['Gym'].apply(lambda x: True if x == 'Yes' else False)

# 5.2 Day_Type (Weekend vs. Weekday)
weekend_days = ['Saturday','Sunday']
df['Day_Type'] = df['Days'].apply(lambda d: 'Weekend' if d in weekend_days else 'Weekday')

# 5.3 Weight_Change
df['Weight_Change'] = df['Weight(kg)'].diff().fillna(0)

# Check new columns
display(df.head(10))

<a id='saving-data'></a>
## 6. Saving Processed Data
After cleaning and feature engineering, we can save this final version of the dataset for further analysis.

In [None]:
# Save the processed dataset to a new CSV
processed_file = 'my_personal_health_data_processed.csv'
df.to_csv(processed_file, index=False)
print(f"Processed data saved as: {processed_file}")

<a id='conclusion'></a>
## 7. Conclusion
In this notebook:
1. We **loaded** the raw dataset.
2. We performed **data cleaning** (fixed missing values, standardized data types).
3. We created **new features** like `Gym_Bool`, `Day_Type`, and `Weight_Change`.
4. We **saved** the final processed data for further analysis (e.g., in `data_analysis.ipynb`).

This completes the data processing phase for our personal health project!