<a href="https://colab.research.google.com/github/BerkayKsgn/DSA210-TermProject/blob/main/data_process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Raw Data Import & Cleaning

In this notebook, I will load the raw self-recorded data from my daily coffee, water, focus, and sleep logs.  
The goal is to enrich the dataset with new meaningful features to prepare it for analysis and hypothesis testing.

In [54]:
import pandas as pd

## Load Raw Data

I recorded my habits into an Excel file. Let's load that into a DataFrame.  
The file contains daily information on beverage consumption, sleep, and focus.

In [55]:
raw_path = "/content/raw_focus_sleep_data.xlsx"
df = pd.read_excel(raw_path)
df.head()

Unnamed: 0,Date,Coffee (ml),Caffeine (mg),Water (ml),Focus (min),Screen Time (min),Sleep Start,Sleep End,Sleep (hrs),Sleep Quality (%),Last Coffee (hrs before sleep),Temperature (°C),Sugar (mg)
0,10 Mar,900,225,1234,263,294,1:34,8:51,7.09,68,4,14,45
1,11 Mar,1200,300,1195,284,258,0:08,8:09,8.58,73,6,20,60
2,12 Mar,600,150,1233,248,263,1:27,8:28,7.95,67,5,12,30
3,13 Mar,1200,300,1228,240,213,2:05,9:10,7.02,66,3,15,40
4,14 Mar,600,150,1680,182,224,0:44,8:57,8.44,75,3,16,30


## Feature Engineering

Here, I will derive new useful columns from the existing data for better interpretability and group-based comparisons.  
These engineered features will allow me to group observations and test hypotheses in later phases.

In [56]:
# New Boolean Features
df['High Coffee'] = df['Coffee (ml)'] > 600
df['High Water'] = df['Water (ml)'] > 1200
df['High Focus'] = df['Focus (min)'] > 240

# Sleep Time Features
df['Sleep Start Hour'] = df['Sleep Start'].str.split(':').str[0].astype(int)
df['Sleep End Hour'] = df['Sleep End'].str.split(':').str[0].astype(int)

## Basic Feature Check

Let's verify that the new features we created make sense by checking their averages.

In [57]:
df[['High Coffee', 'High Water', 'High Focus']].mean()

Unnamed: 0,0
High Coffee,0.581395
High Water,0.744186
High Focus,0.627907


## Reorder Columns

For readability and consistency with my README.md, I will reorder the columns.

In [58]:
ordered_columns = [
    "Date", "Coffee (ml)", "Caffeine (mg)", "Water (ml)", "Sugar (mg)",
    "Focus (min)", "Screen Time (min)",
    "Sleep Start", "Sleep End", "Sleep (hrs)", "Sleep Quality (%)",
    "Last Coffee (hrs before sleep)", "Temperature (°C)",
    "High Coffee", "High Water", "High Focus",
    "Sleep Start Hour", "Sleep End Hour"
]
df = df[ordered_columns]

## Validate the Data


Validate for data types and check for missing values.

In [59]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Date                            43 non-null     object 
 1   Coffee (ml)                     43 non-null     int64  
 2   Caffeine (mg)                   43 non-null     int64  
 3   Water (ml)                      43 non-null     int64  
 4   Sugar (mg)                      43 non-null     int64  
 5   Focus (min)                     43 non-null     int64  
 6   Screen Time (min)               43 non-null     int64  
 7   Sleep Start                     43 non-null     object 
 8   Sleep End                       43 non-null     object 
 9   Sleep (hrs)                     43 non-null     float64
 10  Sleep Quality (%)               43 non-null     int64  
 11  Last Coffee (hrs before sleep)  43 non-null     int64  
 12  Temperature (°C)                43 non

Unnamed: 0,0
Date,0
Coffee (ml),0
Caffeine (mg),0
Water (ml),0
Sugar (mg),0
Focus (min),0
Screen Time (min),0
Sleep Start,0
Sleep End,0
Sleep (hrs),0


## Save the Cleaned Dataset

The final cleaned dataset will be used for data visualization and hypothesis testing in the next notebook.

In [60]:
cleaned_path = "/content/cleaned_focus_sleep_data.xlsx"
df.to_excel(cleaned_path, index=False)
print("Cleaned data saved at:", cleaned_path)

Cleaned data saved at: /content/cleaned_focus_sleep_data.xlsx
