# Data Preprocessing Report

## 1. Data Loading and Exploration

### Dataset: Sleep Deprivation Dataset

- **Source:** Provided dataset
- **Rows:** 60
- **Columns:** 14
- **Target Variable:** `Sleep_Quality_Score`



In [3]:
import pandas as pd

df = pd.read_csv(r"C:\Users\lenovo\Downloads\archive\sleep_deprivation_dataset_detailed.csv")
df.info()

df.isnull().sum()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Participant_ID             60 non-null     object 
 1   Sleep_Hours                60 non-null     float64
 2   Sleep_Quality_Score        60 non-null     int64  
 3   Daytime_Sleepiness         60 non-null     int64  
 4   Stroop_Task_Reaction_Time  60 non-null     float64
 5   N_Back_Accuracy            60 non-null     float64
 6   Emotion_Regulation_Score   60 non-null     int64  
 7   PVT_Reaction_Time          60 non-null     float64
 8   Age                        60 non-null     int64  
 9   Gender                     60 non-null     object 
 10  BMI                        60 non-null     float64
 11  Caffeine_Intake            60 non-null     int64  
 12  Physical_Activity_Level    60 non-null     int64  
 13  Stress_Level               60 non-null     int64  
d

Participant_ID               0
Sleep_Hours                  0
Sleep_Quality_Score          0
Daytime_Sleepiness           0
Stroop_Task_Reaction_Time    0
N_Back_Accuracy              0
Emotion_Regulation_Score     0
PVT_Reaction_Time            0
Age                          0
Gender                       0
BMI                          0
Caffeine_Intake              0
Physical_Activity_Level      0
Stress_Level                 0
dtype: int64

#### Observations:

- The dataset contains both numerical and categorical features.
- No missing values were detected.

---

## 2. Handling Missing Values

Since there are no missing values in the dataset, no imputation or removal is required.
## 3. Encoding Categorical Variables

### Identified Categorical Columns:

- `Gender` (Nominal)
- `Participant_ID` (Dropped as it does not contribute to prediction)

### Encoding Process:

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df["Gender"] = label_encoder.fit_transform(df["Gender"])

df.drop(columns=["Participant_ID"], inplace=True)


#### Explanation:

- **Label Encoding** is used for `Gender` since it has only two categories (Male = 1, Female = 0).
- **Participant\_ID** is an identifier and not a feature affecting the target variable, so it is dropped.

---

## 4. Feature Scaling

### Standardization & Normalization

#### Standardization (for normally distributed features):

- `Sleep_Hours`
- `Stroop_Task_Reaction_Time`
- `PVT_Reaction_Time`
- `BMI`

#### Normalization (for non-normal features):

- `Sleep_Quality_Score`
- `Daytime_Sleepiness`
- `N_Back_Accuracy`
- `Emotion_Regulation_Score`
- `Age`
- `Caffeine_Intake`
- `Physical_Activity_Level`
- `Stress_Level`

### Scaling Process:

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
standardized_columns = ["Sleep_Hours", "Stroop_Task_Reaction_Time", "PVT_Reaction_Time", "BMI"]
df[standardized_columns] = scaler.fit_transform(df[standardized_columns])

minmax_scaler = MinMaxScaler()
normalized_columns = ["Sleep_Quality_Score", "Daytime_Sleepiness", "N_Back_Accuracy",
                      "Emotion_Regulation_Score", "Age", "Caffeine_Intake",
                      "Physical_Activity_Level", "Stress_Level"]
df[normalized_columns] = minmax_scaler.fit_transform(df[normalized_columns])

#### Explanation:

- **Standardization** is applied to features assumed to be normally distributed (zero mean, unit variance).
- **Normalization** is applied to features that are not normally distributed to bring values into the `[0,1]` range.

---

## 5. Splitting the Dataset

### Train-Test Split (80-20)


In [None]:
from sklearn.model_selection import train_test_split


X = df.drop(columns=["Sleep_Quality_Score"])  
y = df["Sleep_Quality_Score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Explanation:

- **80-20 Split** is chosen to ensure enough training data while reserving a portion for evaluation.
- **Random State (42)** ensures reproducibility.

---

## 6. Before-and-After Comparisons

### Sample Before Processing:

| Sleep\_Hours | Gender | Stroop\_Task\_Reaction\_Time | BMI |
| ------------ | ------ | ---------------------------- | --- |
| 6.5          | Male   | 500                          | 24  |
| 7.2          | Female | 460                          | 21  |

### Sample After Processing:

| Sleep\_Hours | Gender | Stroop\_Task\_Reaction\_Time | BMI   |
| ------------ | ------ | ---------------------------- | ----- |
| -0.30        | 1      | -2.00                        | 0.71  |
| 1.59         | 0      | -0.85                        | -0.01 |

#### Key Changes:

- `Gender` is now numerical.
- `Sleep_Hours`, `Stroop_Task_Reaction_Time`, and `BMI` are scaled.

---

## 7. Challenges Faced

1. **Feature Distribution Assumptions** – Determining whether a feature follows a normal distribution required visualization, but due to the small dataset size, assumptions were made based on feature characteristics.
2. **Choosing the Right Encoding** – Since `Gender` had only two categories, label encoding was sufficient.
3. **Standardization vs. Normalization** – A mix of both techniques was used based on the distribution of numerical features.

---

