# Data Preprocessing

    ## Data Preprocessing
    This notebook handles the data loading, cleaning, and preprocessing for the power consumption dataset.
    
    Steps:
    1. Load the dataset
    2. Handle missing values
    3. Normalize continuous features (e.g., Global_active_power, Voltage)
    4. Combine Date and Time into a Timestamp column
    

In [8]:
!pip install pandas 
!pip install numpy
!pip install scikit-learn

Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-win_amd64.whl (11.6 MB)
Collecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, pytz, pandas
Successfully installed pandas-2.2.3 pytz-2025.2 tzdata-2025.2


You should consider upgrading via the 'C:\Users\Ibrah\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\Users\Ibrah\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp310-cp310-win_amd64.whl (11.1 MB)
Collecting scipy>=1.6.0
  Downloading scipy-1.15.2-cp310-cp310-win_amd64.whl (41.2 MB)
Collecting joblib>=1.2.0
  Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.6.1 scipy-1.15.2 threadpoolctl-3.6.0


You should consider upgrading via the 'C:\Users\Ibrah\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [9]:

#### **Code Cell**:


import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import zipfile
import os


# Step 1: Extract the ZIP file
zip_file_path = r"C:\Users\Ibrah\Downloads\individual+household+electric+power+consumption.zip"
extract_dir = r"C:\\Users\\Ibrah\\Downloads\\power_consumption_data\\"

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Step 2: Load the CSV file from the extracted folder
csv_file_path = os.path.join(extract_dir, 'household_power_consumption.txt')
df = pd.read_csv(csv_file_path, sep=';', header=0, low_memory=False, na_values=["?"])

# Step 1: Combine Date and Time into a Timestamp column
df['Timestamp'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H:%M:%S')
df.set_index('Timestamp', inplace=True)

# Step 2: Drop original Date and Time columns
df.drop(columns=['Date', 'Time'], inplace=True)

# Step 3: Handle missing values
# Fill missing values with the mean of the respective column
df.fillna(df.mean(), inplace=True)

# Step 4: Normalize the continuous features using Min-Max Scaling
scaler = MinMaxScaler()
continuous_columns = ['Global_active_power', 'Global_reactive_power', 'Voltage', 'Global_intensity', 
                      'Sub_metering_1', 'Sub_metering_2', 'Sub_metering_3']
df[continuous_columns] = scaler.fit_transform(df[continuous_columns])

# Check the first few rows of the cleaned dataset
df.head()


Unnamed: 0_level_0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2006-12-16 17:24:00,0.374796,0.300719,0.37609,0.377593,0.0,0.0125,0.548387
2006-12-16 17:25:00,0.478363,0.313669,0.336995,0.473029,0.0,0.0125,0.516129
2006-12-16 17:26:00,0.479631,0.358273,0.32601,0.473029,0.0,0.025,0.548387
2006-12-16 17:27:00,0.480898,0.361151,0.340549,0.473029,0.0,0.0125,0.548387
2006-12-16 17:28:00,0.325005,0.379856,0.403231,0.323651,0.0,0.0125,0.548387



#### **Markdown Cell**:


### Data Summary

In this section, we check for missing values and display basic summary statistics.

We will:
- Check for missing values in the dataset.
- Provide summary statistics of the continuous features to understand their distributions.






In [10]:

#### **Code Cell**:

# Step 5: Check for missing values
missing_values = df.isnull().sum()

# Basic statistics
stats = df.describe()

# Display missing values and basic stats
print("Missing Values:\n", missing_values)
print("\nSummary Stats:\n", stats)




Missing Values:
 Global_active_power      0
Global_reactive_power    0
Voltage                  0
Global_intensity         0
Sub_metering_1           0
Sub_metering_2           0
Sub_metering_3           0
dtype: int64

Summary Stats:
        Global_active_power  Global_reactive_power       Voltage  \
count         2.075259e+06           2.075259e+06  2.075259e+06   
mean          9.194415e-02           8.900322e-02  5.699469e-01   
std           9.511638e-02           8.058576e-02  1.040272e-01   
min           0.000000e+00           0.000000e+00  0.000000e+00   
25%           2.118414e-02           3.453237e-02  5.111470e-01   
50%           5.015390e-02           7.338129e-02  5.738288e-01   
75%           1.307261e-01           1.381295e-01  6.352181e-01   
max           1.000000e+00           1.000000e+00  1.000000e+00   

       Global_intensity  Sub_metering_1  Sub_metering_2  Sub_metering_3  
count      2.075259e+06    2.075259e+06    2.075259e+06    2.075259e+06  
mean       9

### **Explanation of the Code**:

1. **Data Loading**: The dataset is loaded from a CSV file with `pd.read_csv()`. We specify `na_values=["?"]` to treat `?` as missing values.
2. **Timestamp Creation**: We combine the `Date` and `Time` columns into a `Timestamp` column and set it as the index of the DataFrame.
3. **Handling Missing Values**: Missing values are filled with the **mean** of each column using `fillna()`.
4. **Normalization**: We use **Min-Max Scaling** to normalize the continuous features (e.g., **Global\_active\_power**, **Voltage**, etc.) to the range \[0, 1].
5. **Missing Values Check & Summary Stats**: We display any missing values and summarize the dataset’s basic statistics (mean, min, max, etc.).



In [11]:
# Save the preprocessed data
df.to_csv('preprocessed_power_consumption.csv')  # or use df.to_pickle('preprocessed_power_consumption.pkl')
