# Universal Data Cleaning & Preparation Template

This Jupyter Notebook provides a step-by-step template for cleaning and preparing any tabular dataset for machine learning.

In [14]:
# Import necessary libraries
import pandas as pd
import numpy as np

### Step 1: Configuration & Data Loading

**Action Required:** Update the `file_path` variable below with the name of your CSV file. Make sure the file is in the same folder as this notebook.

In [15]:
# --- CONFIGURATION ---
file_path = 'your_dataset.csv'

# --- DATA LOADING ---
try:
    df = pd.read_csv(file_path)
    print("Data loaded successfully!")
    print("First 5 rows of your data:")
    print(df.head())
except FileNotFoundError:
    print(f"ERROR: The file '{file_path}' was not found.")

ERROR: The file 'your_dataset.csv' was not found.


### Step 2: Data Exploration
Let's get a summary of the dataset to understand its structure, data types, and missing values.

In [16]:
# Get a summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1460 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          1460 non-null   object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### Step 3: Data Cleaning & Preprocessing

This section contains automated steps to clean the data. Review the output of each cell to understand the changes made to your DataFrame.

In [17]:
# --- Fill Missing Numeric Values ---
print("--- Handling Missing Numeric Data ---")
numeric_cols = df.select_dtypes(include=np.number).columns

for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        mean_value = df[col].mean()
        df[col] = df[col].fillna(mean_value)
        print(f"INFO: Missing values in numeric column '{col}' filled with mean.")

--- Handling Missing Numeric Data ---


In [18]:
# --- Fill Missing Categorical Values ---
print("\n--- Handling Missing Categorical Data ---")
categorical_cols = df.select_dtypes(include=['object']).columns

for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        mode_value = df[col].mode()[0]
        df[col] = df[col].fillna(mode_value)
        print(f"INFO: Missing values in categorical column '{col}' filled with mode.")


--- Handling Missing Categorical Data ---


In [19]:
# --- Encode All Categorical Data to Numbers ---
print("\n--- Encoding All Text Data to Numbers ---")
# Create a copy to keep the original df unchanged
df_processed = df.copy() 
object_cols_to_encode = df_processed.select_dtypes(include=['object']).columns
if len(object_cols_to_encode) > 0:
    df_processed = pd.get_dummies(df_processed, columns=object_cols_to_encode, drop_first=True)
    print("SUCCESS: All text columns have been one-hot encoded.")
else:
    print("INFO: No text columns to encode.")


--- Encoding All Text Data to Numbers ---
SUCCESS: All text columns have been one-hot encoded.


### Step 4: Preprocessing Complete!
The `df_processed` DataFrame now contains your fully cleaned and model-ready data.

In [20]:
# Display the first 5 rows of the final processed data
print("Head of the final, processed DataFrame:")
print(df_processed.head())

Head of the final, processed DataFrame:
   Id  MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0   1          60         65.0     8450            7            5       2003   
1   2          20         80.0     9600            6            8       1976   
2   3          60         68.0    11250            7            5       2001   
3   4          70         60.0     9550            7            5       1915   
4   5          60         84.0    14260            8            5       2000   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  ...  SaleType_ConLI  SaleType_ConLw  \
0          2003       196.0         706  ...           False           False   
1          1976         0.0         978  ...           False           False   
2          2002       162.0         486  ...           False           False   
3          1970         0.0         216  ...           False           False   
4          2000       350.0         655  ...           False           False   