### **Homework Assignment: Data Preprocessing for Machine Learning**

#### **Dataset**

Use the **California Housing Dataset** from Aurélien Géron's GitHub repo:

In [3]:
import os
import tarfile
import urllib.request
import pandas as pd

DOWNLOAD_ROOT = "https://github.com/ageron/data/raw/main/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    try:
        # Create directory if it doesn't exist
        os.makedirs(housing_path, exist_ok=True)
        
        # Download the file
        tgz_path = os.path.join(housing_path, "housing.tgz")
        print(f"Downloading {housing_url}...")
        urllib.request.urlretrieve(housing_url, tgz_path)
        
        # Extract the file
        print(f"Extracting {tgz_path}...")
        with tarfile.open(tgz_path) as housing_tgz:
            housing_tgz.extractall(path=housing_path)
        
        print("Download and extraction completed successfully!")
        return True
    except Exception as e:
        print(f"An error occurred: {e}")
        return False

# Try to fetch the data
if fetch_housing_data():
    # If download and extraction were successful, read the CSV
    csv_path = os.path.join(HOUSING_PATH, "housing.csv")
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
        print("Data loaded successfully!")
        print(df.head())
    else:
        print(f"Error: Could not find {csv_path} after download")
else:
    print("Failed to download the housing data")

Downloading https://github.com/ageron/data/raw/main/housing.tgz...
Extracting datasets/housing/housing.tgz...
Download and extraction completed successfully!
Error: Could not find datasets/housing/housing.csv after download


## **Part 1: Exploratory Data Analysis (EDA)**

1. Display:

   * The first 10 rows.
   * Dataset info using `.info()`.
   * Summary statistics using `.describe()`.
   * Value counts for categorical columns (e.g., `ocean_proximity`).

2. Identify:

   * Columns with missing values.
   * Numerical vs categorical features.
   * Columns with unusual distributions or outliers.

---

## **Part 2: Handling Missing Values**

3. For missing data:

   * Drop any row or column if missing values are insignificant.
   * Use **median** imputation for `total_bedrooms`.

4. Create a `missing_report(df)` function that:

   * Returns a DataFrame: column name, count and percentage of missing values.

---

## **Part 3: Encoding Categorical Variables**

5. Encode the `ocean_proximity` column:

   * Use **One-Hot Encoding** via `pd.get_dummies()` or `OneHotEncoder`.

---

## **Part 4: Feature Scaling**

6. For numerical features:

   * Apply both **StandardScaler** and **MinMaxScaler** to features like:

     * `median_income`, `housing_median_age`, `population`, `median_house_value`
   * Plot feature histograms before and after scaling.

---

## **Part 5: Optional Feature Engineering**

7. Create meaningful new features:

   * `rooms_per_household = total_rooms / households`
   * `bedrooms_per_room = total_bedrooms / total_rooms`
   * `population_per_household = population / households`