# Overview

1. **Load the Data**  
2. **Check and Drop Columns with Too Many Missing Values**  
3. **Remove Rows Missing the Target Variable**  
4. **Fill Remaining Missing Values**  
5. **Remove Outliers (Simple Approach)**  
6. **Encode Categorical Variables (Two Methods)**  
7. **Transform the Target to Reduce Skew**  
8. **Check Correlation**  
9. **Scale Features (Demonstration)**  
10. **Final Preview**

---

Will use the **“AmesHousing.csv”** dataset and treat **"SalePrice"** as the target for a linear regression model.  
If you have a different dataset, simply adjust the file path and target column accordingly.


## STEP 1: Imports, File Path, and Basic Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import skew
from sklearn.preprocessing import (
    PowerTransformer, 
    StandardScaler, 
    MinMaxScaler, 
    PolynomialFeatures
)

# Our dataset location
file_path = "AmesHousing.csv"
# The column we want to predict
target_col = "SalePrice"


## Why These Imports?

- **pandas**: For data handling  
- **numpy**: For numeric calculations  
- **matplotlib/seaborn**: For plotting and data visualization  
- **sklearn.preprocessing**: For transformations (scaling and power transforms)  


## STEP 2: Load the Data

In [None]:
# Load the CSV into a DataFrame
df = pd.read_csv(file_path)

# Show the first 5 rows (to see what the data looks like)
print("Data Loaded Successfully!\n")
print(df.head())

# Shape tells us how many rows and columns
print("\nShape of the dataset:", df.shape)

# info() shows column names, data types, and where missing values might be
df.info()


## Note

- `df.head()`: Lets us see the top rows to get a feel for what columns we have.  
- `df.info()`: Shows how many non-missing (non-null) entries each column has.  


## From the Output:

### Missing Values:
- **Alley** has only 198 non-null entries out of 2930, meaning most houses don’t have this data (over 2,700 missing).  
- Similarly, **Pool QC** has just 13 non-null entries, etc.

### Data Types:
- **int64**: Usually represents whole numbers (e.g., `Year Built`, `Overall Qual`).  
- **float64**: Often represents numeric data that can have decimals (e.g., `Lot Frontage`).  
- **object**: Typically represents text or categorical data (e.g., `Street`, `Neighborhood`).  


## STEP 3: Drop Columns with Too Many Missing Values

In a real dataset, some columns may have lots of missing data.  
For DEMO purposes, we will drop any column if more than **30%** of its entries are missing.  
You can choose a different threshold based on your needs.


In [None]:
def drop_high_missing_columns(dataframe, threshold=0.3):
    """
    Drop any column that has a fraction of missing values 
    greater than the given threshold (default=0.3).
    
    For example, 0.3 means 30% of that column's values are missing.
    """
    # fraction of missing values for each column
    missing_fraction = dataframe.isnull().mean()
    
    # which columns are above the threshold
    cols_to_drop = missing_fraction[missing_fraction > threshold].index
    print(f"Columns to drop (>{threshold*100}% missing):", list(cols_to_drop))
    
    # drop them in place
    dataframe.drop(columns=cols_to_drop, inplace=True)
    
    return dataframe

df = drop_high_missing_columns(df, threshold=0.3)


## STEP 4: Drop Rows with Missing Target

We cannot train a model if our target (**"SalePrice"**) is missing.

In [None]:
def drop_missing_target(dataframe, target_column):
    """
    Drop rows that have missing values for the target column.
    """
    if target_column not in dataframe.columns:
        print(f"Target column '{target_column}' not found. Skipping.")
        return dataframe
    
    before = len(dataframe)
    dataframe.dropna(subset=[target_column], inplace=True)
    after = len(dataframe)
    print(f"Dropped {before - after} rows that had missing {target_column}.")
    
    return dataframe

df = drop_missing_target(df, target_col)


## STEP 5: Fill Remaining Missing Values

Now we decide how to fill in the other missing values. A simple approach:  

- **Numeric columns**: Fill with the median value.  
- **Categorical (string/object) columns**: Fill with `"Missing"`.  


In [None]:
def fill_missing_values(dataframe):
    """
    Fills missing numeric values with the median,
    and missing categorical values with 'Missing'.
    """
    numeric_cols = dataframe.select_dtypes(include=[np.number]).columns
    categorical_cols = dataframe.select_dtypes(exclude=[np.number]).columns

    # Numeric -> median
    for col in numeric_cols:
        if dataframe[col].isnull().any():
            dataframe[col].fillna(dataframe[col].median(), inplace=True)

    # Categorical -> "Missing"
    for col in categorical_cols:
        if dataframe[col].isnull().any():
            dataframe[col].fillna("Missing", inplace=True)
    
    return dataframe

### Check Missing Values Before:
- Let us see how many missing values exist in total.  
- We will also inspect an example column (**Garage Cond**) that was previously missing in many rows.


In [None]:
# Count total missing values across the entire DataFrame
missing_before = df.isnull().sum().sum()
print(f"Total missing values (BEFORE): {missing_before}")

# Example: Check the unique values in 'Alley' column (if it exists)
if 'Garage Cond' in df.columns:
    print("\nUnique values in 'Garage Cond' BEFORE filling:")
    print(df['Garage Cond'].unique()[:10])  # Just show up to 10 unique values
else:
    print("\n'Garage Cond' column not found.")


### Fill Missing Values

In [None]:
df = fill_missing_values(df)

### Check Missing Values After

In [None]:
# Count total missing values across the entire DataFrame again
missing_after = df.isnull().sum().sum()
print(f"\nTotal missing values (AFTER): {missing_after}")

# Check the 'Alley' column again
if 'Garage Cond' in df.columns:
    print("\nUnique values in 'Garage Cond' AFTER filling:")
    print(df['Garage Cond'].unique()[:10])
else:
    print("\n'Garage Cond' column not found.")


## STEP 6: Remove Outliers

For DEMO purposes, let us remove houses where **"Gr Liv Area"** (above-ground living area) is greater than **4000 square feet**.  
These are rare and can skew linear models.


In [None]:
def remove_outliers(dataframe, col_name="Gr Liv Area", upper_limit=4000):
    """
    Remove any row where 'Gr Liv Area' is above 4000 sq ft (simple example).
    We'll also show a before/after boxplot of this column.
    """
    if col_name not in dataframe.columns:
        print(f"Column '{col_name}' not found. Skipping outlier removal.")
        return dataframe
    
    df_before = dataframe.copy()
    
    # Boxplot before
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    sns.boxplot(x=df_before[col_name])
    plt.title(f"{col_name} - Before")
    
    # Actually remove the outliers
    dataframe = dataframe[dataframe[col_name] < upper_limit]
    
    # Boxplot after
    plt.subplot(1, 2, 2)
    sns.boxplot(x=dataframe[col_name])
    plt.title(f"{col_name} - After")
    plt.tight_layout()
    plt.show()
    
    return dataframe

df = remove_outliers(df, col_name="Gr Liv Area", upper_limit=4000)


# Boxplots: Before and After Removing Outliers

## 1. Boxplot Before Removing Outliers (Left Plot)
The left boxplot shows the distribution of "Gr Liv Area" values prior to outlier removal.

- Several data points exceed 4,000 square feet, identified as outliers (shown as circles outside the whiskers).
- These extreme values could distort analyses such as linear regression by introducing skewness or bias.

## 2. Boxplot After Removing Outliers (Right Plot)
The right boxplot shows the same data after filtering out values greater than 4,000 square feet.

## 3. Key Takeaways:
- Removing extreme outliers minimizes the risk of skewness or bias, especially when the outliers are rare or atypical.
- This adjustment improves the reliability of statistical methods, like regression, by ensuring the analysis reflects the core patterns of the dataset without undue influence from non-representative data points.
- By refining the dataset in this way, models are more likely to produce accurate and meaningful insights.

## Note

In reality, outlier handling depends on context.  
Here, we are removing large houses simply as an example.


## STEP 7: Encode Categorical Variables

Machine learning models like linear regression typically prefer numeric data.  
So we must convert (“encode”) any categorical columns.  

- **Frequency Encoding**:  
  For columns with many categories (over 10), we replace each category with how often it appears in the dataset (a fraction from 0 to 1).  

- **One-Hot Encoding**:  
  For columns with few categories, we create new columns of 0s and 1s for each category (minus one category to avoid redundancies).  


In [None]:
def encode_categorical_features(dataframe, freq_threshold=10):
    """
    If a column has more than `freq_threshold` unique categories,
    we use frequency encoding (replacing each category with its fraction of occurrences).
    Otherwise, we do one-hot encoding (creating new 0/1 columns for each category).

    This version prints out:
      1) Which columns were frequency-encoded vs. one-hot-encoded
      2) A small sample (first 5 rows) of one frequency-encoded column
         and one one-hot-encoded column (if any exist).
    """
    # Identify categorical columns (object dtype)
    cat_cols = dataframe.select_dtypes(include=["object"]).columns
    
    one_hot_frames = []
    freq_frames = {}
    
    # Keep track of columns that were encoded by which method
    freq_encoded_columns = []
    one_hot_encoded_columns = []
    
    # Decide how to encode each categorical column
    for col in cat_cols:
        unique_count = dataframe[col].nunique()
        
        if unique_count > freq_threshold:
            # Frequency encoding
            freq_map = dataframe[col].value_counts(normalize=True)
            freq_frames[col + "_freq"] = dataframe[col].map(freq_map)
            freq_encoded_columns.append(col)
        else:
            # One-hot encoding
            dummies = pd.get_dummies(dataframe[col], prefix=col, drop_first=True)
            one_hot_frames.append(dummies)
            one_hot_encoded_columns.append(col)

    # Merge frequency-encoded columns back
    if freq_frames:
        freq_df = pd.DataFrame(freq_frames, index=dataframe.index)
        dataframe = pd.concat([dataframe, freq_df], axis=1)

    # Merge one-hot-encoded columns back
    if one_hot_frames:
        ohe_df = pd.concat(one_hot_frames, axis=1)
        dataframe = pd.concat([dataframe, ohe_df], axis=1)

    # Drop original categorical columns (replaced by numeric ones)
    dataframe.drop(columns=cat_cols, inplace=True)
    
    # ---------------------------
    # PRINT A SUMMARY OF RESULTS
    # ---------------------------
    print("\nEncoding Summary:")

    # Frequency-encoded columns
    if freq_encoded_columns:
        print(f"  Frequency-encoded columns (>{freq_threshold} unique categories):")
        for col in freq_encoded_columns:
            print(f"    - {col}")
    else:
        print(f"  No columns had more than {freq_threshold} unique categories.")

    # One-hot-encoded columns
    if one_hot_encoded_columns:
        print(f"\n  One-hot-encoded columns (<= {freq_threshold} unique categories):")
        for col in one_hot_encoded_columns:
            print(f"    - {col}")
    else:
        print(f"  No columns had {freq_threshold} or fewer unique categories.")

    # ---------------------------
    # SHOW EXAMPLES OF TRANSFORMED COLUMNS
    # ---------------------------
    # 1) Example of a frequency-encoded column
    if freq_encoded_columns:
        freq_example = freq_encoded_columns[0]               # just pick the first column we freq-encoded
        freq_example_col = freq_example + "_freq"            # this is how we named it above
        if freq_example_col in dataframe.columns:
            print(f"\nExample of frequency-encoded column: '{freq_example_col}'")
            display(dataframe[[freq_example_col]].head(5))
    
    # 2) Example of a one-hot-encoded column
    if one_hot_encoded_columns:
        oh_example = one_hot_encoded_columns[0]              # pick the first column we one-hot-encoded
        # Our new one-hot columns for this feature start with oh_example + "_"
        oh_cols = [c for c in dataframe.columns if c.startswith(oh_example + "_")]
        if oh_cols:
            print(f"\nExample of one-hot-encoded columns from '{oh_example}': {oh_cols}")
            display(dataframe[oh_cols].head(5))

    return dataframe


# Usage Example
df = encode_categorical_features(df, freq_threshold=10)


## STEP 8: Transform the Target to Reduce Skew

Often, house prices or other monetary values are not evenly distributed (they can be heavily skewed).  
Linear regression works better if the target is more normally distributed.  

We will use **PowerTransformer** (Yeo-Johnson method) to help normalize it.


In [None]:
def transform_target_skew(dataframe, target_column="SalePrice"):
    """
    Applies the PowerTransformer to the target column 
    to reduce skew and make the data more normal-like.
    """
    if target_column not in dataframe.columns:
        print(f"Target column '{target_column}' not found. Skipping.")
        return dataframe
    
    # Before transformation
    target_before = dataframe[target_column].copy()
    skew_before = skew(target_before)
    print(f"Initial {target_column} skew: {skew_before:.2f}")

    # Apply the Yeo-Johnson transform
    transformer = PowerTransformer(method="yeo-johnson")
    dataframe[[target_column]] = transformer.fit_transform(dataframe[[target_column]])

    # After transformation
    target_after = dataframe[target_column].copy()
    skew_after = skew(target_after)
    print(f"Transformed {target_column} skew: {skew_after:.2f}")

    # Compare the distributions
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    sns.histplot(target_before, kde=True)
    plt.title(f"{target_column} - Before Transform")

    plt.subplot(1, 2, 2)
    sns.histplot(target_after, kde=True)
    plt.title(f"{target_column} - After Transform")

    plt.tight_layout()
    plt.show()

    return dataframe

df = transform_target_skew(df, target_col)


# Interpretation of SalePrice Transformation

## Before Transformation (Left Plot)
The histogram shows the **SalePrice** distribution before applying the PowerTransformer.

- The distribution is **right-skewed**, with a long tail stretching toward higher values.
- Most house prices are clustered at the lower end, with fewer high-priced houses.
- This skewness can negatively impact regression models by introducing bias and violating assumptions of normality.


## After Transformation (Right Plot)
The histogram shows the **SalePrice** distribution after applying the PowerTransformer (Yeo-Johnson method).

- The transformed distribution is now **approximately normal**, with values symmetrically centered around the mean.
- The extreme right tail has been compressed, reducing the skewness to nearly zero.
- This normalization ensures the data better aligns with the assumptions of statistical and machine learning models, improving model accuracy and interpretability.

## Overall Impact:
- **Before Transformation:** The skewness (1.59) indicates a heavily distorted distribution.
- **After Transformation:** The skewness is reduced to 0.00, signifying a near-normal distribution.
- **Takeaway:** The transformation helps the dataset meet the requirements of regression models and improves the reliability of predictions by removing distortions caused by skewed target data.


## STEP 9: Correlation Analysis

Correlation helps us see how strongly each feature is related to our target.  
We also display a heatmap of all correlations.


In [None]:
def correlation_analysis(dataframe, target_column="SalePrice"):
    """
    Prints out the correlation of each column with the target,
    and then shows a full correlation heatmap.
    """
    if target_column not in dataframe.columns:
        print(f"Target column '{target_column}' not found. Skipping correlation analysis.")
        return

    corr_matrix = dataframe.corr(numeric_only=True)  # for pandas 1.5+
    target_corr = corr_matrix[target_column].sort_values(ascending=False)

    print("\nTop 10 features MOST positively correlated with target:")
    print(target_corr[1:11])  # skip the target itself at index 0

    print("\nTop 10 features MOST negatively correlated with target:")
    print(target_corr[-10:])

    # Plot a heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, cmap="coolwarm", annot=False, square=True)
    plt.title("Correlation Matrix Heatmap")
    plt.show()

correlation_analysis(df, target_column=target_col)


## STEP 10: Demonstrate Feature Scaling

Scaling features using **StandardScaler** can sometimes help linear regression,  
especially if the ranges of features differ greatly.


In [None]:
def scale_features_demo(dataframe, target_column="SalePrice", col_to_plot="Gr Liv Area"):
    """
    Shows how one column looks before and after StandardScaler.
    """
    # Separate the features from the target, if the target column exists
    if target_column in dataframe.columns:
        X = dataframe.drop(columns=[target_column])
    else:
        X = dataframe.copy()

    # Fit the StandardScaler
    scaler_std = StandardScaler()
    X_std = scaler_std.fit_transform(X)
    
    # Convert to a DataFrame for easy plotting
    X_std_df = pd.DataFrame(X_std, columns=X.columns, index=X.index)

    # Plot original vs. standard-scaled distributions for col_to_plot
    if col_to_plot in X.columns:
        plt.figure(figsize=(12, 4))

        # Original
        plt.subplot(1, 2, 1)
        sns.histplot(X[col_to_plot], kde=True, color="skyblue")
        plt.title(f"{col_to_plot} - Original")

        # StandardScaled
        plt.subplot(1, 2, 2)
        sns.histplot(X_std_df[col_to_plot], kde=True, color="orange")
        plt.title(f"{col_to_plot} - StandardScaled")

        plt.tight_layout()
        plt.show()
    else:
        print(f"Column '{col_to_plot}' was not found. Cannot plot distribution.")
        
scale_features_demo(df, target_column="SalePrice", col_to_plot="Gr Liv Area")



## STEP 12: Final Preview

Let us see how our DataFrame looks now and what its final shape is.


In [None]:
print("\nPreview of the processed DataFrame (unscaled):")
display(df.head())

print("\nFinal shape of processed data:", df.shape)
