## Setup: Libraries & Display Options

This cell imports all required libraries for:
- **Data handling**: `pandas`, `numpy`
- **Preprocessing**: imputers, encoders, pipelines, column transformers
- **Visualization**: `matplotlib`, `seaborn`
- **Modeling**: train/test split, linear & tree-based regressors, metrics

We also set a `pandas` display option to show **all columns** when printing DataFrames, which is helpful when inspecting encoded feature matrices.

In [None]:
# For Data manipulation and analytics
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For model training
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# setting display options for better viewing
pd.set_option('display.max_columns', None)

## Load Data & Initial Inspection

We load the **Ames Housing**-style dataset from `train.csv`.  
Then:
- Show the **first 5 rows** to eyeball columns and values.
- Print `df.info()` to see dtypes, non-null counts, and spot potential missing values.

> Tip: If the path or filename differs, update `pd.read_csv('train.csv')` accordingly.

In [None]:
# Load the training data
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')

print("First 5 rows of the dataset")
display(df.head())

print("\nInformation of Dataset:")
df.info()

## EDA: Distribution of SalePrice (Raw)

Plot the raw distribution of the target `SalePrice` using a histogram with kernel density estimation (KDE).  
Housing prices are usually **right-skewed**; verifying the skew helps justify a log transform later to stabilize variance and improve model fit.

In [None]:
# Plot the distribution of SalePrice
plt.figure(figsize = (10,6))
sns.histplot(df['SalePrice'], kde=True, bins=50)
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price ($)')
plt.ylabel('Frequency')
plt.show()

## Transform Target: `log1p(SalePrice)`

Apply `np.log1p` to `SalePrice` (i.e., `log(1 + price)`).  
This reduces right skew, makes errors more homoscedastic, and often improves linear model performance.

> Note: To report metrics in original dollars, you’d need to `expm1` predictions and ground truth before scoring.

In [None]:
transformed_data = np.log1p(df['SalePrice'])

## EDA: Distribution of SalePrice (Log-Scaled)

Re-plot the target after the log transform to confirm the distribution is now closer to **normal**.  
A more symmetric distribution usually benefits regression models and simplifies residual analysis.

In [None]:
# Plot the distribution of transformed SalePrice as it was skewed
plt.figure(figsize = (10,6))
sns.histplot(transformed_data, kde=True, bins=50)
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price ($)')
plt.ylabel('Frequency')
plt.show()

## Commit Log-Scaled Target

Replace `df['SalePrice']` with the log-transformed values so that the rest of the pipeline uses the stabilized target consistently.

In [None]:
df['SalePrice'] = transformed_data

## Identify Categorical Columns & Target Variable

- Detect all **categorical** columns (dtype `object`) to plan encoding.
- Define `target_col = 'SalePrice'` for clarity and reusability.

This separation prepares us for a clean **feature/target** split.

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Identify the target variable
target_col = 'SalePrice'

print(f"Categorical features to encode: {list(categorical_cols)}")

## Split Features and Target

- `y` is the **log-transformed** `SalePrice`.
- `X` drops `Id` and `SalePrice` to keep only predictors.

This establishes the modeling matrices.

In [None]:
target_col = 'SalePrice'
y = df[target_col]
X = df.drop(['Id', target_col], axis=1)

## EDA: Correlation Heatmap of Numerical Features with SalePrice

Compute pairwise correlations for all **numerical features** and visualize their correlation with `SalePrice` using a heatmap.  
- Darker colors show stronger positive or negative correlations.  
- Annotated values provide exact correlation coefficients.  

This helps quickly identify which numerical predictors (e.g., `OverallQual`, `GrLivArea`, `GarageCars`) are most strongly associated with housing prices.

## EDA: Scatter Plots of Top Predictors vs. SalePrice (Log-Scaled)

Visualize how the most correlated features relate to `SalePrice`.  
We generate scatter plots for 10 top predictors (`OverallQual`, `GrLivArea`, `GarageCars`, `GarageArea`, `TotalBsmtSF`, `1stFlrSF`, `FullBath`, `YearBuilt`, `YearRemodAdd`, `GarageYrBlt`).  
These plots help reveal **linear trends**, **outliers**, and whether relationships look suitable for regression models.

In [None]:
# Correlation heatmap for numerical features
plt.figure(figsize=(15,10))
corr = df.corr(numeric_only=True)
sns.heatmap(corr[['SalePrice']].sort_values(by='SalePrice', ascending=False), 
            annot=True, cmap='coolwarm')
plt.title("Correlation of Numerical Features with SalePrice")
plt.show()

In [None]:
top_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 
                'TotalBsmtSF', '1stFlrSF', 'FullBath', 'YearBuilt', 
                'YearRemodAdd', 'GarageYrBlt']

plt.figure(figsize=(20, 25))

for i, col in enumerate(top_features, 1):
    plt.subplot(5, 2, i)  # 5 rows, 2 cols
    sns.scatterplot(x=df[col], y=df['SalePrice'], alpha=0.6)
    plt.title(f'{col} vs SalePrice')
    plt.xlabel(col)
    plt.ylabel('SalePrice')

plt.tight_layout()
plt.show()

In [None]:
if 'GarageYrBlt' in X.columns and 'YearBuilt' in X.columns:
    X['GarageYrBlt'] = X['GarageYrBlt'].fillna(X['YearBuilt'])

numeric_pipe_top = Pipeline(steps=[
    ("impute", SimpleImputer(strategy="median")),
])

preprocessor_top = ColumnTransformer(
    transformers=[("num", numeric_pipe_top, top_features)],
    remainder="drop",
    verbose_feature_names_out=False
)

# Fit/transform just the top features
X_top_array = preprocessor_top.fit_transform(X)
X_top = pd.DataFrame(X_top_array, columns=preprocessor_top.get_feature_names_out())

# If you want correlations with SalePrice for these top features:
corr_top = pd.concat([X_top, y.rename('SalePrice')], axis=1).corr(numeric_only=True)['SalePrice'].sort_values(ascending=False)
print(corr_top.to_string())

# Quick bar plot of their correlations
plt.figure(figsize=(10,6))
corr_top.drop('SalePrice').plot(kind='bar')
plt.title('Correlation of Top Features with SalePrice')
plt.ylabel('Correlation')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Split into Train/Test Sets

Create an 80/20 split with a fixed `random_state` for reproducibility.  
We print counts to verify the split sizes.

> Consistent splits enable **fair model comparisons**.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_top, y, test_size = 0.2, random_state = 42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

## Validate: No NaNs After Preprocessing

Count `NaN` values in train and test features.  
This confirms imputers and encoders produced a fully numeric, NaN-free matrix compatible with scikit-learn estimators.

In [None]:
print(np.isnan(X_train).sum()) 
print(np.isnan(X_test).sum())  

## Model Zoo: Train & Compare (LR, Ridge, Lasso, RF, GB)

Define and train five models on the **same split**:
- Linear Regression
- Ridge Regression (L2 regularization)
- Lasso Regression (L1 regularization, performs feature selection)
- Random Forest (bagged trees; captures interactions; robust)
- Gradient Boosting (sequential trees; often SOTA for tabular data)

Collect **R²/MAE/RMSE** for each and print a **sorted leaderboard** by R².  
This gives a concise, apples-to-apples comparison across approaches.

In [None]:
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=10),
    "Lasso Regression": Lasso(alpha=0.001, max_iter=5000),
    "Random Forest": RandomForestRegressor(n_estimators=300, max_depth=15, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=500, learning_rate=0.05,
                                                   max_depth=4, random_state=42)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    results.append([name, r2, mae, rmse])

results_df = pd.DataFrame(results, columns=["Model", "R²", "MAE", "RMSE"])
print(results_df.sort_values(by="R²", ascending=False).to_string(index=False))

## Visual Diagnostics: Actual vs Predicted (Per Model)

Define a plotting function and generate one **scatter plot** per model:
- X-axis: **Actual** (log) `SalePrice`
- Y-axis: **Predicted** (log) `SalePrice`
- Reference 45° line to gauge calibration

Saved to PNG files for reporting.  
These plots reveal **bias**, **spread**, and whether predictions cluster off the ideal line.

In [None]:
def plot_actual_vs_pred(y_true, y_pred, title, save_path=None):
    plt.figure(figsize=(7, 6))
    plt.scatter(y_true, y_pred, alpha=0.6)
    lims = [min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())]
    plt.plot(lims, lims, '--r', linewidth=2)
    plt.title(title)
    plt.xlabel('Actual Sale Price')
    plt.ylabel('Predicted Sale Price')
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150)
    plt.show()


for name, model in models.items():
    y_pred = model.predict(X_test)
    plot_actual_vs_pred(y_test, y_pred, f'{name}: Actual vs Predicted',
                        save_path=f'{name.replace(" ", "_").lower()}_actual_vs_pred.png')

## Next Steps & Notes

- **Cross-Validation**: Use `cross_validate` with `cv=5` to reduce variance from a single split.
- **Hyperparameter Tuning**:
  - Random Forest: `n_estimators`, `max_depth`, `min_samples_split`, `max_features`
  - Gradient Boosting: `learning_rate`, `n_estimators`, `max_depth`, `subsample`
- **Boosting Libraries**: Try **XGBoost**, **LightGBM**, or **CatBoost** for stronger tabular performance.
- **Feature Engineering**:
  - Interaction terms (e.g., quality × area)
  - Age features (e.g., `YrSold - YearBuilt`, `YearRemodAdd` deltas)
  - Neighborhood aggregates (median price by area)

> Remember: Metrics are on the **log scale** here. For business-facing reporting, inverse-transform to price dollars.

## Generate Kaggle Submission (Id, SalePrice)

This cell:
1) Loads the competition `test.csv`.  
2) Applies the **same preprocessing** (ordinal/one-hot/median impute).  
3) Selects the same **top-feature slice** used for training.  
4) Trains the chosen model on **all training data** (for best final perf).  
5) Predicts on test set, **inverse-transforms** from log scale (`expm1`).  
6) Writes `submission.csv` with columns **Id, SalePrice**.

Note: Kaggle evaluates RMSLE on logs, but the file you submit must contain **actual prices** (not log values).

In [None]:
test_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")
test_ids = test_df["Id"].copy()

preprocessor_top.fit(X)
X_all_prepared = preprocessor_top.transform(X)
X_test_prepared = preprocessor_top.transform(test_df.drop(columns=["Id"]))

feature_names_all = preprocessor_top.get_feature_names_out()
valid_top_features = [
    'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'ExterQual',
    'KitchenQual', 'BsmtQual', 'TotalBsmtSF', '1stFlrSF', 'FullBath',
    'YearBuilt', 'YearRemodAdd', 'GarageFinish', 'TotRmsAbvGrd',
    'Fireplaces', 'HeatingQC', 'MasVnrArea', 'PoolQC',
    'GarageYrBlt', 'Condition1_Norm' 
]
present = [f for f in valid_top_features if f in feature_names_all]
top_features_indices = [np.where(feature_names_all == f)[0][0] for f in present]

X_all_top = X_all_prepared[:, top_features_indices]
X_test_top = X_test_prepared[:, top_features_indices]

final_model = RandomForestRegressor(n_estimators=300, max_depth=15, random_state=42)

final_model.fit(X_all_top, y)             
y_test_pred_log = final_model.predict(X_test_top)

y_test_pred = np.expm1(y_test_pred_log)    
y_test_pred = np.maximum(y_test_pred, 0.0) 

submission = pd.DataFrame({
    "Id": test_ids,
    "SalePrice": y_test_pred
})

submission_path = "/kaggle/working/submission.csv"
submission.to_csv(submission_path, index=False)

print("Saved:", submission_path)
submission.head()