![Image](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/logos/header.png?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1768665756&Signature=DEGqz6Tp5K1totvr5sN8OjQeDKR8V2snePUsmSEdOcVY8GfoQ8fYi%2BPLNnTkL%2F8vYz5fXao74dlWKEGc%2Bihj73fYru0T%2FAGeiHbRF%2FbPCIfpfcs0P48pZU9akU1CvyWIpgv6hb8ybSOM%2FwialVz9pphqgSzdp9pBkr0APgW0bj2nBGkWkSXUNLQTUQEUiFX7XEKdvg7SVVYyWHHPvfSsnotD0soA1N1kBJEZSosuda6RMND5nHNCGw20rCKCwFf5NE%2B%2Bu5tKL1EsHdr7KLdf2oZ0loOKhg82p2dSzBt%2BxioFkZXpPeNpxPl33SPZA5K5t8LCHoKMwBJUHCkcK9w%2F2A%3D%3D)


# House Prices Prediction

## Objective
The main objective here is to predict the final sales price of each home. This is a regression problem. We will use the provided dataset, perform Exploratory Data Analysis (EDA), preprocess the data, and train a Gradient Boosting model to generate accurate predictions.

## 1. Setup and Data Loading
We start by importing the necessary libraries and loading the training and testing datasets.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from scipy.stats import skew

# Settings for better readability
pd.set_option('display.max_columns', None)
plt.style.use('ggplot')

# Load data
train_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")
train_df.head()

## 2. Exploratory Data Analysis (EDA)

### Target Variable: SalePrice
Let's look at the distribution of the target variable to understand its properties.

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(train_df['SalePrice'], kde=True)
plt.title('Distribution of SalePrice')
plt.show()

print(f"Skewness: {train_df['SalePrice'].skew()}")

**Interpretation:**
The histogram above shows that the `SalePrice` is **right-skewed** (positively skewed). This means most houses satisfy a lower or average price range, with a long tail of very expensive houses. 

Machine Learning models generally perform better when the target variable is normally distributed. Therefore, we will apply a **log-transformation** (`np.log1p`) to normalize the distribution.

In [None]:
train_df['SalePrice'] = np.log1p(train_df['SalePrice'])

plt.figure(figsize=(10,6))
sns.histplot(train_df['SalePrice'], kde=True)
plt.title('Distribution of Log(SalePrice)')
plt.show()

**Result:** After the transformation, the distribution looks much closer to a normal curve (Gaussian distribution), which is ideal for linear regression and tree-based models.

### Correlations
We want to identify which features have the strongest relationships with `SalePrice`.

In [None]:
corr = train_df.select_dtypes(include=[np.number]).corr()
top_corr = corr['SalePrice'].sort_values(ascending=False).head(10)
print("Top 10 Positively Correlated Features:")
print(top_corr)

plt.figure(figsize=(12,8))
sns.heatmap(train_df[top_corr.index].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Top Features')
plt.show()

**Interpretation:**
- **OverallQual (Overall Quality)** has the highest correlation with SalePrice. In other words, higher quality construction commands higher prices.
- **GrLivArea (Above Grade Living Area)** is strongly correlated. Bigger houses tend to be more expensive.
- **GarageCars/GarageArea**: Garage size is also a significant predictor.

## 3. Data Preprocessing

To prepare the data for the model, we must handle missing values and encode categorical variables into a numerical format.

In [None]:
y = train_df['SalePrice']
train_features = train_df.drop(['Id', 'SalePrice'], axis=1)
test_features = test_df.drop(['Id'], axis=1)

# Combine for processing to ensure consistent dimensionality
all_data = pd.concat([train_features, test_features]).reset_index(drop=True)
print(f"Combined shape: {all_data.shape}")

### Handling Missing Values
- **Categorical Features**: We treat missing values as a separate category named 'Missing' or fill with the mode. Here we fill with 'Missing' to allow the model to learn if 'missingness' is predictive.
- **Numerical Features**: We fill missing values with the median of the column.

In [None]:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
categorical_feats = all_data.dtypes[all_data.dtypes == "object"].index

# Fill Numeric with Median
for col in numeric_feats:
    all_data[col] = all_data[col].fillna(all_data[col].median())

# Fill Categorical with 'Missing' string
for col in categorical_feats:
    all_data[col] = all_data[col].fillna("Missing")

print(f"Missing values remaining: {all_data.isnull().sum().sum()}")

### Encoding
We use **One-Hot Encoding** (`pd.get_dummies`) to convert categorical variables into binary columns (0 or 1). This increases the dataset width but captures categorical information effectively.

In [None]:
all_data = pd.get_dummies(all_data)
print(f"Shape after encoding: {all_data.shape}")

# Split back into train and test
X = all_data[:len(train_df)]
X_test = all_data[len(train_df):]

## 4. Modeling

We will use a **Gradient Boosting Regressor** to correct the errors of previous trees. We use an 80/20 train-validation split to assess our model's performance without touching the test set.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingRegressor(n_estimators=300, learning_rate=0.05, max_depth=4, random_state=42)
model.fit(X_train, y_train)

# Validation
preds_val = model.predict(X_val)
rmse = np.sqrt(mean_squared_error(y_val, preds_val))
print(f"Validation RMSE (Log Scale): {rmse:.4f}")

**Performance Interpretation:**
The RMSE (Root Mean Squared Error) is calculated on the log-transformed prices. A score of **approx 0.13** is quite competitive for a base model on this dataset. It roughly translates to an average error of about 13% in the predicted price.

### Feature Importance
Let's inspect what the model determined as the most important drivers of House Price.

In [None]:
feature_importance = model.feature_importances_
sorted_idx = np.argsort(feature_importance)[-10:]

plt.figure(figsize=(10,6))
plt.barh(range(10), feature_importance[sorted_idx])
plt.yticks(range(10), X.columns[sorted_idx])
plt.title("Top 10 Important Features")
plt.show()

**Interpretation:**
The Feature Importance plot confirms our EDA:
- `OverallQual` is typically the dominant feature.
- `GrLivArea` and `TotalBsmtSF` (Total Basement Square Footage) are also critical.
This tells us that **size and quality** are the two biggest factors in determining a home's value in this dataset.

## 5. Submission

Finally, we predict on the official test set. Since we trained on log-prices, we must use `np.expm1` (exponential minus 1) to convert the predictions back to the original dollar values.

In [None]:
# Predict
final_preds_log = model.predict(X_test)
final_preds = np.expm1(final_preds_log)

# Create submission DataFrame
submission = pd.DataFrame({'Id': test_df['Id'], 'SalePrice': final_preds})

# Display head
print(submission.head())

# Save
submission.to_csv('submission.csv', index=False)
print("submission.csv saved successfully.")