# Problem Statement
**Objective:**
Predict the sale prices of houses in Ames, using machine learning. We aim to build a regression model that uses features such as area, location, and year built to estimate house prices accurately.

### Import Required Libraries

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [None]:
# -----------------------------------------------
#  2. Load Dataset
# -----------------------------------------------
df = pd.read_csv("../data/Ames-Housing-Dataset.csv")
print("Dataset shape:", df.shape)
df.head()



### Data Cleaning


In [None]:
# Drop columns with > 30% missing values
null_percent = df.isnull().mean() * 100
df = df.drop(columns=null_percent[null_percent > 30].index)


In [None]:
# Fill missing values
for col in df.select_dtypes(include='object'):
    df[col] = df[col].fillna(df[col].mode()[0])
for col in df.select_dtypes(include=['int', 'float']):
    df[col] = df[col].fillna(df[col].median())


In [None]:
# Drop ID columns
df.drop(columns=['Order', 'PID'], inplace=True)
df.head()

###  Exploratory Data Analysis (EDA)

#### Living Area Vs Price

In [None]:
# Living Area vs Price
sns.scatterplot(data=df, x='Gr Liv Area', y='SalePrice')
plt.title("Living Area vs Sale Price")
plt.show()


**Insights:**
The Above Graph Shows that:
- As area increases, sale price increases
- Bigger houses cost more

#### Distribution of House Sale Prices(Sale Price Vs Number of Houses)

In [None]:
# Histogram to see the spread of Sale Prices
plt.figure(figsize=(8, 5))
sns.histplot(df['SalePrice'], bins=40, color='skyblue', kde=True)
plt.title("Distribution of House Sale Prices")
plt.xlabel("Sale Price (USD)")
plt.ylabel("Number of Houses")
plt.show()


**Insights:**
- Most houses are sold for $100,000 to $250,000
- A few expensive houses go above $400,000



#### Average Sale Price Vs Year Built

In [None]:
# Group by Year Built and calculate average Sale Price
avg_price_by_year = df.groupby('Year Built')['SalePrice'].mean()

# Plot the line chart
plt.figure(figsize=(10, 5))
avg_price_by_year.plot(kind='line', color='green', marker='o')
plt.title("Average Sale Price by Year Built")
plt.xlabel("Year Built")
plt.ylabel("Average Sale Price")
plt.grid(True)
plt.tight_layout()
plt.show()


**Insights:**
This graph shows
- how the average house price changes based on the year it was built.
- You’ll notice that houses built in recent years (2000–2010) generally have higher average prices.
- The line may go up and down slightly, but the overall trend is usually upward.



#### Average Sale Price by Neighborhood(Sale Price Vs Neighborhood)

In [None]:
plt.figure(figsize=(12, 5))
avg_price = df.groupby('Neighborhood')['SalePrice'].mean().sort_values(ascending=False)
sns.barplot(x=avg_price.index, y=avg_price.values, palette='coolwarm')
plt.xticks(rotation=90)
plt.title("Average Sale Price by Neighborhood")
plt.ylabel("Average Sale Price")
plt.xlabel("Neighborhood")
plt.show()


### Preprocessing for Modeling

In [None]:
# Encode categorical columns
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col])

# Define features and target
X = df.drop(columns='SalePrice')
y = df['SalePrice']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Model Training and Evaluation

####  Linear Regression

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

print(" Linear Regression Results")
print(f"MAE  : ${mae_lr:,.2f}")
print(f"RMSE : ${rmse_lr:,.2f}")


#### Random Forest Regressor

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print("\n Random Forest Regression Results")
print(f"MAE  : ${mae_rf:,.2f}")
print(f"RMSE : ${rmse_rf:,.2f}")


#### Comparing Both Models

**Comparison:**
- Linear Regression gives an
- MAE of ($20.2K) and RMSE of ($33.4K), showing decent but less precise predictions.
- Random Forest improves on this with lower MAE ($15.8K) and RMSE ($26.5K), handling non-linear patterns better.
##### **Random Forest** is the better choice for more accurate house price prediction.

### Feature Importance from Random Forest


In [None]:
importances = rf_model.feature_importances_
feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=False).head(10)

# Plot top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x=feat_imp, y=feat_imp.index, palette='coolwarm')
plt.title("Top 10 Features Influencing Sale Price")
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.show()


**Insights:**
This bar graph shows which features the Random Forest model thinks are most useful 
 when predicting house prices.
 For example, `Overall Qual` (overall material and finish quality) and `Gr Liv Area` 
 (above ground living area) are among the most influential.
 The higher the bar, the more important that feature is in determining the final prediction

### Conclusion
we developed and evaluated regression models to predict house prices using the Ames Housing Dataset. 
Through thorough data cleaning, feature analysis, and modeling, 
we identified key factors such as living area, overall quality, and neighborhood as strong predictors of price.
While linear regression gave a decent baseline, the random forest model performed much better with lower prediction errors.
In the end, we achieved accurate and reliable predictions, with random forest showing strong potential 
for real-world use in estimating house prices.

### Usage Example

In [None]:
example = {
    'Gr Liv Area': 2200,        # Above ground living area (sq ft)
    'Garage Cars': 2,           # Number of cars garage can hold
    'Year Built': 2010,         # Year the house was built
    'Overall Qual': 7,          # Overall quality of the house (1-10)
    'Total Bsmt SF': 1000,      # Total basement area
    '1st Flr SF': 1200,         # First floor square footage
    '2nd Flr SF': 1000,         # Second floor square footage
    'Full Bath': 2,             # Number of full bathrooms
    'Half Bath': 1,             # Number of half bathrooms
    'Fireplaces': 1,            # Number of fireplaces
    'Neighborhood': 15,         # Encoded value for 'Neighborhood'
    # 🔁 Include all other required columns with defaults or 0
}

example_df = pd.DataFrame([example])
example_df = example_df.reindex(columns=X.columns, fill_value=0)

predicted_price = rf_model.predict(example_df)
print(f"🏡 Predicted Sale Price for the Example House: ${predicted_price[0]:,.2f}")
