# PYTHON JUPYTER NOTEBOOK SCRIPT FOR HOUSE PRICE PREDICTION PROJECT
# PROJECT NAME: House Price & Category Prediction
# Name Jawad Ali
# ROLL NO: 22F-BSAI-97


## OBJECTIVE:
This comprehensive Machine Learning project demonstrates the integration of two models:
1. **Regression**: Use Linear Regression (LR) to predict continuous house price ('SalePrice').
2. **Classification**: Use Random Forest Classifier (RF) to predict a discrete 'Price_Category' (Low, Medium, High).

## DATASET:
Kaggle "House Prices: Advanced Regression Techniques" dataset (train.csv)

## 1. Import Required Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

print("All libraries imported successfully!")

## 2. Load and Explore the Dataset

In [None]:
# Load the dataset
df = pd.read_csv('train.csv')

# Drop the 'Id' column
df = df.drop('Id', axis=1)

# Display basic information
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nBasic Statistics:")
print(df.describe())

## 3. Handle Missing Values

In [None]:
# Check for missing values
print("Missing values before handling:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Fill numerical missing values with median
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

# Fill categorical missing values with 'None'
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna('None', inplace=True)

print("\n\nMissing values after handling:")
print(df.isnull().sum().sum())
print("All missing values handled successfully!")

## 4. Create Price Category Target Variable

In [None]:
# Create Price_Category using pandas.cut
# Bin the SalePrice into three categories: Low, Medium, High
df['Price_Category'] = pd.cut(df['SalePrice'], 
                               bins=3, 
                               labels=['Low', 'Medium', 'High'])

# Display the distribution of categories
print("Price Category Distribution:")
print(df['Price_Category'].value_counts().sort_index())
print("\n")
print("Sample data with Price_Category:")
print(df[['SalePrice', 'Price_Category']].head(10))

## 5. Separate Features and Targets

In [None]:
# Save target variables
y_reg = df['SalePrice']  # Regression target
y_clf = df['Price_Category']  # Classification target

# Create feature set by dropping both target columns
X = df.drop(['SalePrice', 'Price_Category'], axis=1)

print("Feature set shape:", X.shape)
print("Regression target shape:", y_reg.shape)
print("Classification target shape:", y_clf.shape)
print("\nFeatures columns:", X.columns.tolist()[:10], "... (showing first 10)")

## 6. One-Hot Encode Categorical Features

In [None]:
# Apply One-Hot Encoding to categorical features
print("Shape before encoding:", X.shape)

X_encoded = pd.get_dummies(X, drop_first=True)

print("Shape after encoding:", X_encoded.shape)
print("\nEncoded features (first 10):", X_encoded.columns.tolist()[:10])
print("\nOne-Hot Encoding completed successfully!")

## 7. Split Data for Regression and Classification

In [None]:
# Split data for Regression (70/30 split, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_encoded, y_reg, test_size=0.3, random_state=42
)

# Split data for Classification (70/30 split, random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_encoded, y_clf, test_size=0.3, random_state=42
)

print("Regression Train set:", X_train_reg.shape, y_train_reg.shape)
print("Regression Test set:", X_test_reg.shape, y_test_reg.shape)
print("\nClassification Train set:", X_train_clf.shape, y_train_clf.shape)
print("Classification Test set:", X_test_clf.shape, y_test_clf.shape)

## 8. Scale Features

In [None]:
# Initialize StandardScaler for Regression
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# Initialize StandardScaler for Classification
scaler_clf = StandardScaler()
X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)
X_test_clf_scaled = scaler_clf.transform(X_test_clf)

print("Feature scaling completed successfully!")
print("\nRegression - Scaled train set shape:", X_train_reg_scaled.shape)
print("Regression - Scaled test set shape:", X_test_reg_scaled.shape)
print("\nClassification - Scaled train set shape:", X_train_clf_scaled.shape)
print("Classification - Scaled test set shape:", X_test_clf_scaled.shape)

## 9. Train Linear Regression Model

In [None]:
# Initialize and train Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train_reg_scaled, y_train_reg)

print("Linear Regression model trained successfully!")
print(f"Model Coefficients: {lr_model.coef_[:5]}... (showing first 5)")
print(f"Model Intercept: {lr_model.intercept_:.2f}")

## 10. Train Random Forest Classifier Model

In [None]:
# Initialize and train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_clf_scaled, y_train_clf)

print("Random Forest Classifier trained successfully!")
print(f"Number of trees: {rf_model.n_estimators}")
print(f"Feature importances (first 5): {rf_model.feature_importances_[:5]}")

## 11. Evaluate Regression Model

In [None]:
# Make predictions on test set
y_pred_reg = lr_model.predict(X_test_reg_scaled)

# Calculate R-squared
r2 = r2_score(y_test_reg, y_pred_reg)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))

# Display results
print("=" * 60)
print("LINEAR REGRESSION MODEL - EVALUATION METRICS")
print("=" * 60)
print(f"R-squared (R²): {r2:.4f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")
print("=" * 60)
print("\nInterpretation:")
print(f"- The model explains {r2*100:.2f}% of the variance in house prices.")
print(f"- On average, predictions are off by ${rmse:,.2f}.")

## 12. Evaluate Classification Model

In [None]:
# Make predictions on test set
y_pred_clf = rf_model.predict(X_test_clf_scaled)

# Calculate Accuracy Score
accuracy = accuracy_score(y_test_clf, y_pred_clf)

# Generate Confusion Matrix
cm = confusion_matrix(y_test_clf, y_pred_clf)

# Display results
print("=" * 60)
print("RANDOM FOREST CLASSIFIER - EVALUATION METRICS")
print("=" * 60)
print(f"Accuracy Score: {accuracy:.4f} ({accuracy*100:.2f}%)")
print("\nConfusion Matrix:")
print("-" * 40)
print(cm)
print("-" * 40)
print("\nConfusion Matrix Format:")
print("                  Predicted")
print("              Low  Medium  High")
print("Actual Low    ", cm[0])
print("Actual Medium ", cm[1])
print("Actual High   ", cm[2])
print("=" * 60)
print("\nInterpretation:")
print(f"- The model correctly classifies {accuracy*100:.2f}% of houses into price categories.")

## 13. Summary and Conclusion

This project successfully demonstrated the integration of two machine learning models:

### **Regression Model (Linear Regression)**
- **Purpose**: Predict continuous house prices
- **Metrics**: R² Score and RMSE
- **Use Case**: Estimating exact sale prices for houses

### **Classification Model (Random Forest)**
- **Purpose**: Classify houses into price categories (Low, Medium, High)
- **Metrics**: Accuracy Score and Confusion Matrix
- **Use Case**: Quick categorization of properties for market segmentation

### **Key Steps Implemented**:
1. ✅ Data loading and preprocessing
2. ✅ Missing value handling (median for numerical, 'None' for categorical)
3. ✅ Feature engineering (Price_Category creation using pd.cut)
4. ✅ One-Hot Encoding for categorical variables
5. ✅ Feature scaling using StandardScaler
6. ✅ Train-test split (70/30) for both tasks
7. ✅ Model training for Linear Regression and Random Forest
8. ✅ Comprehensive evaluation with appropriate metrics

### **Note**: 
Remember to insert your **ROLL NUMBER** at the top of the notebook before submission!