# House Price Prediction using XGBoost Regressor



## 1. Introduction
This report provides an in-depth analysis of the House Price Prediction dataset and implements the **XGBoost Regressor** to predict house prices. The report includes dataset exploration, preprocessing, model building, evaluation, and conclusion.

---

## 2. Libraries Used
The following libraries are used for data analysis, visualization, preprocessing, and model building:



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score

## 3. Dataset Overview

The dataset contains information about various factors affecting house prices, including:  

- **Location**: Categorical, representing the area where the house is located.  
- **Size**: Continuous, representing the square footage of the house.  
- **Condition**: Categorical, indicating the overall condition of the house.  
- **Garage**: Categorical, representing whether a garage is present.  
- **Price**: Continuous, representing the actual price of the house.  

### 3.1 Number of Columns  

The dataset contains multiple attributes related to house pricing. The main features include:  

- **Location**  
- **Size**  
- **Condition**  
- **Garage**  
- **Price (Target Variable)**  

### 3.2 Relationship Between Columns  

- **Location and Price**: Houses in prime locations tend to have higher prices.  
- **Size and Price**: Larger houses generally have higher prices.  
- **Condition and Price**: Well-maintained houses are expected to have higher prices.  
- **Garage and Price**: Houses with a garage are usually more expensive.  


## 4. Basic Analysis

Loading and displaying the dataset:

In [3]:
data = pd.read_csv(r"C:\Users\Shaik Sakhlaih\Downloads\House Price Prediction Dataset.csv")
print(data.head())
print(data.tail())
print(data.info())
print(data.describe())

   Id  Area  Bedrooms  Bathrooms  Floors  YearBuilt  Location  Condition  \
0   1  1360         5          4       3       1970  Downtown  Excellent   
1   2  4272         5          4       3       1958  Downtown  Excellent   
2   3  3592         2          2       3       1938  Downtown       Good   
3   4   966         4          2       2       1902  Suburban       Fair   
4   5  4926         1          4       2       1975  Downtown       Fair   

  Garage   Price  
0     No  149919  
1     No  424998  
2     No  266746  
3    Yes  244020  
4    Yes  636056  
        Id  Area  Bedrooms  Bathrooms  Floors  YearBuilt  Location  Condition  \
1995  1996  4994         5          4       3       1923  Suburban       Poor   
1996  1997  3046         5          2       1       2019  Suburban       Poor   
1997  1998  1062         5          1       2       1903     Rural       Poor   
1998  1999  4062         3          1       2       1936     Urban  Excellent   
1999  2000  2989        

## 5. Checking for Null Values

In [4]:
print(data.isnull().sum())

Id           0
Area         0
Bedrooms     0
Bathrooms    0
Floors       0
YearBuilt    0
Location     0
Condition    0
Garage       0
Price        0
dtype: int64


Result: The dataset contains no missing values.

## 6. Data Preprocessing

Since the dataset contains categorical variables, we convert them into numerical format using **Label Encoding**.

In [5]:
le = LabelEncoder()

data['Location'] = le.fit_transform(data['Location'])
data['Condition'] = le.fit_transform(data['Condition'])
data['Garage'] = le.fit_transform(data['Garage'])


## 7. Model Building

### 7.1 Splitting the Dataset

The target variable (**Garage**) is separated, and the dataset is split into training and testing sets.

In [6]:
x = data.drop(['Garage'], axis=1)
y = data['Garage']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)


### 7.2 Initializing the XGBoost Regressor

In [None]:
params = {
    'objective': 'binary:logistic',
    'max_depth': 4,
    'alpha': 10,
    'learning_rate': 0.1,
    'n_estimators': 100
}

xgb_clf = XGBRegressor(**params)
xgb_clf.fit(x_train, y_train)


### 7.3 Making Predictions

In [None]:
y_pred = xgb_clf.predict(x_test)

## 8. Model Evaluation

### 8.1 Mean Squared Error


In [None]:
mse = mean_squared_error(y_test, y_pred)
print("The mean squared error:", mse)

**Result**: `0.2531`

## 8.2 R² Score

In [None]:
r2 = r2_score(y_test, y_pred)
print("The R2 score error:", r2)


**Result**: `-0.0186` (A negative R² indicates a poor fit.)

## 9. Conclusion  

- The dataset was successfully analyzed and preprocessed.  
- **Label Encoding** was applied to categorical variables.  
- The **XGBoost Regressor** model was built and trained.  
- The model was evaluated using **Mean Squared Error (MSE)** and **R² Score**.  
- The **R² score** was **-0.0186**, indicating that the model did not perform well.  
- The **MSE** was **0.2531**, suggesting that the predictions had some variance from actual values.  
- Alternative models (such as **Random Forest Regressor** or **Linear Regression**) may provide better results.  
