# House Price Prediction

### <font color='blue'>*Author: Ali Chehrazi*</font>

##  Description

In this project, XGBRegression  was employed to detect fraudulent transactions. The data employed in this project can be found at the following link:

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

<img src="Images\Credit_Card_Fraud_1.webp" width="400">

### About Dataset

The provided data consists of information related to real estate properties, particularly for predicting the sale price of these properties. Here's a summary of the data fields:

1. SalePrice: The sale price of the property in dollars (the target variable to predict).

The remaining fields represent various features and characteristics of the properties:

2. MSSubClass: The building class.
3. MSZoning: The general zoning classification.
4. LotFrontage: The linear feet of street connected to the property.
5. LotArea: The lot size in square feet.
6. Street: The type of road access.
7. Alley: The type of alley access.
8. LotShape: The general shape of the property.
9. LandContour: The flatness of the property.
10. Utilities: The type of utilities available.
11. LotConfig: The lot configuration.
12. LandSlope: The slope of the property.
13. Neighborhood: The physical locations within Ames city limits.
14. Condition1: Proximity to the main road or railroad.
15. Condition2: Proximity to the main road or railroad (if a second is present).
16. BldgType: The type of dwelling.
17. HouseStyle: The style of dwelling.
18. OverallQual: The overall material and finish quality.
19. OverallCond: The overall condition rating.
20. YearBuilt: The original construction date.
21. YearRemodAdd: The remodel date.
22. RoofStyle: The type of roof.
23. RoofMatl: The roof material.
24. Exterior1st: The exterior covering on the house.
25. Exterior2nd: The exterior covering on the house (if more than one material is used).
26. MasVnrType: The type of masonry veneer.
27. MasVnrArea: The masonry veneer area in square feet.
28. ExterQual: The quality of exterior materials.
29. ExterCond: The present condition of the material on the exterior.
30. Foundation: The type of foundation.
31. BsmtQual: The height of the basement.
32. BsmtCond: The general condition of the basement.
33. BsmtExposure: The presence of walkout or garden level basement walls.
34. BsmtFinType1: The quality of the basement finished area.
35. BsmtFinSF1: Type 1 finished square feet.
36. BsmtFinType2: The quality of the second finished area (if present).
37. BsmtFinSF2: Type 2 finished square feet.
38. BsmtUnfSF: The unfinished square feet of basement area.
39. TotalBsmtSF: The total square feet of basement area.
40. Heating: The type of heating.
41. HeatingQC: The heating quality and condition.
42. CentralAir: The presence of central air conditioning.
43. Electrical: The electrical system.
44. 1stFlrSF: The square footage of the first floor.
45. 2ndFlrSF: The square footage of the second floor.
46. LowQualFinSF: The square footage of low-quality finished areas on all floors.
47. GrLivArea: The above-grade (ground) living area square footage.
48. BsmtFullBath: The number of basement full bathrooms.
49. BsmtHalfBath: The number of basement half bathrooms.
50. FullBath: The number of full bathrooms above grade.
51. HalfBath: The number of half baths above grade.
52. Bedroom: The number of bedrooms above the basement level.
53. Kitchen: The number of kitchens.
54. KitchenQual: The quality of the kitchen.
55. TotRmsAbvGrd: The total rooms above grade (excluding bathrooms).
56. Functional: The home functionality rating.
57. Fireplaces: The number of fireplaces.
58. FireplaceQu: The quality of the fireplace.
59. GarageType: The garage location.
60. GarageYrBlt: The year the garage was built.
61. GarageFinish: The interior finish of the garage.
62. GarageCars: The size of the garage in car capacity.
63. GarageArea: The size of the garage in square feet.
64. GarageQual: The quality of the garage.
65. GarageCond: The condition of the garage.
66. PavedDrive: The presence of a paved driveway.
67. WoodDeckSF: The wood deck area in square feet.
68. OpenPorchSF: The open porch area in square feet.
69. EnclosedPorch: The enclosed porch area in square feet.
70. 3SsnPorch: The three-season porch area in square feet.
71. ScreenPorch: The screen porch area in square feet.
72. PoolArea: The pool area in square feet.
73. PoolQC: The quality of the pool.
74. Fence: The quality of the fence.
75. MiscFeature: Miscellaneous features not covered in other categories.
76. MiscVal: The value of miscellaneous features.
77. MoSold: The month the property was sold.
78. YrSold: The year the property was sold.
79. SaleType: The type of sale.
80. SaleCondition: The condition of the sale.

This dataset appears to be a comprehensive collection of property-related information, and it can be used for various analyses and predictive modeling, with the target variable being the sale price of the properties.

## Importing Libraries

First, let's import the required libraries.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import matplotlib.pyplot as plt #plotting library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #plotting library
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error, r2_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## Preprocessing
Let's read the dataset and save it into a data frame.

In [None]:
df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
print('Dataset inlcudes the information for '+str(len(df))+' houses.\n')
df.head()

### Data types

In [None]:
pd.set_option('display.max_rows', None)
df.dtypes

### NaN values
Let's check if there are any NaN values in the data frame and fill in the NaN values appropriately. 

In [None]:
df.isna().sum()[df.isna().sum() > 0]

Based on the data descriptions nan values for the following features mean that the property does not have this specific feature. Therefore, lets fill them with the 'None' value as they are categorical variables.

Alley,FireplaceQu,PoolQC,Fence,MiscFeature <---- 'None'

For the categorical parameters related to Garage (listed below) and basement, a None value can be assigned. For the year built, however, let's fill that with the original construction year of the house.

GarageType,GarageFinish,GarageQual,GarageCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2 <---- 'None'

GarageYrBlt <---- YearBuilt

For the Electrical feature, there is only one record with a missing value; let's drop that record.

For MasVnrType(masonry veneer), let's fill it with a None Category. For MasVnrArea, let's use the median.

For the LotFrontage, let's use the median.

In [None]:
Selected_features=['Alley','FireplaceQu','PoolQC','Fence','MiscFeature','GarageType','GarageFinish','GarageQual','GarageCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','MasVnrType']
for item in Selected_features:
    df[item]=df[item].fillna('None')
df['GarageYrBlt'][df['GarageYrBlt'].isna()]=df['YearBuilt'][df['GarageYrBlt'].isna()]
df.drop(df[df['Electrical'].isna()].index,inplace=True)
df['LotFrontage'].fillna(df['LotFrontage'].median(),inplace=True)
df['MasVnrArea'].fillna(df['MasVnrArea'].median(),inplace=True)
print('the number of remaining missing values are ',df.isna().sum().sum(),'\n')

## Train/test split

Let's drop the 'id' column and split the data to train/test sets.

In [None]:
id_=df.pop('Id')
y=df.pop('SalePrice')
x=df.select_dtypes(exclude=['object'])
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2)

Let's take a look at the X_train and Y_train

In [None]:
pd.set_option('display.max_rows', 10)
X_train

In [None]:
Y_train

## XGBRegressor Model

Let's go over a range of parameters for the model and find the ones with the highest score.

In [None]:
parameters = {'n_estimators': [10, 100, 500],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.05, 0.1, 0.2],
    'min_child_weight': [1, 10, 100]}

XGBR = XGBRegressor()
XGBR_cv=GridSearchCV(XGBR,parameters,cv=4)
XGBR_cv.fit(X_train,Y_train)
print("tuned hpyerparameters :(best parameters) ",XGBR_cv.best_params_)
print("R-squared Score_CV:",XGBR_cv.best_score_)
XGBR_cv.score(X_test,Y_test)
print("R-squared Score_test:",XGBR_cv.score(X_test,Y_test))

__Summary:__ The "House Price Prediction" project utilized XGBRegression to forecast property prices based on a comprehensive real estate dataset. This dataset contains extensive property attributes crucial for predicting sale prices. The project encompassed preprocessing, including handling missing values. The XGBoost regression model was fine-tuned, resulting in an R-squared score of 0.8549 during cross-validation and an impressive 0.9047 on the test dataset. This project exemplifies the effectiveness of data science techniques in predicting real estate prices and offers valuable insights into leveraging advanced algorithms for housing market analysis. 