### <span style = 'color:green'> Build a machine learning algorithm to predict the house prices by using ensemble techniques.   </span>

#### About the dataset

Here's a brief version of what you'll find in the data description file.

- SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale


- **File descriptions**
- train.csv - the training set
- test.csv - the test set
- data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here

**Ensemble Techniques**

- Bagging - Building multiple models (typically of the same type) from different subsamples of the training dataset.
- Boosting - Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
- Voting - Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.




* Build Various ***ML Models*** with the view of ***increasing accuracy*** of the Model. 


1. Dicision trees regression

2. random forest regression 

3. ADA boost

4. Gradient Boosting with XGBoost




### To download the dataset <a href = 'https://drive.google.com/file/d/1rknDE31orIy3R214mzd4Wcrlbyw7LS2O/view?usp=sharing' title = 'Google Drive'> Click Here

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor, BaggingRegressor, VotingRegressor
from xgboost import XGBRegressor

In [13]:
#Load the train and test datasets
train_data = pd.read_csv('train(1).csv')
test_data = pd.read_csv('test.csv')

In [14]:
all_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)

In [15]:
all_data.dropna(axis=1, inplace=True)

In [21]:
#Load the train and test datasets
train_data = pd.read_csv('train(1).csv')
test_data = pd.read_csv('test.csv')

In [23]:
numeric_columns = all_data.select_dtypes(include=['number']).columns

# Impute missing values with the mean of numeric columns
all_data[numeric_columns] = all_data[numeric_columns].fillna(all_data[numeric_columns].mean())

# Convert categorical variables to one-hot encoding
all_data = pd.get_dummies(all_data)

# Split features and target variable
X_train = all_data[:len(train_data)].drop(columns=['SalePrice'])
y_train = train_data['SalePrice']
X_test = all_data[len(train_data):].drop(columns=['SalePrice'])

# Step 5: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 6: Train-test split for model evaluation
X_train_eval, X_val, y_train_eval, y_val = train_test_split(X_train_scaled, y_train, test_size=0.2, random_state=42)


In [26]:
# Decision Tree Regression
dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train, y_train)
print("\nDecision Tree Regression:")
print("Training R^2 Score:", dt_reg.score(X_train, y_train))

# Random Forest Regression
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_train)
print("\nRandom Forest Regression:")
print("Training R^2 Score:", rf_reg.score(X_train, y_train))

# AdaBoost
adaboost_reg = AdaBoostRegressor(random_state=42)
adaboost_reg.fit(X_train, y_train)
print("\nAdaBoost:")
print("Training R^2 Score:", adaboost_reg.score(X_train, y_train))

# Gradient Boosting with XGBoost
xgb_reg = XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
print("\nGradient Boosting with XGBoost:")
print("Training R^2 Score:", xgb_reg.score(X_train, y_train))


Decision Tree Regression:
Training R^2 Score: 1.0

Random Forest Regression:
Training R^2 Score: 0.9802302759460504

AdaBoost:
Training R^2 Score: 0.8738536568526146

Gradient Boosting with XGBoost:
Training R^2 Score: 0.9996190384852393


In [27]:
# Step 7: Model Evaluation for Gradient Boosting with XGBoost
# Make predictions on the validation set
y_pred = xgb_reg.predict(X_val)

# Evaluate the model using appropriate metrics
mse = mean_squared_error(y_val, y_pred)
print("Mean Squared Error (MSE) on Validation Set:", mse)

# Optionally, you can also calculate other evaluation metrics such as R-squared, MAE, etc.

# Step 8: Hyperparameter Tuning (Optional)
# If the model performance is not satisfactory, you can perform hyperparameter tuning to improve it.
# You can use techniques like GridSearchCV or RandomizedSearchCV to search for the best hyperparameters.

# For example:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 3, 5],
    # Add more hyperparameters as needed
}

grid_search = GridSearchCV(estimator=XGBRegressor(random_state=42),
                           param_grid=param_grid,
                           scoring='neg_mean_squared_error',  # Use appropriate scoring metric
                           cv=5)  # Cross-validation folds

grid_search.fit(X_train_eval, y_train_eval)

# Best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Evaluate the model with the best hyperparameters
best_xgb_reg = grid_search.best_estimator_
y_pred_best = best_xgb_reg.predict(X_val)
mse_best = mean_squared_error(y_val, y_pred_best)
print("Mean Squared Error (MSE) on Validation Set (Best Model):", mse_best)


Mean Squared Error (MSE) on Validation Set: 19344377725.94379
Best Hyperparameters: {'learning_rate': 0.2, 'max_depth': 3, 'min_child_weight': 1}
Mean Squared Error (MSE) on Validation Set (Best Model): 723754364.3995544
