# House Price Classification Model

This notebook develops a classification model to predict whether a house is 'Expensive' or not based on various features. The goal is to build a model that can accurately classify houses into these two categories.

The process involves:
1.  **Data Loading and Exploration:** Loading the dataset and examining its structure, missing values, and basic statistics.
2.  **Data Preprocessing:** Handling missing values, encoding categorical features (both ordinal and one-hot encoding), and scaling numerical features.
3.  **Model Training:** Training a RandomForestClassifier model.
4.  **Hyperparameter Tuning:** Using GridSearchCV to find the best hyperparameters for the model to improve performance.
5.  **Model Evaluation:** Evaluating the model's performance on both the training and testing data using various metrics like accuracy, F1 score, precision, recall, and ROC AUC.
6.  **Prediction on New Data:** Using the trained model to make predictions on a new dataset.
7.  **Saving Predictions:** Saving the predictions to a CSV file.

**Insights from the model:**

Based on the evaluation metrics (Training Accuracy: 0.999, Testing Accuracy: 0.949, F1 Score: 0.795, Precision Score: 0.829, Recall Score: 0.763, ROC AUC Score: 0.981), the model performs well in classifying houses as expensive or not. The high training accuracy suggests the model fits the training data closely, while the slightly lower testing accuracy indicates good generalization to unseen data, with a small potential for overfitting, which is common in complex models like Random Forest. The F1, Precision, and Recall scores provide a balanced view of the model's ability to correctly identify expensive houses and minimize false positives and negatives. The high ROC AUC score suggests the model has a strong ability to distinguish between the two classes.

Further steps could involve exploring other models, feature engineering, or gathering more data to potentially improve the model's performance and robustness.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

In [None]:
url = "https://drive.google.com/file/d/1MNscpmMalx2vDHb4vdEdPwkXz3zZ7K0P/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = pd.read_csv(path)

In [None]:
data

Unnamed: 0,LotArea,LotFrontage,TotalBsmtSF,BedroomAbvGr,Fireplaces,PoolArea,GarageCars,WoodDeckSF,ScreenPorch,Expensive,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,8450,65.0,856,3,0,0,2,0,0,0,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,9600,80.0,1262,3,1,0,2,298,0,0,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,11250,68.0,920,3,1,0,2,0,0,0,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,9550,60.0,756,3,1,0,3,0,0,0,...,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,14260,84.0,1145,4,1,0,3,192,0,0,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,7917,62.0,953,3,1,0,2,0,0,0,...,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1456,13175,85.0,1542,3,2,0,2,349,0,0,...,Attchd,Unf,TA,TA,Y,,MnPrv,,WD,Normal
1457,9042,66.0,1152,4,2,0,1,0,0,1,...,Attchd,RFn,TA,TA,Y,,GdPrv,Shed,WD,Normal
1458,9717,68.0,1078,2,0,0,1,366,0,0,...,Attchd,Unf,TA,TA,Y,,,,WD,Normal


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   LotArea        1460 non-null   int64  
 1   LotFrontage    1201 non-null   float64
 2   TotalBsmtSF    1460 non-null   int64  
 3   BedroomAbvGr   1460 non-null   int64  
 4   Fireplaces     1460 non-null   int64  
 5   PoolArea       1460 non-null   int64  
 6   GarageCars     1460 non-null   int64  
 7   WoodDeckSF     1460 non-null   int64  
 8   ScreenPorch    1460 non-null   int64  
 9   Expensive      1460 non-null   int64  
 10  MSZoning       1460 non-null   object 
 11  Condition1     1460 non-null   object 
 12  Heating        1460 non-null   object 
 13  Street         1460 non-null   object 
 14  CentralAir     1460 non-null   object 
 15  Foundation     1460 non-null   object 
 16  ExterQual      1460 non-null   object 
 17  ExterCond      1460 non-null   object 
 18  BsmtQual

In [None]:
pd.options.display.max_rows = 100
data.isnull().sum()

Unnamed: 0,0
LotArea,0
LotFrontage,259
TotalBsmtSF,0
BedroomAbvGr,0
Fireplaces,0
PoolArea,0
GarageCars,0
WoodDeckSF,0
ScreenPorch,0
Expensive,0


In [None]:
data.drop(['FireplaceQu', 'MasVnrType', 'Alley','PoolQC', 'Fence', 'MiscFeature'], axis=1, inplace=True)

In [None]:
#define target variable
X = data
y = data.pop('Expensive')

In [None]:
#split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31416)

In [None]:
# select categorical and numerical column names
X_cat_columns = X.select_dtypes(exclude="number").columns
X_num_columns = X.select_dtypes(include="number").columns

# Select categorical columns
ordinal_cols = ["ExterQual", "ExterCond", "BsmtQual", "BsmtCond",
                "BsmtExposure", "BsmtFinType1", "KitchenQual",
                "LotShape", "LandSlope", "Functional",
                "GarageQual", "GarageCond", "HeatingQC", "PavedDrive"]

# Ordinal encoding order for each feature
bsmt_qual_order = ["Po", "Fa", "TA", "Gd", "Ex"]
bsmt_cond_order = ["Po", "Fa", "TA", "Gd", "Ex"]
bsmt_exposure_order = ["No", "Mn", "Av", "Gd"]
bsmt_fintype1_order = ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"]
kitchen_qual_order = ["Po", "Fa", "TA", "Gd", "Ex"]
exter_qual_order = ["Po", "Fa", "TA", "Gd", "Ex"]
exter_cond_order = ["Po", "Fa", "TA", "Gd", "Ex"]
lot_shape_order = ["IR3", "IR2", "IR1", "Reg"]
land_slope_order = ["Sev", "Mod", "Gtl"]
functional_order = ["Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"]
garage_qual_order = ["Po", "Fa", "TA", "Gd", "Ex"]
garage_cond_order = ["Po", "Fa", "TA", "Gd", "Ex"]
heating_qc_order = ["Po", "Fa", "TA", "Gd", "Ex"]
paved_drive_order = ["N", "P", "Y"]

ordinal_encoder = OrdinalEncoder(categories=[
    exter_qual_order, exter_cond_order, bsmt_qual_order, bsmt_cond_order,
    bsmt_exposure_order, bsmt_fintype1_order, kitchen_qual_order, lot_shape_order,
    land_slope_order, functional_order, garage_qual_order, garage_cond_order,
    heating_qc_order, paved_drive_order
], handle_unknown="use_encoded_value",
   unknown_value=-1  # Placeholder for unknown categories
)


# Specify one-hot encoded columns based on updated categorical feature list
onehot_cols = ["MSZoning", "Condition1", "Heating", "Street", "CentralAir",
               "Foundation", "LandContour", "Utilities", "LotConfig",
               "Neighborhood", "Condition2", "BldgType", "HouseStyle",
               "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd",
               "BsmtFinType2", "Electrical", "GarageType", "GarageFinish",
               "SaleType", "SaleCondition"]

onehot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

# Numeric pipeline
numeric_pipe = make_pipeline(
    SimpleImputer(),
    StandardScaler()
)

# Categorical pipeline with both ordinal and one-hot encoding
categoric_pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    ColumnTransformer(
        transformers=[
            ("cat_ordinal", ordinal_encoder, ordinal_cols),
            ("cat_onehot", onehot_encoder, onehot_cols)
        ]
    )
)

# Full preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, X_num_columns),
        ("cat", categoric_pipe, X_cat_columns),
    ]
)

In [None]:
preprocessor = make_column_transformer(
    (numeric_pipe, X_num_columns),
    (categoric_pipe, X_cat_columns),
)

In [None]:
# initialize transformers & model
dtree = RandomForestClassifier()

In [None]:
# create the pipeline
fullpipe = make_pipeline(preprocessor,
                     dtree).set_output(transform='pandas')

In [None]:
# create parameter grid
param_grid = {
    "columntransformer__pipeline-1__simpleimputer__strategy":["mean", "median"],
    "columntransformer__pipeline-1__standardscaler__with_mean":[True, False],
    "columntransformer__pipeline-1__standardscaler__with_std":[True, False],
    "randomforestclassifier__n_estimators": [100, 150],
    "randomforestclassifier__max_depth": [None, 20],
    "randomforestclassifier__min_samples_split": [2, 5],
    "randomforestclassifier__min_samples_leaf": [1, 2],
    "randomforestclassifier__max_features": ["sqrt", "log2"],
    "randomforestclassifier__bootstrap": [True]
}

In [None]:
# define cross validation
search = GridSearchCV(fullpipe,
                      param_grid,
                      cv=10,
                      verbose=1)

In [None]:
# fit
search.fit(X_train, y_train)

Fitting 10 folds for each of 256 candidates, totalling 2560 fits


  _data = np.array(data, dtype=dtype, copy=copy,


In [None]:
# cross validation average accuracy
search.best_score_

0.9546419098143236

In [None]:
# best parameters
search.best_params_

{'columntransformer__pipeline-1__simpleimputer__strategy': 'median',
 'columntransformer__pipeline-1__standardscaler__with_mean': False,
 'columntransformer__pipeline-1__standardscaler__with_std': False,
 'randomforestclassifier__bootstrap': True,
 'randomforestclassifier__max_depth': 20,
 'randomforestclassifier__max_features': 'sqrt',
 'randomforestclassifier__min_samples_leaf': 1,
 'randomforestclassifier__min_samples_split': 5,
 'randomforestclassifier__n_estimators': 150}

In [None]:
# training accuracy
y_train_pred = search.predict(X_train)

accuracy_score(y_train, y_train_pred)

0.9991438356164384

In [None]:
# testing accuracy
y_test_pred = search.predict(X_test)

accuracy_score(y_test, y_test_pred)

0.9486301369863014

In [None]:
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score

# For binary classification, if y_test is binary (0 and 1)
f1 = f1_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
roc_auc = roc_auc_score(y_test, search.predict_proba(X_test)[:, 1])  # Use probabilities for AUC

# Print scores
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)
print("F1 Score:", f1)
print("Precision Score:", precision)
print("Recall Score:", recall)
print("ROC AUC Score:", roc_auc)

Training Accuracy: 0.9991438356164384
Testing Accuracy: 0.9486301369863014
F1 Score: 0.7945205479452054
Precision Score: 0.8285714285714286
Recall Score: 0.7631578947368421
ROC AUC Score: 0.9805221715706589


In [None]:
url = "https://drive.google.com/file/d/1QjjOREyIugHZ0hkXdOgsqrHhkse0wfX-/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
new_data = pd.read_csv(path)

In [None]:
new_data.drop(['FireplaceQu', 'MasVnrType', 'Alley','PoolQC', 'Fence', 'MiscFeature'], axis=1, inplace=True)

In [None]:
new_prediction = search.predict(new_data)

In [None]:
predictions_df = pd.DataFrame({
    "Id": new_data['Id'],
    "Expensive": new_prediction
})

In [None]:
predictions_df.to_csv("predictions.csv", index=False)