# **1. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [1]:
# Import relevant Library

# Data Wrangling
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

# Data Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Evaluation metrics
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score

# **2. Loading Dataset from Clustering Results**

Load the clustering result dataset from the CSV file into a DataFrame variable.

In [2]:
# loading data
df = pd.read_csv('/content/house_cluster.csv')

In [3]:
df.head(5)

Unnamed: 0,property_type,price,location,city,baths,purpose,bedrooms,area_m2,Cluster
0,Flat,38000,DHA Defence,Islamabad,3,For Rent,3,253.0,Medium Affordable Properties
1,House,11500000,Wapda Town,Lahore,3,For Sale,3,126.0,Large Luxury Properties
2,House,6500000,Lahore Medical Housing Society,Lahore,3,For Sale,3,76.0,Large Luxury Properties
3,House,15000000,Bahria Town,Lahore,6,For Sale,5,253.0,Large Luxury Properties
4,Upper Portion,42000,Gulistan-e-Jauhar,Karachi,3,For Rent,3,303.0,Cheap Affordable Properties


In [4]:
df.value_counts('Cluster')

Unnamed: 0_level_0,count
Cluster,Unnamed: 1_level_1
Large Luxury Properties,5707
Medium Affordable Properties,3101
Cheap Affordable Properties,1192


Equalize the number of clusters as targets

In [5]:
df = df.groupby('Cluster').apply(lambda x: x.sample(n=1000, random_state=42))
df.reset_index(drop=True, inplace=True)

print(df.value_counts('Cluster'))

Cluster
Cheap Affordable Properties     1000
Large Luxury Properties         1000
Medium Affordable Properties    1000
Name: count, dtype: int64


  df = df.groupby('Cluster').apply(lambda x: x.sample(n=1000, random_state=42))


Melakukan Encoding dan Stadarisasi Data

In [6]:
df.shape

(3000, 9)

In [7]:
le = LabelEncoder()
df['Cluster'] = le.fit_transform(df['Cluster'])
df['city'] = le.fit_transform(df['city'])
df['purpose'] = le.fit_transform(df['baths'])
df['property_type'] = le.fit_transform(df['property_type'])
df['location'] = le.fit_transform(df['location'])

In [8]:
scaler = StandardScaler()
df[['price','baths','bedrooms','area_m2']] = scaler.fit_transform(df[['price','baths','bedrooms','area_m2']])

# **3. Data Splitting**

The Data Splitting stage aims to separate the dataset into two parts: training data (training set) and test data (test set).

In [9]:
X = df.drop(columns=['Cluster'])
y = df['Cluster']

In [10]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **4. Building Classification Models**


## **a. Building Classification Models**


Initialize the model with 3 algorithms whose performance will be compared. The classification algorithm used is:
*   KNeighbors Classifier
*   Support Vector Machine (SVC)
*   Random Forest

In [11]:
def model(x_train,y_train):
  #loreg
  knn = KNeighborsClassifier()
  knn.fit(x_train,y_train)

  #support vector machine
  svc = SVC()
  svc.fit(x_train,y_train)

  #random forest
  rfc = RandomForestClassifier()
  rfc.fit(x_train,y_train)

  print('[0] KNeighbors Classifier Training Acc : ' , knn.score(x_train, y_train))
  print('[1] Support Vector Machine Training Acc : ' , svc.score(x_train, y_train))
  print('[2] Random Forest Classifier Training Acc : ' , rfc.score(x_train, y_train))

  return knn, svc, rfc

In [12]:
model = model(x_train,y_train)

[0] KNeighbors Classifier Training Acc :  0.9270833333333334
[1] Support Vector Machine Training Acc :  0.5129166666666667
[2] Random Forest Classifier Training Acc :  1.0


**The following are the accuracy results of the 3 trained models**

| Training Model            | Accuracy Training |
|---------------------------|-------------------|
| KNeighbors Classifier       | 0.9270833333333334|
| Support Vector Machine:   | 0.5129166666666667|
| Random Forest Classifier: | 1.0          |

## **b. Evaluation of Classification Models**

After the model is trained, it is necessary to test the model. The model is tested with test data. After testing, the performance of the model test will be seen.

The metrics used in this test use the confusion matrix:

* Accuracy
* Precision
* Recall
* F1 Score



In [13]:
def evaluate_model(y_true, y_pred, model_name):
    # Confusion Matrix
    cm = confusion_matrix(y_true, y_pred)
    print(f'Confusion Matrix for {model_name}:')
    print(cm)

    # Metrik lainnya
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')

    # Cetak hasil
    print(f'Accuracy: {accuracy:.4f}')
    print(f'Precision: {precision:.4f}')
    print(f'Recall: {recall:.4f}')
    print(f'F1 Score: {f1:.4f}')

In [19]:
evaluate_model(y_test, model[0].predict(x_test), 'KNeighbors Classifier')

Confusion Matrix for KNeighbors Classifier:
[[189  20   8]
 [ 12 167  18]
 [  7  15 164]]
Accuracy: 0.8667
Precision: 0.8677
Recall: 0.8667
F1 Score: 0.8669


In [15]:
evaluate_model(y_test, model[1].predict(x_test), 'Support Vector Machine')

Confusion Matrix for Support Vector Machine:
[[154  30  33]
 [ 64  99  34]
 [ 73  54  59]]
Accuracy: 0.5200
Precision: 0.5142
Recall: 0.5200
F1 Score: 0.5076


In [16]:
evaluate_model(y_test, model[2].predict(x_test), 'Random Forest Classifier')

Confusion Matrix for Random Forest Classifier:
[[217   0   0]
 [  0 197   0]
 [  0   0 186]]
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


**Testing results from the three models**:

| Model                     | Accuracy | Precision | Recall | F1-Score |
|---------------------------|----------|-----------|--------|----------|
| KNeighbors Classifier      | 0.8667   | 0.8667   | 0.8667 | 0.8669 |
| Support Vector Machine:   | 0.5200  | 0.5142    | 0.5200| 0.5076  |
| Random Forest Classifier: | 1.0000 | 1.0000  | 1.0000| 1.0000 |



## **c. Classification Model Tuning**
Based on the table above, two of the three models have very good performance, but not with support vector machines. therefore hyperparameter tuning will be carried out using RandomizedSearchCV. This is done to try to improve the performance of the support vector machine classifier model

In [20]:
param_dist = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

In [21]:
svc = SVC()
random_search = RandomizedSearchCV(estimator=svc, param_distributions=param_dist, n_iter=5, cv=3, n_jobs=-1, verbose=2, random_state=42)
random_search.fit(x_train, y_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


In [22]:
print(f"Best parameters (Random Search): {random_search.best_params_}")
best_svc_random = random_search.best_estimator_

Best parameters (Random Search): {'kernel': 'poly', 'gamma': 0.1, 'C': 0.1}


In [23]:
y_pred_random = best_svc_random.predict(x_test)

In [26]:
    accuracy = accuracy_score(y_test, y_pred_random)
    precision = precision_score(y_test, y_pred_random, average='weighted')
    recall = recall_score(y_test, y_pred_random, average='weighted')
    f1 = f1_score(y_test, y_pred_random, average='weighted')

In [27]:
    print(f'Accuracy: {accuracy:.4f}')
    print(f'Precision: {precision:.4f}')
    print(f'Recall: {recall:.4f}')
    print(f'F1 Score: {f1:.4f}')

Accuracy: 0.9967
Precision: 0.9967
Recall: 0.9967
F1 Score: 0.9967


## **d. Evaluation of Classification Models after Tuning**

Initially, the support vector machine classifier model had poor training and testing accuracy results. This indicates that the model is underfit. Therefore, it is necessary to carry out hyperparameter tuning to successfully model performance of the support vector machine classifier. The following are the results before and after hyperparameter tuning

Beikut adalah hasil evaluasi model klasifikasi

| Metrics   | Testing before tuning | after hyperparameter tuning |
|-----------|--------------------|----------------------------------|
| Accuracy  | 0.5200            | 0.9967                       |
| Precision | 0.5142             | 0.9967                         |
| Recall    | 0.5200             | 0.9967                        |
| F1-Score  | 0.5076            | 0.9967                        |


## **e. Analysis of Classification Model Evaluation Results**

Based on the experiments above, we conclude that the Random Forest Classifier is a strong and optimal classification algorithm because it combines many decision trees to increase the accuracy and stability of predictions. Some algorithms have their respective advantages and disadvantages. Therefore, the function of hyperparameter tuning is to improve model performance if the model used only uses default settings