<a href="https://colab.research.google.com/github/N4bilFikri/Data-Mining/blob/main/Tugas_4_Modelling_PDAB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#*Perbandingan Tingkat Ekonomi Antara Anak-anak Miskin & Kaya di Negara-negara Berkembang*#

#**Import Library dan Resource yang akan digunakan**

**Core Library**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Library untuk Splitting Data**

In [2]:
from sklearn.model_selection import train_test_split

**Library untuk Normalisasi Data**

In [3]:
from sklearn.preprocessing import MinMaxScaler

**Library Untuk Build Model**

In [4]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

from sklearn.metrics import accuracy_score

#**Inisiasi Variabel untuk menyimpan Dataframe**

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/N4bilFikri/Data-Mining/main/Data%20Cleaned.csv')

In [6]:
df.head()

Unnamed: 0,total,poorest_20perc,richest_20perc,diff,country_encoded,country_Albania,country_Algeria,country_Argentina,country_Bangladesh,country_Belize,...,country_Syrian Arab Republic,country_Tajikistan,country_Thailand,country_The former Yugoslav Republic of Macedonia,country_Trinidad and Tobago,country_Tunisia,country_Turkmenistan,country_Uzbekistan,country_Viet Nam,country_Yemen
0,5,1,24,24,18,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,6,1,23,22,11,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,1,24,23,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,11,1,34,33,13,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,10,2,28,26,8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0




#**Modelling Predict**

#####*Dari dataset ini, terlihat bahwa data tersebut memiliki beberapa variabel independen (misalnya, country_encoded) dan satu variabel dependen (total). Berdasarkan informasi ini, dapat diasumsikan bahwa tujuan analisis adalah untuk melakukan prediksi terhadap nilai total berdasarkan variabel-variabel independen yang ada.*

#####*Dalam konteks ini, algoritma pemodelan yang cocok adalah algoritma regresi/algoritma prediktif. Tujuan dari algoritma regresi adalah untuk membangun model matematis yang dapat memprediksi nilai-nilai continuos atau kategorikal berdasarkan variabel independen.*

#####*Dengan demikian, algoritma yang cocok untuk database ini adalah algoritma prediktif.*

###**Pre-Processing**

#####**Split Data**

In [29]:
x = df.drop('total', axis=1)
y = df['total']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

#####**Normalisasi Data**

In [30]:
scaler = MinMaxScaler()

x_train_norm = scaler.fit_transform(x_train)

x_test_norm = scaler.transform(x_test)

###**Build and Train Model**

#####**Gaussian Naive Bayes**

In [31]:
gnb = GaussianNB()

gnb.fit(x_train_norm, y_train)

#####**K-Nearest Neighbor**

In [32]:
knn = KNeighborsClassifier()

knn.fit(x_train_norm, y_train)

#####**Decision Tree**

In [33]:
dtc = DecisionTreeClassifier()

dtc.fit(x_train_norm, y_train)

###**Begins and Compare Predict**

In [34]:
gnb_pred = gnb.predict(x_test_norm)
knn_pred = knn.predict(x_test_norm)
dtc_pred = dtc.predict(x_test_norm)

In [35]:
x_test = pd.DataFrame(x_test).reset_index(drop=True)

y_test = pd.DataFrame(y_test).reset_index(drop=True)

gnb_col = pd.DataFrame(gnb_pred.astype(int), columns=["gnb_prediction"])
knn_col = pd.DataFrame(knn_pred.astype(int), columns=["knn_prediction"])
dtc_col = pd.DataFrame(dtc_pred.astype(int), columns=["dtc_prediction"])

combined_data = pd.concat([x_test, y_test, gnb_col, knn_col, dtc_col], axis=1)

In [36]:
combined_data.head()

Unnamed: 0,poorest_20perc,richest_20perc,diff,country_encoded,country_Albania,country_Algeria,country_Argentina,country_Bangladesh,country_Belize,country_Bhutan,...,country_Trinidad and Tobago,country_Tunisia,country_Turkmenistan,country_Uzbekistan,country_Viet Nam,country_Yemen,total,gnb_prediction,knn_prediction,dtc_prediction
0,13,57,44,20,0,0,0,0,0,0,...,0,0,0,0,0,0,33,11,21,26
1,7,59,52,23,0,0,0,0,0,0,...,0,0,0,0,0,0,26,11,21,26
2,2,28,26,8,0,0,0,0,0,0,...,0,0,0,0,0,0,10,11,6,6
3,34,73,39,14,0,0,0,0,0,0,...,0,0,0,0,0,0,55,11,51,68
4,48,87,39,21,0,0,0,0,0,0,...,0,0,0,0,0,0,73,11,51,68


#**Tensorflow Classification (Extras)**

In [37]:
import tensorflow as tf
from tensorflow import keras

In [38]:
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(x_train.shape[1],)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1, activation='softmax')
])

In [39]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [40]:
model.fit(x_train_norm, y_train, epochs=10, batch_size=32, validation_data=(x_test_norm, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7d512d6219c0>

In [41]:
tf_pred_prob = model.predict(x_test_norm)

tf_pred = np.argmax(tf_pred_prob, axis=1)

tf_col = pd.DataFrame(tf_pred, columns=["tf_prediction"])

final_data = pd.concat([combined_data, tf_col], axis=1)



In [42]:
final_data

Unnamed: 0,poorest_20perc,richest_20perc,diff,country_encoded,country_Albania,country_Algeria,country_Argentina,country_Bangladesh,country_Belize,country_Bhutan,...,country_Tunisia,country_Turkmenistan,country_Uzbekistan,country_Viet Nam,country_Yemen,total,gnb_prediction,knn_prediction,dtc_prediction,tf_prediction
0,13,57,44,20,0,0,0,0,0,0,...,0,0,0,0,0,33,11,21,26,0
1,7,59,52,23,0,0,0,0,0,0,...,0,0,0,0,0,26,11,21,26,0
2,2,28,26,8,0,0,0,0,0,0,...,0,0,0,0,0,10,11,6,6,0
3,34,73,39,14,0,0,0,0,0,0,...,0,0,0,0,0,55,11,51,68,0
4,48,87,39,21,0,0,0,0,0,0,...,0,0,0,0,0,73,11,51,68,0
5,25,76,51,12,0,0,0,0,0,0,...,0,0,0,0,0,47,11,35,61,0
6,3,40,37,32,0,0,0,0,0,0,...,1,0,0,0,0,18,11,5,11,0
7,30,66,36,33,0,0,0,0,0,0,...,0,1,0,0,0,48,11,21,26,0
8,23,73,49,4,0,0,0,0,1,0,...,0,0,0,0,0,44,11,32,61,0
9,12,53,41,27,0,0,0,0,0,0,...,0,0,0,0,0,30,11,21,27,0
