# Klasifikasi Hasil Belajar Akademik Mahasiswa FTMM Universitas Airlangga 2020 Berdasarkan Jenis Kelamin, Motivasi Diri, Tujuan Hidup, dan Cara Pandang Terhadap Diri

Nama Anggota Kelompok:
* Anisyaul Fitria 007
* Mohammad Sihabudin Al Qurtubi 043
* Qothrotunnidha' Almaulidiyah 093

Link Dataset : https://docs.google.com/spreadsheets/d/1zCu6VuM6ZN2WSGgQfvigywsdu_FBb6iCVK_gCBLzbew/edit?usp=sharing

# Library

In [1]:
import pandas as pd #untuk memudahkan manipulasi data struktur
import numpy as np #untuk perhitungan scientific
import seaborn as sns #untuk plot
import matplotlib.pyplot as plt #untuk plot
from sklearn.preprocessing import StandardScaler #untuk data transform
from sklearn.model_selection import train_test_split,GridSearchCV #untuk split data
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, precision_score, recall_score, plot_confusion_matrix #untuk confusion matrix
from sklearn.linear_model import LogisticRegression #untuk model logistic regression
from sklearn.naive_bayes import GaussianNB #untuk model naive bayes
from sklearn.tree import DecisionTreeClassifier #untuk model decision tree

import warnings
warnings.filterwarnings('ignore')

In [23]:
# Import data
url = 'https://raw.githubusercontent.com/Alqurtubi17/uas_datmin/main/DatasetKelompok12%20-%20Sheet1.csv'
prep = pd.read_csv(url)
prep

## Preprocessing

In [3]:
cleanup_nums = {"Hasil belajar akademik semester sebelumnya  (IPS). (0) IPS < 3.0 (1) 3.0 ≤ IPS ≤ 3.5 (2) IPS >  3.5": {"(2) IPS >  3.5": "Sangat Baik", "(1) 3.0 ≤ IPS ≤ 3.5": "Baik", "(0) IPS < 3.0": "Cukup Baik"},
               "Gender": {"Laki-laki": 1, "Perempuan": 0}}

In [4]:
#melakukan labeling manula
df1 = prep.replace(cleanup_nums)

In [5]:
#menghapus kolom cap waktu dan prodi
df1 = df1.drop(['Cap waktu', 'Prodi'], axis=1)

In [6]:
#mengubah naman kolom yang ada
df1.columns = ["Gender", "IPS", "M1", "M2", "M3", "M4", "M5", 
                "PP1", "PP2", "PP3", "PP4", "PP5", "SE1", "SE2", "SE3", "SE4"]

In [7]:
#aggregating
df1['M_Total'] = df1['M1'] + df1['M2'] + df1['M3'] + df1['M4'] + df1['M5']
df1['PP_Total'] = df1['PP1'] + df1['PP2'] + df1['PP3'] + df1['PP4'] + df1['PP5']
df1['SE_Total'] = df1['SE1'] + df1['SE2'] + df1['SE3'] + df1['SE4']

In [8]:
#melakukan labeling berdasarkan penelitian
conditions = [
    (df1['M_Total'] >= 1) & (df1['M_Total'] <= 5),
    (df1['M_Total'] >= 6) & (df1['M_Total'] <= 10),
    (df1['M_Total'] >= 11) & (df1['M_Total'] <= 15),
    (df1['M_Total'] >= 16) & (df1['M_Total'] <= 20),
    (df1['M_Total'] >= 21) & (df1['M_Total'] <= 25)
]

values = ['1', '2', '3', '4', '5']

df1['M_Total'] = np.select(conditions, values)

In [9]:
#melakukan labeling berdasarkan penelitian
conditions2 = [
    (df1['PP_Total'] >= 1) & (df1['PP_Total'] <= 5),
    (df1['PP_Total'] >= 6) & (df1['PP_Total'] <= 10),
    (df1['PP_Total'] >= 11) & (df1['PP_Total'] <= 15),
    (df1['PP_Total'] >= 16) & (df1['PP_Total'] <= 20),
    (df1['PP_Total'] >= 21) & (df1['PP_Total'] <= 25)
]

df1['PP_Total'] = np.select(conditions2, values)

Interpretasi:

Pada variabel independen dilakukan diskritisasi untuk semua atribut.  Kriteria yang digunakan untuk melakukan kategori total jawaban responden adalah skor terendah adalah 5 dan skor tertinggi 25 untuk indikator motivasi dan indikator tujuan hidup pribadi, kategori persepsi = 5. Interval = (25-5)/5 = 4. Interval yang digunakan untuk kategori persepsi seperti kode di atas

In [10]:
#melakukan labeling berdasarkan penelitian
conditions3 = [
    (df1['SE_Total'] >= 1) & (df1['SE_Total'] <= 5),
    (df1['SE_Total'] >= 6) & (df1['SE_Total'] <= 10),
    (df1['SE_Total'] >= 11) & (df1['SE_Total'] <= 15),
    (df1['SE_Total'] >= 16) & (df1['SE_Total'] <= 20)
]
values = ['1', '2', '3', '4']
df1['SE_Total'] = np.select(conditions3, values)

Interpretasi:

Indikator cara pandang terhadap diri sendiri menggunakan kriteria untuk melakukan kategori total jawaban responden adalah skor terendah = 4, skor tertinggi = 20 untuk indikator motivasi dan indikator tujuan hidup pribadi, kategori persepsi = 4. Interval = (20-4)/4 = 4. Interval yang digunakan untuk kategori persepsi seperti kode di atas

In [11]:
#memilih atribut yang ada
df = df1[['IPS', 'Gender', 'M_Total', 'PP_Total', 'SE_Total']]

# Modelling

In [12]:
#membagi data menjadi variabel independen dan dependen
x = df.drop('IPS', axis = 1)
y = df['IPS']

In [13]:
# membagi dataset menjadi data train dan data test sebesar 70:30
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [14]:
# membuat dataframe untuk menyimpan hasil evaluasi
model_performance = pd.DataFrame(columns=['Accuracy','Sensitivity', 'Precision','F1-Score'])

In [15]:
# membuat model decision tree dengan parameter max_depth = 4
modeltree = DecisionTreeClassifier(max_depth=4).fit(X_train,y_train)
dt_y_testpred = modeltree.predict(X_test)

print('Training set score: {:.2f}'.format(modeltree.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(modeltree.score(X_test, y_test)))

# confusion matrix
cnf_matrix = confusion_matrix(y_test, dt_y_testpred)
print("Confusion Matrix for Decision Tree Classifier:")
print(cnf_matrix)

# classification report yang berisi measurement 
print("\nClassification Report for Decision Tree Classifier:\n%s"%classification_report(y_test,dt_y_testpred))

In [16]:
FP = (cnf_matrix.sum(axis=0) - np.diag(cnf_matrix)).astype(float)
FN = (cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)).astype(float)
TP = (np.diag(cnf_matrix)).astype(float)
TN = (cnf_matrix.sum() - (FP + FN + TP)).astype(float)

accuracy = accuracy_score(y_test, dt_y_testpred)
sensitivity = recall_score(y_test, dt_y_testpred, average='weighted')
precision = precision_score(y_test, dt_y_testpred, average='weighted')
f1s = f1_score(y_test, dt_y_testpred, average='weighted')

model_performance.loc['Decision Tree'] = [accuracy, sensitivity, precision, f1s]

In [17]:
# membuat model naive bayes
modelngb = GaussianNB()
modelngb.fit(X_train, y_train)
nb_y_testpred = modelngb.predict(X_test)

print('Training set score: {:.2f}'.format(modelngb.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(modelngb.score(X_test, y_test)))

# confusion matrix
cnf_matrix2 = confusion_matrix(y_test, nb_y_testpred)
print("Confusion Matrix for Gaussian Naive Bayes Classifier:")
print(cnf_matrix2)

# classification report yang berisi measurement 
k_value5_accuracy_score=accuracy_score(y_test, nb_y_testpred)
print("\nClassification Report for Gaussian Naive Bayes Classifier:\n%s"%classification_report(y_test, nb_y_testpred))

In [18]:
FP = (cnf_matrix2.sum(axis=0) - np.diag(cnf_matrix2)).astype(float)
FN = (cnf_matrix2.sum(axis=1) - np.diag(cnf_matrix2)).astype(float)
TP = (np.diag(cnf_matrix2)).astype(float)
TN = (cnf_matrix2.sum() - (FP + FN + TP)).astype(float)

accuracy = accuracy_score(y_test, nb_y_testpred)
sensitivity = recall_score(y_test, nb_y_testpred, average='weighted')
precision = precision_score(y_test, nb_y_testpred, average='weighted')
f1s = f1_score(y_test, nb_y_testpred, average='weighted')

model_performance.loc['Gaussian Naive Bayes'] = [accuracy, sensitivity, precision, f1s]

# Evaluation

In [19]:
model_performance.style.background_gradient(cmap='RdYlBu_r').format({'Accuracy': '{:.2%}',
                                                                    'Sensitivity': '{:.2%}',
                                                                    'Precision': '{:.2%}',
                                                                    'F1-Score': '{:.2%}',
                                                                    })

Interpretasi:

Didapatkan bahwa dari segi akurasi naive bayes ternyata lebih baik dari decision tree begitu juga dengan pengukuran sensitifitas dan precision