# 🎓 Prediksi Jurusan Kuliah Berdasarkan Data Siswa
Proyek ini bertujuan untuk memprediksi jurusan kuliah siswa berdasarkan nilai akademik dan minat menggunakan machine learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
df = pd.read_csv("dataset_prediksi_jurusan.csv")
df.head()

In [None]:
# EDA
df.info()
df.describe()

In [None]:
# Visualisasi distribusi nilai
plt.figure(figsize=(10, 6))
df[['nilai_matematika', 'nilai_bahasa_inggris', 'nilai_ipa', 'nilai_ips']].hist(bins=20)
plt.tight_layout()
plt.show()

# Distribusi target
sns.countplot(data=df, x="jurusan_output")
plt.title("Distribusi Jurusan Output")
plt.show()

In [None]:
# Preprocessing
label_encoders = {}
categorical_columns = ["minat_ipa", "minat_ips", "minat_bahasa", "ekonomi_keluarga", "tipe_sekolah", "jurusan_output"]

for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Split fitur dan target
X = df.drop("jurusan_output", axis=1)
y = df["jurusan_output"]

# Normalisasi
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Logistic Regression:\n", classification_report(y_test, y_pred_lr))

In [None]:
# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest:\n", classification_report(y_test, y_pred_rf))

In [None]:
# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("KNN:\n", classification_report(y_test, y_pred_knn))

In [None]:
# Confusion Matrix
def plot_confusion(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

plot_confusion(y_test, y_pred_rf, "Random Forest")

In [None]:
# Simpan model
import joblib
joblib.dump(rf, "model_prediksi_jurusan.pkl")
joblib.dump(scaler, "scaler.pkl")