# Proyek Akhir: Menyelesaikan Permasalahan Perusahaan Edutech

- Nama:Panca Adnan Andrian
- Email:acari.panca21@gmail.com
- Id Dicoding:izrael

## Persiapan

#### pertanyaan: 
1. **Berapa persen siswa di Jaya Jaya Institut yang berisiko dropout berdasarkan kategori usia dan kualifikasi pendidikan sebelumnya?** 
2. **Bagaimana perbandingan tingkat kelulusan antara siswa yang menerima beasiswa dan yang tidak?**
3. **Apakah terdapat hubungan antara jumlah unit mata kuliah yang diambil pada semester pertama dengan kemungkinan siswa menyelesaikan pendidikannya tepat waktu?**

### Menyiapkan library yang dibutuhkan

In [39]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import joblib

### Menyiapkan data yang akan diguankan

## Data Understanding

In [40]:
# Memuat data dari CSV
url = "https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/students_performance/data.csv"
data = pd.read_csv(url, delimiter=";")
df = data



In [41]:
data.head()

Unnamed: 0,Marital_status,Application_mode,Application_order,Course,Daytime_evening_attendance,Previous_qualification,Previous_qualification_grade,Nacionality,Mothers_qualification,Fathers_qualification,...,Curricular_units_2nd_sem_credited,Curricular_units_2nd_sem_enrolled,Curricular_units_2nd_sem_evaluations,Curricular_units_2nd_sem_approved,Curricular_units_2nd_sem_grade,Curricular_units_2nd_sem_without_evaluations,Unemployment_rate,Inflation_rate,GDP,Status
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


## Data Preparation / Preprocessing

In [42]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Marital_status                                4424 non-null   int64  
 1   Application_mode                              4424 non-null   int64  
 2   Application_order                             4424 non-null   int64  
 3   Course                                        4424 non-null   int64  
 4   Daytime_evening_attendance                    4424 non-null   int64  
 5   Previous_qualification                        4424 non-null   int64  
 6   Previous_qualification_grade                  4424 non-null   float64
 7   Nacionality                                   4424 non-null   int64  
 8   Mothers_qualification                         4424 non-null   int64  
 9   Fathers_qualification                         4424 non-null   i

In [43]:
# Memeriksa kolom yang memiliki nilai null saja
null_columns = data.isnull().sum()
null_columns = null_columns[null_columns > 0]  # Filter hanya kolom dengan nilai null

print(null_columns)


Series([], dtype: int64)


In [44]:
# Mapping course codes to their respective abbreviations
course_abbreviations = {
    33: 'Biofuel Tech',
    171: 'Animation',
    8014: 'Soc. Serv. (Eve)',
    9003: 'Agronomy',
    9070: 'Comm. Design',
    9085: 'Vet Nursing',
    9119: 'Informatics Eng.',
    9130: 'Equinculture',
    9147: 'Mgmt',
    9238: 'Social Service',
    9254: 'Tourism',
    9500: 'Nursing',
    9556: 'Oral Hygiene',
    9670: 'Ad & Mkt Mgmt',
    9773: 'Journ. & Comm.',
    9853: 'Basic Ed.',
    9991: 'Mgmt (Eve)'
}
data['Course_name'] = data['Course']
# Replace course codes with names
data['Course_name'] = data['Course_name'].map(course_abbreviations)


## Modeling

In [45]:
# Pilih fitur yang relevan
features = [
    'Age_at_enrollment',
    'Scholarship_holder',
    'Curricular_units_1st_sem_approved',
    'Curricular_units_1st_sem_credited',
    'Curricular_units_1st_sem_enrolled', 
    'Previous_qualification_grade',
    'Tuition_fees_up_to_date', 
    'Gender'
]

# Memisahkan fitur dan label
X = data[features]
y = data['Status']  # Asumsikan 'Status' adalah kolom target

In [46]:

# Memisahkan data menjadi set pelatihan dan pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [47]:

# Membuat dan melatih model RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Memprediksi pada set pengujian
y_pred = model.predict(X_test)

## Evaluation

In [48]:

# Menampilkan hasil evaluasi
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[230  37  49]
 [ 45  31  75]
 [ 21  45 352]]
              precision    recall  f1-score   support

     Dropout       0.78      0.73      0.75       316
    Enrolled       0.27      0.21      0.23       151
    Graduate       0.74      0.84      0.79       418

    accuracy                           0.69       885
   macro avg       0.60      0.59      0.59       885
weighted avg       0.67      0.69      0.68       885



In [49]:
joblib.dump(model, 'model/random_forest_model.joblib')

['model/random_forest_model.joblib']