# Proyek Akhir: Menyelesaikan Permasalahan Perusahaan Edutech

- Nama: Herly Riyanto Hidayat
- Email: herlynjjd@gmail.com
- Id Dicoding: heryryanth

## Persiapan

### Menyiapkan library yang dibutuhkan

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
import joblib # Save the model

### Menyiapkan data yang akan diguankan

## Data Understanding

In [None]:
# URL to the raw dataset
url = "https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/students_performance/data.csv"

# Load the dataset
df = pd.read_csv(url, sep=';')

# Display the first five rows
df.head()

### Check missing Value

In [None]:
df.isnull().sum()

### Check the duplicated value

In [None]:
df.duplicated().sum()

### Check varians of data type

In [None]:
df.info()

**insight**
1. Data tidak memiliki missing value, dan duplicated value
2. Tipe datanya memiliki 7 kolom bertipe float, 29 kolom bertipe int, dan 1 kolom bertipe object
3. 1 kolom bertipe object yang nama kolomnya Status, merupakan target, dan 36 kolom lainnya ada feature

### Exploratory Data Analysis

#### Univariate Analysis on Status column

In [None]:
# Count the occurrences of each category in the target variable
status_counts = df['Status'].value_counts()
print(status_counts)

print("\nPercentage Distribution:")
print((status_counts / len(df) * 100).round(2))

# Plot class distribution
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Status', hue='Status')
plt.title('Distribution of Student Outcomes')
plt.xlabel('Status')
plt.ylabel('Count')
plt.show()

#### Bivariate Analysis Effect of Admission_grade on Status column

In [None]:
plt.figure(figsize=(10, 4))
sns.boxplot(x='Status', y='Admission_grade', data=df, hue='Status')
plt.title('Admission Grade by Student Status')
plt.ylabel('Admission Grade')
plt.xlabel('Status')
plt.show()

#### Bivariate Analysis Effect of First Semester Grade on Status column

In [None]:
plt.figure(figsize=(10, 4))
sns.boxplot(x='Status', y='Curricular_units_1st_sem_grade', data=df, hue='Status')
plt.title('First Semester Grade by Student Status')
plt.ylabel('1st Semester Grade')
plt.xlabel('Status')
plt.show()

#### Bivariate Analysis Effect of Scholarship on Status column

In [None]:
plt.figure(figsize=(10, 4))
sns.countplot(x='Scholarship_holder', hue='Status', data=df, palette='Set1')
plt.title('Scholarship Ownership vs Status')
plt.ylabel('Count')
plt.xlabel('Scholarship Ownership')
plt.show()

#### Bivariate Analysis Effect of Tuition Fee on Status column

In [None]:
plt.figure(figsize=(10, 4))
sns.countplot(x='Tuition_fees_up_to_date', hue='Status', data=df, palette='Set2')
plt.title('Tuition Fee Status vs Status')
plt.ylabel('Count')
plt.xlabel('Tuition Fee Status')
plt.show()

#### Bivariate Analysis Effect of Debtor on Status column

In [None]:
plt.figure(figsize=(10, 4))
sns.countplot(x='Debtor', hue='Status', data=df, palette='Set3')
plt.title('Debt Status vs Status')
plt.ylabel('Count')
plt.xlabel('Debt Status')
plt.show()

#### Correlation Matrix

In [None]:
# Select only numerical columns
numeric_df = df.select_dtypes(include=['number'])

# Compute the full correlation matrix
corr_matrix = numeric_df.corr()

# Set a threshold for "strong" correlation
threshold = 0.6
strong_corr = corr_matrix[(abs(corr_matrix) >= threshold) & (abs(corr_matrix) < 1)]

# Visualize with a heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Heatmap of Strong Correlations (|corr| ≥ 0.6)')
plt.show()

**Insight:**
1. Siswa dengan Admission Grade (Nilai Penerimaan) yang tinggi cenderung lebih bisa Graduate
2. Siswa dengan Nilai Semester pertama yang tinggi cenderung Graduate
3. Siswa yang mendapatkan Beasiswa rate Dropoutnya lebih kecil dari pada siswa yang tidak mendapatkan Beasiswa (Scholarschip)
4. Pengaruh dari Tuition Fee bisa dibilang dapat mempengaruhi rate Dropout yang lebih tinggi
5. Debt Status memiliki pengaruh terhadap siswa yang Graduate dan juga Dropout

## Data Preparation / Preprocessing

## Modeling

## Evaluation