## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Merging Dataframes

1. 학생 정보 관련 파일:
   - studentInfo.csv: 학생의 인구 통계 정보, 등록 정보 등을 담고 있습니다.

2. 과정 정보 관련 파일:
   - courses.csv: 각 코스(모듈)에 대한 정보를 담고 있습니다.
   - assessments.csv: 코스 내의 평가 정보 (과제, 시험 등)를 담고 있습니다.

3. 학생 성적 관련 파일:
    - studentAssessment.csv: 학생들의 평가 점수를 담고 있습니다.
    - studentRegistration.csv: 학생의 코스 등록 정보를 담고 있습니다.
    
4. VLE 활동 관련 파일:
   - vle.csv: VLE 활동에 대한 설명 정보를 담고 있습니다.
   - studentVle.csv: 학생의 VLE(Virtual Learning Environment) 활동 데이터를 담고 있습니다.

---

In [5]:
from sklearn.metrics import classification_report

def evaluate_score(y_true, y_pred):
    # precision = precision_score(y_true, y_pred)
    # recall = recall_score(y_true, y_pred)
    # print(classification_report(y_true, y_pred))
    # print('precision: {0:.6f}, recall: {1:.6f}'\
    #       .format(precision, recall))
    print(classification_report(y_true, y_pred))

In [6]:
df = pd.read_csv("merged_data_final.csv")
display(df)
display(df.info())

Unnamed: 0,gender,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,module_presentation_length,...,code_module_EEE,code_module_FFF,code_module_GGG,code_presentation_2013B,code_presentation_2013J,code_presentation_2014B,code_presentation_2014J,assessment_type_CMA,assessment_type_Exam,assessment_type_TMA
0,0,4699,9,2,0,240,0,0,-159.0,268,...,False,False,False,False,True,False,False,False,False,True
1,1,4699,2,1,0,60,0,0,-53.0,268,...,False,False,False,False,True,False,False,False,False,True
2,1,3630,5,1,0,60,0,0,-52.0,268,...,False,False,False,False,True,False,False,False,False,True
3,1,3576,5,0,0,60,0,0,-176.0,268,...,False,False,False,False,True,False,False,False,False,True
4,0,3630,8,1,0,60,0,0,-110.0,268,...,False,False,False,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168899,1,3630,2,0,0,30,0,0,-2.0,269,...,False,False,True,False,False,False,True,True,False,False
168900,1,3630,0,1,0,30,0,0,-10.0,269,...,False,False,True,False,False,False,True,True,False,False
168901,0,3630,0,0,0,60,0,0,-10.0,269,...,False,False,True,False,False,False,True,True,False,False
168902,1,3576,5,0,2,30,0,0,2.0,269,...,False,False,True,False,False,False,True,True,False,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168904 entries, 0 to 168903
Data columns (total 35 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   gender                      168904 non-null  int64  
 1   highest_education           168904 non-null  int64  
 2   imd_band                    168904 non-null  int64  
 3   age_band                    168904 non-null  int64  
 4   num_of_prev_attempts        168904 non-null  int64  
 5   studied_credits             168904 non-null  int64  
 6   disability                  168904 non-null  int64  
 7   final_result                168904 non-null  int64  
 8   date_registration           168904 non-null  float64
 9   module_presentation_length  168904 non-null  int64  
 10  my_average_score            168904 non-null  float64
 11  my_score_std                168904 non-null  float64
 12  my_score_trend              168904 non-null  int64  
 13  assessment_wei

None

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import load


# 언더샘플링
df = pd.read_csv("merged_data_final.csv")
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit_transform(X_train)
X_test = standard_sc.transform(X_test)

In [8]:
from sklearn.model_selection import RandomizedSearchCV
from joblib import load

# lr_clf =  LogisticRegression(max_iter=1000, solver='liblinear', penalty='l1', C=np.float64(5.0))
# svm_clf =  SVC(max_iter=1000, probability=True, kernel='poly', C=np.float64(2.1))
# knn_clf =  KNeighborsClassifier(n_neighbors=3, algorithm='auto')
# xgb_clf =  XGBClassifier(tree_method='hist', device="cuda",
#                          n_estimators=500, objective="reg:squarederror",
#                          subsample=np.float64(0.7999999999999999), max_depth=8,
#                          learning_rate=np.float64(0.14))

# lr_clf.fit(X_train, y_train)
# svm_clf.fit(X_train, y_train)
# knn_clf.fit(X_train, y_train)
# xgb_clf.fit(X_train, y_train)

file_names = ['lr_clf.joblib', 'svm_clf.joblib', 'knn_clf.joblib', 'xgb_clf.joblib']

# model_best_estimators = [lr_clf, svm_clf, knn_clf, xgb_clf]

# for idx, file_name in enumerate(file_names):
#     dump(model_best_estimators[idx], file_name)

best_models = []
for idx, name in enumerate(file_names):
    best_models.append(load(name))
    y_pred = best_models[idx].predict(X_test)
    print(f'{file_names[idx]}')
    evaluate_score(y_test, y_pred)
    print()

lr_clf.joblib
              precision    recall  f1-score   support

           0       0.68      0.70      0.69      2391
           1       0.69      0.67      0.68      2391

    accuracy                           0.68      4782
   macro avg       0.68      0.68      0.68      4782
weighted avg       0.68      0.68      0.68      4782


svm_clf.joblib
              precision    recall  f1-score   support

           0       0.50      0.02      0.04      2391
           1       0.50      0.98      0.66      2391

    accuracy                           0.50      4782
   macro avg       0.50      0.50      0.35      4782
weighted avg       0.50      0.50      0.35      4782


knn_clf.joblib
              precision    recall  f1-score   support

           0       0.77      0.71      0.74      2391
           1       0.73      0.79      0.76      2391

    accuracy                           0.75      4782
   macro avg       0.75      0.75      0.75      4782
weighted avg       0.75     

configuration generated by an older version of XGBoost, please export the model by calling
`Booster.save_model` from that version first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html

for more details about differences between saving model and serializing.



In [9]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# 언더샘플링
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit_transform(X_train)
X_test = standard_sc.transform(X_test)

# 로지스틱 회귀 + 앙상블
voting_clf = VotingClassifier(
    estimators=[
        ('lr_clf', best_models[0]),
        # ('svm_clf', best_models[1]),
        ('knn_clf', best_models[2]),
        ('xgb_clf', best_models[3])
    ],
    voting='soft'
)
voting_clf.fit(X_train, y_train)

# 평가
y_pred_train = voting_clf.predict(X_train)
y_pred_test = voting_clf.predict(X_test)


print("✅ 학습 정확도:", accuracy_score(y_train, y_pred_train))
print("✅ 테스트 정확도:", accuracy_score(y_test, y_pred_test))
print("\n📋 분류 성능 보고서 (테스트셋):\n")
print(classification_report(y_test, y_pred_test))



✅ 학습 정확도: 0.9911638607131653
✅ 테스트 정확도: 0.8598912588874947

📋 분류 성능 보고서 (테스트셋):

              precision    recall  f1-score   support

           0       0.90      0.81      0.85      2391
           1       0.83      0.91      0.87      2391

    accuracy                           0.86      4782
   macro avg       0.86      0.86      0.86      4782
weighted avg       0.86      0.86      0.86      4782

