## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Merging Dataframes

1. 학생 정보 관련 파일:
   - studentInfo.csv: 학생의 인구 통계 정보, 등록 정보 등을 담고 있습니다.

2. 과정 정보 관련 파일:
   - courses.csv: 각 코스(모듈)에 대한 정보를 담고 있습니다.
   - assessments.csv: 코스 내의 평가 정보 (과제, 시험 등)를 담고 있습니다.

3. 학생 성적 관련 파일:
    - studentAssessment.csv: 학생들의 평가 점수를 담고 있습니다.
    - studentRegistration.csv: 학생의 코스 등록 정보를 담고 있습니다.
    
4. VLE 활동 관련 파일:
   - vle.csv: VLE 활동에 대한 설명 정보를 담고 있습니다.
   - studentVle.csv: 학생의 VLE(Virtual Learning Environment) 활동 데이터를 담고 있습니다.

---

In [2]:
from sklearn.metrics import classification_report

def evaluate_score(y_true, y_pred):
    # precision = precision_score(y_true, y_pred)
    # recall = recall_score(y_true, y_pred)
    # print(classification_report(y_true, y_pred))
    # print('precision: {0:.6f}, recall: {1:.6f}'\
    #       .format(precision, recall))
    print(classification_report(y_true, y_pred))

In [5]:
lr_params =  {
            'penalty' : ['l1', 'l2'],
            'C' : np.arange(200) / 10,
            'solver' : ['lbfgs', 'newton-cg', 'liblinear']
            }
svm_params =  {
            'C' : np.arange(200) / 10,
            'kernel' : ['linear', 'poly', 'rbf', 'sigmoid']
            }
knn_params =  {
            'n_neighbors' : range(3, 12, 2),
            'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute']
            }
xgb_params =  {
            'learning_rate' : np.arange(101) / 100,
            'max_depth' : range(3, 9),
            'subsample': np.arange(0.5, 1.05, 0.1),
            'lambda': [0, 1, 10]
            }

model_params = [lr_params, svm_params, knn_params, xgb_params]

In [3]:
df = pd.read_csv("merged_data_final.csv")
display(df)
display(df.info())

Unnamed: 0,gender,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,module_presentation_length,...,code_module_EEE,code_module_FFF,code_module_GGG,code_presentation_2013B,code_presentation_2013J,code_presentation_2014B,code_presentation_2014J,assessment_type_CMA,assessment_type_Exam,assessment_type_TMA
0,0,4699,9,2,0,240,0,0,-159.0,268,...,False,False,False,False,True,False,False,False,False,True
1,1,4699,2,1,0,60,0,0,-53.0,268,...,False,False,False,False,True,False,False,False,False,True
2,1,3630,5,1,0,60,0,0,-52.0,268,...,False,False,False,False,True,False,False,False,False,True
3,1,3576,5,0,0,60,0,0,-176.0,268,...,False,False,False,False,True,False,False,False,False,True
4,0,3630,8,1,0,60,0,0,-110.0,268,...,False,False,False,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168899,1,3630,2,0,0,30,0,0,-2.0,269,...,False,False,True,False,False,False,True,True,False,False
168900,1,3630,0,1,0,30,0,0,-10.0,269,...,False,False,True,False,False,False,True,True,False,False
168901,0,3630,0,0,0,60,0,0,-10.0,269,...,False,False,True,False,False,False,True,True,False,False
168902,1,3576,5,0,2,30,0,0,2.0,269,...,False,False,True,False,False,False,True,True,False,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168904 entries, 0 to 168903
Data columns (total 35 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   gender                      168904 non-null  int64  
 1   highest_education           168904 non-null  int64  
 2   imd_band                    168904 non-null  int64  
 3   age_band                    168904 non-null  int64  
 4   num_of_prev_attempts        168904 non-null  int64  
 5   studied_credits             168904 non-null  int64  
 6   disability                  168904 non-null  int64  
 7   final_result                168904 non-null  int64  
 8   date_registration           168904 non-null  float64
 9   module_presentation_length  168904 non-null  int64  
 10  my_average_score            168904 non-null  float64
 11  my_score_std                168904 non-null  float64
 12  my_score_trend              168904 non-null  int64  
 13  assessment_wei

None

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import load


# 언더샘플링
df = pd.read_csv("merged_data_final.csv")
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit_transform(X_train)
X_test = standard_sc.transform(X_test)

In [6]:
from sklearn.model_selection import RandomizedSearchCV

lr_clf =  LogisticRegression(max_iter=1000)
svm_clf =  SVC(max_iter=1000, probability=True)
knn_clf =  KNeighborsClassifier()
xgb_clf =  XGBClassifier(tree_method='hist', device="cuda",
                         n_estimators=500, objective="reg:squarederror")

models_names = [lr_clf, svm_clf, knn_clf, xgb_clf]

model_best_params = []
model_best_estimators = []

for idx, model_ in enumerate(models_names):
    rd_search = RandomizedSearchCV(model_, model_params[idx], cv=5, n_iter=100, random_state=0, scoring='f1')
    rd_search.fit(X_train, y_train)
    
    model_best_params.append(rd_search.best_params_)
    model_best_estimators.append(rd_search.best_estimator_)

165 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
95 fits failed with the following error:
Traceback (most recent call last):
  File "/home/dheum/.conda/envs/pystudy_env/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/dheum/.conda/envs/pystudy_env/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dheum/.conda/envs/pystudy_env/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty

In [7]:
from joblib import dump
file_names = ['lr_clf.joblib', 'svm_clf.joblib', 'knn_clf.joblib', 'xgb_clf.joblib']

for idx, file_name in enumerate(file_names):
    dump(model_best_estimators[idx], file_name)

In [8]:
for idx, best_param in enumerate(model_best_params):
    print(f'{file_names[idx]} | best parameters:\n{best_param}')

lr_clf.joblib | best parameters:
{'solver': 'liblinear', 'penalty': 'l1', 'C': np.float64(5.0)}
svm_clf.joblib | best parameters:
{'kernel': 'poly', 'C': np.float64(2.1)}
knn_clf.joblib | best parameters:
{'n_neighbors': 3, 'algorithm': 'auto'}
xgb_clf.joblib | best parameters:
{'subsample': np.float64(0.7999999999999999), 'max_depth': 8, 'learning_rate': np.float64(0.14), 'lambda': 0}


In [9]:
best_models = []
for idx, name in enumerate(file_names):
    best_models.append(load(name))
    y_pred = best_models[idx].predict(X_test)
    print(f'{file_names[idx]}')
    evaluate_score(y_test, y_pred)
    print()

lr_clf.joblib
              precision    recall  f1-score   support

           0       0.68      0.70      0.69      2391
           1       0.69      0.67      0.68      2391

    accuracy                           0.68      4782
   macro avg       0.68      0.68      0.68      4782
weighted avg       0.68      0.68      0.68      4782


svm_clf.joblib
              precision    recall  f1-score   support

           0       0.50      0.02      0.04      2391
           1       0.50      0.98      0.66      2391

    accuracy                           0.50      4782
   macro avg       0.50      0.50      0.35      4782
weighted avg       0.50      0.50      0.35      4782


knn_clf.joblib
              precision    recall  f1-score   support

           0       0.77      0.71      0.74      2391
           1       0.73      0.79      0.76      2391

    accuracy                           0.75      4782
   macro avg       0.75      0.75      0.75      4782
weighted avg       0.75     

In [55]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import load

df = pd.read_csv("merged_data_final.csv")

# 언더샘플링
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit_transform(X_train)
X_test_scaled = standard_sc.transform(X_test)

# 로지스틱 회귀 + 앙상블
voting_clf = VotingClassifier(
    estimators=[
        ('lr_clf', best_models[0]),
        # ('svm_clf', best_models[1]),
        ('knn_clf', best_models[2]),
        ('xgb_clf', best_models[3])
    ],
    voting='soft'
)
voting_clf.fit(X_train, y_train)

# 평가
y_pred_train = voting_clf.predict(X_train)
y_pred_test = voting_clf.predict(X_test_scaled)


print("✅ 학습 정확도:", accuracy_score(y_train, y_pred_train))
print("✅ 테스트 정확도:", accuracy_score(y_test, y_pred_test))
print("\n📋 분류 성능 보고서 (테스트셋):\n")
print(classification_report(y_test, y_pred_test))

✅ 학습 정확도: 0.9917912788873784
✅ 테스트 정확도: 0.8609368465077374

📋 분류 성능 보고서 (테스트셋):

              precision    recall  f1-score   support

           0       0.90      0.81      0.85      2391
           1       0.83      0.91      0.87      2391

    accuracy                           0.86      4782
   macro avg       0.86      0.86      0.86      4782
weighted avg       0.86      0.86      0.86      4782



Unnamed: 0,gender,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,date_registration,module_presentation_length,my_average_score,...,code_module_EEE,code_module_FFF,code_module_GGG,code_presentation_2013B,code_presentation_2013J,code_presentation_2014B,code_presentation_2014J,assessment_type_CMA,assessment_type_Exam,assessment_type_TMA
42819,1,3576,7,0,0,60,0,5.0,262,67.000000,...,False,False,False,False,False,False,True,False,False,True
32180,0,3576,7,1,0,60,0,-81.0,234,96.818182,...,False,False,False,False,False,True,False,False,False,True
71210,1,3630,0,0,1,120,1,-30.0,240,72.846154,...,False,False,False,True,False,False,False,True,False,False
16849,0,3630,3,0,0,60,0,-81.0,268,70.500000,...,False,False,False,False,True,False,False,False,False,True
45963,0,3630,4,0,0,90,0,-243.0,241,9.333333,...,False,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162373,1,3576,3,1,0,30,0,-73.0,241,78.444444,...,False,False,True,False,False,True,False,True,False,False
143693,0,3630,4,0,3,60,0,-24.0,269,66.416667,...,False,True,False,False,False,False,True,False,False,True
49036,0,3576,0,0,0,60,0,-165.0,241,56.285714,...,False,False,False,False,False,True,False,True,False,False
529,1,3630,4,0,0,180,0,-31.0,268,69.600000,...,False,False,False,False,True,False,False,False,False,True


In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

df = pd.read_csv("merged_data_final.csv")

# 언더샘플링
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, _, _ = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit(X_train)


# ¹Ýº¹ ½ÃÀÛ
t = True
result_info = {}

for idx, row in df[df['final_result'] == 1].iterrows():
    base_input = row[X.columns].copy()
    for avg in np.linspace(40, base_input['course_avg_score'], 1):
        for max_ in np.linspace(60, base_input['course_max_score'], 1):
            for std in np.linspace(5, base_input['course_std_score'], 1):
                modified = base_input.copy()
                modified['course_avg_score'] = avg
                modified['course_max_score'] = max_
                modified['course_std_score'] = std

                modified = pd.DataFrame(modified)
                modified = modified.T
                modified = standard_sc.transform(modified)
                pred = voting_clf.predict(modified)[0]
                
                if pred == 0:
                    result_info = {
                        'original_student_id': idx,
                        'original_final_result': row['final_result'],
                        'modified_course_avg_score': avg,
                        'modified_course_max_score': max_,
                        'modified_course_std_score': std
                    }
                    t = False
                    break
            if not t:
                break
        if not t:
            break
    if not t:
        break

# Ãâ·Â
print("Á¶°Ç ¸¸Á·ÇÏ´Â ÇÐ»ý Á¤º¸:")
print(result_info)

gender                                1
highest_education                  3630
imd_band                              7
age_band                              0
num_of_prev_attempts                  0
studied_credits                      60
disability                            0
date_registration                -180.0
module_presentation_length          268
my_average_score                   67.0
my_score_std                   1.154701
my_score_trend                        1
assessment_weight              0.007034
weighted_score                 0.464223
course_avg_score              69.431637
course_max_score                   98.0
course_std_score               12.63941
course_late_rate               0.235438
days_early_submission               2.0
my_late_rate                        0.0
code_module_AAA                    True
code_module_BBB                   False
code_module_CCC                   False
code_module_DDD                   False
code_module_EEE                   False


gender                                0
highest_education                  3576
imd_band                              7
age_band                              1
num_of_prev_attempts                  0
studied_credits                      60
disability                            0
date_registration                -170.0
module_presentation_length          268
my_average_score              65.333333
my_score_std                   8.115828
my_score_trend                        1
assessment_weight              0.007034
weighted_score                 0.520493
course_avg_score              69.431637
course_max_score                   98.0
course_std_score               12.63941
course_late_rate               0.235438
days_early_submission               2.0
my_late_rate                   0.166667
code_module_AAA                    True
code_module_BBB                   False
code_module_CCC                   False
code_module_DDD                   False
code_module_EEE                   False


Á¶°Ç ¸¸Á·ÇÏ´Â ÇÐ»ý Á¤º¸:
{'original_student_id': 21, 'original_final_result': 1, 'modified_course_avg_score': np.float64(40.0), 'modified_course_max_score': np.float64(88.5), 'modified_course_std_score': np.float64(5.0)}


---
# Polynomial

In [63]:
df = pd.read_csv("merged_poly_data_final.csv")
display(df)
display(df.info())

Unnamed: 0,my_average_score^2,my_average_score module_presentation_length,my_average_score highest_education,my_average_score,my_average_score course_avg_score,my_score_std code_module_CCC,my_score_std course_late_rate,my_average_score course_std_score,my_average_score weighted_score,course_avg_score,...,my_score_std course_avg_score,my_late_rate,weighted_score days_early_submission,my_late_rate course_std_score,course_std_score^2,course_std_score course_late_rate,my_score_std highest_education,course_avg_score days_early_submission,my_late_rate course_late_rate,final_result
0,6724.000000,21976.000000,385318.000000,82.000000,5693.394237,0.0,0.725670,1036.431602,44.987434,69.431637,...,214.002678,0.000000,0.548627,0.000000,159.754679,2.975802,14483.290700,69.431637,0.000000,0.0
1,4408.960000,17795.200000,312013.600000,66.400000,4610.260699,0.0,1.020836,839.256809,32.692557,69.431637,...,301.048404,0.400000,-1.477073,5.055764,159.754679,2.975802,20374.378489,-208.294911,0.094175,0.0
2,5776.000000,20368.000000,275880.000000,76.000000,5276.804414,0.0,1.622647,960.595143,38.488311,69.431637,...,478.524535,0.000000,1.012850,0.000000,159.754679,2.975802,25018.048485,138.863274,0.000000,0.0
3,2959.360000,14579.200000,194534.400000,54.400000,3777.081055,0.0,4.829644,687.583892,26.401631,69.431637,...,1424.279655,1.000000,-3.397269,12.639410,159.754679,2.975802,73355.955047,-486.021459,0.235438,0.0
4,4624.000000,18224.000000,246840.000000,68.000000,4721.351318,0.0,2.584466,859.479865,37.784943,69.431637,...,762.168382,0.200000,0.000000,2.527882,159.754679,2.975802,39847.414596,0.000000,0.047088,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168899,5459.567901,19876.111111,268216.666667,73.888889,5843.426117,0.0,1.807261,1451.565063,11.245518,79.083962,...,1352.222703,0.000000,0.304390,0.000000,385.935512,2.076436,62067.810699,158.167925,0.000000,0.0
168900,6377.163265,21481.571429,289881.428571,79.857143,6315.419289,0.0,1.615037,1568.812853,20.256427,79.083962,...,1208.397229,0.285714,0.000000,5.612926,385.935512,2.076436,55466.137546,0.000000,0.030199,0.0
168901,7612.562500,23470.250000,316717.500000,87.250000,6900.075726,0.0,1.225808,1714.047316,17.705349,79.083962,...,917.169414,0.125000,2.840973,2.455655,385.935512,2.076436,42098.611022,1107.175475,0.013212,0.0
168902,6597.049383,21848.777778,290450.666667,81.222222,6423.375175,0.0,2.528878,1595.630167,20.602691,79.083962,...,1892.147960,0.333333,-0.253658,6.548414,385.935512,2.076436,85558.701065,-79.083962,0.035232,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168904 entries, 0 to 168903
Data columns (total 35 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   my_average_score^2                           168904 non-null  float64
 1   my_average_score module_presentation_length  168904 non-null  float64
 2   my_average_score highest_education           168904 non-null  float64
 3   my_average_score                             168904 non-null  float64
 4   my_average_score course_avg_score            168904 non-null  float64
 5   my_score_std code_module_CCC                 168904 non-null  float64
 6   my_score_std course_late_rate                168904 non-null  float64
 7   my_average_score course_std_score            168904 non-null  float64
 8   my_average_score weighted_score              168904 non-null  float64
 9   course_avg_score                             168904 non-nul

None

In [64]:
from joblib import load

# 언더샘플링
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit_transform(X_train)
X_test = standard_sc.transform(X_test)

In [None]:
lr_clf2 =  LogisticRegression(max_iter=1000)
svm_clf2 =  SVC(max_iter=1000, probability=True)
knn_clf2 =  KNeighborsClassifier()
xgb_clf2 =  XGBClassifier(tree_method='hist', device="cuda",
                         n_estimators=500, objective="reg:squarederror")

models_names = [lr_clf2, svm_clf2, knn_clf2, xgb_clf2]

model_best_params = []
model_best_estimators = []

for idx, model_ in enumerate(models_names):
    rd_search = RandomizedSearchCV(model_, model_params[idx], cv=5, n_iter=100, random_state=0, scoring='f1')
    rd_search.fit(X_train, y_train)
    
    model_best_params.append(rd_search.best_params_)
    model_best_estimators.append(rd_search.best_estimator_)

In [None]:
file_names2 = ['lr_clf_poly.joblib', 'svm_clf_poly.joblib', 'knn_clf_poly.joblib', 'xgb_clf_poly.joblib']

for idx, file_name in enumerate(file_names):
    dump(model_best_estimators[idx], file_name)

In [None]:
for idx, best_param in enumerate(model_best_params):
    print(f'{file_names[idx]} | Poly_model best parameters:\n{best_param}')

lr_clf.joblib | Poly_model best parameters:
{'solver': 'lbfgs', 'penalty': 'l2', 'C': np.float64(11.3)}
svm_clf.joblib | Poly_model best parameters:
{'kernel': 'poly', 'C': np.float64(2.1)}
knn_clf.joblib | Poly_model best parameters:
{'n_neighbors': 11, 'algorithm': 'auto'}
xgb_clf.joblib | Poly_model best parameters:
{'subsample': np.float64(0.7999999999999999), 'max_depth': 6, 'learning_rate': np.float64(0.02), 'lambda': 10}


In [None]:
best_models2 = []
for idx, name in enumerate(file_names2):
    best_models2.append(load(name))
    y_pred = best_models2[idx].predict(X_test)
    print(f'{file_names[idx]}')
    evaluate_score(y_test, y_pred)
    print()

lr_clf.joblib
              precision    recall  f1-score   support

         0.0       0.58      0.68      0.63      2390
         1.0       0.62      0.52      0.56      2390

    accuracy                           0.60      4780
   macro avg       0.60      0.60      0.59      4780
weighted avg       0.60      0.60      0.59      4780


svm_clf.joblib
              precision    recall  f1-score   support

         0.0       0.50      0.21      0.30      2390
         1.0       0.50      0.78      0.61      2390

    accuracy                           0.50      4780
   macro avg       0.50      0.50      0.46      4780
weighted avg       0.50      0.50      0.46      4780


knn_clf.joblib
              precision    recall  f1-score   support

         0.0       0.60      0.56      0.58      2390
         1.0       0.59      0.63      0.61      2390

    accuracy                           0.60      4780
   macro avg       0.60      0.60      0.59      4780
weighted avg       0.60     

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import load


# 언더샘플링
df_0 = df[df['final_result'] == 0]
df_1 = df[df['final_result'] == 1]
df_0_down = resample(df_0, replace=False, n_samples=len(df_1), random_state=42)
df_balanced = pd.concat([df_0_down, df_1])

# 특성과 라벨 분리
X = df_balanced.drop('final_result', axis=1)
y = df_balanced['final_result']
X = pd.get_dummies(X)
X = X.dropna()
y = y.loc[X.index]

# train/test 분할
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

standard_sc = StandardScaler()
X_train = standard_sc.fit_transform(X_train)
X_test = standard_sc.transform(X_test)

# 로지스틱 회귀 + 앙상블
voting_clf2 = VotingClassifier(
    estimators=[
        ('lr_clf', best_models2[0]),
        # ('svm_clf', best_models2[1]),
        ('knn_clf', best_models2[2]),
        ('xgb_clf', best_models2[3])
    ],
    voting='soft'
)
voting_clf2.fit(X_train, y_train)

# 평가
y_pred_train = voting_clf2.predict(X_train)
y_pred_test = voting_clf2.predict(X_test)


print("✅ 학습 정확도:", accuracy_score(y_train, y_pred_train))
print("✅ 테스트 정확도:", accuracy_score(y_test, y_pred_test))
print("\n📋 분류 성능 보고서 (테스트셋):\n")
print(classification_report(y_test, y_pred_test))

✅ 학습 정확도: 0.6850282485875706
✅ 테스트 정확도: 0.6209205020920502

📋 분류 성능 보고서 (테스트셋):

              precision    recall  f1-score   support

         0.0       0.63      0.59      0.61      2390
         1.0       0.61      0.65      0.63      2390

    accuracy                           0.62      4780
   macro avg       0.62      0.62      0.62      4780
weighted avg       0.62      0.62      0.62      4780

