## Overall Performance of All Models

The models we've tried for this data challenge and their performance are as shown in the table.


| Model Name | Performance (F1 Score) | Model Name | Performance F1 Score |
|--|--|--|--|
| Logistic Regression | 0.6680633564575165 | Bagging Tree | 0.7403892068203023 |
| KNN | 0.6728005559416261 | Random Forest | 0.7489226858421332 |
| Gradient Boosting | 0.7311811698293893 | AdaBoost | 0.726369737237488 |
| Naive Bayes | 0.7363129412411065 | XGBoost | 0.7501138071719989 |
| CatBoost | 0.7524212296157773 | CNN | 0.7339021501847703 |

## Identify the Problem and Exploratory Data Analysis

There are 8 csv files that could be the input for predicting the final score, which represent students' action records, details of the assignments, relationships of different unit-level assignments, details of the explanations, details of the hints, details of the problems, details of the sequences and relationships of different unit-level sequences. 

So the key information could be extracted from students' action, assignment details, explanation details, hint details, problem details and sequence details.

We analyzed every single variable's data type, and created a data map to visualize the relationship between different variables.

## Feature Engineering

### Students' Action

For students' action data, we deleted hint_id as it could not connect with "hint_details". We processed timestamp data, kept the max value and min value and calculated the max gap. We also converted the varibale action, which is a categorical variable into dummy variables and converted other categorical variables to numeric variables.

### Assignment Details

Since variable "assignment_log_id" in "assignment_details.csv" could not be connected with neither "action_logs.csv" nor "assignment_relationships.csv", so we use sequence_id to connect assignment details including assignment_release_date, assignment_due_date, assignment_start_time and assignment_end_time to sequence details.

### Assignment Relationship

In the file "assignment_relationships.csv", unit_test_assignment_log_id could connect with "training_unit_test_scores.csv" and "evaluation_unit_test_scores.csv", and in_unit_assignment_log_id could be connect with assignment_log_id in "action_logs.csv".

### Explanation Details

We select 3 variables from "explanation_details.csv", which are explanation_contains_image, explanation_contains_equation, and explanation_contains_video.

### Hint Details

The file "hint_details.csv" contains detailed information about hints. However, we find that the hint_id here can not match hint_id in "action_logs.csv", so we choose not to use hint information.

### Problem Details

For the file "problem_details.csv", we deleted the variables problem_multipart_id, problem_multipart_position, problem_skill_description, and problem_text_bert_pca. We transformed the variable problem_skill_code to three new variables based on the value's meaning: problem_skill_code_grade, problem_skill_code_skill and problem_skill_code_level. And all of the categorical variables were converted to numeric variables. Variables problem_contains_image, problem_contains_equation and problem_contains_video were kept.

### Sequence Details

For the file "sequence_details.csv", we converted variables sequence_folder_path_level_1, sequence_folder_path_level_2, sequence_folder_path_level_3, sequence_folder_path_level_4 and sequence_name into numeric variables. For sequence_problem_ids, we transformed it into multiple rows so that sequence details could be connected with both assignment details and problem details.

### Sequence Relationships

In the file "sequence_relationships.csv", in_unit_sequence_id could connnect to sequence_id in "assignment_details.csv", and sequence_id in "sequence_details.csv".

## Modelling

In [1]:
# Import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

We have preprocessed data to make it ready for running machine learning models.

The file "action_logs_summed_dummies_wtime.csv" combines action details, explanation information and assignment relationship information.

The file "problem_details.csv" has been processed and all categorical variables were converted to numeric variables.

The file "sdad.csv" combines sequence details and assignment details.

### Merge Important Data

In [2]:
# Load the necessary data
tuts = pd.read_csv('/kaggle/input/d/cielzhao80/edm-cup-2023/training_unit_test_scores.csv')
euts = pd.read_csv('/kaggle/input/d/cielzhao80/edm-cup-2023/evaluation_unit_test_scores.csv')
# ar = pd.read_csv('/kaggle/input/d/cielzhao80/edm-cup-2023/assignment_relationships.csv')
al = pd.read_csv('/kaggle/input/d/cielzhao80/edm-cup-2023/action_logs_summed_dummies_wtime.csv')
pro_d = pd.read_csv('/kaggle/input/d/cielzhao80/edm-cup-2023/problem_details.csv')
sdad = pd.read_csv('/kaggle/input/d/cielzhao80/edm-cup-2023/sdad.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
psa = pro_d.merge(sdad, how = 'left', left_on = 'problem_id', right_on = 'sequence_problem_ids')

In [4]:
del psa['sequence_problem_ids']

In [5]:
tal = tuts.merge(al, how='left', left_on='assignment_log_id', right_on='unit_test_assignment_log_id')
eal = euts.merge(al, how='left', left_on='assignment_log_id', right_on='unit_test_assignment_log_id')

In [6]:
del tal['unit_test_assignment_log_id']
del eal['unit_test_assignment_log_id']

In [7]:
tp = tal.merge(psa, how='left', left_on='problem_id', right_on='problem_id')
ep = eal.merge(psa, how='left', left_on='problem_id', right_on='problem_id')

### Fill Missing Value with -1

In [8]:
tp_fill = tp.fillna(-1)
ep_fill = ep.fillna(-1)

In [9]:
X_train_all, X_valid_all, y_train, y_valid = train_test_split(tp_fill.drop('score', axis=1),
                                                      tp_fill['score'],
                                                      test_size=0.2,
                                                      random_state=123)

In [10]:
X_train = X_train_all.drop(['assignment_log_id', 'problem_id'], axis=1)
X_valid = X_valid_all.drop(['assignment_log_id', 'problem_id'], axis=1)

In [11]:
t_input_cols = [c for c in tp_fill.columns if c not in ['assignment_log_id', 'problem_id', 'score']]
e_input_cols = [c for c in ep_fill.columns if c not in ['assignment_log_id', 'problem_id', 'id', 'score']]
target_col = 'score'

### Fit Data into Models

#### Logistic Regression

In [12]:
lr = LogisticRegression(max_iter=1000)
lr = lr.fit(X_train, y_train)
y_pred = lr.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.6680633564575165


#### Bagging Tree

In [13]:
dt = DecisionTreeClassifier(max_depth=5)
bagging = BaggingClassifier(base_estimator=dt, n_estimators=100, random_state=123)
bagging = bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.7403892068203023


#### Random Forest

In [15]:
rf = RandomForestClassifier(n_estimators=100, random_state=123)
rf = rf.fit(X_train, y_train)
y_pred = rf.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.7489226858421332


In [17]:
# Predict evaluation data
rfc = RandomForestClassifier(n_estimators=100, random_state=123)
rfc = rfc.fit(tp_fill[t_input_cols], tp_fill[target_col])
ep_fill[target_col] = rfc.predict_proba(ep_fill[e_input_cols])[:,1]
ep_fill[['id', 'score']].to_csv('/kaggle/working/rf5.csv', index=False)

#### Gradient Boosting

In [16]:
gb = GradientBoostingClassifier(n_estimators=100, random_state=123)
gb = gb.fit(X_train, y_train)
y_pred = gb.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.7311811698293893


#### AdaBoost

In [17]:
ab = AdaBoostClassifier(n_estimators=100, random_state=123)
ab = ab.fit(X_train, y_train)
y_pred = ab.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.726369737237488


#### KNN

In [18]:
knn = KNeighborsClassifier(n_neighbors=3)
knn = knn.fit(X_train, y_train)
y_pred = knn.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.6728005559416261


#### Naive Bayes

In [19]:
gnb = GaussianNB()
gnb = gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_valid)
f1_gnb = f1_score(y_valid, y_pred)
print(f1_gnb)

# mnb = MultinomialNB()
# mnb = mnb.fit(X_train, y_train)
# y_pred = mnb.predict(X_valid)
# f1_mnb = f1_score(y_valid, y_pred)
# print(f1_mnb)

0.7363129412411065


#### XGBoost

In [20]:
xgb = XGBClassifier()
X_train = X_train.apply(pd.to_numeric)
X_valid = X_valid.apply(pd.to_numeric)
xgb = xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.7501138071719989


In [30]:
xgb = XGBClassifier()
tp_fill[t_input_cols] = tp_fill[t_input_cols].apply(pd.to_numeric)
tp_fill[target_col] = tp_fill[target_col].apply(pd.to_numeric)
ep_fill[e_input_cols] = ep_fill[e_input_cols].apply(pd.to_numeric)
ep_fill[target_col] = ep_fill[target_col].apply(pd.to_numeric)
xgb = xgb.fit(tp_fill[t_input_cols], tp_fill[target_col])
ep_fill[target_col] = xgb.predict_proba(ep_fill[e_input_cols])[:,1]
ep_fill[['id', 'score']].to_csv('/kaggle/working/xgb6.csv', index=False)

#### CatBoost

In [21]:
cat = CatBoostClassifier(verbose=False)
cat = cat.fit(X_train, y_train)
y_pred = cat.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.7524212296157773


In [None]:
cat = CatBoostClassifier()
cat = cat.fit(tp_fill[t_input_cols], tp_fill[target_col])
ep_fill[target_col] = cat.predict_proba(ep_fill[e_input_cols])[:,1]
ep_fill[['id', 'score']].to_csv('/kaggle/working/cat6.csv', index=False)

#### CNN

In [22]:
cnn = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=1000, random_state=123)
cnn = cnn.fit(X_train, y_train)
y_pred = cnn.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
print(f1)

0.7339021501847703


## Results and Interpretation

| Model Name | Performance (F1 Score) | Model Name | Performance F1 Score |
|--|--|--|--|
| Logistic Regression | 0.6680633564575165 | Bagging Tree | 0.7403892068203023 |
| KNN | 0.6728005559416261 | Random Forest | 0.7489226858421332 |
| Gradient Boosting | 0.7311811698293893 | AdaBoost | 0.726369737237488 |
| Naive Bayes | 0.7363129412411065 | XGBoost | 0.7501138071719989 |
| CatBoost | 0.7524212296157773 | CNN | 0.7339021501847703 |

In conclusion, our analysis shows that XGBoost and CatBoost perform the best for predicting students' performance in the online learning platform.

The data analysis process involved several iterations. Initially, we only included students' action and problem information, resulting in a score of 0.67. We then added explanation details, which improved the performance to 0.69. Finally, we included sequence and assignment details, which led to a score of 0.7. Our findings suggest that students' actions and problem information are the most important factors for predicting their performance, which indicates that we can trace students' online actions and content details to predict students' performance.

However, we recognize that there may be deeper information that could further improve our predictions. We will continue to explore the data to identify additional key features.