# Additive Factors Model (AFM) Implementation
This notebook provides a python implementation of the AFM student model.
The AFM model calculates the probability of a student carrying out correctly a step based on the prior attempts (opportunities).

Please review and execute the implementation steps taking into account what we learned about the rule space student models and the Q-matrix.

## Initializing the environment
First, we import the required libraries for handling data and training the machine learning model. We deactivate potential warnings for readability purposes.

In [2]:
import pandas as pd
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from patsy import dmatrices
from sklearn.metrics import f1_score, precision_score, recall_score, brier_score_loss
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')

## Data preparation
We define the function **feature_engineering()** that transforms the student data values into the appropriate types

In [3]:
 def feature_engineering(df):
    df.loc[ df['First Attempt'] == 'incorrect', 'First Attempt'] = 0
    df.loc[ df['First Attempt'] == 'hint', 'First Attempt'] = 0
    df.loc[ df['First Attempt'] == 'correct', 'First Attempt'] = 1
    df = df[(df['First Attempt']==0) | (df['First Attempt']==1)]

    df=df.dropna()
    df.insert(loc=len(df.columns),column='Outcome',value=df['First Attempt'])

    df.rename(columns={'KC (Default)': 'KCModel', 'Opportunity': 'OpportunityModel'}, inplace=True)

    df.rename(columns={'Corrects': 'CorrectModel', 'Incorrects': 'IncorrectModel'}, inplace=True)

    df.rename(columns={'Hints': 'TellsModel'}, inplace=True)
    return df

## Model Training and Testing
For the model's implementation, we will use Logistic Regression and the python library scikit-learn.
The function **trainModel()** splits the dataset into a training and a test dataset following the 80/20 (Paretto) principle.
Then, we use the "train" subset to train the model and the "test" subset to test the model.

The prediction values are stored in the variable *"y_pred"* while the actual values are stored in the variable *"y_test"*.
By comparing the variables *"y_pred"* and *"y_test"*, we can assess the performance of the predictive model.
To do so, we use the following measures: RMSE, f1, precision and recall since our model practically works as a binary classifier to predict correct or incorrect student steps.



In [4]:
def trainModel(df,modeltype,X):

    y = df['Outcome']
    y= y.astype('int')

    X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=0)
    TrainTestSplitModel=LogisticRegression(max_iter=1000,penalty='l2')   
    TrainTestSplitModel.fit(X_train,y_train)

    y_pred=TrainTestSplitModel.predict(X_test)
    RMSE=np.sqrt(np.mean((y_test-y_pred)**2))
    f1=f1_score(y_test, y_pred, average="macro")
    precision=precision_score(y_test, y_pred, average="macro")
    recall=recall_score(y_test, y_pred, average="macro")    

    return (RMSE, f1, precision, recall)

## Read data
Now, lets import an example dataset which we will use for training and testing the model.
First, we read the dataset from the excel file "Example" and we save it as a pandas dataframe.

Then, we call the function **feature_engineering()** to pre-process and prepare the data.

In [5]:
datalink = "https://drive.google.com/uc?export=download&id=1c121feuMH0BJBWU5FDr3wAm1OMquF7pl"
df = pd.read_csv(datalink)
df.head()

Unnamed: 0,Row,AnonStudentId,First Attempt,KC (Default),Incorrects,Hints,Corrects,Opportunity
0,1,Stu_1,correct,KC12,1,14,1,10.0
1,2,Stu_1,unknown,,1,14,1,1.0
2,3,Stu_1,unknown,KC8,1,9,0,2.0
3,4,Stu_1,unknown,KC9,5,8,1,
4,5,Stu_1,correct,KC10,1,8,1,4.0


In [6]:
data = feature_engineering(df)
data.head()

Unnamed: 0,Row,AnonStudentId,First Attempt,KCModel,IncorrectModel,TellsModel,CorrectModel,OpportunityModel,Outcome
0,1,Stu_1,1,KC12,1,14,1,10.0,1
4,5,Stu_1,1,KC10,1,8,1,4.0,1
6,7,Stu_1,1,KC11,0,7,1,8.0,1
7,8,Stu_1,1,KC17,0,6,1,11.0,1
10,11,Stu_1,0,KC14,3,5,1,12.0,0


## Define the model's function

Here we define the logistic regression function for the AFM model. Remember, the AFM model calculates the probability of correctness based on the student's prior opportunities in the respective Knowledge Components (KCs).
The function **dmatrices()** prepares the X(input) and y(output) data that we will use for training and testing the model.

In [None]:
#specify the model type
modeltype="AFM"
#specify the model function
y, X = dmatrices('Outcome ~ AnonStudentId + KCModel+ KCModel:OpportunityModel', data,return_type="dataframe")

In [None]:
(RMSE, f1, precision, recall)=trainModel(data,modeltype,X)
print(RMSE, f1, precision, recall)

**QUESTION**
When splitting the dataset into train and test subsets, we followed an 80/20 split.
If we change the split to 50/50, how does this affect the performance of the model? 
Please change the code below accordingly and calculate again the values for RMSE, f1, precision and recall.

In [None]:
# TODO: Set test_size to 0.5 for 50/50 split
def trainModel5050(df,modeltype,X):

    y = df['Outcome']
    y= y.astype('int')

    X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=, random_state=0)
    TrainTestSplitModel=LogisticRegression(max_iter=1000,penalty='l2')   
    TrainTestSplitModel.fit(X_train,y_train)

    y_pred=TrainTestSplitModel.predict(X_test)
    RMSE=np.sqrt(np.mean((y_test-y_pred)**2))
    f1=f1_score(y_test, y_pred, average="macro")
    precision=precision_score(y_test, y_pred, average="macro")
    recall=recall_score(y_test, y_pred, average="macro")    

    return (RMSE, f1, precision, recall)

(RMSE, f1, precision, recall)=trainModel5050(data,modeltype,X)
print(RMSE, f1, precision, recall)