# **Hello there!**

Let me start by stating that i am new to kaggle and data science in general so any mistakes i make or if there are any suggestions to be given, please feel free to point that out in the comments section below.

While reading and going through this competition and its notebooks i came across this particular notebook called [0-780-unoptimized-lgbm-interesting-features](https://www.kaggle.com/zyy2016/0-780-unoptimized-lgbm-interesting-features). Here the author talks about using **Trueskill** features. This peaked my interest and so i tried to explore it and am making this notebook for other beginners like me to undertsand this concept. If you like the above notebook please do drop an upvote as it helps the author reach out to more people and spread his work and learn more.

I don't know how much this'll help in the competition prediction or scorewise, but this is just an interesting concept that i wanted to explore more of. So here goes nothing...

# So what exactly is TrueSkill?

According to [TrueSkill.org](https://trueskill.org/)

"TrueSkill is a rating system among game players. It was developed by Microsoft Research and has been used on Xbox LIVE for ranking and matchmaking service. This system quantifies players’ TRUE skill points by the Bayesian inference algorithm. It also works well with any type of match rule including N:N team game or free-for-all."

In simple words, if you are a participant in a competition, then you will be recognized by a number. That number is your TrueSkill rating. When two participants clash, depending on which participant wins, that number gets updated. If participant A wins against Participant B then the rating of A shall increase and B shall decrease based on certain calculations.

# So how does this relate to Riid!

Think of TOEIC test as a competition not as a test. So you treat every **user_id** as a participant and every **content_id** as a unique participant as well. Everytime a user answers a question correctly, the rating for the user_id **increases** and the rating for the content_id **decreases** . The reverse of this happens when the user answers a question incorrectly.

In [None]:
import datatable as dt
import pandas as pd
import numpy as np
from trueskill import Rating, quality_1vs1, rate_1vs1
import math
import trueskill

In [None]:
data_types_dict = {
    'timestamp': 'int64',
    'user_id': 'int32', 
    'content_id': 'int16', 
    'content_type_id':'int8', 
    'task_container_id': 'int16',
    'answered_correctly': 'int8', 
    'prior_question_elapsed_time': 'float32', 
    'prior_question_had_explanation': 'bool'
}
target = 'answered_correctly'

In [None]:
#loading the dataset using datatable
train = dt.fread('../input/riiid-test-answer-prediction/train.csv', columns=set(data_types_dict.keys())).to_pandas()
print("Data Loaded")

In [None]:
train = train[train[target] != -1].reset_index(drop=True)
print("Lecture columns removed")

In [None]:
train = train.astype(data_types_dict)
print("Basic Pre Processing Done")

In [None]:
train = train.groupby('user_id').tail(4)

In [None]:
users = np.unique(train["user_id"])
questions = np.unique(train["content_id"])

In [None]:
rating_users = []
for user in users:
    rating_object = Rating()
    rating_users.append(rating_object)

rating_questions = []
for question in questions:
    rating_object = Rating()
    rating_questions.append(rating_object)

In [None]:
dict_users = dict(zip(users,rating_users))
dict_questions = dict(zip(questions,rating_questions))

In [None]:
answers = train["answered_correctly"].values
u_temp = train["user_id"].values
q_temp = train["content_id"].values

In [None]:
def win_probability(team1, team2):
    delta_mu = team1.mu - team2.mu
    sum_sigma = sum([team1.sigma ** 2, team2.sigma ** 2])
    size = 2
    denom = math.sqrt(size * (0.05 * 0.05) + sum_sigma)
    ts = trueskill.global_env()
    return ts.cdf(delta_mu / denom)

Above function has been taken from the notebook [0-780-unoptimized-lgbm-interesting-features](https://www.kaggle.com/zyy2016/0-780-unoptimized-lgbm-interesting-features). Do consider dropping an upvote.

In [None]:
count = 0
winning_probability = []
print("Creating Feature")
print("Running this will take about 10 mins. Grab a chai :)")
for user_id,content_id,answer in zip(u_temp,q_temp,answers):
    count = count + 1
    old_user_rating = dict_users[user_id]
    old_question_rating = dict_questions[content_id]
    prob = win_probability(old_user_rating,old_question_rating)
    winning_probability.append(prob)
    if answer == 1:
        new_user_rating,new_question_rating = rate_1vs1(old_user_rating,old_question_rating)
    if answer == 0:
        new_question_rating,new_user_rating = rate_1vs1(old_question_rating,old_user_rating)
    dict_users[user_id] = new_user_rating
    dict_questions[content_id] = new_question_rating
    if count%1000000 == 0:
        print((count/1000000),"million rows done" )

In [None]:
train["correct_prob"] = winning_probability

In [None]:
train.head()

Currently the most amount of time is used up on updating the dictionaries, which is why i haven't done it for the entire dataset. If you have a better way of updating them do please drop it in the comments. To avoid running the code again and again for the entire dataset, you could run this code and save the csv file to your working directory for further usage.
If you liked the kernel, do drop an upvote on it. As stated before any comments and suggestions are more than welcome.