# EDA of Eedi Dataset
## Author: Bhavana Jonnalagadda


In [None]:
# !pip install plotly 

In [3]:
import os

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
# Settings for plot rendering, makes work with HTML output + jupyer lab + static output
pio.renderers.default = "notebook+plotly_mimetype+png"
# pio.renderers.default = "png"
%matplotlib inline

## Load and format data

We focus on the data available to tasks 3 and 4 as it is a much smaller dataset that is workable. The task descriptions from the paper "Instructions and Guide for
Diagnostic Questions: The NeurIPS 2020 Education Challenge":

- **The third task** is to predict the “quality” of a question, as defined by a panel of domain experts (experienced teachers), based on the information learned from the students’ answers found in the dataset. This task requires the definition of a metric for evaluating the question quality that mimics the experts’ judgement of the question quality. This is an unsupervised task.
- **The fourth task** is to interactively generate a sequence of questions to ask a student in order to maximise the predictive accuracy of a model on their remaining answers. Specifically, a participant’s model will be provided with a set of previously-unseen students, whose answers to questions are completely hidden, and a set of potential questions to query for each student. The model will then choose a personalized question to query for each of these students in turn, and then their corresponding answer will be revealed to the model. Based on this information, the model should choose a second question to query for each student, and so on, until 10 questions have been asked in total. The aim of the task is to maximise the predictive accuracy of a participant’s model on a held-out set of questions for each student, after the model has been exposed to the 10 answers from each student.

In [9]:
# Use normpath to make work on windows machines
data_dir = os.path.normpath("../data/")
eedi_dir = os.path.join(data_dir, "Eedi_dataset")
eedi_metadata_dir = os.path.join(eedi_dir, "metadata")

df_train = pd.read_csv(os.path.join(eedi_dir, "train_data", "train_task_3_4.csv"))
df_answer = pd.read_csv(os.path.join(eedi_metadata_dir, "answer_metadata_task_3_4.csv"))
df_student = pd.read_csv(os.path.join(eedi_metadata_dir, "student_metadata_task_3_4.csv"))
df_question = pd.read_csv(os.path.join(eedi_metadata_dir, "question_metadata_task_3_4.csv"))
df_subject = pd.read_csv(os.path.join(eedi_metadata_dir, "subject_metadata.csv"))

print(df_train, df_answer, df_student, df_question, df_subject, sep="\n")

         QuestionId  UserId  AnswerId  IsCorrect  CorrectAnswer  AnswerValue
0               898    2111    280203          1              2            2
1               767    3062     55638          1              3            3
2               165    1156    386475          1              2            2
3               490    1653    997498          1              4            4
4               298    3912    578636          1              3            3
...             ...     ...       ...        ...            ...          ...
1382722          80    2608     57945          1              2            2
1382723         707    2549    584230          0              2            1
1382724         840    5901   1138956          1              1            1
1382725         794    3854   1151183          0              1            3
1382726         157    3184   1321883          1              3            3

[1382727 rows x 6 columns]
         AnswerId             DateAnswered  Conf