# Train Test Split

- Perfrom Strateified shuffle split on the QA pair dataset. 
- Use topics as stratification variable.
- 10% of the data will be used for testing.

In [1]:
import os
os.chdir("../")

## Load Data

In [2]:
import pandas as pd

In [3]:
qa_df = pd.read_csv("data/qa-pair-datasettyjgd2rs.csv")
qa_df.head()

Unnamed: 0,QuestionId,QuestionText,SubjectId,SubjectName,ConstructId,ConstructName,AnswerText,MisconceptionId,MisconceptionName
0,0,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,33,BIDMAS,856,Use the order of operations to carry out calcu...,Does not need brackets,1672.0,"Confuses the order of operations, believes add..."
1,1,"Simplify the following, if possible: \( \frac{...",1077,Simplifying Algebraic Fractions,1612,Simplify an algebraic fraction by factorising ...,\( m+1 \),2142.0,Does not know that to factorise a quadratic ex...
2,1,"Simplify the following, if possible: \( \frac{...",1077,Simplifying Algebraic Fractions,1612,Simplify an algebraic fraction by factorising ...,\( m+2 \),143.0,Thinks that when you cancel identical terms fr...
3,1,"Simplify the following, if possible: \( \frac{...",1077,Simplifying Algebraic Fractions,1612,Simplify an algebraic fraction by factorising ...,\( m-1 \),2142.0,Does not know that to factorise a quadratic ex...
4,2,Tom and Katie are discussing the \( 5 \) plant...,339,Range and Interquartile Range from a List of Data,2774,Calculate the range from a list of data,Only\nTom,1287.0,Believes if you changed all values by the same...


In [4]:
m_df = pd.read_csv("data/misconception_dataset.csv")
m_df.head()

Unnamed: 0,MisconceptionId,MisconceptionName,Topic,Count
0,0,Does not know that angles in a triangle sum to...,3,1
1,1,Uses dividing fractions method for multiplying...,0,2
2,2,Believes there are 100 degrees in a full turn,-1,2
3,3,Thinks a quadratic without a non variable term...,16,1
4,4,Believes addition of terms and powers of terms...,14,2


## Add Topic Column to QA Pair Dataset

In [5]:
qa_df["Topic"] = qa_df["MisconceptionId"].map(lambda x: m_df.loc[m_df["MisconceptionId"] == x, "Topic"].values[0])
qa_df.head()

Unnamed: 0,QuestionId,QuestionText,SubjectId,SubjectName,ConstructId,ConstructName,AnswerText,MisconceptionId,MisconceptionName,Topic
0,0,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,33,BIDMAS,856,Use the order of operations to carry out calcu...,Does not need brackets,1672.0,"Confuses the order of operations, believes add...",6
1,1,"Simplify the following, if possible: \( \frac{...",1077,Simplifying Algebraic Fractions,1612,Simplify an algebraic fraction by factorising ...,\( m+1 \),2142.0,Does not know that to factorise a quadratic ex...,16
2,1,"Simplify the following, if possible: \( \frac{...",1077,Simplifying Algebraic Fractions,1612,Simplify an algebraic fraction by factorising ...,\( m+2 \),143.0,Thinks that when you cancel identical terms fr...,0
3,1,"Simplify the following, if possible: \( \frac{...",1077,Simplifying Algebraic Fractions,1612,Simplify an algebraic fraction by factorising ...,\( m-1 \),2142.0,Does not know that to factorise a quadratic ex...,16
4,2,Tom and Katie are discussing the \( 5 \) plant...,339,Range and Interquartile Range from a List of Data,2774,Calculate the range from a list of data,Only\nTom,1287.0,Believes if you changed all values by the same...,24


In [6]:
qa_df["Topic"].value_counts()

Topic
-1     1154
 0      405
 1      400
 6      302
 5      226
 3      186
 7      173
 9      172
 16     142
 13     141
 4      129
 8      120
 2      119
 21      83
 10      77
 18      67
 19      61
 12      58
 20      56
 17      55
 14      48
 11      46
 15      39
 25      35
 24      29
 23      24
 22      23
Name: count, dtype: int64

## Train Test Split

In [7]:
from src.constants.column_names import QAPairCSVColumns

In [8]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
for train_index, test_index in sss.split(qa_df, qa_df["Topic"]):
    qa_df.loc[test_index, QAPairCSVColumns.SPLIT] = "test"
    qa_df.loc[train_index, QAPairCSVColumns.SPLIT] = "train"
    
qa_df[QAPairCSVColumns.SPLIT].value_counts()

Split
train    3933
test      437
Name: count, dtype: int64

In [9]:
count_df = qa_df.groupby("Topic", as_index=False)["Split"].value_counts()
count_df.head()

Unnamed: 0,Topic,Split,count
0,-1,train,1039
1,-1,test,115
2,0,train,365
3,0,test,40
4,1,train,360


In [10]:
import plotly.express as px
fig = px.bar(count_df, x="Topic", y="count", color="Split", barmode="stack", text="count")
fig.update_layout(title="Train Test Split", xaxis_title="Topic", yaxis_title="Count", title_x=0.5, width=1000, height=600)
fig.show()

## Logging to W&B


In [11]:
from src.constants.wandb_project import WandbProject
from src.utils.wandb_artifact import log_dataframe_artifact
import wandb

In [12]:
wandb.init(project=WandbProject.PROJECT_NAME, job_type="dataset-upload")

log_dataframe_artifact(
    qa_df,
    artifact_name=WandbProject.QA_PAIR_DATASET_NAME,
    artifact_type="dataset",
    description="""
    QA pair dataset.

    Each row of the dataset contains the following columns:
    - `QuestionId`: Id of the question.
    - `QuestionText`: Text of the question.
    - `SubjectId`: Id of the subject.
    - `SubjectName`: Name of the subject.
    - `ConstructId`: Id of the construct.
    - `ConstructName`: Name of the construct.
    - `AnswerText`: Text of the answer.
    - `MisconceptionId`: Id of the misconception.
    - `Split`: Split of the dataset.
    """,
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mshakleenishfar[0m. Use [1m`wandb login --relogin`[0m to force relogin


<Artifact qa-pair-dataset>