# Data Summary

- 4 different assignments were given to students
- The texts were evaulated in two categories, "content" and "wording"
- There is no repeated student ids

# Train Approach

- Main Task : Regression of content and wording
- Auxiliary Task : Predict the assignment
    - Learn specific features for different assignments


## Data Summary

We are working with student assignment data. The main aim is to predict the evaluation scores for student texts based on their content and wording. The available data is split into training and test sets.

### Dataset Structure

- `prompts_train` & `prompts_test`: Contains the prompts or assignments given to students.
- `summaries_train` & `summaries_test`: Contains student summaries or answers to the prompts, along with evaluation scores for content and wording.

### Data Loading & Inspection

In [13]:
import pandas as pd

# Load datasets
prompts_test = pd.read_csv("../data/prompts_test.csv")
prompts_train = pd.read_csv("../data/prompts_train.csv")
summaries_test = pd.read_csv("../data/summaries_test.csv")
summaries_train = pd.read_csv("../data/summaries_train.csv")

Inspecting the `prompts_train` dataset:

In [14]:
prompts_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   prompt_id        4 non-null      object
 1   prompt_question  4 non-null      object
 2   prompt_title     4 non-null      object
 3   prompt_text      4 non-null      object
dtypes: object(4)
memory usage: 256.0+ bytes


This dataset has 4 columns and 4 entries, suggesting there were 4 different assignments given.

Inspecting the descriptive statistics of the `summaries_train` dataset:

In [15]:
summaries_train.describe()

Unnamed: 0,content,wording
count,7165.0,7165.0
mean,-0.014853,-0.063072
std,1.043569,1.036048
min,-1.729859,-1.962614
25%,-0.799545,-0.87272
50%,-0.093814,-0.081769
75%,0.49966,0.503833
max,3.900326,4.310693


This provides insights into the distribution of scores for `content` and `wording`.

Taking a closer look at the first few rows of both datasets:

In [16]:
prompts_train.head()
summaries_train.head()

Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757


Finally, checking the number of unique student IDs:

In [17]:
unique_student_ids = summaries_train['student_id'].nunique()
print(f"Number of unique student IDs: {unique_student_ids}")

Number of unique student IDs: 7165


## Preliminary Observations

- We have 4 different assignments given to students.
- Texts are evaluated in two categories: "content" and "wording".
- There are no repeated student IDs in the training dataset, implying each student responded to a prompt once.
- The main task is to perform regression on the `content` and `wording` scores.
- An auxiliary task could be to predict the specific assignment given a student's response. This might help the model recognize specific features or nuances associated with each assignment.