# Linking Writing Processes to Writing Quality
## Use typing behavior to predict essay quality

In [2]:
import pandas as pd

df_train = pd.read_csv('Dataset/train_logs.csv')
df_test = pd.read_csv('Dataset/test_logs.csv')
df_train_scores = pd.read_csv('Dataset/train_scores.csv')

### Training dataset summary statistics
- Examine summary stats
- Categorical variables that will need encoding(+cardinality)
- Null/empty values

In [3]:
df_train

Unnamed: 0,id,event_id,down_time,up_time,action_time,activity,down_event,up_event,text_change,cursor_position,word_count
0,001519c8,1,4526,4557,31,Nonproduction,Leftclick,Leftclick,NoChange,0,0
1,001519c8,2,4558,4962,404,Nonproduction,Leftclick,Leftclick,NoChange,0,0
2,001519c8,3,106571,106571,0,Nonproduction,Shift,Shift,NoChange,0,0
3,001519c8,4,106686,106777,91,Input,q,q,q,1,1
4,001519c8,5,107196,107323,127,Input,q,q,q,2,1
...,...,...,...,...,...,...,...,...,...,...,...
8405893,fff05981,3615,2063944,2064440,496,Nonproduction,Leftclick,Leftclick,NoChange,1031,240
8405894,fff05981,3616,2064497,2064497,0,Nonproduction,Shift,Shift,NoChange,1031,240
8405895,fff05981,3617,2064657,2064765,108,Replace,q,q,q => q,1031,240
8405896,fff05981,3618,2069186,2069259,73,Nonproduction,Leftclick,Leftclick,NoChange,1028,240


In [4]:
df_train.describe()

Unnamed: 0,event_id,down_time,up_time,action_time,cursor_position,word_count
count,8405898.0,8405898.0,8405898.0,8405898.0,8405898.0,8405898.0
mean,2067.649,793560.3,793658.4,98.08498,1222.964,231.4687
std,1588.284,514945.1,514942.8,253.3985,948.5242,175.9088
min,1.0,106.0,252.0,0.0,0.0,0.0
25%,852.0,373184.2,373282.0,66.0,499.0,96.0
50%,1726.0,720886.0,720980.0,93.0,1043.0,200.0
75%,2926.0,1163042.0,1163141.0,122.0,1706.0,327.0
max,12876.0,8313630.0,8313707.0,447470.0,7802.0,1326.0


In [5]:
column_types = df_train.dtypes
categorical_variables = ['activity', 'down_event', 'up_event', 'text_change']
column_types

id                 object
event_id            int64
down_time           int64
up_time             int64
action_time         int64
activity           object
down_event         object
up_event           object
text_change        object
cursor_position     int64
word_count          int64
dtype: object

As seen above, out of 11 columns in our training data, 4 are categorical and 7 are not(id has an object dtype but we don't regard it as a categorical variable)

Below we get the cardinality of the categorical variables in a dictionary. There are 34 different activity categories, 123 different down_event categories, 121 different up_event categories and 2739 different text_change categories.

In [6]:
cardinalities = {}
for v in categorical_variables:
    cardinalities[v] = len(df_train[v].unique())
cardinalities

{'activity': 50, 'down_event': 131, 'up_event': 130, 'text_change': 4111}

In [7]:
nulls = df_train.isnull().sum()
nulls

id                 0
event_id           0
down_time          0
up_time            0
action_time        0
activity           0
down_event         0
up_event           0
text_change        0
cursor_position    0
word_count         0
dtype: int64

The first column represents the id of the essay the typing event was registered for. By taking the unique ids we have the number of essays in the dataset. We will need that to engineer the dataset prior to training.

In [8]:
number_of_essays = df_train['id'].unique()
print('Number of essays in the dataset: ', len(number_of_essays))

Number of essays in the dataset:  2471


### Preprocessing

In [None]:
# DATA CLEANING

# Impute null values with column mean
for column in nulls.keys().tolist():
    column_mean = df_train[column].mean()
    df_train[column] = df[column].fillna(column_mean)

In [1]:
# SCALING AND NORMALISATION

### EDA

In [13]:
# Due to the size of the trainign data, choose a subset to perform eda
df_train_eda = df_train.copy().sample(frac=0.1, random_state=42)

### Feature engineering

As mentioned, each typing event corresponds to a specific essay, therefore an essay can be inherently described by the typing events that created it.
A natural thing to do to structure the data is to group the typing events based on essay id(aggregate the logs data). After that, we could express each essay with features created from its typing events to create a training dataset.