# 4. Feature Engineering

   - Explore potential features from the text, such as summary length, unique word count, etc.
   - Analyze prompt texts to see if they can offer additional features.
   - Discuss and implement feature extraction methods together.


## Previous codes

In [2]:
import pandas as pd
from transformers import BertTokenizer


# Load datasets
prompts_test = pd.read_csv("../data/prompts_test.csv")
prompts_train = pd.read_csv("../data/prompts_train.csv")
summaries_test = pd.read_csv("../data/summaries_test.csv")
summaries_train = pd.read_csv("../data/summaries_train.csv")

# Drop student_id column from summaries_train and summaries_test
summaries_train = summaries_train.drop(columns=['student_id'])
summaries_test = summaries_test.drop(columns=['student_id'])

id_mapping = {id_val: idx for idx, id_val in enumerate(prompts_train['prompt_id'].unique())}

summaries_train['prompt_id'] = summaries_train['prompt_id'].replace(id_mapping)
summaries_test['prompt_id'] = summaries_test['prompt_id'].replace(id_mapping)

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the 'text' column
texts = summaries_train['text'].tolist()
tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512)

In [7]:
summaries_train['summary_length'] = summaries_train['text'].apply(lambda x: len(x.split()))
print(summaries_train['summary_length'].head())

0     61
1     52
2    235
3     25
4    203
Name: summary_length, dtype: int64


In [8]:
summaries_train['unique_word_count'] = summaries_train['text'].apply(lambda x: len(set(x.split())))
print(summaries_train['unique_word_count'].head())

0     51
1     38
2    149
3     19
4    138
Name: unique_word_count, dtype: int64


In [9]:
summaries_train['avg_word_length'] = summaries_train['text'].apply(lambda x: sum(len(word) for word in x.split()) / len(x.split()))
print(summaries_train['avg_word_length'].head())

0    4.688525
1    3.711538
2    4.834043
3    5.320000
4    5.024631
Name: avg_word_length, dtype: float64


In [10]:
correlation_matrix = summaries_train.corr()
print(correlation_matrix)

                   prompt_id   content   wording  summary_length  \
prompt_id           1.000000  0.006426 -0.016128        0.091246   
content             0.006426  1.000000  0.751380        0.792626   
wording            -0.016128  0.751380  1.000000        0.536343   
summary_length      0.091246  0.792626  0.536343        1.000000   
unique_word_count   0.095362  0.806767  0.544271        0.981951   
avg_word_length    -0.332255  0.187802  0.156207        0.099059   

                   unique_word_count  avg_word_length  
prompt_id                   0.095362        -0.332255  
content                     0.806767         0.187802  
wording                     0.544271         0.156207  
summary_length              0.981951         0.099059  
unique_word_count           1.000000         0.140658  
avg_word_length             0.140658         1.000000  


  correlation_matrix = summaries_train.corr()


## Correlation Analysis

Strong positive correlation : Length of sentence.
Weak positive correlation : Length of words.

## Multi-taks candidates

Readability Scores

Grammatical Errors

Dependency Parsing:
Analyze sentence structures to see if certain patterns are more common in high-scoring responses.

Semantic Similarity:
Measure how similar student summaries are to the original prompt or a given reference summary. This can give insights into how closely students stuck to the original topic.