# CommonLit - Evaluate Student Summaries Dataset with TensorFlow Decision Forests

This notebook walks you through how to train a baseline Random Forest model using TensorFlow Decision Forests on the **CommonLit - Evaluate Student Summaries** dataset made available for this competition.

Roughly, the code will look as follows:

```
import tensorflow_decision_forests as tfdf
import pandas as pd

dataset = pd.read_csv("project/dataset.csv")
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="my_label")

model = tfdf.keras.RandomForestModel()
model.fit(tf_dataset)

print(model.summary())
```

Decision Forests are a family of tree-based models including Random Forests and Gradient Boosted Trees. They are the best place to start when working with tabular data, and will often outperform (or provide a strong baseline) before you begin experimenting with neural networks.

# Import the libraries

In [1]:
! python --version

Python 3.9.18


In [2]:
! pip install torch



In [3]:
import string
import re
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import textstat
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
#from catboost import CatBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import torch
from sklearn.metrics import mean_squared_error
from textblob import TextBlob

from sklearn.model_selection import KFold, GroupKFold
from tqdm import tqdm

from typing import List
import warnings
import logging
import os
import shutil
import json
import transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from datasets import Dataset,load_dataset, load_from_disk
from transformers import TrainingArguments, Trainer
from datasets import load_metric, disable_progress_bar

import nltk
from nltk import sent_tokenize, pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from collections import Counter
import spacy
import re
from autocorrect import Speller
from spellchecker import SpellChecker
import lightgbm as lgb

warnings.simplefilter("ignore")
logging.disable(logging.ERROR)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
disable_progress_bar()
tqdm.pandas()


In [4]:
print("TensorFlow v" + tf.__version__)
# print("TensorFlow Decision Forests v" + tfdf.__version__)

TensorFlow v2.14.0


In [5]:
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything(seed=42)

# Load the Dataset

### Load the prompt csv

In [6]:
df_train_prompt = pd.read_csv('data/prompts_train.csv')
print("Full prompt train dataset shape is {}".format(df_train_prompt.shape))

Full prompt train dataset shape is (4, 4)


The data is composed of 4 columns and 4 entries. We can see all 4 dimensions of our dataset by using the following code:

In [7]:
df_train_prompt.head()

Unnamed: 0,prompt_id,prompt_question,prompt_title,prompt_text
0,39c16e,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...
1,3b9047,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...
2,814d6b,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...
3,ebad26,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."


### Load the summaries csv

In [8]:
df_train_summaries = pd.read_csv('data/summaries_train.csv')
print("Full summaries train dataset shape is {}".format(df_train_summaries.shape))

Full summaries train dataset shape is (7165, 5)


The data is composed of 5 columns and 7165 entries. We can see all 5 dimensions of our dataset by printing out the first 5 entries using the following code:

In [9]:
df_train_summaries.head()

Unnamed: 0,student_id,prompt_id,text,content,wording
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226
3,005ab0199905,3b9047,The highest class was Pharaohs these people we...,-0.210614,-0.471415
4,0070c9e7af47,814d6b,The Third Wave developed rapidly because the ...,3.272894,3.219757


In [10]:
combi = df_train_prompt.merge(df_train_summaries, how="left", on="prompt_id")
# saving the dataframe
combi.to_csv('merge_train.csv')


# Preprocess the data

In [11]:
"""
to do next on sunday:
1. pos: part of speech
2. Jaccard Similarity
3. Average Word Length: Calculate the average word length in the summary and the prompt.
4. Average Sentence Length: Compute the average sentence length in the summary and the prompt.
5. Keyword Matching : Identify and count specific keywords or phrases related to the prompt that appear in the summary.
6. NER : Identify and count named entities in both the prompt and the summary.
7. Apply topic modeling techniques (e.g., LDA) to identify and compare the main topics in the prompt and the summary.
8. Semantic similarity: Compute semantic similarity scores (e.g., Word Mover's Distance) between the prompt and the summary.
9. Use readability metrics (e.g., Flesch-Kincaid, Gunning Fog Index) to measure the readability of both the prompt and the summary.

"""


tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")

twd = TreebankWordDetokenizer()
STOP_WORDS = set(stopwords.words('english'))

spacy_ner_model = spacy.load('en_core_web_sm',)
speller = Speller(lang='en')
spellchecker = SpellChecker() 

def word_overlap_count(row):
        """ intersection(prompt_text, text) """        
        def check_is_stop_word(word):
            return word in STOP_WORDS
        
        prompt_words = row['prompt_tokens']
        summary_words = row['summary_tokens']
        if STOP_WORDS:
            prompt_words = list(filter(check_is_stop_word, prompt_words))
            summary_words = list(filter(check_is_stop_word, summary_words))
        return len(set(prompt_words).intersection(set(summary_words)))
            
def ngrams(token, n):
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[token[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

def ngram_co_occurrence(row, n) -> int:
    # Tokenize the original text and summary into words
    original_tokens = row['prompt_tokens']
    summary_tokens = row['summary_tokens']

    # Generate n-grams for the original text and summary
    original_ngrams = set(ngrams(original_tokens, n))
    summary_ngrams = set(ngrams(summary_tokens, n))

    # Calculate the number of common n-grams
    common_ngrams = original_ngrams.intersection(summary_ngrams)
    return len(common_ngrams)
    
def ner_overlap_count(row, mode):
    model = spacy_ner_model
    def clean_ners(ner_list):
        return set([(ner[0].lower(), ner[1]) for ner in ner_list])
    prompt = model(row['prompt_text'])
    summary = model(row['text'])

    if "spacy" in str(model):
        prompt_ner = set([(token.text, token.label_) for token in prompt.ents])
        summary_ner = set([(token.text, token.label_) for token in summary.ents])
    elif "stanza" in str(model):
        prompt_ner = set([(token.text, token.type) for token in prompt.ents])
        summary_ner = set([(token.text, token.type) for token in summary.ents])
    else:
        raise Exception("Model not supported")

    prompt_ner = clean_ners(prompt_ner)
    summary_ner = clean_ners(summary_ner)
    
    intersecting_ners = prompt_ner.intersection(summary_ner)
        
    ner_dict = dict(Counter([ner[1] for ner in intersecting_ners]))
    
    if mode == "train":
        return ner_dict
    elif mode == "test":
        return {key: ner_dict.get(key) for key in ner_keys}


def quotes_count(row):
    summary = row['text']
    text = row['prompt_text']
    quotes_from_summary = re.findall(r'"([^"]*)"', summary)
    if len(quotes_from_summary)>0:
        return [quote in text for quote in quotes_from_summary].count(True)
    else:
        return 0
        
def spelling(text):

    wordlist= text.split()
    amount_miss = len(list(spellchecker.unknown(wordlist)))

    return amount_miss
    
def add_spelling_dictionary(tokens: List[str]) -> List[str]:
    """dictionary update for pyspell checker and autocorrect"""
    spellchecker.word_frequency.load_words(tokens)
    speller.nlp_data.update({token:1000 for token in tokens})
    
    
####### new method
def count_sentences(text):
    sentences = sent_tokenize(text)
    return len(sentences)

def count_stopwords(text: str) -> int:
    stopword_list = set(stopwords.words('english'))
    words = text.split()
    stopwords_count = sum(1 for word in words if word.lower() in stopword_list)
    return stopwords_count

# Count the punctuations in the text.
# punctuation_set -> !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~
def count_punctuation(text: str) -> int:
    punctuation_set = set(string.punctuation)
    punctuation_count = sum(1 for char in text if char in punctuation_set)
    return punctuation_count

# Count the digits in the text.
def count_numbers(text: str) -> int:
    numbers = re.findall(r'\d+', text)
    numbers_count = len(numbers)
    return numbers_count

# Function to extract POS tags for a given text
def extract_pos_tags(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    pos_tag_list = [tag for _, tag in pos_tags]
    return pos_tag_list


def extract_keywords(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    
    # Define relevant parts of speech for keywords (e.g., nouns, adjectives)
    relevant_pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS']
    
    keywords = [word for word, pos in pos_tags if pos in relevant_pos]
    
    return keywords

# Function to count keyword matches in the summary
def count_keyword_matches(prompt_keywords, summary):
    summary = summary.lower()  # Convert summary to lowercase for case-insensitive matching
    count = 0
    for keyword in prompt_keywords:
        if keyword.lower() in summary:
            count += 1
    return count


# Function to calculate average sentence length
def calculate_average_sentence_length(text):
    sentences = sent_tokenize(text)  # Tokenize text into sentences
    sentence_lengths = [len(sentence.split()) for sentence in sentences]  # Calculate word count for each sentence
    if len(sentence_lengths) > 0:
        return sum(sentence_lengths) / len(sentence_lengths)  # Calculate average sentence length
    else:
        return 0  # Return 0 if there are no sentences

# Function to calculate average word length
def calculate_average_word_length(text):
    words = text.split()  # Split text into words
    word_lengths = [len(word) for word in words]  # Calculate the length of each word
    if len(word_lengths) > 0:
        return sum(word_lengths) / len(word_lengths)  # Calculate average word length
    else:
        return 0  # Return 0 if there are no words




# This function applies all the above preprocessing functions on a text feature  

def run(prompts: pd.DataFrame, summaries:pd.DataFrame) -> pd.DataFrame:
    
    # before merge preprocess
    prompts["original_prompt_len"] = prompts["prompt_text"].apply(lambda x: len(x))
    
    prompts['prompt_sentenceCount'] = prompts['prompt_text'].apply(lambda x:count_sentences(x))
    
    
    prompts["prompt_length"] = prompts["prompt_text"].apply(lambda x: len(word_tokenize(x)))
    
    prompts["prompt_tokens"] = prompts["prompt_text"].apply(lambda x: word_tokenize(x))
    
    prompts["prompt_word_cnt"] = prompts["prompt_text"].apply(lambda x:len(x. split(' ')))
    
    prompts["prompt_stpword_cnt"] = prompts["prompt_text"].apply(lambda x: count_stopwords(x))
    
    #prompts["prompt_punct_cnt"] = prompts["prompt_text"].apply(lambda x: count_punctuation(x))
    
    #prompts["prompt_num_cnt"] = prompts["prompt_text"].apply(lambda x: count_numbers(x))
    
    # Add prompt tokens into spelling checker dictionary
    prompts["prompt_tokens"].apply(lambda x: add_spelling_dictionary(x))
    
    
    # Add POS features for prompt_question, prompt_title, and prompt_text
    prompts['question_pos_tags'] = prompts['prompt_question'].apply(extract_pos_tags)
    prompts['prompt_text_pos_tags'] = prompts['prompt_text'].apply(extract_pos_tags)
    
    # Example: Count the occurrences of specific POS tags (e.g., nouns, verbs)
    prompts['noun_count'] = prompts['prompt_text_pos_tags'].apply(lambda x: x.count('NN'))
    prompts['verb_count'] = prompts['prompt_text_pos_tags'].apply(lambda x: x.count('VB'))
    
    # Example: TextBlob sentiment analysis
    prompts['prompt_textblob_polarity'] = prompts['prompt_text'].apply(lambda x: TextBlob(x).sentiment.polarity)
    prompts['prompt_textblob_subjectivity'] = prompts['prompt_text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
    
    # Automatically extract keywords from the prompt
    prompts['prompt_keywords'] = prompts['prompt_text'].apply(extract_keywords)
    
    
    # Calculate average sentence length for the prompt_text and student_summary columns
    prompts['avg_sentence_length_prompt'] = prompts['prompt_text'].apply(calculate_average_sentence_length)
    summaries['avg_sentence_length_summary'] = summaries['text'].apply(calculate_average_sentence_length)
    
    # Calculate average word length for the prompt_text and student_summary columns
    prompts['avg_word_length_prompt'] = prompts['prompt_text'].apply(calculate_average_word_length)
    summaries['avg_word_length_summary'] = summaries['text'].apply(calculate_average_word_length)

    
    # Add POS features for prompt_question, prompt_title, and prompt_text
    summaries['text_pos_tags'] = summaries['text'].apply(extract_pos_tags)
    
    
    # Example: Count the occurrences of specific POS tags (e.g., nouns, verbs)
    summaries['text_noun_count'] = summaries['text_pos_tags'].apply(lambda x: x.count('NN'))
    summaries['text_verb_count'] = summaries['text_pos_tags'].apply(lambda x: x.count('VB'))
    
    summaries['textblob_polarity'] = summaries['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
    summaries['textblob_subjectivity'] = summaries['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
    
    summaries["original_summary_len"] = summaries["text"].apply(lambda x: len(x))
    
    summaries['summary_sentenceCount'] = summaries['text'].apply(lambda x:count_sentences(x))

    summaries["summary_length"] = summaries["text"].apply(lambda x: len(word_tokenize(x)))
    
    summaries["summary_tokens"] = summaries["text"].apply(lambda x: word_tokenize(x))
    
    summaries["summary_word_cnt"] = summaries["text"].apply(lambda x:len(x.split(' ')))
    
    #summaries["summary_stpword_cnt"] = summaries["text"].apply(lambda x: count_stopwords(x))
    
    #summaries["summary_punct_cnt"] = summaries["text"].apply(lambda x: count_punctuation(x))
    
    #summaries["summary_num_cnt"] = summaries["text"].apply(lambda x: count_numbers(x))
    
    
    #from IPython.core.debugger import Pdb; Pdb().set_trace()
    # fix misspelling
    summaries["fixed_summary_text"] = summaries["text"].progress_apply(lambda x: speller(x))
    
    # count misspelling
    summaries["splling_err_num"] = summaries["text"].progress_apply(spelling)
    
    # merge prompts and summaries
    input_df = summaries.merge(prompts, how="left", on="prompt_id")
    # after merge preprocess
    # input_df['length_ratio'] = input_df['summary_length'] / input_df['prompt_length']
    
    input_df['word_overlap_count'] = input_df.progress_apply(word_overlap_count, axis=1)
    input_df['bigram_overlap_count'] = input_df.progress_apply(ngram_co_occurrence,args=(2,), axis=1)
    input_df['bigram_overlap_ratio'] = input_df['bigram_overlap_count'] / (input_df['summary_length'] - 1)
    
    input_df['trigram_overlap_count'] = input_df.progress_apply(ngram_co_occurrence, args=(3,), axis=1)
    input_df['trigram_overlap_ratio'] = input_df['trigram_overlap_count'] / (input_df['summary_length'] - 2)
    
    input_df['quotes_count'] = input_df.progress_apply(quotes_count, axis=1)
    
    return input_df.drop(columns=["summary_tokens", "prompt_tokens"])

In [30]:
result = run(df_train_prompt, df_train_summaries)

100%|██████████| 7165/7165 [04:21<00:00, 27.41it/s]
100%|██████████| 7165/7165 [00:00<00:00, 12914.34it/s]
100%|██████████| 7165/7165 [00:00<00:00, 16007.98it/s]
100%|██████████| 7165/7165 [00:00<00:00, 7537.33it/s]
100%|██████████| 7165/7165 [00:01<00:00, 6739.01it/s]
100%|██████████| 7165/7165 [00:00<00:00, 152467.91it/s]


In [13]:
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7165 entries, 0 to 7164
Data columns (total 41 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   student_id                    7165 non-null   object 
 1   prompt_id                     7165 non-null   object 
 2   text                          7165 non-null   object 
 3   content                       7165 non-null   float64
 4   wording                       7165 non-null   float64
 5   avg_sentence_length_summary   7165 non-null   float64
 6   avg_word_length_summary       7165 non-null   float64
 7   text_pos_tags                 7165 non-null   object 
 8   text_noun_count               7165 non-null   int64  
 9   text_verb_count               7165 non-null   int64  
 10  textblob_polarity             7165 non-null   float64
 11  textblob_subjectivity         7165 non-null   float64
 12  original_summary_len          7165 non-null   int64  
 13  sum

In [14]:
print("Full summaries train dataset shape is {}".format(result.shape))

Full summaries train dataset shape is (7165, 41)


In [15]:
result['prompt_text_pos_tags'].head()

0    [IN, DT, NNP, NNP, NN, VBD, NN, IN, NNP, NNP, ...
1    [IN, CD, NN, VBG, NN, IN, DT, NN, ,, CC, DT, N...
2    [JJ, NN, VBD, VBN, IN, DT, NN, ., IN, DT, NN, ...
3    [JJ, NN, VBD, VBN, IN, DT, NN, ., IN, DT, NN, ...
4    [IN, DT, NNP, NNP, NN, VBD, NN, IN, NNP, NNP, ...
Name: prompt_text_pos_tags, dtype: object

In [31]:
result['keyword_count'] = result.apply(lambda row: count_keyword_matches(row['prompt_keywords'], row['text']), axis=1)


result['text_pos_len'] = result['text_pos_tags'].apply(lambda x: len(x))
result['prompt_pos_len'] = result['prompt_text_pos_tags'].apply(lambda x: len(x))

result['text_adj_count'] = result['text_pos_tags'].apply(lambda x: x.count('JJ'))
result['prompt_adj_count'] = result['prompt_text_pos_tags'].apply(lambda x: x.count('JJ'))

result['text_adj_ratio'] = result['text_adj_count'] / result['text_pos_len']
result['prompt_adj_ration'] = result['prompt_adj_count'] / result['prompt_pos_len']

result['text_verb_ratio'] = result['text_verb_count'] / result['text_pos_len']
result['prompt_verb_ration'] = result['verb_count'] / result['prompt_pos_len']

result['text_noun_ratio'] = result['text_noun_count'] / result['text_pos_len']
result['prompt_noun_ration'] = result['noun_count'] / result['prompt_pos_len']

# Calculate readability metrics for prompt and summary
result['flesch_kincaid_prompt'] = result['prompt_question'].apply(textstat.flesch_kincaid_grade)
result['flesch_kincaid_summary'] = result['text'].apply(textstat.flesch_kincaid_grade)
result['gunning_fog_prompt'] = result['prompt_question'].apply(textstat.gunning_fog)
result['gunning_fog_summary'] = result['text'].apply(textstat.gunning_fog)

In [32]:
# Define a function to calculate Jaccard similarity
def jaccard_similarity(str1, str2):
    # Tokenize the strings and convert them to sets
    set1 = set(str1.split())
    set2 = set(str2.split())
    
    # Calculate Jaccard similarity
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    similarity = intersection / union if union > 0 else 0.0
    
    return similarity

# Calculate Jaccard similarity between prompt_question and text
result['jaccard_similarity'] = result.apply(lambda row: jaccard_similarity(row['prompt_question'], row['text']), axis=1)


In [33]:
result['combined_text'] = result['prompt_text'] + ' ' + result['text']

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer(max_df=0.85, max_features=1000, stop_words='english')
X = vectorizer.fit_transform(result['combined_text'])

# Apply Latent Dirichlet Allocation (LDA) for topic modeling
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(X)

# Get the topics for the prompt_text
prompt_topics = lda.transform(vectorizer.transform(result['prompt_text']))

# Get the topics for the student_summary
summary_topics = lda.transform(vectorizer.transform(result['text']))

# Assign the topics to DataFrame
result['prompt_topics'] = prompt_topics.tolist()
result['summary_topics'] = summary_topics.tolist()


In [34]:
# Convert the 'prompt_topics' and 'summary_topics' columns from object to float
#result['prompt_topics'] = result['prompt_topics'].apply(lambda x: [float(val) for val in x])
#result['summary_topics'] = result['summary_topics'].apply(lambda x: [float(val) for val in x])
# Separate each item in 'prompt_topics' and 'summary_topics' into separate columns


result[['prompt_topic_1', 'prompt_topic_2', 'prompt_topic_3']] = result['prompt_topics'].apply(pd.Series)
result[['summary_topic_1', 'summary_topic_2', 'summary_topic_3']] = result['summary_topics'].apply(pd.Series)


result[['prompt_question', 'prompt_title', 'prompt_topics', 'summary_topics', 'prompt_topic_1', 'prompt_topic_2', 'prompt_topic_3', 'summary_topic_1', 'summary_topic_2', 'summary_topic_3']].head(30)

Unnamed: 0,prompt_question,prompt_title,prompt_topics,summary_topics,prompt_topic_1,prompt_topic_2,prompt_topic_3,summary_topic_1,summary_topic_2,summary_topic_3
0,Summarize how the Third Wave developed over su...,The Third Wave,"[0.9976707222247121, 0.0011622064773270835, 0....","[0.7166481316313827, 0.015401043676678662, 0.2...",0.997671,0.001162,0.001167,0.716648,0.015401,0.267951
1,Summarize the various ways the factory would u...,Excerpt from The Jungle,"[0.0009218235174879116, 0.9981522727362265, 0....","[0.02161244422911558, 0.9570443049783564, 0.02...",0.000922,0.998152,0.000926,0.021612,0.957044,0.021343
2,"In complete sentences, summarize the structure...",Egyptian Social Structure,"[0.0011688524885937544, 0.001174645211872204, ...","[0.06296437014671895, 0.04736791369332212, 0.8...",0.001169,0.001175,0.997657,0.062964,0.047368,0.889668
3,"In complete sentences, summarize the structure...",Egyptian Social Structure,"[0.0011688524885937544, 0.001174645211872204, ...","[0.08841918467068985, 0.028057187596373157, 0....",0.001169,0.001175,0.997657,0.088419,0.028057,0.883524
4,Summarize how the Third Wave developed over su...,The Third Wave,"[0.9976707222247121, 0.0011622064773270835, 0....","[0.7820817880146029, 0.06862669389643532, 0.14...",0.997671,0.001162,0.001167,0.782082,0.068627,0.149292
5,Summarize the various ways the factory would u...,Excerpt from The Jungle,"[0.0009218235174879116, 0.9981522727362265, 0....","[0.021058089327376895, 0.8931636755248883, 0.0...",0.000922,0.998152,0.000926,0.021058,0.893164,0.085778
6,"In complete sentences, summarize the structure...",Egyptian Social Structure,"[0.0011688524885937544, 0.001174645211872204, ...","[0.012857424035880197, 0.049620897662872476, 0...",0.001169,0.001175,0.997657,0.012857,0.049621,0.937522
7,Summarize the various ways the factory would u...,Excerpt from The Jungle,"[0.0009218235174879116, 0.9981522727362265, 0....","[0.019859350988571122, 0.8760290135922674, 0.1...",0.000922,0.998152,0.000926,0.019859,0.876029,0.104112
8,Summarize at least 3 elements of an ideal trag...,On Tragedy,"[0.001415458355819997, 0.001409109992094977, 0...","[0.017046221848662425, 0.015254946699526208, 0...",0.001415,0.001409,0.997175,0.017046,0.015255,0.967699
9,Summarize at least 3 elements of an ideal trag...,On Tragedy,"[0.001415458355819997, 0.001409109992094977, 0...","[0.030917140205545455, 0.030766154069026847, 0...",0.001415,0.001409,0.997175,0.030917,0.030766,0.938317


In [20]:
result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7165 entries, 0 to 7164
Data columns (total 66 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   student_id                    7165 non-null   object 
 1   prompt_id                     7165 non-null   object 
 2   text                          7165 non-null   object 
 3   content                       7165 non-null   float64
 4   wording                       7165 non-null   float64
 5   avg_sentence_length_summary   7165 non-null   float64
 6   avg_word_length_summary       7165 non-null   float64
 7   text_pos_tags                 7165 non-null   object 
 8   text_noun_count               7165 non-null   int64  
 9   text_verb_count               7165 non-null   int64  
 10  textblob_polarity             7165 non-null   float64
 11  textblob_subjectivity         7165 non-null   float64
 12  original_summary_len          7165 non-null   int64  
 13  sum

In [21]:
result = result.drop(['prompt_topics', 'summary_topics'], axis=1)

In [22]:
result.describe()

Unnamed: 0,content,wording,avg_sentence_length_summary,avg_word_length_summary,text_noun_count,text_verb_count,textblob_polarity,textblob_subjectivity,original_summary_len,summary_sentenceCount,...,flesch_kincaid_summary,gunning_fog_prompt,gunning_fog_summary,jaccard_similarity,prompt_topic_1,prompt_topic_2,prompt_topic_3,summary_topic_1,summary_topic_2,summary_topic_3
count,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,...,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0,7165.0
mean,-0.014853,-0.063072,23.932378,4.563326,11.934822,3.998744,0.080354,0.437366,418.776971,3.763015,...,9.114613,11.309216,11.173788,0.089429,0.154575,0.278974,0.56645,0.168779,0.282777,0.548444
std,1.043569,1.036048,14.268509,0.419488,8.758633,3.574231,0.192219,0.181227,307.833685,3.11006,...,4.534061,3.583176,4.821147,0.054385,0.359656,0.446934,0.493677,0.31583,0.405843,0.441937
min,-1.729859,-1.962614,5.125,3.258065,0.0,0.0,-1.0,0.0,114.0,1.0,...,0.0,6.56,2.64,0.0,0.000922,0.001162,0.000926,0.001464,0.001368,0.002126
25%,-0.799545,-0.87272,15.0,4.275862,6.0,2.0,-0.01331,0.333333,216.0,2.0,...,6.2,6.56,8.01,0.052632,0.000922,0.001175,0.000926,0.013891,0.0155,0.030303
50%,-0.093814,-0.081769,20.5,4.545455,9.0,3.0,0.083333,0.45,320.0,3.0,...,8.2,10.27,10.3,0.079646,0.001169,0.001409,0.997175,0.024229,0.028107,0.86605
75%,0.49966,0.503833,28.8,4.827586,15.0,5.0,0.190625,0.551429,513.0,5.0,...,11.0,14.43,13.2,0.113924,0.001415,0.998152,0.997657,0.074879,0.857096,0.957449
max,3.900326,4.310693,471.0,7.015625,103.0,32.0,1.0,1.0,3940.0,47.0,...,61.0,16.02,64.41,0.481481,0.997671,0.998152,0.997657,0.992219,0.995308,0.997039


## Extract feature columns

In [23]:
FEATURE_COLUMNS = result.drop(columns = ['student_id', 'prompt_id', 'fixed_summary_text', 'text', 'prompt_question', 'prompt_title', 'prompt_text','content', 'wording', 'prompt_sentenceCount', 'summary_stpword_cnt','prompt_stpword_cnt', 'summary_punct_cnt', 'prompt_punct_cnt', 'summary_num_cnt', 'splling_err_num', 'prompt_num_cnt', 'quotes_count', 'question_pos_tags', 'text_pos_tags', 'prompt_text_pos_tags', 'prompt_keywords', 'combined_text'], axis = 1).columns.to_list()

#columns = ['student_id', 'prompt_id', 'text', 'prompt_question', 'prompt_title', 'prompt_text', 'content', 'wording'], axis = 1

KeyError: "['summary_stpword_cnt', 'summary_punct_cnt', 'prompt_punct_cnt', 'summary_num_cnt', 'prompt_num_cnt'] not found in axis"

In [None]:
FEATURE_COLUMNS

In [None]:
result.isnull().sum()

## Plot feature columns

In [None]:
# figure, axis = plt.subplots(3, 2, figsize=(15, 15))
# plt.subplots_adjust(hspace=0.25, wspace=0.3)

# for i, column_name in enumerate(FEATURE_COLUMNS):
#     row = i//2
#     col = i % 2
#     bp = sns.barplot(ax=axis[row, col], x=preprocessed_df['student_id'], y=preprocessed_df[column_name], color='blue')
#     bp.set(xticklabels=[])
#     axis[row, col].set_title(column_name)
# axis[2, 1].set_visible(False)
# plt.show()

Now let us split the dataset into training and testing datasets:

In [None]:
"""
def split_dataset(dataset, test_ratio=0.20):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(result)
train_ds_pd.shape, valid_ds_pd.shape
"""

In [None]:
# `content` label datatset columns
FEATURE_CONTENT = FEATURE_COLUMNS

# `wording` label datatset columns
FEATURE_WORDING = FEATURE_COLUMNS

In [None]:
params = {
    'objective': 'reg:squarederror',
    'n_estimators': 70,  # Adjust as needed
    'max_depth': 5,       # Adjust as needed
    'eta': 0.1, 
    'subsample' : 0.5 ,
    'colsample_bytree' : 0.7# Add other hyperparameters here
}

# Create RandomForestModel for label content
model_content = XGBRegressor(**params)


# Create RandomForestModel for label wording
model_wording = XGBRegressor(**params)

In [None]:
n_splits = 4  # Choose the number of folds you want
group_kfold = GroupKFold(n_splits=n_splits)


scores_c = []  # To store the evaluation scores for each fold
scores_w = []

for train_idx, test_idx in group_kfold.split(result[FEATURE_COLUMNS],  groups= result['prompt_id']):
    #X_train, X_test = result[FEATURE_COLUMNS][train_idx], result[FEATURE_COLUMNS][test_idx]
    X_train = result.loc[train_idx, FEATURE_COLUMNS]
    X_test = result.loc[test_idx, FEATURE_COLUMNS]

    y_train_c, y_test_c = result['content'][train_idx], result['content'][test_idx]
    y_train_w, y_test_w = result['wording'][train_idx], result['wording'][test_idx]
    
    # Fit the XGBoost regressor on the training data
    # Training RandomForestModel for label content
    model_content.fit(X_train, y_train_c)

    # Training RandomForestModel for label wording
    model_wording.fit(X_train, y_train_w)
    
    # Make predictions on the test data
    evaluation_content = model_content.predict(X_test)
    evaluation_content_rmse = np.sqrt(np.mean((evaluation_content - y_test_c)**2))
    scores_c.append(evaluation_content_rmse)
    
    evaluation_wording = model_wording.predict(X_test)
    evaluation_wording_rmse = np.sqrt(np.mean((evaluation_wording - y_test_w)**2))
    scores_w.append(evaluation_wording_rmse)

# Calculate the average score across all folds
average_score_c = np.mean(scores_c)
print(f'Average Root Mean Squared Error Content: {average_score_c}')

# Calculate the average score across all folds
average_score_w = np.mean(scores_w)
print(f'Average Root Mean Squared Error Wording: {average_score_w}')

MCRMSE = np.mean([average_score_c, average_score_w])
print(f"MCRMSE: {MCRMSE:.4f}")

# Train the model

We will train the model using a one-liner.

Note: you may see a warning about Autograph. You can safely ignore this, it will be fixed in the next release.

In [None]:
# Training RandomForestModel for label content
#model_content.fit(train_ds_pd[FEATURE_CONTENT], train_ds_pd['content'])

# Training RandomForestModel for label wording
#model_wording.fit(train_ds_pd[FEATURE_WORDING], train_ds_pd['wording'])

Now, let us run an evaluation using the validation dataset.

In [None]:

#evaluation_content = model_content.predict(valid_ds_pd[FEATURE_CONTENT])
#evaluation_content_rmse = np.sqrt(np.mean((evaluation_content - valid_ds_pd['content'])**2))
# evaluation_content = model_content.score(valid_ds_pd[FEATURE_CONTENT], valid_ds_pd['content'])
#print(f"Content MSE: {evaluation_content_rmse:.4f}")

# Run evaluation for model_wording
#evaluation_wording = model_wording.predict(valid_ds_pd[FEATURE_WORDING])
#evaluation_wording_rmse = np.sqrt(np.mean((evaluation_wording - valid_ds_pd['wording'])**2))
#print(f"Wording MSE: {evaluation_wording_rmse:.4f}")

#MCRMSE = np.mean([evaluation_content_rmse, evaluation_wording_rmse])
#print(f"MCRMSE: {MCRMSE:.4f}")

# Submission

In [None]:
#df_test_prompt = pd.read_csv('data/prompts_test.csv')
#df_test_summaries = pd.read_csv('data/summaries_test.csv')

In [None]:
#df_test = df_test_summaries.merge(df_test_prompt, on='prompt_id')

In [None]:
#df_test.head()

In [None]:
#processed_test_df = feature_engineer(df_test)

In [None]:
#processed_test_df.head()

In [None]:
# test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(processed_test_df[FEATURE_COLUMNS], task = tfdf.keras.Task.REGRESSION)
#test_ds = processed_test_df[FEATURE_COLUMNS]

In [None]:
#processed_test_df['content'] = model_content.predict(test_ds)
#processed_test_df['wording'] = model_wording.predict(test_ds)

In [None]:
#processed_test_df.head()

In [None]:
#processed_test_df[['student_id', 'content', 'wording']].to_csv('submission.csv',index=False)
# display(pd.read_csv('submission.csv'))