<div style="background:#926AA6   ;border-radius:5px; font-family:'Times';font-size:35px;color:  #f2f2f2" ><center>&ensp; 📚Feedback Prize - Predicting Effective Arguments</center></div>

![](https://images.unsplash.com/photo-1610484826967-09c5720778c7?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=870&q=80)

<div style="background:#926AA6   ;border-radius:5px; font-family:'Times';font-size:35px;color:  #f2f2f2" >&ensp; 🌌Story</div>

 <div style="background:#343148   ;font-family: mono;font-size:15px;color:  #f2f2f2" >
     Writing is crucial for success. In particular, argumentative writing fosters critical thinking and civic engagement skills, and can be strengthened by practice. However, only 13 percent of eighth-grade teachers ask their students to write persuasively each week. Additionally, resource constraints disproportionately impact Black and Hispanic students, so they are more likely to write at the “below basic” level as compared to their white peers. An automated feedback tool is one way to make it easier for teachers to grade writing tasks assigned to their students that will also improve their writing skills.
 <p style="font-family: mono;font-size:15px;color:  #f2f2f2">'''There are numerous automated writing feedback tools currently available, but they all have limitations, especially with argumentative writing. Existing tools often fail to evaluate the quality of argumentative elements, such as organization, evidence, and idea development. Most importantly, many of these writing tools are inaccessible to educators due to their cost, which most impacts already underserved schools.
 <p style="font-family: mono;font-size:15px;color:  #f2f2f2">'''Georgia State University (GSU) is an undergraduate and graduate urban public research institution in Atlanta. U.S. News & World Report ranked GSU as one of the most innovative universities in the nation. GSU awards more bachelor’s degrees to African-Americans than any other non-profit college or university in the country. GSU and The Learning Agency Lab, an independent nonprofit based in Arizona, are focused on developing science of learning-based tools and programs for social good.
<p style="font-family: mono;font-size:15px;color:  #f2f2f2">  '''To best prepare all students, GSU and The Learning Agency Lab have joined forces to encourage data scientists to improve automated writing assessments. This public effort could also encourage higher quality and more accessible automated writing tools. If successful, students will receive more feedback on the argumentative elements of their writing and will apply the skill across many disciplines.

  

<div style="background:#926AA6   ;font-family:'Times';font-size:35px;color:  #f2f2f2" >&ensp;💾 Data</div>

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Importing Libraries</b></div>

In [None]:
import os
import logging
from types import SimpleNamespace
from pathlib import Path
from datetime import datetime
import math
import random
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import GroupKFold
from sklearn.metrics import log_loss
from torch.utils.data import DataLoader, Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer
from tqdm import tqdm
from scipy.special import softmax
from IPython.core.display import display, HTML
from IPython.display import display, HTML
from transformers import DataCollatorWithPadding
from datasets import Dataset, load_metric

import wandb

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Data Overview</b></div>

<div style="background:#343148   ;font-family: mono;font-size:15px;color:  #f2f2f2" >&ensp;
    The dataset presented here contains argumentative essays written by U.S students in grades 6-12. These essays were annotated by expert raters for discourse elements commonly found in argumentative writing:
<ul>
<li><b>Lead - </b>an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
<li><b>Position</b> - an opinion or conclusion on the main question
<li><b>Claim</b> - a claim that supports the position
<li><b>Counterclaim</b> - a claim that refutes another claim or gives an opposing reason to the position
<li><b>Rebuttal</b> - a claim that refutes a counterclaim
<li><b>Evidence</b> - ideas or examples that support claims, counterclaims, or rebuttals.
<li><b>Concluding Statement</b>- a concluding statement that restates the claims
</ul>
    Your task is to predict the quality rating of each discourse element. Human readers rated each rhetorical or argumentative element, in order of increasing quality, as one of:
    
<ul>
<li><b>Ineffective</b>
<li><b>Adequate</b>
<li><b>Effective</b>
</ul>    
</div>

<body>

<table style="width:100%">
  <tr>
    <th style="color:black; font-size: 20px", bgcolor='#926AA6'>Feature</th>
    <th style="color:black; font-size: 20px", bgcolor='#926AA6'>Description</th> 
    
  </tr>
  <tr>
      <td style=" font-size: 17px"><b>discourse_id</b></td>
      <td style="font-size: 17px">ID code for discourse element</td>
    </tr>
      <tr>
      <td style=" font-size: 17px"><b>essay_id </b></td>
      <td style="font-size: 17px">ID code for essay response. This ID code corresponds to the name of the full-text file in the train/ folder.</td>
    </tr>
          <tr>
      <td style=" font-size: 17px"><b>discourse_text  </b></td>
      <td style="font-size: 17px">Text of discourse element.</td>
    </tr>
       <tr>
      <td style=" font-size: 17px"><b>discourse_type  </b></td>
      <td style="font-size: 17px">Class label of discourse element. </td>
    </tr>
  <tr>
      <td style=" font-size: 17px"><b>discourse_type_num </b></td>
      <td style="font-size: 17px">Enumerated class label of discourse element.</td>
    </tr>
  <tr>
      <td style=" font-size: 17px"><b>discourse_effectiveness - </b></td>
      <td style="font-size: 17px">Quality rating of discourse element, the target.</td>
    </tr>

    
    
</table>

</body>


<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Reading Data</b></div>


In [None]:
train_df = pd.read_csv("/kaggle/input/feedback-prize-effectiveness/train.csv")
test_df = pd.read_csv("/kaggle/input/feedback-prize-effectiveness/test.csv")

In [None]:
def overview(df):
    num_rows=len(df.index)
    num_col=len(df.columns)
    fig, ax = plt.subplots()
      
    #create values for table
    lab = ['Number Of Rows', 'Number Of Columns']
    table_data=[
    [num_rows,num_col]
        ]
    ax.set_title('Number of Samples', 
             fontweight ="bold") 
    #create table
    table = ax.table(cellText=table_data, colLabels=lab,colColours =["#926AA6"] * 10, loc='center')

    #modify table
    table.set_fontsize(14)
    table.scale(2,4)
    ax.axis('off')
    #display table
    plt.show()
    
overview(train_df) 

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>First and Last Five rows</b></div>


In [None]:
first_five=train_df.head(5)
last_five =train_df.tail(5)
print("First Five Rows")
display(first_five)
print("="*100)
print("="*100)
print("="*100)
print("Last Five Rows")
display(last_five)

first_five

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>General informations</b></div>


In [None]:

def info(df):
    missing= train_df.isnull().sum()
    percent_missing = (train_df.isnull().sum() * 100 / len(train_df)).round(2)
    dtypes=train_df.dtypes
    data=df
    data=pd.DataFrame([np.array(list(train_df.columns)).T,np.array(list(missing)).T,np.array(list(percent_missing)).T,np.array(list(dtypes)).T])
    data = data.transpose()
    data.columns=['Features','Num of Missing values','percentage Missing','DataType']
   
    fig, ax = plt.subplots()
      
    #create values for tabl

    #create table
    ax.set_title("General Informations", fontsize=40, y=1.5)
    table = ax.table(cellText=data.values, colLabels=data.columns,colColours =["#926AA6"] * len(data.columns), loc='center')

    #modify table
    table.set_fontsize(14)
    table.scale(5,5)
    ax.axis('off')
    #display table
  
    plt.show()
info(train_df)    

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Target Distributions</b></div>


In [None]:
def pie_target(df,col,title):
            colors = ["#570990","#e4b6fe",'#8b22ba', "#8a3cf6"]    
            fig, ax = plt.subplots(1,2,figsize=(16, 8))
            fig.suptitle(title, size = 20)
            labels = list(df[col].value_counts().index)
            values = df[col].value_counts()
            ax[0].pie( values,colors=colors[:2],explode=(.05,0,0),startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.6)
           
            sns.countplot(x=col, data=df, hue=col,palette=colors[:2], ax=ax[1])

            ax[0].add_artist(plt.Circle((0,0),0.4,fc='white'))
            plt.show()
            
pie_target(train_df,'discourse_effectiveness','Label Distrubtion')            

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Discourse Types</b></div>


In [None]:

def pie_target(feat,df):
            fig, ax = plt.subplots(figsize=(12,8))

        
            colors = ["#570990","#e4b6fe",'#8b22ba', "#8a3cf6"]

            plt.title ('Discourse Types',fontdict={'fontsize':20})
        
            labels = list(df[feat[0]].value_counts().index)
           
            values = df[feat[0]].value_counts()
            
          #  ax[1].pie( values,colors=colors,startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.6)
         #   ax[1].title.set_text(f'Count Plot for ')
            sns.countplot(x=feat[0],data=df,palette=colors )
        #    ax[1].add_artist(plt.Circle((0,0),0.4,fc='white'))
            fig.tight_layout()        
            plt.show()
cat_features=['discourse_type']

pie_target(cat_features,train_df)

In [None]:

def pie_target(feat,df):
            fig, ax = plt.subplots(figsize=(8,8))

        
            colors = ["#570990","#e4b6fe",'#8b22ba', "#8a3cf6"]
            plt.title ('Discourse Type',fontdict={'fontsize':20})
        
            labels = list(df[feat[0]].value_counts().index)
           
            values = df[feat[0]].value_counts()
            
            ax.pie( values,colors=colors,startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.6)
            ax.title.set_text(f'Count Plot Discourse Type ')
           
            ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
            fig.tight_layout()        
            plt.show()
cat_features=['discourse_type']

pie_target(cat_features,train_df)

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Discourse Type vs Discourse Effectiveness</b></div>


In [None]:

cat_features=['discourse_type']
plt.figure(figsize = (18,18))
def rel_tar(df,feat_list,target):
    
        for i in enumerate(feat_list):
             
                colors = [ "#570990","#e4b6fe",'#8b22ba', "#8a3cf6", '#967032', '#2734DE'] 
                rand_col = colors[random.sample(range(6),1)[0]]
                plt.subplot(2,2,i[0]+1)
                sns.countplot(x=i[1], data=df, hue=target,palette='BuPu')
                plt.title (i[1]+f' vs {target}',fontdict={'fontsize':20})
                plt.xlabel(" ")
                plt.ylabel(" ")
                plt.xticks(rotation = 45)
                plt.tight_layout()
                
                
rel_tar(train_df,cat_features,'discourse_effectiveness')         


<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Discourse Text</b></div>


In [None]:
train_df['wrd_cnt'] = train_df['discourse_text'].apply(lambda x : len(x.split()))

In [None]:
def hist(col,title):
    
    plt.figure(figsize = (10,8))
    
    ax = sns.histplot(col,kde=False);
    
    values = np.array([patch.get_height() for patch in ax.patches])
    
    #normalizing the values to get a range of colours
    norm = plt.Normalize(values.min(), values.max())
    
    #range of colours from colourmap-rainbow
    colors = plt.cm.RdPu(norm(values))
    ax.grid(False)
    #set colour for each patch
    for patch, color in zip(ax.patches, colors):
        patch.set_color(color)

    plt.title(title, size = 20)
    
hist(train_df['wrd_cnt'],'Distribution of word count in discourse text')

In [None]:
def  dist_plot(feat,df):
            fig, ax = plt.subplots(1,3,figsize=(22,8))
   
        
            colors = ["#570990","#e4b6fe",'#8b22ba', "#8a3cf6"]
            fig.suptitle('Word Density Based on Effectiveness', size = 29)
            ax[0].title.set_text(f'Count Plot')
            sns.histplot(df[df['discourse_effectiveness']=="Adequate"][feat[0]], ax=ax[0])
             #normalizing the values to get a range of colours
            valuesp = np.array([patch.get_height() for patch in ax[0].patches])    
            norm = plt.Normalize(valuesp.min(), valuesp.max())
            #range of colours from colourmap-rainbow
            colors = plt.cm.RdPu(norm(valuesp))
            ax[0].grid(False)
            #set colour for each patch
            for patch, color in zip(ax[0].patches, colors):
                patch.set_color(color)


            ax[1].title.set_text(f'Ineffective')
            sns.histplot(df[df['discourse_effectiveness']=="Ineffective"][feat[0]], ax=ax[1])
            valuesp = np.array([patch.get_height() for patch in ax[1].patches])    
            norm = plt.Normalize(valuesp.min(), valuesp.max())
            #range of colours from colourmap-rainbow
            colors = plt.cm.RdPu(norm(valuesp))
            ax[1].grid(False)
            #set colour for each patch
            for patch, color in zip(ax[1].patches, colors):
                patch.set_color(color)

      
            ax[2].title.set_text(f'Effective')
            sns.histplot(df[df['discourse_effectiveness']=="Effective"][feat[0]], ax=ax[2])
            valuesp = np.array([patch.get_height() for patch in ax[2].patches])    
            norm = plt.Normalize(valuesp.min(), valuesp.max())
            #range of colours from colourmap-rainbow
            colors = plt.cm.RdPu(norm(valuesp))
            ax[2].grid(False)
            #set colour for each patch
            for patch, color in zip(ax[2].patches, colors):
                patch.set_color(color)

            
            
            fig.tight_layout()        
            plt.show()
float_features=['wrd_cnt']

dist_plot(float_features,train_df)

In [None]:
def colors(i):
    valuesp = np.array([patch.get_height() for patch in ax[i].patches])    
    norm = plt.Normalize(valuesp.min(), valuesp.max())
    #range of colours from colourmap-rainbow
    colors = plt.cm.RdPu(norm(valuesp))
    ax[i].grid(False)
    #set colour for each patch
    for patch, color in zip(ax[i].patches, colors):
        patch.set_color(color)

discourse_types = train_df.discourse_type.unique()

fig, ax = plt.subplots(1, len(discourse_types), sharex='col', sharey='row', figsize=(25, 5))
for i, discourse_type in enumerate(discourse_types):
   
    filtered_df = train_df[train_df.discourse_type == discourse_type]
    sns.histplot(data=filtered_df["wrd_cnt"], ax=ax[i])
    colors(i)
    ax[i].set_title(discourse_type)
fig.suptitle('Word Density Based On Discourse Type', size = 22)    
plt.show()


In [None]:
fig, ax = plt.subplots(1, len(discourse_types), sharex='col', sharey='row', figsize=(25, 5))
for i, discourse_type in enumerate(discourse_types):
   
    filtered_df = train_df[train_df.discourse_type == discourse_type].query('discourse_effectiveness=="Effective"')
    sns.histplot(data=filtered_df["wrd_cnt"], ax=ax[i])
    colors(i)
    ax[i].set_title(discourse_type)
fig.suptitle('Word Density of Effective essays', size = 22)    
plt.show()

In [None]:
fig, ax = plt.subplots(1, len(discourse_types), sharex='col', sharey='row', figsize=(25, 5))
for i, discourse_type in enumerate(discourse_types):
   
    filtered_df = train_df[train_df.discourse_type == discourse_type].query('discourse_effectiveness=="Ineffective"')
    sns.histplot(data=filtered_df["wrd_cnt"], ax=ax[i])
    colors(i)
    ax[i].set_title(discourse_type)
fig.suptitle('Word Density of Ineffective essays', size = 22)    
plt.show()

In [None]:
fig, ax = plt.subplots(1, len(discourse_types), sharex='col', sharey='row', figsize=(25, 5))
for i, discourse_type in enumerate(discourse_types):
   
    filtered_df = train_df[train_df.discourse_type == discourse_type].query('discourse_effectiveness=="Adequate"')
    sns.histplot(data=filtered_df["wrd_cnt"], ax=ax[i])
    colors(i)
    ax[i].set_title(discourse_type)
fig.suptitle('Word Density of Adequate essays', size = 22)    
plt.show()

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b >Discourse Text Examples</b></div>


In [None]:
from IPython.core.display import display, HTML

def show_examples_for_discourse_type(discourse_type, topic):
    filt = train_df.query(f'discourse_type == "{discourse_type}"').sample(frac=1, random_state=420)
    display(HTML(
        f"""
        <h4 style="background:#cc0088 ;color: black; font-size: 20px; width:10%" >{discourse_type }</h4>
        <table>
            <tr>
              <th style="color:black; font-size: 15px", bgcolor='#8b22ba' width=33%>Ineffective</th>
              <th style="color:black; font-size: 15px", bgcolor='#8b22ba' width=33%>Adequate</th>
              <th style="color:black; font-size: 15px", bgcolor='#8b22ba' width=33%>Effective</th>
            </tr>
            <tr>
              <td>{filt.query("discourse_effectiveness == 'Ineffective'").iloc[0].discourse_text}</td>
              <td>{filt.query("discourse_effectiveness == 'Adequate'").iloc[0].discourse_text}</td>
              <td>{filt.query("discourse_effectiveness == 'Effective'").iloc[0].discourse_text}</td>
            </tr>
        </table>
        """
    ))

In [None]:
for dt in set(train_df.discourse_type.values):
    show_examples_for_discourse_type(dt, 10)

<div style=";font-family:'Times';font-size:30px;color:  #FA7A35" >📌 <b>Word Cloud On Discourse Text</b></div>


In [None]:
from wordcloud  import WordCloud
fig, ax = plt.subplots(7,3, sharex='col', sharey='row', figsize=(30, 30))
for i, discourse_type in enumerate(discourse_types):
    for j,effect in enumerate(set(train_df.discourse_effectiveness.values)):
            #sns.histplot(data=train_df["wrd_cnt"], ax=ax[i,j])
            wc=WordCloud(background_color='white').generate(str(train_df.query(f'discourse_type == "{discourse_type}" and discourse_effectiveness=="{effect}"')['discourse_text']))
             
            ax[i,j].imshow(wc)
            ax[i,j].axis('off');
            ax[i,j].set_title(f'{effect } {discourse_type}', fontsize=30);
            
fig.suptitle('Word Cloud on Discourse Type and Effectiveness', size = 40)    
plt.show()

<div style="background:#926AA6   ;font-family:'Times';font-size:35px;color:  white" ><center>&ensp;Thank you</center></div>
<div style="background:#926AA6   ;font-family:'Times';font-size:35px;color:  white" ><center>&ensp;⚠ WORK IN PROGRESS ⚠
<br>Please consider upvoting the kernel if you found it useful.</center></div>
