# __Introduction__

The dataset contains argumentative essays written by U.S students in grades 6-12. The essays were annotated by expert raters for elements commonly found in argumentative writing.

The goal of this competition is to come up with a model that can accurately segment according to the categories listed below. 

* Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
* Position - an opinion or conclusion on the main question
* Claim - a claim that supports the position
* Counterclaim - a claim that refutes another claim or gives an opposing reason to the position
* Rebuttal - a claim that refutes a counterclaim
* Evidence - ideas or examples that support claims, counterclaims, or rebuttals.
* Concluding Statement - a concluding statement that restates the claims

This Notebook contains **Comprehensive EDA** in a bid to surface any linguistic patterns that might occur class-wise; that could be leveraged for segmentation

<img src="https://i2.wp.com/www.iedunote.com/img/20535/essay-writing-made-easy.jpg?fit=1920%2C1280&quality=100&ssl=1" style="width:500px;height:500px;img-align:center;">

### Importing required libraries 📚

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", -1)
import numpy as np

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

import seaborn as sns
import matplotlib.style as style 
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import scattertext as st
from IPython.display import IFrame

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

import eli5






# __EDA__ 📊

### Loading in the dataset

In [None]:
df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')

### Overview of the dataset

In [None]:
df.head()

### The column descriptions are:

* id - ID code for essay response
* discourse_id - ID code for discourse element
* discourse_start - character position where discourse element begins in the essay response
* discourse_end - character position where discourse element ends in the essay response
* discourse_text - text of discourse element
* discourse_type - classification of discourse element
* discourse_type_num - enumerated class label of discourse element
* predictionstring - the word indices of the training sample, as required for predictions

### Check for null values

In [None]:
df.isnull().sum().head()

### Number of unqiue essays to work with

In [None]:
print(f"We have {df['id'].nunique()} essays")

### Distribution of target

In [None]:
sns.set_theme(style="darkgrid")
sns.set_palette('rainbow')
plt.figure(figsize=(8,6))
sns.barplot(y=df['discourse_type'].unique(),
            x=df['discourse_type'].value_counts())
plt.xlabel('Count of Discourse Type')
plt.title('Target Distribution')
plt.show()

### Basic text cleaning for removing stopwords, punctuation and any non-alphanumeric characters

In [None]:
def clean_text(sentence):
    sentence = sentence.lower()
    sentence = ' '.join(re.sub("(nan)|(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", sentence).split())
    return sentence

df['discourse_text'] = df['discourse_text'].map(clean_text)

### Let's first take a look at average sentence lengths of each class

In [None]:
df_claim['discourse_text_len'] = df_claim['discourse_text'].apply(lambda x: len(x))
df_evidence['discourse_text_len'] = df_evidence['discourse_text'].apply(lambda x: len(x))
df_position['discourse_text_len'] = df_position['discourse_text'].apply(lambda x: len(x))
df_concluding_statement['discourse_text_len'] = df_concluding_statement['discourse_text'].apply(lambda x: len(x))
df_lead['discourse_text_len'] = df_lead['discourse_text'].apply(lambda x: len(x))
df_counterclaim['discourse_text_len'] = df_counterclaim['discourse_text'].apply(lambda x: len(x))
df_rebuttal['discourse_text_len'] = df_rebuttal['discourse_text'].apply(lambda x: len(x))

claim_avg = df_claim['discourse_text_len'].sum()/len(df_claim['discourse_text'])
evidence_avg = df_evidence['discourse_text_len'].sum()/len(df_evidence['discourse_text'])
position_avg = df_position['discourse_text_len'].sum()/len(df_position['discourse_text'])
concluding_statement_avg = df_concluding_statement['discourse_text_len'].sum()/len(df_concluding_statement['discourse_text'])
lead_avg = df_lead['discourse_text_len'].sum()/len(df_lead['discourse_text'])
counterclaim_avg = df_counterclaim['discourse_text_len'].sum()/len(df_counterclaim['discourse_text'])
rebuttal_avg = df_rebuttal['discourse_text_len'].sum()/len(df_counterclaim['discourse_text'])

sent_list = [claim_avg,evidence_avg,position_avg,concluding_statement_avg,lead_avg,counterclaim_avg,rebuttal_avg]

sent_df = {'Classes':['Claim', 'Evidence', 'Position', 'Concluding Statement','Lead','Counterclaim','Rebuttal'],
        'Length': sent_list}
sent_df = pd.DataFrame(sent_df)


sns.set_style('darkgrid')
style.use('seaborn-pastel')

plot = sns.catplot(
    data=sent_df, kind="bar",
    x="Classes", y="Length",height = 8, aspect=9.7/6.27,
    legend=False
)

plot.fig.suptitle("Average Sentence Length");


### ---> Very uneven distribution, Evidence has a substantially higher average sentence length than the other classes, followed by Concluding Statement and Lead. 

### ---> Lead corresponds to the introduction, Concluding Statement is the ending and Evidence is the crux of an argumentative essay, so it stands to reason why they would be the top 3 longest classes.

### ---> Classes similar to each other, Claim/Position and Counterclaim/Rebuttal are around the same vicinity 

### Textblob library let's us compute the subjectivity of a sentence, taking this measure class-wise might shed some insights. Higher the score, higher the subjectivity 

In [None]:
def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

df_claim = df.loc[df['discourse_type'] == 'Claim']
claim_score = df_claim['discourse_text'].astype(str).apply(get_subjectivity).sum()

df_evidence = df.loc[df['discourse_type'] == 'Evidence']
evidence_score = df_evidence['discourse_text'].astype(str).apply(get_subjectivity).sum()

df_position  = df.loc[df['discourse_type'] == "Position"]
position_score = df_position['discourse_text'].astype(str).apply(get_subjectivity).sum()


df_concluding_statement = df.loc[df['discourse_type'] == 'Concluding Statement']
concluding_statement_score = df_concluding_statement['discourse_text'].astype(str).apply(get_subjectivity).sum()

df_lead = df.loc[df['discourse_type'] == "Lead"]
lead_score = df_lead['discourse_text'].astype(str).apply(get_subjectivity).sum()

df_counterclaim = df.loc[df['discourse_type'] == 'Counterclaim']
counterclaim_score = df_counterclaim['discourse_text'].astype(str).apply(get_subjectivity).sum()

df_rebuttal = df.loc[df['discourse_type'] == 'Rebuttal']
rebuttal_score = df_rebuttal['discourse_text'].astype(str).apply(get_subjectivity).sum()

scores_list = [claim_score,evidence_score,position_score,concluding_statement_score,lead_score,counterclaim_score,rebuttal_score]

scores_df = {'Classes':['Claim', 'Evidence', 'Position', 'Concluding Statement','Lead','Counterclaim','Rebuttal'],
        'Count':scores_list}
scores_df = pd.DataFrame(scores_df)


sns.set_style('darkgrid')
style.use('seaborn-pastel')


plot = sns.catplot(
    data=scores_df, kind="bar",
    x="Classes", y="Count",height = 8, aspect=9.7/6.27,
    legend=False
)

plot.fig.suptitle("Subjectivity Score");



### ---> Evidence has around 5k data points less than Claims but still outscores it substantially when it comes to subjectivity. Ironically evidential statements should be less subjective than mere claims

### ---> Same case with the Position and Concluding Statement pair; a concluding statement should have *less* subjectivity than a position statment, where mostly factual statements are recanted.

### ---> Both Counterclaim and Rebuttal are essentially contradicting an opposing statement with fact, maybe why they have almost the same subjectivity score(accounting for the difference in data points)

### ---> Lead's relatively low subjectivity score can be explained by the fact that a lot opening's start with a fact, a statistic or paraphrasing a quote

### Wordcloud sorted by class to get the most frequent words

In [None]:
wc_claim = WordCloud(max_font_size=50, max_words=100, background_color="white", collocation_threshold = 3).generate(claim_corpus)
wc_evidence = WordCloud(max_font_size=50, max_words=100, background_color="white",collocation_threshold = 3).generate(evidence_corpus)
wc_position = WordCloud(max_font_size=50, max_words=100, background_color="white",collocation_threshold = 3).generate(position_corpus)
wc_lead = WordCloud(max_font_size=50, max_words=100, background_color="white",collocation_threshold = 3).generate(lead_corpus)
wc_concluding_statement = WordCloud(max_font_size=50, max_words=100, background_color="white",collocation_threshold = 3).generate(concluding_statement_corpus)
wc_counterclaim = WordCloud(max_font_size=50, max_words=100, background_color="white",collocation_threshold = 3).generate(counterclaim_corpus)
wc_rebuttal = WordCloud(max_font_size=50, max_words=100, background_color="white",collocation_threshold = 3).generate(rebuttal_corpus)

fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_claim,interpolation="bilinear")
ax1 = plt.title("Claim", fontsize=20)

ax2 = fig.add_subplot(122)
ax2 = plt.imshow(wc_evidence,interpolation="bilinear")
ax2 = plt.title("Evidence", fontsize=20)

fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_position,interpolation="bilinear")
ax1 = plt.title("Position", fontsize=20)

ax2 = fig.add_subplot(122)
ax2 = plt.imshow(wc_lead,interpolation="bilinear")
ax2 = plt.title("Lead", fontsize=20)

fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_concluding_statement,interpolation="bilinear")
ax1 = plt.title("Concluding Statement", fontsize=20)

ax2 = fig.add_subplot(122)
ax2 = plt.imshow(wc_counterclaim,interpolation="bilinear")
ax2 = plt.title("Counterclaim", fontsize=20)


fig = plt.figure(figsize=(12,6))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_rebuttal,interpolation="bilinear")
ax1 = plt.title("Rebuttal", fontsize=20)

plt.show()



### ---> Commonly used words like 'people', 'think', 'student' are the most frequent, although the frequency of the word, 'Electoral' across multiple classes is a bit surprising. 

### ---> This suggests the variance in the topic of essay's isn't much and despite containing essays by students ranging from 6th grade to 12th grade, the vocabulary level is pretty consistent at middle-school level.

### The RAKE a.k.a Rapid Automatic Keyword Extraction algorithm does a decent job of extracting top phrases from a corpus, we use the RAKE-NLTK library to perform this task for each of the classes and spaCy to highlight the type of entities present in the **top 100** phrases.

In [None]:
!pip install rake-nltk
from rake_nltk import Rake
r = Rake()

### Top Phrases for Claim

In [None]:
claim_corpus = df_claim['discourse_text'].tolist()
claim_corpus = " ".join(claim_corpus)
r.extract_keywords_from_text(claim_corpus)
top_claims = r.get_ranked_phrases()[0:100]
top_claims = " ".join(top_claims)
displacy.render(nlp(top_claims),style='ent',jupyter=True)

### Top phrases for Evidence

In [None]:
evidence_corpus = df_evidence['discourse_text'].tolist()
evidence_corpus = " ".join(evidence_corpus)
r.extract_keywords_from_text(evidence_corpus)
top_evidence = r.get_ranked_phrases()[0:100]
top_evidence = " ".join(top_evidence)
displacy.render(nlp(top_evidence),style='ent')

### Top phrases for Position

In [None]:
position_corpus = df_position['discourse_text'].tolist()
position_corpus = " ".join(position_corpus)
r.extract_keywords_from_text(position_corpus)
top_position = r.get_ranked_phrases()[0:100]
top_position = " ".join(top_position)
displacy.render(nlp(top_position),style='ent')

### Top Phrases for Concluding Statement

In [None]:
concluding_statement_corpus = df_concluding_statement['discourse_text'].tolist()
concluding_statement_corpus = " ".join(concluding_statement_corpus)
r.extract_keywords_from_text(concluding_statement_corpus)
top_concluding_statement = r.get_ranked_phrases()[0:100]
top_concluding_statement = " ".join(top_concluding_statement)
displacy.render(nlp(top_concluding_statement),style='ent')

### Top Phrases for Lead Statement

In [None]:
lead_corpus = df_lead['discourse_text'].tolist()
lead_corpus = " ".join(lead_corpus)
r.extract_keywords_from_text(lead_corpus)
top_lead = r.get_ranked_phrases()[0:100]
top_lead = " ".join(top_lead)
displacy.render(nlp(top_lead),style='ent')

### Top Phrases for Counterclaim 

In [None]:
counterclaim_corpus = df_counterclaim['discourse_text'].tolist()
counterclaim_corpus = " ".join(counterclaim_corpus)
r.extract_keywords_from_text(counterclaim_corpus)
top_counterclaim = r.get_ranked_phrases()[0:100]
top_counterclaim = " ".join(top_counterclaim)
displacy.render(nlp(top_counterclaim),style='ent')

### Top Phrases for Rebuttal 

In [None]:
rebuttal_corpus = df_rebuttal['discourse_text'].tolist()
rebuttal_corpus = " ".join(rebuttal_corpus)
r.extract_keywords_from_text(rebuttal_corpus)
top_rebuttal = r.get_ranked_phrases()[0:100]
top_rebuttal = " ".join(top_rebuttal)
displacy.render(nlp(top_rebuttal),style='ent')

### Stopword count by Class

In [None]:
def stopword_counter(corpus):
    count = 0  
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(corpus)
    for w in word_tokens:
        if w in stop_words:
            count = count+1
 
    return count

claim_count = stopword_counter(claim_corpus)
evidence_count = stopword_counter(evidence_corpus)
position_count = stopword_counter(position_corpus)
concluding_satement_count = stopword_counter(concluding_statement_corpus)
lead_count = stopword_counter(lead_corpus)
counterclaim_count = stopword_counter(counterclaim_corpus)
rebuttal_count = stopword_counter(rebuttal_corpus)
stopwords_count = [claim_count,evidence_count,position_count,concluding_satement_count,lead_count,counterclaim_count,rebuttal_count]

stopwords_df = {'Classes':['Claim', 'Evidence', 'Position', 'Concluding_Statement','Lead','Counterclaim','Rebuttal'],
        'Stopword Count':stopwords_count}
stopwords_df = pd.DataFrame(stopwords_df)
stopwords_df


### 

### Sentiment score sorted by class. Higher score indicates a more positive statement

In [None]:
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

df_claim = df.loc[df['discourse_type'] == 'Claim']
claim_score = df_claim['discourse_text'].astype(str).apply(getPolarity).sum()

df_evidence = df.loc[df['discourse_type'] == 'Evidence']
evidence_score = df_evidence['discourse_text'].astype(str).apply(getPolarity).sum()

df_position  = df.loc[df['discourse_type'] == "Position"]
position_score = df_position['discourse_text'].astype(str).apply(getPolarity).sum()

df_concluding_statement = df.loc[df['discourse_type'] == 'Concluding Statement']
concluding_statement_score = df_concluding_statement['discourse_text'].astype(str).apply(getPolarity).sum()

df_lead = df.loc[df['discourse_type'] == "Lead"]
lead_score = df_lead['discourse_text'].astype(str).apply(getPolarity).sum()

df_counterclaim = df.loc[df['discourse_type'] == 'Counterclaim']
counterclaim_score = df_counterclaim['discourse_text'].astype(str).apply(getPolarity).sum()

df_rebuttal = df.loc[df['discourse_type'] == 'Rebuttal']
rebuttal_score = df_rebuttal['discourse_text'].astype(str).apply(getPolarity).sum()

sentiment_list = [claim_score,evidence_score,position_score,concluding_statement_score,lead_score,counterclaim_score,rebuttal_score]

sentiment_df = {'Classes':['Claim', 'Evidence', 'Position', 'Concluding Statement','Lead','Counterclaim','Rebuttal'],
        'Score':sentiment_list}
sentiment_df = pd.DataFrame(sentiment_df)


sns.set_style('darkgrid')
style.use('seaborn-pastel')

plot = sns.catplot(
    data=sentiment_df, kind="bar",
    x="Classes", y="Score",
    height=8, aspect=9.7/6.27, legend=False
)

plot.fig.suptitle("Sentiment Score")

### ---> Sentiment is directly proportional to the number of data points, suggesting that most essays are bereft of words that carry strong sentiments.

### ---> Essays are written in a passive voice with facts used more to justify their position than personal preferences, this could be why here sentiment only correlates with the length and not the semantic meaning. 

An interesting approach to EDA would be to group similar classes and see if we can train a classifier to discriminate between them with enough accuracy
Distinguishing between them could assist in text segmentation which could then be used as a intermediate step for making the argumentative/non-argumentative classification. Here we try out 2 pairs, the first one is (Claim, Position) and the second one is (Counterclaim, Rebuttal). 

In [None]:
df_claim_list = df_claim['discourse_text'].tolist()
df_claim_list_label = df_claim['discourse_type'].tolist()
df_empty = pd.DataFrame()
df_empty.insert(0,'txt',df_claim_list,True)
df_empty.insert(1,'label',df_claim_list_label)
df_empty_1 = pd.DataFrame()
df_position_list = df_position['discourse_text'].tolist()
df_position_list_label = df_position['discourse_type'].tolist()
df_empty_1.insert(0,'txt',df_position_list,True)
df_empty_1.insert(1,'label',df_position_list_label,True)
df_train = df_empty.append(df_empty_1)


In [None]:
X_train,X_test,y_train,y_test = train_test_split(df_train['txt'],df_train['label'])

In [None]:
SD_clf = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,2))),('clf',  SGDClassifier())])

SD_clf.fit(X_train,y_train)

In [None]:
SD_clf.score(X_test,y_test)

### 88 % accuracy, suggesting that these both have sufficiently unique features to distinguish between them.  

In [None]:
pred_sd = SD_clf.predict(X_test)
pd.DataFrame(classification_report(pred_sd,y_test,output_dict=True)).T


In [None]:
eli5.show_weights(SD_clf)

### ---> Verbs like 'should', believe, 'think' are generally used to express normative sentences, which tracks because the student is writing about their *opinion*

### ---> 'reason', 'because', 'also' are words usually used when elaborating on a Claim, so that tracks too. 

### ---> There is a clear pattern of words/phrases for both the classes, with Position sentences usually expressing normative sentiments and Claims sentences justifying the rationale behind the claim

In [None]:
df_empty_3 = pd.DataFrame()
df_counterclaim_list = df_counterclaim['discourse_text'].tolist()
df_counterclaim_label = df_counterclaim['discourse_type'].tolist()
df_empty_3.insert(0,'txt',df_counterclaim_list,True)
df_empty_3.insert(1,'label',df_counterclaim_label,True)
df_empty_4 = pd.DataFrame()
df_rebuttal_list = df_rebuttal['discourse_text'].tolist()
df_rebuttal_label = df_rebuttal['discourse_type'].tolist()
df_empty_4.insert(0,'txt',df_rebuttal_list,True)
df_empty_4.insert(1,'label',df_rebuttal_label,True)
df_train_2 = df_empty_3.append(df_empty_4)


In [None]:
X_train_2,X_test_2,y_train_2,y_test_2 = train_test_split(df_train_2['txt'],df_train_2['label'])

In [None]:
SD_clf_2 = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1,2))),('clf',  SGDClassifier())])

SD_clf_2.fit(X_train_2,y_train_2)

In [None]:
SD_clf_2.score(X_test_2,y_test_2)

### ---> Only about 80 percent of accuracy here, could be because the data points to work with are much lesser than the previous case and the thematic overlap.



In [None]:
pred_sd_2 = SD_clf_2.predict(X_test_2)
pd.DataFrame(classification_report(pred_sd_2,y_test_2,output_dict=True)).T

In [None]:
eli5.show_weights(SD_clf_2)

### ---> Thematically, Counterclaim and Rebuttal are very similar as they are both refuting a statement, so the words used as features to detect them are too. 

### ---> Many words that are used for negation have shown up in both the classes as expected; "no", "however", "although" etc. 

### ---> It's difficult to find much semantic/linguistic difference bwetween these 2 classes

The notebook is a work in progress and will be updated with more EDA and Modelling content