## [Kaggel Link](https://www.kaggle.com/c/feedback-prize-2021) 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
df =pd.read_csv("../input/feedback-prize-2021/train.csv")
df.head()

## 1.1 About Data
The dataset contains argumentative essays written by U.S students in grades 6-12. The essays were annotated by expert raters for elements commonly found in argumentative writing.

Our task is to predict the human annotations. You will first need to segment each essay into discrete rhetorical and argumentative elements (i.e., discourse elements like Lead,Claim,Evidence etc) and then classify each element as one of the following:

* **Lead** - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
* **Position** - an opinion or conclusion on the main question
* **Claim** - a claim that supports the position
* **Counterclaim** - a claim that refutes another claim or gives an opposing reason to the position
* **Rebuttal**  - a claim that refutes a counterclaim
* **Evidence**  - ideas or examples that support claims, counterclaims, or rebuttals.
* **Concluding Statement**  - a concluding statement that restates the claims


.csv file containing the annotated version of all essays in the training set

* **id** - ID code for essay response
* **discourse_id** - ID code for discourse element
* **discourse_start** - character position where discourse element begins in the essay response
* **discourse_end** - character position where discourse element ends in the essay response
* **discourse_text** - text of discourse element
* **discourse_type** - classification of discourse element
* **discourse_type_num** - enumerated class label of discourse element
* **predictionstring** - the word indices of the training sample, as required for predictions

In [None]:
train_dir ="../input/feedback-prize-2021/train"
test_dir = "../input/feedback-prize-2021/test"

In [None]:
print(f"Total number of Rows/Records : {len(df.id)} ")
print(f"Total number of Files data in CSV : {len(df.groupby('id'))} ")
print(f"="*50)
print(f"Total Number of files in Train Folder : { len(os.listdir(train_dir))}")
print(f"Total Number of files in Test Folder : { len(os.listdir(test_dir))}")

## 1.2 EDA

According to above explantion in About Data we required to classify in above category

In [None]:
plt.rcParams["figure.figsize"] = (15,8)
plt.title("Discourse_type Distribution in train Dataset",fontsize=20)
plt.xlabel("Classes")
# plt.xticks(rotation=60)
plt.ylabel("Records Count")
plt.bar(df.discourse_type.value_counts().index,df.discourse_type.value_counts(),color=plt.rcParams['axes.prop_cycle'].by_key()['color'])
# Adding count bar plot 
for index,data in enumerate(list(df.discourse_type.value_counts())):
  plt.text(x=index , y =data+1 , s=f"{data}" , fontdict=dict(fontsize=15), ha="center",bbox=dict(facecolor='wheat',boxstyle='square',edgecolor='black',pad=0.1))
plt.tight_layout()
plt.show()

In [None]:
# discourse_type_num
plt.rcParams["figure.figsize"] = (15,8)
plt.title("discourse_type_num Distribution in train Dataset",fontsize=20)
plt.xlabel("Classes")
plt.xticks(rotation=90)
plt.ylabel("Records Count")
plt.bar(df.discourse_type_num.value_counts().index,df.discourse_type_num.value_counts(),color=plt.rcParams['axes.prop_cycle'].by_key()['color'])
# Adding count bar plot 
for index,data in enumerate(list(df.discourse_type_num.value_counts())):
  plt.text(x=index , y =data+1 , s=f"{data}" , fontdict=dict(fontsize=8),rotation=90,ha="center",bbox=dict(facecolor='wheat',boxstyle='square',edgecolor='black',pad=0.5))
plt.tight_layout()
plt.show()

## Read Text of File

In [None]:
# Reading files and checking how text is store in File
import os
from IPython.display import display
for i in os.listdir(train_dir)[:2]:
  print(f"\033[1m File Name is : {i} \033[0m ")
  with open(train_dir+'/'+i, 'r') as file: 
    data = file.read()
    print(data,end="\n")
  print("="*200)

## Checking Length of Every Document 

In [None]:
file_data =[]
for i in os.listdir(train_dir):
  data={}
  with open(train_dir+'/'+i,'r') as file:
    text_data=file.read()
    data['file_name']=i
    data['text_data']=text_data
  file_data.append(data)  

## Converting Dict to Data Fram
file_df = pd.DataFrame(file_data)
file_df['text_len'] =file_df['text_data'].apply(len)
file_df.head()

In [None]:
plt.title("Text len frequency in File")
plt.xlabel("Length of Text ")
file_df['text_len'].plot(kind='hist',bins=100)
plt.show()

We can see we are getting Documents text of more then average words let's check manually 

In [None]:
file_df[file_df['text_len']>8000]

I have checked few Files like 8895,5866 extra space with special HTML character like \xa0  which should be removed you can see in below row

---



In [None]:
file_df[file_df['text_len']>6000].text_data.loc[11236]

In [None]:
import unicodedata
file_data =[]
for i in os.listdir(train_dir):
  data={}
  with open(train_dir+'/'+i,'r') as file:
    text_data=file.read()
    data['file_name']=i
    data['text_data']=unicodedata.normalize("NFKD",text_data).strip() # Using this we are removing \xa0  and Strip help to remove extra space
  file_data.append(data)  

## Converting Dict to Data Fram
file_df = pd.DataFrame(file_data)
file_df['text_len'] =file_df['text_data'].apply(len)
file_df.head()

In [None]:
plt.title("Text len frequency in File")
plt.xlabel("Length of Text ")
file_df['text_len'].plot(kind='hist',bins=100)
plt.show()

In [None]:
file_df[file_df['text_len']>6000]

## Length and Label comparison

In [None]:
df['discourse_len']=df['discourse_text'].apply(len)

In [None]:
#https://www.kaggle.com/erikbruin/nlp-on-student-writing-eda
from matplotlib.ticker import FuncFormatter

fig = plt.figure(figsize=(12,8))

ax1 = fig.add_subplot(211)
ax1 = df.groupby('discourse_type')['discourse_len'].mean().sort_values().plot(kind="barh")
ax1.set_title("Average number of words versus Discourse Type", fontsize=14, fontweight = 'bold')
ax1.set_xlabel("Average number of words", fontsize = 10)
ax1.set_ylabel("")

ax2 = fig.add_subplot(212)
ax2 = df.groupby('discourse_type')['discourse_type'].count().sort_values().plot(kind="barh")
ax2.get_xaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) #add thousands separator
ax2.set_title("Frequency of Discourse Type in all essays", fontsize=14, fontweight = 'bold')
ax2.set_xlabel("Frequency", fontsize = 10)
ax2.set_ylabel("")

plt.tight_layout(pad=2)
plt.show()

## visualize using Spacy

We are not able visualize data in color lower cases of labels so we did into upper case 

In [None]:
labels = df.discourse_type.unique().tolist()
labels = list(map(str.upper,labels))
print(labels)

In [None]:
# https://www.kaggle.com/thedrcat/feedback-prize-eda-with-displacy
import spacy
from spacy import displacy


def visualize(example):
  colors = {
		"LEAD": "#8000FF",
		"POSITION": "#2B7FF6",
		"EVIDENCE": "#2ADDDD",
		'CLAIM': '#80FFB4',
		'CONCLUDING STATEMENT': 'D4DD80',
		'COUNTERCLAIM': '#FF8042',
		'REBUTTAL': '#FF0000'
	}
  ents = []
  for i, row in df[df['id'] == example].iterrows():
      ents.append({
                      'start': int(row['discourse_start']), 
                        'end': int(row['discourse_end']), 
                        'label': row['discourse_type'].upper() #upper case
                  })
  with open(train_dir+"/"+example+'.txt', 'r') as file: 
    data = file.read()
  doc = {
      "text": data,
      "ents": ents,
      "title": example
  }
  options = {"ents": labels, "colors": colors}
  displacy.render(doc, style="ent", options=options, manual=True, jupyter=True)

In [None]:
for i in df['id'].sample(n=5,random_state=10).values.tolist():
  visualize(i)
  print("\n\n")
  print("="*120)


**From above visulization**

* Specially in CLAIM tag we can see sequently same tag and some time its droping few words and sentence which makes prediction complex or complicated.
* Most of Eassy Concluding Tag are at the end 
* In starting of EDA section we found most used tags are claim,evidence,position,concluding Statement
* In CSV we have also given  claim1, claim2 etc like means if any tag getting repeate then it will increase number of tag but we have predict Tags only