For Parts of EDA -  I took inspiration from: [Sankar Hasija](https://www.kaggle.com/odins0n/feedback-prize-eda), upvote for this notebook too!, if you find the notebook usefull 

# <center>IMPORTS</center> 

In [None]:
# Some EDA elements have been inspired by Sanskar Hasija! if you like the notebook please upvote his as well

import os
import spacy
import wordcloud
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go


In [None]:
train_dir = "../input/feedback-prize-2021/train"
test_dir = "../input/feedback-prize-2021/test"
train_files = os.listdir(train_dir)
test_files = os.listdir(test_dir)

for file in range(len(train_files)):
    train_files[file] = str(train_dir) + "/" +  str(train_files[file])
for file in range(len(test_files)):
    test_files[file] = str(test_dir) + "/" +  str(test_files[file])
    
train = pd.read_csv("../input/feedback-prize-2021/train.csv")
sub = pd.read_csv("../input/feedback-prize-2021/sample_submission.csv")

# <center>EDA</center> 

In [None]:
print("Total number of train files = " , len(train_files))
print("Total number of test files = " , len(test_files))

### Train Essay Sample

In [None]:
f = open(train_files[0], "r")
print(f.read())

### Test Essay Sample

In [None]:
f = open(test_files[3], "r")
print(f.read())

## Train Tabular Dataframe

### Column Description
* **id** - ID code for essay response
* **discourse_id** - ID code for discourse element
* **discourse_start** - character position where discourse element begins in the essay response
* **discourse_end** - character position where discourse element ends in the essay response
* **discourse_text** - text of discourse element
* **discourse_type** - classification of discourse element
* **discourse_type_num** - enumerated class label of discourse element
* **predictionstring** - the word indices of the training sample, as required for predictions

### Quick view of Train Dataframe

In [None]:
train.head()

### Basic statistics of training data

In [None]:
print("Number of rows in train dataframe = " , len(train))

In [None]:
train.describe()

### Null Values 

In [None]:
train.isnull().sum()

### Quick view of Submission File

In [None]:
sub.head()

# <center>DATA DISTRIBUTION</center> 

In short we have **15594** file submissions by students and **144293** discourse text identified. 

In [None]:
print(f" Average distribution of elements per story {len(train)/len(train_files):9.2f}")

### The 7 different Discourse Type

* **Lead** - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
* **Position** - an opinion or conclusion on the main question
* **Claim** - a claim that supports the position
* **Counterclaim** - a claim that refutes another claim or gives an opposing reason to the position
* **Rebuttal** - a claim that refutes a counterclaim
* **Evidence** - ideas or examples that support claims, counterclaims, or rebuttals.
* **Concluding Statement** - a concluding statement that restates the claims

### Discourse Type Distribution

In [None]:
fig = px.bar(x = np.unique(train["discourse_type"]),
y = [list(train["discourse_type"]).count(i) for i in np.unique(train["discourse_type"])] , 
            color = np.unique(train["discourse_type"]),
             color_continuous_scale="Emrld") 
fig.update_xaxes(title="Classes")
fig.update_xaxes(categoryorder="total descending")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Discourse Type Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

Few Insignts :
* More than one clain and Evidence per story
* Looks like one position and one Concluding statement
* Some people provide lead
* Counterclaim and rebuttal are infrequent ( looks like most students agree with the statements )  

### Enumerated class label of Discourse Element Distribution

In [None]:
fig = px.bar(x = np.unique(train["discourse_type_num"]),
y = [list(train["discourse_type_num"]).count(i) for i in np.unique(train["discourse_type_num"])] , 
            color = np.unique(train["discourse_type_num"]),
             color_continuous_scale="blues") 
fig.update_xaxes(title="Classes")
fig.update_xaxes(categoryorder="total descending")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Enumerated class label of Discourse Element Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

# <center>DISCOURSE TEXT DISTRIBUTION</center> 

In [None]:
train["discourse_len"] = train["discourse_end"] - train["discourse_start"]
pd.pivot_table(train, values='discourse_len', index=['discourse_type'],
                    aggfunc=('count','mean','max','min')).rename_axis(None, axis=1).reset_index() 

In [None]:
train["discourse_len"] = train["discourse_end"] - train["discourse_start"]
train01 = pd.pivot_table(train, values='discourse_len', index=['id'],
                    aggfunc=('sum')).rename_axis(None, axis=1).reset_index() 
fig = px.histogram(data_frame= train01,x = "discourse_len",  marginal="violin",nbins = 400 )
fig.show()

In [None]:
fig = px.histogram(train, x="discourse_len", color="discourse_type")
fig.show()

In [None]:

fig = px.histogram(data_frame= train,x = "discourse_len",  marginal="violin",nbins = 400 )
fig.show()

# <center>Markov Transition Matrix</center> 

In [None]:
#https://stackoverflow.com/questions/22219004/how-to-group-dataframe-rows-into-list-in-pandas-groupby
markovraw = train.groupby('id')['discourse_type'].apply(list).reset_index(name='new')

In [None]:
markovraw.loc[:, 'starting'] = markovraw.new.map(lambda x: x[0])
markovraw.loc[:, 'ending'] = markovraw.new.map(lambda x: x[-1])


In [None]:
markovraw.head()

In [None]:
markovraw.new.apply(lambda x: {x.insert(0,'Start'),x.append('End')})
markovraw.head()

In [None]:
markov_chain=[]
for mc in markovraw.new:
    markov_chain.extend(mc)
my_map = dict(enumerate(set(markov_chain)))

my_map


In [None]:
inv_map = {v: k for k, v in my_map.items()}
final_list= [inv_map.get(item)  for item in markov_chain]
inv_map

In [None]:
T = final_list


In [None]:


#create matrix of zeros

M = [[0]*9 for _ in range(9)]

for (i,j) in zip(T,T[1:]):
    M[i][j] += 1

#now convert to probabilities:
for row in M:
    n = sum(row)
    if n > 0:
        row[:] = [f/sum(row) for f in row]

#print M:


In [None]:
vals = [ key for key, value in inv_map.items() ]
import pandas as pd 
pd.DataFrame(data = M, 
                  index = vals, 
                  columns = vals)

Neglect the End to Start transition probablity of 1 , thats because I appended all the lists

# <center>TEXT VISUALIZATION</center> 

In [None]:
# This is all Sanskar Hasija ! I just picked it up for analysis , very handy
r = 24
ents = []
for i, row in train[train['id'] == train_files[r][35:-4]].iterrows():
    ents.append({
                    'start': int(row['discourse_start']), 
                     'end': int(row['discourse_end']), 
                     'label': row['discourse_type']
                })

with open(train_files[r], 'r') as file: data = file.read()

doc2 = {
    "text": data,
    "ents": ents,
}

colors = {'Lead': '#EE11D0','Position': '#AB4DE1','Claim': '#1EDE71','Evidence': '#33FAFA','Counterclaim': '#4253C1','Concluding Statement': 'yellow','Rebuttal': 'red'}
options = {"ents": train.discourse_type.unique().tolist(), "colors": colors}
spacy.displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True);

## References 
[Text coloring](https://www.kaggle.com/ibrezmohd/nlp-on-student-writing-eda/edit)