**Created by Sanskar Hasija**

**📊 Feedback Prize - EDA 📊**

**15 DECEMBER 2021**


  # <center> 📊 FEEDBACK PRIZE - EDA 📊 </center>
## <center>If you find this notebook useful, support with an upvote👍</center>

# <center>IMPORTS</center> 

In [None]:
import os
import spacy
import wordcloud
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

In [None]:
train_dir = "../input/feedback-prize-2021/train"
test_dir = "../input/feedback-prize-2021/test"
train_files = os.listdir(train_dir)
test_files = os.listdir(test_dir)

for file in range(len(train_files)):
    train_files[file] = str(train_dir) + "/" +  str(train_files[file])
for file in range(len(test_files)):
    test_files[file] = str(test_dir) + "/" +  str(test_files[file])
    
train = pd.read_csv("../input/feedback-prize-2021/train.csv")
sub = pd.read_csv("../input/feedback-prize-2021/sample_submission.csv")

# <center>EDA</center> 

In [None]:
print("Total number of train files = " , len(train_files))
print("Total number of test files = " , len(test_files))

### Train Essay Sample

In [None]:
f = open(train_files[0], "r")
print(f.read())

### Test Essay Sample

In [None]:
f = open(test_files[3], "r")
print(f.read())

## Train Tabular Dataframe

### Column Description
* **id** - ID code for essay response
* **discourse_id** - ID code for discourse element
* **discourse_start** - character position where discourse element begins in the essay response
* **discourse_end** - character position where discourse element ends in the essay response
* **discourse_text** - text of discourse element
* **discourse_type** - classification of discourse element
* **discourse_type_num** - enumerated class label of discourse element
* **predictionstring** - the word indices of the training sample, as required for predictions

### Quick view of Train Dataframe

In [None]:
train.head()

### Basic statistics of training data

In [None]:
print("Number of rows in train dataframe = " , len(train))

In [None]:
train.describe()

### Null Values 

In [None]:
train.isnull().sum()

### Quick view of Submission File

In [None]:
sub.head()

# <center>DATA DISTRIBUTION</center> 

### The 7 different Discourse Type

* **Lead** - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
* **Position** - an opinion or conclusion on the main question
* **Claim** - a claim that supports the position
* **Counterclaim** - a claim that refutes another claim or gives an opposing reason to the position
* **Rebuttal** - a claim that refutes a counterclaim
* **Evidence** - ideas or examples that support claims, counterclaims, or rebuttals.
* **Concluding Statement** - a concluding statement that restates the claims

### Discourse Type Distribution

In [None]:
fig = px.bar(x = np.unique(train["discourse_type"]),
y = [list(train["discourse_type"]).count(i) for i in np.unique(train["discourse_type"])] , 
            color = np.unique(train["discourse_type"]),
             color_continuous_scale="Emrld") 
fig.update_xaxes(title="Classes")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Discourse Type Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        template="plotly_white")
fig.show()

### Enumerated class label of Discourse Element Distribution

In [None]:
fig = px.bar(x = np.unique(train["discourse_type_num"]),
y = [list(train["discourse_type_num"]).count(i) for i in np.unique(train["discourse_type_num"])] , 
            color = np.unique(train["discourse_type_num"]),
             color_continuous_scale="blues") 
fig.update_xaxes(title="Classes")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Enumerated class label of Discourse Element Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        template="plotly_white")
fig.show()

# <center>DISCOURSE TEXT DISTRIBUTION</center> 

### Length of Discourse Text

In [None]:
train["discourse_len"] = train["discourse_end"] - train["discourse_start"]
fig = px.histogram(data_frame= train,x = "discourse_len",  marginal="violin",nbins = 400 )
fig.update_layout(template="plotly_white")
fig.show()

### Starting Position of Discourse Text

In [None]:
fig = px.histogram(data_frame= train,x = "discourse_start",  marginal="violin" ,nbins = 400)
fig.update_layout(template="plotly_white")
fig.show()

### Ending Position of Discourse Text

In [None]:
fig = px.histogram(data_frame= train,x = "discourse_end",  marginal="violin" ,nbins = 400)
fig.update_layout(template="plotly_white")
fig.show()

# <center>WORD CLOUD</center> 

In [None]:
wordcloud = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=80, max_words=5000,
                      width = 600, height = 400,
                      background_color='black').generate(' '.join(txt for txt in train["discourse_text"]))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud);

# <center>TEXT VISUALIZATION</center> 

In [None]:
r = 20
ents = []
for i, row in train[train['id'] == train_files[r][35:-4]].iterrows():
    ents.append({
                    'start': int(row['discourse_start']), 
                     'end': int(row['discourse_end']), 
                     'label': row['discourse_type']
                })

with open(train_files[r], 'r') as file: data = file.read()

doc2 = {
    "text": data,
    "ents": ents,
}

colors = {'Lead': '#EE11D0','Position': '#AB4DE1','Claim': '#1EDE71','Evidence': '#33FAFA','Counterclaim': '#4253C1','Concluding Statement': 'yellow','Rebuttal': 'red'}
options = {"ents": train.discourse_type.unique().tolist(), "colors": colors}
spacy.displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True);