# **EDA Simplified: Feedback Prize (Adapted by Ibrez and Sanskar Hasija)**

## Introduction
In many schools across the entire country, writing is a key to success for a lot of students because they want to get into advanced universities, although I want to improve my writing in this notebook. But unfortunately, just for low-income, Black, and Hispanic students fare even worse, with less than 15 percent demonstrating writing proficiency. One way to help students improve their writing is via automated feedback tools, which evaluate student writing and provide personalized feedback. How? We can identify and analyze argumentative writing elements from every students' writing starting from grade 6 to grade 12. So, let's go on to it! 

## Imports and Setup
Before we analyze argumentative writing elements, first things first, we import stuff! First of all, import the os module for files and directories. Then, we also need to import SpaCy (spacy) and wordcloud modules because they were used to analyze text from any resource. We then import the NumPy (numpy) module as np and the Pandas (pandas) module as pd since these modules contribute machine learning and data analysis to the 6th to 12th grade students' essays. Finally, import the plotting modules, Seaborn (seaborn) as sns, Plotly with the express attribute as px, the famous Matplotlib module that has the pyplot attribute as plt, and finally, another Plotly module with the graph_objects attribute as go.

In [None]:
# Files and Directories
import os

# Text Analysis
import spacy
import wordcloud

# Machine Learning and Data Analysis
import pandas as pd
import numpy as np

# Plotting
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

We then setup the train and test directories by defining two variables: TRAIN_DIR and TEST_DIR. We assign TRAIN_DIR to the training path (and also TEST_DIR to the testing path) of the feedback-prize-2021 competition data. We then assign two variables: train_files and test_files to a list containing the names of the entries in the directories of the TRAIN_DIR and TEST_DIR image paths. We now create a for loop, looping the variable file in the range between the number of entities (len function) of train_files (and test_files). Inside the two for loops, the train_file and the test_file variable with the file key as variable index is assigned to the TRAIN_DIR and TEST_DIR variables plus the "/" string and again plus the string version of themselves with the file variable as the key index again. Now, we assign the train_df variable to the pandas module reading the train.csv from the feedback-prize-2021 competition dataset along with the sub_df variable, but the pandas module read the csv file, sample_submission.csv. 

In [None]:
# ℹ️: You can copy file path in data sections if you would want!
TRAIN_DIR = "../input/feedback-prize-2021/train"
TEST_DIR = "../input/feedback-prize-2021/test"
train_files = os.listdir(TRAIN_DIR)
test_files = os.listdir(TEST_DIR)

for file in range(len(train_files)):
    train_files[file] = TRAIN_DIR + "/" + str(train_files[file])
    
for file in range(len(test_files)):
    test_files[file] = TEST_DIR + "/" + str(test_files[file])
    
train_df = pd.read_csv("../input/feedback-prize-2021/train.csv")
sub_df = pd.read_csv("../input/feedback-prize-2021/sample_submission.csv")

## EDA
Now, let's do the EDA part! First, we will print out the number of train and test files by using the train_files and test_files variable that are caked inside the len function.

In [None]:
print("No. of train files: ", len(train_files))
print("No. of test files: ", len(test_files))

After running this code cell above, we can see that there are 15,594 train files and just 5 test files counted! Now, let's analyze one of the train essays that the anonyomous 6th-to-12th grade students wrote!

### Analysis of Train Essay Sample
In order to analyze the sample of the train essay, we define a variable called f and assign it to open the train_files with any index number (mine's 9) and read it (as "r"). We then read it using the f variable with the read function.

In [None]:
f = open(train_files[9], "r")
print(f.read())

As we can see here, we analyzed an anonymous student wrote an argument about how less driving reduces pollution and protects the environment. But what about the test essay sample? Well, we just move on...

### Analysis of Test Essay Sample
Like the part where we analyzed the train essay sample, let's now analyze the test essay sample. It's the same code as the part where we analyzed the train essay sample but the f variable is assigned to open and read the test_files variable that has the slice index of any number (mine's 1).

In [None]:
f = open(test_files[0], "r")
print(f.read())

After analyzing the test essay sample we had, we can see that another anonyomous student argues about asking for other's opinions is especially beneficial as wise. Now, without further ado, let's dive in to the basics of the train tabular dataframe part!

### Train Tabular Dataframe
Here are the basics of the train tabular dataframe!
* **id**: This is an ID Code, just for essay response
* **discourse_id**: This is an ID Code again, but for discourse element.
* **discourse_start**: A character position where the discourse element starts in the essay response.
* **discourse_end**: Same as discourse_start but the discourse element ends in the essay response.
* **discourse_text**: This represents the text of the discourse element.
* **discourse_type**: That is the classification of discourse element.
* **discourse_type_num**: That is the enumerated class label of the discourse element.
* **predictionstring**: The word indices extracted from the training sample, obligatory for predictions.

### Sneak Peek of a Train Dataframe
After showing off the basics of the training columns of this training DataFrame, let's now take a peek of the train_df dataframe by displaying the first 5 rows using the head function!

In [None]:
train_df.head()

As mentioned, there are eight entities of data stored in a training DataFrame! But we now go further on analyzing the statistics of our sample of the training data!

### Analyzing the Basic Statistics of Training Data
First of all, we need to find out the number of rows. Without further ado, we print out the number of rows in the training dataframe by measuring the number of entities of the train_df dataframe using the len function!

In [None]:
print("No of rows in the training dataframe: ", len(train_df))

After running this code cell up above, we can see that there are 144,293 rows in the training dataframe. But, there is one more thing to find, finding the std, mean, min, max, and the iqr of the data in discourse_id, discourse_start, and discourse_end. Well, we can find the number in std, mean, min, max, and the iqr for that by using plugging in the describe function in the train_df dataframe. 

Now, we start to find the statistics of our train_df DataFrame by printing out the train_df dataframe plugged in with the describe function.

In [None]:
train_df.describe()

We now analyze the basic statistics of discourse_id, discourse_start, and discourse_end data entities.

### Null Values Check
Most importantly, we need to check whether our train_df dataframe has any null values. We checked it by counting the null values with the train_df variable that is plugged with the isnull function and the sum function. Let's do it!

In [None]:
train_df.isnull().sum()

As we see this, we counted zero entities of null values in each data in the train_df dataframe! Now, let's get started on distributing data!

## Distributing the Data
Let's now distribute out the data! First of all, let's analyze the seven different discourse type students wrote out! 
* Lead: a compelling and interesting first section of your paper that tells about the issue and has an attention getting hook (Also known as the inviting introduction).
* Position: an opinion or conclusion on the main question or problem.
* Claim: an argument based on facts and reasoning.
* Counterclaim: a claim that refutes another claim or give out an opposing reason to the position.
* Rebuttal: a claim that opposes a counterclaim
* Evidence: personal experiences, definitions, facts, research, data, quotes from an authority in the field, or statistical graphs which tend to support or prove something.
* Concluding Statement: the final section of your paper that clearly summarizes the points made and is supported by evidence (Also known as the purposeful conclusion).

### Distributing the Discourse Type
Let's count the seven discourse types of the argumentative essay, and now we plot them together! First, we define a variable that sets up our graph figure with the plotly express module px and set it to the bar figure graph using the bar function, which contains the x variable to being defined to np module that gives out the unique variables of the discourse_type data key in the train_df dataframe using the unique function. We then define another variable called y to return a list containing a key data, discourse_type which it is in the train_df dataframe and then count them with the count function containing the i variable, which loops in the np module that gives out the unique values (with the unique function) of the discourse_type data key in the train_df dataframe, thus having it surrounded by parentheses. We then set up the colors of the data counts in the train_df dataframe by define color as our variable to the np module that give out the unique values of the discourse_type key in the train_df dataframe using the unique function, and set the color_continuous_scale variable to Emrld, or Peach or Mint, or whatever you want! (ℹ️: the variables, y, color, and color_continuous_scale all belongs inside the px.bar figure setup.) We then update the x-axis and y-axis with the labels in our fig graph figure variable by using the update_xaxes and update_yaxes function, which contained a title attribute to specify the labels for both x and y axis to Classes (for x-axis) and Number of Rows (for y-axis). Finally, on our setup, we update layout for our graph figure by using the fig variable with the update_layout function with the showlegend hyperparameter being set to true, and the title set to a dictionary consisting of setting the keys: text set to Discourse Type Distribution, y set to 0.95, x set to 0.5, xanchor set to center, and yanchor set to top, thus, the template hyperparameter is set to... whatever built-in template you like (mine's seaborn). And thus, showing our figure using the show function that is plugged in with the fig variable.

In [None]:
fig = px.bar(x=np.unique(train_df["discourse_type"]), y=[list(train_df["discourse_type"]).count(i) for i in np.unique(train_df["discourse_type"])], color=np.unique(train_df["discourse_type"]), color_continuous_scale="Mint")
fig.update_xaxes(title="Classes")
fig.update_yaxes(title="Number of Rows")
fig.update_layout(showlegend=True, 
                  title={
                      'text':'Discourse Type Distribution', 
                      'y':0.95, 
                      'x':0.5, 
                      'xanchor':'center', 
                      'yanchor':'top'}, template="seaborn")
fig.show()

As you can see, The number of Claim stated in the essay is 50K, Concluding Statement is about 13K, Counterclaim is 5817, Lead is 9305, Position is about 15K, and Rebuttal is 4337.

### Enumeration Over The Enumerated Class Label of Discourse Element Distribution
We now create another graph for sure! It's similar to the one we distributed the discourse type, but we were evaluating the enumerated class labels of Enumerated class label of Discourse Element Distribution for our title in the update_layout function parameters, and we are creating a bar graph consisting of the key in train_df, discourse_type_num.

In [None]:
fig = px.bar(x=np.unique(train_df["discourse_type_num"]), y=[list(train_df["discourse_type_num"]).count(i) for i in np.unique(train_df["discourse_type_num"])], color=np.unique(train_df["discourse_type_num"]), color_continuous_scale="Mint")
fig.update_xaxes(title="Classes")
fig.update_yaxes(title="Number of Rows")
fig.update_layout(showlegend=True, 
                  title={
                      'text':'Enumerated class label of Discourse Element Distribution', 
                      'y':0.95, 
                      'x':0.5, 
                      'xanchor':'center', 
                      'yanchor':'top'}, template="seaborn")
fig.show()

When the graph has been done running, there are a lot of counts over the discourse text so that we can't explain out of it.

## Distributing the Discourse Text
Now, after distributing the Discourse Elements, let's now analyze every segment of the discourse text! So what are we stallin' for? Let's move on!

### Length of Discourse Text
To start measuring the length of our discourse text, we first append discourse_len to our train_df dataframe by defining a new key-set, discourse_len to the difference between the discourse_end and the discourse_start key. Next, we created another figure by defining a variable, fig to create a histogram with the px module that is connected by the histogram function, containing four parameters: data_frame set to the train_df dataframe, x (that represents the x-axis) set to discourse_len, marginal set to violin, and the nbins value set to 400. Finally, update our layout of the graph with the px module that sticks to the update_layout function containing a parameter, template (to represent the layout) to, again, any layout if you'd like and show the figure using the show function to stick to the fig variable figure.

In [None]:
train_df["discourse_len"] = train_df["discourse_end"] - train_df["discourse_start"]
fig = px.histogram(data_frame=train_df, x="discourse_len", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

## Starting Point of Discourse Text
Now, we measure the starting point of the discourse text! It's kinda similar to the way we did to measure the length of discourse text, but the x parameter in the histrogram function is assigned to the key data of discourse_start in train_df.

In [None]:
fig = px.histogram(data_frame=train_df, x="discourse_start", marginal="violin", nbins=400)
fig.update_layout(template="presentation")
fig.show()

## Wordclouding
Now, let's wordcloud the words! We first define a variable called wordcloud to the wordcloud module with the WordCloud function containing 6 parameters (stopwords to wordcloud module with the STOPWORDS attribute call, max_font_size value set to 80 (or 90), max_words value set to 5000 (or 4500), width value set to 600, height value set to 400, and background_color set to black) connected by the generate function after the WordCloud function with 6 parameters, which contained a sub-code thing in which a spaced text (' ') is joined together using the join function, containing the variable txt being looped in the discourse_text key in train_df. Next, we define two variables, fig and ax to return the subplot axes of our wordcloud model using the plt module with the subplots function, containing the figsize parameter, to set the window size to 14 units and 10 units. We then use the ax variable to plug in the imshow function to show the image graph to the wordcloud image graph figure, and set the interpolation parameter to bilinear. Finally, we set the axis in our figure to off by using the ax variable to call the set_axis_off function and call the imshow function again using plt to show the wordcloud data.

In [None]:
wordcloud = wordcloud.WordCloud(stopwords=wordcloud.STOPWORDS, max_font_size=90, max_words=4500, width=600, height=400, background_color='black').generate(' '.join(txt for txt in train_df["discourse_text"]))
fig, ax = plt.subplots(figsize=(14,10))
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()
plt.imshow(wordcloud);

As you can see, there were a lot of words scrambled in every place, about every topic at a time students from grade 6 to 12 wrote. But wait, there's more!

### Wordclouding using stylecloud
This is the most part where we wordcloud every students' topic into F-U-N. We do it with stylecloud! But first, we need to install and import stylecloud. Inspired from: https://www.kaggle.com/kapakudaibergenov/stylecloud

In [None]:
!pip install stylecloud
import stylecloud

Before we use stylecloud, we need to concat our text data! We define a variable called concated_discourse_text to the blank string followed by a join function, containing a variable, i to looping train_df with the discourse_text key with the astype function containing a str type surrounded by the square brackets.

In [None]:
concated_discourse_text = ' '.join([i for i in train_df.discourse_text.astype(str)])

We now generate our wordcloud but in a artistic way! We generate our customized wordcloud by calling out the stylecloud module with the gen_stylecloud function, containing the 5 parameters: text set to the concated discourse_text data, icon_name to any font awesome icons, palette set to any of the color formats, (mine's colorbrewer.diverging.Spectral_11), background_color set to black, and finally, size value to 1024. 

In [None]:
stylecloud.gen_stylecloud(text=concated_discourse_text, icon_name="fab fa-google", palette="colorbrewer.diverging.Spectral_11", background_color="black", size=1024)

After that, we call out the Image function, containing three hyperparameters: filename set to the output of stylecloud.png image file, width value set to 1024, and so did the same value for height. But before that, we need to import Image from the IPython module with the display submodule.

In [None]:
from IPython.display import Image
Image(filename='./stylecloud.png', width=1024, height=1024)

Waouh! What a masterpiece! We just displayed a the logo forming a Google-G logo, but with random colors and word topics students wrote on their essays! Finally, let's display out our visualization of the text in one of the student's argumentative essays!

## Text Visualization
Now we've come to our final part of our EDA, visualizing our text! we first set the r variable to any number you'd like and set the ents variable to an empty array. We now loop over the variables i and row to our train_df function containing the id data from the train_df dataframe equally comparing to the r value number in train_files, setting the slicing ratio to 35 over -4 to calling the iterrows function to iter every row in train_df dataframe. Inside of it, we make the ents array append the dictionary keys which were: 'start' assigned to the row dataframe of discourse_start in int type, 'end' assigned to the row dataframe of discourse_end in int type, and 'label' assigned to the row dataframe of the discourse_type. We now open (using the with open keyword to) the train_files' id and read it 'r' as the file variable, thus assigning the data variable to read the file variable with the file in it by calling the read function. We now create a dictionary, doc2 to assign text key to the data variable, and ents key to the ents variable. Furthermore, we create a colors dictionary to assign the discourse elements key to any color you'd like and create an options dictionary, in which ents key is assigned to the train_df dataframe leading to discourse_type key and the unique function, to find the unique elements of a dataframe thus converting to list using the tolist function and colors key is assigned to the colors dictionary. Finally, display out the highlighted text by using the spacy function with the displacy attribute and render the text by using the render function, in which doc2 variable is placed in, the style parameter set to  ent, options parameter set to options dictionary, and setting the manual and jupyter parameters to True.

In [None]:
r = 30
ents = []
for i, row in train_df[train_df['id'] == train_files[r][35:-4]].iterrows():
    ents.append({
                    'start': int(row['discourse_start']), 
                     'end': int(row['discourse_end']), 
                     'label': row['discourse_type']
                })

with open(train_files[r], 'r') as file: data = file.read()

doc2 = {
    "text": data,
    "ents": ents,
}

colors = {'Lead': 'magenta','Position': 'purple','Claim': 'orange','Evidence': 'green','Counterclaim': 'blue','Concluding Statement': 'yellow','Rebuttal': 'red'}
options = {"ents": train_df.discourse_type.unique().tolist(), "colors": colors}
spacy.displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True);

## Conclusion
And so we did it! We analyzed our student's argumentative essays and ready to move on to identifying them! After analyzing em, what should we do now? We could submit our argumentative essays to Feedback Prize about how Ponyboy in *The Outsiders* rebeled against the Greasers stereotypes, or, identify more essays with discourse text identification... But, it's our choice!

## **WORKS CITED**

* odins0n (Sanskar Hasija). (2022, January 25). 📊 feedback prize - eda 📊. Kaggle. Retrieved February 20, 2022, from https://www.kaggle.com/odins0n/feedback-prize-eda 
* Kapakudaibergenov (Kapa Kudaibergenov). (2021, June 3). Stylecloud. Kaggle. Retrieved February 20, 2022, from https://www.kaggle.com/kapakudaibergenov/stylecloud 