# ***Student writing📖 : The Visualization📊***

# **Context -**

***Writing is crucial for success. In particular, argumentative writing fosters critical thinking and civic engagement skills, and can be strengthened by practice. However, only 13 percent of eighth-grade teachers ask their students to write persuasively each week. Additionally, resource constraints disproportionately impact Black and Hispanic students, so they are more likely to write at the “below basic” level as compared to their white peers. An automated feedback tool is one way to make it easier for teachers to grade writing tasks assigned to their students that will also improve their writing skills.***

**There are numerous automated writing feedback tools currently available, but they all have limitations, especially with argumentative writing. Existing tools often fail to evaluate the quality of argumentative elements, such as organization, evidence, and idea development. Most importantly, many of these writing tools are inaccessible to educators due to their cost, which most impacts already underserved schools.**


***Georgia State University (GSU)is an undergraduate and graduate urban public research institution in Atlanta. U.S. News & World Report ranked GSU as one of the most innovative universities in the nation. GSU awards more bachelor’s degrees to African-Americans than any other non-profit college or university in the country. GSU and The Learning Agency Lab, an independent nonprofit based in Arizona, are focused on developing science of learning-based tools and programs for social good.***

**To best prepare all students, GSU and The Learning Agency Lab have joined forces to encourage data scientists to improve automated writing assessments. This public effort could also encourage higher quality and more accessible automated writing tools. If successful, students will receive more feedback on the argumentative elements of their writing and will apply the skill across many disciplines.**

# **Goal of the Competition :**

*The goal of this competition is to classify argumentative elements in student writing as "effective," "adequate," or "ineffective." You will create a model trained on data that is representative of the 6th-12th grade population in the United States in order to minimize bias. Models derived from this competition will help pave the way for students to receive enhanced feedback on their argumentative writing. With automated guidance, students can complete more assignments and ultimately become more confident, proficient writers.*

*This competition will comprise two tracks. The first track will be a traditional track in which accuracy of classification will be the only metric used for success. Success on this track will be updated on the Kaggle leaderboard. Prize money for the accuracy-only, “Leaderboard Prize” track will be $25,000.*

*The second track will measure computational efficiency in which efficiency is determined using a combination of accuracy and the speed at which models are able to generate these predictions. We are hosting this track because highly accurate models are often computationally heavy. Such models have a stronger carbon footprint and frequently prove difficult to utilize in real-world educational contexts, since most educational organizations have limited computational capabilities. Weekly updates on models based on computational efficiency will be posted in the discussion forum. Prize money for the computational, “Efficiency Prize” track will be $30,000.**

*You can find more details about the Efficiency Prize Evaluation via the side tab.*

## **Import Libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

## **Reading the train file**

In [None]:
train_df = pd.read_csv("../input/feedback-prize-effectiveness/train.csv")

train_df.head()

## **Checking the length of the train dataset**

In [None]:
train_df.shape

## **Checking the info of the train data**

In [None]:
train_df.info()

## **Checking missing values**

In [None]:
train_df.isnull().sum()

## **Reading the test dataset**

In [None]:
test_df = pd.read_csv('../input/feedback-prize-effectiveness/test.csv')

test_df.head()

## **Checking the length of the test data**

In [None]:
test_df.shape

## **Checking missing values**

In [None]:
test_df.isnull().sum()

## **Checking info of the test dataset**

In [None]:
test_df.info()

# ***The Visualization***

### ***Let's draw a Funnel-Chart for better visualization (discourse_effectiveness)***


In [None]:
temp = train_df.groupby('discourse_effectiveness').count()['discourse_id'].reset_index().sort_values(by='discourse_id',ascending=False)
fig = go.Figure(go.Funnelarea(
    text =temp.discourse_effectiveness,
    values = temp.discourse_id,
    title = {"position": "top center", "text": "Funnel-Chart of discourse_effectiveness Distribution"}
    ))
fig.show()

### **Distribution of Discourse Effectiveness**

In [None]:
fig = px.bar(x=np.unique(train_df["discourse_effectiveness"]), y=[list(train_df["discourse_effectiveness"]).count(i) for i in np.unique(train_df["discourse_effectiveness"])], color=np.unique(train_df["discourse_effectiveness"]), color_continuous_scale="Mint")
fig.update_xaxes(title="Classes")
fig.update_yaxes(title="Number of Rows")
fig.update_layout(showlegend=True, 
                  title={
                      'text':'Enumerated class label of Discourse Effectiveness Distribution', 
                      'y':0.95, 
                      'x':0.5, 
                      'xanchor':'center', 
                      'yanchor':'top'}, template="seaborn")
fig.show()

### ***Let's draw a Funnel-Chart for better visualization (discourse_type)***

In [None]:
temp = train_df.groupby('discourse_type').count()['discourse_id'].reset_index().sort_values(by='discourse_id',ascending=False)
fig = go.Figure(go.Funnelarea(
    text =temp.discourse_type,
    values = temp.discourse_id,
    title = {"position": "top center", "text": "Funnel-Chart of discourse_type Distribution"}
    ))
fig.show()

### **Distribution of Discourse Type**

In [None]:
fig = px.bar(x=np.unique(train_df["discourse_type"]), y=[list(train_df["discourse_type"]).count(i) for i in np.unique(train_df["discourse_type"])], color=np.unique(train_df["discourse_type"]), color_continuous_scale="Mint")
fig.update_xaxes(title="Classes")
fig.update_yaxes(title="Number of Rows")
fig.update_layout(showlegend=True, 
                  title={
                      'text':'Discourse Type Distribution', 
                      'y':0.95, 
                      'x':0.5, 
                      'xanchor':'center', 
                      'yanchor':'top'}, template="seaborn")
fig.show()

In [None]:
!pip install textstat

In [None]:
train_new=train_df.copy()

In [None]:
import textstat
train_new.head()

In [None]:
train_new['Reading_Time'] = 0
train_new['Average_Char_per_Word'] = 0
train_new['Average_Sen_Length'] = 0
train_new['Average_Syllables_per_Word'] = 0
train_new['Word_Count'] = 0
for i in range(len(train_new)):
    train_new['Reading_Time'][i] = textstat.flesch_reading_ease(train_new['discourse_text'][i])
    train_new['Average_Char_per_Word'][i] = textstat.avg_character_per_word(train_new['discourse_text'][i])
    train_new['Average_Sen_Length'][i] = textstat.avg_sentence_length(train_new['discourse_text'][i])
    train_new['Average_Syllables_per_Word'][i] = textstat.avg_syllables_per_word(train_new['discourse_text'][i])
    train_new['Word_Count'][i] = textstat.lexicon_count(train_new['discourse_text'][i])

In [None]:
train_new.head()

## ***Reading time Variation across all discourse type :***

In [None]:
fig=px.histogram(data_frame=train_new,x=train_new.Reading_Time,marginal="violin",color=train_new.discourse_type)

fig.update_layout(title="Reading Time Variation wrt all discourse type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

## ***Average Character per word for all discourse type :***

In [None]:
fig=px.histogram(data_frame=train_new,x=train_new.Average_Char_per_Word,marginal="violin",color=train_df.discourse_type)

fig.update_layout(title="Average Char per word Variation wrt all discourse type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

## ***Average Sentence Length for all discourse type :***

In [None]:
fig=px.histogram(data_frame=train_new,x=train_new.Average_Sen_Length,marginal="violin",color=train_df.discourse_type)

fig.update_layout(title="Average Sentence Length for all discourse type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

## ***Average syllables for all discourse type :***

In [None]:
fig=px.histogram(data_frame=train_new,x=train_new.Average_Syllables_per_Word,marginal="violin",color=train_df.discourse_type)

fig.update_layout(title="Average syllables per word for all discourse type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

## ***Word Count for all discourse type :***

In [None]:
fig=px.histogram(data_frame=train_new,x=train_new.Word_Count,marginal="violin",color=train_df.discourse_type)

fig.update_layout(title="Word Count for all discourse type:",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

# ***Discourse Type Distribution:***

In [None]:
values = train_new['discourse_type'].value_counts()
labels=values.index
text=values.index
fig = go.Figure(data=[go.Pie(values=values,labels=labels,hole=.3,pull=[0.1,0,0,0,0,0,0])])
fig.update_traces(hoverinfo='label+percent', textinfo='label', textfont_size=20,
                  marker=dict(line=dict(color='#000000', width=3)))
fig.update_layout(title="Discourse Type Distribution ",
                  titlefont={'size': 25},
                  )
fig.show()

In [None]:
from nltk.corpus import stopwords
from nltk.util import ngrams
import nltk
from tqdm.notebook import tqdm
from sklearn.feature_extraction.text import CountVectorizer

### ***Most Occuring Bigrams across all types :***

In [None]:
def get_n_grams(n_grams, top_n = 10):
    df_words = pd.DataFrame()
    for dt in tqdm(train_df['discourse_type'].unique()):
        df = train_df.query('discourse_type == @dt')
        texts = df['discourse_text'].tolist()
        vec = CountVectorizer(lowercase = True, stop_words = 'english',\
                              ngram_range=(n_grams, n_grams)).fit(texts)
        bag_of_words = vec.transform(texts)
        sum_words = bag_of_words.sum(axis=0)
        words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
        cvec_df = pd.DataFrame.from_records(words_freq,\
                                            columns= ['words', 'counts']).sort_values(by="counts", ascending=False)
        cvec_df.insert(0, "Discourse_type", dt)
        cvec_df = cvec_df.iloc[:top_n,:]
        df_words = df_words.append(cvec_df)
        
    return df_words

In [None]:
bigrams = get_n_grams(n_grams = 2, top_n=10)
bigrams.head()

In [None]:
f1=bigrams[bigrams['Discourse_type']=='Lead']
fig=px.histogram(y=f1['counts'],x=f1['words'],color=f1.Discourse_type,color_discrete_sequence=['blue'])

fig.update_layout(title="Bigrams in Lead :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

In [None]:
f2=bigrams[bigrams['Discourse_type']=='Position']
fig=px.histogram(y=f2['counts'],x=f2['words'],color=f2.Discourse_type,color_discrete_sequence=['tomato'])

fig.update_layout(title="Bigrams in Position :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

In [None]:
f3=bigrams[bigrams['Discourse_type']=='Claim']
fig=px.histogram(y=f3['counts'],x=f3['words'],color=f3.Discourse_type,color_discrete_sequence=['gold'])

fig.update_layout(title="Bigrams in Claim :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

In [None]:
f4=bigrams[bigrams['Discourse_type']=='Evidence']
fig=px.histogram(y=f4['counts'],x=f4['words'],color=f4.Discourse_type,color_discrete_sequence=['lightgreen'])

fig.update_layout(title="Bigrams in Evidence :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

In [None]:
f5=bigrams[bigrams['Discourse_type']=='Counterclaim']
fig=px.histogram(y=f5['counts'],x=f5['words'],color=f5.Discourse_type,color_discrete_sequence=['lightskyblue'])

fig.update_layout(title="Bigrams in Counterclaim :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

In [None]:
f6=bigrams[bigrams['Discourse_type']=='Rebuttal']
fig=px.histogram(y=f6['counts'],x=f6['words'],color=f6.Discourse_type,color_discrete_sequence=['gray'])

fig.update_layout(title="Bigrams in Rebuttal :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

In [None]:
f7=bigrams[bigrams['Discourse_type']=='Concluding Statement']
fig=px.histogram(y=f7['counts'],x=f7['words'],color=f7.Discourse_type,color_discrete_sequence=['hotpink'])

fig.update_layout(title="Bigrams in Concluding Statement :",
                  titlefont={'size': 25},template='plotly_white'     
                  )
fig.show()

## **If you like my work, kindly upvote it !! 😊😊**


### ***Here, my few more work ✍️(have a look):***


* **[Spaceship Titanic : Story of a Space Titanic 🌌🚢](https://www.kaggle.com/code/deepakkaura/spaceship-titanic-story-of-a-space-titanic)**


* **[H&M : Insightful Plots and Prediction](https://www.kaggle.com/code/deepakkaura/h-m-insightful-plots-and-prediction)**


* **[My few sample work](https://github.com/deepak7642/Few-Samples-of-My-work)**
