
#  <span style="color:green"> <center> A Quick Look at ```Selected Text``` Noise

### In this notebook, I am exploring the noise in the dataset. 
1. I analysed the ```selected_text``` column in the original dataset.
2. I compared ```selected_text``` with the ```prediction``` of a reference model to find out why I am getting low performance in positive and negative sentiment predictions.


### Reference Model
I am using prediction of a trained model ```(5 fold ensemble of a RoBERTa-Base with public LB 0.712)``` to find out anamolies. I can identify many of them by performing error analysis. In particular, when my Jaccard score is very low, many times it's due to the inherent noise in the data.


### <span style="color:orange"> Can We Use It For Our Benefits?



### <span style="color:green">  If you find it useful, please upvote! Thank you! 🔥</span>

** N.B. I just listed a few noise. At the end of the notebook you can generate random samples to find more variety of noise. **

## Include Prerequisites

In [None]:
# PACKAGES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import sys, os
from IPython.core.display import *

# DATA I/O
print('Loading Ground Truth')
train_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
display(train_df.head())

# Load prediction of Trained RoBERTa Model (5 fold ensemble, 0.712LB)
print('Loading Trained Model Predictions')
preds = pd.read_csv('/kaggle/input/tweeterroranalysis/error-analysis-roberta-base-5model.csv')
preds.drop(columns=['textID'], inplace=True)
display(preds.head())

# Analmoly Statistics
anamoly = train_df.copy().drop(columns=['textID','text','selected_text','sentiment'])

## Estimate Jaccard Score of the Model Predictions

In [None]:
def jaccard(str1, str2): 
    """ Compute Jaccard Score 
    """
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

preds['Jaccard'] = preds.apply(lambda x: jaccard(str(x.selected_text), str(x.prediction)), axis=1)
preds['word_length'] = preds['prediction'].apply(lambda x: len(x.split()))
display(preds.head())

## Let's Look @ my Jaccard Distribution

- There are tons of errors in postive and negative sentiment predictions.
- During prediction my model copied whole text to prediction when sentiment in ```neutral```. Looks like it is not a perfect strategy as it has a few non perfect Jaccard scores.
- Average Jaccard values for 3 sentiments: 
``` 
POSITIVE: 0.5814722049084579
NEGATIVE: 0.5906752954713453
NEUTRAL:  0.9764467881939682
```

In [None]:
dfg = preds.groupby(['sentiment'])
df_pos = dfg.get_group('positive')
df_neg = dfg.get_group('negative')
df_neutral = dfg.get_group('neutral')
fig=px.histogram(df_pos, x='Jaccard',title='Positive Sentiment');fig.show();
fig=px.histogram(df_neg, x='Jaccard',title='Negative Sentiment');fig.show();
fig=px.histogram(df_neutral, x='Jaccard',title='Neutral Sentiment');fig.show();

print("All data Jaccard: " + str(preds.Jaccard.mean()))
print("+ data Jaccard: " + str(df_pos.Jaccard.mean()))
print("- data Jaccard: " + str(df_neg.Jaccard.mean()))
print("= data Jaccard: " + str(df_neutral.Jaccard.mean()))

# 👻 Warmup: Let's Look at the Lonely Characters in ```selected_text```

### Not many of those!

In [None]:
lonely_comma_indices = train_df.selected_text.apply(lambda x: 1 if (',' in str(x).split()) else 0)
anamoly['lonely comma'] = lonely_comma_indices

lonely_semicolon_indices = train_df.selected_text.apply(lambda x: 1 if (';' in str(x).split()) else 0)
anamoly['lonely semicolon'] = lonely_semicolon_indices

lonely_colon_indices = train_df.selected_text.apply(lambda x: 1 if (':' in str(x).split()) else 0)
anamoly['lonely colon'] = lonely_colon_indices

lonely_period_indices = train_df.selected_text.apply(lambda x: 1 if ('.' in str(x).split()) else 0)
anamoly['lonely period'] = lonely_period_indices

lonely_at_indices = train_df.selected_text.apply(lambda x: 1 if ('@' in str(x).split()) else 0)
anamoly['lonely @'] = lonely_at_indices

lonely_underscore_indices = train_df.selected_text.apply(lambda x: 1 if ('_' in str(x).split()) else 0)
anamoly['lonely _'] = lonely_underscore_indices

print("Lonely Charater Statistics")
display(pd.DataFrame(anamoly.iloc[:,:6].sum()))

# Now Let's See Where My Model Scored Jaccard=0 for Positive and Negative Sentiments.

### NOT FAIR!😳 I am penalized because:

### 1. Missing a ```!```  <span style="color:orange">Damn! It ```hurts!!!```

### 2. Missing a ```.```  <span style="color:orange">It is ```stupid...```

### 3. Missing ```d``` in ```good```?  <span style="color:orange">LOL. It's not ```goo```

### 4. Missing ```ng``` in ```amazing```?  <span style="color:orange">Dude. It's not ```amazi``` at all!

### 5. Words are not complete in the data! E.g. ```st jokin``` instead of ```just joking```.

### ...

In [None]:
df_all0 = df_neg[df_neg.Jaccard==0].append(df_pos[df_pos.Jaccard==0]).reset_index().drop(columns=['index'])
print('Missing !')
display(df_all0[df_all0.index==340])
print('Missing .')
display(df_all0[df_all0.index==226])
print('Data has missing character at the end')
display(df_all0[df_all0.index==792])
print('Data has missing multiple characters at the end')
display(df_all0[df_all0.index==621])
print('Data has words only containing last two characters')
display(df_all0[df_all0.index==1150])
display(df_all0[df_all0.index==581])
display(df_all0[df_all0.index==906])
print('Data has words only containing last one character')
display(df_all0[df_all0.index==181])
display(df_all0[df_all0.index==984])

# <span style="color:red"> I gained 0.001LB Just by adding following post-processing! Can you come up with more ways to improve?

```
sample = pd.read_csv("../input/tweet-sentiment-extraction/sample_submission.csv")
sample.loc[:, 'selected_text'] = final_output

sample['selected_text'] = sample['selected_text'].apply(lambda x: x.replace('!!!!', '!') if len(x.split())==1 else x)
sample['selected_text'] = sample['selected_text'].apply(lambda x: x.replace('..', '.') if len(x.split())==1 else x)
sample['selected_text'] = sample['selected_text'].apply(lambda x: x.replace('...', '.') if len(x.split())==1 else x)
```

# To find more noise, you can run the following code to get random samples!

In [None]:
df_all0.sample(20) 

# Scrap

In [None]:
df = preds.copy()

print('Before')
print(df.Jaccard.mean())

df['prediction'] = df['prediction'].apply(lambda x: x.replace('!!!!', '!') if len(x.split())==1 else x)
df['prediction'] = df['prediction'].apply(lambda x: x.replace('..', '.') if len(x.split())==1 else x)
df['prediction'] = df['prediction'].apply(lambda x: x.replace('...', '.') if len(x.split())==1 else x)
df['Jaccard'] = df.apply(lambda x: jaccard(str(x.selected_text), str(x.prediction)), axis=1)

print('After')
print(df.Jaccard.mean())

In [None]:
import re

def post_process(s):
    a = re.findall('[^A-Za-z0-9]',s)
    b = re.sub('[^A-Za-z0-9]+', '', s)

    try:
        if a.count('.')==3:
            text = b + '. ' + b + '..'
        elif a.count('!')==4:
            text = b + '! ' + b + '!! ' +  b + '!!!'
        else:
            text = s
        return text
    except:
        return text
    

In [None]:
df = preds.copy()
#df = df_all0.copy()

print('Before')
print(df.Jaccard.mean())

df['prediction'] = df.apply(lambda x: post_process(x['prediction']) if (len(str(x['prediction']).split())==1) else x['prediction'], axis=1)
df['Jaccard'] = df.apply(lambda x: jaccard(str(x.selected_text), str(x.prediction)), axis=1)

print('After')
print(df.Jaccard.mean())

In [None]:
print(f'Improved {100*len(df[df.Jaccard>0])/len(df[df.Jaccard==0])}% entries')
df[df.Jaccard>0]

In [None]:
df[df.Jaccard>0]

In [None]:
df = df_all0.copy()
df['pred_count'] = df.apply(lambda x: x['prediction'].count('.') if ((len(x['prediction'].split())==1)&(len(x['selected_text'].split())==1)) else x['prediction'], axis=1)
df['gt_count'] = df.apply(lambda x: x['selected_text'].count('.') if ((len(x['prediction'].split())==1)&(len(x['selected_text'].split())==1)) else x['selected_text'], axis=1)

In [None]:
a = df[df.word_length==1]

print('1 .')
t = a[a.pred_count==1].gt_count
u = plt.hist(t)
plt.bar(np.arange(len(u[0])),u[0]); plt.show();

print('2 .')
t = a[a.pred_count==2].gt_count
u = plt.hist(t)
plt.bar(np.arange(len(u[0])),u[0]); plt.show();

print('3 .')
t = a[a.pred_count==3].gt_count
u = plt.hist(t)
plt.bar(np.arange(len(u[0])),u[0]); plt.show();

print('4 .')
t = a[a.pred_count==4].gt_count
u = plt.hist(t)
plt.bar(np.arange(len(u[0])),u[0]); plt.show();