# Distilbert-base-uncased-emotion: 
Base Model - https://huggingface.co/distilbert-base-uncased#distilbert-base-model-uncased

This machine learning model Hugging Face - [bhadresh-savani/distilbert-base-uncased-emotion](https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion#distilbert-base-uncased-emotion) is used for computing emotions of textual data with Context

**About the Model:** 

Distilbert is created with knowledge distillation during the pre-training phase which reduces the size of a BERT model by 40%, while retaining 97% of its language understanding. It's smaller, faster than Bert and any other Bert-based model.

Distilbert-base-uncased finetuned on the emotion dataset using HuggingFace Trainer with below Hyperparameters

1.   learning rate : 2e-5
2.   batch size : 64
3.   num_train_epochs: 8




##  Installing Libraries and Packages

In [1]:
!pip install -q transformers

[K     |████████████████████████████████| 3.1 MB 5.0 MB/s 
[K     |████████████████████████████████| 895 kB 43.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 31.1 MB/s 
[K     |████████████████████████████████| 596 kB 40.7 MB/s 
[K     |████████████████████████████████| 59 kB 4.0 MB/s 
[?25h

# TextClassifier with DistilBert

In [136]:
from google.colab import drive 
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [159]:
import pandas as pd
from transformers import pipeline
import os
import argparse
import time
import html
import re

inputFileName="/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data.json"
outputFileName="/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data-classified.json"

# read the json file
df = pd.read_json(inputFileName)

# define the distilbert text-classifier
classifier = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)


def clean_text(text):
    """Cleans string for processing. Removes bytes, emails, and urls
    Args:
        text ([str]): text to clean
    Returns:
        Union[str, None]: Returns cleaned string, or None if no text remains after cleaning
    """
    if text and ''.join(text.split()):
        if type(text) == bytes: #Decoding byte strings
            text = text.decode('utf-8')
        #Removing emails + ***.com urls
        text = ' '.join([item for item in text.split() if '@' not in item and '.com' not in item])
        text = ' '.join(text.split()) #removing all multiple spaces
        if text: return text
    return None

# method to classify the text and calculate elapsed time in seconds
def get_distil_bert_textClassification(txt):
    #Cleaning the text
    cleantxt = clean_text(txt)
    if cleantxt == None:
        cleantxt = ""
    t = time.process_time()
    result = classifier(cleantxt)
    elapsed_time = time.process_time() - t
    return pd.Series([result, elapsed_time])

# apply get_distil_bert_textClassification() function to column 'text' and assign it to classification_score column
df[['classification_score', 'elapsed_time']]= df['content'].apply(get_distil_bert_textClassification)

# flatten the classification_score
df['classification_score_flatten'] = df['classification_score'].apply(lambda x: x[0])
list = pd.DataFrame.from_records(df['classification_score_flatten'],
                                 columns=["sadness", "joy", "love", "anger", "fear", "surprise"])

df["sadness"]  = list['sadness'].apply(lambda x: x['score'])
df["joy"]      = list['joy'].apply(lambda x: x['score'])
df["love"]     = list['love'].apply(lambda x: x['score'])
df["anger"]    = list['anger'].apply(lambda x: x['score'])
df["fear"]     = list['fear'].apply(lambda x: x['score'])
df["surprise"] = list['surprise'].apply(lambda x: x['score'])

    # Save the dataframe to CSV file.
df.drop(columns=['classification_score', 'classification_score_flatten'])\
    .to_csv(outputFileName)


Token indices sequence length is longer than the specified maximum sequence length for this model (3729 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: ignored

# **Observations**
We need to Reduce the Number of Words before calculating Emotoion Scores because of the following 

1. Interesting Observation about the  **bhadresh-savani/distilbert-base-uncased-emotion Model**.
It does not seem right for Empty Strings and None.

*   classifier(None)  --> is throwing an error 

```
/text input must of type str (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` `Inline code`
```

*   classifier("")  --> Results  

```
[[{'label': 'sadness', 'score': 0.11096163094043732},
  {'label': 'joy', 'score': 0.20327672362327576},
  {'label': 'love', 'score': 0.02004520408809185},
  {'label': 'anger', 'score': 0.47305959463119507},
  {'label': 'fear', 'score': 0.17409689724445343},
  {'label': 'surprise', 'score': 0.018559947609901428}]]
```

2. RuntimeError bhadresh-savani/distilbert-base-uncased-emotion Model only takes 512 tokens. 

```
RuntimeError: The size of tensor a (3729) must match the size of tensor b (512) at non-singleton dimension 1
```




# **Solutions**

1.   Clean the text:String
2.   Breaking long strings into chunks of 512
3.   Calculate the Classfier Emotions Score - Emotions {sadness, joy, love, anger, fear, surprise }
4.   Take the average scores of each chunks  

In [160]:
def clean_data(df):
    """Cleans dataframe for processing. Removes bytes, html, @, emails, urls, non-alpha characters
    Args:
        df['content']([str]): Column with text to clean 
    Returns:
        df: Returns cleaned dataframe, or None if no text remains after cleaning
    """
    df['content'] = df['content'].apply(lambda t: text.decode('utf-8') if type(t) == bytes else t)
    #Lowercase all tweets
    df['content'] = df['content'].apply(lambda t: t.lower())
    #Decode HTML
    df['content'] = df['content'].apply(lambda t: html.unescape(t))
    #Remove @ mentions
    df['content'] = df['content'].apply(lambda t: re.sub(r'@[A-Za-z0-9]+','',t))
    #Remove URLs
    df['content'] = df['content'].apply(lambda t: re.sub('https?://[A-Za-z0-9./]+','',t))
    #Remove remaining non-alpha characters
    df['content'] = df['content'].apply(lambda t: re.sub("[^a-zA-Z]", " ", t))
    return df

inputFileName="/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data.json"

# read the json file
df = pd.read_json(inputFileName)

df = clean_data(df)

## Check the length of each Text after Cleaned
df.content.apply(len)

0        2058
1       17945
2       10901
3        7951
4       66287
        ...  
9995     3074
9996     5429
9997     1576
9998       27
9999       24
Name: content, Length: 10000, dtype: int64

In [161]:
# define the distilbert text-classifier
classifier = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)


In [162]:
 def chunk_list(lst:list, items_per_chunk:int):
        """Breaks a list into chunks

        Args:
            lst ([list]): List to chunk
            items_per_chunk ([int]): Number of items per list

        Yields:
            [list]: a chunk of lst, with size 'items_per_chunk'
        """
        for i in range(0, len(lst), items_per_chunk):
            yield lst[i:i + items_per_chunk]


In [165]:
def avg_scoreby512(text):
  """Calculate the Average Scores of text by dividing into chunks of 512 size.

        Args:
            text ([String]): cleaned text 

        Yields:
            {dictionary}: a dictionary of Emotions {sadness, joy, love, anger, fear, surprise } '
        """
  print("Debug Text - ", text)
  sadness_scores = []
  joy_scores = []
  love_scores = []
  anger_scores = []
  fear_scores = []
  surprise_scores = []

  for str_chunk in chunk_list(text, 512):
    result = classifier(str_chunk)
    #print("Debug Result - ", result) 
    sadness = result[0][0]
    joy = result[0][1]
    love = result[0][2]
    anger = result[0][3]
    fear = result[0][4]
    surprise = result[0][5]
    sadness_scores.append(sadness['score'])
    joy_scores.append(joy['score'])
    love_scores.append(love['score'])
    anger_scores.append(anger['score'])
    fear_scores.append(fear['score'])
    surprise_scores.append(surprise['score'])
    print("Debug Chunk {} , Sadness Score: {} ".format(len(sadness_scores), sadness))
  sadness_avg_score = sum(sadness_scores)/len(sadness_scores)
  joy_avg_score = sum(joy_scores)/len(joy_scores)
  love_avg_score = sum(love_scores)/len(love_scores)
  anger_avg_score = sum(anger_scores)/len(anger_scores)
  fear_avg_score = sum(fear_scores)/len(fear_scores)
  surprise_avg_score = sum(surprise_scores)/len(surprise_scores)
  print("Debug Average: sadness_scores: {}, joy_scores: {}, love_scores: {}, anger_scores: {}, fear_scores: {}, surprise_scores: {} ".format(sadness_avg_score, joy_avg_score, love_avg_score, anger_avg_score, fear_scores, surprise_scores))
  return { "sadness_score" : sadness_avg_score,
                "joy_scores" : joy_avg_score,
                "love_scores" : love_avg_score,
                "anger_scores" : anger_avg_score,
                "fear_scores" : fear_avg_score,
                "surprise_scores" : surprise_avg_score }


In [164]:
df['classification_score'] = df['content'].dropna().apply(lambda text: avg_scoreby512(text))

Debug Text -  the student diana ord  ez  from   eso b at ies manuel ca adas has created this wonderful presentation about the chernobyl disaster and nuclear energy   usp sharing here s irene medina  also from   eso b  with her project about deforestation this teenage swedish girl is the promoter of the march   th strike and the fridays for future movement  videos about plastic innovative ways with plastic waste we are going to work in our integrated project  walking around my region   to do so  you will divide into teams of three or four members  each team will research on one of the following towns   moraleda de zafayona  alhama de granada  montefr o  hu tor t jar  loja after your research  you will present the classroom a tourist brochure with information about one of these towns  group task   this is the rubric we will use to evaluate your work  and you can have a look at this example here  each member of the team will research about a different aspect individual task  we will use t

ZeroDivisionError: ignored

# Check for NAN

In [132]:
import numpy as np

# To return the row and column indices where the value in NaN
np.where(pd.isnull(df['content']))


(array([], dtype=int64),)

In [134]:
df.shape

(10000, 2)