# Distilbert-base-uncased-emotion: 
Base Model - https://huggingface.co/distilbert-base-uncased#distilbert-base-model-uncased

This machine learning model Hugging Face - [bhadresh-savani/distilbert-base-uncased-emotion](https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion#distilbert-base-uncased-emotion) is used for computing emotions of textual data with Context

**About the Model:** 

Distilbert is created with knowledge distillation during the pre-training phase which reduces the size of a BERT model by 40%, while retaining 97% of its language understanding. It's smaller, faster than Bert and any other Bert-based model.

Distilbert-base-uncased finetuned on the emotion dataset using HuggingFace Trainer with below Hyperparameters

1.   learning rate : 2e-5
2.   batch size : 64
3.   num_train_epochs: 8




##  Installing Libraries and Packages

In [None]:
!pip install -q transformers


[K     |████████████████████████████████| 3.1 MB 5.2 MB/s 
[K     |████████████████████████████████| 895 kB 35.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 36.7 MB/s 
[K     |████████████████████████████████| 59 kB 4.6 MB/s 
[K     |████████████████████████████████| 596 kB 26.8 MB/s 
[?25h

In [None]:
# Read the data from Google Drive 
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# add Files to "/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data.json" 

In [None]:
from transformers import pipeline

# TextClassifier with DistilBert

In [None]:
import pandas as pd
import os
import argparse
import time


inputFileName="/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data.json"
outputFileName="/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data-classified.json"

# read the textFile
df = pd.read_json("/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data.json")


# define the distilbert text-classifier
classifier = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)

def clean_text(text):
    """Cleans string for processing. Removes bytes, emails, and urls
    Args:
        text ([str]): text to clean
    Returns:
        Union[str, None]: Returns cleaned string, or None if no text remains after cleaning
    """
    if text and ''.join(text.split()):
        if type(text) == bytes: #Decoding byte strings
            text = text.decode('utf-8')
        #Removing emails + ***.com urls
        text = ' '.join([item for item in text.split() if '@' not in item and '.com' not in item])
        text = ' '.join(text.split()) #removing all multiple spaces
        if text: return text
    return None

# method to classify the text and calculate elapsed time in seconds
def get_distil_bert_textClassification(txt):
    #Cleaning the text
    cleantxt = clean_text(txt)
    if cleantxt == None:
        cleantxt = ""
    t = time.process_time()
    result = classifier(cleantxt)
    elapsed_time = time.process_time() - t
    return pd.Series([result, elapsed_time])

# apply get_distil_bert_textClassification() function to column 'text' and assign it to classification_score column
df[['classification_score', 'elapsed_time']]= df['content'].apply(get_distil_bert_textClassification)

# flatten the classification_score
df['classification_score_flatten'] = df['classification_score'].apply(lambda x: x[0])
list = pd.DataFrame.from_records(df['classification_score_flatten'],
                                 columns=["sadness", "joy", "love", "anger", "fear", "surprise"])

df["sadness"]  = list['sadness'].apply(lambda x: x['score'])
df["joy"]      = list['joy'].apply(lambda x: x['score'])
df["love"]     = list['love'].apply(lambda x: x['score'])
df["anger"]    = list['anger'].apply(lambda x: x['score'])
df["fear"]     = list['fear'].apply(lambda x: x['score'])
df["surprise"] = list['surprise'].apply(lambda x: x['score'])

    # Save the dataframe to CSV file.
df.drop(columns=['classification_score', 'classification_score_flatten'])\
    .to_csv(outputFileName)


Downloading:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/291 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3729 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: ignored

## Observations - Reduce the Number of Words beacuse 

1. Interesting Observation about the  **bhadresh-savani/distilbert-base-uncased-emotion Model**.
It does not seem right for Empty Strings and None.

*   classifier(None)  --> is throwing an error 

```
/text input must of type str (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` `Inline code`
```

*   classifier("")  --> Results  

```
[[{'label': 'sadness', 'score': 0.11096163094043732},
  {'label': 'joy', 'score': 0.20327672362327576},
  {'label': 'love', 'score': 0.02004520408809185},
  {'label': 'anger', 'score': 0.47305959463119507},
  {'label': 'fear', 'score': 0.17409689724445343},
  {'label': 'surprise', 'score': 0.018559947609901428}]]
```

2. RuntimeError bhadresh-savani/distilbert-base-uncased-emotion Model only takes 512 tokens. 

```
RuntimeError: The size of tensor a (3729) must match the size of tensor b (512) at non-singleton dimension 1
```




In [None]:
import html
import re

df = pd.read_json("/content/drive/MyDrive/Colab Notebooks/data/blog_sample_data.json")

def clean_data(df):
    #Lowercase all tweets
    df['content'] = df['content'].apply(lambda t: t.lower())
    #Decode HTML
    df['content'] = df['content'].apply(lambda t: html.unescape(t))
    #Remove @ mentions
    df['content'] = df['content'].apply(lambda t: re.sub(r'@[A-Za-z0-9]+','',t))
    #Remove URLs
    df['content'] = df['content'].apply(lambda t: re.sub('https?://[A-Za-z0-9./]+','',t))
    #Remove remaining non-alpha characters
    df['content'] = df['content'].apply(lambda t: re.sub("[^a-zA-Z]", " ", t))
    return df

df = clean_data(df)
df.content.apply(len)

0        2058
1       17945
2       10901
3        7951
4       66287
        ...  
9995     3074
9996     5429
9997     1576
9998       27
9999       24
Name: content, Length: 10000, dtype: int64

In [None]:
# Breaking long strings into chunks of 512, then take the average of each chunks score. 

In [None]:
 def chunk_lst(lst:list, items_per_chunk:int):
        """Breaks a list into chunks

        Args:
            lst ([list]): List to chunk
            items_per_chunk ([int]): Number of items per list

        Yields:
            [list]: a chunk of lst, with size 'items_per_chunk'
        """
        for i in range(0, len(lst), items_per_chunk):
            yield lst[i:i + items_per_chunk]


In [None]:
df['content'] = df['content'].apply(lambda t: chunk_lst(t, 512))