# Emotion Detection using DistilBERT
---
Twitter user bio is analyzed to find emotion probabilities on 28 metrices. I'm using DistilBERT transformer pre-trained on GoEmotions dataset. The model card can be found [here](https://huggingface.co/joeddav/distilbert-base-uncased-go-emotions-student?text=I+feel+lucky+to+be+here.).

### Setup

Transformers installation from [HuggingFace](https://github.com/huggingface/transformers).

Plotly upgraded to latest version and pyyaml downgraded to fix yaml loader issue.

emoji package to demojize tweets

In [2]:
!pip install transformers
!pip install --upgrade plotly
!pip install pyyaml==5.4.1
!pip install emoji 

Collecting pyyaml==5.4.1
  Using cached PyYAML-5.4.1-cp39-cp39-win_amd64.whl (213 kB)
Installing collected packages: pyyaml
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 6.0
    Uninstalling PyYAML-6.0:


ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'c:\\programdata\\anaconda3\\lib\\site-packages\\_yaml\\__init__.py'
Consider using the `--user` option or check the permissions.





### Import Necessary Packages


In [3]:
#Models and Core Packages
from transformers import AutoTokenizer, TFAutoModel, pipeline
import pandas as pd

#For Preprocessing
from pathlib import Path
import re    # RegEx for removing non-letter characters
import nltk  # natural language processing
import emoji # processing emojis
nltk.download("stopwords")
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.porter import *

#For data visualization
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline

pd.options.plotting.backend = "plotly"
pd.options.display.max_colwidth=160

Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.


ImportError: cannot import name 'TFAutoModel' from 'transformers' (C:\ProgramData\Anaconda3\lib\site-packages\transformers\__init__.py)

### HuggingFace installation check

Default pipeline to predict following the huggingface installation guide [here](https://huggingface.co/docs/transformers/installation).

In [3]:
pipeline('sentiment-analysis')('we love you')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998704195022583}]

### DistilBERT base uncased GoEmotions student model

The model is shared [here](https://huggingface.co/joeddav/distilbert-base-uncased-go-emotions-student). Test run and simple prediction scores.

In [5]:
classifier = pipeline("text-classification", model='joeddav/distilbert-base-uncased-go-emotions-student', framework='tf', return_all_scores=True)

predictions = classifier("I feel lucky to be here.")

df = pd.DataFrame(predictions[0])

df.plot(x='label', y='score', kind='bar')

Downloading:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at joeddav/distilbert-base-uncased-go-emotions-student.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Downloading:   0%|          | 0.00/421 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

  defaults = yaml.load(f)


### Utility Functions

#### Progress bar
citation - from this [answer](https://stackoverflow.com/a/46939639/9573439).

In [6]:
from IPython.display import HTML, display
import time

def progress(done, total):
  percent = int(100 * done // total)
  return HTML("""
      <span>
        Progress: 
        <progress
            value='{percent}'
            max='100',
            style='width: 50%'
        >
            {percent}
        </progress> {done}/{total} Complete
      </span>
  """.format(percent=percent, done=done, total=total))

In [7]:
out = display(progress(0, 100), display_id=True)
for i in range(0, 100, 1):
    time.sleep(0.01)
    out.update(progress(i+1, 100))
    if i == 50:
      break

#### Verdict

Lable with maximum probability from predicted probabilities.

In [8]:
def verdict(predictions):
  max_score = 0
  verdict = ''
  for d in predictions:
    if d['score'] > max_score:
      max_score = d['score']
      verdict = d['label']
  return verdict

print(verdict(predictions[0]))

relief


#### Contraction Mapping

Dictionary containing valid english contractions from wikipedia.

In [10]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
                       "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", 
                       "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", 
                       "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am",
                       "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                       "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                       "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not",
                       "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", 
                       "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not",
                       "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
                       "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have",
                       "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is",
                       "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would",
                       "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have",
                       "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
                       "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", 
                       "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did",
                       "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                       "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", 
                       "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                       "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have",
                       "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have", 'u.s':'america', 'e.g':'for example'}

#### Text Processing

In [13]:
# Regex patterns.
url_regx             = r"(quick\s*link[s*]\s*:\s*)*((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
user_regx            = r'@[^\s]+'
hashtag_regx         = r'#[^\s]+'
special_quotes_regx  = r'[’|‘|´|`]+'
alpha_regx           = r"[^a-zA-Z0-9'\"\.\,\?\!\&\%\$\/\-+]"
valid_punc_regx      = r"([\"\.\,\?\!\&\%\$\/\-+])"
break_alphanum_regx  = r'([0-9]+)'
sequence_regx        = r"(.)\1\1+"
seq_replace_regx     = r"\1\1"
is_empty_str_regx    = r'.*[a-zA-Z0-9]+.*'

In [14]:
stop_words = set(nltk.corpus.stopwords.words('english'))

def process_text(s):
  # Remove links
  s = re.sub(url_regx, ' ', s)

  # Replace tabs with whitespace
  s = s.replace('\t', ' ')

  # Replace all emojis.
  s = emoji.replace_emoji(s, ' ')

  # Replace @USERNAME to ' '.
  s = re.sub(user_regx, ' ', s)

  # Replace #HASHTAG to ' '
  s = re.sub(hashtag_regx, ' ', s)

  # Replace special quotes
  s = re.sub(special_quotes_regx, "'", s)

  # Replace 1 or more of valid punctuations with 1
  s = re.sub(valid_punc_regx+r"\1+", r"\1", s)

  # Replace 3 or more consecutive letters by 2 letter.
  s = re.sub(sequence_regx, seq_replace_regx, s)

  # Replace all non-english alphabets, digits and invalid punctuations.
  s = re.sub(alpha_regx, " ", s)

  # Put space between punctuations and letters/digits
  s = re.sub(valid_punc_regx, r" \1 ", s)

  # Break alphanumeric into words and numbers
  s = re.sub(break_alphanum_regx, r' \1 ', s)

  # If no alphabet/digits remain
  if re.match(is_empty_str_regx, s) is None:
    return ''

  # Tokenize
  # tokens = nltk.word_tokenize(s) # breaks contractions into 2 words
  tokens = s.split()

  valid_bag = []
  for w in tokens:
    # Remove stopwords
    # if w in stop_words:
    #     continue
    # Contraction Mapping
    if w in contraction_mapping:
      valid_bag.extend(contraction_mapping[w].split())

    valid_bag.append(w)

  # At least 2 words or 1 word and 1 punctuation
  if len(valid_bag) < 2:
    return ''
  
  return ' '.join(valid_bag)



### Limitations of the model


#### Emoji
This model doesn't work with emojis and empty strings. The highest probability given if emojis are present are often wrong. If the emojis are replaced with corresponding alias the situation also doesn't improve.

One approach is to remove all of the emojis.

In [15]:
examples = ["❤️ you", "meh 😒", "did not bring the charger 😢", "lol 😂", "can you help me with the loan? 😊"]
demojized = [emoji.demojize(s) for s in examples]
demojized_eng = ["heart you", "meh unamused", "did not bring the charger crying", "lol tears of joy", "can you help me with the loan? smiling"]
noemoji = ["you", "meh", "did not bring the charger", "lol", "can you help me with the loan?"]
df = pd.DataFrame({
    "sentence": [examples[0], noemoji[0], demojized_eng[0], demojized[0],
                 examples[1], noemoji[1], demojized_eng[1], demojized[1],
                 examples[2], noemoji[2], demojized_eng[2], demojized[2],
                 examples[3], noemoji[3], demojized_eng[3], demojized[3],
                 examples[4], noemoji[4], demojized_eng[4], demojized[4]],
    "verdict": [verdict(classifier(examples[0])[0]), verdict(classifier(noemoji[0])[0]), verdict(classifier(demojized_eng[0])[0]), verdict(classifier(demojized[0])[0]),
                verdict(classifier(examples[1])[0]), verdict(classifier(noemoji[1])[0]), verdict(classifier(demojized_eng[1])[0]), verdict(classifier(demojized[1])[0]),
                verdict(classifier(examples[2])[0]), verdict(classifier(noemoji[2])[0]), verdict(classifier(demojized_eng[2])[0]), verdict(classifier(demojized[2])[0]),
                verdict(classifier(examples[3])[0]), verdict(classifier(noemoji[3])[0]), verdict(classifier(demojized_eng[3])[0]), verdict(classifier(demojized[3])[0]),
                verdict(classifier(examples[4])[0]), verdict(classifier(noemoji[4])[0]), verdict(classifier(demojized_eng[4])[0]), verdict(classifier(demojized[4])[0])]
})

display(df)

Unnamed: 0,sentence,verdict
0,❤️ you,admiration
1,you,realization
2,heart you,caring
3,:red_heart: you,caring
4,meh 😒,amusement
5,meh,annoyance
6,meh unamused,disapproval
7,meh :unamused_face:,confusion
8,did not bring the charger 😢,disappointment
9,did not bring the charger,disappointment


#### Training Limitations
From the model card for the DistilBERT model [here](https://huggingface.co/joeddav/distilbert-base-uncased-go-emotions-student), the model is trained using the zero-shot pipeline provided by huggingface with unlabeled GoEmotions dataset. The model uses the same classes from the GoEmotions dataset.

This model may underperform compared to a full supervised model. No accuracy/F1 score is shared by the author.

### Pre-processing Bios

The file contains the userids and bios of twitter users. Suitable techniques from below are used to clean and prepare the data.

#### Tweet Cleaning Strategy (in order of recommended execution) -
1. **Lowercase** the bios
2. **Duplicates** removed.
3. **Remove links** and any "Quick Links: " text.
4. **html** code check, remove if found in the dataset.
5. **Tabs** replaced with whitespace.
6. **Emoji** to text, ascii emoji to text. Replace consecutive same emoji with a single one. If accuracy suffers then removed.
7. **Mentions** remove or replace with mask *USER* depending on accuracy.
8. **HashTags** convertion to valid words if possible. If not possible remove or replaced with mask *HASHTAG* depending on accuracy.
9. **Special quotation** marks replaced with proper ones.
10. **Contraction mapping** and short forms (u, lol etc.) expansion.
11. **Consecutive letters** if 3 or more then replaced with only 2 (*heeyyyy* to *heeyy*).
12. **Acronyms** expansion.
13. **English letters, digits, valid punctuations** kept, everything else removed. 
14. **Break Alphanumeric words** by adding space between letters and numbers (assuming missing space mistake).
15. **Stopwords** removed depending on the hit on accuracy (may be important for emotions?).
16. **Space between words and punctuations**. Must be after ascii emoji to text is done and unnecessary ones removed.
17. **Spelling correction** based on valid dictionary.
18. **POS** generation, tokenization
19. **Lemmatization** depending on accuracy.
20. **Remove multiple spaces**.
21. **Empty sentences** removed from dataset.

#### Mounting Google drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Checking file location

In [16]:
!head /content/drive/My\ Drive/users_bio.csv

413080213	#PaisleyBuddie @❤️luv my5brats NSSRP 👵🏼❤️@IrvineWelsh. ❤️. @SarahPinborough 📖🎬 #ImaGranny ❤️👵🏼❤️ #HiStitch ❤️💙🐾
493832011	We are the organizers of Regina's Queen City Pride Festival! Pride 2021 is from June 4 to 13!  Quick Links: https://t.co/Ae11sMU5mj  #QCPRIDE
2989319032	Meninsn?? Yes please!!!!!
1042385216	https://t.co/5B8iqsIMaH gives u #LGBTQ news from #Australia #Ireland #NewZealand #UK #USA #Scandinavia & the 🌎 4 the LGBTQ community & their families & friends
490149888	Info & photo ancient - not updating😅.3rd generation @Yankees fan, Consultant, Advocate, @OurVoicesNY #OurVoicesMatter
2209217329	
2547645997	Humanist, Musician, Writer. https://t.co/yTLXTc5x3e
144965060	Watch Gay Webcams on CameraBoys & LiveJasmin: https://t.co/EU3XNRneqe #livejasmin #cameraboys
18975463	Queer Art and Culture | Cabaret & Events with & for our LGBTIQ+ Community
105306151	she/her/elle  Community & political organiser on #Trans, #LGBT, #Housing, #Homelessness, #MentalHealth issues.


#### Processing dataset

There is line ending issues in the word file. If directly read using pandas, it causes multiple lines to be considered as the same line. Manual loading of the file is required.

**Step 1:** Taking advantage of this we are removing empty bios and converting to lowercase.

In [14]:
ids = []
bios = []
with open('/content/drive/My Drive/users_bio.csv', mode='r') as fin:
  for l in fin:
    l = l.strip()
    parts = l.split('\t')

    # Empty string or only id and no bio
    if len(parts) < 2 or len(parts[1]) == 0:
      continue

    ids.append(parts[0])

    # Lowercase bios during load
    bios.append(parts[1].lower())

**Step 2:** Duplicates removed

In [15]:
df = pd.DataFrame({'ids': ids, 'bios': bios})

df.drop_duplicates(inplace=True)

df.dropna(inplace=True)

df['length'] = [len(s) for s in df['bios']]

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1767520 entries, 0 to 2892130
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   ids     object
 1   bios    object
 2   length  int64 
dtypes: int64(1), object(2)
memory usage: 53.9+ MB


In [16]:
df.head()

Unnamed: 0,ids,bios,length
0,413080213,#paisleybuddie @❤️luv my5brats nssrp 👵🏼❤️@irvinewelsh. ❤️. @sarahpinborough 📖🎬 #imagranny ❤️👵🏼❤️ #histitch ❤️💙🐾,111
1,493832011,we are the organizers of regina's queen city pride festival! pride 2021 is from june 4 to 13! quick links: https://t.co/ae11smu5mj #qcpride,141
2,2989319032,meninsn?? yes please!!!!!,25
3,1042385216,https://t.co/5b8iqsimah gives u #lgbtq news from #australia #ireland #newzealand #uk #usa #scandinavia & the 🌎 4 the lgbtq community & their families & friends,159
4,490149888,"info & photo ancient - not updating😅.3rd generation @yankees fan, consultant, advocate, @ourvoicesny #ourvoicesmatter",117


In [17]:
df[df.length == df.length.max()].head(1)

Unnamed: 0,ids,bios,length
235764,1214436673909489664,https://t.co/m24cseqjvx https://t.co/edqsdzytb5 https://t.co/jhqyqlqmdj https://t.co/l5ja3v4rc8 https://t.co/tpd9htmvhv https://t.co/d5ieaky7cl https://...,294


In [18]:
df[df.length == df.length.min()].head(1)

Unnamed: 0,ids,bios,length
137,96475659,😅,1


**Step 3, 5-16:** links, mentions, hashtags, emojis, special quotations, contraction mapping, consecutive letters, ~~acronyms~~, non-english/non-number removal, break alphanumeric words, ~~remove stopwords~~, space between words and punctuations, ~~spelling correction~~. 

Test run and Inspection

In [19]:
for s in df['bios'].to_list()[:5]:
  print(f'Before: {s}\n')
  print(f'After: {process_text(s)}\n\n')

Before: #paisleybuddie @❤️luv my5brats nssrp 👵🏼❤️@irvinewelsh. ❤️. @sarahpinborough 📖🎬 #imagranny ❤️👵🏼❤️ #histitch ❤️💙🐾

After: luv my 5 brats nssrp .


Before: we are the organizers of regina's queen city pride festival! pride 2021 is from june 4 to 13!  quick links: https://t.co/ae11smu5mj  #qcpride

After: we are the organizers of regina's queen city pride festival ! pride 2021 is from june 4 to 13 !


Before: meninsn?? yes please!!!!!

After: meninsn ? yes please !


Before: https://t.co/5b8iqsimah gives u #lgbtq news from #australia #ireland #newzealand #uk #usa #scandinavia & the 🌎 4 the lgbtq community & their families & friends

After: gives u news from & the 4 the lgbtq community & their families & friends


Before: info & photo ancient - not updating😅.3rd generation @yankees fan, consultant, advocate, @ourvoicesny #ourvoicesmatter

After: info & photo ancient - not updating . 3 rd generation fan , consultant , advocate ,




Final processing and saved in file

In [20]:
df['processed'] = [process_text(s) for s in df['bios']]
df['length_processed'] = [len(s) for s in df['processed']]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1767520 entries, 0 to 2892130
Data columns (total 5 columns):
 #   Column            Dtype 
---  ------            ----- 
 0   ids               object
 1   bios              object
 2   length            int64 
 3   processed         object
 4   length_processed  int64 
dtypes: int64(2), object(3)
memory usage: 80.9+ MB


In [22]:
df.drop(df[df.length_processed < 2].index, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1670994 entries, 0 to 2892130
Data columns (total 5 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   ids               1670994 non-null  object
 1   bios              1670994 non-null  object
 2   length            1670994 non-null  int64 
 3   processed         1670994 non-null  object
 4   length_processed  1670994 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 76.5+ MB


In [23]:
df[df.length_processed == df.length_processed.max()].head(1)

Unnamed: 0,ids,bios,length,processed,length_processed
2732017,737706614,di-de-di-da-di-de-do-do di-ba-di-de-do di-de-de-di-de-de-de-do-do-day-bi-di-do di-de-di-da-di-de-do-do di-ba-di-de-do di-de-de-di-de-de-de-do-do-day-bi-di-do,157,di - de - di - da - di - de - do - do di - ba - di - de - do di - de - de - di - de - de - de - do - do - day - bi - di - do di - de - di - da - di - de - d...,249


In [24]:
df[df.length_processed == df.length_processed.min()].head(1)

Unnamed: 0,ids,bios,length,processed,length_processed
6216,3496040303,i #writelgbtq #romance. #queer. #asd. #immigrant. #polyglot. #lbtq #writingcommunity - https://t.co/ksrroryjlg,110,i -,3


#### Save Processed Bios to File

In [25]:
df.to_csv('/content/drive/My Drive/users_bio_processed.csv', sep='\t', columns=['ids', 'processed'], index=False)

### Predictions in Batch

In [17]:
df = pd.read_csv('/content/drive/My Drive/users_bio_processed.csv', sep='\t', dtype={"ids": "string", "processed": "string"})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670994 entries, 0 to 1670993
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   ids        1670994 non-null  string
 1   processed  1670994 non-null  string
dtypes: string(2)
memory usage: 25.5 MB


In [18]:
out_path = Path('/content/drive/My Drive/users_bio_distilbert_27.csv')
start_index = 0

if out_path.is_file():
  with open(out_path, mode='r') as fin:
    start_index = len(fin.readlines()) - 2

print(f'Start processing from: {start_index}')

Start processing from: 131530


In [None]:
with open('/content/drive/My Drive/users_bio_distilbert_27.csv', mode='a') as fout:
  if start_index == 0:
    fout.write('ids,admiration,amusement,anger,annoyance,approval,'+\
              'caring,confusion,curiosity,desire,disappointment,disapproval,'+\
              'disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,'+\
              'nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral,verdict\n')
  else:
    fout.write(f'\nRestarted process.\n')

  n = len(df['processed'])
  out = display(progress(0, n), display_id=True)

  for i in df['processed'].loc[start_index:].index:
    predictions = classifier(df['processed'][i])
    fout.write(f'{df["ids"][i]},')

    for probs in predictions:
      for d in probs:
        fout.write(f"{d['score']}, ")
      
      fout.write(f'{verdict(probs)}\n')
    
    out.update(progress(i+1, n))


In [None]:
# drive.flush_and_unmount()