### `preprocessing_and_embedding.ipynb`
This jupyter notebook contains data set cleaning, tokenization and extracting embeddings (sentence vectors) for a suicidal tweet classifier.

I tried to explain the important parts as much as I can.

For the tokenization and embedding parts, I used the [BERT](https://huggingface.co/docs/transformers/model_doc/bert) model (specifically `bert-base-uncased` model).

In [16]:
import re
import unicodedata
import string

import pandas as pd
import numpy as np
import contractions
import emoji

import torch
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import BertTokenizer, BertModel

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stopword_list = stopwords.words('english')

In [10]:
df = pd.read_csv('../data/vader_processed.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,vader_sentiment_label,vader_score,tweet
0,0,0,-0.2699,"Wow, my dad yday: “you don’t take those stupid..."
1,1,0,-0.5995,what part of this was really harmfult of a lot...
2,2,1,0.3382,one of the ways I got through my #depression i...
3,3,0,-0.8643,see i wanna do one of them but they all say th...
4,4,0,-0.8316,IS IT clinical depression or is it the palpabl...


In [11]:
df = df.drop(['Unnamed: 0'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22963 entries, 0 to 22962
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   vader_sentiment_label  22963 non-null  int64  
 1   vader_score            22963 non-null  float64
 2   tweet                  22963 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 538.3+ KB


### Couple of things I noticed after the examination of the trimmed data:
1) Most of the tweets contain hashtags, emojis, numbers, and symbols.
2) The overall emotion of the tweet depends on what kind of emoji(s) does it contains. For example, the tweet that contains '😞' emoji is more likely to be depressive.
3) Some of the punctiation marks are repetitive (i.e. '!!' or '??'). These marks could be valuable.
4) There are non-English tweets.
5) Most of the tweets contain tagged users (i.e. '@elon'). A column named 'mentions' in the original copy of the dataset provides the tagged users in the tweet however, not all of them appear in the column.
6) Some of the tweets contain links and hardcoded pictures (i.e. 'pic.twitter.com/tBhxLdatP8'). 

### Some valuable attributes will be saved for later use:
#### These attributes represent tweet's characteristic features. 
- Swearing or offensive word(s) [Integer] => How many are there?
- Tweet length [Integer] => this column will be calculated after removing links.
- Emoji attribute will be break down into two columns: pos_emoji [Boolean], neg_emoji [Boolean]
- URL/Link [Boolean]

### Tweets that contain depressive emojis are more likely to be depressive tweets.
- So, all of the emojis in the dataset must be gathered into a data structure for later analysis.
- Then, most common emojis must be found for categorization.
- Finally, tweets should be checked whether they contain a categorized emoji or not.

In [12]:
df_attributes = df.copy()
emoji_map = {}

def get_emoji_map(tweet, map):
    if type(tweet) == 'float64':
        print(tweet)
    for char in tweet:
        if emoji.is_emoji(char):
            if char not in map.keys():
                map[char] = [emoji.demojize(char), 1]
            else:
                map[char][1] += 1
                
df_attributes['tweet'].apply(get_emoji_map, map=emoji_map)
sorted_emoji_map = sorted(emoji_map.items(), key=lambda x:x[1][1], reverse=True)
sorted_emoji_map

[('😂', [':face_with_tears_of_joy:', 460]),
 ('😭', [':loudly_crying_face:', 342]),
 ('❤', [':red_heart:', 272]),
 ('💔', [':broken_heart:', 175]),
 ('😔', [':pensive_face:', 132]),
 ('🏻', [':light_skin_tone:', 127]),
 ('🤣', [':rolling_on_the_floor_laughing:', 125]),
 ('♀', [':female_sign:', 113]),
 ('🙏', [':folded_hands:', 107]),
 ('🏼', [':medium-light_skin_tone:', 105]),
 ('😍', [':smiling_face_with_heart-eyes:', 102]),
 ('💜', [':purple_heart:', 95]),
 ('🏾', [':medium-dark_skin_tone:', 91]),
 ('🙃', [':upside-down_face:', 88]),
 ('🏽', [':medium_skin_tone:', 88]),
 ('🤷', [':person_shrugging:', 87]),
 ('😊', [':smiling_face_with_smiling_eyes:', 76]),
 ('😩', [':weary_face:', 74]),
 ('🙄', [':face_with_rolling_eyes:', 72]),
 ('💕', [':two_hearts:', 71]),
 ('😢', [':crying_face:', 69]),
 ('♂', [':male_sign:', 68]),
 ('🤔', [':thinking_face:', 63]),
 ('🥺', [':pleading_face:', 61]),
 ('😞', [':disappointed_face:', 60]),
 ('💯', [':hundred_points:', 59]),
 ('🖤', [':black_heart:', 57]),
 ('🤦', [':person_f

In [13]:
# Here I exract positive and negative emojis using the analyzer
analyzer = SentimentIntensityAnalyzer()
emoji_df = pd.DataFrame(columns=['emoji', 'emoji_unicode', 'emoji_count', 'pos_score', 'neg_score', 'neu_score', 'compound_score'])

for key, val in emoji_map.items():
    emoji_df.loc[len(emoji_df.index)] = [key, val[0], val[1], analyzer.polarity_scores(key)['pos'], analyzer.polarity_scores(key)['neg'], analyzer.polarity_scores(key)['neu'], analyzer.polarity_scores(key)['compound']]

pos_df = emoji_df[emoji_df['compound_score'] > 0.27]
neg_df = emoji_df[emoji_df['compound_score'] < -0.27]

pos_emojis = pos_df['emoji_unicode'].tolist()
neg_emojis = neg_df['emoji_unicode'].tolist()

In [14]:
def check_url_link(tweet):
    sentence = tweet.split(' ')
    for word in sentence:
        if word.startswith('https:') or word.startswith('http:'): 
            return 1
    return 0

def check_pos_emoji(tweet):
    for char in tweet:
        if emoji.is_emoji(char) and char in pos_emojis:
            return 1
    return 0

def check_neg_emoji(tweet):
    for char in tweet:
        if emoji.is_emoji(char) and char in neg_emojis:
            return 1
    return 0

def get_tweet_length(tweet):
    sentence = tweet.split(' ')
    res = " "
    
    for word in sentence:
        if word.startswith('https:') or word.startswith('http:') or word.startswith('pic.twitter.com'):
            sentence.remove(word)
    return len(res.join(sentence))

def get_profanity_words(tweet):
    cleaned_tweet = re.sub(r'[^\w\s]', '', tweet)
    sentence = cleaned_tweet.split(' ') 
    profanity_wordlist = np.loadtxt('../data/profanity_wordlist.txt', dtype='str')
    count = 0
    
    for word in sentence:
        if word.lower() in profanity_wordlist:
            count += 1
    return count

In [15]:
tweet_num = 22913

tweet = df_attributes['tweet'][tweet_num]
vader_score = df_attributes['vader_sentiment_label'][tweet_num]

print(f"{tweet} - {vader_score}\n")
print(f"url link: {check_url_link(tweet)}")
print(f"positive emoji: {check_pos_emoji(tweet)}")
print(f"negative emoji: {check_neg_emoji(tweet)}")
print(f"tweet length: {get_tweet_length(tweet)}")
print(f"profanity word: {get_profanity_words(tweet)}")

I've been trying to figure out why my depression is so damn hard to shake today. I knew something was going on. - 0

url link: 0
positive emoji: 0
negative emoji: 0
tweet length: 111
profanity word: 1


In [109]:
df_attributes['tweet_length'] = df_attributes['tweet'].apply(get_tweet_length)
df_attributes['url_link'] = df_attributes['tweet'].apply(check_url_link)
df_attributes['pos_emoji'] = df_attributes['tweet'].apply(check_pos_emoji)
df_attributes['neg_emoji'] = df_attributes['tweet'].apply(check_neg_emoji)
df_attributes['profanity_word'] = df_attributes['tweet'].apply(get_profanity_words)

In [110]:
df_attributes

Unnamed: 0,vader_sentiment_label,vader_score,tweet,tweet_length,url_link,pos_emoji,neg_emoji,profanity_word
0,0,-0.2699,"Wow, my dad yday: “you don’t take those stupid...",278,0,0,0,0
1,0,-0.5995,what part of this was really harmfult of a lot...,274,0,0,0,0
2,1,0.3382,one of the ways I got through my #depression i...,208,0,0,0,0
3,0,-0.8643,see i wanna do one of them but they all say th...,114,0,0,0,0
4,0,-0.8316,IS IT clinical depression or is it the palpabl...,78,0,0,0,0
...,...,...,...,...,...,...,...,...
22958,0,-0.8126,CBD for depression? Nature works in mysterious...,116,1,0,0,0
22959,0,-0.5719,Depression is real,18,0,0,0,0
22960,0,-0.5060,Even though Tropical Depression Barry did not ...,245,1,0,0,0
22961,0,-0.7906,https://medtally.com/post/cluster-analysis-wi...,83,1,0,0,0


### Data Preprocessing
Data should be processed in a way that it can be vectorized for tokenization.

(Note: At this point, not so sure about removing hashtags. Maybe will delete later)

- All tweets converted into lowercase.
- URL(s) removed from tweets. Such as, "https...., pic.twitter.com/..."
- All contractions are expanded.
- Accented chars are converted into original forms. Such as, "á": "a"
- All emojis, mentions and digits are removed.
- All punctuation marks and special characters (i.e. £, $) are removed.
- All stopwords are removed.
- Finally, stemming is applied for a more robust data set.

In [111]:
df_clean = df_attributes.copy()

In [112]:
def to_lowercase(tweet):
    return tweet.lower()

In [113]:
# remove url links such as, 'https://...', 'http://...', or 'pic.twitter...'
def remove_url(tweet):
    tweet = re.sub(r'http\S+', '', tweet)
    tweet = re.sub(r'pic\.twitter\.com\S+', '', tweet)
    tweet = re.sub(r'www.+', '', tweet)
    tweet = tweet.replace(u'\xa0', u' ')
    
    return tweet.strip()

In [114]:
# this dictionary contains contractions and their expanded forms
contractions_dict = {
  "brb": "be right back",
  "btw": "by the way",
  "cant": "can not",
  "dont": "do not",
  "doesnt": "doesn ot",
  "didnt": "did not",
  "hasnt": "has not",
  "havent": "have not",
  "heres": "here is",
  "howd": "how did",
  "howve": "how have",  
  "hows": "how is",
  "id": "i would",
  "ive": "i have",
  "isnt": "is not",
  "itd": "it would",
  "itll": "it will",
  "its": "it iss",
  "kys": "kill yourself",  
  "lets": "let us", 
  "ngl": "not gonna lie",
  "omg": "oh my god",
  "omfg": "oh my fucking god",  
  "shes": "she is",
  "stfu": "shut the fuck up",  
  "thats": "that is",
  "theres": "there is",
  "theyd": "they would",
  "theyll": "they will",
  "theyre": "they are",
  "theyve": "they have",
  "thisll": "this will",  
  "uve": "you have",  
  "wasnt": "was not",
  "wed": "we would",
  "werent": "were not",
  "whatll": "what will",
  "what're": "what are",
  "whats": "what is",
  "whatve": "what have",
  "whens": "when is",
  "whered": "where would",
  "wheres": "where is",
  "whereve": "where have",
  "wholl": "who will",
  "who'll've": "who will have",
  "whos": "who is",
  "whove": "who have",
  "whys": "why is",
  "whyve": "why have",
  "will've": "will have",
  "wont": "will not",
  "wouldve": "would have",
  "wouldnt": "would not",
  "yall": "you all",
  "yalls": "you alls",
  "youd": "you would",
  "youll": "you will",
  "youllve": "you will have",
  "youre": "you are",
  "youve": "you have",
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he will have",
  "he's": "he is",
  "here's": "here is", 
  "how'd": "how did",
  "how'd'y": "how do you",
  "how've": "how have",  
  "how'll": "how will",
  "how's": "how is",
  "i'd": "i would",
  "i'd've": "i would have",
  "i'll": "i will",
  "i'll've": "i will have",
  "i'm": "i am",
  "i've": "i have",
  "isn't": "is not",
  "it'd": "it had",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "seen't": "see not",  
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so is",
  "that'd": "that would",
  "that'd've": "that would have",
  "that'll": "that will", 
  "that's": "that is",
  "there'd": "there had",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "this'll": "this will",  
  "to've": "to have",
  "u've": "you have",  
  "wasn't": "was not",
  "we'd": "we had",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "ya'll": "you all",
  "y'all": "you all",
  "y'alls": "you alls",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "yday": "yesterday",
  "you'd": "you had",
  "you'd've": "you would have",
  "you'll": "you will",
  "you'll've": "you will have",
  "you're": "you are",
  "you've": "you have"
}
"""
    create a regular expression pattern using 'contractions_dict'
    this pattern is used to identify contractions in the given input string
"""
contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

"""
    replaces all occurrences of contractions in the input string with their expanded forms
    'replace' function is used as the replacement function, and it's applied to each match found by the regular expression
"""
def expand_contractions(tweet, contractions_dict=contractions_dict):
    # takes a 'match' object and returns corresponded expansion form
    tweet = tweet.replace("’", "'")
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, tweet)

In [115]:
#tweet = "Wow, my dad yday: “you don’t take those stupid depression drugs anymore though, do you? Because they’re the absolute worst thing [and there is never a need for them]!”  Ain’t it great when your own family is so supportive? My mom’s and sister’s stance on this is similar, btw..."
#expand_contractions(tweet)

In [116]:
"""
    convert accent characters into standard ASCII characters
    Such as: résumé, café, prótest, divorcé => resume, cafe, protest, divorce
"""
def convert_accented_chars(tweet):
    return unicodedata.normalize('NFKD', tweet).encode('ascii', 'ignore').decode('utf-8', 'ignore')

In [117]:
def remove_emojis(tweet):
    return ''.join(char for char in tweet if not emoji.is_emoji(char))

In [118]:
# remove mentions (and tags ?)
def remove_mentions(tweet):
    return re.sub(r'@\S*', '', tweet)

In [119]:
# remove digits
def remove_digits(tweet):
    return ''.join(char for char in tweet if not char.isdigit())

In [120]:
def remove_special_characters(tweet):
    pattern = r'[^a-zA-z0-9.,!?/:;\"\'\s]' 
    return re.sub(pattern, '', tweet)

In [121]:
def remove_punctuation(tweet):
    return ''.join([char for char in tweet if char not in string.punctuation])

In [122]:
def remove_stopwords(tweet):
    word_tokens = nltk.word_tokenize(tweet) 
    filtered_sentence = [w for w in word_tokens if not w in stopword_list]
    tweet = ' '.join(filtered_sentence)
    return tweet

#remove_stopwords('wow my dad yday you do not take those stupi would depression drugs anymore though do you because they are the absolute worst thing and there is never a need for them  am not it great when your own family is so supporti have my moms and sisters stance on this is similar by the way')

In [123]:
df_clean['tweet'] = df_clean['tweet'].apply(to_lowercase)
df_clean['tweet'] = df_clean['tweet'].apply(remove_url)
df_clean['tweet'] = df_clean['tweet'].apply(expand_contractions)
df_clean['tweet'] = df_clean['tweet'].apply(convert_accented_chars)
df_clean['tweet'] = df_clean['tweet'].apply(remove_emojis)
df_clean['tweet'] = df_clean['tweet'].apply(remove_mentions)
df_clean['tweet'] = df_clean['tweet'].apply(remove_digits)
df_clean['tweet'] = df_clean['tweet'].apply(remove_special_characters)
df_clean['tweet'] = df_clean['tweet'].apply(remove_punctuation)
df_clean['tweet'] = df_clean['tweet'].apply(remove_stopwords)

#df_clean.to_csv('test.csv')

In [124]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22963 entries, 0 to 22962
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   vader_sentiment_label  22963 non-null  int64  
 1   vader_score            22963 non-null  float64
 2   tweet                  22963 non-null  object 
 3   tweet_length           22963 non-null  int64  
 4   url_link               22963 non-null  int64  
 5   pos_emoji              22963 non-null  int64  
 6   neg_emoji              22963 non-null  int64  
 7   profanity_word         22963 non-null  int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 1.4+ MB


### Some tweets remained as an empty string after unnecessary parts were removed.
Such as:
    "💜💜💜 @NewYorkTimes" => " "

These rows must be removed as well.

In [125]:
df_clean[df_clean['tweet'].str.len() == 0]

Unnamed: 0,vader_sentiment_label,vader_score,tweet,tweet_length,url_link,pos_emoji,neg_emoji,profanity_word
1281,0,0.0000,,2,0,0,0,0
1447,1,0.7845,,7,0,0,0,0
1450,0,0.0000,,0,0,0,0,0
1723,1,0.6369,,3,0,0,0,0
1732,0,0.0000,,132,0,0,0,0
...,...,...,...,...,...,...,...,...
22650,0,0.0000,,0,0,0,0,0
22708,0,0.0000,,0,0,0,0,0
22731,0,0.0000,,0,0,0,0,0
22779,1,0.4939,,2,0,0,0,0


In [126]:
# remove rows where 'tweet' column is empty
df_clean = df_clean[df_clean['tweet'].str.len() > 0]

len(df_clean.index)

22830

After cleaning the text, the text should be processed with Stemmer before tokenization.

In [127]:
stemmer = PorterStemmer()

def get_stem(tweet, stemmer=stemmer):
    return ' '.join([stemmer.stem(word) for word in tweet.split()])

In [128]:
df_clean['tweet'] = df_clean['tweet'].apply(get_stem) 
#df_clean.to_csv('test.csv')

## Tokenization and Embedding Extraction with BERT
The text is cleaned and should be tokenized in a way thay the embeddings can be extracted for later training.

In [129]:
df_bert = df_clean.copy()
df_bert.head()

Unnamed: 0,vader_sentiment_label,vader_score,tweet,tweet_length,url_link,pos_emoji,neg_emoji,profanity_word
0,0,-0.2699,wow dad yesterday take stupi would depress dru...,278,0,0,0,0
1,0,-0.5995,part realli harmfult lot peopl went everi gui ...,274,0,0,0,0
2,1,0.3382,one way got depress learn danc rain sourc stre...,208,0,0,0,0
3,0,-0.8643,see wan na one say ptsd depress andor anxieti ...,114,0,0,0,0
4,0,-0.8316,clinic depress palpabl hopeless gener,78,0,0,0,0


### Tokenization

In [130]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Returns tokens of the tweet, and tensors of the tokens and segment ids
def get_tokenization(tweet, tokenizer=tokenizer):
    marked_text = "[CLS] " + tweet + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)

    # Map the token strings to their vocabulary indeces.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    # segments_ids is used as an indicator of multiple sentences.
    # We're taking only one sentence here, so segments_ids will be filled up with only '1's
    # Mark each of the tokens as belonging to sentence "1"
    segments_ids = [1] * len(tokenized_text)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensor = torch.tensor([segments_ids])

    return tokenized_text, tokens_tensor, segments_tensor

In [131]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

### Extracting Embeddings - Sentence Vectors

Understanding what `extract_sentence_embeddings` does:
- It predict hidden states features for each layer.
- `hidden_states` object has four dimensions, in the following order:
- 1) The layer number (13 = 12 + input embeddings)
  2) The batch number (1 sentence)
  3) The word/token number (depends on the given `tokens_tensor`)
  4) The hidden unit/feature number (768 features)
- Then, average the second to last hiden layer of each token producing a single 768 length vector to get a single vector for the entire sentence.

In [132]:
"""
    Side note: torch.no_grad tells PyTorch not to construct the compute graph 
    during this forward pass (since we won’t be running backprop here).
    This just reduces memory consumption and speeds things up a little.
"""

def extract_sentence_embeddings(tokens_tensor, segments_tensor, model=model):
    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensor)
       
        # `hidden_states` has shape [13 x 1 x 22 x 768]
        hidden_states = outputs[2]

        # `token_vecs` is a tensor with shape [22 x 768]
        token_vecs = hidden_states[-2][0]

        # Calculate the average of all 22 token vectors.
        sentence_embedding = torch.mean(token_vecs, dim=0)
        
    return sentence_embedding

Let's test the tokenization and extraction of sentence embeddings with the first tweet in our cleaned data set.  

In [133]:
tweet = df_bert['tweet'][0]
tweet

'wow dad yesterday take stupi would depress drug anymor though absolut worst thing never need great famili supporti mom sister stanc similar way'

In [134]:
tokenized_text, tokens_tensor, segments_tensor = get_tokenization(tweet)
tokenized_text, tokens_tensor, segments_tensor

(['[CLS]',
  'wow',
  'dad',
  'yesterday',
  'take',
  'stu',
  '##pi',
  'would',
  'de',
  '##press',
  'drug',
  'any',
  '##mo',
  '##r',
  'though',
  'abs',
  '##ol',
  '##ut',
  'worst',
  'thing',
  'never',
  'need',
  'great',
  'fa',
  '##mi',
  '##li',
  'support',
  '##i',
  'mom',
  'sister',
  'stan',
  '##c',
  'similar',
  'way',
  '[SEP]'],
 tensor([[  101, 10166,  3611,  7483,  2202, 24646,  8197,  2052,  2139, 20110,
           4319,  2151,  5302,  2099,  2295, 14689,  4747,  4904,  5409,  2518,
           2196,  2342,  2307,  6904,  4328,  3669,  2490,  2072,  3566,  2905,
           9761,  2278,  2714,  2126,   102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))

In [135]:
sentence_embedding = extract_sentence_embeddings(tokens_tensor, segments_tensor)
sentence_embedding

tensor([-7.8889e-01,  3.2795e-01,  9.7737e-01, -3.5318e-01,  2.9939e-01,
        -4.3548e-01,  6.1954e-01,  6.1555e-01, -2.1930e-01, -6.4830e-01,
         6.3507e-01, -5.2176e-01,  7.4387e-01,  8.0955e-01, -1.4502e+00,
         3.5829e-01,  5.1148e-01,  6.6309e-01, -1.0264e-01, -2.3209e-01,
         8.7779e-01, -3.3429e-01, -1.0678e-01,  1.6091e-01,  3.2531e-02,
        -4.5209e-01,  3.9826e-01,  4.1566e-02, -6.6612e-03,  1.5509e-01,
         3.1926e-01,  2.1097e-01, -3.9433e-01, -3.6388e-01,  8.4692e-02,
        -2.0377e-01,  1.4267e-01,  9.0458e-01, -1.5283e-01,  2.7483e-01,
         1.7792e-01, -6.7930e-01, -4.8960e-02,  1.0906e-01, -6.2476e-01,
        -5.9140e-02, -1.2102e-01, -2.2941e-01,  4.8750e-02,  1.4828e-01,
        -1.9536e-01, -5.0544e-01, -6.5010e-01, -4.8235e-01, -8.1248e-02,
         4.1584e-01,  2.3588e-01, -4.8958e-01,  1.2158e-01, -2.8020e-01,
         4.9251e-02,  1.1793e-01,  3.6503e-01, -2.1966e-01, -2.4879e-02,
        -2.4909e-01, -1.1744e-01,  1.4650e-01, -4.3

In [136]:
type(sentence_embedding)

torch.Tensor

In [137]:
# Note: At this point, I have to get the vectors and relevant attributes per tweet.
# I will probably use a tuple for this:

# => [(tensor, tweet_length, url_link , pos_emoji , neg_emoji, profanity_word, class)]
