### Pre-processing Bios

The file contains the userids and bios of twitter users. Suitable techniques from below are used to clean and prepare the data.

#### Tweet Cleaning Strategy (in order of recommended execution) -
1. **Lowercase** the bios
2. **Duplicates** removed.
3. **Remove links** and any "Quick Links: " text.
4. **html** code check, remove if found in the dataset.
5. **Tabs** replaced with whitespace.
6. **Emoji** to text, ascii emoji to text. Replace consecutive same emoji with a single one. If accuracy suffers then removed.
7. **Mentions** remove or replace with mask *USER* depending on accuracy.
8. **HashTags** convertion to valid words if possible. If not possible remove or replaced with mask *HASHTAG* depending on accuracy.
9. **Special quotation** marks replaced with proper ones.
10. **Contraction mapping** and short forms (u, lol etc.) expansion.
11. **Consecutive letters** if 3 or more then replaced with only 2 (*heeyyyy* to *heeyy*).
12. **Acronyms** expansion.
13. **English letters, digits, valid punctuations** kept, everything else removed. 
14. **Break Alphanumeric words** by adding space between letters and numbers (assuming missing space mistake).
15. **Stopwords** removed depending on the hit on accuracy (may be important for emotions?).
16. **Space between words and punctuations**. Must be after ascii emoji to text is done and unnecessary ones removed.
17. **Spelling correction** based on valid dictionary.
18. **POS** generation, tokenization
19. **Lemmatization** depending on accuracy.
20. **Remove multiple spaces**.
21. **Empty sentences** removed from dataset.

In [1]:
import pandas as pd
from pathlib import Path
from ekphrasis.classes.segmenter import Segmenter
import re
import nltk
nltk.download("stopwords")
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords, wordnet
from nltk.stem.porter import PorterStemmer
import emoji
from tqdm.auto import tqdm
from collections import defaultdict
import time

pd.options.plotting.backend = "plotly"
pd.options.display.max_colwidth=160

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shuvo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shuvo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\shuvo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Utility Functions

#### Contraction Mapping

Dictionary containing valid english contractions from wikipedia.

In [2]:
contraction_mapping = {
    "ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because",
    "could've": "could have", "couldn't": "could not", "didn't": "did not",  
    "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not",
    "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is",
    "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
    "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have",
    "I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have",
    "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have",
    "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will",
    "it'll've": "it will have", "it's": "it is", "let's": "let us", "ma'am": "madam",
    "mayn't": "may not", "might've": "might have","mightn't": "might not",
    "mightn't've": "might not have", "must've": "must have", "mustn't": "must not",
    "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
    "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have",
    "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
    "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
    "she'll've": "she will have", "she's": "she is", "should've": "should have",
    "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have",
    "so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have",
    "that's": "that is", "there'd": "there would", "there'd've": "there would have",
    "there's": "there is", "here's": "here is","they'd": "they would",
    "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have",
    "they're": "they are", "they've": "they have", "to've": "to have",
    "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will",
    "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not",
    "what'll": "what will", "what'll've": "what will have", "what're": "what are",
    "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have",
    "where'd": "where did", "where's": "where is", "where've": "where have",
    "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
    "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have",
    "won't": "will not", "won't've": "will not have", "would've": "would have",
    "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
    "y'all'd": "you all would", "y'all'd've": "you all would have","y'all're": "you all are",
    "y'all've": "you all have","you'd": "you would", "you'd've": "you would have",
    "you'll": "you will", "you'll've": "you will have", "you're": "you are",
    "you've": "you have", 'u.s':'america', 'e.g':'for example'
}

#### Text Processing

In [3]:
# Regex patterns.
url_regx             = r"(quick\s*link[s*]\s*:\s*)*((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
user_regx            = r'@[^\s]+'
hashtag_regx         = r'#[^\s]+'
special_quotes_regx  = r'[’|‘|´|`]+'
alpha_regx           = r"[^a-zA-Z':_]"
valid_punc_regx      = r"([\"\.\,\?\!\&\%\$\/\-+])"
break_alphanum_regx  = r'([0-9]+)'
sequence_regx        = r"([a-zA-Z])\1\1+"
seq_replace_regx     = r"\1\1"
# is_empty_str_regx    = r'.*[a-zA-Z0-9]+.*'
is_empty_str_regx    = r'.*[a-zA-Z]+.*'

In [4]:
stop_words = set(nltk.corpus.stopwords.words('english'))
seg_tw = Segmenter(corpus="twitter")
lemmatizer = nltk.stem.WordNetLemmatizer()

tag_map = defaultdict(lambda : wordnet.NOUN)
tag_map['J'] = wordnet.ADJ
tag_map['V'] = wordnet.VERB
tag_map['R'] = wordnet.ADV

Reading twitter - 1grams ...
Reading twitter - 2grams ...


  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in


In [5]:
def process_text(s):
    # Remove links
    s = re.sub(url_regx, ' ', s)

    # Replace tabs with whitespace
    s = s.replace('\t', ' ')

    # Replace @USERNAME to ' '.
    s = re.sub(user_regx, ' ', s)
    
    # Replace 3 or more consecutive letters by 2 letter.
    s = re.sub(sequence_regx, seq_replace_regx, s)

    # Replace #HASHTAG to ' '
    tags = re.findall(hashtag_regx, s)
    for tag in tags:
        seg_tag = seg_tw.segment(tag[1:])
        s = s.replace(tag, seg_tag)

    # Replace special quotes
    s = re.sub(special_quotes_regx, "'", s)

    # Demojize all emojis.
    s = emoji.demojize(s, delimiters=(' :', ': '))

    # Remove all non-english alphabets and invalid punctuations.
    s = re.sub(alpha_regx, " ", s)

    # Emojize
    s = emoji.emojize(s)

    # Remove : and _ as Emojize step is done
    s = re.sub(r'[:_]', ' ', s)

    # If no alphabet/digits remain
    if re.match(is_empty_str_regx, s) is None:
        return ''
    
    
    # Tokenize
    # tokens = nltk.word_tokenize(s) # breaks contractions into 2 words
    tokens = s.split()

    valid_bag = []
    for w in tokens:
        # Remove stopwords
        # if w in stop_words:
            # continue
        
        # Contraction Mapping
        if w in contraction_mapping:
            valid_bag.extend(contraction_mapping[w].split())
            continue

        valid_bag.append(w)

    # At least 2 words or 1 word and 1 punctuation
    if len(valid_bag) < 2:
        return ''

    s = ' '.join(valid_bag)

    return s
    
    """
    valid_bag = []

    tokens = nltk.word_tokenize(s)

    for token, tag in nltk.pos_tag(tokens):
        lemma = lemmatizer.lemmatize(token, tag_map[tag[0]])
        # print(f'{token} - {tag[0]} - {tag_map[tag[0]]} - {lemma}')
        valid_bag.append(lemma)
  
    return ' '.join(valid_bag)
    """


#### Processing dataset

There is line ending issues in the word file. If directly read using pandas, it causes multiple lines to be considered as the same line. Manual loading of the file is required.

**Step 1:** Taking advantage of this we are removing empty bios and converting to lowercase.

In [6]:
user_bios = Path('..\\TweetApp\\data\\all_user_bios.csv')
predictions = Path('..\\TweetApp\\data\\users_bio_distilbert.csv')

if not user_bios.is_file():
    raise Exception("User Bios file not found")

if not predictions.is_file():
    raise Exception("DistilBERT predicted bios not found")


Take Bios with only Negative emotions as predicted by DistilBERT

In [7]:
negative_ids = defaultdict(str)
negative_tags = ['anger', 'sadness', 'fear', 'surprise']

with open(predictions, 'r') as f:
    for line in tqdm(f):
        parts = line.strip().split(',')
        if parts[len(parts) - 1] in negative_tags:
            negative_ids[parts[0]] = parts[len(parts) - 1]

len(negative_ids)

1741916it [00:01, 911997.98it/s]


690001

In [8]:
ids = []
bios = []
with open(user_bios, mode='r', encoding='utf-8') as fin:
    for l in fin:
        l = l.strip()
        parts = l.split('\t')

        # Empty string or only id and no bio
        if len(parts) < 2 or len(parts[1]) == 0:
            continue

        if parts[0] not in negative_ids:
            continue

        ids.append(parts[0])

        # Lowercase bios and remove extra spaces for duplicate bio detection
        l = parts[1].lower()
        l = re.sub(r'( )\1+', ' ', l).strip()
        bios.append(l)

len(bios)

1128830

**Step 2:** Duplicates removed

In [9]:
df = pd.DataFrame({'ids': ids, 'bios': bios})

df.drop_duplicates(inplace=True)

df.dropna(inplace=True)

df['length'] = [len(s) for s in df['bios']]

df.info()

# Same id but different bios
len(df[df['ids'].duplicated()])

<class 'pandas.core.frame.DataFrame'>
Int64Index: 691682 entries, 0 to 1128829
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   ids     691682 non-null  object
 1   bios    691682 non-null  object
 2   length  691682 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 21.1+ MB


1682

In [10]:
df.head()

Unnamed: 0,ids,bios,length
0,2989319032,meninsn?? yes please!!!!!,25
1,1042385216,https://t.co/5b8iqsimah gives u #lgbtq news from #australia #ireland #newzealand #uk #usa #scandinavia & the 🌎 4 the lgbtq community & their families & friends,159
2,144965060,watch gay webcams on cameraboys & livejasmin: https://t.co/eu3xnrneqe #livejasmin #cameraboys,93
3,18975463,queer art and culture | cabaret & events with & for our lgbtiq+ community,73
4,635586076,"we believe in the power of social media,protest and boycotts to bring social justice and change.we report news other gay news media won't.",138


**Step 3, 5-16:** links, mentions, hashtags, emojis, special quotations, contraction mapping, consecutive letters, ~~acronyms~~, non-english/non-number removal, break alphanumeric words, ~~remove stopwords~~, space between words and punctuations, ~~spelling correction~~. 

Test run and Inspection

In [11]:
s = "she / they🏳️‍🌈 , member of the gruesome twosome 😏 ❤️🔥 blm✊🏿"
print(process_text(s))

she they 🏳️‍🌈 member of the gruesome twosome 😏 ❤️ 🔥 blm ✊🏿


In [12]:
for s in df['bios'].head(5).to_list():
    print(f'Before: {s}\n')
    print(f'After: {process_text(s)}\n\n')

Before: meninsn?? yes please!!!!!

After: meninsn yes please


Before: https://t.co/5b8iqsimah gives u #lgbtq news from #australia #ireland #newzealand #uk #usa #scandinavia & the 🌎 4 the lgbtq community & their families & friends

After: gives u lgbtq news from australia ireland new zealand uk usa scandinavia the 🌎 the lgbtq community their families friends


Before: watch gay webcams on cameraboys & livejasmin: https://t.co/eu3xnrneqe #livejasmin #cameraboys

After: watch gay webcams on cameraboys livejasmin livejasmin camera boys


Before: queer art and culture | cabaret & events with & for our lgbtiq+ community

After: queer art and culture ' cabaret events with for our lgbtiq community


Before: we believe in the power of social media,protest and boycotts to bring social justice and change.we report news other gay news media won't.

After: we believe in the power of social media protest and boycotts to bring social justice and change we report news other gay news media will not




Final processing and saved in file

In [13]:
df['processed'] = [process_text(s) for s in tqdm(df['bios'])]
df['length_processed'] = [len(s) for s in df['processed']]
df.info()

100%|██████████| 691682/691682 [02:11<00:00, 5245.94it/s]


<class 'pandas.core.frame.DataFrame'>
Int64Index: 691682 entries, 0 to 1128829
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   ids               691682 non-null  object
 1   bios              691682 non-null  object
 2   length            691682 non-null  int64 
 3   processed         691682 non-null  object
 4   length_processed  691682 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 31.7+ MB


In [14]:
df.drop(df[df.length_processed < 2].index, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 639763 entries, 0 to 1128829
Data columns (total 5 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   ids               639763 non-null  object
 1   bios              639763 non-null  object
 2   length            639763 non-null  int64 
 3   processed         639763 non-null  object
 4   length_processed  639763 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 29.3+ MB


In [15]:
df[df.length_processed == df.length_processed.max()].head(1)

Unnamed: 0,ids,bios,length,processed,length_processed
217252,377727439,😎 20 | gay | make-up | netflix | stoner 🍁 ig:instamatty_ 🔥sc:snapamatty👇🏼👇🏼👇🏼👇🏼👇🏼👇🏼👇🏼👇🏼👇🏼my animations👇🏼👇🏼👇🏼👇🏼👇🏼,112,😎 ' gay ' make up ' netflix ' stoner 🍁 ig instamatty 🔥 sc snapamatty backhand index pointing down medium light skin tone backhand index pointing down medium...,810


In [16]:
df[df.length_processed == df.length_processed.min()].head(1)

Unnamed: 0,ids,bios,length,processed,length_processed
4529,126450048,a ㅆô†ɧعr ℱɪʀʂʈ 06-24-13,23,a r,3


#### Save Processed Bios to File

In [17]:
processed_bios = '..\\TweetApp\\data\\user_bio_processed_for_perspective.tsv'
df.to_csv(processed_bios, sep='\t', columns=['ids', 'processed'], index=False)