<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#ff6f69; font-size:40px'>2. Data Cleaning and Preprocessing - Pipeline </h1>
</div>


<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Prior Requirements</h2>
</div>

Make sure the followings are installed in Python environment

Under Anaconda Prompt :
- `Textblob` Package: `!pip install textblob`
- `spacy` Package: `!pip install spacy`
- Trained pipelines for English under Spacy: `python -m spacy download en`
- Consolidated Text Preprocessing package: `!pip install git+ssh://git@github.com/HwaiTengTeoh/pt.git`
- `emot` Package: `!pip install emot`
- Download `Emoji_Dict.p` from download link: https://drive.google.com/open?id=1G1vIkkbqPBYPKHcQ8qy0G2zkoab2Qv4v
- Download `Emoticon_Dict.p` from download link: https://drive.google.com/open?id=1HDpafp97gCl9xZTQWMgP2kKK_NuhENlE
- `Gensim` Package: `!pip install gensim`
- Spelling Check - `language-tool-python` Package: `!pip install language-tool-python` **(More precise)**
- Contraction to Expansion - `pycontractions` Package: `!pip install pycontractions` **(More precise)**


<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Import Libraries/ Modules</h2>
</div>

In [15]:
# Import Dependencies
%matplotlib inline

# Begin Python Imports
import datetime, warnings, scipy
warnings.filterwarnings("ignore")

# Data Manipulation
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')

# Progress bar
from tqdm._tqdm_notebook import tqdm_notebook
from tqdm import tqdm
tqdm_notebook.pandas()

# Text Cleaning & Normalization
import re
import pickle
import spacy
import nltk
from emot.emo_unicode import UNICODE_EMOJI, EMOTICONS_EMO
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

nltk.download('averaged_perceptron_tagger')
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Acer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!



<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Import data</h2>
</div>

In [16]:
# Read AMiCA data 
amica_data = pd.read_csv('amica_data_toclean_version.csv', encoding='utf8')

# Check first 2 instances and last 2 instances
amica_data.head(2).append(amica_data.tail(2))

Unnamed: 0.1,Unnamed: 0,tag,text,label,role,harmfulness_score,oth_language,file_index,word_count,char_count,avg_word_len,stopword_count,hashtag_count,mention_count,digit_counts,uppercase_count,emails_count,urls_count,punc_count,exclaimation_count,questionmark_count,pos,pos_ADJ_counts,pos_ADP_counts,pos_ADV_counts,pos_AUX_counts,pos_CCONJ_counts,pos_DET_counts,pos_NOUN_counts,pos_INTJ_counts,pos_NUM_counts,pos_PART_counts,pos_PRON_counts,pos_PROPN_counts,pos_PUNCT_counts,pos_SCONJ_counts,pos_SYM_counts,pos_VERB_counts,pos_other_counts,ner,ner_CARDINAL_counts,ner_DATE_counts,ner_EVENT_counts,ner_FAC_counts,ner_GPE_counts,ner_LANGUAGE_counts,ner_LAW_counts,ner_LOC_counts,ner_MONEY_counts,ner_NORP_counts,ner_ORDINAL_counts,ner_ORG_counts,ner_PERCENT_counts,ner_PERSON_counts,ner_PRODUCT_counts,ner_QUANTITY_counts,ner_TIME_counts,ner_WORK_OF_ART_counts,text_check,emoji_counts,emoticon_counts,term_absolute_counts,term_allness_counts,term_badword_counts,term_negation_counts,term_diminisher_counts,term_intensifier_counts
0,0,s.0.w.0,Oh My God. :x,Non-Cyberbullying,,0.0,0,xml_folder\Askfm_conversation_10000_main.xml,4,10,2.5,0,0,0,0,0,0,0,2,0,0,INTJ PRON PROPN PUNCT PUNCT,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,oh god seal lip wear brace tongue tie,0,0,0,0,1,0,0,0
1,1,s.0.w.0,opinion on Ross Golby?,Non-Cyberbullying,,0.0,0,xml_folder\Askfm_conversation_10002_main.xml,4,19,4.75,1,0,0,0,0,0,0,1,0,1,NOUN ADP PROPN PROPN PUNCT,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,PERSON,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,opinion ross colby,0,0,1,1,0,0,0,0
107104,107104,s.57.w.0,Like=15 likes,Non-Cyberbullying,,0.0,0,xml_folder\Askfm_conversation_9999_main.xml,2,12,6.0,0,0,0,0,0,0,0,1,0,0,PROPN VERB,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,PERSON,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,like like,0,0,0,0,0,0,0,0
107105,107105,s.58.w.0,no 5 likes for everyone,Non-Cyberbullying,,0.0,0,xml_folder\Askfm_conversation_9999_main.xml,5,19,3.8,3,0,0,1,0,0,0,0,0,0,DET NUM NOUN ADP PRON,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,like,0,0,2,2,0,1,0,0



<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Initial Dataset Exploration</h2>
</div>

In [17]:
# Check dimension of dataset
amica_data.shape
print("There are "+ str(amica_data.shape[0]) +" rows and "+ str(amica_data.shape[1]) +" columns from the AMiCA dataset.")

There are 107106 rows and 67 columns from the AMiCA dataset.


In [18]:
# Check column type
amica_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107106 entries, 0 to 107105
Data columns (total 67 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Unnamed: 0               107106 non-null  int64  
 1   tag                      107106 non-null  object 
 2   text                     107106 non-null  object 
 3   label                    107106 non-null  object 
 4   role                     107106 non-null  object 
 5   harmfulness_score        107106 non-null  float64
 6   oth_language             107106 non-null  int64  
 7   file_index               107106 non-null  object 
 8   word_count               107106 non-null  int64  
 9   char_count               107106 non-null  int64  
 10  avg_word_len             107106 non-null  float64
 11  stopword_count           107106 non-null  int64  
 12  hashtag_count            107106 non-null  int64  
 13  mention_count            107106 non-null  int64  
 14  digi

In [19]:
# Delete Unwanted column
amica_data.drop('Unnamed: 0', inplace=True, axis=1)

In [20]:
# Delete Unwanted column
amica_data=amica_data.reset_index(drop=True)

In [21]:
# Last check column type
amica_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107106 entries, 0 to 107105
Data columns (total 66 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   tag                      107106 non-null  object 
 1   text                     107106 non-null  object 
 2   label                    107106 non-null  object 
 3   role                     107106 non-null  object 
 4   harmfulness_score        107106 non-null  float64
 5   oth_language             107106 non-null  int64  
 6   file_index               107106 non-null  object 
 7   word_count               107106 non-null  int64  
 8   char_count               107106 non-null  int64  
 9   avg_word_len             107106 non-null  float64
 10  stopword_count           107106 non-null  int64  
 11  hashtag_count            107106 non-null  int64  
 12  mention_count            107106 non-null  int64  
 13  digit_counts             107106 non-null  int64  
 14  uppe

In [22]:
# Last check column type
amica_data['text']

0                                             Oh My God. :x
1                                    opinion on Ross Golby?
2         dont know him, I just see him walking past jub...
3                                                 Dick size
4                                           you should know
                                ...                        
107101    Do you believe that playing is more important ...
107102                                     yeah I am with u
107103         I already ans that question just scroll down
107104                                        Like=15 likes
107105                              no 5 likes for everyone
Name: text, Length: 107106, dtype: object


<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>(Ignore) Handle of missing data - None</h2>
</div>

In [23]:
# Calculate the proportion of missing data

def checkMissing(data,perc=0):
    """ 
    Function that takes in a dataframe and returns
    the percentage of missing value.
    """
    missing = [(i, data[i].isna().mean()*100) for i in data]
    missing = pd.DataFrame(missing, columns=["column_name", "percentage"])
    missing = missing[missing.percentage > perc]
    print(missing.sort_values("percentage", ascending=False).reset_index(drop=True))

print("Proportion of missing data in columns")
checkMissing(amica_data)

Proportion of missing data in columns
  column_name  percentage
0         ner   73.373107
1  text_check    0.004668



<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Text Preprocessing Pipeline </h2>
</div>


In [25]:
import preprocess_text as pt
import language_tool_python
from pycontractions.contractions import Contractions
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate
tool = language_tool_python.LanguageTool('en-US')
cont = Contractions(api_key="glove-twitter-100")

# Functions
def get_term_list(path):
    '''
    Function to import term list file
    '''
    word_list = []
    with open(path,"r") as f:
        for line in f:
            word = line.replace("\n","").strip()
            word_list.append(word)
    return word_list

def get_vocab(corpus):
    '''
    Function returns unique words in document corpus
    '''
    # vocab set
    unique_words = set()
    
    # looping through each document in corpus
    for document in tqdm(corpus):
        for word in document.split(" "):
            if len(word) > 2:
                unique_words.add(word)
    
    return unique_words

def create_profane_mapping(profane_words,vocabulary):
    '''
    Function creates a mapping between commonly found profane words and words in 
    document corpus 
    '''
    
    # mapping dictionary
    mapping_dict = dict()
    
    # looping through each profane word
    for profane in tqdm(profane_words):
        mapped_words = set()
        
        # looping through each word in vocab
        for word in vocabulary:
            # mapping only if ratio > 80
            try:
                if fuzz.ratio(profane,word) > 90:
                    mapped_words.add(word)
            except:
                pass
                
        # list of all vocab words for given profane word
        mapping_dict[profane] = mapped_words
    
    return mapping_dict

def replace_words(corpus,mapping_dict):
    '''
    Function replaces obfuscated profane words using a mapping dictionary
    '''
    
    processed_corpus = []
    
    # iterating over each document in the corpus
    for document in tqdm(corpus):
        
        # splitting sentence to word
        comment = document.split()
        
        # iterating over mapping_dict
        for mapped_word,v in mapping_dict.items():
            
            # comparing target word to each comment word 
            for target_word in v:
                
                # each word in comment
                for i,word in enumerate(comment):
                    if word == target_word:
                        comment[i] = mapped_word
        
        # joining comment words
        document = " ".join(comment)
        document = document.strip()
                    
        processed_corpus.append(document)
        
    return processed_corpus

# Counts of term by category
countvec = CountVectorizer(ngram_range=(1,3))
def get_term_counts(x,category):
    
    # Split input text by unigram, bigram and trigram
    # as the keywords may span up to 3 words
    def get_ngram_text(x):
        
        try:
            countvec.fit_transform(x)
            text_list = countvec.get_feature_names()
            return text_list

        except ValueError:
            return [' '] # to handle scenario where text input are all stop words only
    
    # check the existence of word by category
    term_category = [t for t in get_ngram_text(x) if t in category]
    
    # return the number of occurence
    return len(term_category)


# Import external list, store as list
term_absolute_list = get_term_list("term_list/compiled_absolute.txt")
term_allness_list = get_term_list("term_list/compiled_allness.txt")
term_badword_list = get_term_list("term_list/compiled_badword.txt")
term_negation_list = get_term_list("term_list/compiled_negation.txt")
term_diminisher_list = get_term_list("term_list/compiled_diminisher.txt")
term_intensifier_list = get_term_list("term_list/compiled_intensifier.txt")

ModuleNotFoundError: No module named 'pycontractions'

In [37]:
###############################
# Text Preprocessing Pipeline #
###############################

def text_preprocessing_pipeline(df=amica_data,
                                textual_statistics=False,
                                remove_url=False,
                                remove_email=False,
                                remove_user_mention=False,
                                remove_html=False,
                                remove_space_single_char=False,
                                normalize_elongated_char=False,
                                normalize_emoji=False,
                                normalize_emoticon=False,
                                normalize_accented=False,
                                lower_case=False,
                                normalize_slang=False,
                                normalize_badterm=False,
                                spelling_check=False,
                                normalize_contraction=False,
                                term_list=False,
                                remove_numeric=False,
                                remove_stopword=False,
                                keep_pronoun=False,
                                remove_punctuation=False,
                                pos=False,
                                ner=False,
                                lemmatise=False
                               ):
    '''
    -------------
     Description
    -------------
    Function that compile all preprocessing steps in one go
    
    -----------
     Parameter
    -----------
    df: Data Frame
    textual_statistics: Boolean
    remove_url: Boolean
    remove_email: Boolean
    remove_user_mention: Boolean
    remove_html: Boolean
    remove_space_single_char: Boolean
    normalize_elongated_char: Boolean
    normalize_emoji: Boolean
    normalize_emoticon: Boolean
    normalize_accented: Boolean
    lower_case: Boolean
    normalize_slang: Boolean
    normalize_badterm: Boolean
    spelling_check: Boolean
    normalize_contraction: Boolean
    remove_numeric: Boolean
    remove_stopword: Boolean
    keep_pronoun: Boolean
    remove_punctuation: Boolean
    pos: Boolean
    ner: Boolean
    lemmatise: Boolean
    
    '''
    
    if textual_statistics:
        print('Developing textual statistics from original text')
        df['word_count'] = df['text'].progress_apply(lambda x: pt.get_wordcounts(x))
        df['char_count'] = df['text'].progress_apply(lambda x: pt.get_char_counts(x))
        df['avg_word_len'] = df['text'].progress_apply(lambda x: pt.get_avg_wordlength(x))
        df['stopword_count'] = df['text'].progress_apply(lambda x: pt.get_stopwords_counts(x))
        df['hashtag_count'] = df['text'].progress_apply(lambda x: pt.get_hashtag_counts(x))
        df['mention_count'] = df['text'].progress_apply(lambda x: pt.get_mention_counts(x))
        df['digit_counts'] = df['text'].progress_apply(lambda x: pt.get_digit_counts(x))
        df['uppercase_count'] = df['text'].progress_apply(lambda x: pt.get_uppercase_counts(x))
        df['emails_count'] = df['text'].progress_apply(lambda x: pt.get_emails(x))
        df['urls_count'] = df['text'].progress_apply(lambda x: pt.get_urls(x))
        df['punc_count'] = df['text'].progress_apply(lambda x: pt.get_punc_counts(x))
        df["exclaimation_count"] = df["text"].progress_apply(lambda x: x.count("!"))
        df["questionmark_count"] = df["text"].progress_apply(lambda x: x.count("?"))
    
    if pos:
        print('Text Preprocessing: Developing POS tag count')
        df["pos"] = df["text"].progress_apply(lambda x: pt.get_pos_tag(x))
        df["pos_ADJ_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="ADJ"))     #adjective
        df["pos_ADP_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="ADP"))     #adposition
        df["pos_ADV_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="ADV"))     #adverb
        df["pos_AUX_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="AUX"))     #auxiliary
        df["pos_CCONJ_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="CCONJ")) #coordinating conjunction
        df["pos_DET_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="DET"))     #determiner
        df["pos_NOUN_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="NOUN"))   #noun
        df["pos_INTJ_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="INTJ"))   #interjection
        df["pos_NUM_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="NUM"))     #numeral
        df["pos_PART_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="PART"))   #particle
        df["pos_PRON_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="PRON"))   #pronoun
        df["pos_PROPN_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="PROPN")) #proper noun
        df["pos_PUNCT_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="PUNCT")) #punctuation
        df["pos_SCONJ_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="SCONJ")) #subordinating conjunction
        df["pos_SYM_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="SYM"))     #symbol
        df["pos_VERB_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="VERB"))   #verb
        df["pos_other_counts"] = df["pos"].progress_apply(lambda x: pt.get_pos_tag_counts(x,pos_tag="X"))     #other
    
    if ner:
        print('Text Preprocessing: Developing NER tag count')
        df["ner"] = df["text"].progress_apply(lambda x: pt.get_ner(x))
        ner_lst = nlp.pipe_labels['ner']
        for ner in ner_lst:
             df["ner_"+ ner +"_counts"] =  df["ner"].apply(lambda x: pt.get_ner_counts(x,ner))
                
    if remove_url:
        print('Text Preprocessing: Remove URL')
        df['text_check'] = df['text'].progress_apply(lambda x: pt.remove_urls(x))
        
    if remove_email:
        print('Text Preprocessing: Remove email')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_emails(x))
        
    if remove_user_mention:
        print('Text Preprocessing: Remove user mention')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_mention(x))
    
    if remove_html:
        print('Text Preprocessing: Remove html element')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_html_tags(x))
        
    if remove_space_single_char:
        print('Text Preprocessing: Remove single spcae between single characters e.g F U C K')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_space_single_chars(x))
        
    if normalize_elongated_char:
        print('Text Preprocessing: Reduction of elongated characters')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_elongated_chars(x))
        
    if normalize_emoji:
        print('Text Preprocessing: Normalize and count emoji')
        df['emoji_counts'] = df['text_check'].progress_apply(lambda x: pt.get_emoji_counts(x))
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.convert_emojis(x))
        
        
    if normalize_emoticon:
        print('Text Preprocessing: Normalize and count emoticon')
        df['emoticon_counts'] = df['text_check'].progress_apply(lambda x: pt.get_emoticon_counts(x))
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.convert_emoticons(x))
        
        
    if normalize_accented:
        print('Text Preprocessing: Normalize accented character')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_accented_chars(x))
        
    if lower_case:
        print('Text Preprocessing: Convert to lower case')
        df['text_check'] = df['text_check'].progress_apply(lambda x: str(x).lower())
    
    if normalize_slang:
        print('Text Preprocessing: Normalize slang')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.slang_resolution(x))
        
    if normalize_badterm:
        print('Text Preprocessing: Replace obfuscated bad term')
        # unique words in vocab 
        unique_words = get_vocab(corpus= df['text_check'])
        
        # creating mapping dict 
        mapping_dict = create_profane_mapping(profane_words=term_badword_list,vocabulary=unique_words)
        
        df['text_check'] = replace_words(corpus=df['text_check'],
                                                 mapping_dict=mapping_dict)
        
    if spelling_check:
        print('Text Preprocessing: Spelling Check')
        df['text_check'] = df['text_check'].progress_apply(lambda x: tool.correct(x))
        tool.close()
        
    if normalize_contraction:
        print('Text Preprocessing: Contraction to Expansion')
        
        # Special handling to prevent code from taking forever to run
        hardcode_clean_50702 = df['text_check'].iloc[50702].replace("'d"," would").replace("wasn't","was not").replace("wouldn't","would not").replace("'s"," is").replace("'m"," am")
        df['text_check'].iloc[50702] = hardcode_clean_50702

        hardcode_clean_107720 = df['text_check'].iloc[107720].replace("'d"," would").replace("wasn't","was not").replace("wouldn't","would not")
        df['text_check'].iloc[107720] = hardcode_clean_107720

        df['text_check'] = df['text_check'].progress_apply(lambda x: ''.join(list(cont.expand_texts([x], precise=True))))
    
    if term_list:
        print('Developing Binary Features for existence of terms by category')
        df['term_absolute_counts'] = df['text_check'].progress_apply(lambda x: get_term_counts([x],category=term_absolute_list))
        df['term_allness_counts'] = df['text_check'].progress_apply(lambda x: get_term_counts([x],category=term_allness_list))
        df['term_badword_counts'] = df['text_check'].progress_apply(lambda x: get_term_counts([x],category=term_badword_list))
        df['term_negation_counts'] = df['text_check'].progress_apply(lambda x: get_term_counts([x],category=term_negation_list))
        df['term_diminisher_counts'] = df['text_check'].progress_apply(lambda x: get_term_counts([x],category=term_diminisher_list))
        df['term_intensifier_counts'] = df['text_check'].progress_apply(lambda x: get_term_counts([x],category=term_intensifier_list))

    if remove_numeric: 
        print('Text Preprocessing: Remove numeric')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_numeric(x))
        
    if remove_punctuation:
        print('Text Preprocessing: Remove punctuations')
        df['text_check'] = df['text_check'].progress_apply(lambda x: pt.remove_special_chars(x))
        
    if remove_stopword:
        print('Text Preprocessing: Remove stopword')
        if keep_pronoun:
            print('Text Preprocessing: and, keep Pronoun')
        df["text_check"] = df["text_check"].progress_apply(lambda x: pt.remove_stopwords(x,keep_pronoun=keep_pronoun))
        
    # Remove multiple spaces
    print('Text Preprocessing: Remove multiple spaces')
    df['text_check'] = df['text_check'].progress_apply(lambda x: ' '.join(x.split()))
    
    if lemmatise:
        print('Text Preprocessing: Lemmatization')
        df["text_check"] = df["text_check"].progress_apply(lambda x: pt.make_base(x))
        
    # Make sure remove multiple spaces
    # df['text_check'] = df['text_check'].progress_apply(lambda x: ' '.join(x.split()))
    
    # Make sure lower case for all again
    df['text_check'] = df['text_check'].progress_apply(lambda x: str(x).lower())
    
    # Remove empty text after cleaning
    print('Last Step: Remove empty text after preprocessing. Done')
    df = df[~df['text_check'].isna()]
    df = df[df['text_check'] != '']
    df = df.reset_index(drop=True)
    
    return df


<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Inspect Cleaning Process</h2>
</div>



<div class="alert alert-info" style="background-color:#ff6f69; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>Output: Preprocessed and Cleaned Data</h2>
</div>


In [23]:
amica_data_clean_with_stopword = text_preprocessing_pipeline(
                                    df=amica_data,
                                    textual_statistics=True,
                                    remove_url=True,
                                    remove_email=True,
                                    remove_user_mention=True,
                                    remove_html=True,
                                    remove_space_single_char=True,
                                    normalize_elongated_char=True,
                                    normalize_emoji=True,
                                    normalize_emoticon=True,
                                    normalize_accented=True,
                                    lower_case=True,
                                    normalize_slang=True,
                                    normalize_badterm=True,
                                    spelling_check=True,
                                    normalize_contraction=True,
                                    term_list=True,
                                    remove_numeric=True,
                                    remove_stopword=False, # Keep stopwords
                                    keep_pronoun=False,  # Keep pronoun
                                    remove_punctuation=True,
                                    pos=True,
                                    ner=True,
                                    lemmatise=True)


Developing textual statistics from original text


  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Developing POS tag count


  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Developing NER tag count


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove URL


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove email


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove user mention


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove html element


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove single spcae between single characters e.g F U C K


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Reduction of elongated characters


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Normalize and count emoji


  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Normalize and count emoticon


  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Normalize accented character


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Convert to lower case


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Normalize slang


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Replace obfuscated bad term


100%|██████████| 113694/113694 [00:00<00:00, 541277.61it/s]
100%|██████████| 1921/1921 [03:45<00:00,  8.53it/s]
100%|██████████| 113694/113694 [01:21<00:00, 1394.15it/s]

Text Preprocessing: Spelling Check





  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Contraction to Expansion


  0%|          | 0/113694 [00:00<?, ?it/s]

At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the voca

At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the voca

At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the voca

At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the voca

At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the vocabulary.
At least one of the documents had no words that were in the voca

Developing Binary Features for existence of terms by category


  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove numeric


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove punctuations


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Remove multiple spaces


  0%|          | 0/113694 [00:00<?, ?it/s]

Text Preprocessing: Lemmatization


  0%|          | 0/113694 [00:00<?, ?it/s]

  0%|          | 0/113694 [00:00<?, ?it/s]

Last Step: Remove empty text after preprocessing. Done


In [24]:
amica_data_clean_with_stopword_base1 =  amica_data_clean_with_stopword.copy()
amica_data_clean_with_stopword_base2 =  amica_data_clean_with_stopword.copy()

In [25]:
amica_data_clean_no_stopword_pronoun = text_preprocessing_pipeline(
                                            df=amica_data_clean_with_stopword_base1,
                                            remove_stopword=True, # Remove stopwords
                                            keep_pronoun=True) # But keep pronoun

Text Preprocessing: Remove stopword
Text Preprocessing: and, keep Pronoun


  0%|          | 0/112249 [00:00<?, ?it/s]

Text Preprocessing: Remove multiple spaces


  0%|          | 0/112249 [00:00<?, ?it/s]

  0%|          | 0/112249 [00:00<?, ?it/s]

Last Step: Remove empty text after preprocessing. Done


In [26]:
amica_data_clean_no_stopword_all = text_preprocessing_pipeline(
                                        df=amica_data_clean_with_stopword_base2,
                                        remove_stopword=True, # Remove all stopwords
                                        keep_pronoun=False)

Text Preprocessing: Remove stopword


  0%|          | 0/112249 [00:00<?, ?it/s]

Text Preprocessing: Remove multiple spaces


  0%|          | 0/112249 [00:00<?, ?it/s]

  0%|          | 0/112249 [00:00<?, ?it/s]

Last Step: Remove empty text after preprocessing. Done


In [27]:
amica_data_clean_with_stopword.to_csv('amica_data_clean_with_stopword.csv')
amica_data_clean_no_stopword_pronoun.to_csv('amica_data_clean_no_stopword_pronoun.csv')
amica_data_clean_no_stopword_all.to_csv('amica_data_clean_no_stopword_all.csv')

<h1><center>- END Preprocessing and Cleaning -</center></h1>

In [28]:
# Quick count on number of words
amica_data_clean_with_stopword['wc'] = amica_data_clean_with_stopword['text_check'].progress_apply(lambda x: pt.get_wordcounts(x))
amica_data_clean_with_stopword['wc'].sum()

  0%|          | 0/112249 [00:00<?, ?it/s]

1067429

In [29]:
amica_data_clean_with_stopword['text_check']

0            oh my god seal lip or wear brace or tongue tie
1                                     opinion on ross colby
2         do not know him i just see him walk past jubil...
3                                                 dick size
4                                           you should know
                                ...                        
112244                                   yeah i am with you
112245                                   where are you from
112246         i already and that question just scroll down
112247                                            like like
112248                                 no like for everyone
Name: text_check, Length: 112249, dtype: object

In [30]:
# Quick count on number of words
amica_data_clean_no_stopword_pronoun['wc'] = amica_data_clean_no_stopword_pronoun['text_check'].progress_apply(lambda x: pt.get_wordcounts(x))
amica_data_clean_no_stopword_pronoun['wc'].sum()

  0%|          | 0/109940 [00:00<?, ?it/s]

659654

In [31]:
amica_data_clean_no_stopword_pronoun['text_check']

0         oh god seal lip wear brace tongue tie
1                            opinion ross colby
2           know him i him walk past jubilee no
3                                     dick size
4                                      you know
                          ...                  
109935                               yeah i you
109936                                      you
109937                        i question scroll
109938                                like like
109939                            like everyone
Name: text_check, Length: 109940, dtype: object

In [32]:
# Quick count on number of words
amica_data_clean_no_stopword_all['wc'] = amica_data_clean_no_stopword_all['text_check'].progress_apply(lambda x: pt.get_wordcounts(x))
amica_data_clean_no_stopword_all['wc'].sum()

  0%|          | 0/107106 [00:00<?, ?it/s]

463837

In [33]:
amica_data_clean_no_stopword_all['text_check']

0         oh god seal lip wear brace tongue tie
1                            opinion ross colby
2                        know walk past jubilee
3                                     dick size
4                                          know
                          ...                  
107101            believe playing important win
107102                                     yeah
107103                          question scroll
107104                                like like
107105                                     like
Name: text_check, Length: 107106, dtype: object