In [194]:
import pandas as pd
import wordninja
import re
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/summerai/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [210]:
data = pd.read_csv('fulltext_cleaned_0122.csv')

In [211]:
data 

Unnamed: 0,Index_Unknown,uid,url,keywords,full_text,classifiers
0,5,380117,https://www.westernjournal.com/trump-right-jud...,coronavirus|covid|qanon|biden|voterfraud|antii...,"Michigan Secretary of State Jocelyn Benson, pi...",voterfraud
1,13,448712,https://pjmedia.com/news-and-politics/victoria...,antilatinx|whitesupremacy|antiblack,"Joe Hall is a former Marine and handyman, who ...",whitesupremacy
2,15,256646,https://www.breitbart.com/politics/2020/12/23/...,biden|antilatinx|antiimmigrant|coronavirus,PoliticsEntertainmentMediaEconomyWorldLondo...,antilatinx
3,15,256646,https://www.breitbart.com/politics/2020/12/23/...,biden|antilatinx|antiimmigrant|coronavirus,PoliticsEntertainmentMediaEconomyWorldLondo...,biden
4,21,406930,https://www.theepochtimes.com/supreme-court-ju...,bigtech|qanon|qanon,"Thomas, considered a conservative on the high ...",bigtech
...,...,...,...,...,...,...
119919,911692,911560,https://hannity.com/media-room/not-so-fast-ari...,qanon|biden|voterfraud|biden|biden|qanon|biden...,,presidentbiden
119920,911739,911656,https://www.nytimes.com/2021/03/26/opinion/ezr...,coronavirus|disinformation|whitesupremacy|anti...,SectionsSEARCHSkip to contentSkip to site inde...,coronavirus
119921,911739,911656,https://www.nytimes.com/2021/03/26/opinion/ezr...,coronavirus|disinformation|whitesupremacy|anti...,SectionsSEARCHSkip to contentSkip to site inde...,bigtech
119922,911789,911757,https://www.theguardian.com/us-news/2021/apr/0...,coronavirus|covid|antiblack|qanon|whitesuprema...,Skip to main contentSkip to navigationAdvertis...,whitesupremacy


### Assumption
<br> 1. Irrelevant text for categories and sublinks are removed
<br> 2. Links in the text are removed, hence words like jpg doesn't appear unless it is part of the article
<br> 3. White space is approperiate reserved between each two sentences. Eg. no case like sentence1.sentence2

### Solution
<br> 0. remove null
<br> 1. get rid of all special characters; replace them with white space, delete >1 white space
<br> 2. use wordninja; consider cases that do not pass to wordninja split eg. n95
<br> 3. lemmatize
<br> 4. covert all full text to lower case
<br> 5. clean stop words

In [212]:
# step 0
data = data[data['full_text'].notnull()].reset_index(drop=True)

### Demo
This demo shows how each step affects a single article

In [262]:
example = data.iloc[2,-2]
example

'   PoliticsEntertainmentMediaEconomyWorldLondon / EuropeBorder / Cartel ChroniclesIsrael / Middle EastAfricaAsiaLatin AmericaAll WorldVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresB Inspired    BREITBART    PoliticsEntertainmentMediaEconomyWorldLondon / EuropeBorder / Cartel ChroniclesIsrael / Middle EastAfricaAsiaLatin AmericaWorld NewsVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresPodcastsBreitbart News DailyB InspiredAbout UsPeopleNewsletters  As Biden suggestedhe is satisfied with Congress allocating just $600 stimulus checks for each American out of a $900 billion coronavirus relief package, his advisers said he will push to fund better housing and coronavirus tests for foreign nationals in Mexico.Biden said, Congress did its job this week in reference to the Democrat-controlled House and Republican-controlled Senate passing the package. President Trump, on the other hand, has demandedthe package be reworked to includ

In [263]:
# step 1

# repalce all characters with a white space 
# except - between letters and (, or .) between digits or letters; 
# Eg. keep COVID-19, one-year, 2,000, 3.00 and U.S
step1_remove_char = re.sub(r"(?!(?<=[a-zA-Z0-9])[\,\.\-](?=[a-zA-Z0-9]))[^a-zA-Z0-9 \n]"," ", example + '3.00, 5g')
# if there are more than one white spaces between words, reduce to one
step1_remove_spaces = re.sub('\s+', " ", step1_remove_char).strip()

In [264]:
step1_remove_char

'   PoliticsEntertainmentMediaEconomyWorldLondon   EuropeBorder   Cartel ChroniclesIsrael   Middle EastAfricaAsiaLatin AmericaAll WorldVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresB Inspired    BREITBART    PoliticsEntertainmentMediaEconomyWorldLondon   EuropeBorder   Cartel ChroniclesIsrael   Middle EastAfricaAsiaLatin AmericaWorld NewsVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresPodcastsBreitbart News DailyB InspiredAbout UsPeopleNewsletters  As Biden suggestedhe is satisfied with Congress allocating just  600 stimulus checks for each American out of a  900 billion coronavirus relief package  his advisers said he will push to fund better housing and coronavirus tests for foreign nationals in Mexico.Biden said  Congress did its job this week in reference to the Democrat-controlled House and Republican-controlled Senate passing the package  President Trump  on the other hand  has demandedthe package be reworked to includ

In [265]:
step1_remove_spaces

'PoliticsEntertainmentMediaEconomyWorldLondon EuropeBorder Cartel ChroniclesIsrael Middle EastAfricaAsiaLatin AmericaAll WorldVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresB Inspired BREITBART PoliticsEntertainmentMediaEconomyWorldLondon EuropeBorder Cartel ChroniclesIsrael Middle EastAfricaAsiaLatin AmericaWorld NewsVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresPodcastsBreitbart News DailyB InspiredAbout UsPeopleNewsletters As Biden suggestedhe is satisfied with Congress allocating just 600 stimulus checks for each American out of a 900 billion coronavirus relief package his advisers said he will push to fund better housing and coronavirus tests for foreign nationals in Mexico.Biden said Congress did its job this week in reference to the Democrat-controlled House and Republican-controlled Senate passing the package President Trump on the other hand has demandedthe package be reworked to include 2,000 stimulus checks for e

In [266]:
# step 2

# if a word matches this pattern or is in the list then we don't want to pass it to wordninja
# if there is hyphen, combination of letters and digits and pure capitalized letters, don't pass
wordninja_filter = re.compile(r"-|([A-Za-z]+\d+\w*|\d+[A-Za-z]+\w*)|^[^a-z]*$")
# if a word is in the list, don't pass it to wordninja because it can't handle well
words_pass = ['qanon', 'covid', 'vaxx']

# split the string by a white space
string_isolated = step1_remove_spaces.split()

# check word by word to detect split
step2_words_split = ''
for el in string_isolated:
    # if the word matches the pattern or is in the list, then we don't pass it to wordnijia to split
    if wordninja_filter.search(el) or el.lower() in words_pass:
        temp = el
    # all the other words will be checked if be split if necessary
    else:
        temp = ' '.join(wordninja.split(el))
    step2_words_split += ' ' + temp
    step2_words_split = step2_words_split.strip()

In [268]:
step2_words_split

'Politics Entertainment Media Economy World London Europe Border Cartel Chronicles Israel Middle East Africa Asia Latin America All World Video Tech Sports On the Hill On the Hill Articles On The Hill Exclusive Video Wires B Inspired BREITBART Politics Entertainment Media Economy World London Europe Border Cartel Chronicles Israel Middle East Africa Asia Latin America World News Video Tech Sports On the Hill On the Hill Articles On The Hill Exclusive Video Wires Podcasts Bre it bart News Daily B Inspired About Us People Newsletters As Biden suggested he is satisfied with Congress allocating just 600 stimulus checks for each American out of a 900 billion coronavirus relief package his advisers said he will push to fund better housing and coronavirus tests for foreign nationals in Mexico Biden said Congress did its job this week in reference to the Democrat-controlled House and Republican-controlled Senate passing the package President Trump on the other hand has demanded the package be 

In [269]:
# step 3
def step3_get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

step3_lemmatizer = WordNetLemmatizer()

In [270]:
temp_2 = ''
for word in step2_words_split.split():
    words_lemmatized = step3_lemmatizer.lemmatize(word, step3_get_wordnet_pos(word))
    temp_2 += ' ' + words_lemmatized

In [271]:
temp_2

' Politics Entertainment Media Economy World London Europe Border Cartel Chronicles Israel Middle East Africa Asia Latin America All World Video Tech Sports On the Hill On the Hill Articles On The Hill Exclusive Video Wires B Inspired BREITBART Politics Entertainment Media Economy World London Europe Border Cartel Chronicles Israel Middle East Africa Asia Latin America World News Video Tech Sports On the Hill On the Hill Articles On The Hill Exclusive Video Wires Podcasts Bre it bart News Daily B Inspired About Us People Newsletters As Biden suggest he be satisfied with Congress allocate just 600 stimulus check for each American out of a 900 billion coronavirus relief package his adviser say he will push to fund well housing and coronavirus test for foreign national in Mexico Biden say Congress do it job this week in reference to the Democrat-controlled House and Republican-controlled Senate passing the package President Trump on the other hand have demand the package be rework to incl

In [272]:
# step 4 & step 5
step4_stop_words = set(stopwords.words('english'))
result = ''
for word in temp_2.split():
    if word.lower() not in step4_stop_words:
        result += ' ' + word.lower()
print(result.strip())

politics entertainment media economy world london europe border cartel chronicles israel middle east africa asia latin america world video tech sports hill hill articles hill exclusive video wires b inspired breitbart politics entertainment media economy world london europe border cartel chronicles israel middle east africa asia latin america world news video tech sports hill hill articles hill exclusive video wires podcasts bre bart news daily b inspired us people newsletters biden suggest satisfied congress allocate 600 stimulus check american 900 billion coronavirus relief package adviser say push fund well housing coronavirus test foreign national mexico biden say congress job week reference democrat-controlled house republican-controlled senate passing package president trump hand demand package rework include 2,000 stimulus check american call reporter week biden transition team official say plan provide funding improve shelter humanitarian assistance immigrant wait northern mexi

### Putting everything together

In [273]:
def fulltext_clean(string):
    #PREPARATION
    # step 1
    # repalce all characters with a white space except these three char -  ,  . among digits/letters; 
    # Eg. keep 2,000, 3.00, covid-19
    remove_char = re.sub(r"(?!(?<=[a-zA-Z0-9])[\,\.\-](?=[a-zA-Z0-9]))[^a-zA-Z0-9 \n]"," ", string)
    # if there are more than one white spaces between words, reduce to one
    remove_spaces = re.sub('\s+', " ", remove_char).strip()   

    # step 2
    # if a word matches this pattern or is in the list then we don't want to pass it to wordninja
    # if there is hyphen, combination of letters and digits or pure capitalized letters, don't pass
    wordninja_filter = re.compile(r"-|([A-Za-z]+\d+\w*|\d+[A-Za-z]+\w*)|^[^a-z]*$")
    # if a word is in the list, don't pass it to wordninja because it can't handle the word well
    words_pass = ['qanon', 'covid']
    
    # step 3
    # set up for lemmatize
    def get_wordnet_pos(word):
        """Map POS tag to first character lemmatize() accepts"""
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)

    lemmatizer = WordNetLemmatizer()
    
    # step4
    # prepare stop words
    stop_words = set(stopwords.words('english'))

    # CLEANING
    # split the string by a white space
    string_isolated = remove_spaces.split()

    # check the string word by word to detect necessary split, lemmatize and remove stop word
    words_split = ''
    for el in string_isolated:
        # step 2
        # if the word matches the pattern or is in the list, then we don't pass it to wordnijia to split
        if wordninja_filter.search(el) or el.lower() in words_pass:
            temp = el
        # all the other words will be checked and be split if necessary
        else:
            temp = ' '.join(wordninja.split(el))
            
        # step 3: lemmatize the word
        words_lemmatized = lemmatizer.lemmatize(temp, get_wordnet_pos(temp))
        
        # step 4 & step 5
        if words_lemmatized.lower() not in stop_words:
             words_split += ' ' + words_lemmatized.lower()
                
    words_split = words_split.strip()
        
    return words_split

In [274]:
# original article
single_article = data.iloc[2000,-2]
single_article

'PopularFully Vaxxed ESPN Host Stephen Smith Says He Nearly Died After Contracting CovidArmy Conducting Two-Week \'Guerrilla Warfare Exercise\' in Rural North Carolina Focused On Battling \'Freedom Fighters\'\'Shut The F**k Up!\' Fans Heckle NYC Mayor Eric Adams At New York Knicks GamePoll: Nearly Half of Democrats Support Fining or Imprisoning Americans Who \'Question Efficacy\' of Covid ShotsTrain Derails in Garbage-Strewn Area Trashed by Looters in Los Angeles BREAKING: Coca-Cola is forcing employees to complete online training telling them to "try to be less white." These images are from an internal whistleblower: pic.twitter.com/gRi4N20esZ Karlyn supports banning critical race theory in NH (@DrKarlynB) February 19, 2021 Tucker Carlson\'s Review Of Robin DiAngelo\'s Book \'White Fragility\'"The real point of her book is to defeat & demoralize you.""Everything about \'White Fragility\' is poisonous garbage." pic.twitter.com/4lkyz57ROQ The Columbia Bugle  (@ColumbiaBugle) June 25, 20

In [275]:
# aritcle after cleaning
fulltext_clean(single_article)

'popular fully vax x ed espn host stephen smith says nearly died contracting c ovid army conducting two-week guerrilla warfare exercise rural north carolina focused battling freedom fighters shut f k fans heckle nyc mayor eric adams new york knicks game poll nearly half democrats support fining imprisoning americans question efficacy covid shots train derails garbage-strewn area trashed looters los angeles breaking coca-cola force employee complete online training tell try less white image internal whistle blower pic twitter com gri4n20esz karl yn support ban critical race theory nh dr karl yn b february 19 2021 tucker carlson review robin di angelo book white fragility real point book defeat demoralize everything white fragility poisonous garbage pic twitter com 4lkyz57roq columbia bugle columbia bugle june 25 2020'

### Discussion
<br> 1. Should we deal with words that look like shi**t and f**k? Eg. see data.iloc[150, -2] and the case above
<br> 2. If a word in 'word_pass' is concatenated together with another word, we can't separate it correctly. Eg. see the word 'CovidArmy' above - words like this will be sent to wordninja but since it can't separate covid well we end up getting c ovid army; if we don't pass it to wordninja we end up getting covidarmy