# Data Cleaning

## Part 1: Web Scraping

First, we want to extract the article text from the URLs linked to in the RT, Ruptly, and Sputnik tweets. We will be using web scraping methods to obtain the text.

### Planned Method

* Use BeautifulSoup or lxml to scrape article text
* Sections to target are body, main, or article sections
* HTML tags: p and h1/h2/h3
* In particular, classes of h1/h2/h3 with "article" and/or "title"
* Ignore h1/h2/h3 with "footer" or "newsletter" in class
* Ignore text with no "!" or "."
* If PDF, use textract to extract article text

### Actual Method

The RT dataset consists mainly of articles from the domain rt.com. BeautifulSoup is used to extract the text from the linked articles. The result is derived from the text of the HTML p tags. The function that requests the article data returns a list of lists of strings.

The Sputnik dataset consists mainly of articles from the domain sptnkne.ws. BeautifulSoup is used to extract the description of the linked articles. The result is derived from the text of the HTML p tags found within the HTML div tag of class 'b-article__text'.

The Ruptly dataset was found to be unsuitable for the topic modeling analysis planned for it, so the code had to be scrapped. The scrapped code is found in the Extra section at the end of this notebook.

In [678]:
# Imports needed for web scraping and pickling
import pandas as pd
import requests
from bs4 import BeautifulSoup
import pickle

# Read in RT URL file into DataFrame, get the values of the expanded_url column
rtdf = pd.read_csv("rt_urls.csv")
urls = rtdf['expanded_url'].values

# There are 2144 tweets in the RT file
# len(rtdf)

# Define function to scrape article data from URLs in the RT dataset
def url_to_rttext(url):
    '''Return article data from RT URLs'''
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'lxml')
        result = [p.text for p in soup.find_all('p')]
        return result
    except:
        return None

In [635]:
## Request the article data; takes ~ 1 hour to run
# rtarticle = [url_to_rttext(url) for url in urls]

In [752]:
# Articles stored as lists of strings get changed into a single string
def combine_text(list_of_str):
    '''Take iterable collection of strings and combine them into one string'''
    combined_text = ' '.join(list_of_str)
    return combined_text

# Store article text as new column "article" in RT DataFrame
rtdf['article'] = rtarticle
rtdf['article'] = rtdf['article'].apply(lambda x: combine_text(x) if x!=None else '')

In [753]:
# Read in Sputnik URL file, get the values of the expanded_url column
sputnikdf = pd.read_csv("sputnikint_urls.csv")
urls = sputnikdf['expanded_url'].values

# There are 3198 tweets in the Ruptly file
# len(sputnikdf)

# Scrape article data from URLs in the Sputnik dataset
def url_to_stext(url):
    '''Return article data from Sputnik URLs'''
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'lxml')
        body = soup.find('div', class_='b-article__text').find_all('p')
        result = [p.text for p in body]
        return result
    except:
        return None

In [754]:
## Request the article data; takes a while to run
# sarticle = [url_to_stext(url) for url in urls]

In [755]:
# Store article text as new column "article" in Sputnik DataFrame
sputnikdf['article'] = sarticle
sputnikdf['article'] = sputnikdf['article'].apply(lambda x: combine_text(x) if x!=None else '')

In [756]:
sputnikdf.head()

Unnamed: 0.1,Unnamed: 0,url,expanded_url,display_url,indices,article
0,0,https://t.co/CPCTe4xCbd,https://sptnkne.ws/CpBz,sptnkne.ws/CpBz,"[112, 135]",Former Senate staffer Tara Reade claims a barr...
1,1,https://t.co/oin3K2gk86,https://sptnkne.ws/CpAU,sptnkne.ws/CpAU,"[58, 81]",Five Indian Army personnel including a command...
2,2,https://t.co/kKSz6S9ofV,https://sptnkne.ws/CpAV,sptnkne.ws/CpAV,"[90, 113]","According to the Infobae news portal, the cour..."
3,3,https://t.co/LuJwSuff35,http://sptnkne.ws/CpA2,sptnkne.ws/CpA2,"[109, 132]",The number of COVID-19-related deaths througho...
4,4,https://t.co/CemmMGniMs,https://sptnkne.ws/CpAZ,sptnkne.ws/CpAZ,"[72, 95]",The epicentre of the natural disaster was loca...


## Part 2: Clean Text

This part involves applying text cleaning techniques like making text lowercase, removing punctuation, and removing words with numerals. The ids of the tweets (the column titled 'Unnamed: 0') will be kept to facilitate the next step.

In [783]:
# Apply text cleaning techniques
import re
import string

def clean_text(text):
    '''Make text lowercase, remove punctuation, remove words containing numbers, and remove non-English characters.'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # initial punctuation removal
    text = re.sub('[‘’“”…]', '', text) # additional punctuation removal
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[^a-z]+[ ]*[^a-z]+', ' ', text)
    text = re.sub('[ ][ ]', '', text) # remove any double spaces
    return text

# Create clean DataFrames
clean_data_rt = pd.concat([rtdf['Unnamed: 0'], rtdf['article'].apply(lambda x: clean_text(x) if x!='' else '')], axis=1)
clean_data_s = pd.concat([sputnikdf['Unnamed: 0'], sputnikdf['article'].apply(lambda x: clean_text(x) if x!='' else '')], axis=1)

In [785]:
## Inspect the cleaned DataFrames
clean_data_rt.head()
clean_data_s.head()

Unnamed: 0.1,Unnamed: 0,article
0,3,the south korean military says it issued a war...
1,4,by jason otoole who has worked as a senior fea...
2,8,perkins is best known for his book confessions...
3,9,the ongoing coronavirus crisis has produced a ...
4,10,


## Part 3: Organize Articles

This part involves preparing the cleaned DataFrames for use in the Topic_Modeling_Yen notebook. This includes creating a corpus and document-term matrix for both RT and Sputnik.

### Corpus

The corpus was mostly created in the previous step. There are target url CSVs that cover urls retweeted significantly, which the topic modeling procedure will zero in on.

In [786]:
# Read in target url CSVs
rt_target = pd.read_csv('RT_urls_with_over_80_retweets.csv')
s_target = pd.read_csv('sputnik_urls_with_over_20.csv')

# Create a DataFrame with the targeted urls
rt_corpus = pd.merge(clean_data_rt, rt_target).rename(columns={'Unnamed: 0': 'ID'})[['ID', 'article']]
s_corpus = pd.merge(clean_data_s, s_target).rename(columns={'Unnamed: 0': 'ID'})[['ID', 'article']]

# Pickle the corpus for later use
rt_corpus.to_pickle("rt_corpus.pkl")
s_corpus.to_pickle("s_corpus.pkl")

### Document-Term Matrix

To perform topic modeling, the text will need to be tokenized, or broken down into little pieces, like when text is broken into words. The scikit-learn library's CountVectorizer will be used to do this, representing a different document in each row and a different word in each column. Stop words will also be eliminated with the package. Stop words are common words that add little meaning to text such as conjunctions like "and."

In [788]:
# Create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

# RT document-term matrix
rt_cv = cv.fit_transform(clean_data_rt.article)
rt_dtm = pd.DataFrame(rt_cv.toarray(), columns=cv.get_feature_names())
rt_dtm.index = clean_data_rt.index

# Sputnik document-term matrix
s_cv = cv.fit_transform(clean_data_s.article)
s_dtm = pd.DataFrame(s_cv.toarray(), columns=cv.get_feature_names())
s_dtm.index = clean_data_s.index

Unnamed: 0,aa,aadmi,aaj,aaja,aakshat,aam,aarhus,aarogya,aarogyasetu,aarohyasetu,...,zuckerberg,zuganov,zur,zurich,zurzeit,zuvor,zuwara,zvezda,zvezdar,zvi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3193,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3194,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3195,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3196,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Additional Stop Word Removal

Additional stop words will be determined and removed. The most common words for each article will determine the stop words.

In [848]:
# Find the top 30 words
data = rt_dtm.transpose()

for article in data.columns:
    common_words = data[article].nlargest(10)
    print(common_words)

korean      4
dmz         3
hit         3
north       3
fired       2
guard       2
incident    2
inside      2
post        2
reported    2
Name: 0, dtype: int64
musk         11
musks         5
america       4
certainly     4
just          4
like          4
lockdown      4
pandemic      4
came          3
case          3
Name: 1, dtype: int64
said        6
death       5
perkins     5
budget      4
economic    4
people      4
rt          4
better      3
fear        3
military    3
Name: 2, dtype: int64
patent            6
gates             5
body              4
mikhalkov         4
activity          3
cryptocurrency    3
device            3
theory            3
tv                3
alarming          2
Name: 3, dtype: int64
aabsent       0
aal           0
aals          0
aan           0
aaron         0
aarps         0
aback         0
abandon       0
abandoned     0
abandoning    0
Name: 4, dtype: int64
kim       8
said      5
news      4
event     3
kcna      3
kims      3
north     3
showed

hail          6
inches        3
study         3
town          3
actually      2
atmosphere    2
carlos        2
gargantuan    2
hailstone     2
hailstones    2
Name: 167, dtype: int64
workers         4
climate         3
coronavirus     3
equal           3
frontline       3
im              3
pandemic        3
post            3
thunberg        3
appreciation    2
Name: 168, dtype: int64
bodies         7
funeral        6
trucks         6
home           5
wednesday      4
brooklyn       3
coronavirus    3
corpses        3
police         3
smell          3
Name: 169, dtype: int64
minister       6
mishustin      6
prime          6
putin          6
government     5
major          4
president      4
belousov       3
colleagues     3
coronavirus    3
Name: 170, dtype: int64
aabsent       0
aal           0
aals          0
aan           0
aaron         0
aarps         0
aback         0
abandon       0
abandoned     0
abandoning    0
Name: 171, dtype: int64
tweet        6
twitter      5
add       

mayor        5
new          5
people       5
blasio       4
community    4
jews         4
social       4
yorkers      4
chaim        3
city         3
Name: 254, dtype: int64
bombers     4
air         3
rt          3
close       2
forces      2
left        2
video       2
white       2
able        1
advanced    1
Name: 255, dtype: int64
huawei       6
app          5
companies    4
google       4
googles      3
wego         3
banned       2
began        2
biggest      2
brands       2
Name: 256, dtype: int64
countries     11
doctors        9
cuba           7
people         6
just           5
pandemic       5
cuban          4
cubas          4
healthcare     4
including      4
Name: 257, dtype: int64
iit              9
syrian           9
opcw             8
government       7
investigation    6
ltamenah         6
said             6
evidence         5
fact             5
western          5
Name: 258, dtype: int64
videos          5
pentagon        4
said            4
unidentified    4
aerial  

cotton      5
learn       5
need        5
vaccine     5
chinese     4
students    4
america     3
study       3
begun       2
china       2
Name: 434, dtype: int64
noble        7
president    7
prizes       7
russia       7
know         6
nobel        6
history      5
story        5
trump        5
eating       4
Name: 435, dtype: int64
kim           22
north         14
kims           7
korea          7
koreas         6
succession     6
family         4
likely         4
power          4
china          3
Name: 436, dtype: int64
bailout       2
british       2
business      2
capital       2
company       2
deal          2
fund          2
government    2
million       2
news          2
Name: 437, dtype: int64
country      4
iran         2
outbreak     2
president    2
religious    2
rt           2
sites        2
white        2
zones        2
allow        1
Name: 438, dtype: int64
million        11
bpd            10
oil            10
crude           9
demand          7
week            7
ca

moscow         4
people         4
announced      3
blood          3
coronavirus    3
hospitals      3
plasma         3
total          3
wednesday      3
according      2
Name: 642, dtype: int64
police          8
queensland      4
rules           4
video           4
construction    3
site            3
story           3
trio            3
according       2
australia       2
Name: 643, dtype: int64
cuomo        7
new          5
work         4
essential    3
virus        3
want         3
worker       3
yorkers      3
added        2
asked        2
Name: 644, dtype: int64
china            6
beijing          3
coronavirus      3
said             3
epidemic         2
foreign          2
geng             2
international    2
legal            2
ministry         2
Name: 645, dtype: int64
money         6
university    4
harvard       3
bailout       2
endowment     2
promised      2
rt            2
statement     2
trump         2
use           2
Name: 646, dtype: int64
space         13
launch       

percent        6
china          5
american       4
business       4
economic       3
firms          3
respondents    3
survey         3
chamber        2
commerce       2
Name: 851, dtype: int64
india         13
million        9
health         8
government     7
tests          7
healthcare     6
beds           5
indian         5
medical        5
people         5
Name: 852, dtype: int64
protesters       4
police           3
rabin            3
square           3
avivs            2
black            2
corruption       2
demonstrators    2
distancing       2
gathering        2
Name: 853, dtype: int64
crisis       5
europe       4
gentiloni    4
eu           3
trillion     3
common       2
economic     2
eurozone     2
financial    2
fund         2
Name: 854, dtype: int64
oil        12
global      4
prices      4
barrel      3
crude       3
cut         3
demand      3
million     3
able        2
current     2
Name: 855, dtype: int64
bus         4
police      4
dallas      3
man         3
repo

coronavirus     3
laboratory      3
news            3
based           2
ghebreyesus     2
organization    2
repeatedly      2
rt              2
story           2
virus           2
Name: 1063, dtype: int64
bat            4
bats           4
tested         3
viruses        3
cause          2
coronavirus    2
kerala         2
positive       2
pteropus       2
rt             2
Name: 1064, dtype: int64
aircraft     3
plane        3
captain      2
committee    2
hit          2
incident     2
rt           2
runway       2
shows        2
video        2
Name: 1065, dtype: int64
sailors     4
navy        3
tested      3
aircraft    2
carrier     2
french      2
lavault     2
outbreak    2
positive    2
reported    2
Name: 1066, dtype: int64
syrian       8
army         4
group        4
said         4
american     3
fighters     3
military     3
near         3
surrender    3
weapons      3
Name: 1067, dtype: int64
ships       6
iranian     5
vessels     5
military    4
navy        4
oil         3
r

decision    5
cuomo       4
lift        4
trump       4
date        3
said        3
states      3
work        3
change      2
day         2
Name: 1247, dtype: int64
african       4
africa        3
guangzhou     3
alleged       2
china         2
city          2
consulate     2
foreigners    2
local         2
officials     2
Name: 1248, dtype: int64
new          5
armor        3
commander    2
day          2
defense      2
design       2
machines     2
military     2
mm           2
rt           2
Name: 1249, dtype: int64
kudrin        5
measures      5
businesses    4
crisis        4
economy       4
government    4
russia        4
according     3
billion       3
country       3
Name: 1250, dtype: int64
percent         8
restrictions    4
rights          4
capital         3
romir           3
supportive      3
survey          3
april           2
coronavirus     2
education       2
Name: 1251, dtype: int64
biden       7
sanders     7
asking      4
im          4
party       3
policy      3
t

tweet        6
add          4
browser      3
instant      3
instantly    3
learn        3
location     3
twitter      3
website      3
code         2
Name: 1463, dtype: int64
sanders        10
labour          9
democratic      8
uk              8
party           7
corbyn          6
politics        5
biden           4
government      4
ideological     4
Name: 1464, dtype: int64
million       8
cuts          6
bpd           5
said          5
output        4
production    4
arabia        3
barrel        3
barrels       3
day           3
Name: 1465, dtype: int64
million       8
cuts          6
bpd           5
said          5
output        4
production    4
arabia        3
barrel        3
barrels       3
day           3
Name: 1466, dtype: int64
taliban       5
attack        4
afghan        3
fired         3
rockets       3
bagram        2
car           2
claimed       2
free          2
government    2
Name: 1467, dtype: int64
claims          4
million         4
insurance       2
jobless    

app         7
germany     4
use         4
data        3
said        3
tracking    3
users       3
apps        2
braun       2
cases       2
Name: 1675, dtype: int64
kissinger         8
rt                5
global            3
government        3
humanity          3
like              3
pandemic          3
project           3
world             3
administration    2
Name: 1676, dtype: int64
hospital     4
pm           4
care         3
confirmed    3
nhs          3
raab         3
according    2
advice       2
afternoon    2
continued    2
Name: 1677, dtype: int64
cabinet      3
monday       3
pm           3
raab         3
condition    2
confirmed    2
foreign      2
hospital     2
johnson      2
johnsons     2
Name: 1678, dtype: int64
week           5
minister       3
recorded       3
announced      2
coronavirus    2
country        2
deaths         2
foreign        2
lockdown       2
lowest         2
Name: 1679, dtype: int64
billion       3
percent       3
crisis        2
economy       2
e

navy        7
crozier     5
letter      5
sailors     5
command     4
captain     3
chain       3
croziers    3
did         3
military    3
Name: 1891, dtype: int64
migrants    10
libya        6
african      4
boats        4
europe       4
italy        4
mafia        3
ngo          3
people       3
rescue       3
Name: 1892, dtype: int64
facebook        9
twitter         6
media           5
social          5
leaders         4
thing           4
world           4
decided         3
hes             3
organization    3
Name: 1893, dtype: int64
cases        8
virus        5
confirmed    3
deaths       3
lockdown     3
people       3
spread       3
wuhan        3
country      2
effect       2
Name: 1894, dtype: int64
coronavirus    6
poll           4
respondents    4
americans      3
appears        3
available      3
half           3
percent        3
shot           3
vaccinated     3
Name: 1895, dtype: int64
said           3
billion        2
coronavirus    2
iata           2
industry       2


migrants     3
asked        2
bath         2
clothes      2
eyes         2
gautam       2
local        2
officer      2
officials    2
public       2
Name: 2108, dtype: int64
percent       7
pandemic      2
rt            2
week          2
analyst       1
app           1
autonomous    1
bailout       1
bear          1
business      1
Name: 2109, dtype: int64
china          4
virus          4
beijing        3
chinese        3
coronavirus    3
embassy        3
nations        3
pandemic       3
story          3
western        3
Name: 2110, dtype: int64
canada        5
left          5
pay           5
harry         4
kingdom       4
trump         4
friend        3
meghan        3
pair          3
protection    3
Name: 2111, dtype: int64
new          13
york         12
trump        11
told          7
city          6
president     6
state         6
blasio        5
cuomo         4
federal       4
Name: 2112, dtype: int64
weve         5
calls        4
city         4
day          4
almojera     3


In [851]:
# Some common words that add little added meaning include "rt", "said", and "im"; they will be added to the stop words
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer

stop_words = text.ENGLISH_STOP_WORDS.union(['rt', 'said', 'im'])

# Recreate document-term matrices
cv = CountVectorizer(stop_words=stop_words)
rt_cv = cv.fit_transform(clean_data_rt.article)
rt_stop = pd.DataFrame(rt_cv.toarray(), columns=cv.get_feature_names())
rt_stop.index = clean_data_rt.index

cv = CountVectorizer(stop_words=stop_words)
s_cv = cv.fit_transform(clean_data_s.article)
s_stop = pd.DataFrame(s_cv.toarray(), columns=cv.get_feature_names())
s_stop.index = clean_data_s.index

# Pickle the document-term matrices
rt_stop.to_pickle('rt_dtm.pkl')
s_stop.to_pickle('s_dtm.pkl')

In [850]:
# Also pickle the cleaned data and the CountVectorizer object
clean_data_rt.to_pickle('rt_clean.pkl')
clean_data_s.to_pickle('s_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))

# Extra

This is the code that was ultimately scrapped.

### Web Scraping

In [208]:
# %load_ext memory_profiler
## For memory profiling

In [210]:
## A lot of unneeded text was extracted with this cell's code and peak memory was higher than that of the cell below \
## when not in function form

# article = []

# ruptly = pd.read_csv("ruptly_urls.csv")
# url = ruptly['expanded_url'].values[0]
# page = requests.get(url)
# soup = BeautifulSoup(page.content, 'lxml')
# for text in soup.body.find_all(text=True):
#     if text.parent.name not in ['script', 'meta', 'link', 'style'] and text != '\n' and text != None:
#         article.append(text.strip())

peak memory: 367.14 MiB, increment: -0.34 MiB


In [642]:
## Read in Ruptly URL file, get the values of the expanded_url column
# ruptlydf = pd.read_csv("ruptly_urls.csv")
# urls = ruptlydf['expanded_url'].values

## There are 668 tweets in the Ruptly file
## len(ruptlydf)

## Scrape article data from URLs in the Ruptly dataset
# def url_to_ruptext(url):
#     '''Return article data from Ruptly URLs'''
#     try:
#         page = requests.get(url)
#         soup = BeautifulSoup(page.content, 'lxml')
#         desc = soup.find('p', id='eow-description')
#         result = [item.string for item in desc if item.string]
#         return result
#     except:
#         return None

In [572]:
## Request the article data; takes a while to run

# ruptlyarticle = [url_to_ruptext(url) for url in urls]

In [751]:
# ruptlydf['article'] = ruptlyarticle
# ruptlydf['article'] = ruptlydf['article'].apply(lambda x: combine_text(x) if x!=None else '')
# ruptlydf.head()

### Clean Text

In [None]:
# clean_data_rup = pd.concat([ruptlydf['indices'], ruptlydf['article'].apply(lambda x: clean_text(x) if x!='' else '')], axis=1)

In [None]:
# clean_data_rup.head()

### Organize Articles

In [None]:
# rup_target = pd.read_csv('Ruptly_urls_with_over_80.csv')
# rup_clean = pd.merge(clean_data_rup, rup_target)

# rup_corpus.to_pickle("rup_corpus.pkl")