## CLEANING AND VECTORIZATION
In this notebook we will be pre-processing, cleaning, lemmatizing, tagging, and vectorizing our webscraped tweets. The goal is to create an optimized document-term matrix for topic modeling.

In [19]:
import pandas as pd
import spacy
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from langdetect import detect
import unicodedata

### Standardize Encoding

Read in CSV as dataframe, and standardize encoding.

In [12]:
df_raw = pd.read_csv('df_raw.csv', encoding='utf-8', index_col=0)

In [5]:
df_raw.columns

Index(['user', 'date', 'url', 'outlinks', 'content'], dtype='object')

In [13]:
#reindex the columns for easier viewing
cols = df_raw.columns.tolist()

cols.insert(2, cols.pop(cols.index('url')))

df_raw= df_raw.reindex(columns= cols)

Decode our webscrapped tweets into ascii so we can remove emojis and foreign characters easily during our pre-processing steps.

In [29]:
#decoded unicode into ascii
df_raw['clean'] = df_raw['content'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('ascii'))

In [30]:
df_raw.head()

Unnamed: 0,user,date,url,outlinks,content,clean
0,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-11-01T01:08:48+00:00,https://twitter.com/TheAtlantic/status/1322707...,['http://on.theatln.tc/YXH6gyR'],The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily: Will this decade be the ne...
1,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-11-01T00:38:45+00:00,https://twitter.com/TheAtlantic/status/1322699...,['http://on.theatln.tc/l89Uzv7'],There's plenty that's going wrong for Trump. H...,There's plenty that's going wrong for Trump. H...
2,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-11-01T00:06:48+00:00,https://twitter.com/TheAtlantic/status/1322691...,['http://on.theatln.tc/ZGvkM7u'],"If Trump tries to steal the election, people w...","If Trump tries to steal the election, people w..."
3,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T23:34:45+00:00,https://twitter.com/TheAtlantic/status/1322683...,['http://on.theatln.tc/kypt5Zc'],The Trump campaign's “election-security operat...,The Trump campaign's election-security operati...
4,"{'username': 'TheAtlantic', 'displayname': 'Th...",2020-10-31T23:04:31+00:00,https://twitter.com/TheAtlantic/status/1322675...,['http://on.theatln.tc/rNbarVc'],"Even if Joe Biden wins decisively next week, t...","Even if Joe Biden wins decisively next week, t..."


### Preprocessing

In [32]:
def preprocess(tweet):
    """
    Takes in tweet and performs initial text cleaning/preprocessing.
    """
    #make sure doc is string
    tweet=str(tweet)
    #lowercase-- not changing anything to lowercase yet due to proper nouns being very important. want to use Spacy to detect later.
    #tweet = tweet.lower()
    #get rid of urls
    rem_url=re.sub(r'http\S+', '', tweet)
    #gets rid of @ tags
    rem_tag = re.sub('@\S+', '', rem_url)
    #gets rid of # in hashtag but keeps content of hashtag
    rem_hashtag = re.sub('#', '', rem_tag)
    #gets rid of special characters, numbers, etc.
    clean_text = re.sub(r'[^A-Za-z\s]','', rem_hashtag)

    return clean_text

In [33]:
df_raw['clean']=df_raw['clean'].map(lambda x:preprocess(x))

In [80]:
df_raw[['content', 'clean']].head(3)

Unnamed: 0,content,clean
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...


### Standardize Language

Let's remove any foreign language tweets to make sure we're only focusing on English. 

In [37]:
def english_only(x):
    """
    Take tweet, detect language, and only return English tweets, coding foreign language tweets as NaNs.
    """
    try:
        if detect(x) == 'en':
            return x
        else:
            return np.nan
    except:
        pass


In [38]:
%%time
#  Remove any non english tweets
df_raw['clean'] = df_raw['clean'].apply(lambda x: english_only(x))


CPU times: user 12min 4s, sys: 6.02 s, total: 12min 10s
Wall time: 12min 14s


In [39]:
df_raw.clean.isnull().sum()


1251

In [40]:
#drop non-English tweets
df_raw = df_raw[df_raw.clean.notnull()]

### Tagging and Lemmatizing

We only want the nouns (and proper nouns) in each tweet for topic modeling, as they are the essence of article subjects. Let's tag the nouns and return the lemmatized versions of tehm in a single step.

Because we also have so much data, we will incoporate the NLP pipeline in order to shorten the processing time. Tips on how to do this were found here: https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad

In [41]:
nlp = spacy.load('en_core_web_sm', disable=[ 'parser', 'ner'])

In [42]:
def noun_lemmatize_pipe(doc):
    """
    Takes in tweet and returns only the lemmatized version of nouns (including proper nouns).
    """
    lemma_list = [token.lemma_ for token in doc
                  if token.pos_ == "NOUN" or token.pos_ =="PROPN"] 
    return lemma_list

#create a pipeline in order to shorten processing time
def preprocess_pipe(texts):
    """
    Inputs noun_lemmative_pipe function into NLP pipeline for faster processing.
    """
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=50):
        preproc_pipe.append(noun_lemmatize_pipe(doc))
    return preproc_pipe

In [43]:
%%time
#apply function and create a new column to house the outputs
df_raw['clean_lemmatized'] = preprocess_pipe(df_raw['clean'])


CPU times: user 2min 22s, sys: 670 ms, total: 2min 23s
Wall time: 2min 24s


In [79]:
df_raw[['content', 'clean', 'clean_lemmatized']].head(3)

Unnamed: 0,content,clean,clean_lemmatized
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s,"[decade, s]"
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...,"[plenty, Trump, thing, campaign, gap, Joe, Bid..."
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...,"[Trump, election, people, coup, strategy, write]"


We've successfully filtered out the nouns and proper nouns, lemmitized them, while making sure our function runs on optimized time!

### Remove Additional Words

Let's filter out any additional words that may appear in the tweets but aren't related to article subjects, like the names of the publications and common headline section titles.

In [54]:
removal_words= ["Times", "Wall", "Street", "Journal", "New", "Yorker", "York", "Medium", "Wired", "Financial", "Washington", "Post", "Business", "Insider", "Economist", "The", "Atlantic", "Daily", "Weekly", "SPONSORED", "Sponsored", "BREAKING", "Breaking", "NEWS", "News" ]

In [73]:
df_raw['clean_lemmatized'] = df_raw['clean_lemmatized'].apply(lambda x: [word for word in x if word not in removal_words])

In [78]:
df_raw[['content', 'clean', 'clean_lemmatized', 'clean_final']].head(3)

Unnamed: 0,content,clean,clean_lemmatized,clean_final
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s,"[decade, s]",decade s
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...,"[plenty, Trump, thing, campaign, gap, Joe, Bid...",plenty Trump thing campaign gap Joe Biden report
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...,"[Trump, election, people, coup, strategy, write]",Trump election people coup strategy write


### Vectorize
Now let's rejoin our twice cleaned, lemmatized list of nouns and pronouns!

In [74]:
df_raw['clean_final'] = df_raw['clean_lemmatized'].apply(lambda x: ' '.join(x))

In [59]:
df_raw[['content', 'clean', 'clean_lemmatized', 'clean_final']].head(3)

Unnamed: 0,content,clean,clean_lemmatized,clean_final
0,The Atlantic Daily: Will this decade be the ne...,The Atlantic Daily Will this decade be the new s,"[decade, s]",decade s
1,There's plenty that's going wrong for Trump. H...,Theres plenty thats going wrong for Trump Here...,"[plenty, Trump, thing, campaign, gap, Joe, Bid...",plenty Trump thing campaign gap Joe Biden report
2,"If Trump tries to steal the election, people w...",If Trump tries to steal the election people wi...,"[Trump, election, people, coup, strategy, write]",Trump election people coup strategy write


We will be using TFID Vectorizer as opposed to Count Vectorizer, so we can give equal weight to rare words. Because our focus is on nouns and pronouns, rare words are likely to be just as, if not more impactful, as words that are frequently used. 

We'll also be using some of the built in parameters in the TFID Vectorizer as last checks before we output our doc-term matrix.

In [91]:
#define vectorizer and set parameters in order to standardize everything to lowercase, remove any stop words, and remove any word that appears below 0.005%, about 10 times.
tfidf = TfidfVectorizer(lowercase = True, stop_words= 'english', min_df = 0.00005)
#fit on fully cleaned dataframe column
doc_term_matrix = tfidf.fit_transform(df_raw['clean_final'])
#turn matrix into a dataframe with words as columns
matrix_df = pd.DataFrame(doc_term_matrix.toarray(), columns=tfidf.get_feature_names())

In [92]:
matrix_df

Unnamed: 0,aaron,ab,abandonment,abbey,abbott,abby,abc,abe,abenomics,aberration,...,zoo,zoological,zoom,zooms,zoos,zoox,zora,zuckerberg,zuckerbergs,zuzana
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
198745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
198746,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
198747,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [93]:
matrix_df.columns[1:100]

Index(['ab', 'abandonment', 'abbey', 'abbott', 'abby', 'abc', 'abe',
       'abenomics', 'aberration', 'abes', 'abhijit', 'abigail', 'ability',
       'abiy', 'abolition', 'abolitionist', 'abortion', 'abraham', 'abrams',
       'absence', 'absentee', 'absolution', 'absurdity', 'abu', 'abundance',
       'abuse', 'abuser', 'abyss', 'ac', 'aca', 'academia', 'academic',
       'academy', 'acceleration', 'accelerator', 'accent', 'accenture',
       'acceptance', 'access', 'accessibility', 'accessory', 'accident',
       'acclaim', 'accommodation', 'accomplice', 'accomplishment', 'accord',
       'account', 'accountability', 'accountant', 'accounting', 'accounts',
       'accumulation', 'accuracy', 'accusation', 'accuser', 'ache',
       'achievement', 'achilles', 'acid', 'ackman', 'acknowledgement', 'aclu',
       'acne', 'acosta', 'acquaintance', 'acquisition', 'acquittal', 'acre',
       'acrimony', 'acronym', 'act', 'acting', 'action', 'activism',
       'activist', 'activity', 'actor',

### Next Steps
Our doc-term matrix looks great! We have fully cleaned out all special characters and foreign language. We have extracted the nouns and pronouns in their lemmatized forms. Now we can topic model in our next notebook!
