# Data Preprocessing

This notebook is doing preprocessing on the twitter text to pass the tweets through the Watson Natural Language API. 

In [1]:
# Libraries

import numpy as np
import pandas as pd
import spacy
from spacy.lang.en import English

In [2]:
data = pd.read_csv("trumptweets.csv", ";")

In [3]:
data

Unnamed: 0,username,date,retweets,favorites,text,geo,mentions,hashtags,id,permalink
0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...,,,,"1,22629E+18",https://twitter.com/realDonaldTrump/status/122...
1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...,,,,"1,22625E+18",https://twitter.com/realDonaldTrump/status/122...
2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...,,#NAME?,,"1,22623E+18",https://twitter.com/realDonaldTrump/status/122...
3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...,,,,"1,22623E+18",https://twitter.com/realDonaldTrump/status/122...
4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...,,,,"1,22622E+18",https://twitter.com/realDonaldTrump/status/122...
...,...,...,...,...,...,...,...,...,...,...
10245,realDonaldTrump,07.01.17 16:02,24681,87739,Having a good relationship with Russia is a go...,,,,"8,17748E+17",https://twitter.com/realDonaldTrump/status/817...
10246,realDonaldTrump,07.01.17 13:03,16601,73661,Only reason the hacking of the poorly defended...,,,,"8,17703E+17",https://twitter.com/realDonaldTrump/status/817...
10247,realDonaldTrump,07.01.17 12:56,15401,60280,Intelligence stated very strongly there was ab...,,,,"8,17701E+17",https://twitter.com/realDonaldTrump/status/817...
10248,realDonaldTrump,07.01.17 04:53,13961,59218,Gross negligence by the Democratic National Co...,,,,"8,1758E+17",https://twitter.com/realDonaldTrump/status/817...


In [4]:
text = data.text

In [5]:
text

0        A great coach and a fantastic guy. His endorse...
1        Pete Rose played Major League Baseball for 24 ...
2        Total and complete Endorsement for Debbie Lesk...
3        Governor Cuomo wanted to see me this weekend. ...
4        We will not be touching your Social Security o...
                               ...                        
10245    Having a good relationship with Russia is a go...
10246    Only reason the hacking of the poorly defended...
10247    Intelligence stated very strongly there was ab...
10248    Gross negligence by the Democratic National Co...
10249    Happy Birthday @EricTrump ! https://www. faceb...
Name: text, Length: 10250, dtype: object

In [6]:
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "entityrecognizer"])
#nlp = spacy.load("en")
#nlp_en = English()  # Includes English data

def preprocess(text):
    doc = nlp(text)
    
    # init an list to save the return
    result = []
    
    # in the function we have to: 
    # tokeniz the text: 
    # - Special-case rule to split a string into several tokens or prevent a token from being split when
    # remove stop words: 
    # - Is the token part of a stop list, i.e. the most common words of the language?
    # lemma:
    # - The base form of the word
    # is alpha: 
    # - Is the token an alpha character?
    
    for token in doc:
        if token.is_alpha == True: 
                result.append(token.lemma_)
    return result

In [7]:
print(preprocess(text[0]))
print(text[0])

['A', 'great', 'coach', 'and', 'a', 'fantastic', 'guy', 'His', 'endorsement', 'of', 'me', 'in', 'Indiana', 'be', 'a', 'very', 'big', 'deal']
A great coach and a fantastic guy. His endorsement of me in Indiana was a very big deal! https:// twitter.com/kyle__boone/st atus/1226234981808250880 …


In [8]:
text[40:60]

40    @nytimes “The Votes Were A Resounding Victory ...
41                           pic.twitter.com/FIg1SYtJcy
42    I will be making a public statement tomorrow a...
43                           pic.twitter.com/JDS4zUXXJG
44                           pic.twitter.com/lj8RUqcz37
45    “The Democrats want to run a Country, and they...
46    Thank you Jonathan, and great job! https:// tw...
47    It was a great and triumphant evening for our ...
48                 #SOTU2020 pic.twitter.com/W03gQLkdpo
49                           pic.twitter.com/QeFYDg9jZ0
50    My Approval Rating in the Republican Party = 9...
51    Market up big today on very good economic news...
52    The Democrat Party in Iowa really messed up, b...
53    When will the Democrats start blaming RUSSIA, ...
54    It is not the fault of Iowa, it is the Do Noth...
55    The Democrat Caucus is an unmitigated disaster...
56           Big WIN for us in Iowa tonight. Thank you!
57    Many people do not know what a great guy &

In [9]:
print(" ".join(preprocess(text[59])))
print(text[59])

where the Whistleblower where the 2 Whistleblower where the Informer Why do Corrupt politician Schiff MAKE UP my conversation with the Ukrainian President Why do the House do its job And sooo much much
Where’s the Whistleblower? Where’s the second Whistleblower? Where’s the Informer? Why did Corrupt politician Schiff MAKE UP my conversation with the Ukrainian President??? Why didn’t the House do its job? And sooo much more!


In [10]:
print(" ".join(preprocess(text[142])))
print(text[142])

هذا ما قد تبدو عليه دولة فلسطين المستقبلية بعاصمة في أجزاء من القدس الشرقية
هذا ما قد تبدو عليه دولة فلسطين المستقبلية بعاصمة في أجزاء من القدس الشرقية. pic.twitter.com/CFuYwwjSso


In [11]:
doc = nlp(text[142])
doc

هذا ما قد تبدو عليه دولة فلسطين المستقبلية بعاصمة في أجزاء من القدس الشرقية. pic.twitter.com/CFuYwwjSso

In [12]:
doc.lang_

'en'

In [19]:
for token in doc:
    print(token.is_alpha)

True
True
True
True
True
True
True
True
True
True
True
True
True
True
False
False


In [14]:
doc = nlp(text[141])
doc.text
doc.lang_

'en'