<a href="https://colab.research.google.com/github/SDS-AAU/DSBA-2021/blob/master/static/notebooks/DSBA21_M2W2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tweet-preprocessor -qq

In [2]:
import pandas as pd
import preprocessor as prepro
import spacy

load up data

In [3]:
data_congress = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/pol_tweets.gz')

In [4]:
data_congress

Unnamed: 0,text,labels
340675,RT @GreenBeretFound Today we remember Sgt. 1st...,0
289492,"Yes, yes, yes, yes and yes. 😷 #JerseyStrong 💪🏾...",1
371088,Made new friends this afternoon delivering mas...,1
82212,RT @TXMilitary Happening TODAY: Pilots with th...,0
476047,RT @SteveScalise President Trump's legal team ...,0
...,...,...
61499,Outrageous.\n\nBrave health care workers are p...,0
185562,RT @dskolnick .@RepTimRyan proposes up to $3K ...,1
354040,It is clear that the #HeroesAct will help tens...,1
708686,Democrats are talking about Bolton and Mulvane...,0


preprocessing

In [5]:
# prepro settings
prepro.set_options(prepro.OPT.URL, prepro.OPT.EMOJI, prepro.OPT.NUMBER, prepro.OPT.RESERVED, prepro.OPT.MENTION, prepro.OPT.SMILEY)

In [6]:
data_congress['text_clean'] = data_congress['text'].map(lambda t: prepro.clean(t))

In [7]:
data_congress['text_clean'] = data_congress['text_clean'].str.replace('#','')

bootstrap dictionary with spacy (add-on)
here we take a sample of 1000 tweets and create a dictionary only containinng `'NOUN', 'PROPN', 'ADJ', 'ADV'` - the assumption is that we thereby can capture "more relevant" words...we also remove stoppwords and lematize


In [8]:
nlp = spacy.load("en")

In [22]:
tokens = []

for tweet in nlp.pipe(data_congress.sample(5000)['text_clean']):
  tweet_tok = [token.lemma_.lower() for token in tweet if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] and not token.is_stop] 
  tokens.extend(tweet_tok)

In [23]:
# here we create this dictionary and in the next step it is used in the tokenization
bootstrap_dictionary = list(set(tokens))

repack preprocessing into a function

In [26]:
# now only for the cleanup. The vectorizer is removed and put into the pipeline below.

def preprocessTweets(data_tweets):
  clean_text = data_tweets.map(lambda t: prepro.clean(t))
  clean_text = clean_text.str.replace('#','')
  return clean_text

vectorization and SML part

Here we also add random undersampling (using imblearn) to improve the recall on the underrerresented class "rep" (0)
For this to work we build a pipeline into which we put Tfidfvectorization, undersampline and the logistic regression. This bundles all steps together so that we don't have to exicute them all individually every time.

In [69]:
X = data_congress['text_clean']
y = data_congress['labels']

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

In [42]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

model = make_pipeline_imb(TfidfVectorizer(vocabulary=bootstrap_dictionary), 
                          RandomUnderSampler(),
                          LogisticRegression())

In [43]:
model.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token...
                 RandomUnderSampler(random_state=None, ratio=None,
                                    replacement=False, return_indices=False,
                                    sampling_strategy='auto')),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight

In [44]:
model.score(X_test, y_test)

0.7579

In [45]:
from sklearn.metrics import classification_report

In [46]:
y_pred = model.predict(X_test)

In [47]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.65      0.77      0.70      3760
           1       0.84      0.75      0.79      6240

    accuracy                           0.76     10000
   macro avg       0.75      0.76      0.75     10000
weighted avg       0.77      0.76      0.76     10000



apply to new data

In [48]:
data_tweets = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/pres_debate_2020.gz')

In [49]:
data_tweets

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,user_id,user_id_str,username,name,day,hour,link,urls,photos,video,thumbnail,nlikes,nreplies,nretweets,quote_url,search,near,geo,source,reply_to,translate,trans_src,trans_dest
0,1318944772183281664,1318944772183281664,2020-10-21 15:58:33,2020-10-21 15:58:33,0,,Still time to register: Students can join the ...,en,[presidentialdebate2020],[],1130857348921036802,1130857348921036800,UVADemocracy,UVA Democracy Initiative,3,15,https://twitter.com/UVADemocracy/status/131894...,[https://bit.ly/349NTIU],[https://pbs.twimg.com/media/Ek3UXC1X0AAw47D.png],1,https://pbs.twimg.com/media/Ek3UXC1X0AAw47D.png,2,0,2,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
1,1318938583122743296,1318938583122743296,2020-10-21 15:33:57,2020-10-21 15:33:57,0,,Be prepared for Trump to railroad Thursday’s d...,en,[presidentialdebate2020],[],243363569,243363569,kevinjguest,Kevin Guest,3,15,https://twitter.com/kevinjguest/status/1318938...,[],[],0,,0,0,0,https://twitter.com/donaldjtrumpjr/status/1318...,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
2,1318932554897031168,1318932554897031168,2020-10-21 15:10:00,2020-10-21 15:10:00,0,,Join us tomorrow from 5-8pm as @michaelpleahy ...,en,[presidentialdebate2020],[],26819436,26819436,TalkradioWLAC,Talkradio WLAC,3,15,https://twitter.com/TalkradioWLAC/status/13189...,[https://wlac.iheart.com/calendar/event/5f8df3...,[],0,,0,0,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
3,1318928783169245184,1318928783169245184,2020-10-21 14:55:01,2020-10-21 14:55:01,0,,Wanna bet #ProudBoys comes up #PresidentialDeb...,en,"[proudboys, presidentialdebate2020]",[],298018860,298018860,PBPoliticsFins,Antonio Fins,3,14,https://twitter.com/PBPoliticsFins/status/1318...,[https://www.palmbeachpost.com/story/news/2020...,[],0,,0,0,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
4,1318927150247018496,1318927150247018496,2020-10-21 14:48:31,2020-10-21 14:48:31,0,,RT College Tour @BelmontUniv was spotless. Gor...,en,"[musiccity, presidentialdebate2020]",[],4159192877,4159192877,12thSouth,12th South,3,14,https://twitter.com/12thSouth/status/131892715...,[],[https://pbs.twimg.com/media/Ek3CBOhXYAIpo8C.j...,1,https://pbs.twimg.com/media/Ek3CBOhXYAIpo8C.jpg,0,0,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8806,1311705601597222912,1311705601597222912,2020-10-01 16:32:40,2020-10-01 16:32:40,0,,Check out my Gig on Fiverr: do email marketing...,en,"[october1st, worsttrumpever, thursdaythoughts,...",[],1294253523769667586,1294253523769667584,kvngmujebo,Kvngmujebo,4,16,https://twitter.com/kvngmujebo/status/13117056...,[https://www.fiverr.com/share/1KRDaK],[],0,,0,0,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
8807,1311705488531156992,1311705488531156992,2020-10-01 16:32:13,2020-10-01 16:32:13,0,,This was made in 2012! It’s exactly like what ...,en,"[democrats, republicans, presidentialdebate202...",[],25635056,25635056,AZLatina480,Jessica Williams,4,16,https://twitter.com/AZLatina480/status/1311705...,[],[],0,,0,0,0,https://twitter.com/JonnyEthco/status/13113395...,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
8808,1311705196657958912,1311705196657958912,2020-10-01 16:31:03,2020-10-01 16:31:03,0,,How you finna lose two swing states with one q...,en,[presidentialdebate2020],[],382541164,382541164,SampsonRaySimon,Sampson Ray Simon,4,16,https://twitter.com/SampsonRaySimon/status/131...,[],[],0,,0,0,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
8809,1311704929090891776,1311704929090891776,2020-10-01 16:30:00,2020-10-01 16:30:00,0,,"This morning on the @ArleneBynonShow, @gmacofg...",en,"[erinotoole, blanchet, houseofcommons, trudeau...",[],1343036322,1343036322,SXMCanadaTalks,SXMCanadaTalks,4,16,https://twitter.com/SXMCanadaTalks/status/1311...,[https://soundcloud.com/canadatalks/political-...,[],0,,3,2,3,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,


In [50]:
X_new_tweets = preprocessTweets(data_tweets['tweet'])

In [51]:
predictions_new_tweets = model.predict_proba(X_new_tweets)

In [52]:
predictions_new_tweets

array([[0.27158913, 0.72841087],
       [0.63047388, 0.36952612],
       [0.375382  , 0.624618  ],
       ...,
       [0.45854787, 0.54145213],
       [0.51437921, 0.48562079],
       [0.62824224, 0.37175776]])

In [53]:
data_tweets['dem_probability'] = predictions_new_tweets[:,1]

In [54]:
for tweet in data_tweets.sort_values('dem_probability')['tweet'][:10]:
  print(tweet)
  print('\n')

#PresidentialDebate2020 elections have been infiltrated by Chinese Communist Party !! Sign the Petition to Investigate, Condemn &amp; Reject the Chinese Communist Party #CCP #BidenCrimeFamily #COVID19 #Subversion   https://reject ccp  https://t.co/PNO7Vx4JSD


@CBSNews #PresidentialDebate2020 RIGGED SYSTEM! What the American People Want for topics: De-funding The police? Biden's Mental Fitness/Kamala Plan to Run Country? Increasing Taxes? Biden's Russian/China Collusion? Hunter's Relationship China/Russia? USA SHOULD DEMAND!  #LyingJoeBiden


I can start marking all these Pro #Biden hashtags made by Chinese bots as spam now finally???? my god!!! Did Twitter eff up???? Did #chinabitchbiden trigger something???? #bidencares #votebidenharris #bidenharrislandslide #PresidentialDebate2020 #Trump2020 #MAGA @realDonaldTrump  https://t.co/BWjmt59zqx


#PresidentialDebate2020 elections have been infiltrated by Chinese Communist Party !! Sign the Petition to Investigate, Condemn &amp; Reject the

In [55]:
for tweet in data_tweets.sort_values('dem_probability')['tweet'][-10:]:
  print(tweet)
  print('\n')

#PresidentialDebate2020   ? for Trump: @JoeBiden wills to lead for ALL Americans, so President Trump, what is it that you think your fans will desire of #BidenHarris2020 should they win the election?  #BidenHarris2020


The racist president’s words written in black and white. #Debate2020 #PresidentialDebate2020   https://t.co/Ed9njTi8R3


This segment on climate change gives me chills... it’s scary how much they don’t believe in science and climate change 💔 #PresidentialDebate2020


@JoeBiden It is not a problem of recognizing black people for their contributions, Mr. Vice President, it is about equal opportunities and rights.  THAT IS REAL CHANGE! #EstamosCambiando  #PresidentialDebate2020  https://t.co/z2HenQSet1


An appalling report on #PresidentialDebate2020 @BBCNews You should be ashamed of yourselves. No balance on racism or analysis of the president’s lack of condemnation of white supremacy . Instead an ending suggesting ‘this is why people will support Donald Trump’ We expect 

explainability

In [56]:
!pip -q install eli5

[?25l[K     |███                             | 10 kB 21.4 MB/s eta 0:00:01[K     |██████▏                         | 20 kB 25.2 MB/s eta 0:00:01[K     |█████████▎                      | 30 kB 12.8 MB/s eta 0:00:01[K     |████████████▍                   | 40 kB 9.7 MB/s eta 0:00:01[K     |███████████████▌                | 51 kB 5.2 MB/s eta 0:00:01[K     |██████████████████▌             | 61 kB 5.4 MB/s eta 0:00:01[K     |█████████████████████▋          | 71 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████▊       | 81 kB 6.7 MB/s eta 0:00:01[K     |███████████████████████████▉    | 92 kB 7.0 MB/s eta 0:00:01[K     |███████████████████████████████ | 102 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████████| 106 kB 5.5 MB/s 
[?25h

In [61]:
import eli5
eli5.show_weights(model[2], #we are pulling the model from the undersampling pipeline here (it has the index 2)
                  feature_names=vectorizer.get_feature_names(), target_names=['rep','dem'], top=20)

Weight?,Feature
+4.632,trump
+4.420,forthepeople
+4.098,climate
+3.917,democracy
+3.774,black
+3.556,heroesact
+3.473,trumps
+3.319,aca
… 4907 more positive …,… 4907 more positive …
… 4186 more negative …,… 4186 more negative …


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [65]:
data_tweets['clean_tweet'] = preprocessTweets(data_tweets['tweet'])

In [68]:
eli5.show_prediction(model[2], data_tweets['clean_tweet'][5237], vec=model[0], target_names=['rep','dem'])

Contribution?,Feature
0.437,Highlighted in text (sum)
-0.406,<BIAS>
