<a href="https://colab.research.google.com/github/SDS-AAU/DSBA-2021/blob/master/static/notebooks/DSBA21_M2W2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tweet-preprocessor -qq

In [2]:
import pandas as pd
import preprocessor as prepro
import spacy

load up data

In [3]:
data_congress = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/pol_tweets.gz')

In [4]:
data_congress

Unnamed: 0,text,labels
340675,RT @GreenBeretFound Today we remember Sgt. 1st...,0
289492,"Yes, yes, yes, yes and yes. 😷 #JerseyStrong 💪🏾...",1
371088,Made new friends this afternoon delivering mas...,1
82212,RT @TXMilitary Happening TODAY: Pilots with th...,0
476047,RT @SteveScalise President Trump's legal team ...,0
...,...,...
61499,Outrageous.\n\nBrave health care workers are p...,0
185562,RT @dskolnick .@RepTimRyan proposes up to $3K ...,1
354040,It is clear that the #HeroesAct will help tens...,1
708686,Democrats are talking about Bolton and Mulvane...,0


preprocessing

In [5]:
# prepro settings
prepro.set_options(prepro.OPT.URL, prepro.OPT.EMOJI, prepro.OPT.NUMBER, prepro.OPT.RESERVED, prepro.OPT.MENTION, prepro.OPT.SMILEY)

In [6]:
data_congress['text_clean'] = data_congress['text'].map(lambda t: prepro.clean(t))

In [7]:
data_congress['text_clean'] = data_congress['text_clean'].str.replace('#','')

bootstrap dictionary with spacy (add-on)
here we take a sample of 1000 tweets and create a dictionary only containinng `'NOUN', 'PROPN', 'ADJ', 'ADV'` - the assumption is that we thereby can capture "more relevant" words...we also remove stoppwords and lematize


In [9]:
nlp = spacy.load("en_core_web_sm")

In [10]:
tokens = []

for tweet in nlp.pipe(data_congress.sample(5000)['text_clean']):
  tweet_tok = [token.lemma_.lower() for token in tweet if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] and not token.is_stop] 
  tokens.extend(tweet_tok)

In [13]:
# here we create this dictionary and in the next step it is used in the tokenization
bootstrap_dictionary = list(set(tokens))

repack preprocessing into a function

In [14]:
# now only for the cleanup. The vectorizer is removed and put into the pipeline below.

def preprocessTweets(data_tweets):
  clean_text = data_tweets.map(lambda t: prepro.clean(t))
  clean_text = clean_text.str.replace('#','')
  return clean_text

vectorization and SML part

Here we also add random undersampling (using imblearn) to improve the recall on the underrerresented class "rep" (0)
For this to work we build a pipeline into which we put Tfidfvectorization, undersampline and the logistic regression. This bundles all steps together so that we don't have to exicute them all individually every time.

In [15]:
X = data_congress['text_clean']
y = data_congress['labels']

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

In [17]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

model = make_pipeline_imb(TfidfVectorizer(vocabulary=bootstrap_dictionary), 
                          RandomUnderSampler(),
                          LogisticRegression())

In [18]:
model.fit(X_train, y_train)

Pipeline(steps=[('tfidfvectorizer',
                 TfidfVectorizer(vocabulary=['chainyet', 'meal', 'we.put',
                                             'hug', 'connect', 'foolish',
                                             'packed', 'barnwell', 'makeup',
                                             'repunderwood', 'inexcusable',
                                             'belcher', 'postalservice',
                                             'provocation', 'produce', 'pastie',
                                             'brindaba', 'coronavius',
                                             'dangerously', 'chaps', 'burrow',
                                             'money', 'uscg', 'jim', 'small',
                                             'scope', 'specifically', 'marble',
                                             'freethefamilie', 'crafting', ...])),
                ('randomundersampler', RandomUnderSampler()),
                ('logisticregression', LogisticRegressi

In [19]:
model.score(X_test, y_test)

0.76

In [20]:
from sklearn.metrics import classification_report

In [21]:
y_pred = model.predict(X_test)

In [22]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.65      0.77      0.71      3760
           1       0.84      0.76      0.80      6240

    accuracy                           0.76     10000
   macro avg       0.75      0.76      0.75     10000
weighted avg       0.77      0.76      0.76     10000



apply to new data

In [23]:
data_tweets = pd.read_json('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/pres_debate_2020.gz')

In [24]:
data_tweets

Unnamed: 0,id,conversation_id,created_at,date,timezone,place,tweet,language,hashtags,cashtags,...,nretweets,quote_url,search,near,geo,source,reply_to,translate,trans_src,trans_dest
0,1318944772183281664,1318944772183281664,2020-10-21 15:58:33,2020-10-21 15:58:33,0,,Still time to register: Students can join the ...,en,[presidentialdebate2020],[],...,2,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
1,1318938583122743296,1318938583122743296,2020-10-21 15:33:57,2020-10-21 15:33:57,0,,Be prepared for Trump to railroad Thursday’s d...,en,[presidentialdebate2020],[],...,0,https://twitter.com/donaldjtrumpjr/status/1318...,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
2,1318932554897031168,1318932554897031168,2020-10-21 15:10:00,2020-10-21 15:10:00,0,,Join us tomorrow from 5-8pm as @michaelpleahy ...,en,[presidentialdebate2020],[],...,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
3,1318928783169245184,1318928783169245184,2020-10-21 14:55:01,2020-10-21 14:55:01,0,,Wanna bet #ProudBoys comes up #PresidentialDeb...,en,"[proudboys, presidentialdebate2020]",[],...,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
4,1318927150247018496,1318927150247018496,2020-10-21 14:48:31,2020-10-21 14:48:31,0,,RT College Tour @BelmontUniv was spotless. Gor...,en,"[musiccity, presidentialdebate2020]",[],...,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8806,1311705601597222912,1311705601597222912,2020-10-01 16:32:40,2020-10-01 16:32:40,0,,Check out my Gig on Fiverr: do email marketing...,en,"[october1st, worsttrumpever, thursdaythoughts,...",[],...,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
8807,1311705488531156992,1311705488531156992,2020-10-01 16:32:13,2020-10-01 16:32:13,0,,This was made in 2012! It’s exactly like what ...,en,"[democrats, republicans, presidentialdebate202...",[],...,0,https://twitter.com/JonnyEthco/status/13113395...,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
8808,1311705196657958912,1311705196657958912,2020-10-01 16:31:03,2020-10-01 16:31:03,0,,How you finna lose two swing states with one q...,en,[presidentialdebate2020],[],...,0,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,
8809,1311704929090891776,1311704929090891776,2020-10-01 16:30:00,2020-10-01 16:30:00,0,,"This morning on the @ArleneBynonShow, @gmacofg...",en,"[erinotoole, blanchet, houseofcommons, trudeau...",[],...,3,,PresidentialDebate2020,,,,"{'user_id': None, 'username': None}",,,


In [25]:
X_new_tweets = preprocessTweets(data_tweets['tweet'])

In [26]:
predictions_new_tweets = model.predict_proba(X_new_tweets)

In [27]:
predictions_new_tweets

array([[0.36471858, 0.63528142],
       [0.6576234 , 0.3423766 ],
       [0.34498307, 0.65501693],
       ...,
       [0.3937557 , 0.6062443 ],
       [0.57080851, 0.42919149],
       [0.82055616, 0.17944384]])

In [28]:
data_tweets['dem_probability'] = predictions_new_tweets[:,1]

In [29]:
for tweet in data_tweets.sort_values('dem_probability')['tweet'][:10]:
  print(tweet)
  print('\n')

@realDonaldTrump @MarkMeadows @senatemajldr @kevinomccarthy @SpeakerPelosi @SenSchumer Nancy Pelosi never listen and she never work for Americans. Made in China 🇨🇳#PresidentialDebate2020 #stimulus #2020Election #Trump2020 #MAGA  https://t.co/6dxhe01gao


I can start marking all these Pro #Biden hashtags made by Chinese bots as spam now finally???? my god!!! Did Twitter eff up???? Did #chinabitchbiden trigger something???? #bidencares #votebidenharris #bidenharrislandslide #PresidentialDebate2020 #Trump2020 #MAGA @realDonaldTrump  https://t.co/BWjmt59zqx


@CBSNews #PresidentialDebate2020 RIGGED SYSTEM! What the American People Want for topics: De-funding The police? Biden's Mental Fitness/Kamala Plan to Run Country? Increasing Taxes? Biden's Russian/China Collusion? Hunter's Relationship China/Russia? USA SHOULD DEMAND!  #LyingJoeBiden


Senate found Hunter Biden got $3.5 from Moscow &amp; used this money to pay prostitutes connected to human trafficking. But Media &amp; Chris Wallace 

In [30]:
for tweet in data_tweets.sort_values('dem_probability')['tweet'][-10:]:
  print(tweet)
  print('\n')

The racist president’s words written in black and white. #Debate2020 #PresidentialDebate2020   https://t.co/Ed9njTi8R3


76% of the people watching the #PresidentialDebate2020 will be watching just to see Trump get muted.


@axios @BretBaier Add in streaming views and Trump crushed it again! #PresidentialDebate2020


...And were guests of Trump at first #PresidentialDebate2020


#Trump  Mute him out Mute him out Mute him out  Next #PresidentialDebate2020


rona really said fuck trump #PresidentialDebate2020 @realDonaldTrump


"I know I am but what am I." - Trump #BidenTrumpDebate #PresidentialDebate2020 #MorningJoe


Trump caved!!!! 😂😂😂#JoeBidenKamalaHarris2020 #PresidentialDebate2020  #TrumpIsANationalDisgrace


rona really said fuck trump  #PresidentialDebate2020


Whose paying that black women behind trump to nod all day! #PresidentialDebate2020




explainability

In [31]:
!pip -q install eli5

[K     |████████████████████████████████| 216 kB 7.4 MB/s 
[K     |████████████████████████████████| 133 kB 62.7 MB/s 
[?25h  Building wheel for eli5 (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.[0m


In [34]:
import eli5
eli5.show_weights(model[2], #we are pulling the model from the undersampling pipeline here (it has the index 2)
                  feature_names=model[0].get_feature_names(), target_names=['rep','dem'], top=20)



Weight?,Feature
+4.910,forthepeople
+4.881,trump
+4.199,black
+3.886,climate
+3.871,democracy
+3.539,heroesact
+3.500,aca
+3.422,trumps
… 4829 more positive …,… 4829 more positive …
… 4239 more negative …,… 4239 more negative …


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [35]:
data_tweets['clean_tweet'] = preprocessTweets(data_tweets['tweet'])

In [36]:
eli5.show_prediction(model[2], data_tweets['clean_tweet'][5237], vec=model[0], target_names=['rep','dem'])

Contribution?,Feature
0.369,<BIAS>
-0.046,Highlighted in text (sum)
