# Exam Q1 Nadine Kanbier - 4283724
#### Compare the use of clickbait titles between democrats and republicans in framing.p. How many times do democrats refer to an article with a clickbait title and how many times do republicans do? Inspect the titles in the dataset that were classified as clickbait and try to explain the results. (Hint: consult the manual to see how to classify the title of the articles as clickbait/non-clickbait!)

#### Your answer must consist of the following:
#### • The complete code to answer the question with a short comment for every step (ca. 2 sentences per step)
#### • An answer to the question + explanation (ca. 200 words)

In [1]:
# First, import the training data set.
import pandas as pd

DATASET_URL = 'https://gist.githubusercontent.com/amitness/0a2ddbcb61c34eab04bad5a17fd8c86b/raw/66ad13dfac4bd1201e09726677dd8ba8048bb8af/clickbait.csv'
data = pd.read_csv(DATASET_URL)
data.head(5)

Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0


In [2]:
# Train/test split
from sklearn.model_selection import train_test_split

X = list(data.title.values)
y = list(data.label.values)# the labels we want to predict --> Y
labels = ['not clickbait', 'clickbait']

X_train_str, X_test_str, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [3]:
# Vectorization
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer() # this initializes the CountVectorizer 

cv.fit(X_train_str) # create the vocabulary

X_train = cv.transform(X_train_str)
X_test = cv.transform(X_test_str)

In [5]:
# Train the model + evaluate the performance
vocabulary = cv.get_feature_names()
vectorized_texts = pd.DataFrame(X_train.toarray(), columns=vocabulary)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(X_train, y_train)

from sklearn.metrics import classification_report

y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred, 
                          target_names=labels))

               precision    recall  f1-score   support

not clickbait       0.96      0.98      0.97      3178
    clickbait       0.98      0.96      0.97      3220

     accuracy                           0.97      6398
    macro avg       0.97      0.97      0.97      6398
 weighted avg       0.97      0.97      0.97      6398



In [40]:
# Load the dataset we want to use
df = pd.read_pickle('framing.p')

In [41]:
# Predict unlabeled titles and add it to the dataframe
new_example = list(df.title.values)
new_example_bow = cv.transform(new_example)

predictions = lr.predict(new_example_bow)
df['clickbait'] = predictions

In [42]:
# Check if it makes sense
df.head()

Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label,clickbait
0,1325914751495499776,2020-11-09 21:34:45,SenShelby,R,Alabama,Senator,ICYMI – @BusinessInsider declared #Huntsville ...,businessinsider,https://www.businessinsider.com/personal-finan...,www.businessinsider.com,The 10 best US cities to move to if you want t...,The best US cities to move to if you want to s...,,1
1,1294021087118987264,2020-08-13 21:20:43,SenShelby,R,Alabama,Senator,Great news! Today @mazda_toyota announced an a...,,https://pressroom.toyota.com/mazda-and-toyota-...,pressroom.toyota.com,Mazda and Toyota Further Commitment to U.S. Ma...,"HUNTSVILLE, Ala., (Aug. 13, 2020) – Today, Maz...",,0
2,1323340848130609156,2020-11-02 19:06:59,DougJones,D,Alabama,Senator,He’s already quitting on the folks of Alabama ...,,https://apnews.com/article/c73f0dfe8008ebaf85e...,apnews.com,"Tuberville, Jones fight for Senate seat in Ala...","GARDENDALE, Ala. (AP) — U.S. Sen. Doug Jones, ...",,0
3,1323004075831709698,2020-11-01 20:48:46,DougJones,D,Alabama,Senator,I know you guys are getting bombarded with fun...,,https://secure.actblue.com/donate/djfs-close?r...,secure.actblue.com,I just gave!,Join us! Contribute today.,negiotated,1
4,1322567531320717314,2020-10-31 15:54:06,DougJones,D,Alabama,Senator,"Well looky here folks, his own players don’t t...",,https://slate.com/culture/2020/10/tommy-tuberv...,slate.com,What Tommy Tuberville’s Former Auburn Players ...,"""All I could think is, why?""",,0


In [43]:
R_click = df.clickbait[(df.party == 'R') & (df.clickbait == 1)].count()
D_click = df.clickbait[(df.party == 'D') & (df.clickbait == 1)].count() 
D = df.clickbait[df.party == 'D'].count()
R = df.clickbait[df.party == 'R'].count()

In [44]:
# Percentage of all Republicans posts linking to clickbait titles
R_click/R * 100

13.61472140007721

In [45]:
# Percentage of all Democrats posts linking to clickbait titles
D_click/D * 100

18.182986378822925

In [46]:
# Interpret the results: how did the model predict the labels? Does the interpretation of the model explain the results?
vocabulary = cv.get_feature_names()
regression_coefficients = lr.coef_[0] # get the LR weights
vocab_coef_combined = list(zip(regression_coefficients, vocabulary)) 
feature_importance = pd.DataFrame(vocab_coef_combined,
                      columns=['coef', 'word'])
feature_importance.sort_values('coef', ascending=False).head(10)

Unnamed: 0,coef,word
20545,4.018,you
18566,3.058204,these
18594,2.929925,this
154,2.922905,2015
2962,2.900514,buzzfeed
168,2.783366,21
8694,2.774251,here
12329,2.670787,my
18575,2.663407,things
20553,2.609577,your


In [39]:
clickbait_titles = df[df.clickbait == 1]
clickbait_titles.iloc[:10]

Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label,clickbait
0,1325914751495499776,2020-11-09 21:34:45,SenShelby,R,Alabama,Senator,ICYMI – @BusinessInsider declared #Huntsville ...,businessinsider,https://www.businessinsider.com/personal-finan...,www.businessinsider.com,The 10 best US cities to move to if you want t...,The best US cities to move to if you want to s...,,1
3,1323004075831709698,2020-11-01 20:48:46,DougJones,D,Alabama,Senator,I know you guys are getting bombarded with fun...,,https://secure.actblue.com/donate/djfs-close?r...,secure.actblue.com,I just gave!,Join us! Contribute today.,negiotated,1
8,1319698596074262530,2020-10-23 17:53:58,DougJones,D,Alabama,Senator,"We are excited to have @DebraMessing, @SeanHay...",,http://secure.actblue.com/donate/10.23.20wg,secure.actblue.com,I just gave!,Join us! Contribute today.,,1
9,1319690481501114368,2020-10-23 17:21:44,DougJones,D,Alabama,Senator,Try this link: https://t.co/TuqKYnXhsX http...,,https://secure.actblue.com/donate/10.23.20wg,secure.actblue.com,I just gave!,Join us! Contribute today.,,1
19,1311091680402079745,2020-09-29 23:53:10,DougJones,D,Alabama,Senator,"Proud to have the endorsement of my friend, th...",,http://vote.org,vote.org,Everything You Need to Vote - Vote.org,Register to vote. Check your registration stat...,,1
21,1309636079172177925,2020-09-25 23:29:07,DougJones,D,Alabama,Senator,I will not be a party to Mitch McConnell's pow...,,https://secure.actblue.com/donate/stop-mcconne...,secure.actblue.com,I just gave to Doug Jones!,We need Doug Jones in the Senate,,1
30,1295517455436193792,2020-08-18 00:26:45,DougJones,D,Alabama,Senator,Very excited to have a small role for the open...,,https://www.demconvention.com/watch-the-conven...,www.demconvention.com,Watch the 2020 #DemConvention August 17-20,Democrats are coming together August 17-20. Be...,,1
39,1321520316569604101,2020-10-28 18:32:50,RepByrne,R,Alabama 1st District,Representative,I enjoyed visiting TR Miller High School yeste...,,https://www.brewtoncityschools.org/site/defaul...,www.brewtoncityschools.org,Thank you Representative Bradley Byrne,,negiotated,1
50,1316084474338316297,2020-10-13 18:32:45,RepByrne,R,Alabama 1st District,Representative,Congratulations to Spring Hill College (@sprhi...,,https://shcbadgers.com/news/2020/9/18/softball...,shcbadgers.com,Fowler named to Top 30 Women of the Year list ...,"INDIANAPOLIS, Ind. – The National Collegiate A...",,1
66,1307384687808126978,2020-09-19 18:22:54,RepByrne,R,Alabama 1st District,Representative,RT @UnitedWayNews: .@UnitedWay is working with...,,https://howrightnow.org/,howrightnow.org,How Right Now,How Right Now is an initiative to address peop...,,1


**Answer:** 
When I first started on this question, I trained a BERT model (because it is generally a 'smarter' model). However, I realized I could not fully interpret **why** the model decided a title was clickbait or not clickbait. Because the question consists of 1) the comparison between both parties and 2) the inspection and explanation of the results, I think the linear regression model fits better because of its interpretability. If the question only consisted of number 1, I might have gone for the BERT model. The results between de BERT model and this model were similar: they differed only 2%.

The Democrats from our dataset refer to a clickbait **18.2%** of the time. The Republicans of out dataset refer to a clickbait title **13.6%** of the time.

Seeing how the model came to its results did not directly explain the difference between Democrats and Republicans (beside the fact that Buzzfeed leans towards the left according to Allsides.com). However, when inspecting the titles manually, it became clear that the titles were mostly directed towards the act of voting (like: Everything You Need to Vote - Vote.org). After conducting a short literary analysis, it turns out that older adults and non-Democrats — often Republicans or independents — showed a higher taste for clickbait (Munger et al. (2018)). An explanation for our results would be that the Democrats are aware of this and use this fact strategically, trying to win over swing and even Republican voters.

*Reference: Munger, K., Luca, M., Nagler, J., & Tucker, J. (2018). https://rubenson.org/wp-content/uploads/2018/09/munger-tpbw18.pdf*