# Exploring titles

This notebook explores the controversial.csv and top.csv datasets of Reddit article titles for the month of September 2020.

What signs indicate an article is controversial? 

Is the title alone enough to make this determination? After all, it's a Reddit meme to say that Redditors do not read the article, but only the title. 

Later I'll scrape and process the articles themselves and see what can be done.

In [3]:
# first we'll run through the standard bag of words. Get some counts. 
import numpy as np
import pandas as pd
import gensim
import sklearn as sk

In [10]:
df1 = pd.read_csv('controversial.csv')

In [16]:
# add a column for all entries in df1 to signify controversial true
df1["controversial"] = True

In [17]:
df1.head()

Unnamed: 0,title,score,upvote_ratio,id,url,comms_num,created,body,timestamp,controversial
0,"Crowd gathers outside hospital, chants ""We hop...",37,0.55,isfcae,https://www.fox10phoenix.com/news/crowd-gather...,131,1600093000.0,,2020-09-14 15:09:07,True
1,Planned Parenthood Quietly Stops Distributing ...,2,0.51,ifpeu1,https://www.ncregister.com/daily-news/planned-...,20,1598306000.0,,2020-08-24 22:53:51,True
2,First transgender person elected to local offi...,31,0.55,isjodx,https://mainebeacon.com/first-transgender-pers...,28,1600114000.0,,2020-09-14 21:12:41,True
3,Mount Union Area High School student asked by ...,0,0.48,ijwcu1,https://www.wearecentralpa.com/news/mount-unio...,30,1598903000.0,,2020-08-31 20:37:45,True
4,Mixed race woman fired by G4S after row over b...,0,0.49,ienlnk,https://www.theguardian.com/uk-news/2020/aug/2...,29,1598149000.0,,2020-08-23 03:24:32,True


In [20]:
# get list of ids
controversial_ids = [id for id in df1.id]
# controversial_ids

In [24]:
df2 = pd.read_csv('top.csv')
df2.shape

(994, 9)

In [27]:
# remove any top articles which are also controversial to remove duplicates
df2 = df2[~df2.id.isin(controversial_ids)]

In [34]:
df2.shape  # nearly 200 were both controversial and top

(806, 10)

In [29]:
df2["controversial"] = False

In [30]:
df = df1.append(df2, ignore_index=True)

In [31]:
df.head()

Unnamed: 0,title,score,upvote_ratio,id,url,comms_num,created,body,timestamp,controversial
0,"Crowd gathers outside hospital, chants ""We hop...",37,0.55,isfcae,https://www.fox10phoenix.com/news/crowd-gather...,131,1600093000.0,,2020-09-14 15:09:07,True
1,Planned Parenthood Quietly Stops Distributing ...,2,0.51,ifpeu1,https://www.ncregister.com/daily-news/planned-...,20,1598306000.0,,2020-08-24 22:53:51,True
2,First transgender person elected to local offi...,31,0.55,isjodx,https://mainebeacon.com/first-transgender-pers...,28,1600114000.0,,2020-09-14 21:12:41,True
3,Mount Union Area High School student asked by ...,0,0.48,ijwcu1,https://www.wearecentralpa.com/news/mount-unio...,30,1598903000.0,,2020-08-31 20:37:45,True
4,Mixed race woman fired by G4S after row over b...,0,0.49,ienlnk,https://www.theguardian.com/uk-news/2020/aug/2...,29,1598149000.0,,2020-08-23 03:24:32,True


In [32]:
df.shape

(1792, 10)

In [33]:
df.to_csv('top_and_controversial.csv')