# Intro
This notebook takes data from two excellent NYT articles and puts it in a more structured form. 
Thos articles in question are [The People, Places and Things Trump Has Praised on Twitter: A Complete List](https://www.nytimes.com/interactive/2018/02/14/upshot/trump-compliments-list.html) and [The 459 People, Places and Things Donald Trump Has Insulted on Twitter: A Complete List](https://www.nytimes.com/interactive/2016/01/28/upshot/donald-trump-twitter-insults.html)

Each of these articles catalouges Trumps tweets and highlights phrases that were an insult / compliment. 

## So What is this ?
At [LightTag](https://lighttag.io) we build tools to label text and we couldn't miss an opportunity to promote a wonderful labeled data set. This notebook shows our process of extracting and enriching the data NYT provided.

## Where is this heading
We're using this data to jump start a wider scale, public labeling project where we'll be labeling trump tweets.

In [254]:
import pandas as pd
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize import RegexpTokenizer

# Simple white space tokenizer
tokenizer = RegexpTokenizer(r'\w+')



### Preparing Insults

In [255]:
Insults = pd.read_csv('./insults.csv') # Read the insults csv that came with this repo

In [256]:
def has_verb(s):
    '''
    This function uses nltk/wordnet to check if a word is a verb. We assume that if what NYT called an insult
    has a verb in it than it is an acusation.
    For example "So Dumb!" is an insult but "rigged the election" is an accustation
    '''
    words = tokenizer.tokenize(s.lower())
    for word in words:
        try:
            if wn.synsets(word)[0].pos()=="v":
                return True
        except IndexError:
            pass
    return False
Insults['has_verb'] =Insults.quotes_vec.apply(has_verb)

In [257]:
Insults['tag'] = Insults.has_verb.apply(lambda x: "insult" if not x else "accusation")

In [258]:
# We rename some colums so that we can easily stack this df with the compliments one which we are about to make
Insults = Insults.rename(columns={"insult_slugs":"slug","quotes_vec":"phrase"})

### Preparing Compliments
The compliments data is similar but not quite the same. It is in  tsv instead of csv, the tweet body isn't provided etc

In [259]:
Compliments = pd.read_csv('./compliments.tsv',sep='\t').rename(columns={"compliment":"phrase"})
Compliments['tag'] = "compliment" # For now, every compliment is a compliemnt


In [260]:
Both = Insults.append(Compliments)


# Todo
the tweets in the compliemnts data don't have the tweet body, we need to fetch it for those tweets that don't have a body. 
For now, we'll dump those tweets, in the future we will use the [Trump Twitter Archive](http://www.trumptwitterarchive.com/) To fill those holes


In [261]:

links_with_no_tweet = Both.groupby('tweet_link').tweet.any()
links_with_no_tweet = links_with_no_tweet[links_with_no_tweet==False]



In [262]:
#Todo replace this with a fix for missing tweets
Both = Both[~Both.tweet_link.isin(links_with_no_tweet.index)] #Dump the tweets that have no link
Both = Both.sort_values(by=['tweet_link','tweet']).fillna(method='ffill') #Sort then fill 

# Adding entity info
Here we add the start and end for each phrase NYT highlighted, then try to resolve the entities they claim are talked baout in the text

In [263]:
def get_start_end(row):
    # We want to start and end coordinates of each phrases in the tweet
    if type(row.tweet)==str:
        start = row.tweet.find(row.phrase)
        if start !=-1:
            end = start + len(row.phrase)
            return pd.Series({"start":start,"end":end})
Both = Both.join(Both.apply(get_start_end,axis=1))

The NYT data has a field called slug that is some slug representation of who was being spoken about, for example hillary-clinton. We want to highlight that entity if possible. 
Sometimes there is more than one entitiy in a tweet so we need to be ready for that as well

In [264]:

def find_ent(row):
    
    if type(row.slug)==str and type(row.tweet)==str:
        words = row.slug.split('-')


        tweet = row.tweet.lower()
        ents = []
        if len(words) ==2: # if it looks like a name
            words = [' '.join(words)] + words # First search for the full name, than first than last
        for word in words:
            start = tweet.find(word)
            if start != -1:
                end = start+len(word)
                ents.append({"start":start,"end":end,"phrase":row.tweet[start:end],"tag":"entity",
                             "tweet_link":row.tweet_link,"tweet":row.tweet,"slug":row.slug})
                break # If we found a mention than stop, this is a hack, but it works
        return (ents)

F = A.apply(find_ent,1)
#Make a dataframe out of what we found, filtering out empty rows
Entities = pd.DataFrame(sum(filter(lambda x:x,F.values),[]))

In [265]:
#Now append the entities to the other tags we have and drop any duplicates
Result = Both.append(Entities).drop_duplicates()

In [266]:
# Lets make it pretty and look
Pretty = Result.set_index(["tweet_link","tweet",'slug',"tag"])[["phrase","start","end"]].sort_index()
Pretty.head(50)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,phrase,start,end
tweet_link,tweet,slug,tag,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
http://twitter.com/realDonaldTrump/status/641783686363508736,"I am on @seanhannity tonight at 10:00 on @FoxNews. Will be talking about my day in D.C., the horrendous nuke deal with Iran and more. Enjoy!",iran-deal,entity,Iran,119.0,123.0
http://twitter.com/realDonaldTrump/status/641783686363508736,"I am on @seanhannity tonight at 10:00 on @FoxNews. Will be talking about my day in D.C., the horrendous nuke deal with Iran and more. Enjoy!",iran-deal,insult,horrendous,93.0,103.0
http://twitter.com/realDonaldTrump/status/667347920560238592,"Broken down political pundit @GeorgeWill, who is wrong almost all of the time, should be thrown off @FoxNews. Boring and totally biased.",george-will,accusation,broken down,,
http://twitter.com/realDonaldTrump/status/667347920560238592,"Broken down political pundit @GeorgeWill, who is wrong almost all of the time, should be thrown off @FoxNews. Boring and totally biased.",george-will,accusation,boring and totally biased,,
http://twitter.com/realDonaldTrump/status/667347920560238592,"Broken down political pundit @GeorgeWill, who is wrong almost all of the time, should be thrown off @FoxNews. Boring and totally biased.",george-will,accusation,should be thrown off Fox News,,
http://twitter.com/realDonaldTrump/status/667347920560238592,"Broken down political pundit @GeorgeWill, who is wrong almost all of the time, should be thrown off @FoxNews. Boring and totally biased.",george-will,entity,George,30.0,36.0
http://twitter.com/realDonaldTrump/status/667347920560238592,"Broken down political pundit @GeorgeWill, who is wrong almost all of the time, should be thrown off @FoxNews. Boring and totally biased.",george-will,insult,wrong almost all of the time,49.0,77.0
http://twitter.com/realDonaldTrump/status/669501105630498816,.@Karl Rove just totally bombed on @Morning_Joe. He has ZERO credibility. @FoxNews,karl-rove,entity,Karl Rove,2.0,11.0
http://twitter.com/realDonaldTrump/status/669501105630498816,.@Karl Rove just totally bombed on @Morning_Joe. He has ZERO credibility. @FoxNews,karl-rove,insult,has ZERO credibility,52.0,72.0
http://twitter.com/realDonaldTrump/status/678986289350316033,It's the Democrat's total weakness that is the greatest recruiting tool of ISIS!!!,democrats,accusation,it's the Democrat's total weakness that is the...,,


In [267]:
Pretty.to_csv('./ny_times_data.csv') # We can save it as a csv as well

In [268]:
len(Pretty)

7058