# Notes

**Focus on**:
- Quality of data
- Log your data and calls when using data scraping
- Creativity
- Visualization (explanatory figures), simple is better
- Be critical of your data collection and generating process
    - Bias
    - Missing data
        - Ignore
        - Collect new data
        - Remove or replace missing data
    - Internal and external validity
    - Data collection type (random, survey, big data, other)
- Less focus on the analytical section and more on the collection and presentation

### Reflect on the ethical aspect
- Do you respect privacy? 
- Can single individuals be identified? 
- What are the potential consequences?
- Are there ethical considerations?
    - With respect to individuals? 
    - With respectto firms or organizations?
- Consider the GDPR:
    - Is it anonymous? 
    - Personal data or statistics?
    - Any change of re-identification?

### Logging

- Log your calls, use it to determine success ratio
    - Where did the call fail? Rewrite code.
    - Don't be greedy. time.sleep(0.5) between each call.
- Visualize the log (lecture 10)

We start by importing our data source to Python. The file *tweets.json* is created from [the Trump twitter archive](http://www.trumptwitterarchive.com/archive). We have selected all tweets from the 20th of January 2017 (assumed office) to 21st of August 2019.

**DOCSTRING**

Below we provide our docstring to the data project.
We import the relevant packages, some of which will require installation through either *pip* or *conda*. 

In [3]:
'''
DOCSTRING:

This project analyzes Donald J. Trump's twitter data and presents a visual analysis of key elements.
It makes use of several packages, some of which should be installed via either pip or conda.
Executing the code cells will save files to the relative path of this Jupyter Notebook. 

'''

# Importing packages
import pandas as pd
import scraping_class, time, json
import numpy as np
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from IPython.display import display, HTML
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from afinn import Afinn
% matplotlib inline


UsageError: Line magic function `%` not found.


We set up our connector to the relevant data source and log our connections in a file called *my_log*. 

We later use the log file to visualize our data connection attempts.

In [4]:
logfile = 'my_log'## name your log file.
connector = scraping_class.Connector(logfile)
data = []
# Fetching data
for i in range(2015,2020):
    url = 'http://www.trumptwitterarchive.com/data/realdonaldtrump/'+str(i)+'.json'
    r, call_id = connector.get(url, 'Tweets')
    json_file = r.json() 
    data += json_file[::-1] # invert list
    time.sleep(0.5) # set sleep timer to prevent unintentional DOS attacks

In [5]:
# Creating and manipulating dataframes
# Main dataframe
df = pd.DataFrame(data)
date = [i[2]+i[1]+i[-1]+'-'+i[3] for i in df['created_at'].str.split(' ')] # slice date
df['datetime'] = date
df['datetime'] = pd.to_datetime(df['datetime'], format='%d%b%Y-%H:%M:%S') # format datetime 
#df = df[(df['datetime'] > '2017-01-20')] # filter by relevant date
df = df.query("is_retweet == False") # drop retweets
df = df.reset_index(drop=True).sort_values(by=['datetime']) # set index to date
df.index = df['datetime']

# Fuzzy Section

In [566]:
import bs4 as bs
import pickle
import requests
import fuzzywuzzy

def sp500_tickers_names():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    names = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        name  = row.findAll('td')[1].text
        tickers.append(ticker)
        names.append(name)
        
    with open("sp500tickers.pickle","wb") as f:
        pickle.dump(tickers,f)
    with open("sp500tickers.pickle","wb") as f:
        pickle.dump(names,f)
        
    return tickers, names

# ændringer direkte her: skift til AIG, tilføj Google
SP500 = sp500_tickers_names()[1]

In [569]:
#Adding names that differ alot from there offical names

missing_names = ['Google', 'AIG', 'JPMorgan', 'Microsoft'] 
for names in missing_names:
    SP500.append(names) #

In [571]:
#Running the first fuzzy search 
from fuzzywuzzy import fuzz
#################################################################################
# Warning may take along time (10**7 iterations) - approx 35 min                #
#################################################################################
# Refer instead to the pickled file above
# Second run-through is on reduced firm sample
from tqdm import tqdm
relevant_tweet = {}
store_count = {}
for SP in tqdm(SP500):
    count_sp = 0
    for i in range(len(df['text'])):
        if fuzz.token_set_ratio(df['text'][i], SP)>=60: 
            count_sp +=1
            relevant_tweet[i]=SP
    store_count[SP]=count_sp 

100%|████████████████████████████████████████████████████████████████████████████████| 509/509 [34:44<00:00,  3.04s/it]


In [579]:
# Inspecting the data frame to see what firms might be problematic
df_sc=pd.Series(store_count)
df_sc = df_sc.to_frame()
df_sc.columns = ['count']
df_sc.query('count >50')

Unnamed: 0,count
American Express Co,205
AT&T Inc.,695
Bank of America Corp,331
The Bank of New York Mellon Corp.,63
Best Buy Co. Inc.,67
Cimarex Energy,88
Cincinnati Financial,53
CMS Energy,88
The Cooper Companies,139
Devon Energy,88


In [578]:
## Used for second round of fuzzy search:
non_zero_SP500 = list(df_sc.query("count !=0").index) # Select the companies with non-zero amount of relevant tweets
SP500=non_zero_SP500 #Overwriting the old data frame without the companies with no tweets

237

In [580]:
# Futher changes to the dataframe:
# The inspection of the data revealed some problematic companies that had to many false positives
# AT&T, Energy, Companies, Trade, facebook, resources, general, healthcare, international, Lincoln, national, technology, public, southern,

# We try removing parts of company names that have inherent meaning - the idea is to leave the names of the companies\
# without the word that has inherent meaning - i.e. to search for Ameriprise rather than Ameriprise Financial
# This approach is not without its weaknesses - but right now we are getting to many false positives


############################## This code was for testing the approach #############################
#Initially we create a list of firms affected by this modification:
# part_remove = []
# for i in range(len(SP500)):
#     if any(j in SP500[i].lower() for j in problematic_names):
#         part_remove.append(SP500[i])

# #We then go over all the firms removing the problematic parts of their names:        
# for part in part_remove:
#       for problems in problematic_names:
#              if problems in part:
#                     part_remove[part_remove.index(part)]=part.replace(problems, '')
##################################################################################################

# Removing the problematic parts 
# AT&T, Energy, Companies, Trade, facebook, resources, general, healthcare, international, Lincoln, national, technology, public, southern,
problematic_names = ['Financial', 'Energy', 'Companies','Corporation', 'Resources', 'National', 'Technology', 'International', 'Company', 'Technologies']
for firm in SP500:
      for problems in problematic_names:
             if problems in firm:
                    SP500[SP500.index(firm)]=firm.replace(problems, '')    

        

In [582]:
# We make the fuzzy search again on the data set without the X-irrelevant companies
# where we have also removed problematic parts of company names
####################################################################################################################
# Warning may take along time (approx 15 min) - though it is 237 compared to 509 companies in the first run_through#
####################################################################################################################
# Refer instead to the pickled file in the hand-in
# Second run-through is on reduced firm sample
relevant_tweet = {}
store_count = {}
for SP in tqdm(SP500):
    count_sp = 0
    for i in range(len(df['text'])):
        if fuzz.token_set_ratio(df['text'][i], SP)>=60: 
            count_sp +=1
            relevant_tweet[i]=SP
    store_count[SP]=count_sp       

100%|████████████████████████████████████████████████████████████████████████████████| 237/237 [14:25<00:00,  3.15s/it]


In [583]:
# Inspecting the data frame to see what firms might be problematic
df_sc=pd.Series(store_count)
df_sc = df_sc.to_frame()
df_sc.columns = ['count']
df_sc.query('count >50')

Unnamed: 0,count
American Express Co,205
American Group,503
AT&T Inc.,695
Bank of America Corp,331
The Bank of New York Mellon Corp.,63
Best Buy Co. Inc.,67
Dollar General,177
E*Trade,320
Fox Class A,110
Fox Class B,71


In [584]:
# After this second runthrough we inspect the relevant_tweets and remove directly from here
# A more correct approach would be to make changes to the SP500 we search over and do the search again
# But due to time-constrain, this very computationally costly approach is unfeasible

# Some companies get a huge amount of false positive due to their name being something with inherent meaning:
# After visual inspection of the fuzzy search results, we have added the following exeptions:
exeptions = ['american', 'anthem', 'america', 'united']

# We create a list so we are able to inspect the amount of tweets we remove, and check companies might be affected
remove_ex = {}
for i, j in relevant_tweet.items():
        if any(ex in relevant_tweet[i].lower() for ex in exeptions):
            remove_ex[i]=j
#display(remove_ex[0:100]) -> removed firms like: united, packaging of america, American express, american airlines.
#Then we remove these tweets from the relevant_tweets using the pop function
all(map( relevant_tweet.pop, remove_ex))

True

In [585]:
#After removing the exceptions we inspect the relevant_tweets again, and find an amount of tweets on AT&T that indicates problems:
#looking at AT&T
ATT_tweets=[k for k,v in relevant_tweet.items() if v == 'AT&T Inc.']
att_tweets = []
#Inserting tweet ID in df['text'] to call tweets
for i in ATT_tweets:
    att_tweets.append(df['text'][i])
    
display(att_tweets[0:5]) # Reading over contents
# We discover the following issue:
print(fuzz.token_set_ratio('Today, I was thrilled to host the @WWP Soldier Ride once again at the @WhiteHouse. We were all deeply honored to be in the presence of TRUE AMERICAN HEROES....https://t.co/q6D5875xCw', 'AT&T'))
print(fuzz.token_set_ratio('at', 'AT&T'))
#Removing AT&T tweets - since every tweet with the word 'at' will get 100 pct. match
print(len(relevant_tweet)) #Checking length before removing
print(len(ATT_tweets)) # Amount to be removed
all(map(relevant_tweet.pop, ATT_tweets)) #Removing using the pop function over the dict relevant_tweet
print(len(relevant_tweet)) # Check if result adds up


['Beautiful evening at a #MAGARally with great American Patriots. Loyal citizens like you helped build this Country and together, we are taking back this Country – returning power to YOU, the AMERICAN PEOPLE. Get out and https://t.co/0pWiwCHGbh! https://t.co/9nCTLdFVW4 https://t.co/wBOUVedVtT',
 'Horrible killing of a 13 year old American girl at her home in Israel by a Palestinian terrorist. We must get tough. https://t.co/zauQ6kb9Hj',
 'We pause today to remember the 2,403 American heroes who selflessly gave their lives at Pearl Harbor 75 years ago...\nhttps://t.co/r5eRLR24Q3',
 'It was an honor to host our American heroes from the @WWP #SoldierRideDC at the @WhiteHouse today with @FLOTUS, @VP… https://t.co/u5AI1pupVV',
 'Speech transcript at Arab Islamic American Summit ➡️https://t.co/eUWxJXJxbe\n\nReplay ➡️https://t.co/VtmlSqciXx\n\n#RiyadhSummit #POTUSAbroad']

100
100
2408
587
1821


In [586]:
# We reuse the approach used on AT&T on the following:
# Third inspection indicates we scould inspect the following companies: 
problem_comp = ['Dollar General', 'E*Trade', 'Fox  Class A', 'General Dynamics', 'General Electric',\
                'General Mills', 'General Motors','HCA Healthcare', 'Southern Co.', 'Chubb Limited', 'CVS Health', 'Juniper Networks']
problem_tweets = {}
problem_list = {}
for comp in problem_comp:
    problem_list[comp] = [k for k,v in relevant_tweet.items() if v == str(comp)]
    problem_tweets[comp] = [df['text'][k] for k,v in relevant_tweet.items() if v == str(comp)]

# This step requires manually inputting the firm-key for inspection since the full amount of tweets at once
# are too much to show/consider at once. 
problem_tweets['General Motors']

#problem_list[1]

# Below are the conclusions regarding the relevance of the individual firm
# Manually looking through reveals:
# 
#Dollar General	Not working?
#E*Trade	Irrelevant
#Fox Class A	Relevant (considering fox news is part of Fox Class A)
#General Dynamics	Not working?
#General Electric	Irrelevant
#General Mills	Not working?
#General Motors	 Partially relevant
#HCA Healthcare	Irrelevant
#Southern Co.	Irrelevant
#Chubb Limited 	Irrelevant
#CVS Health 	 Irrelevant
#Juniper Networks 	 Irrelevant




['I am pleased to inform you that I have just named General/Secretary John F Kelly as White House Chief of Staff. He is a Great American....',
 'Great meeting with Ford CEO Mark Fields and General Motors CEO Mary Barra at the @WhiteHouse today. https://t.co/T0eIgO6LP8',
 'Today, it was an honor to have @UN\nSecretary-General @AntonioGuterres at the @WhiteHouse. Speaking for the U.S.A., we appreciate all you do! https://t.co/Sk0Jcazzxw',
 'When you give a crazed, crying lowlife a break, and give her a job at the White House, I guess it just didn’t work out. Good work by General Kelly for quickly firing that dog!',
 '“They were all in on it, clear Hillary Clinton and FRAME Donald Trump for things he didn’t do.” Gregg Jarrett on @foxandfriends  If we had a real Attorney General, this Witch Hunt would never have been started! Looking at the wrong people.',
 '"@BreitbartVideo: .@AnnCoulter: Trump Has Best Shot in General Election http://t.co/Vf6c5kvrcn via @IanHanchett http://t.co/GOQTWZhjA

In [588]:
# Based on the conclusions above we remove the following firms:
tweets_to_remove = []
irrelevant_comps = ['E*Trade', 'General Electric','HCA Healthcare', 'Southern Co.', 'Chubb Limited', 'CVS Health', 'Juniper Networks']
for comp in irrelevant_comps:
     for k,v in problem_list.items():
            if k==comp:
                tweets_to_remove.append(v)


for i in range(len(tweets_to_remove)):
     all(map(relevant_tweet.pop, tweets_to_remove[i]))

In [589]:
#Inspecting the relevant_tweet frame
df_rele=pd.Series(relevant_tweet)
df_rele = df_rele.to_frame()
df_rele.columns = ['Company']
df_rele.head(10)

# We checking the reduced relevant_tweet frame for how many times the unique companies are mentioned:
store_counts_rele = {}
for i in set(df_rele['Company']):
    count = 0
    for j in df_rele.index:
        if df_rele['Company'][j]==str(i):
            count+=1
    store_counts_rele[i]=count

store_counts_rele
#Inspection of these results indicate that the following companies produce to many false positives:
#Unum Group, Public Storage, Dollar, Best Buy Co. Inc, Robert Half, Paper, Dish Network, The Bank of New York mellon Corp, lam Research, ball C
#General Motors - we still haven't solved that theese were partially relevant (10%)

{'Tiffany & Co.': 4,
 'Comcast Corp.': 8,
 'Snap-on': 1,
 'The Cooper ': 6,
 'Wynn Resorts Ltd': 4,
 'A.O. Smith Corp': 2,
 'Jack Henry & Associates': 3,
 'Google': 26,
 'Lincoln ': 13,
 'Capital One ': 9,
 'Duke ': 3,
 'Lockheed Martin Corp.': 3,
 'Visa Inc.': 11,
 'Western Union Co': 12,
 'Discovery Inc. Class C': 3,
 'The Walt Disney ': 4,
 'UDR, Inc.': 2,
 'Dow Inc.': 17,
 'Nasdaq, Inc.': 6,
 'Block H&R': 7,
 "Macy's Inc.": 14,
 'Kimberly-Clark': 2,
 'Huntington Bancshares': 2,
 'Globe Life Inc.': 1,
 'Vulcan Materials': 1,
 'PNC  Services': 7,
 'Starbucks Corp.': 1,
 'Boston Properties': 5,
 'Under Armour Class C': 2,
 'Mettler Toledo': 6,
 'Harley-Davidson': 6,
 'Sealed Air': 3,
 'Tapestry, Inc.': 1,
 'Unum Group': 48,
 "Edison Int'l": 1,
 'Texas Instruments': 1,
 'Valero ': 1,
 'Exxon Mobil Corp.': 1,
 'Equity Residential': 7,
 'Alexandria Real Estate Equities': 1,
 'Home Depot': 1,
 'TripAdvisor': 5,
 'Intercontinental Exchange': 1,
 'Eastman Chemical': 5,
 'Public Storage': 72

In [590]:
# Fourth round of inspection indicates we should inspect the following companies: 
problem_comp4 = ['Unum Group', 'Public Storage', 'Dollar Tree', 'Best Buy Co. Inc.',\
                'Robert Half ','AvalonBay Communities, Inc.',\
                ' Paper','Waters ', 'Dish Network', 'The Bank of New York Mellon Corp.','Lam Research', 'Ball Corp']
problem_tweets4 = {}
problem_list4 = {}
for comp in problem_comp4:
    problem_list4[comp] = [k for k,v in relevant_tweet.items() if v == str(comp)]
    problem_tweets4[comp] = [df['text'][k] for k,v in relevant_tweet.items() if v == str(comp)]
    
problem_tweets4['Ball Corp']
#problem_list4

# Manually looking through reveals:
# Conclusions:
#Unum Group	Irrelevant (only group)
#Public Storage	Irrelevant (Only public)
#Dollar Tree	Irrelevant ( only dollar)
#Best Buy Co. Inc.	Irrelevant (only best or buy)
#Robert Half 	Irrelevant (Robert Mueller tweets)
#AvalonBay Communities, Inc.	Irrelevant (only communities)
#Paper	 Irrelevant (about newspapers)
#Waters	Irrelevant
#Dish Network.	Irrelevant (only network)
#The Bank of New York Mellon Corp. 	Irrelevant (only bank and new york)
#Lam Research 	 Irrelevant (only research)
#Ball Corp 	 Irrelevant



['At stake in this Election is whether we continue the extraordinary prosperity we have achieved - or whether we let the Radical Democrat Mob take a giant wrecking ball to our Country and our Economy! #JobsNotMobs https://t.co/POhRivI1BZ',
 'I wonder if Marshawn Lynch will now speak and call some  coach a moron for not allowing him to run the ball three times for one yard?',
 '"@ElianaBenador: #CPAC th crystal ball to see th future: True Conservative candidates will be sorely missing. None of them measures toTrump',
 'Placing the ball in the right position for the next shot is eighty percent of winning golf. -- Ben Hogan',
 '.@TMobile  You service is absolutely terrible - get on the ball!  @JohnLegere',
 '"@RockinJoe1: @realDonaldTrump your candidacy would hit the GOP like a wrecking ball! Total game changer"  Stay tuned!',
 'Tom Brady would have won if he was throwing a soccer ball. He is my friend and a total winner! @Patriots',
 '"@thatgirlflorida: @realDonaldTrump @ByronYork @CBSNe

In [591]:
#We remove tweets according to the conclusion to the search above
tweets_to_remove4 = []
irrelevant_comps4 = ['Unum Group', 'Public Storage', 'Dollar Tree', 'Best Buy Co. Inc.',\
                'Robert Half ','AvalonBay Communities, Inc.',\
                ' Paper','Waters ', 'Dish Network', 'The Bank of New York Mellon Corp.','Lam Research', 'Ball Corp']
for comp in irrelevant_comps4:
     for k,v in problem_list4.items():
            if k==comp:
                tweets_to_remove4.append(v)

for i in range(len(tweets_to_remove4)):
     all(map(relevant_tweet.pop, tweets_to_remove4[i]))
    

In [592]:
# Last thing we fix is the General Motors problem where we found some tweets we relevant,
# and we were reluctant to remove them all
# Trying to identify the relevant general motors tweets: 
problem_gen_id_tweet = {}
for k,v in relevant_tweet.items():
    if v == 'General Motors':
        problem_gen_id_tweet[k]=df['text'][k]

# After manuel inspection we have determined the following keywords to avoid irrelevant tweets

keywords = ['Barra', '@GM', 'Motors', 'G.M.']
save_gen = []    
for k, v in problem_gen_id_tweet.items():
          if any(key in v for key in keywords):
              save_gen.append(k)
            
all(map(problem_gen_id_tweet.pop, save_gen)) #Removing the relevant tweets from our problem_GM_tweets
all(map(relevant_tweet.pop, problem_gen_id_tweet)) #Removing the irrelevant tweets



True

In [595]:
# We pickle the resulting dictionary as not to require anyone to run the fuzzysearch
pickle.dump(relevant_tweet, open( "Final_tweet_GMfix.p", "wb" ) ) 

In [596]:
tweets_final_GMfix = {}
for k,v in relevant_tweet.items():
    tweets_final_GMfix[df['text'][k]] = v

# We pickle the tweets aswell    
pickle.dump(tweets_final_GMfix, open( "tweets_final_GMfix.p", "wb" ) ) 

In [597]:
relevant_tweet_load = pickle.load( open( "Final_tweet_GMfix.p", "rb" ) )
tweets_final_GMfix_load = pickle.load(open("tweets_final_GMfix.p", "rb"))

In [598]:
relevant_tweet_load == relevant_tweet

True

In [599]:
tweets_final_GMfix_load == tweets_final_GMfix

True