# Notes

**Focus on**:
- Quality of data
- Log your data and calls when using data scraping
- Creativity
- Visualization (explanatory figures), simple is better
- Be critical of your data collection and generating process
    - Bias
    - Missing data
        - Ignore
        - Collect new data
        - Remove or replace missing data
    - Internal and external validity
    - Data collection type (random, survey, big data, other)
- Less focus on the analytical section and more on the collection and presentation

### Reflect on the ethical aspect
- Do you respect privacy? 
- Can single individuals be identified? 
- What are the potential consequences?
- Are there ethical considerations?
    - With respect to individuals? 
    - With respectto firms or organizations?
- Consider the GDPR:
    - Is it anonymous? 
    - Personal data or statistics?
    - Any change of re-identification?

### Logging

- Log your calls, use it to determine success ratio
    - Where did the call fail? Rewrite code.
    - Don't be greedy. time.sleep(0.5) between each call.
- Visualize the log (lecture 10)

We start by importing our data source to Python. The file *tweets.json* is created from [the Trump twitter archive](http://www.trumptwitterarchive.com/archive). We have selected all tweets from the 20th of January 2017 (assumed office) to 21st of August 2019.

**DOCSTRING**

Below we provide our docstring to the data project.
We import the relevant packages, some of which will require installation through either *pip* or *conda*. 

In [2]:
'''
DOCSTRING:

This project analyzes Donald J. Trump's twitter data and presents a visual analysis of key elements.
It makes use of several packages, some of which should be installed via either pip or conda.
Executing the code cells will save files to the relative path of this Jupyter Notebook. 

'''

# Importing packages
import pandas as pd
import scraping_class, time, json
import numpy as np
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from IPython.display import display, HTML
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from afinn import Afinn
% matplotlib inline


UsageError: Line magic function `%` not found.


We set up our connector to the relevant data source and log our connections in a file called *my_log*. 

We later use the log file to visualize our data connection attempts.

In [3]:
logfile = 'my_log'## name your log file.
connector = scraping_class.Connector(logfile)
data = []
# Fetching data
for i in range(2015,2020):
    url = 'http://www.trumptwitterarchive.com/data/realdonaldtrump/'+str(i)+'.json'
    r, call_id = connector.get(url, 'Tweets')
    json_file = r.json() 
    data += json_file[::-1] # invert list
    time.sleep(0.5) # set sleep timer to prevent unintentional DOS attacks

In [4]:
# Creating and manipulating dataframes
# Main dataframe
df = pd.DataFrame(data)
date = [i[2]+i[1]+i[-1]+'-'+i[3] for i in df['created_at'].str.split(' ')] # slice date
df['datetime'] = date
df['datetime'] = pd.to_datetime(df['datetime'], format='%d%b%Y-%H:%M:%S') # format datetime 
#df = df[(df['datetime'] > '2017-01-20')] # filter by relevant date
df = df.query("is_retweet == False") # drop retweets
df = df.reset_index(drop=True).sort_values(by=['datetime']) # set index to date
df.index = df['datetime']

# Fuzzy Section

In [5]:
import bs4 as bs
import pickle
import requests
import fuzzywuzzy

def sp500_tickers_names():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    names = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        name  = row.findAll('td')[1].text
        tickers.append(ticker[:-1]) # removing \n from the ticker ID
        names.append(name)
        
    with open("sp500tickers.pickle","wb") as f:
        pickle.dump(tickers,f)
    with open("sp500tickers.pickle","wb") as f:
        pickle.dump(names,f)
        
    return tickers, names

#SP500 = sp500_tickers_names()[1]

keys_500 = sp500_tickers_names()[1]
values_500 =sp500_tickers_names()[0]
sp500_dict = dict(zip(keys_500, values_500))
sp500_dict

{'3M Company': 'MMM',
 'Abbott Laboratories': 'ABT',
 'AbbVie Inc.': 'ABBV',
 'ABIOMED Inc': 'ABMD',
 'Accenture plc': 'ACN',
 'Activision Blizzard': 'ATVI',
 'Adobe Systems Inc': 'ADBE',
 'Advanced Micro Devices Inc': 'AMD',
 'Advance Auto Parts': 'AAP',
 'AES Corp': 'AES',
 'Affiliated Managers Group Inc': 'AMG',
 'AFLAC Inc': 'AFL',
 'Agilent Technologies Inc': 'A',
 'Air Products & Chemicals Inc': 'APD',
 'Akamai Technologies Inc': 'AKAM',
 'Alaska Air Group Inc': 'ALK',
 'Albemarle Corp': 'ALB',
 'Alexandria Real Estate Equities': 'ARE',
 'Alexion Pharmaceuticals': 'ALXN',
 'Align Technology': 'ALGN',
 'Allegion': 'ALLE',
 'Allergan, Plc': 'AGN',
 'Alliance Data Systems': 'ADS',
 'Alliant Energy Corp': 'LNT',
 'Allstate Corp': 'ALL',
 'Alphabet Inc Class A': 'GOOGL',
 'Alphabet Inc Class C': 'GOOG',
 'Altria Group Inc': 'MO',
 'Amazon.com Inc.': 'AMZN',
 'Amcor plc': 'AMCR',
 'Ameren Corp': 'AEE',
 'American Airlines Group': 'AAL',
 'American Electric Power': 'AEP',
 'American Exp

In [6]:
sp500_dict

{'3M Company': 'MMM',
 'Abbott Laboratories': 'ABT',
 'AbbVie Inc.': 'ABBV',
 'ABIOMED Inc': 'ABMD',
 'Accenture plc': 'ACN',
 'Activision Blizzard': 'ATVI',
 'Adobe Systems Inc': 'ADBE',
 'Advanced Micro Devices Inc': 'AMD',
 'Advance Auto Parts': 'AAP',
 'AES Corp': 'AES',
 'Affiliated Managers Group Inc': 'AMG',
 'AFLAC Inc': 'AFL',
 'Agilent Technologies Inc': 'A',
 'Air Products & Chemicals Inc': 'APD',
 'Akamai Technologies Inc': 'AKAM',
 'Alaska Air Group Inc': 'ALK',
 'Albemarle Corp': 'ALB',
 'Alexandria Real Estate Equities': 'ARE',
 'Alexion Pharmaceuticals': 'ALXN',
 'Align Technology': 'ALGN',
 'Allegion': 'ALLE',
 'Allergan, Plc': 'AGN',
 'Alliance Data Systems': 'ADS',
 'Alliant Energy Corp': 'LNT',
 'Allstate Corp': 'ALL',
 'Alphabet Inc Class A': 'GOOGL',
 'Alphabet Inc Class C': 'GOOG',
 'Altria Group Inc': 'MO',
 'Amazon.com Inc.': 'AMZN',
 'Amcor plc': 'AMCR',
 'Ameren Corp': 'AEE',
 'American Airlines Group': 'AAL',
 'American Electric Power': 'AEP',
 'American Exp

In [7]:
#Adding names that differ alot from there offical names

# missing_names = ['Google': 'AIG', 'JPMorgan', 'Microsoft'] 
# for names in missing_names:
#     SP500.append(names) 
missing_names = {'Google':'GOOGL', 'AIG':'AIG', 'JPMorgan':'JPM', 'Microsoft':'MSFT'}
sp500_dict.update(missing_names)

In [8]:
#Running the first fuzzy search 
from fuzzywuzzy import fuzz
#################################################################################
# Warning may take along time (10**7 iterations) - approx 35 min                #
#################################################################################
# Refer instead to the pickled file above
# Second run-through is on reduced firm sample
from tqdm import tqdm
relevant_tweet = {}
store_count = {}
for k,v in tqdm(sp500_dict.items()):
    count_sp = 0
    for i in range(len(df['text'])):
        if fuzz.token_set_ratio(df['text'][i], k)>=60: 
            count_sp +=1
            relevant_tweet[i]=[k,v] 
    store_count[k]=count_sp 

100%|████████████████████████████████████████████████████████████████████████████████| 509/509 [33:14<00:00,  2.89s/it]


In [9]:
# Inspecting the data frame to see what firms might be problematic
df_sc=pd.Series(store_count)
df_sc = df_sc.to_frame()
df_sc.columns = ['count']
df_sc.query('count >50')

Unnamed: 0,count
American Express Co,205
AT&T Inc.,695
Bank of America Corp,331
The Bank of New York Mellon Corp.,64
Best Buy Co. Inc.,67
Cimarex Energy,88
Cincinnati Financial,53
CMS Energy,88
The Cooper Companies,139
Devon Energy,88


In [15]:
## Used for second round of fuzzy search:
non_zero_SP500 = list(df_sc.query("count ==0").index) # Select the companies with non-zero amount of relevant tweets
# SP500=non_zero_SP500 #Overwriting the old data frame without the companies with no tweets
all(map(sp500_dict.pop, non_zero_SP500))

In [22]:
# Futher changes to the dataframe:
# The inspection of the data revealed some problematic companies that had to many false positives
# AT&T, Energy, Companies, Trade, facebook, resources, general, healthcare, international, Lincoln, national, technology, public, southern,

# We try removing parts of company names that have inherent meaning - the idea is to leave the names of the companies\
# without the word that has inherent meaning - i.e. to search for Ameriprise rather than Ameriprise Financial
# This approach is not without its weaknesses - but right now we are getting to many false positives


############################## This code was for testing the approach #############################
#Initially we create a list of firms affected by this modification:
# part_remove = []
# for i in range(len(SP500)):
#     if any(j in SP500[i].lower() for j in problematic_names):
#         part_remove.append(SP500[i])

# #We then go over all the firms removing the problematic parts of their names:        
# for part in part_remove:
#       for problems in problematic_names:
#              if problems in part:
#                     part_remove[part_remove.index(part)]=part.replace(problems, '')
##################################################################################################

# Removing the problematic parts 
# AT&T, Energy, Companies, Trade, facebook, resources, general, healthcare, international, Lincoln, national, technology, public, southern,
problematic_names = ['Financial', 'Energy', 'Companies','Corporation', 'Resources', 'National', 'Technology', 'International', 'Company', 'Technologies']
# for firm in SP500:
#       for problems in problematic_names:
#              if problems in firm:
#                     SP500[SP500.index(firm)]=firm.replace(problems, '')    

for key in sp500_dict:
    for problems in problematic_names:
        if problems in key:
            sp500_dict[key.replace(problems, "")] = sp500_dict.pop(key)

sp500_dict
        

{'AES Corp': 'AES',
 'Alexandria Real Estate Equities': 'ARE',
 'Amazon.com Inc.': 'AMZN',
 'American Airlines Group': 'AAL',
 'American Electric Power': 'AEP',
 'American Express Co': 'AXP',
 'American Tower Corp.': 'AMT',
 'Anthem': 'ANTM',
 'A.O. Smith Corp': 'AOS',
 'Apple Inc.': 'AAPL',
 'Applied Materials Inc.': 'AMAT',
 'Arista Networks': 'ANET',
 'Arthur J. Gallagher & Co.': 'AJG',
 'AT&T Inc.': 'T',
 'AvalonBay Communities, Inc.': 'AVB',
 'Ball Corp': 'BLL',
 'Bank of America Corp': 'BAC',
 'The Bank of New York Mellon Corp.': 'BK',
 'Best Buy Co. Inc.': 'BBY',
 'Block H&R': 'HRB',
 'Boston Properties': 'BXP',
 'Broadcom Inc.': 'AVGO',
 'Cabot Oil & Gas': 'COG',
 'Capri Holdings': 'CPRI',
 'Caterpillar Inc.': 'CAT',
 'CBRE Group': 'CBRE',
 'CBS Corp.': 'CBS',
 'Charter Communications': 'CHTR',
 'Chubb Limited': 'CB',
 'Church & Dwight': 'CHD',
 'Cisco Systems': 'CSCO',
 'Citrix Systems': 'CTXS',
 'Comcast Corp.': 'CMCSA',
 'Comerica Inc.': 'CMA',
 'Conagra Brands': 'CAG',
 'Co

In [746]:
# neg = {'Financial Storage':'FS', 'Energy Enterprise': 'EE', 'Business Factory':'BF'}
# problematic_names_test = ['Financial', 'Energy', 'Companies','Corporation', 'Resources', 'National', 'Technology', 'International', 'Company', 'Technologies']


# for key in neg:
#     for problems in problematic_names:
#         if problems in key:
#             neg[key.replace(problems, "")] = neg.pop(key)
# # for k,v in neg.items():
# #     for problems in problematic_names:
# #         if problems in v[0]:
# #             v[0]=v[0].replace(problems,"")
            

# neg

#Misforstået forsøg - virker til at skifte i relevant_tweet der har liste som value i ordbog.
# for k,v in sp500_dict.items():
#     for problems in problematic_names:
#         if problems in v[0]:
#             v[0] = v[0].replace(problems,"")

            


In [23]:
# We make the fuzzy search again on the data set without the X-irrelevant companies
# where we have also removed problematic parts of company names
####################################################################################################################
# Warning may take along time (approx 15 min) - though it is 237 compared to 509 companies in the first run_through#
####################################################################################################################
# Refer instead to the pickled file in the hand-in
# Second run-through is on reduced firm sample
relevant_tweet = {}
store_count = {}
for k,v in tqdm(sp500_dict.items()):
    count_sp = 0
    for i in range(len(df['text'])):
        if fuzz.token_set_ratio(df['text'][i], k)>=60: 
            count_sp +=1
            relevant_tweet[i]=[k,v] 
    store_count[k]=count_sp     

100%|████████████████████████████████████████████████████████████████████████████████| 237/237 [14:30<00:00,  3.39s/it]


In [24]:
# Inspecting the data frame to see what firms might be problematic
df_sc=pd.Series(store_count)
df_sc = df_sc.to_frame()
df_sc.columns = ['count']
df_sc.query('count >50')

Unnamed: 0,count
American Express Co,205
AT&T Inc.,695
Bank of America Corp,331
The Bank of New York Mellon Corp.,64
Best Buy Co. Inc.,67
Dollar General,177
E*Trade,320
General Dynamics,135
General Electric,144
General Mills,134


In [25]:
# After this second runthrough we inspect the relevant_tweets and remove directly from here
# A more correct approach would be to make changes to the SP500 we search over and do the search again
# But due to time-constrain, this very computationally costly approach is unfeasible

# Some companies get a huge amount of false positive due to their name being something with inherent meaning:
# After visual inspection of the fuzzy search results, we have added the following exeptions:
exeptions = ['American', 'Anthem', 'America', 'United']

# We create a list so we are able to inspect the amount of tweets we remove, and check companies might be affected
# remove_ex = {}
# for i, j in relevant_tweet.items():
#         if any(ex in relevant_tweet[i].lower() for ex in exeptions):
#             remove_ex[i]=j
remove_ex = {}
for i, j in relevant_tweet.items():
        if any(ex in j[0] for ex in exeptions):
            remove_ex[i]=j[0]
#display(remove_ex[0:100]) -> removed firms like: united, packaging of america, American express, american airlines.
#Then we remove these tweets from the relevant_tweets using the pop function
all(map( relevant_tweet.pop, remove_ex))

True

In [26]:
#After removing the exceptions we inspect the relevant_tweets again, and find an amount of tweets on AT&T that indicates problems:
#looking at AT&T
ATT_tweets=[k for k,v in relevant_tweet.items() if v[0] == 'AT&T Inc.']
att_tweets = []
#Inserting tweet ID in df['text'] to call tweets
for i in ATT_tweets:
    att_tweets.append(df['text'][i])
    
display(att_tweets[0:5]) # Reading over contents
# We discover the following issue:
print(fuzz.token_set_ratio('Today, I was thrilled to host the @WWP Soldier Ride once again at the @WhiteHouse. We were all deeply honored to be in the presence of TRUE AMERICAN HEROES....https://t.co/q6D5875xCw', 'AT&T'))
print(fuzz.token_set_ratio('at', 'AT&T'))
#Removing AT&T tweets - since every tweet with the word 'at' will get 100 pct. match
print(len(relevant_tweet)) #Checking length before removing
print(len(ATT_tweets)) # Amount to be removed
all(map(relevant_tweet.pop, ATT_tweets)) #Removing using the pop function over the dict relevant_tweet
print(len(relevant_tweet)) # Check if result adds up


['The NFL National Anthem Debate is alive and well again - can’t believe it! Isn’t it in contract that players must stand at attention, hand on heart? The $40,000,000 Commissioner must now make a stand. First time kneeling, out for game. Second time kneeling, out for season/no pay!',
 '.@ApprenticeNBC season premiere this Sunday at 9/8c on @NBC- http://t.co/GD0xOaphnN',
 "By popular demand I will be tweeting during tomorrow's record 14th season premiere of @ApprenticeNBC on @nbc at 9/8c http://t.co/MiVcT6zjEk",
 '#CelebApprentice Who will hear those two famous words? @Apprenticenbc premieres tomorrow at 9/8c on NBC. http://t.co/Z6XE3Ngz6D',
 '"@llluminatedOnes: Watch #CelebApprentice tomorrow at 9pm ET on @NBC because @RealDonaldTrump said so! @ApprenticeNBC http://t.co/pm0b8cMJfn']

100
100
2313
568
1745


In [27]:
# We reuse the approach used on AT&T on the following:
# Third inspection indicates we scould inspect the following companies: 
problem_comp = ['Dollar General', 'E*Trade', 'Fox  Class A', 'General Dynamics', 'General Electric',\
                'General Mills', 'General Motors','HCA Healthcare', 'Southern Co.', 'Chubb Limited', 'CVS Health', 'Juniper Networks']
problem_tweets = {}
problem_list = {}
for comp in problem_comp:
    problem_list[comp] = [k for k,v in relevant_tweet.items() if v[0] == str(comp)]
    problem_tweets[comp] = [df['text'][k] for k,v in relevant_tweet.items() if v[0] == str(comp)]

# This step requires manually inputting the firm-key for inspection since the full amount of tweets at once
# are too much to show/consider at once. 
problem_tweets['General Motors']

#problem_list[1]

# Below are the conclusions regarding the relevance of the individual firm
# Manually looking through reveals:
# 
#Dollar General	Not working?
#E*Trade	Irrelevant
#Fox Class A	Relevant (considering fox news is part of Fox Class A)
#General Dynamics	Not working?
#General Electric	Irrelevant
#General Mills	Not working?
#General Motors	 Partially relevant
#HCA Healthcare	Irrelevant
#Southern Co.	Irrelevant
#Chubb Limited 	Irrelevant
#CVS Health 	 Irrelevant
#Juniper Networks 	 Irrelevant




['Great meeting with Ford CEO Mark Fields and General Motors CEO Mary Barra at the @WhiteHouse today. https://t.co/T0eIgO6LP8',
 'Today, it was an honor to have @UN\nSecretary-General @AntonioGuterres at the @WhiteHouse. Speaking for the U.S.A., we appreciate all you do! https://t.co/Sk0Jcazzxw',
 'When you give a crazed, crying lowlife a break, and give her a job at the White House, I guess it just didn’t work out. Good work by General Kelly for quickly firing that dog!',
 '“They were all in on it, clear Hillary Clinton and FRAME Donald Trump for things he didn’t do.” Gregg Jarrett on @foxandfriends  If we had a real Attorney General, this Witch Hunt would never have been started! Looking at the wrong people.',
 '"@BreitbartVideo: .@AnnCoulter: Trump Has Best Shot in General Election http://t.co/Vf6c5kvrcn via @IanHanchett http://t.co/GOQTWZhjAM"',
 '"@realDBP: @realDonaldTrump @MenOfHistory The Art of the Deal is still the best book ever written on business and life in general!" Than

In [28]:
# Based on the conclusions above we remove the following firms:
tweets_to_remove = []
irrelevant_comps = ['E*Trade', 'General Electric','HCA Healthcare', 'Southern Co.', 'Chubb Limited', 'CVS Health', 'Juniper Networks']
for comp in irrelevant_comps:
     for k,v in problem_list.items():
            if k==comp:
                tweets_to_remove.append(v)


for i in range(len(tweets_to_remove)):
     all(map(relevant_tweet.pop, tweets_to_remove[i]))

In [30]:
#Inspecting the relevant_tweet frame
df_rele=pd.Series(relevant_tweet)
df_rele = df_rele.to_frame()
df_rele.columns = ['Company']
df_rele.head(10)

# We checking the reduced relevant_tweet frame for how many times the unique companies are mentioned:
#store_counts_rele = {}
# for i in set(df_rele['Company']):
#     count = 0
#     for j in df_rele.index:
#         if df_rele['Company'][j]==str(i):
#             count+=1
#     store_counts_rele[i]=count

# store_counts_rele


#Edition for relevant tweet in dict form
store_counts_rele = {}
company_unique = []
for k, v in relevant_tweet.items():
    company_unique.append(v[0])


for i in set(company_unique):
    count = 0
    for j in df_rele.index:
        if df_rele['Company'][j][0]==str(i):
            count+=1
    store_counts_rele[i]=count

store_counts_rele
    

#Inspection of these results indicate that the following companies produce to many false positives:
#Unum Group, Public Storage, Dollar, Best Buy Co. Inc, Robert Half, Paper, Dish Network, The Bank of New York mellon Corp, lam Research, ball C
#General Motors - we still haven't solved that theese were partially relevant (10%)

{'Vulcan Materials': 1,
 'Lockheed Martin Corp.': 2,
 'Home Depot': 1,
 'Nasdaq, Inc.': 6,
 'T-Mobile US': 14,
 'Johnson Controls ': 13,
 'Morgan Stanley': 7,
 'Skyworks Solutions': 10,
 'Yum! Brands Inc': 1,
 'Franklin ': 15,
 'Kellogg Co.': 1,
 'Goldman Sachs Group': 7,
 'Cincinnati ': 14,
 'Western Union Co': 10,
 'Southwest Airlines': 4,
 'Kinder Morgan': 1,
 'Apple Inc.': 8,
 'Phillips 66': 1,
 'Robert Half ': 65,
 'Sempra ': 1,
 'Lam Research': 14,
 'Unum Group': 41,
 'Target Corp.': 4,
 'Realty Income ': 9,
 'Boeing ': 9,
 "McDonald's Corp.": 1,
 'Total System Services': 2,
 'Huntington Bancshares': 2,
 'Tyson Foods': 4,
 'Nordstrom': 2,
 'The Cooper ': 7,
 'Kansas City Southern': 2,
 'Public Storage': 57,
 'Amazon.com Inc.': 21,
 'Pfizer Inc.': 2,
 'PNC  Services': 9,
 'Dover Corp.': 1,
 'Texas Instruments': 1,
 'Dollar Tree': 31,
 'Broadcom Inc.': 1,
 'Ball Corp': 12,
 'General Motors': 106,
 'Best Buy Co. Inc.': 54,
 'Progressive Corp.': 2,
 'Rollins Inc.': 2,
 'The Bank of N

In [31]:
# Fourth round of inspection indicates we should inspect the following companies: 
problem_comp4 = ['Unum Group', 'Public Storage', 'Dollar Tree', 'Best Buy Co. Inc.',\
                'Robert Half ','AvalonBay Communities, Inc.',\
                ' Paper','Waters ', 'Dish Network', 'The Bank of New York Mellon Corp.','Lam Research', 'Ball Corp']
problem_tweets4 = {}
problem_list4 = {}
for comp in problem_comp4:
    problem_list4[comp] = [k for k,v in relevant_tweet.items() if v[0] == str(comp)]
    problem_tweets4[comp] = [df['text'][k] for k,v in relevant_tweet.items() if v[0] == str(comp)]
    
problem_tweets4['Ball Corp']
#problem_list4

# Manually looking through reveals:
# Conclusions:
#Unum Group	Irrelevant (only group)
#Public Storage	Irrelevant (Only public)
#Dollar Tree	Irrelevant ( only dollar)
#Best Buy Co. Inc.	Irrelevant (only best or buy)
#Robert Half 	Irrelevant (Robert Mueller tweets)
#AvalonBay Communities, Inc.	Irrelevant (only communities)
#Paper	 Irrelevant (about newspapers)
#Waters	Irrelevant
#Dish Network.	Irrelevant (only network)
#The Bank of New York Mellon Corp. 	Irrelevant (only bank and new york)
#Lam Research 	 Irrelevant (only research)
#Ball Corp 	 Irrelevant



['At stake in this Election is whether we continue the extraordinary prosperity we have achieved - or whether we let the Radical Democrat Mob take a giant wrecking ball to our Country and our Economy! #JobsNotMobs https://t.co/POhRivI1BZ',
 'I wonder if Marshawn Lynch will now speak and call some  coach a moron for not allowing him to run the ball three times for one yard?',
 '"@ElianaBenador: #CPAC th crystal ball to see th future: True Conservative candidates will be sorely missing. None of them measures toTrump',
 'Placing the ball in the right position for the next shot is eighty percent of winning golf. -- Ben Hogan',
 '.@TMobile  You service is absolutely terrible - get on the ball!  @JohnLegere',
 '"@RockinJoe1: @realDonaldTrump your candidacy would hit the GOP like a wrecking ball! Total game changer"  Stay tuned!',
 'Tom Brady would have won if he was throwing a soccer ball. He is my friend and a total winner! @Patriots',
 '"@thatgirlflorida: @realDonaldTrump @ByronYork @CBSNe

In [32]:
#We remove tweets according to the conclusion to the search above
tweets_to_remove4 = []
irrelevant_comps4 = ['Unum Group', 'Public Storage', 'Dollar Tree', 'Best Buy Co. Inc.',\
                'Robert Half ','AvalonBay Communities, Inc.',\
                ' Paper','Waters ', 'Dish Network', 'The Bank of New York Mellon Corp.','Lam Research', 'Ball Corp']
for comp in irrelevant_comps4:
     for k,v in problem_list4.items():
            if k==comp:
                tweets_to_remove4.append(v)

for i in range(len(tweets_to_remove4)):
      all(map(relevant_tweet.pop, tweets_to_remove4[i]))
    

In [33]:
# Last thing we fix is the General Motors problem where we found some tweets we relevant,
# and we were reluctant to remove them all
# Trying to identify the relevant general motors tweets: 
problem_gen_id_tweet = {}
for k,v in relevant_tweet.items():
    if v[0] == 'General Motors':
        problem_gen_id_tweet[k]=df['text'][k]

# After manuel inspection we have determined the following keywords to avoid irrelevant tweets

keywords = ['Barra', '@GM', 'Motors', 'G.M.']
save_gen = []    
for k, v in problem_gen_id_tweet.items():
          if any(key in v for key in keywords):
              save_gen.append(k)
            
all(map(problem_gen_id_tweet.pop, save_gen)) #Removing the relevant tweets from our problem_GM_tweets
all(map(relevant_tweet.pop, problem_gen_id_tweet)) #Removing the irrelevant tweets



True

In [35]:
# We pickle the resulting dictionary as not to require anyone to run the fuzzysearch
pickle.dump(relevant_tweet, open( "Final_tweet_with_tickers.p", "wb" ) ) 

In [36]:
tweets_final_GMfix = {}
for k,v in relevant_tweet.items():
    tweets_final_GMfix[df['text'][k]] = v

# We pickle the tweets aswell    
pickle.dump(tweets_final_GMfix, open( "tweets_final_with_tickers.p", "wb" ) ) 

In [597]:
relevant_tweet_load = pickle.load( open( "Final_tweet_GMfix.p", "rb" ) )
tweets_final_GMfix_load = pickle.load(open("tweets_final_GMfix.p", "rb"))

In [598]:
relevant_tweet_load == relevant_tweet

True

In [599]:
tweets_final_GMfix_load == tweets_final_GMfix

True

In [37]:
get_index_comp = []
get_index_time = []
get_index_ticker = []
for k,v in relevant_tweet.items():
    get_index_comp.append(v[0])
    get_index_ticker.append(v[1])
    get_index_time.append(df.index[k])

In [671]:
# get_index_time = []
# for k in relevant_tweet:
#     get_index_time.append(df.index[k])
    
# get_index_time

In [38]:
d_relevant = {'Timestamp': get_index_time, 'Company':get_index_comp, 'Ticker': get_index_ticker}
df_relevant = pd.DataFrame(d_relevant)
df_relevant

Unnamed: 0,Timestamp,Company,Ticker
0,2015-09-29 15:20:39,V.F. Corp.,VFC
1,2019-06-16 22:55:55,Alexandria Real Estate Equities,ARE
2,2015-01-01 00:09:47,Amazon.com Inc.,AMZN
3,2015-04-25 13:48:54,Amazon.com Inc.,AMZN
4,2015-12-07 15:08:20,Amazon.com Inc.,AMZN
5,2015-12-23 14:55:24,Amazon.com Inc.,AMZN
6,2017-06-28 13:06:14,Amazon.com Inc.,AMZN
7,2017-07-22 10:33:01,Amazon.com Inc.,AMZN
8,2017-07-23 23:57:36,Amazon.com Inc.,AMZN
9,2017-07-25 02:23:18,Amazon.com Inc.,AMZN


In [39]:
pickle.dump(df_relevant, open( "df_relevant.p", "wb" ) ) 