# Notes

**Focus on**:
- Quality of data
- Log your data and calls when using data scraping
- Creativity
- Visualization (explanatory figures), simple is better
- Be critical of your data collection and generating process
    - Bias
    - Missing data
        - Ignore
        - Collect new data
        - Remove or replace missing data
    - Internal and external validity
    - Data collection type (random, survey, big data, other)
- Less focus on the analytical section and more on the collection and presentation

### Reflect on the ethical aspect
- Do yourespectprivacy? 
- Can single individualsbeidentified? 
- What are the potential consequences?
- Are there ethical considerations?
    - With respect to individuals? 
    - With respectto firmsor organizations?
- Consider the GDPR:
    - Is it anonymous? 
    - Personal data or statistics?
    - Any change of re-identification?

### Project ideas

- Factor based investing (use google or yahoo finance API)
- Trump tweet correlation with stock market
- Don't use LinkedIn
- 


### Logging

- Log your calls, use it to determine success ratio
    - Where did the call fail? Rewrite code.
    - Don't be greedy. time.sleep(0.5) between each call.
- Visualize the log (lecture 10)

We start by importing our data source to Python. The file *tweets.json* is created from [the Trump twitter archive](http://www.trumptwitterarchive.com/archive). We have selected all tweets from the 20th of January 2017 (assumed office) to 21st of August 2019.

In [None]:
# CODE BIN:
# # Loading dataset from file using a relative path
# with codecs.open('tweets.json', 'r', 'utf-8-sig') as json_file:  
#     data = json.load(json_file)
#     df=pd.DataFrame(data) # converting to a pandas data frame
    

In [98]:
# Importing packages
import pandas as pd
import json, codecs
import scraping_class, time

logfile = 'my_log'## name your log file.
connector = scraping_class.Connector(logfile)
data = []
# Fetching data
for i in range(2017,2020):
    url = 'http://www.trumptwitterarchive.com/data/realdonaldtrump/'+str(i)+'.json'
    r, call_id = connector.get(url, 'Tweets')
    json_file = r.json() 
    data += json_file[::-1] # invert list
    time.sleep(0.5) # set sleep timer to prevent unintentional DOS attacks


In [144]:
# Creating and manipulating dataframe    
df = pd.DataFrame(data)
keywords = ['china', 'tariff','mexico','europe','billions','trade-war','war'] # list of relevant keywords
df['keyword'] = df['text'].str.contains('(?i)|'.join(keywords)) # search for keywords while ignoring word case (upper/lower)
df_filter = df.query("is_retweet == False & keyword == True") # filter query to remove retweets and non relevant tweets
date = [i[2]+i[1]+i[-1] for i in df_filter['created_at'].str.split(' ')] # slice date
df_filter['date'] = date
df_filter['date'] = pd.to_datetime(df_filter['date'], format='%d%b%Y') 
df_filter.reset_index(drop=True).sort_values(by=['date']) # set index to date


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,created_at,favorite_count,id_str,in_reply_to_user_id_str,is_retweet,retweet_count,source,text,keyword,date
0,Sun Jan 01 05:00:10 +0000 2017,126230,815422340540547073,,False,32665,Twitter for iPhone,TO ALL AMERICANS-\n#HappyNewYear &amp; many bl...,True,2017-01-01
1,Mon Jan 02 23:47:12 +0000 2017,64480,816068355555815424,,False,17507,Twitter for Android,China has been taking out massive amounts of m...,True,2017-01-02
2,Tue Jan 03 16:44:13 +0000 2017,52850,816324295781740544,,False,15571,Twitter for iPhone,"""@DanScavino: Ford to scrap Mexico plant, inve...",True,2017-01-03
3,Wed Jan 04 13:19:09 +0000 2017,86542,816635078067490816,,False,19529,Twitter for Android,Thank you to Ford for scrapping a new plant in...,True,2017-01-04
4,Thu Jan 05 18:14:30 +0000 2017,107842,817071792711942145,,False,32121,Twitter for Android,Toyota Motor said will build a new plant in Ba...,True,2017-01-05
5,Fri Jan 06 11:19:49 +0000 2017,65840,817329823374831617,,False,19129,Twitter for Android,The dishonest media does not report that any m...,True,2017-01-06
6,Fri Jan 06 12:34:37 +0000 2017,39031,817348644647108609,,False,9331,Twitter for Android,"Wow, the ratings are in and Arnold Schwarzeneg...",True,2017-01-06
7,Sun Jan 08 02:07:09 +0000 2017,85216,817915516018892805,,False,17231,Twitter for Android,I look very much forward to meeting Prime Mini...,True,2017-01-08
8,Mon Jan 09 04:05:31 +0000 2017,77977,818307689323368448,,False,19152,Twitter for Android,Dishonest media says Mexico won't be paying fo...,True,2017-01-09
9,Mon Jan 09 14:16:34 +0000 2017,109595,818461467766824961,,False,23379,Twitter for Android,Ford said last week that it will expand in Mic...,True,2017-01-09
