# 1. Define QueryNews Functions

- To acquire API ID: http://eventregistry.org/ (register using gwu email)
- Freetier only has 2000 tokens: can only use the QueryNews function 2000 times.
- Freetier seems to have a upper limit of 400 or so per day.
- Freetier can only access data from the most recent 30 days.
- Details about the API: https://github.com/EventRegistry/event-registry-python

In [1]:
pip install eventregistry

Collecting eventregistry
[?25l  Downloading https://files.pythonhosted.org/packages/37/22/5163e7ce25c0c115e88963a95d68673d2c68cc51fc5bf34897dcae5c0c69/eventregistry-8.7.tar.gz (45kB)
[K     |████████████████████████████████| 51kB 1.4MB/s eta 0:00:01
Building wheels for collected packages: eventregistry
  Building wheel for eventregistry (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/aimeejiang/Library/Caches/pip/wheels/d2/9b/2a/61e30267ffa8e68a8cc13b0607caa36ebda9de8d834c1851f2
Successfully built eventregistry
Installing collected packages: eventregistry
Successfully installed eventregistry-8.7
Note: you may need to restart the kernel to use updated packages.


In [1]:
from eventregistry import *
er = EventRegistry(apiKey = "6e511143-02d0-40f3-b28e-b4898956ad89")

def QueryNews(company,category=""):
    
    "The Function returns raw news from EventRegistry News API."
    
    # Query news information based on company name if no category is specified
    if category == "": 
        q = QueryArticles(
            keywords=company,
            lang = ["eng"])
    # Query news information based on company name and category if there is category
    else: 
        q = QueryArticles(
            keywords=company,
            categoryUri=er.getCategoryUri(category),
            lang = ["eng"])
        
    q.setRequestedResult(RequestArticlesInfo(sortBy="rel"))
    res = er.execQuery(q)
    return(res)

using user provided API key for making requests
Event Registry host: http://eventregistry.org
Text analytics host: http://analytics.eventregistry.org


In [5]:
help(QueryArticles)

Help on class QueryArticles in module eventregistry.QueryArticles:

class QueryArticles(eventregistry.Base.Query)
 |  QueryArticles(keywords=None, conceptUri=None, categoryUri=None, sourceUri=None, sourceLocationUri=None, sourceGroupUri=None, authorUri=None, locationUri=None, lang=None, dateStart=None, dateEnd=None, dateMentionStart=None, dateMentionEnd=None, keywordsLoc='body', ignoreKeywords=None, ignoreConceptUri=None, ignoreCategoryUri=None, ignoreSourceUri=None, ignoreSourceLocationUri=None, ignoreSourceGroupUri=None, ignoreAuthorUri=None, ignoreLocationUri=None, ignoreLang=None, ignoreKeywordsLoc='body', isDuplicateFilter='keepAll', hasDuplicateFilter='keepAll', eventFilter='keepAll', startSourceRankPercentile=0, endSourceRankPercentile=100, minSentiment=-1, maxSentiment=1, dataType='news', requestedResult=None)
 |  
 |  Base class for Query and AdminQuery
 |  used for storing parameters for a query. Parameter values can either be
 |  simple values (set by _setVal()) or an array 

In [3]:
# query every news about Apple--> dont specify the category, but need an empty string.
res = QueryNews("apple","")
res

{'articles': {'results': [{'uri': '6001514268',
    'lang': 'eng',
    'isDuplicate': False,
    'date': '2020-03-20',
    'time': '18:10:00',
    'dateTime': '2020-03-20T18:10:00Z',
    'dateTimePub': '2020-03-20T18:01:00Z',
    'dataType': 'news',
    'sim': 0,
    'url': 'https://www.marketstudyreport.com/global-apple-puree-market-research-report-2020',
    'title': 'Global Apple Puree Market Research Report 2020',
    'body': "Table of Contents 1 Apple Puree Market Overview 1.1 Product Overview and Scope of Apple Puree 1.2 Apple Puree Segment by Type 1.2.1 Global Apple Puree Sales Growth Rate Comparison by Type (2021-2026) 1.2.2 Conventional 1.2.3 Organic 1.3 Apple Puree Segment by Application 1.3.1 Apple Puree Sales Comparison by Application: 2020 VS 2026 1.3.2 Beverages 1.3.3 Infant Food 1.3.4 Bakery & Snacks 1.3.5 Ice Cream & Yoghurt 1.3.6 Others 1.4 Global Apple Puree Market Size Estimates and Forecasts 1.4.1 Global Apple Puree Revenue 2015-2026 1.4.2 Global Apple Puree Sales 2

# 2. Define NewsWrangling Function

In [2]:
import pandas as pd

def NewsWrangling(n):
    
    "The Function returns wrangled data, including date, title, body and relevance of the news."
    
    # Applying list comprehension
    lists = n["articles"].get("results")
    date = [news["date"] for news in lists]
    title = [news["title"] for news in lists]
    body = pd.Series([news["body"] for news in lists],dtype="str")
    relevance = [news["relevance"] for news in lists]
    # Create dictionary
    dic = {
        "date":date,
        "title":title,
        "body":body,
        "relevance":relevance}
    # Create dataframe from the dictionary
    df = pd.DataFrame(dic,columns=["date",
                                   "title",
                                   "body", 
                                   "relevance"])
    return(df)

In [5]:
NewsWrangling(res).title[1]

'Global Apple Powder Market Research Report 2020'

In [6]:
NewsWrangling(res).head()

Unnamed: 0,date,title,body,relevance
0,2020-03-20,Global Apple Puree Market Research Report 2020,Table of Contents 1 Apple Puree Market Overvie...,100
1,2020-04-09,Global Apple Powder Market Research Report 2020,Table of Contents 1 Apple Powder Market Overvi...,97
2,2020-03-16,France's Competition Regulator Details in Grea...,Last Thursday Patently Apple posted a report t...,41
3,2020-04-02,This List of Streaming Services Will Ensure Yo...,Your guide to the many streaming services and ...,33
4,2020-03-15,"Roundup: Everything we know about iOS 14, watc...",In this week's top stories: It was a big week ...,31


# 3. Define SentimentAnalysis Function

- There are two types of sentiment analysis: lexicon-based sentiment analysis and text classfication. 
- 
- Here we use lbsa module from https://github.com/AntoinePassemiers/Lexicon-Based-Sentiment-Analysis/tree/master/src. It actually returns 6 categories from each text input: anticipation,joy,surprise,trust,anger,disgust,fear,sadness. I integrate them to have two: positive and negative.

In [3]:
import lbsa 
import numpy as np

def SentimentAnalysis(text):
    
    "The Function conducts sentiment analysis using exsiting modules: lbsa."   
    
    # Load the Lexicon-Based Sentiment Analysis Tool
    lexicon = lbsa.get_lexicon('sa', language='english')
    sentiment = lexicon.process(text)
    
    # Calculate Positive and Negative Scores
    positive = np.sum([sentiment["anticipation"],
                   sentiment['joy'],
                   sentiment['surprise'],
                   sentiment['trust']])
    negative =  np.sum([sentiment['anger'],
                    sentiment['disgust'],
                    sentiment['fear'],
                    sentiment['sadness']])
    
    # Calculate Net Sentiment Score as a fraction of Total Scores
    if positive+negative == 0:
        sentiment_score = 0
    else: 
        sentiment_positive = positive/(positive+negative)
        sentiment_negative = negative/(positive+negative)
        sentiment_score = sentiment_positive - sentiment_negative
    
    return(sentiment_score)

In [8]:
words = "I am really really happy but I am exhausted!"
SentimentAnalysis(words)



0.5

# 4. Define AddSentimentScore Function

In [4]:
def AddSentimentScore(wrangled_news):

    "The Function multiplies the sentiment score by relevance value and add the final score to dataframe."
    
    # Get sentiment score for each article
    scores=[SentimentAnalysis(news) for news in wrangled_news["body"]]
    
    # Calculate finalized score by multiplying relevance
    wrangled_news["sentimentscore"] = scores * wrangled_news["relevance"]/100
    return(wrangled_news)

In [10]:
d = NewsWrangling(res)
data = AddSentimentScore(d)
data

Unnamed: 0,date,title,body,relevance,sentimentscore
0,2020-03-20,Global Apple Puree Market Research Report 2020,Table of Contents 1 Apple Puree Market Overvie...,100,0.684211
1,2020-04-09,Global Apple Powder Market Research Report 2020,Table of Contents 1 Apple Powder Market Overvi...,97,0.752430
2,2020-03-16,France's Competition Regulator Details in Grea...,Last Thursday Patently Apple posted a report t...,41,0.069280
3,2020-04-02,This List of Streaming Services Will Ensure Yo...,Your guide to the many streaming services and ...,33,0.110812
4,2020-03-15,"Roundup: Everything we know about iOS 14, watc...",In this week's top stories: It was a big week ...,31,0.147791
...,...,...,...,...,...
95,2020-04-07,Sign in with Apple FAQ: What you need to know ...,At Apple's Worldwide Developer Conference in 2...,14,0.079459
96,2020-04-06,Global Apple Fiber Market Deep Analysis From 2...,Global Apple Fiber Market Research Report pres...,14,0.118696
97,2020-04-02,Apple Is Giving Up Its 30% Cut to Encourage 'P...,"For years, Apple has insisted that all purchas...",14,0.085631
98,2020-04-02,4 Big Reasons to Love Apple Stock Now,"When it comes to Apple (NASDAQ:AAPL), there's ...",14,0.080937


# 5. Download Queried Raw Data(store them in case that we need to use it again)

- Because we download it to computer, we don't need to query them again if we need to use the raw data again, so that we can save the available 400 tokens per day in EventRegistry.

## 5.1 Get Company Names list(the list is from the "rank-screener" file in GoogleDrive)

In [5]:
import pandas as pd
# to save tokens, only query the first 5 companies in the list
companies = pd.read_csv("company.csv",sep=',',header=None)[0:200]
companies = companies[0]
companies

0                        Abbvie Inc
1                   Astrazeneca Plc
2      Bristol-Myers Squibb Company
3               Glaxosmithkline Plc
4                 Johnson & Johnson
                   ...             
195          Gritstone Oncology Inc
196         Galera Therapeutics Inc
197             G1 Therapeutics Inc
198            Halozyme Therapeutic
199     Happiness Biotech Group Ltd
Name: 0, Length: 200, dtype: object

## 5.2 Write Raw data to files

In [12]:
for comp in companies:
    # query all category news data of the companies in the list
    result = QueryNews(comp,"coronavrius")
    # specify file names
    name = comp + ".txt"
    with open(name,"w") as outfile:
        # wirte json format data
        json.dump(result, outfile)

# 6. Load Raw News from local files

In [13]:
# mydic is a dictionary: key= company name, value = queried raw news
mydic = {}
for comp in companies:
    name = comp + ".txt"
    with open(name) as f:
        mydic[name] = json.load(f)

# 7. Process Raw News using Two Functions(take long time to run)
### 1 NewsWrangling(): to get dataframe
### 2 AddSentimentScore(): the SentimentAnalysis function is embeded.

In [15]:
# Create a data frame with dates ranging from 3.09 to 4.10
date = pd.date_range('2020-03-09', periods=30, freq='D')
df = pd.DataFrame({"date": date.astype("str")})

for i in mydic.keys():
    # mydic is json format raw news data
    wrangled = NewsWrangling(mydic[i]) 
    # sentiment score added format
    scored = AddSentimentScore(wrangled) 
    # extract date and daily mean sentiment score
    added_socialscore =  scored[["date","sentimentscore"]].groupby("date").mean() 
    df = pd.merge(df,added_socialscore, left_on=["date"],right_index=True,how='outer')

In [16]:
# Rename the data frame according to company names
names = [name.replace('.txt','') for name in list(mydic.keys())]
names.insert(0,'date')
df.columns = names
df

Unnamed: 0,date,Abbvie Inc,Astrazeneca Plc,Bristol-Myers Squibb Company,Glaxosmithkline Plc,Johnson & Johnson,Eli Lilly and Company,Merck & Company,Novo Nordisk,Novartis Ag,...,Genmab,Gamida Cell Ltd,Genfit S.A. ADR,Genprex Inc,Gossamer Bio Inc,Gritstone Oncology Inc,Galera Therapeutics Inc,G1 Therapeutics Inc,Halozyme Therapeutic,Happiness Biotech Group Ltd
0.0,2020-03-09,,,,,,,,,,...,,,,,,,,,,
1.0,2020-03-10,,,,,,,,,,...,,,,,,,,,,
2.0,2020-03-11,0.162636,0.008526,,0.029546,,0.194604,-0.002203,0.063107,-0.572876,...,0.043201,,,,,-8.4e-05,,,,
3.0,2020-03-12,0.128188,-0.053792,-0.400852,0.230409,-0.019317,0.158914,-0.000684,0.007117,0.021362,...,0.032475,,,,0.000588,,,0.352941,,
4.0,2020-03-13,0.185714,0.321429,,0.138519,,0.01125,-0.002812,,0.012059,...,0.01,,,,,,,,,
5.0,2020-03-14,,,,,,,,,,...,,,,,,,,,,
6.0,2020-03-15,,0.044043,,,,,,,-0.005524,...,-0.000295,,,,,,,,,
7.0,2020-03-16,0.189404,,,0.002585,0.339767,0.181285,,0.171379,0.051693,...,0.010009,,,,0.191365,,,,,
8.0,2020-03-17,0.0634,,0.048462,,,,,0.074126,-0.135084,...,0.003799,,,,,,,,,
9.0,2020-03-18,0.182433,0.0012,0.11676,0.549141,0.315714,-0.038203,,0.237391,0.051662,...,0.000354,,,,,,,,,


In [12]:
# There are missing values per company
for company in names:
    print(df[company].count())

33
27
30
30
28
20
27
9
28
28
27


In [19]:
df

# 8. Write sentiment time series data to file(with missing values)

In [17]:
df.to_csv("corona-virus_200.txt", header=True, index=False, sep='\t')

# 9. Write sentiment time series data to file(after using akima interpolate method)

In [20]:
# visualize the time series
%matplotlib inline
df.interpolate(method='akima').plot()

NotImplementedError: Interpolation with NaNs in the index has not been implemented. Try filling those NaNs before interpolating.

In [21]:
df.interpolate(method='akima').to_csv("business_200_timeseries.txt", header=True, index=False, sep='\t')

NotImplementedError: Interpolation with NaNs in the index has not been implemented. Try filling those NaNs before interpolating.

# #####****  Next Step #####**** 

## What can we do next?

###  1: Get the left 600 companies' sentiment scores through timeline

- We cannot query the 600 companies at one query:
  - 1 we only have less than 500 tokens per day and 2000 tokens in total.
  - 2 the API searching machine is not scalable, when using for loop to query 200 companied at one time, first 5 companies almost have news on each day , but companies afterwards seem to have sparse news.  
  
- Because the keywordsLoc in QueryArticles is not used, the relevance must be multiplied to get the sentiment score for each news.


### 2: Get company sentiment score based on categories:

- 1 The categories can be specified using DMOZ taxonomy. You can see the hierachy of all categories at this website: http://eventregistry.org/documentation?tab=searchArticles. **Autosuggest--> Categories--> type words in the box**   
-   
- business
- society
- environment
- health
- science

- 2 To query based on category, the empty category argument in **"5.2 Write Raw data to files"** need to be filled.

### 3: Visualize the data in Tableau

- Time series lines? Analyze the change of sentiment score.
- Average sentiment in the 31 days?
- Compare the average sentiment score(impact alpha using NLP) with the traditional impact alpha we calculated using  ESG metrics. (But the companies may not match in the two datasets)
- Compare the average sentiment score(impact alpha using NLP) with the investment alpha.

### 4: Forecast the data in SAS

- Some of the series still have missing values.
- Some of the series is white noise(for example Abbvie Inc)