# EPE Tracker Tool

## Expected Input

This notebook expects files "probability_weights_tweets.json" and "probability_weights_article.json", an example of which is provided below:

```
{
  "initial": 0.3,
  "threshold": 0.75,
  "users": {
    "insight_centre": 0.9,
    "job_ie": -0.8
  },
  "hashtags": {
    "epe": 0.8,
    "jobfairy": -0.8
  },
  "words": {
    "speech": 0.5,
    "advertisement": -0.2
  }
}
```

The initial and threshold keys are values between 0 and 1 representing the probability to begin with and the probability above which should be considered an EPE event respectively. The users, hashtags and words maps all allow the user to provide positive and negative probabilities to be used in the running of the program. There is no need to provide the hashtags or users keys in probability_weights_article.json.

This notebook also expects an "env.json" file, an example of which is shown below:

```
{
  "end_date": {
    "year": 2022,
    "month": 1,
    "day": 1
  },
  "podcast_id": "0HTattjY2kXo9zywjNW8Mk",
  "audience_estimate": {
    "twitter": "unknown",
    "insight_website": "unknown",
    "podcast": 500,
    "brainstorm": 2000,
    "silicon_republic": 1000,
    "external": {
      "irishtimes": 3000,
      "irish times": 3000
    }
  },
  "people": {
    "Georgiana Ifrim": {
      "twitter": "heerme"
    }
  },
  "pages": {
    "Insight Centre": {
      "twitter": "insight_centre"
    }
  },
  "twitter_consumer_key": "...",
  "twitter_consumer_secret": "...",
  "twitter_access_token": "...",
  "twitter_access_token_secret": "...",
  "spotify_client_id": "...",
  "spotify_client_secret": "..."
}
```

The end_date key should be the date you would like to start collecting from - All events will therefore be collected from the end_date to the current day.

The podcast_id key is the ID of the podcast you would like to collect the episodes on Spotify. This can be got by visiting the podcast on the spotify website. Eg. https://open.spotify.com/show/0HTattjY2kXo9zywjNW8Mk means our ID is 0HTattjY2kXo9zywjNW8Mk

The audience_estimate key is used to provide estimates of the number of people reached by an event. There should be a key for each of the core sources (twitter, insight_website, podcast, brainstorm and silicon republic currently) as well as an external key which allows a list of sites like Irish Times, RTÉ, The Journal to be specified. 

The people key is a map of people (key) and their twitter handles (value). These are the researchers we are trying to find events for.

The pages key is a map of pages (key) and their twitter handles (value) and allows extra Twitter accounts to be collected from without allowing "Insight Centre" to be found to be conducting EPE.

The various id's, keys, tokens and secrets at the bottom are used to interact with the spotify and twitter APIs. These should not be changed unless you are making your own version of the tool, in which case you should generate new keys and tokens as set out in the Twitter and Spotify API Documentation.

## Collection

### Brainstorm Collection

Dependencies: (Working on wget)

In [49]:
#!wget https://www.rte.ie/brainstorm/ -r -E -p --no-parent --accept-regex ".*/brainstorm/20[0-9]{2}/.*"
!pip install newspaper3k



Code:

In [39]:
%run -i collect_brainstorm.py

100 brainstorm articles found


### Insight Website Collection

Dependencies:

In [9]:
#!wget https://www.insight-centre.org/news/ -r -E -p
!pip install newspaper3k

'wget' is not recognized as an internal or external command,
operable program or batch file.




Code:

In [17]:
%run -i collect_insight_website.py

['./www.insight-centre.org\\1-3m-investment-for-insight-spinout-output-sports-irish-times\\index.html', './www.insight-centre.org\\a-breath-of-fresh-air-ep7-of-the-insight-podcast-out-now\\index.html', './www.insight-centre.org\\ai-data-and-feminism-dont-miss-the-next-trustworthy-ai-data-science-and-society-network-meeting\\index.html', './www.insight-centre.org\\ai-data-robotics-association-launches\\index.html', './www.insight-centre.org\\ai-through-the-looking-glass-barry-osullivan-talks-to-dame-wendy-hall-from-ria-event\\index.html', './www.insight-centre.org\\alan-smeaton-talks-to-rtes-claire-byrne-about-fake-twitter-accounts\\index.html', './www.insight-centre.org\\alice-perry-open-data-hackathon-november-16\\index.html', './www.insight-centre.org\\alison-keogh-talks-human-behaviour-with-silicon-republic\\index.html', './www.insight-centre.org\\alison-keogh-talks-human-behaviour-with-silicon-republic-2\\index.html', './www.insight-centre.org\\barry-osullivan-and-derek-bridge-laun

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


156 insight website articles found


### Podcast Collection

Dependencies:

In [7]:
!pip install requests --upgrade
!pip install urllib3 --upgrade
!pip install spotipy --upgrade

Collecting spotipy
  Downloading spotipy-2.19.0-py3-none-any.whl (27 kB)
Installing collected packages: spotipy
Successfully installed spotipy-2.19.0


Code:

In [8]:
%run -i collect_podcast.py

27 podcast episodes found


### Silicon Republic Collection

Should be run once a week ideally
 

Dependencies:

In [None]:
!pip install newspaper3k

Code:

In [21]:
%run -i collect_silicon_republic.py

128 silicon republic articles found


### Twitter Collection

Dependencies:

In [25]:
!pip install tweepy
!pip install wordninja
!pip install emoji

Collecting wordninja
  Downloading wordninja-2.0.0.tar.gz (541 kB)
     -------------------------------------- 541.6/541.6 KB 6.8 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Using legacy 'setup.py install' for wordninja, since package 'wheel' is not installed.
Installing collected packages: wordninja
  Running setup.py install for wordninja: started
  Running setup.py install for wordninja: finished with status 'done'
Successfully installed wordninja-2.0.0
Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
     -------------------------------------- 175.4/175.4 KB 5.3 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Using legacy 'setup.py install' for emoji, since package 'wheel' is not installed.
Installing collected packages: emoji
  Running setup.py install for emoji: started
  Running setup.py install for emoji: finished with statu

Code:

In [43]:
%run -i collect_tweets.py

136 tweets found


## Analysis

Dependencies:

In [44]:
!pip install nltk
!pip install spacy
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_sm
import nltk
import spacy
  
# essential entity models downloads
nltk.downloader.download('maxent_ne_chunker')
nltk.downloader.download('words')
nltk.downloader.download('treebank')
nltk.downloader.download('maxent_treebank_pos_tagger')
nltk.downloader.download('punkt')
nltk.download('averaged_perceptron_tagger')

!pip install python-dateutil
!pip install magicdate
!pip install newspaper3k
!pip install pycountry
!pip install locationtagger

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
     ------------------------------------ 777.4/777.4 MB 924.9 kB/s eta 0:00:00
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_lg')
Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
     ---------------------------------------- 13.9/13.9 MB 6.6 MB/s eta 0:00:00

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\lukej\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\lukej\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\lukej\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     C:\Users\lukej\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lukej\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\lukej\AppData\Roaming\nltk_data...
[nltk_data]   Package avera


[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


Code:

In [51]:
%run -i analysis.py

{'tweets': True, 'insight_website': True, 'brainstorm': True, 'silicon_republic': True, 'podcast': True}


## Output

In [65]:
print("Identified", sum(len(x) for x in slim_output["found_epe"].values()), "events by", len(slim_output["found_epe"]), "people with", len(slim_output["unknown_epe"]), "other unknown events")

Identified 6 events by 5 people with 25 other unknown events


In [61]:
from IPython.display import JSON
JSON(slim_output["found_epe"])

<IPython.core.display.JSON object>

In [62]:
JSON(slim_output["unknown_epe"])

<IPython.core.display.JSON object>

In [63]:
JSON(slim_output["duplicates"])

<IPython.core.display.JSON object>