- [The GDELT Project](https://www.gdeltproject.org/)


- Chen Luo, Feb 18, 2020

## 1. What is GDELT?

- GDELT = Global Database of Events, Language, and Tone

- A project supported by Google, monitors the world's broadcast, print, and web news

---

## 2. What makes it special?

- Multilanguage (Over 100)

- A detailed coding system, including people, locations, organizations, themes, emotions

- Open access

- High update frequency (Every 15 mins)

- Historical breadth (some datasets date back to the 19th century)

- [Friendly documents](https://www.gdeltproject.org/data.html#documentation)

---

## 3. Two major datasets

- GDELT Event Database
    - Contains over 300 categories of physical activities over the world
    
    - Nearly 280 themes
    
    - Nearly 60 attributes are coded for each event

- GDELT Global Knowledge Graph (April 1, 2013 ~ now)
    - Based on each news report, using NER and geocoding algorithms to perform the coding

---

## 4. How to apply GDELT datasets to your network analysis?
- Option 1: GDELT + Gephi
    - [GKG Network Visualizer](http://analysis.gdeltproject.org/module-gkg-network.html)
    
    - Two steps: Enter your keywords, then open your mailbox
    
    - This function is not available now but will come into effect in the next few weeks.
    
    - There are other easy to use [analysis services](http://analysis.gdeltproject.org/)
    
    
- Option 2: Customized data
     - Google BigQuery / Raw data file (`csv` & `tsv`)
     
     - This following demo (very preliminary) creates a countries' co-occurrence (top 1K) network of `global pandemic` (a given theme in GDELT project) through news coverage in the recent week
         - The toolkit includes Google BigQuery ([preview](https://bigquery.cloud.google.com/table/gdelt-bq), but please pay attention to the [quota](https://cloud.google.com/dialogflow/quotas)), Google Cloud Platform, Gephi (geolayout plug-in)

In [1]:
import os
from google.cloud import bigquery
from google.cloud.bigquery.job import QueryJobConfig

# initialize
QueryJobConfig(useLegacySql=False)
# provide the private key of Google Cloud Platform
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './gdelt-a776108ed74c.json'

### 4.1 query for the `edges.csv`

In [3]:
# `gdeltv2.gkg_partitioned` table stores the GKG V2 data (About 11 TB)
# `extra.countrygeolookup` table contains countryname (str), lat (str), long (str), fips (str)
# query_1 is for building the edges table (including source, target, type, weight)
# 'OS' means oceans

query_1 = '''
SELECT
  d.countryname Source, e.countryname Target, "Undirected" Type, ROUND(c.Count/SUM(c.Count) OVER (), 6) Weight
FROM (
  SELECT
    a.countrycode Source, b.countrycode Target, COUNT(*) AS Count
  FROM ( (
      SELECT
        DocumentIdentifier url, REGEXP_EXTRACT(location, r'^.*?#.*?#(.*?)#') countrycode
      FROM
        `gdelt-bq.gdeltv2.gkg_partitioned`, UNNEST(SPLIT(V2Locations, ';')) AS location
      WHERE
        LENGTH(V2Locations) > 3 AND V2Themes LIKE '%HEALTH_PANDEMIC%HEALTH_PANDEMIC%' AND DATE(_PARTITIONTIME) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE())) a
  JOIN ( (
      SELECT
        DocumentIdentifier url, REGEXP_EXTRACT(location, r'^.*?#.*?#(.*?)#') countrycode
      FROM
        `gdelt-bq.gdeltv2.gkg_partitioned`, UNNEST(SPLIT(V2Locations, ';')) AS location
      WHERE
        LENGTH(V2Locations) > 3 AND V2Themes LIKE '%HEALTH_PANDEMIC%HEALTH_PANDEMIC%' AND DATE(_PARTITIONTIME) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE())) b
  ON
    a.url=b.url
  WHERE
    a.countrycode < b.countrycode AND a.countrycode != 'OS' AND b.countrycode != 'OS'
  GROUP BY
    1, 2
  ORDER BY
    3 DESC
  LIMIT
    1000) c
JOIN (
  SELECT
    fips, countryname
  FROM
    `gdelt-bq.extra.countrygeolookup`) d
ON
  c.Source = d.fips
JOIN (
  SELECT
    fips, countryname
  FROM
    `gdelt-bq.extra.countrygeolookup`) e
ON
  c.Target = e.fips
ORDER BY
  Count DESC
'''

### 4.2 generate the `nodes.csv`

In [4]:
# query_2 is for building the nodes table (including ID, Label, Lat, Long)
# Lat & Long fields are prepared for the geo-layout
query_2 = '''
SELECT
  Country Id, Country Label, Latitude, Longitude
FROM (
  WITH
    network AS(
    SELECT
      d.countryname Source, d.latitude SourceLatitude, d.longitude SourceLongitude, 
      e.countryname Target, e.latitude TargetLatitude, e.longitude TargetLongitude
    FROM (
      SELECT
        a.countrycode Source, b.countrycode Target, COUNT(*) AS Count
      FROM ( (
          SELECT
            DocumentIdentifier url, REGEXP_EXTRACT(location,r'^.*?#.*?#(.*?)#') countrycode
          FROM
            `gdelt-bq.gdeltv2.gkg_partitioned`, UNNEST(SPLIT(V2Locations, ';')) AS location
          WHERE
            LENGTH(V2Locations) > 3 AND V2Themes LIKE '%HEALTH_PANDEMIC%HEALTH_PANDEMIC%' AND DATE(_PARTITIONTIME) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE())) a
      JOIN ( (
          SELECT
            DocumentIdentifier url, REGEXP_EXTRACT(location,r'^.*?#.*?#(.*?)#') countrycode
          FROM
            `gdelt-bq.gdeltv2.gkg_partitioned`, UNNEST(SPLIT(V2Locations, ';')) AS location
          WHERE
            LENGTH(V2Locations) > 3 AND V2Themes LIKE '%HEALTH_PANDEMIC%HEALTH_PANDEMIC%' AND DATE(_PARTITIONTIME) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE())) b
      ON
        a.url=b.url
      WHERE
        a.countrycode < b.countrycode AND a.countrycode != 'OS' AND b.countrycode != 'OS'
      GROUP BY
        1, 2
      ORDER BY
        3 DESC
      LIMIT
        1000) c
    JOIN (
      SELECT
        fips, countryname, latitude, longitude
      FROM
        `gdelt-bq.extra.countrygeolookup`) d
    ON
      c.Source = d.fips
    JOIN (
      SELECT
        fips, countryname, latitude, longitude
      FROM
        `gdelt-bq.extra.countrygeolookup`) e
    ON
      c.Target = e.fips
    ORDER BY
      Count DESC) (
    SELECT
      Source Country, SourceLatitude Latitude, SourceLongitude Longitude
    FROM
      network)
  UNION DISTINCT (
    SELECT
      Target Country, TargetLatitude Latitude, TargetLongitude Longitude
    FROM
      network) )
ORDER BY
  Country
'''

### 4.3 build the query module

In [7]:
class buildNetwork:
    def __init__(self, query, filepath):
        self.client = bigquery.Client()
        self.query = query
        self.filepath = filepath
        self.idx = 0
        
    def query_process(self):
        rows = self.client.query(self.query).result()
        return rows
    
    def save_edges_results(self):
        with open(self.filepath, 'w', encoding='utf-8') as edges:
            edges.write('Source' + ',' + 'Target' + ',' + 'Type' + ',' + 'Weight' + '\n')
            for row in self.query_process():
                self.idx += 1
                edges.write(row['Source'] + ',' + row['Target'] + ',' + row['Type'] + ',' + str(row['Weight']) + '\n')
        return self.idx
    
    def save_nodes_results(self):
        with open(self.filepath, 'w', encoding='utf-8') as nodes:
            nodes.write('ID' + ',' + 'Label' + ',' + 'Latitude' + ',' + 'Longitude' + '\n')
            for row in self.query_process():
                self.idx += 1
                nodes.write(row['Id'] + ',' + row['Label'] + ',' + row['Latitude'] + ',' + str(row['Longitude']) + '\n')
        return self.idx

### 4.4 run the query

In [8]:
%%time
# for the edges
newBuildNetwork = buildNetwork(query_1, './edges_pandemic.csv')
newBuildNetwork.save_edges_results()
print('This query generates %s edges'%(newBuildNetwork.idx))

This query generates 1001 edges
CPU times: user 72.2 ms, sys: 10.5 ms, total: 82.8 ms
Wall time: 7.11 s


In [9]:
%%time
# for the nodes
newBuildNetwork = buildNetwork(query_2, './nodes_pandemic.csv')
newBuildNetwork.save_nodes_results()
print('This query generates %s nodes'%(newBuildNetwork.idx))

This query generates 120 nodes
CPU times: user 52.5 ms, sys: 7.23 ms, total: 59.7 ms
Wall time: 7.96 s


### 4.5 move to Gephi

- The final network

In [13]:
%%html
<img src='./final_network.png' width=1000 height=600>

## 5. Some studies using GDELT
- [Vargo, C. J., Guo, L., & Amazeen, M. A. (2018). The agenda-setting power of fake news: A big data analysis of the online media landscape from 2014 to 2016. *New Media & Society*, *20*(5), 2028-2049.](https://journals.sagepub.com/doi/pdf/10.1177/1461444817712086)

> This article uses **GDELT’s Global Knowledge Graph (GKG)** as its data source (Leetaru, 2012a, 2015a). On a daily basis, GDELT monitors news globally and employs a computer-assisted content analysis that identifies people, locations, themes, emotions, narratives, and events (Leetaru, 2015) The dataset has given researchers the ability to **computationally analyze news content of all sorts: real, fake, and fact-checking oriented** (Abbar et al., 2015; Vargo and Guo, 2017).

- [Vargo, C. J., & Guo, L. (2017). Networks, big data, and intermedia agenda setting: An analysis of traditional, partisan, and emerging online U.S. news. *Journalism & Mass Communication Quarterly*, *94*(4), 1031-1055.](https://journals.sagepub.com/doi/pdf/10.1177/1077699016679976)

> As the first big-data analysis that uses GDELT dataset to examine online intermedia agenda setting, our study focuses on the salience transfer of issues only. Given that **GDELT themes include both issues and attributes**, future research should consider analyzing NAS effects in terms of attributes, or a combination of issues and attributes. For example, **GDELT dataset identifies a number of different aspects of economy, for example, “Econ_Bankruptcy,” “Econ_Cost of living,” and “Econ_Debt.”** Researchers could investigate how news media associate different attributes of the economy issue, and then determine which medium leads and which follows.

- [(Workshop paper) New media coverage of refugees in 2016: A GDELT case study](https://aaai.org/ocs/index.php/ICWSM/ICWSM17/paper/download/15778/14897)

> We rely on the Global Database of Events, Language, and Tone (GDELT) project that monitors and analyses news articles around the world. In a ﬁrst phase, we explore the global media conversation on refugees along two important dimensions: **article quantity** and **sentiment**. We identify and characterize events that generated an extensive media coverage and also extreme sentiment in the news reports. In a second phase, we reﬁne the analysis by focusing on the news media coverage related to refugees in Europe. GDELT features allow us to identify the **key countries** and **time evolution** of the media coverage. Lastly, we determine the **main actors linked to refugees in the news and study their interaction through a network analysis**.

- [(In Chinese) Pang, X. & Liu, Z. China-U.S. relations in massive machine-coded event data: Influence of reciprocity, policy inertia, and a third power. *World Economics and Politics*, *5*, 53-79.](https://github.com/LyndonChenLuo/Homework/tree/master/CMN214/GDELT0218/Pang&Liu.pdf)

> Based the event data of directed actions between China and the United States between 1979 and 2017 from the Global Data on Events Location and Tone,this research is intended to investigate the effects of reciprocity,policy inertia,and third power( Russia) on how China and the U.S. cooperate or conflict with each other. We specify a **Vector Autoregressive Model** and rely on the **Impulse Response Functions** to identify the mutual effects of six time series of direct actions among China,the United States,and Russia.