# Data And International Relations

This workbook introduces the foundations of a databse that combines various sources of data on international relations, starting with the [Correlates of War Index](https://correlatesofwar.org/). It's ultimate goal is to provide a start for students of computational international relations to make easier use of data out there, and therewith further the quality of research in the field. 


#### Problems for which I need solutions:

1. What factors does the IR literature identify and where to start. I think for a prototype this only relies on COW data  
2. We need to get the data. This needs to be automated and merged, so that the data is sound
    - What libraries to use?
        Beautiful Soup documentation needed
    - How to tell Python which bits to download?
        For this HTML is required in order to locate the links to the files
    - How to tell python to do this in intervals?
    - It would be possible to do this later and first manually download the files. That would allow to visualise some basic machine learning outcomes


3. We need to run some basic machine learning programs to see if there are some results
    - Which are there and which one is applicable?
    
    
4. We need some output window, like a simple drop down website or something so that I can showcase the prototype to professors



### Getting the data

#### COW 
As the COW data on national material capacity is most familiar to me, I'll start with that piece. For webscrapping, I follow this [beginner's guide](https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460)

#Libraries

In [2]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

#Set the URL you want to webscrape from

In [3]:
url = 'https://correlatesofwar.org/data-sets/national-material-capabilities'

#Connect to the URL and get the source code

In [28]:
source_code = requests.get(url).text

#Parse HTML and save to BeautifulSoup object / Maybe it is necessary to use 'lxml' instead of 'html.parser' depending on the HTML of the website. See more on the difference on parsers [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers).

In [30]:
soup = BeautifulSoup(source_code, "html.parser")

In [22]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <base href="https://correlatesofwar.org/data-sets/national-material-capabilities/"/>
  <!--[if lt IE 7]></base><![endif]-->
  <meta content="Power is considered by many to be a central concept in explaining conflict, and six indicators - military expenditure, military personnel, energy consumption, iron and steel production, urban population, and total population - are included in this data set. It serves as the basis for the most widely used indicator of national capability, CINC (Composite Indicator of National Capability) and covers the period 1816-2012." name="DC.description"/>
  <meta content="Power is considered by many to be a central concept in explaining conflict, and six indicators - military expenditure, military personnel, energy consumption, iron and steel production, urban population, and total population - are included in 

#Extracting individual items, like the title can be done using objectification

In [15]:
match = soup.title.text

In [16]:
print(match)

National Material Capabilities (v5.0) — Correlates of War


#Find all hyperlinks present on webpage

In [34]:
links = soup.findAll('a')

#From all links check for zip links and if present download file

In [36]:
i = 0
for link in links:
    if ('.zip' in link.get('href', [])):
        i += 1
        print("Downloading file: ", i)
  
        # Get response object for link
        response = requests.get(link.get('href'))
  
        # Write content in pdf file
        pdf = open("zip"+str(i)+".zip", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")
  
print("All ZIP files downloaded")

Downloading file:  1
File  1  downloaded
Downloading file:  2
File  2  downloaded
All ZIP files downloaded


#Something is wrong with this method. Maybe cuz I used one to locate pdf files. The explanations for zip files seem more 
difficult
#some sources:

In [None]:
# To download the whole data set, let's do a for loop through all a tags
line_count = 1 #variable to track what line you are on
for one_a_tag in soup.findAll('a'):  #'a' tags are for links
    if line_count >= 36: #code for text files starts at line 36
        link = one_a_tag['href']
        download_url = 'http://web.mta.info/developers/'+ link
        urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:]) 
        time.sleep(1) #pause the code for a sec
    #add 1 for next line
    line_count +=1