# Using Python for Information Retrieval

In this unit, we'll use python to turn a bunch of loose text documents into a real-life database. (Note: This database was created for a project by R. Terman and E. Voeten, and was processed using much the same process as you'll be learning here.)

The lecture and problem set will leverage your new python skills, especially working with text, lists, and dictionaries; writing for-loops, conditional statements, and functions; and "thinking" like a programmer.

**About the Data**

We'll be creating a database from [Universal Period Review outcome reports](http://www.ohchr.org/EN/HRBodies/UPR/Pages/BasicFacts.aspx).

The Universal Periodic Review (UPR) is a process run by the United Nations Human Rights Council, which involves a periodic review of the human rights records of all 193 UN Member States.

Reviews take place through an interactive discussion between the State under review and other UN Member States. During this discussion any UN Member State can pose questions, comments and/or make recommendations to the States under review. States under review can then respond, stating which recommendations they reject, accept, will consider, etc. Reports are then drawn up detailing this discussion.

We will be analyzing outcome reports from the 2014 Universal Period Reviews of 42 countries, which we retrieved [here](http://www.ohchr.org/EN/HRBodies/UPR/Pages/Documentation.aspx) and formatted as text documents.

The goal is to convert these semi-structured texts to a tabular dataset of **recommendations** with the following variables:

1. Text of recommendation (*text*)
2. Country to which the recommendation is directed (*to*)
3. Country that is making the recommendation (*from*)
4. The year when the review took place (*year*)
5. The response to the recommendation, i.e. whether the reviewed country rejects, accepts, etc (*decision*)

In other words, we want to turn this:

<img src="img/text.png" width="600">

into this:

<img src="img/tabular.png" width="400">



In [1]:
import os
import re
import csv
from operator import itemgetter
from itertools import groupby

# PART A: Start with one document

## 1. Read, Clean, Assign

**task**:

1. Read one document
2. Collect information on the country and year
3. Keep the section we're interested in
4. Turn each line into an item in a list.

**skills**:
- file reading
- string splicing
- string methods
- indexing

### 1.1 Read in "cotedivoire2014.txt"

Fill in the blanks to read in the file.

In [62]:
# FILL ME OUT
dir = './data/txts'
file_name = "cotedivoire2014.txt"
with ____(dir + '/'+ file_name,'r', encoding = "ISO-8859-1") as ____:
    text = f.____()

### 1.2 Assign country and year variables 

Splice the file name to create 2 new variables, `country`, and `year`

In [63]:
# FILL ME OUT
country = _______
year = _______

### 1.3 Get the Recommendations Section

Note that the section we want starts with `"II. Conclusions and/or recommendations\n"`. What method would you use to get everything after this substring? Fill in the blank below and assign the value to a new variable called `rec_text`


In [94]:
# FILL ME OUT
sections = text.____("II. Conclusions and/or recommendations\n")
rec_text = ______

AttributeError: 'str' object has no attribute '____'

### 1.4 Turn it into a list

Turn the string above into a list of lines, and store it in a variable called `recs`

In [93]:
# FILL ME OUT
recs = _______

NameError: name '_______' is not defined

## 2. Chunk 

**task**:

These texts have 3 sections each. 
1. The first section contains those recommendations the country supports. 
2. The second section contains recs the country will examine. 
3. The third contains recommendations the country explicitely rejects. 

We want to chunk the the text into three lists, `accept`, `examine`, `reject` -- each containing their respective recommendations.

**skills**:
- string methods
- lists
- loops
- conditionals
- indexing

### 2.1: Find the paragraph numbers

Each section starts with a main paragraph number (e.g. **123**. The individual recommendations are then noted as subparagraphs (e.g. **123.1, 123.2** etc.

The problem is, we don't know what these paragarph numbers are *a priori*. 

Fill in the blanks below to create 3 variables containing the 3 paragraph numbers.

In [48]:
# FILL ME OUT
para1 = recs[0]._______ # find the main paragraph number of the first line
para1 = int(para1)
para2 = ______ # use para1 to find para2
para3 = ______

### 2.2 Parse the text

Now create 3 new lists: `accept`, `examine`, `reject.` Loop through the `recs` and assign each one to their corresponding section.

**hint**: How do you know if a line belongs to a section? It starts with the main paragraph number for that section. So use the **.startswith()** method.

In [49]:
# FILL ME OUT
accept = [line for line in ____ if ____.startswith(str(para1))][1:]

examine = [______ for ______ in ______ if ______.startswith(______)][1:]
examine = examine[1:]

reject = [______][1:]
reject = reject[1:]

## 3. Get Recommending Country

**skills**

- string methods
- indexing
- functions

**task**
- extract the substring representing the recommending country.

### 3.1 Extracting the Country

Take a look at a recommendation. I've given you a sample one below.

In [95]:
# get the first line, from the first section, of the first upr in `l`
rec = accept[1]
print(rec)

127.2 Make efforts towards the ratification of the OP-CAT (Chile); 


Notice that they're all formatted the same way, with the recommending country in parenthesis at the end, in between parentheses.

Using your string skills, find a way to pull out the recommending country.

In [60]:
# FILL ME OUT
rec_country = ________

'127. The recommendations listed below enjoy the support of C\x99te dÕIvoire: '

### 3.2 Create a Function

Create a function called `get_country` that passes an individual recommendation and returns the recommending country

In [96]:
# SOLUTION
def get_country(rec):
    # YOUR CODE HERE
    return(rec_country)

# test you code
get_country(rec)

NameError: name 'rec_country' is not defined

## 4. Store in Dictionary

**task**:

We now want to create a new list called `reclist` containing just individual recommendations. Each recommendation should be a dictionary with the following keys: 

1. `to`: the country under review
2. `from`: the country (or countries) giving the recommendation
4. `year`: the year of the review (all 2014 here)
5. `decision`: whether the recommendation was supported, rejected, etc.
6. `text`: the text of the recommendation

Create your `reclist` by looping through your list `l`. (Hint: You'll need to use loops within loops.)

**skills**:
- loops
- dictionaries

### 4.1 Fill in the Blanks

The program below loops through all the recommentations in the `accept` list and creates a list of dictionaries described above. Fill in the blanks to complete the code.

(Remember we the `country` and `year` variables we created above!)

In [None]:
# make dictionaries for each individual recommendation item
accept_dictionaries = []
for _____ in accept:
    dic = {}
    dic['to'] = _____
    dic['year'] = _____
    dic['decision'] = 'accept'
    dic['from'] = ______
    dic['text'] = rec
    accept_dicts._____(____) 

### 4.2 Repeat 

Now write a program that does the same for the `examine` and `rejected` lists:

In [72]:
examine_dictionaries = []
for _____ in examine:
    #YOUR CODE HERE
    
reject_dictionaries = []
for _____ in reject:
    #YOUR CODE HERE

IndentationError: expected an indented block (<ipython-input-72-1c4ee1d59f13>, line 5)

### 4.3 Put em Together

Now concenate the `accept_dictionaries`, `examine_dictionaries`, `reject_dictionaries` lists to make one big list called `rec_list`

In [74]:
# FILL ME OUT
rec_list = _______

# uncomment test you code
# print(len(rec_list))

# PART B: Repeat for all documents

We just wrote a program that takes one document and turns it into a dataset!

The problem is we have 11 documents!

We'll now modify our program to create our data set from all 11 documents.

## 5. Make a function

**task**

Combine the code you wrote above to create a function that passes filename and returns a list of dictionaries representing all of the recommendations in that document.

**skills**
- Functions
- Copyin and pasting :)

In [76]:
# complete the code below.

def process_document(file_name):
    
    # YOUR CODE HERE
    
    return(rec_list)

In [79]:
# test your code!
print(process_document("tuvalu2013.txt")[:5])

[{'from': 'Costa Rica', 'to': 'tuvalu', 'decision': 'accept', 'year': '2013', 'text': '82.1. Continue the efforts to achieve accession to the main human rights international instruments and their consistent incorporation into domestic legislation (Costa Rica); '}, {'from': 'Nicaragua', 'to': 'tuvalu', 'decision': 'accept', 'year': '2013', 'text': '82.2. Consider ratifying new international human rights instruments which would assist in strengthening its legal and institutional framework for the promotion and protection of human rights (Nicaragua); '}, {'from': 'Turkey', 'to': 'tuvalu', 'decision': 'accept', 'year': '2013', 'text': '82.3. Continue its efforts to accede to the remaining core international human rights treaties, which will strengthen the domestic legislation with regard to the promotion and protection of human rights, including freedom of religion or belief (Turkey); '}, {'from': 'Viet Nam', 'to': 'tuvalu', 'decision': 'accept', 'year': '2013', 'text': '82.4. Work closely

## 6. Loop through filenames

**task**

1. Find the file_names in our directory.
2. Apply the function above to all the filenames
3. Create a master database

**skills**
- I/O
- Loops
- Functions

### 6.1 Make a list of file_names

The program below reads all the file_names in the directory `data/txts`.

In [84]:
dir = 'data/txts'
for file_name in os.listdir(dir):
    print(file_name)

.DS_Store
afghanistan2014.txt
bangladesh2013.txt
cotedivoire2014.txt
djibouti2013.txt
fiji2014.txt
jordan2013.txt
kazakhstan2014.txt
monaco2013.txt
sanmarino2014.txt
turkmenistan2013.txt
tuvalu2013.txt


Modify the program to include only the file_names that end in `.txt`

In [85]:
# YOUR CODE HERE

afghanistan2014.txt
bangladesh2013.txt
cotedivoire2014.txt
djibouti2013.txt
fiji2014.txt
jordan2013.txt
kazakhstan2014.txt
monaco2013.txt
sanmarino2014.txt
turkmenistan2013.txt
tuvalu2013.txt


## 6.2 Process the documents

Fill in the blanks below to process each document.

In [None]:
all_recs = []
for file_name in os.listdir(dir):
    if ______.______(".txt"):
        recs = ______(file_name)
        ______._____(recs)

In [87]:
# Should be 1830
print(len(all_recs))

1830

## 6.3 Save to file

Now we get to save our data_base to a CSV, and we're done!

In [92]:
#writing column headings
import csv
keys = all_recs[0].keys()

#writing the rest
with open('upr-recs.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_recs)