# Assignment 4a: Data structures (CSV/TSV and JSON)

**Deadline for Assignment 4a+b: Friday, October 14, 2022 (5pm) via Canvas (Assignment 4)** 

* Please name your notebook with the following naming convention: 
  ASSIGNMENT_4a_FIRSTNAME_LASTNAME.ipynb 
* Please submit your complete assignment (4a + 4b) by compressing all your material into **a single .zip file** following this naming convention: ASSIGNMENT_4_FIRSTNAME_LASTNAME.zip.  

## Please note that there is a BA and an MA version of Assignment 4b

In case you are not sure about creating a zip file from a folder, please refer to [this guide](https://fossbytes.com/how-to-zip-file-in-windows-mac/) (or any other guide you find online).



If you have **questions** about this chapter, please contact us at cltl.python.course@gmail.com. Questions and answers will be collected in [this Q&A document](https://docs.google.com/document/d/1ynQAqPa2CGB02okyyE4F1StytDqpyRoBqUpWfeBqI_Y/edit?usp=sharing), so please check if your question has already been answered. 

In this block, we covered the following chapters about data formats:

- Chapter 16 - Data Formats I (CSV/TSV)
- Chapter 17 - Data Formats II (JSON)
- Chapter 18 - Data Formats III (XML) *only for master-level course*

In this assignment, you will also have to apply your knowledge about containers. If you get stuck, you are likely to find solutions in the chapters about containers (Block 2). 


**Tip**:

It could happen that your code throws a unicode error when you're trying to open one of the files used in this assignment. If this is the case, you can probably solve if by specifying the encoding when reading in the file:

```python
with open(your/file/path, 'r', encoding = 'utf-8') as infile:
    #your code
```


## Exercise 1: Trump's Facebook Status Updates (CSV/TSV)

In the folder `../Data/csv_data` there is a TSV file called `trump_facebook.tsv` that contains Facebook status updates posted by Donald Trump. It was downloaded from [here](https://www.reddit.com/r/datasets/comments/581hqm/all_of_donald_trumps_facebook_statuses_reaction). Follow the instructions below to read the file and find specific status updates.


### 1a. Write your own function for reading CSV
Write a function called `read_csv()` that has two parameters: 

* **`input_file`** (positional parameter) and 
* **`delimiter`** (keyword parameter with default string `","`). 

The function should read the file and return `status_updates` which contains the content of the file as a 'list of dicts'. When tested on `../Data/Trump-Facebook/FacebookStatuses.tsv` the first two status updates should thus be represented as follows:

```
[{'link_name': 'Timeline Photos',
  'num_angrys': '7',
  'num_comments': '543',
  'num_hahas': '17',
  'num_likes': '6178',
  'num_loves': '572',
  'num_reactions': '6813',
  'num_sads': '0',
  'num_shares': '359',
  'num_wows': '39',
  'status_id': '153080620724_10157915294545725',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'status_published': '10/17/2016 20:56:51',
  'status_type': 'photo'},
 {'link_name': '',
  'num_angrys': '5211',
  'num_comments': '3644',
  'num_hahas': '75',
  'num_likes': '26649',
  'num_loves': '487',
  'num_reactions': '33768',
  'num_sads': '191',
  'num_shares': '17653',
  'num_wows': '1155',
  'status_id': '153080620724_10157914483265725',
  'status_link': 'https://www.facebook.com/DonaldTrump/videos/10157914483265725/',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 8th, we will END this RIGGED system once and for all!",
  'status_published': '10/17/2016 18:00:41',
  'status_type': 'video'}]
```

**DO NOT USE THE CSV MODULE FOR THIS EXERCISE!**

In case you didn't manage to create the `read_csv()` function, run the following code using the `DictReader()` method from the `csv` module to get the data in the right format for the following exercises:

In [2]:
def read_csv(input_file, delimiter=","):
    """
    Reads a CSV file and returns it in a list of dictionaries
    :param input_file: Path to the CSV to be read
    :param delimiter: Delimiter to split the contents of the file on
    :return: A list of dictionaries which contains the contents of the CSV
    """
    status_updates_list = []

    # Open the file and store the contents in a variable
    with open(input_file, 'r', encoding='utf-8') as file_reader:
        csv_data = file_reader.read()
    # Split the data on a newline character so that every row is stored separately in a list
    csv_data = csv_data.split('\n')
    # The first row contains the column names, which we split based on the given delimiter
    column_names = csv_data[0].split(delimiter)
    # The rest of the rows contain the actual values
    rows = csv_data[1:]
    # Process the values in the rows iteratively
    for row in rows:
        status_update_dict = dict()
        # Same principle as splitting the column names
        row_values = row.split(delimiter)

        # Iterate over the row values and column names, which are lists in the same order.
        # So it's safe to use zip()
        for value, id in zip(row_values, column_names):
            # Update the current dictionary
            status_update_dict[id] = value
        # Add the constructed dictionary to the list of dictionaries
        status_updates_list.append(status_update_dict)
    return status_updates_list


# test your function here
filename = "../Data/csv_data/trump_facebook.tsv"
status_updates = read_csv(filename, delimiter="\t")
status_updates[0:2]

[{'status_id': '153080620724_10157915294545725',
  'status_message': 'Beautiful evening in Wisconsin- THANK YOU for your incredible support tonight! Everyone get out on November 8th - and VOTE! LETS MAKE AMERICA GREAT AGAIN! -DJT',
  'link_name': 'Timeline Photos',
  'status_type': 'photo',
  'status_link': 'https://www.facebook.com/DonaldTrump/photos/a.488852220724.393301.153080620724/10157915294545725/?type=3',
  'status_published': '10/17/2016 20:56:51',
  'num_reactions': '6813',
  'num_comments': '543',
  'num_shares': '359',
  'num_likes': '6178',
  'num_loves': '572',
  'num_wows': '39',
  'num_hahas': '17',
  'num_sads': '0',
  'num_angrys': '7'},
 {'status_id': '153080620724_10157914483265725',
  'status_message': "The State Department's quid pro quo scheme proves how CORRUPT our system is. Attempting to protect Crooked Hillary, NOT our American service members or national security information, is absolutely DISGRACEFUL. The American people deserve so much better. On November 

In [3]:
# Not going to use this one as the self-written function works accordingly

# import csv
#
# filename = "../Data/csv_data/trump_facebook.tsv"
# with open(filename, "r") as infile:
#     status_updates = []
#     csv_reader = csv.DictReader(infile, delimiter='\t')
#     for row in csv_reader:
#         status_updates.append(row)

### 1b. Find the status updates with the most responses

Define a function called **`get_update_most_responded_to()`** that has the following parameters: 
* **`status_updates`** (positional parameter) 
* **`response_type`** (keyword parameter with default string `"likes"`) 

The fuction should find the status update that received the highest number of possible reactions to a Facebook status (emoji such as 'angrys', 'comments', 'hahas', etc. - anything that starts with 'num_'). It should return three strings: the **`status_message`**, the **`status_type`** and the **`status_link`** of this particular status update.


In [4]:
def get_update_most_responded_to(status_updates, response_type='likes'):
    """
    Gets the message, type and link of a status update with the highest no. responses
    :param status_updates: The status updates to search through
    :param response_type: The response type to search the highest number of
    :return: Tuple containing the message, type and link
    """
    # Current maximum of responses is set to minus infinity as this comes in handy
    # A smaller number simply isn't possible
    max_responses = float('-inf')
    # Tuple which is going to store the current status with the greatest no. responses
    selected_status = ('', '', '')

    for status_update in status_updates:
        # Convert the current no. responses in a variable (and cast to an integer as it's still a string)
        num_responses = int(status_update[f'num_{response_type}'])

        # Check if the selected no. responses is greater than the current maximum no. responses
        if num_responses > max_responses:
            # Store the status_message, status_type and status_link of the current status update in the selected_status tuple
            selected_status = (status_update['status_message'], status_update['status_type'], status_update['status_link'])
            # Update the new maximum no. responses
            max_responses = num_responses
    return selected_status

# Get the status_message, status_type and status_link of the status update with the highest no. reactions
# Disclaimer: some fields might be empty (this means these fields are also empty in the original dataset)
status_message, status_type, status_link = get_update_most_responded_to(status_updates, response_type='reactions')
print(f'status_message: {status_message}')
print(f'status_type: {status_type}')
print(f'status_link: {status_link}')

status_message: Stop congratulating Obama for killing Bin Laden. The Navy Seals killed Bin Laden.
status_type: status
status_link: 


### 1c. Find the longest status updates

Define a function called **`get_longest_update()`** that has the following parameters: 
* **`status_updates`** (positional parameter) 
* **`length_type`** (keyword parameter with default string `"tokens"`). 

The function should find the longest update. By default, the fuction should find the status update that is the longest in terms of number of tokens. Also implement the options to find the longst status update in terms of characters or sentences in the message. These options should be carried out when `length_type` is changed to `"sentences"` or `"characters"` 

The function should return the status message (called `'status_message'` in the data structure) of the longest update as a string. 

**Attention**: It is recommended to use NLTK for this exercise. 


In [7]:
import nltk

nltk.download('punkt')

def get_longest_update(status_updates, length_type='tokens'):
    """
    Gets a status message with the most no. tokens, sentences or characters
    :param status_updates: The status updates to search through
    :param length_type: The type to check on
    :return: A status message (string)
    """
    # Variable which stores the current longest update
    current_longest_update = ''
    current_longest_length = float('-inf')
    # Definition of the possible length types
    length_type_tokens = 'tokens'
    length_type_sentences = 'sentences'
    length_type_characters = 'characters'

    # Check if the given length_type is known and thus can be handled by this function
    assert length_type in [length_type_tokens, length_type_sentences, length_type_characters], \
        f"The given length_type '{length_type}' is not found! " \
        f"Possible options are '{length_type_tokens}', '{length_type_sentences}' and '{length_type_characters}'"

    for status_update in status_updates:
        # Store the status message in a variable
        status_message = status_update['status_message']
        update_length = -1

        # Check what kind of length type we're dealing with
        if length_type == 'tokens':
            # Tokenize the current status message
            tokens = nltk.word_tokenize(status_message)
            # Get the no. tokens
            update_length = len(tokens)
        elif length_type == 'sentences':
            # Get the sentences by splitting on the newline character
            sentences = status_message.split('\n')
            # Get the no. sentences
            update_length = len(sentences)
        elif length_type == 'characters':
            # Get the no. characters
            update_length = len(status_message)

        # Check if the current length is longer than the previously longer length
        if update_length > current_longest_length:
            # Update the new
            current_longest_length = update_length
            current_longest_update = status_message
    return current_longest_update


print(f"Longest update based on the number of tokens: {get_longest_update(status_updates, length_type='tokens')}")
print()
print(f"Longest update based on the number of sentences: {get_longest_update(status_updates, length_type='sentences')}")
print()
print(f"Longest update based on the number of characters: {get_longest_update(status_updates, length_type='characters')}")
print(f"Calling get_longest_update() this way should raise an error (which is intentional): {get_longest_update(status_updates, length_type='words')}")

[nltk_data] Downloading package punkt to /Users/maxfaber/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Longest update based on the number of tokens: ***Message from Eric Trump*** Before my father takes the stage to face Hillary Clinton, I'll be giving him a list of supporters who made a contribution just before the big debate.   Add your name to the list: http://bit.ly/2duCRlw   Please contribute $100, $65, $35, $20, $15, or even $3 before 8pm ET tonight to get your name on the list of supporters I give him before he takes the stage.  Get on the list here: http://bit.ly/2duCRlw Paula P. James N. Kathy W. Erick M. Curt C. Mark C. Nancy  L. Barbara S. David L. Roy B. Kris M. Daniel S. Daniel D. Eugene L. Caihua W. Ken M. Tommy H. Bill S. Thomas O. Christine W. Dennis J. Erin C. Chad M. Rachel T. Carolyn G. William J. Cindy C. Eugene L. Judy F. Manny C. Edward R. Garry L. Grace B. Boris V. Chris  L. William J. Steven  T. Joann M. Paul S. James E. John P. Marc S. Jim B. Melynda S. Richard S. Jonathan J. Craig O. Ed K. Eileen M. Carmen M. Sherry P. Daniel Mabrey T. Chad B. Ellen  R. Scott P.

AssertionError: The given length_type 'words' is not found! Possible options are 'tokens', 'sentences' and 'characters'

### 1d. Find the status updates containing specific keywords

Define a function called **`get_updates_with_keywords()`** that takes three input arguments: 

* **`status_updates`** (mandatory positional argument) 
* **`keywords`** (mandatory positional argument) 
* **`case_sensitive`** (keyword argument with default `False`)

The fuction should find the status updates that contain **any of the keywords**. The parameter `case_sensitive` should specify whether uppercase and lowercase characters must be treated as distinct. 

The function should return **`filtered_status_updates`**, which is a list of dictioaries with all information about the status updates (same format as the input argument `'status_updates'`). 

**Attention**: It is highly recommended to use NLTK for this exercise. Make sure that you **tokenize** the messages before you look for keywords. 

In [17]:
import nltk

def get_update_with_keywords(status_updates, keywords, case_sensitive=False):
    """

    :param status_updates:
    :param keywords:
    :param case_sensitive:
    :return:
    """
    status_updates_with_keywords = []

    for status_update in status_updates:
        message = status_update['status_message']
        contains_keyword = False

        tokens = nltk.word_tokenize(message)
        for keyword in keywords:
            if contains_keyword:
                break
            if not case_sensitive:
                keyword = keyword.lower()
            for token in tokens:
                if not case_sensitive:
                    token = token.lower()
                if keyword == token:
                    status_updates_with_keywords.append(status_update)
                    contains_keyword = True
                    break
    return status_updates_with_keywords

keywords = ["clinton", "obama"] # test with these keywords; also experiment with other keywords
updates_with_keywords = get_update_with_keywords(status_updates, keywords)
len(updates_with_keywords) == len(set([x['status_message'] for x in updates_with_keywords]))

False

## Exercise 2: Nobel Prize Winners (JSON)

There is a lot of interesting data online. For example, the [Nobel Prize Organisaton](https://www.nobelprize.org) provides the [Nobel Prize API](https://nobelprize.readme.io) that allows you to download information about the prizes, the laureates and the countries. 

The information is formatted in JSON. Have a look at the following URLs:
- http://api.nobelprize.org/v1/prize.json
- http://api.nobelprize.org/v1/laureate.json
- http://api.nobelprize.org/v1/country.json

For this exercise, we will only look at the prizes and the laureates. 

We can download the data using the `requests` module. How this works is shown below.

In [None]:
import requests

In [None]:
# Download data on prizes
api_url = "http://api.nobelprize.org/v1/prize.json"
r = requests.get(api_url)
dict_prizes = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_prizes 

In [None]:
# Download data on laureates
api_url = "http://api.nobelprize.org/v1/laureate.json"
r = requests.get(api_url)
dict_laureates = r.json()
# uncomment the line below if you'd like to see what's inside dict_prizes
#dict_laureates 

### 2a. Read the JSON files

We have already stored the data as the JSON files `laureate.json` and `prize.json` in the folder `../Data/json_data/NobelPrize`. Open these JSON files and load them as the Python dictionaries `dict_laureates` and `dict_prizes`.

In [None]:
# load laureates.json and prize.json here

### 2b. Get all laureates from year and category

Create a function called **`get_laureates()`** that thas three parameters: 

* **`dict_prizes`** (positional parameter) 
* **`year`** (keyword parameter with default `None`) 
* **`category`** (keyword parameter with default `None`) 

The function should find all laureates that received the Nobel Prize, optionally in a specific year and/or category (specified using the keywords `year` and `category`). It should return a list of the full names of the laureates.

For example, for the year 2018 and category "peace" it should return the list `['Denis Mukwege', 'Nadia Murad']`.

In [None]:
def get_laureates():
    # your code here                     


year = 2018
category = "peace"
# test your function here

### 2c. Get all prizes from affiliations

Create a function called **`get_affiliation_prizes()`** that takes one input parameters: 

* **`dict_laureates`** (positional parameter) 

The function should find all affiliations that were involved in winning the Nobel Prize and provide information on the category and year of those Nobel Prizes. It should return a nested dictionary of the following format:

```
{
    "A.F. Ioffe Physico-Technical Institute": [
        {"category": "physics", "year": "2000"}
    ],
    "Aarhus University": [
        {"category": "chemistry", "year": "1997"},
        {"category": "economics","year": "2010"}
    ]
}
```

**Tip:** some of the entries will lack information (for example, there is no associated affiliation). Use `if-statements` to check if essential information is present. 

**General tip for working with data**: If your code breaks, check whether your assumptions about the data hold (very often, they unfortunatelydo not). For instance, a dictionary key you thought was always present is missing from a couple of dictionaries, etc. 

In [None]:
def get_affiliation_prizes():
    # your code here

# test your function here

### 2d. Write to JSON

Next, write the dictionary created in the previous exercise to a JSON file using the following path: 

`../Data/json_data/NobelPrize/nobel_prizes_affiliations.json`.

In [None]:
# write the resulting dictionary to 'json_file'