# 🛠 IFQ718 Module 04 Exercises-03

## 🔍  Context: Parsing common data exchange formats

There are a few more modules from the Python Standard Library that we would like you to be introduced to.

They are `csv` and `json`, for the convenience that they each provide in reading and writing CSV and JSON formatted files.


### The `csv` module

You have already become familiar with the comma-separated values (CSV) format in a previous notebook. Remember that data formatted as CSV looks like this:

<pre>

sepal_length,sepal_width,petal_length,petal_width,species
4.4,3.0,1.3,0.2,setosa
5.1,3.4,1.5,0.2,setosa
5.0,3.5,1.3,0.3,setosa
4.5,2.3,1.3,0.3,setosa
4.4,3.2,1.3,0.2,setosa
5.8,2.6,4.0,1.2,versicolor
5.0,2.3,3.3,1.0,versicolor
5.6,2.7,4.2,1.3,versicolor
5.7,3.0,4.2,1.2,versicolor
5.7,2.9,4.2,1.3,versicolor
6.3,3.3,6.0,2.5,virginica
6.8,3.2,5.9,2.3,virginica
6.7,3.3,5.7,2.5,virginica
6.7,3.0,5.2,2.3,virginica
6.3,2.5,5.0,1.9,virginica

</pre>

Previously, you parsed this format using the `.split(',')` approach, but now we introduce you to parsing CSV files using the `csv` module, which provides advanced capabilities of parsing more complex CSV formats.

You may be asking, why is this important? Consider when the CSV format includes a column of sentences, where the comma `,` is used as a punctuation mark instead of a field delimiter. Our approach of using `.split(',')` would incorrectly split the line into its individual fields. To avoid this situation, a properly formatted CSV file will use quotations to surround that field, like this:

<pre>
author,sentence
Public domain,"Hello, World!"
Jake Bradford,"You should consider the use of quotations, as commas may break an ill-formatted CSV file"
</pre>

Let's try using the `.split(',')` approach. We will count how many fields are on each line.

In [None]:
text = '''author,sentence
Public domain,"Hello, World!"
Jake Bradford,"You should consider the use of quotations, as commas may break an ill-formatted CSV file"
'''

for line in text.split('\n'):
    fields = line.strip().split(',')
    print(f'Number of fields on this line: {len(fields)}', f'\t{fields}')

Okay, this is not looking good. The header row has two fields: `author` and `sentence`, but the subsequent rows have three fields (as the second field has been incorrectly split into two).

Let's try again with a different dataset:

In [None]:
# First, download the dataset
import urllib.request
urllib.request.urlretrieve('https://gist.githubusercontent.com/jaidevd/23aef12e9bf56c618c41/raw/c05e98672b8d52fa0cb94aad80f75eb78342e5d4/books_new.csv', 'data/books.csv')

In [None]:
# Here are the headers of the CSV file
with open('data/books.csv', 'r') as fp:
    print(fp.readline())

In [None]:
# and now, the first few lines
# Here are the headers of the CSV file
with open('data/books.csv', 'r') as fp:
    for idx, line in enumerate(fp):
        print(line.strip())
        if idx == 5:
            break

In [None]:
# What do you notice?

Now, open with the `csv` module.

**Using the `csv.reader()` function**

In [None]:
import csv

with open('data/books.csv', 'r') as fp:
    reader = csv.reader(fp)
    for idx, line in enumerate(reader):
        print(f'Number of fields on this line: {len(line)}', f'\t{line}')
        if idx == 5:
            break

Closely look at some of the fields. They contain commas `,`. 

For example, `Foreman, John` is a value of the `Author` field.

In [None]:
# Print only those fields containing a comma
with open('data/books.csv', 'r') as fp:
    reader = csv.reader(fp)
    next(reader) # skip the header row as we know already it does not contain any commas
    for idx, line in enumerate(reader):
        print(f'Number of fields on this line: {len(line)}', f'\t{[f for f in line if "," in f]}')
        if idx == 5:
            break

**Try using a different function from the module: `csv.DictReader()`**

This function provides a dictionary for each row, rather than a list. This allows you to select fields from a row using their column title rather than index position.

Previously, to select the author, we would have used `line[1]`, as the author field is the second column (index 1) of the file.

Now, we can use the syntax `line['Author']`.

The same outcome is achieved, but, arguably, using `DictReader` makes for cleaner and more interpretable code.

In [None]:
# Print only those fields containing a comma
with open('data/books.csv', 'r') as fp:
    reader = csv.DictReader(fp)
    next(reader) # skip the header row as we know already it does not contain any commas
    for idx, line in enumerate(reader):
        print(line['Author'])
        if idx == 5:
            break

**Writing a CSV-formatted file using `csv.writer`**

The `csv` module also provides functionality to write CSV files with complex data.

We will continue using the publications dataset in the following examples.

In the first example, we will write the publication title and publisher to a new file.

In [None]:
# the number of rows to process
n = 10

# Open two files: one to read, another to write
with open('data/books.csv', 'r') as fpR, open('books_title_publisher.csv', 'w', newline='') as fpW:
    
    reader = csv.DictReader(fpR)
    writer = csv.writer(fpW)
    
    # the header row of the new file
    writer.writerow(['Title', 'Publisher'])
    
    for idx, row in enumerate(reader):
        if idx < 20:
            print(f'Writing to file: ', [row['Title'], row['Publisher']])
                                         
        writer.writerow([row['Title'], row['Publisher']])
                         
        if idx == n:
             break

Find the file in the Jupyer Lab file explorer, then open it using the text editor (do not use the CSV file viewer).

### ✍ Activity 1: More on the publications dataset

In [None]:
# Open the dataset using the csv module

In [None]:
# Inspect the first ten rows. Loop through and break after ten iterations

In [None]:
# Count how many publications per publisher 

In [None]:
# How many publications do not have any authors listed?

In [None]:
# Create a new CSV file with only those publications that do not have any authors listed

### ✍ Activity 2: Open the Titanics dataset using the `csv` module

The dataset contains the following columns:

* survival - Indicates if the passenger survived the ship wreck, it is our target variable (the variable we will be predicting).
* pclass - Indicates the socio-economic status of the given passanger (1st = Upper, 2nd = Middle, 3rd = Lower).
* sex - Male or Female.
* Age - The age of the passenger.
* sibsp - The number of siblings / spouses of the passanger that are on-board.
* parch - The number of parents / children that are on-board.
* ticket - Ticket number, which is the unique identifier of each passanger.
* fare - How much the passanger has paid in total.
* cabin - The cabin number of the passanger.
* Embarked - Which port the passanger embarked from (C = Cherbourg, Q = Queenstown, S = Southampton).

In [None]:
# First, download the dataset, if you don't have it already
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv', 'data/titanic.csv')

In [None]:
# Write your code here

In [None]:
# How many passengers survived?

In [None]:
# How many passengers from each socio-economic class?

In [None]:
# What was the total fares collected?

In [None]:
# What are the names of the oldest and youngest passengers?

### The `json` module

Pronounced 'jason', the `json` module reads and writes JSON-formatted data, which looks like this:

```json
afl_matches = [
    {
        "AwayTeam": "Western Bulldogs",
        "AwayTeamScore": 71,
        "DateUtc": "2022-03-16 08:10:00Z",
        "Group": null,
        "HomeTeam": "Melbourne",
        "HomeTeamScore": 97,
        "Location": "MCG",
        "MatchNumber": 1,
        "RoundNumber": 1
    },
    {
        "AwayTeam": "Richmond",
        "AwayTeamScore": 76,
        "DateUtc": "2022-03-17 08:25:00Z",
        "Group": null,
        "HomeTeam": "Carlton",
        "HomeTeamScore": 101,
        "Location": "MCG",
        "MatchNumber": 2,
        "RoundNumber": 1
    }
]
```

Yes, the structure looks very like Python `dictionary`'s and `list`'s. How convenient.

JSON is *JavaScript Object Notation*, defined in [RFC 7159](https://datatracker.ietf.org/doc/html/rfc7159.html):

>JavaScript Object Notation (JSON) is a lightweight, text-based,
   **language-independent data interchange format**.  It was derived from
   the ECMAScript Programming Language Standard.  JSON defines a small
   set of formatting rules for the portable representation of structured
   data.
   
*Do you like the AFL? More on that in a moment...*

Let's download some JSON-formatted data, if you don't have it already.

In [None]:
import urllib.request
urllib.request.urlretrieve('https://fixturedownload.com/feed/json/afl-2022', 'data/afl-2022.json')

Now, read the contents of the file, just as a plain string:

In [None]:
with open('data/afl-2022.json', 'r') as fp:
    for idx, line in enumerate(fp):
        print(f'Line #{idx+1} is {len(line)} chars\n\n\t{line[:150]}\n\t...')
        if idx == 10:
            break

Well, that is interesting. There is only one line in this file. It is very long. And it does not look very easy to parse manually.

Side note: I found a JSON parser that *geekskool* has written. [It is long](https://github.com/geekskool/python-json-parser/blob/59e8d445d22d775a59a5b635fe5000382e3025b9/jsonparse.py).

Let's use the `json` module:

In [None]:
import json
with open('data/afl-2022.json', 'r') as fp:
    afl = json.load(fp)
    print(type(afl))

Great. The `json.load()` function gave me a native Python `list`. This is more workable than the horrible string we were given earlier.

I can **dump** this as a "pretty ***s**tring*", using `json.dumps()`, to make it more readable. However, the output will be very long. I will print only the first three items of the structure.

In [None]:
print(json.dumps(afl[0:3], indent=4, sort_keys=True))

How many top-level records are there?

In [None]:
print(len(afl))

What does the JSON file contain? i.e., the general context. Previously, we saw data that contained publications.

In [None]:
# Type your answer here

**Count the number of times each team has played each other**

In [None]:
counts = {}
for match in afl:
    
    # consider the two teams in alphabetical order
    teams = ' vs '.join(sorted([match['HomeTeam'], match['AwayTeam']]))
                    
    if teams not in counts:
        counts[teams] = 0
    counts[teams] += 1
                    
for teams in counts:
    print(f'{teams: <40}', counts[teams])

### ✍ Activity 3: Which match had the largest difference in the score?

In [None]:
# Write your code here