# Reading and writing files, JSON

## Contents:

* File Input/Output
* Reading and writing JSON

## File Input/Output

A huge portion of our input data will come from files that we have stored on our computer (on the file system). A lot of analysis of these files is done in memory in Python, when working with them. We have to save them back to the file system to store the results. So, mastering the art of reading and writing is crucial in programming.

Until now, we have run stuff (almost instantly) in our Jupyter Notebooks, but imagine that we write code that takes a couple of ours to run on a large collection of files. Then we want to save the result, either for further analysis, or to make these files available (i.e. sharing) in your research. 

The following code opens a file in our filesystem, prints the first 10 lines and closes the file. Please note that this file must exist on your computer. If you only have downloaded this notebook, go back to the repository, download the file, and place it in the appropriate path (or change the path below). This path corresponds to the folder structure on your file system. 

> **Please note:** The code below shows you how the `open()` function works. It's better to use a `width` block (see below), which does this opening and closing for you.

In [None]:
infile = open('data/adams-hhgttg.txt', 'r', encoding='utf-8')

for i, banana in enumerate(infile):
    if i == 10:
        break
    print(banana)

infile.close()

The key passage here is the one in which the `open()` function opens a file and return a **file object** (hint: try printing the type of `infile`), and it is commonly used with the following three parameters: the **name of the file** that we want to open, the **mode** and the **encoding**. 

- **filename**: the name of the file to open, this corresponds to the full/relative path to the file from the notebook. 

- the **mode** in which we want to open a file: the most commonly used values are `r` for **reading** (default, which means that you don't have to put this in explicitly), `w` for **writing** (overwriting existing files), and `a` for **appending**. (Note that [the documentation](https://docs.python.org/3/library/functions.html#open) report mode values that may be necessary in some exceptional case)

- **encoding**: which mapping of string to code points (conversion to bytes) to use, more on this later. 

>**IMPORTANT**: every opened file should be **closed** by using the function `close()` before the end of the program, or the file could be unavailable to successive manipulations or for other programs.

There are other ways to read a text file, among which the use of the methods `read()` and `readlines()`, that would simplify the above function in:

```python
infile = open('data/adams-hhgttg.txt', 'r', encoding='utf-8')
text = infile.readlines()
print(text[:10])
infile.close()
```

However, these methods **read the whole file at once**, thus creating capacity/efficiency problems when working with big corpora.

In the solution we adopt here the input file is read line by line, so that at any given moment **only one line of text** is loaded into memory. 

You can see all file object methods, including examples, on this W3schools page: https://www.w3schools.com/python/python_ref_file.asp

In [None]:
with open('data/adams-hhgttg.txt', encoding='utf-8') as infile:  # The file is opened
    
    lines = infile.readlines()
    
# As soon as we exit the indented scope, the file is closed again 
# (and made available to other programs on your computer)
print(lines[:10])

In [None]:
with open('data/', encoding='utf-8') as infile:  # The file is opened
    
    lines = infile.readlines()
    
# As soon as we exit the indented scope, the file is closed again 
# (and made available to other programs on your computer)
print(lines[:10])

### The with statement 

A `with` statement is used to wrap the execution of a block of code.

Using this construction to open files has three major advantages:

- there is no need to explicitly  close the file (the file is automatically closed as soon as the nested code exits)
- the file is closed automatically even when unhandled errors cause the program to crash
- the code is way clearer (it is trivial to identify where in the code a file is opened) 

Thus, you can  make it yourself a bit easier. Forget about the explicit `.close()` method. The code above can be rewritten as follows:

In [None]:
with open('data/adams-hhgttg.txt', encoding='utf-8') as infile:  # The file is opened
    
    lines = infile.readlines()
    
# As soon as we exit the indented scope, the file is closed again 
# (and made available to other programs on your computer)
print(lines[:10])

The code in the indented with block is executed while the file is opened. It is automatically closed as the block is closed. 

### Quiz

Hint: you can call `.read()` on the file object.

* Write one function that takes a file path as argument and prints statistics about the file, giving:
    * The number of words (often called 'tokens')
    * The number of unique words (often called 'types')
    * The type:token ratio (i.e. unique words / words)
    * The 10 most frequent words, including their frequencies
* Write a normalization or cleaning function that takes a string as argument, that pre-processes this text and returns a normalized version, by removing/substituting:
    * Uppercase characters
    * Punctuation
* Call the normalization function inside the first function

Test the function on the filepath in `file_path` below. Compare the results from running the function with and without normalization.

In [None]:
import string
from collections import Counter


In [None]:
# Your code here
def get_file_statistics(file_path, normalization=False):
    
    with open(file_path, 'r', encoding='utf-8') as infile:
        text = infile.read()
        
    if normalization:
        text = normalize(text)
        
    words = text.split()
       
    n_words = len(words)
    n_unique = len(set(words))
        
    print("Number of words:", n_words)
    print("Number of unique words:", n_unique)
    print("TTR:", n_unique / n_words)
    
    counter = Counter(words)
    most_common_words = counter.most_common(10)
    
    print("Frequencies:")
    for word, frequency in most_common_words:
        print("\t", word, "(" + str(frequency) + ")")


In [None]:
# Your code here
def get_file_statistics(file_path, normalization=False):
    
    with open(file_path, 'r', encoding='utf-8') as infile:
        text = infile.read()
        
    if normalization:
        text = normalize(text)
        
    words = text.split()
       
    n_words = len(words)
    n_unique = len(set(words))
        
    print("Number of words:", n_words)
    print("Number of unique words:", n_unique)
    print("TTR:", n_unique / n_words)
    
    counter = Counter(words)
    most_common_words = counter.most_common(10)
    
    print("Frequencies:")
    for word, frequency in most_common_words:
        print("\t", word, "(" + str(frequency) + ")")


def normalize(text):
    
    normalized_text = text.lower()
    
    for char in string.punctuation:
        normalized_text = normalized_text.replace(char, '')
    
    return normalized_text


In [None]:
file_path = 'data/adams-hhgttg.txt'

# your_function_name(file_path)
get_file_statistics(file_path, normalization=False)

print()

get_file_statistics(file_path, normalization=True)

---

## Writing files

Writing an output file in Python has a structure that is close to that we're used in our reading examples above. The main difference are:

- the specification of the **mode** `w`
- the use of the function `write()` for each line of text

> **Warning!** Opening an _existing_ file in `w` mode will erase its contents!

In [None]:
# The folder you with to write the file to ('stuff' below) has to exist on the file system

with open('stuff/output-test-1.txt', 'w', encoding='utf-8') as outfile:
    
    outfile.write("My name is:")
    outfile.write("John")

When writing line by line, it's up to you to take care of the **newlines** by appending `\n` to each line. Unlike the `print()` function, the `write()` function has no standard line-end character.

In [None]:
with open('stuff/output-test-2.txt', 'w', encoding='utf-8') as outfile:
    
    outfile.write("My name is:\n")
    outfile.write("Alexander")
    
    
    outfile.write("ééèèüAæøå")


We can inspect the file we just created with the command line. The following is not Python, but a basic command line tool to print the contents of a file. At least on Mac and Linux, this works. Otherwise, just navigate to the file in your file explorer and open it.

> Prepending a `!` to a command executes a program on your computer. Use it with care and don't run such a cell in a notebook that you do not trust!

In [None]:
!cat stuff/output-test-2.txt

### Quiz

Instead of printing the statistics in the previous quiz, write them to a file. For instance, use the file path in `file_path` to write the file to. Copy your function from above, rename it and add the required code to it.


In [None]:
# Your code here

file_path = 'stuff/adams-hhgttg-statistics.txt'

# your_adapted_function_that_writes_statistics(file_path)


In [None]:
# Your code here
def get_file_statistics(file_path, target_file, normalization=False):
    
    with open(file_path, 'r', encoding='utf-8') as infile:
        text = infile.read()
        
    if normalization:
        text = normalize(text)
        
    words = text.split()
       
    n_words = len(words)
    n_unique = len(set(words))
        
    
    
    counter = Counter(words)
    most_common_words = counter.most_common(10)
    

        
    with open(target_file, 'w', encoding='utf-8') as outfile:
        
        outfile.write("Number of words:" + str(n_words))
        outfile.write('\n')
        
        outfile.write("Number of unique words:" + str(n_unique))
        outfile.write('\n')
        
        outfile.write("TTR:" + str(n_unique / n_words))
        outfile.write('\n')

        outfile.write("Frequencies:")
        for word, frequency in most_common_words:
            outfile.write("\t" + word + "(" + str(frequency) + ")")
            outfile.write('\n')

def normalize(text):
    
    normalized_text = text.lower()
    
    for char in string.punctuation:
        normalized_text = normalized_text.replace(char, '')
    
    return normalized_text


In [None]:
get_file_statistics('data/adams-hhgttg.txt', target_file=file_path)

Let's quickly check its contents:

In [None]:
!cat stuff/adams-hhgttg-statistics.txt

---

## Reading files from a folder

In [None]:
import os

In [None]:
# Write a function that reads through the folders and files in a directory. 
# Read through the data directory and all its contents.

def read_through_folder(path):
    """
    Read from all files in a given folder. 
    
    Args:
        path (str): Path to a folder
        
    Returns:
        dict: dictionary with filenames as keys and their contents as value
    """
    
    files = os.listdir(path)
    
    data = dict()
    
    for n, file in enumerate(files, 1):
        
        filepath = os.path.join(path, file)
        
        content = read_from_file(filepath)
        
        print(n, file)
        
        data[file] = content[:100]
        
    return data
    

def read_from_file(filepath):
    
    with open(filepath, 'r', encoding='utf-8') as infile:
        text = infile.read()
        
    return text



In [None]:
path = 'data/gutenberg-extension'

data = read_through_folder(path)

In [None]:
data

In [None]:
# Write a function that reads through the folders and files in a directory. 
# Read through the data directory and all its contents.

def read_through_folder(path):
    """
    Read from all files in a given folder. 
    
    Args:
        path (str): Path to a folder
        
    Returns:
        dict: dictionary with filenames as keys and their contents as value
    """
    
    data = dict()
    
    for root, dirs, files in os.walk(path):
        
#         print(dirs)
#         print()
        
        # Read from the folders here
        for folder in dirs:
            
            folderpath = os.path.join(root, folder)
            files = os.listdir(folderpath)
            
#             print(files)
#             print()
            
            data[folder] = dict()
    
            # Then every file in that folder
            for n, file in enumerate(files, 1):

                filepath = os.path.join(folderpath, file)
                
                if os.path.isdir(filepath):  # Some files can be folders
                    continue

                # Read its contents
                content = read_from_file(filepath)

#                 print(n, file)

                print(type(data))
                print(data)

                data[folder][file] = content[:10]
        
                break
        
    return data
    

def read_from_file(filepath):
    """Give back the text from a file"""
    
    with open(filepath, 'r', encoding='utf-8') as infile:
        text = infile.read()
        
    return text



In [None]:
path = 'data'

data = read_through_folder(path)

for folder, value in data.items():
    
    for file, content in value.items():
        print(content)

In [None]:
print(data)

## Looping through folders and files

If you want to load in multiple files in a folder, without explicitly providing the file pointers/paths for each file, you can also point to a folder. We can use the built-in `os` module to loop through a folder and load multiple files in memory.

In [None]:
import os  # You only have to do this once in your code. 
           # Always put this at the top of your file.

In [None]:
list(os.walk("data/gutenberg-extension"))

In [None]:
gutenberg_books = dict()  # Create an empty dictionary to store our data in

for root, dirs, files in os.walk("data/gutenberg-extension"):
    for file in files:
        
        if not file.endswith('.txt'):  # Why this?
            continue
        
        # You have to specify the full (relative) path, not only the file name.
        file_path = os.path.join(root, file)  
        
        with open(file_path, encoding='utf-8') as infile:
            gutenberg_books[file] = infile.read()

In [None]:
gutenberg_books.keys()

The `os.walk()` method is convenient if you are dealing with a combination of files and folders, no matter how deep the hierarchy goes (folders in folders etc.). A simpler function is `os.listdir()`.

In [None]:
os.listdir('data/gutenberg-extension/')

In [None]:
gutenberg_books = dict()  # Create an empty dictionary to store our data in

folder_path = "data/gutenberg-extension"

for file in os.listdir(folder_path):

    if not file.endswith('.txt'):  # Why this?
        continue
    
    file_path = os.path.join(folder_path, file)
    
    with open(file_path, encoding='utf-8') as infile:
        gutenberg_books[file] = infile.read()

In [None]:
gutenberg_books.keys()

The dictionary object now contains a lot of information: all the contents of all files. There's a chance that your browser/notebook will crash when calling the dictionary here. Instead, let's call a part of one of the books, the first 300 characters:

In [None]:
print(gutenberg_books['doyle-sherlock.txt'][:300])

---

# Reading and writing data in JSON and CSV

We now know how we can read and write textual content to files on our file system. Two more structed and common data formats to store data in are JSON and CSV. If you are not familiar with these, take a look at:

* JSON (https://www.w3schools.com/whatis/whatis_json.asp)
* CSV (https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/)

## JSON

The syntax of JSON is very similar to the syntax of `int`, `str`, `list` and `dict` data types in Python. 

The following data (excerpt) is taken from the data that feeds the Instagram page of the UvA (https://www.instagram.com/uva_amsterdam/). The API/service of Instagram returns web data in JSON that is used by your browser to show you a page with content. You can also find this when inspecting the source of the page. 

A JSON file (named `example.json`) that looks like this:
```json
{
    "biography": "Welcome to the UvA \u274c\u274c\u274c \nFind out more about our:\n\ud83c\udfdb campuses \ud83c\udf93 education \ud83d\udd0e research\nShare your \ud83d\udcf8 using: #uva_amsterdam\nQuestions? Contact us:",
    "blocked_by_viewer": false,
    "restricted_by_viewer": null,
    "country_block": false,
    "external_url": "https://linkin.bio/uva_amsterdam",
    "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fuva_amsterdam\u0026e=ATOBo7L11uPBpsMfd6-pFnoBRaF3T-6ovlD9Blc2q1LGUjnmyuGutPfuK-ib70Bt_YmGu6cDNCX1Y1lC\u0026s=1",
    "edge_followed_by": {
        "count": 42241
    },
    "fbid": "17841401222133463",
    "followed_by_viewer": false,
    "edge_follow": {
        "count": 362
    },
    "follows_viewer": false,
    "full_name": "UvA: University of Amsterdam",
    "id": "1501672737",
    "is_business_account": true,
    "is_joined_recently": false,
    "business_category_name": "Professional Services",
    "overall_category_name": null,
    "category_enum": "UNIVERSITY",
    "category_name": null,
    "profile_pic_url": "https://scontent-amt2-1.cdninstagram.com/v/t51.2885-19/s150x150/117066908_1128864954173821_2797787766361156925_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com\u0026_nc_ohc=PXsEzg-CKaUAX8dEtNL\u0026tp=1\u0026oh=86bb46d8006b77db2037955187e69de1\u0026oe=6056619F",
    "username": "uva_amsterdam",
    "connected_fb_page": null
}
```

Can be loaded into Python as a dictionary:
```python
{
    'biography': 'Welcome to the UvA ❌❌❌ \nFind out more about our:\n🏛 campuses 🎓 education 🔎 research\nShare your 📸 using: #uva_amsterdam\nQuestions? Contact us:',
     'blocked_by_viewer': False,
     'restricted_by_viewer': None,
     'country_block': False,
     'external_url': 'https://linkin.bio/uva_amsterdam',
     'external_url_linkshimmed': 'https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fuva_amsterdam&e=ATOBo7L11uPBpsMfd6-pFnoBRaF3T-6ovlD9Blc2q1LGUjnmyuGutPfuK-ib70Bt_YmGu6cDNCX1Y1lC&s=1',
     'edge_followed_by': {'count': 42241},
     'fbid': '17841401222133463',
     'followed_by_viewer': False,
     'edge_follow': {'count': 362},
     'follows_viewer': False,
     'full_name': 'UvA: University of Amsterdam',
     'id': '1501672737',
     'is_business_account': True,
     'is_joined_recently': False,
     'business_category_name': 'Professional Services',
     'overall_category_name': None,
     'category_enum': 'UNIVERSITY',
     'category_name': None,
     'profile_pic_url': 'https://scontent-amt2-1.cdninstagram.com/v/t51.2885-19/s150x150/117066908_1128864954173821_2797787766361156925_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_ohc=PXsEzg-CKaUAX8dEtNL&tp=1&oh=86bb46d8006b77db2037955187e69de1&oe=6056619F',
     'username': 'uva_amsterdam',
     'connected_fb_page': None
}
```

The main differences between dictionaries in Python and the JSON file notation are:

* Python dictionaries exist in memory in Python, they are an abstract datatype. JSON is a data format and can be saved on your computer, or be transmitted as string (e.g. for a website request, sending data).
* Keys in JSON can only be of type string. This means that writing a Python dictionary with integers as keys will transform them to string. Reading back the file will therefore give you a Python dictionary with strings as keys.
* All non-ascii characters are escape sequences (e.g. `\u274c`) for ❌. This is the same for letters with diacritics (e.g. é, ê, ç, ñ). If all characters are escaped this way, you don't have to specify an encoding when opening json files.
* `True` and `False` are lowercased: `true` and `false`. `None` is `null`. 
* JSON only allows double quotes for its "strings". 

The built-in json module of Python needs to be imported first, to work with json files and notation. 

In [None]:
import json

Let's read a json file from our disk using `json.load()`. The file comes from the public API of the municipality of Amsterdam to look up information on houses by searching on street name and house number. See: https://api.data.amsterdam.nl/atlas/search/adres/. Most often, information from such API's or 'REST-services' is given back in JSON. 

In [None]:
with open('data/bg1.json') as jsonfile:
    data = json.load(jsonfile)

Then, we can inspect the loaded data as a Python dictionary:

In [None]:
print(type(data))
data

In [None]:
data

When we are only interested in the information on the building, we can take out that part to store it separately. This is the first dictionary element in the list that can be found under key `data['results']`. The rest of the information is feedback from the API, telling us that there is 1 hit. 

In [None]:
data_selection = data['results'][0]

# Delete all keys starting with an _underscore

for k in list(data_selection):
    if k.startswith('_'):
        del data_selection[k]

data_selection
# print(type(data_selection))

Then, save it back to a json file using `json.dump()`:

In [None]:
with open('stuff/bg1-selection.json', 'w') as outfile:
    json.dump(data_selection, outfile, indent=4)

### Quiz

* Modify that function you previously built to generate statistics for a file once more so that it returns a python dictionary with these statistics.
* Write a function that uses the `os.walk()` or `os.listdir()` method to run the file statistics function over every file in a folder. Create a dictionary that takes the file name as key, and the returned statistics dictionary as value.
* Also add arguments for a `target_file_path`, and a `data` dictionary to that function. Use the `json.dump()` method to write the dictionary to the provided file path using a with statement.
* Inspect the file by opening it on your computer with a text editor of some sorts. Find a way to make it 'pretty printed' (e.g. with _indents_). 

In [None]:
# Your code here

source_folder = "data/gutenberg-extension"
target_file_path = "stuff/gutenberg-statistics.json"

def your_modified_statistics_function(file_path):
    # Your code
    
    return statistics_dict

def your_functions_here():
    return


In [None]:
# Your code here
def get_file_statistics(file_path, normalization=False):
    
    with open(file_path, 'r', encoding='utf-8') as infile:
        text = infile.read()
        
    if normalization == True:
        text = normalize(text)
        
    words = text.split()
       
    n_words = len(words)
    n_unique = len(set(words))   
    
    counter = Counter(words)
    most_common_words = counter.most_common(10)
    
    mfw = []
    for word, freq in most_common_words:
        mfw.append(word)
    
    statistics = dict()
    
    statistics['n_words'] = n_words
    statistics['n_unique'] = n_unique
    statistics['TTR'] = n_unique / n_words
    statistics['MFW'] = [i[0] for i in most_common_words]
        
    return statistics

def normalize(text):
    
    normalized_text = text.lower()
    
    for char in string.punctuation:
        normalized_text = normalized_text.replace(char, '')
    
    return normalized_text


In [None]:
def get_statistics_for_folder(folder, target_file):
    """"""
    
    statistics_files = dict()
    
    for f in os.listdir(folder):
        
        filepath = os.path.join(folder, f)
        
        stats_dict = get_file_statistics(filepath)
        
        statistics_files[f] = stats_dict
        
    with open(target_file, 'w') as jsonfile:
        json.dump(statistics_files, jsonfile, indent=4)

In [None]:
# get_file_statistics('data/adams-hhgttg.txt')
get_statistics_for_folder('data/gutenberg-extension/', 'stuff/gutenberg_statistics.json')

---

# Exercises

### Exercise 1 (previously Exercise 6 in Notebook 2)

Read the file `data/adams-hhgttg.txt` and:

- Count the number of lines in the file

- Count the number of non-empty lines

- Read each line of the input file, remove its newline character and write it to file `stuff/adams-output.txt`

- Compute the average number of alphanumeric characters per line

- Identify all the unique words used in the text (no duplicates!) and write them in a text file called `stuff/lexicon.txt` (one word per line)

In [None]:
# your code here

with open("stuff/lexicon.txt", "w") as infile:
    infile.write("something")

### Exercise 2

TBD