# External modules

## Importing external modules

Most of the functionality we have used so far, is simply built into the Python language itself. Often, however, you need to use external modules in your code, to have access to more specific functionality that you do not want to code yourself. A lot of external modules are already available in the Python Standard Library, for a wide variety of tasks. There are also countless third-party providers of Python modules. 

To use an external module in your code, you need to explicitly 'import' it. Consider for example the module `random` from the standard library, which contains functions for generating random numbers:

In [None]:
import random
print(random.randint(0, 10)) # randint takes two arguments: the lower and upper bound between which to choose an int

Note the syntax used: using the dot, we indicate that our machine should look for the `randint()` function inside the `random` module we just imported. You can import an entire module or import only a specific function in the module. We could also have imported the (single) function we needed as follows:

In [None]:
from random import randint
print(randint(0, 100))

In this case we wouldn't have to specify where our machine should find the `randint()` function. This can save on some typing if you have to use the function often in your code. You can also (temporarily) change the names of the functions you import, using the `as` syntax:

In [None]:
from random import randint as random_integer
print(random_integer(0, 100))

Imported libraries stay available throughout a session, so you only need to import them once in a notebook (although it does not hurt to import the same library more than once).

## Installing external modules

In your Anaconda navigator, have a look at the environment tab. Here it lists all the packages already installed in your current base environment, and all the packages available for install directly as well.

---

## Reading in files

Input for your programs often comes from files on your disk, such as 'corpora' (a 'corpus' is what we call a large collection of digital text in linguistics). Likewise, you often want output to be written back to files on your disk as well. Thus, reading and writing files is often an essential part of programming and, lucky, for us, this is really simple in Python. The following example reads a file from disk, the default way:

In [None]:
# You probably need to adjust the path depending on where you put the data folder
f = open('data/Arnold__Literature_and_Dogma.txt', 'r', encoding='utf-8')
text = f.read()
f.close()
print(text)

So what exactly happens here? The `open()` function does not return the actual text that is saved in the text file. It only returns a so-called 'file object' from which we can read the content using the `.read()` function. We passed three arguments to the `open()` function:

 * the path to the file that you wish to open. A path specifies a location on disk, using folder and file names combined with forward slashes (backward slashes on Windows, for some reason). There are two kinds of paths (more on this below, when we discuss directories):
   * *relative paths* start from the location where the code is run from (in the case of a notebook: the location where the notebook is saved). They are shorter and more robust when the code is moved together with the data (so they do not change relative to each other|). However, they are sensitive to where the code is run from.
   * *absolute paths* start from the root of the system. They work no matter where the code is run from, but break when you move the data or run it on a different computer.
 * the mode: 'r' represents read mode, so that the file will be opened for reading (and e.g. not for writing).

After having read the contents from the file, we can `close()` it. If you omit this, Python will close the file for you at the end of code execution.

It is best practice however to open a file with a `with` statement so that you can not forget closing it.

In [None]:
with open('data/Arnold__Literature_and_Dogma.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(text)

Reading an entire file in one string is not always desirable, especially not with huge files. The following example reads up until a newline everytime, and returns one line at a time. 


In [None]:
with open('data/Arnold__Literature_and_Dogma.txt', 'r', encoding='utf-8') as f:
    for line in f: # Yes, we can do looping over a file object too!
        print(line)

As you can see, we now get double spacing (double newlines). This is because every line in our file ends with a newline character `"\n"`. The `print` function always adds a newline by default, so we print a total of two newline characters per line. As a reminder, this is how you can use newlines yourself:

In [None]:
s = "This is the first line.\nThis is the second line."
print(s)

There exists a similar character to encode 'tab' characters, namely `\t`. You can use this character to play around with the indentation of your (e.g. hierarchically structured) output:


In [None]:
s = "First line\n\t* Second line\n\t* Third line\n\t* Fourth line\nFifth line"
print(s)

If you are not happy with the double spacing above and wish to remove all preceding and trailing whitespace in a string (newlines, spaces, but also tabs), you can use the `strip()` function:

In [None]:
s = "   strip me!    \n"
print(s)
print(s.strip())

Now, try to adapt the code that read in the Austen file and have your code print each line without the preceding and trailing whitespace! Have the annoying "double lines" disappeared in your output? 

In [None]:
with open('data/Arnold__Literature_and_Dogma.txt', 'r', encoding='utf-8') as f:
    for line in f: # Yes, we can do looping over a file object too!
        print(line)

Rather than just printing, we can of course do whatever we want with this file's content. Let's count the number of lines (but note that a line does not necessarily correspond to a sentence).

In [None]:
count = 0
with open('data/Arnold__Literature_and_Dogma.txt', 'r', encoding='utf-8') as f:
    for line in f: # Yes, we can do looping over a file object too!
        count += 1
print(count)

#### DIY 1

Read the file `Arnold__Literature_and_Dogma.txt` and compute the average length (in characters) of the lines.

In [None]:
# insert your code here

line_lengths = []
with open('data\Arnold__Literature_and_Dogma.txt', 'r', encoding='utf-8') as f:
    for line in f:
        line_lengths.append(len(line))

sum(line_lengths)/len(line_lengths)

In [None]:
line_lengths[:10]

In [None]:
import numpy as np
arr = np.array(line_lengths)
arr

In [None]:
arr.mean()

- - -

## Writing out files

Now that we mastered the art of reading files, let's move on to writing files, which follows a similar logic:

In [None]:
with open('testoutput.txt', 'w', encoding='utf-8') as f:
    f.write("Hello world!")

In this code block, we have automatically created a new file called `testoutput.txt`. We then wrote a single line to this file and then we closed it. Note that the `w` is a crucial difference: if you would have used `r` here, Python would have opened the file in 'read-only' mode and you wouldn't have been able to write to it.

Open the created file in any text editor (Visual Studio Code for instance) to check that the ouput worked properly.

If you want your data to be written on multiple lines, you need to take care to explicitly encode the newlines. Instead of:
    

In [None]:
with open('testouput.txt', 'w', encoding='utf-8') as f:
    f.write("Hello world on the first line!")
    f.write("Hello world on the second line!")

You need to write:

In [None]:
with open('testouput.txt', 'w', encoding='utf-8') as f:
    f.write("Hello world on the first line!\n")
    f.write("Hello world on the second line!")

Otherwise your file would have `Hello world on the first line!Hello world on the second line!` in it, i.e. without a newline in between.

Besides 'read-mode' and 'write-mode' when dealing with text files, there is also the 'append-mode' in Python. Watch out: in 'write-mode', you will always *overwrite* the existing content of the file. However, if you've opened a file in 'append-mode', everything you write to the file will be added at the end of the file, without deleting anything of the existing content in the file. In order to enable the append mode, you need to specify `'a'` as your second parameter when you open files.

In [None]:
# append one more sentence to the file 'testoutput.txt'

#### DIY 2

Read the file `austen-emma-excerpt-tokenised.txt`, and write to a file `data/words.txt` all lowercased tokens (= words and punctuation marks) occurring in this text (without duplicates!), alphabetically ordered, one word per line. That way, you are really creating a lexicon or word list of the text. (Tip: you should use `set()` in this exercise!)

In [None]:
# insert your code here
with open('data/austen-emma-excerpt-tokenised.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    all_tokens = text.lower().split()
    unique_tokens = list(set(all_tokens))
    unique_tokens = sorted(unique_tokens)
    
with open('data/words.txt', 'w', encoding='utf-8') as f:
    for token in unique_tokens:
        f.write(token + '\n')
        #f.write('\n')

In [None]:
l = [1,2,1,4]
l

In [None]:
set(l)

In [None]:
list(set(l))

Check your output by viewing the `words.txt` file in a text editor such as Visual Studio Code (Windows users: do not use Notepad!)

## JSON writing and reading
JSON (JavaScript Object Notation) is a data type that can write all Python's built-in types (i.e., dict, list, str, int, float). JSON files are especially good for fast writing and reading of dictionaries and lists. It is also the data format of choice for web applications as it is great for transmitting data.![json-logo.png](attachment:json-logo.png)

More importantly: JSON is human-readable! This means that it is easy for a human to open a JSON file and read its contents.

The `json` library can parse JSON from strings or files. The library parses JSON into a Python dictionary or list. It can also convert Python dictionaries or lists into JSON strings.

### Writing Python to JSON
Writing JSON means converting a Python object into a JSON file.
As mentioned above, the json library parses a Python dictionary or list into a JSON string and writes the file.
In order to do that, we use the `dump()` function (dump a Python object to file_path), as follows:

In [None]:
import json

data = {"name": "Frank", "age": 39, "interests": ["Long walks on the beach", "Knitting"]}
with open('test.json', 'w', encoding='utf-8') as f:
    # write the data to file_path as json
    json.dump(data, f)

### Reading JSON to Python
Now that we have written some data as a json file it would be nice if we could read it back to Python. This is done with the `load` function from the json module.

In [None]:
with open('test.json','r', encoding='utf-8') as f:
    data = json.load(f)
print(data)

Now you can write and read JSON! It is important to note at this point that JSON cannot store all types of Python objects, but only the following types:
- lists []
- dictionaries {}
- booleans: True, False
- numbers: float, int
- character strings: "string"
- None.

Thus, any other types need to be converted in order to be stored in JSON.

---

## Working with directories

It's time to scale up our operations to not just working with one file, but with collections of files! After all, that is one of the reasons why programming is so handy: it makes it easy to do repetitive tasks fast. Here, we will work with the information of architectural treaties extracted from http://architectura.cesr.univ-tours.fr/

![directory.jpg](attachment:directory.jpg)

In the notebooks we use, there is a convenient way to quickly inspect the contents of a folder using the `ls` command (note that this is not a Python command, just something for the notebooks). Our data are contained under the general `architectura_treaties` folder:

In [None]:
ls "data/architectura_treaties"

As you can see, this folder holds a number of `json` files, ending in the `.json` extension. Let us open a random file:

In [None]:
with open('data/architectura_treaties/book_019.json', 'r', encoding='utf-8') as f:
    data = f.read()
print(data)

Make sure that you do not pass the `'w'` parameter ("write") to `open()`, instead of `'r'` ("read"), since this would overwrite and thus erase the existing file.  After assigning the string returned by `f.read()` to the variable `data` (which is the serialized version of the data as text, not the actual data)/

In this chapter we would like to work with all the files in the `architectura_treaties` directory. This is where loops come in handy of course, since what we really would like to do, is iterate over the contents of the directory. Accessing these contents in Python is easy, but requires importing some extra functionality. In this case, we need to import the `os` module, which contains all functionality related to the OS, the *operating system* of your machine, such as directory information:  

In [None]:
import os

Using the dot-syntax (`os.xxx`), we can now access all functions that come with this module, such as `listdir()`, which returns a list of the items which are included under a given directory

In [None]:
filenames = os.listdir('data/architectura_treaties/')
print(filenames[:20])

The function `os.listdir()` returns a list of strings, representing the filenames contained under a directory. It has the same functionality as the `ls` command in a notebook.

#### DIY 3

Simple question, using the output of listdir, can you tell how many files are in the directory?

In [None]:
# your code goes here


With `os.listdir()`, you need to make sure that you pass the correct path to an existing directory:  

In [None]:
os.listdir('data/non_existing_directory')

It might therefore be convenient to check whether a directory actually exists in a given location, using the `os.path` submodule, which defines a range of interesting functions for working with paths:

In [None]:
print(os.path.isdir('data/architectura_treaties'))
print(os.path.isdir('non_existing_directory'))

The second directory, naturally, does not exist and `isdir()` evaluates to `False` in this case. Creating a new (and thus empty) directory is also easy using `os`:

In [None]:
os.mkdir('my_own_dir')

We can see that it lives in the present working directory now, by typing `ls` again:

In [None]:
ls

Or we use Python:

In [None]:
print(os.path.isdir('my_own_dir'))

Removing directories is also easy, but PLEASE watch out, sometimes it is too easy: if you remove a wrong directory in Python, it will be gone forever... Unlike other applications, Python does not keep a copy of it in your Trash and it does not have a Ctrl-Z button. Please watch out with what you do, since with great power comes great responsiblity! Removing the entire directory which we just created can be done as follows:  

In [None]:
import shutil
shutil.rmtree('my_own_dir')

And lo behold: the directory has disappeared again:

In [None]:
print(os.path.isdir('my_own_dir'))

Here, we use the `rmtree()` command to remove the entire directory in a *recursive* way: even if the directory isn't empty and contains files and subfolders, we will remove all of them. The `os` module also comes with a `rmdir()` but this will not allow you to remove a directory which is not empty, as becomes clear in the `OSError` raised below:  

In [None]:
os.rmdir('data/architectura_treaties')

The folder contains things and therefore cannot be removed using this function. There are, of course, also ways to remove individual files or check whether they exist:

In [None]:

my_dirname = 'data/my_own_dir'
os.mkdir(my_dirname)

my_filename = my_dirname + '/my_file.txt'
with open(my_filename, 'w', encoding='utf-8') as f:
    f.write('Content')
    
print(os.path.exists(my_filename))
os.remove(my_filename)
print(os.path.exists(my_filename))

Here, we created a directory, wrote a new file to it (`my_file.txt`), and removed it again. Using `os.path.exists()` we monitored at which point the file existed. Finally, the `shutil` module also ships with a useful `copyfile()` function which allows you to copy files from one location to another, possibly with another name. To copy night 66 to the present directory, for instance, we could do: 

In [None]:
shutil.copyfile('data/architectura_treaties/book_005.json', 'book_005_copy.json')

Indeed, we have added an exact copy of night 66 to our present working directory:

In [None]:
ls

We can safely remove it again:

In [None]:
os.remove('book_005_copy.json')

### Paths

The paths we have used so far are 'relative' paths, in the sense that they are relative to the place on our machine from which we execute our Python code. Absolute paths can also be retrieved and will differ on each computer, because they typically include user names etc:

In [None]:
os.path.abspath('data/architectura_treaties/book_001.json')

While absolute paths are longer to type, they have the advantage that they can be used anywhere on your computer (i.e. irrespective of where you run your code from). Paths can be tricky. Suppose that we would like to open one of our filenames:

In [None]:
filenames = os.listdir('data/architectura_treaties')
random_filename = filenames[9]
with open(random_filename, 'r', encoding='utf-8') as f:
    text = f.read()
print(text)

Python throws a `FileNotFoundError`, complaining that the file we wish to open does not exist. This situation stems from the fact that `os.listdir()` only returns the base name of a given file, and not an entire (absolute or relative) path to it. To properly access the file, we must therefore not forget to include the rest of the path again: 

In [None]:
filenames = os.listdir('data/architectura_treaties')
random_filename = filenames[9]
with open('data/architectura_treaties/'+ random_filename, 'r', encoding='utf-8') as f:
    text = f.read()
print(text)

Instead of doing string concatenation of the path, it is more robust to use the `os.path.join` method to build paths. It takes care of missing or double slashes, backslashes if you are on Windows etc. This code does the same as the above:

In [None]:
filenames = os.listdir('data/architectura_treaties')
random_filename = filenames[9]
with open(os.path.join('data', 'architectura_treaties', random_filename), 'r', encoding='utf-8') as f:
    text = f.read()
print(text)

Apart from `os.listdir()`, there are a number of other common ways to obtain directory listings in Python. Using the `glob` module for instance, we can easily access the full relative path leading to our Arabian Nights:  

In [None]:
import glob
filenames = glob.glob('data/architectura_treaties/*')
print(filenames[:10])

The asterisk (`*`) in the argument passed to `glob.glob()` is worth noting here. Just like with regular expressions, this asterisk is a sort of wildcard which will match any series of characters (i.e. the filenames under `architectura_treaties`). When we exploit this wildcard syntax, `glob.glob()` offers another distinct advantage: we can use it to easily filter out filenames which we are not interested in:

In [None]:
filenames = glob.glob('data/architectura_treaties/*.json')
print(filenames[:10])

Interestingly, the command in this code block will only load filenames that end in `".txt"`. This is interesting when we would like to ignore other sorts of junk files etc. that might be present in a directory. To replicate similar behaviour with `os.listdir()`, we would have needed a typical `for` loop, such as:

In [None]:
filenames = []
for fn in os.listdir('data/architectura_treaties'):
    if fn.endswith('.json'):
        filenames.append(fn)
print(filenames[:10])

However, when using `glob.glob()`, you might sometimes want to be able to extract a file's base name again. There are several solutions to this:

In [None]:
filenames = glob.glob('data/architectura_treaties/*.json')
fn = filenames[10]
print(fn)

# simple string splitting:
print(fn.split('/')[-1])

# using os.sep:
print(fn.split(os.sep)[-1])

# using os.path:
print(os.path.basename(fn))

Both `os.sep` and `os.path.basename` have the advantage that they know what separator is used for paths in the operating system, so you don't need to explicitly code it like in the first solution. Separators differ between Windows (backslash) and Mac/Linux (forward slash).

Finally, sometimes, you might be interested in all the subdirectories of a particular directory (and all the subdirectories of these subdirectories etc.). Parsing such deep directory structures can be tricky, especially if you do not know how deep a directory tree might run. You could of course try stacking multiple loops using `os.listdir()`, but a more convenient way is `os.walk()`:

In [None]:
for root, directories, filenames in os.walk("data"):
    print(root, "contains the following directories and files:")
    print(directories)
    print(filenames)
    print() # For an empty line between the blocks

As you can see, `os.walk()` allows you to efficiently loop over the entire tree. As always, don't forget that help is right around the corner in your notebooks. Using `help()`, you can quickly access the documentation of modules and their functions etc (but only after you have imported the modules first!). Don't include the parentheses if you want to get help on a function, e.g. not `help(os.walk())`.

In [None]:
help(os.walk)

If you want to access a folder that is not a child node of the current position in the folder, you must use '..' for every step upwards in the hierarchy. Open your Explorer and look for a folder which is at the same level as your 'Python course' folder.

In [None]:
for file in os.listdir(os.path.join('..', '01_Variables')): #fill in the name of a folder which is on the same level as your Python folder
    print(file)

## Exercise

Load all the json files from the architectura folder (parsing the json content for each of them) and put them all in a list. (we will need it for the next chapter)

In [None]:
# your code
complete_data = []

## Hard and optional Exercise

Now much more difficult, load the `xenotheka.json` file and look at the loaded output. It should be a list of dictionnary.

We want to save every entry as an independent json file in a:
- directory named based on the author of the book.
- filename based on the title

Note that the author and title are merged in a single field ('title') in the loaded entry, so they will need to be separated first.

1. Separate the title to the author name and actual title
2. Create the author directory in the OUTPUT_FOLDER if necessary
3. Save the entry in a json in the author folder

In [None]:
OUTPUT_FOLDER = os.makedir('xenotheka_output')
#your code