# Resit Assignment part A

**Deadline: Friday, November 13, 2020 before 17:00** 

- Please name your files: 
    * ASSIGNMENT-RESIT-A.ipynb
    * utils.py (from part B)
    * raw_text_to_coll.py (from part B)

Please name your zip file as follows: RESIT-ASSIGNMENT.zip and upload it via Canvas (Resit Assignment). 
- Please submit your assignment on Canvas: Resit Assignment
- If you have **questions** about this topic
    - [in the week of the 2nd of November] please contact **Pia (pia.sommerauer@vu.nl)**
    - [in the week of the 9th of November] please contact **Marten (m.c.postma@vu.nl)**
    
Questions and answers will be collected in [this Q&A document](https://docs.google.com/document/d/1Yf2lE6HdApz4wSgNpxWL_nnVcXED1YNW8Rg__wCKcvs/edit?usp=sharing), 
so please check if your question has already been answered.

All of the covered chapters are important to this assignment. However, please pay special attention to:
- Chapter 10 - Dictionaries
- Chapter 11 - Functions and scope
* Chapter 14 - Reading and writing text files
* Chapter 15 - Off to analyzing text 
- Chapter 17 - Data Formats II (JSON)
- Chapter 19 - More about Natural Language Processing Tools (spaCy)


In this assignment:
* we are going to process the texts in ../Data/Dreams/*txt
* for each file, we are going to determine:
    * the number of characters
    * the number of sentences
    * the number of words
    * the longest word
    * the longest sentence

## Note
This notebook should be placed in the same folder as the other Assignments!

## Loading spaCy

Please make sure that spaCy is installed on your computer

In [None]:
import spacy

Please make sure you can load the English spaCy model:

In [None]:
nlp = spacy.load('en_core_web_sm')

## Exercise 1: get paths

Define a function called **get_paths** that has the following parameter: 
* **input_folder**: a string

The function:
* stores all paths to .txt files in the *input_folder* in a list
* returns a list of strings, i.e., each string is a file path

In [None]:
# your code here

Please test your function using the following function call

In [None]:
paths = get_paths(input_folder='../Data/Dreams')
print(paths)

## Exercise 2: load text

Define a function called **load_text** that has the following parameter: 
* **txt_path**: a string


The function:
* opens the **txt_path** for reading and loads the contents of the file as a string
* returns a string, i.e., the content of the file

In [None]:
# your code here

## Exercise 3: return the longest

Define a function called **return_the_longest** that has the following parameter: 
* **list_of_strings**: a list of strings


The function:
* returns the string with the highest number of characters. If multiple strings have the same length, return one of them.

In [None]:
def return_the_longest(list_of_strings):
    """
    given a list of strings, return the longest string
    if multiple strings have the same length, return one of them.
    
    :param str list_of_strings: a list of strings
    
    """

Please test you function by running the following cell:

In [None]:
a_list_of_strings = ["this", "is", "a", "sentence"]
longest_string = return_the_longest(a_list_of_strings)

error_message = f'the longest string should be "sentence", you provided {longest_string}'
assert longest_string == 'sentence', error_message

## Exercise 4: extract statistics
We are going to use spaCy to extract statistics from Vickie's dreams! Here are a few tips below about how to use spaCy:

#### tip 1: process text with spaCy

In [None]:
a_text = 'this is one sentence. this is another.'
doc = nlp(a_text)

#### tip 2: the number of characters is the length of the document

In [None]:
num_chars = len(doc.text)
print(num_chars)

#### tip 3: loop through the sentences of a document

In [None]:
for sent in doc.sents:
    sent = sent.text
    print(sent)

#### tip 4: loop through the words of a document

In [None]:
for token in doc:
    word = token.text
    print(word)

Define a function called **extract_statistics** that has the following parameters: 
* **nlp**: the result of calling spacy.load('en_core_web_sm')
* **txt_path**: path to a txt file, e.g., '../Data/Dreams/vickie8.txt'

The function:
* loads the content of the file using the function **load_text**
* processes the content of the file using **nlp(content)** (see tip 1 of this exercise)

The function returns a dictionary with five keys:
* **num_sents**: the number of sentences in the document
* **num_chars**: the number of characters in the document
* **num_tokens**: the number of words in the document
* **longest_sent**: the longest sentence in the document
    * Please make a list with all the sentences and call the function **return_the_longest** to retrieve the longest sentence
* **longest_word**: the longest word in the document
    * Please make a list with all the words and call the function **return_the_longest** to retrieve the longest word
    
Test the function on one of the files from Vickie's dreams.

In [None]:
def extract_statistics(nlp, txt_path):
    """
    given a txt_path
    -use the load_text function to load the text
    -process the text using spaCy
    
    :param nlp: loaded spaCy model (result of calling spacy.load('en_core_web_sm'))
    :param str txt_path: path to txt file
    
    :rtype: dict
    :return: a dictionary with the following keys:
    -"num_sents" : the number of sentences
    -"num_chars" : the number of characters 
    -"num_tokens" : the number of words 
    -"longest_sent" : the longest sentence
    -"longest_word" : the longest word
    """

In [None]:
stats = extract_statistics(nlp, txt_path=paths[0])
stats

## Exercise 5: process all txt files

#### tip 1: how to obtain the basename of a file

In [None]:
import os

In [None]:
basename = os.path.basename('../Data/Dreams/vickie1.txt')[:-4]
print(basename)

Define a function called **process_all_txt_files** that has the following parameters: 
* **nlp**: the result of calling spacy.load('en_core_web_sm')
* **input_folder**: a string (we will test it using '../Data/Dreams')

The function:
* obtains a list of txt paths using the function **get_paths** with **input_folder** as an argument
* loops through the txt paths one by one
* for each iteration, the **extract_statistics** function is called with **txt_path** as an argument

The function returns a dictionary:
* the keys are the basenames of the txt files (see tip 1 of this exercise)
* the values are the output of calling the function **extract_statistics** for a specific file

Test your function using '../Data/Dreams' as a value for the parameter *input_folder*.

In [None]:
def process_all_txt_files(nlp, input_folder):
    """
    given a list of txt_paths
    -process each with the extract_statistics function
    
    :param nlp: loaded spaCy model (result of calling spacy.load('en_core_web_sm'))
    :param list txt_paths: list of paths to txt files
    
    :rtype: dict
    :return: dictionary mapping:
    -basename -> output of extract_statistics function
    """

In [None]:
basename_to_stats = process_all_txt_files(nlp, input_folder='../Data/Dreams')
basename_to_stats

## Exercise 6: write to disk

In this exercise, you are going to write our results to our computer.
Please loop through **basename_to_stats** and create one JSON file for each dream.

* the path is f'{basename}.json', i.e., 'vickie1.json', 'vickie2.json', etc. (please write them to the same folder as this notebook)
* the content of each JSON file is each value of **basename_to_stats**

In [None]:
import json

In [None]:
for basename, stats in basename_to_stats.items():
    pass