# Assignment 3a

## Due: Friday, September 30, 2022 at 5pm (submission via Canvas)


* Please submit your assignment (notebooks of parts 3a and 3b + Python modules) as **a single .zip file** using Canvas (Assignments --> Assignment 3). Please put the notebooks for Assignment 3a and 3b as well as the Python modules (files ending with .py) in one folder, which you call ASSIGNMENT_3_FIRSTNAME_LASTNAME. Please zip this folder and upload it as your submission.

* Please name your zip file with the following naming convention: ASSIGNMENT_3_FIRSTNAME_LASTNAME.zip

**IMPORTANTE NOTE**:
* The students who follow the Bachelor version of this course, i.e., the course Introduction to Python for Humanities and Social Sciences (L_AABAALG075) as part of the minor Digital Humanities, do **not have to do Exercises 3 and 4 of Assignment 3b**
* The other students, i.e., who follow the Master version of  course, which is Programming in Python for Text Analysis (L_AAMPLIN021), are required to **do Exercises 3 and 4 of Assignment 3b**

If you have **questions** about this topic, please contact us **(cltl.python.course@gmail.com)**. Questions and answers will be collected on Piazza, so please check if your question has already been answered first.


In this block, we covered a lot of ground:

* Chapter 12 - Importing external modules 
* Chapter 13 - Working with Python scripts
* Chapter 14 - Reading and writing text files
* Chapter 15 - Off to analyzing text 


In this assignment, you will first complete a number of small exercises about each chapter to make sure you are familiar with the most important concepts. In the second part of the assignment, you will apply your newly acquired skills to write your very own text processing program (ASSIGNMENT-3b) :-). But don't worry, there will be instructions and hints along the way. 


**Can I use external modules other than the ones treated so far?**

For now, please try to avoid it. All the exercises can be solved with what we have covered in block I, II, and III. 


## Functions & scope

### Excercise 1:

Define a function called `split_sort_text` which takes one positional parameter called **text** (a string).

The function:
* splits the string on a space character, i.e., ' '
* returns all the unique words in alphabetical order as a list.

* Hint 1: There is a specific python container which does not allow for duplicates and simply removes them. Use this one. 
* Hint 2: There is a function which sorts items in an iterable called 'sorted'. Look at the documentation to see how it is used. 
* Hint 3: Don't forget to write a docstring. Please make sure that the docstring generally explains with the input is, what the function does, and what the function returns. If you want, but this is not needed to receive full points, you can use [reStructuredText](http://docutils.sourceforge.net/rst.html).

In [3]:
def split_sort_text(text):
    """
    Splits text on the ' ' (space) character and sorts it afterwards (alphabetically)
    :param text: Text to be split and sorted (string)
    :return: Sorted list of strings
    """
    # Splits the text based on the ' ' (space) character
    # The result may contain duplicates, given that it's a list
    splitted_text_list_unsorted = text.split(' ')

    # Convert the list to a string to make sure it doesn't contain any more duplicates
    splitted_text_set_unsorted = set(splitted_text_list_unsorted)

    # Sort the items in the set (sorted() returns a list which is ordered)
    splitted_text_list_sorted = sorted(splitted_text_set_unsorted)
    return splitted_text_list_sorted


In [6]:
# Test case
split_sort_text('c a a b')

['a', 'b', 'c']

## Working with external modules

### Exercise 2
NLTK offers a way of using WordNet in Python. Do some research (using google, because quite frankly, that's what we do very often) and see if you can find out how to import it. WordNet is a computational lexicon which organizes words according to their senses (collected in synsets). See if you can print all the **synset definitions** of the lemma **dog**.

Make sure you have run the following cell to make sure you have installed WordNet:

In [2]:
import nltk
# uncomment the following line to download material including WordNet
nltk.download('book')
nltk.download('omw-1.4')
nltk.download('wordnet')

In [17]:
for synset in nltk.corpus.wordnet.synsets('dog'):
    print(f'Synset name: \'{synset.name()}\'')
    print(f'definition: \'{synset.definition()}\'\n')

Synset name: 'dog.n.01'
definition: 'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

Synset name: 'frump.n.01'
definition: 'a dull unattractive unpleasant girl or woman'

Synset name: 'dog.n.03'
definition: 'informal term for a man'

Synset name: 'cad.n.01'
definition: 'someone who is morally reprehensible'

Synset name: 'frank.n.02'
definition: 'a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll'

Synset name: 'pawl.n.01'
definition: 'a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward'

Synset name: 'andiron.n.01'
definition: 'metal supports for logs in a fireplace'

Synset name: 'chase.v.01'
definition: 'go after with the intent to catch'



## Working with python scripts

### Exercise  3

#### a.) Define a function called `count`, which determines how often each word occurs in a string. Do not use NLTK just yet. Find a way to test it. 

* Write a helper-function called `preprocess`, which removes the punctuation specified by the user, and returns the same string without the unwanted characters. You call the function `preprocess` inside the `count` function.

* Remember that there are string methods that you can use to get rid of unwanted characters. Test the `preprocess` function using the following string `'this is a (tricky) test'`.

* Remember how we used dictionaries to count words? If not, have a look at Chapter 10 - Dictionaries. 

* make sure you split the string on a space character ' '. You loop over the list to count the words.

* Test your function using an example string, which will tell you whether it fulfills the requirements (remove punctuation, split, count). You will get a point for good testing.

#### b.) Create a python script 

Use your editor to create a Python script called **count_words.py**. Place the function definition of the **count** function in **count_words.py**. Also put a function call of the **count** function in this file to test it. Place your helper function definition, i.e., **preprocess**, in a separate script called **utils_3a.py**. Import your helper function **preprocess** into count_words.py. Test whether everything works as expected by calling the script count_words.py from the terminal.

The function **preprocess** preprocesses the text by removing characters that are unwanted by the user. **preprocess** is called within the **count** function and hence builds upon the output from the preprocess function and creates a dictionary in which the key is a word and the value is the frequency of the word.

**Please submit these scripts together with the other notebooks**.

Don't forget to add docstrings to your functions. 

In [4]:
# Feel free to use this cell to try out your code. 

## Dealing with text files

### Exercise 4

**Playing with lyrics**

a.) Write a function called `load_text`, which opens and reads a file and returns the text in the file. It should have the file path as a parameter. Test it by loading this file: ../Data/lyrics/walrus.txt

* Hint: remember it is best practice to use a context manager
* Hint: **FileNotFoundError**: This means that the path you provide does not lead to an existing file on your computer. Please carefully study Chapter 14. Please determine where the notebook or Python module that you are working with is located on your computer. Try to determine where Python is looking if you provide a path such as “../Data/lyrics/walrus.txt”. Try to go from your notebook to the location on your computer where Python is trying to find the file. One tip: if you did not store the Assignments notebooks 3a and 3b in the folder “Assignments”, you would get this error.

b.) Write a function called `replace_walrus`, which takes lyrics as input and replaces every instance of 'walrus' by 'hippo' (make sure to account for upper and lower case - it is fine to transform everything to lower case). The function should write the new version of the song to a file called 'walrus_hippo.txt and stored in ../Data/lyrics. 

Don't forget to add docstrings to your functions. 

In [5]:
# your code here


## Analyzing text with nltk

### Exercise 5

**Building a simple NLP pipeline**

For this exercise, you will need NLTK. Don't forget to import it. 

Write a function called `tag_text`, which takes raw text as input and returns the tagged text. To do this, make sure you follow the steps below:

* Tokenize the text. 

* Perform part-of-speech tagging on the list of tokens. 

* Return the tagged text


Then test your function using the text snipped below (`test_text`) as input.

Please note that the tags may not be correct and that this is not a mistake on your end, but simply NLP tools not being perfect.

In [8]:
test_text = """Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:"""

In [9]:
# your code here

## Python knowledge

### Exercise 6

6.a) Explain in your own words the difference between the global and the local scope.

[answer]

6.b) What is the difference between the modes 'w' and 'a' when opening a file?

[answer]