# NLP Practical Test Model Answer
----

The NLP practical test will take place within this Jupyter notebook. Each question will require you to write a function which will return the answer. This notebook will be graded automatically, so it is important that the names of any existing variables and functions are left unchanged.

A shell function with the correct name for each question has already been defined for you. You will simply need to fill in the necessary code inside the function, as directed by the comments.

Each function has a return statement that reads `return none` - you will need to replace `none` with the relevant variable name that you specify in the function.

#### Import Libraries and Read In the Data

Do not modify or remove any of the code in these cells.

In [1]:
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\phomolo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\phomolo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# read in the data, you can find this data attached to the test on Athena
data = open('alice_in_wonderland.txt', 'r', encoding='ISO-8859-1').read()
print(data[:863])

Alice's Adventures in Wonderland

                ALICE'S ADVENTURES IN WONDERLAND

                          Lewis Carroll

               THE MILLENNIUM FULCRUM EDITION 3.0




                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

 


#### Convert to Lowercase and Remove Punctuation

Do not modify or remove any of the code in these cells.

In [3]:
def remove_punctuation(words):
    words = words.lower()
    return ''.join([char for char in words if char not in string.punctuation])

In [4]:
data = remove_punctuation(data)

In [5]:
# define stemmer function
stemmer = SnowballStemmer('english')

# tokenise data
tokeniser = TreebankWordTokenizer()
tokens = tokeniser.tokenize(data)

# define lemmatiser
lemmatizer = WordNetLemmatizer()

# bag of words
def bag_of_words_count(words, word_dict={}):
    """ this function takes in a list of words and returns a dictionary 
        with each word as a key, and the value represents the number of 
        times that word appeared"""
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return word_dict

# remove stopwords
tokens_less_stopwords = [word for word in tokens if word not in stopwords.words('english')]

# create bag of words
bag_of_words = bag_of_words_count(tokens_less_stopwords)

----

### Question 1: What is the 14th character in the book?

Write a function which extracts the 14th character from the relevant body of text.

In [6]:
def char_14(text):
    return text[13]

In [7]:
char_14(data)

'u'

---
### Question 2: What is the 34th word in the book?

Modify the function below to return the 34th word from the relevant body of text.

In [8]:
def word_34(tokens):
    return tokens[33] 

In [9]:
word_34(tokens)

'the'

---
### Question 3: How many words are in the book?

Modify the function below to return the total number of words from the relevant body of text.

In [10]:
def total_words(tokens):
    return len(tokens)

In [11]:
total_words(tokens)

26411

---
### Question 4: What is the stem of the word '*writing*'

Write a function which returns the stem of the word 'writing'.

In [12]:
def find_stem(word):
    return stemmer.stem(word)

In [13]:
find_stem('writing')

'write'

### Question 5: What is the stem of the 55th word in the book?

Write a function which returns the stem of the word in position 55 in the book.

In [14]:
def find_stem_n(tokens, n):
    return stemmer.stem(tokens[n-1])

In [15]:
find_stem_n(tokens, 55)

'but'

---
### Question 6: What is the lemma of the word '*hippopotami*'?

Use the lemmatizer function from the relevant library to find the lemma of the word 'hippopotami'.

In [16]:
def find_lemma(word):
    return lemmatizer.lemmatize(word) 

In [17]:
find_lemma('hippopotami')

'hippopotamus'

---
### Question 7: What is the lemma of the 389th word in the book?

Modify the function below to find the lemma of the word in the 389th position in the book.

In [18]:
def find_lemma_n(tokens, n):
    return lemmatizer.lemmatize(tokens[n-1]) 

In [19]:
find_lemma_n(tokens, 389)

'see'

---
### Question 8: How many stopwords are in the book?

Modify the function below to count the total number of stopwords that appear in the body of text.

In [20]:
def count_stopwords(tokens):
    from collections import Counter
    stops = stopwords.words('english')
    count = Counter(w for w in tokens if w in stops)
    return len(count)

In [21]:
count_stopwords(tokens)

128

---
### Question 9: How any times does '*Alice*' appear in the book?

Modify the function below to return the count for the number of times the word 'Alice' appears in the book.

In [22]:
def word_count(bag, word):
    count = bag[word]
    return count

In [23]:
word_count(bag_of_words, 'alice')

386

### Question 10: What is the most common word in the book?

Modify the function below to return the word which appears to most times in the book.

In [24]:
def most_common_word(bag):
    max_word = max(bag, key=bag.get)
    return max_word

In [25]:
most_common_word(bag_of_words)

'said'