# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 01/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 3 [here](https://www.nltk.org/book/ch03.html).

**1\.** Define a string `s = 'colorless'`. Write a Python statement that changes this to `'colourless'` using only the **slice** and **concatenation** operations.

In [3]:
# define string
s = 'colorless'

# slice first 4 letters of s
# concatenate letter 'u'
# concatenate last 4 letters of s
s[:4] + 'u' + s[4:]

'colourless'

**2\.** We can use the slice notation to remove morphological endings on words.

For example, `'dogs'[:-1]` removes the last characters of `'dogs'`, leaving `'dog'`. 

Use **slice notation** to remove the affixes from these words (we 've inserted a hyphen to indicate the affix boundary, but omit this from your strings): _dish-es, run-ning, nation-ality, un-do, pre-heat._

In [6]:
# define strings
s1, s2, s3, s4, s5 = 'dishes', 'running', 'nationality', 'undo', 'preheat'

# remove affixes using slice notation
s1[:-2], s2[:-4], s3[:-5], s4[:-2], s5[:-4]

('dish', 'run', 'nation', 'un', 'pre')

**3\.** We saw how we can generate an `IndexError` by indexing beyond the end of a string. 

Is it possible to **construct an index that goes too far to the left, before the start of the string**?

In [9]:
# define a string
s = 'dog'

# an index that goes to far to the left
s[-4]

IndexError: string index out of range

**4\.** We can specify a **"step" size** for the slice. 

The following returns every second character withinthe slice: `monty[6:11:2]`. 

It also works in the reverse direction: `monty[10:5:-2]`.

Try these for yourself, then experiment with different step values.

In [26]:
# define a string
s = 'airplane'

# return every 3rd character inside the slice range
print(s[0:7:3])

# return every 2rd character inside the slice range
print(s[0:7:2])

# return every 2rd character inside the slice range in reverse
print(s[7:0:-2])

apn
arln
eapi


**5\.** What happens if you ask the interpreter to evaluate `monty[::-1]`?

Explain why this is a reasonable result.

In [30]:
# try slice & step size
'monty'[::-1]

'ytnom'

The slice returns the **word in reverse**. 

That is the expected result as the slice `[::]` indicates that is goes **through the whole range** of the word. 

The addition of `[::-1]` indicates that it **starts from the end with a step size of 1**.

**6\.** Describe the **class of strings matched** by the following regular expressions.

`[a-zA-Z]+` one or more lower- or upper-case alphabetic character(s)

`[A-Z][a-z]*` a sequence of two characters 0 or more times: an upper-case alphabetic & a lower-case alphabetic.

`p[aeiou]{,2}t` the letter 'p', any lower-case vowel from 0 to 2 times, and the letter 't'.

`\d+(\.\d+)?` one or more digit(s), and optionally extract only a dot & one or more digit(s).

`([^aeiou][aeiou][^aeiou])*` extract a word sequence 3 characters long for 0 or more times that: starts with anything but a vowel, followed by a vowel & ends with anything but vowel.

`\w+|[^\w\s]+` one or more alphabetic character(s) or a word that starts with anything than an alphabetic character or a whitespace for one or more times.

Test your answers using `nltk.re_show()`.

In [72]:
s = """
Daniel Lambert was born on the 13 of March, 1770, in the Parish of
St. Margaret, at Leicester. From the extraordinary bulk to which he
attained, the reader may be naturally disposed to inquire, whether or
no his parents were persons of remarkable dimensions. This was not the
case; nor were any of his family inclined to corpulence, excepting
an uncle and aunt on the father’s side, who were both very heavy.
The former died during the infancy of Lambert, in the capacity of
gamekeeper to the Earl of Stamford, to whose predecessor his father had
been huntsman in early life. The family of Lambert, senior, consisted
besides Daniel, of another son, who died young, and two daughters, who
are still living, and both women of the common size.

The habits of the subject, 34*1+10 of this memoir were not, in any respect,
different from those of other young persons till the age of fourteen.
Even at that early period he was strongly attached to the sports of the
field. This, however, was only the natural effect of a very obvious
cause, aided probably by an innate propensity to those diversions.--We
have already mentioned the profession of his father and uncle, and have
yet to observe, that his maternal grandfather was a great cock-fighter.
Born and bred among horses, dogs, and cocks, and all the other
appendages of sporting, in the pursuits of which he was encouraged
even in his childhood, it cannot be a matter of wonder that he should
be passionately fond of all those exercises and amusements, which are
comprehended under the denomination of field sports. 12+34*1
"""
print(nltk.re_show(r'[a-zA-Z]+', s[:50]), "\n")
print(nltk.re_show(r'[A-Z][a-z]*', s[:50]), "\n")
print(nltk.re_show(r'p[aeiou]{,2}t', s[:350]), "\n")
print(nltk.re_show(r'\d+(\.\d+)?', s[:50]), "\n")
print(nltk.re_show(r'([^aeiou][aeiou][^aeiou])*', s[:50]), "\n")
print(nltk.re_show(r'\w+|[^\w\s]+', s[:50]))


{Daniel} {Lambert} {was} {born} {on} {the} 13 {of} {March}, 1770,
None 


{Daniel} {Lambert} was born on the 13 of {March}, 1770,
None 


Daniel Lambert was born on the 13 of March, 1770, in the Parish of
St. Margaret, at Leicester. From the extraordinary bulk to which he
attained, the reader may be naturally disposed to inquire, whether or
no his parents were persons of remarkable dimensions. This was not the
case; nor were any of his family inclined to corpulence, exce{pt}ing
an unc
None 


Daniel Lambert was born on the {13} of March, {1770},
None 

{}
{Dan}{}i{}e{}l{} {Lamber}{}t{} {was}{} {bor}{}n{ on}{} {}t{he }{}1{}3{ of}{} {Mar}{}c{}h{},{} {}1{}7{}7{}0{},{}
None 


{Daniel} {Lambert} {was} {born} {on} {the} {13} {of} {March}{,} {1770}{,}
None


**7\.** Write **regular expressions** to match the following classes of strings:

1. A single determiner (assume that _a, an,_ and _the_ are the only determiners).
2. An arithmetic expression using integers, addition, and multiplication, such as 2*3+8. 

In [74]:
# determiner
nltk.re_show(r'(a\s|an\s|the\s)', s)

# arithmetic expression
nltk.re_show(r'^\d+[\*\+]\d+[\*\+]d+$', s)


Daniel Lambert was born on {the }13 of March, 1770, in {the }Parish of
St. Margaret, at Leicester. From {the }extraordinary bulk to which he
attained, {the }reader may be naturally disposed to inquire, whether or
no his parents were persons of remarkable dimensions. This was not {the
}case; nor were any of his family inclined to corpulence, excepting
{an }uncle and aunt on {the }father’s side, who were both very heavy.
The former died during {the }infancy of Lambert, in {the }capacity of
gamekeeper to {the }Earl of Stamford, to whose predecessor his father had
been huntsm{an }in early life. The family of Lambert, senior, consisted
besides Daniel, of another son, who died young, and two daughters, who
are still living, and both women of {the }common size.

The habits of {the }subject, 34*1+10 of this memoir were not, in any respect,
different from those of other young persons till {the }age of fourteen.
Even at that early period he was strongly attached to {the }sports of {the
}field. 

**8\.** Write a **utility function** that takes a URL as its argument, and returns the contents of the URL, with **all HTML markup removed**. 

Use `from urllib import request` and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.

In [12]:
from urllib import request
from bs4 import BeautifulSoup

def remove_HTML(url):
    """Remove HTML markup from text."""
    
    # access contents
    content = request.urlopen(url).read().decode('utf8')
    # strip text from HTML markup
    raw = BeautifulSoup(content, 'html.parser').get_text()
    
    # return stripped text
    return raw

# define URL
url_address = 'http://nltk.org/'

# check original content
print("This is the first 100 characters of the original URL contents:\n\n{}\n".
     format(request.urlopen(url_address).read().decode('utf8'))[:100])

# invoke function
print("This is the first 100 characters with stripped HTML markup:\n\n{}".
      format(remove_HTML(url_address)[:100]))

This is the first 100 characters of the original URL contents:

<!DOCTYPE html>
<head>
  <meta chars
This is the first 100 characters with stripped HTML markup:







NLTK :: Natural Language Toolkit














NLTK



Documentation














NLTK Docume


**9\.** Save some text into a file `corpus.txt`.

Define a function `load(f)` that reads from the file named in its sole argument, and returns a string containing the text of the file.

1. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag `(?x)`.
2. Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

In [93]:
# create a file & write some text
with open('corpus.txt', 'w') as f:
    f.write('This is the file\'s text. My name is Mike, '
            'today is 10-02-2022 and I have £0.00 on my account. '
            'I work in Unemployment Org. I also have 0.000$ !')
    
# define function
def load(f):
    """Read a file and return the text of that file."""
    # open the file
    with open(f) as f:
        # save its content
        content = f.read()
    # return content
    return content

# read content's file
print("The file's content is the following:\n\n{}\n".
      format(load('corpus.txt')))

from nltk.tokenize import RegexpTokenizer

# assign text to var
my_text = load('corpus.txt')

# create multi-line regex pattern
pattern = r'''(?x)         # verbose flag
        (\w*\d*\s)         # extract words, digits, whitespace
'''

# instantiate Regexp tokenizer
tokenizer = RegexpTokenizer(pattern)

# tokenize text
tokens = tokenizer.tokenize(my_text)

# check results
print("Tokenized text without punctuation marks:\n\n{}\n".
      format(tokens))

pattern_B = r'''(?x)         # verbose flag

        \d{,2}-\d{2}-\d{4}   # dates
    |   \£?\d+(?:\.\d+)?\$?  # monetary amounts
    |   [A-Z]\w+             # proper names
'''

# instantiate Regexp tokenizer
tokenizer_B = RegexpTokenizer(pattern_B)

# tokenize text
tokens_b = tokenizer_B.tokenize(my_text)

# check results
print("Tokenized text with 2nd pattern:\n\n{}\n".
      format(tokens_b))

The file's content is the following:

This is the file's text. My name is Mike, today is 10-02-2022 and I have £0.00 on my account. I work in Unemployment Org. I also have 0.000$ !

Tokenized text without punctuation marks:

['This ', 'is ', 'the ', 's ', ' ', 'My ', 'name ', 'is ', ' ', 'today ', 'is ', '2022 ', 'and ', 'I ', 'have ', '00 ', 'on ', 'my ', ' ', 'I ', 'work ', 'in ', 'Unemployment ', ' ', 'I ', 'also ', 'have ', ' ']

Tokenized text with 2nd pattern:

['This', 'My', 'Mike', '10-02-2022', '£0.00', 'Unemployment', 'Org', '0.000$']



**10\.** Rewrite the following loop as a list comprehension:

`sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = []
for word in sent:
    word_len = (word, len(word))
    result.append(word_len)
result`

In [100]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']

# list comprehension
result = [(word, len(word)) for word in sent]

# check result
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

**11\.** Define a string `raw` containing a sentence of your own choosing.

Now, split `raw` on some character other than space, such as `'s'`.

In [101]:
# define string
raw = 'This is my amazing and imaginative string!'

# split string
raw.split('s')

['Thi', ' i', ' my amazing and imaginative ', 'tring!']

**12\.** Write a `for` loop to print out the characters for a string, one per line.

In [103]:
# define string 
s = 'my string!'

# print characters one per line
for char in s:
    print(char)

m
y
 
s
t
r
i
n
g
!


**13\.** What is the difference between calling `split` on a string with no argument or with `' '` as the argument, e.g. `sent.split()` versus `sent.split(' ')`?

What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces?

In [109]:
# define a string
raw = 'This is my amazing and imaginative string!'

# split string
print("This is the result of '.split()':\n\n{}\n".format(raw.split()))

print("This is the result of '.split(' ')':\n\n{}\n".format(raw.split(' ')))


# define a string
raw_extrawhitespace = 'This is    my amazing and    imaginative string!'

# split string
print("This is the result of '.split()' with extra whitespace:\n\n{}\n".format(raw_extrawhitespace.split()))

print("This is the result of '.split(' ')' with extra whitespace:\n\n{}\n".format(raw_extrawhitespace.split(' ')))


This is the result of '.split()':

['This', 'is', 'my', 'amazing', 'and', 'imaginative', 'string!']

This is the result of '.split(' ')':

['This', 'is', 'my', 'amazing', 'and', 'imaginative', 'string!']

This is the result of '.split()' with extra whitespace:

['This', 'is', 'my', 'amazing', 'and', 'imaginative', 'string!']

This is the result of '.split(' ')' with extra whitespace:

['This', 'is', '', '', '', 'my', 'amazing', 'and', '', '', '', 'imaginative', 'string!']



When there are **no tabs, consecutive space characters, or a sequence of tabs and spaces** there is no difference between `.split()` and `split(' ')`. 

When there are **tabs, consecutive space characters, or a sequence of tabs and spaces** it splits on the first white space it encounters and prints the rest as characters.

**14\.** Create a variable `words` containing a list of words. Experiment with `words.sort()` and `sorted(words)`. 

What is the difference?

In [114]:
words = ['word1', 'word3', 'word2', 'word5', 'word4']

# try difference sorting methods
print("This is the result of `words.sort()`:\n\n{}\n".format(words.sort()))

print("This is the result of `.sorted(words)`:\n\n{}\n".format(sorted(words)))

This is the result of `words.sort()`:

None

This is the result of `.sorted(words)`:

['word1', 'word2', 'word3', 'word4', 'word5']



`.sort()` returns `None`; it modifies the original list on the spot.

`sorted()` returns the list sorted; it does not affect the original list.

**15\.** Explore the difference between strings and integers by typing the following at a Python prompt: `"3" * 7` and `3 * 7`.

Try converting between strings and integers using `int("3")` and `str(3)`.

In [126]:
# check differences
str_mult = "3" * 7
print("This is the result of '3' * 7: {} and its data type is: {}.\n".
      format((str_mult), type(str_mult)))

num_mult = 3 * 7
print("This is the result of 3 * 7: {} and its data type is: {}.\n".
      format((num_mult), type(num_mult)))

# try convertions
print("This is the result of int(str_mult): {} and its data type is: {}.\n".
      format(int(str_mult), type(int(str_mult))))

print("This is the result of str(num_mult): {} and its data type is: {}.\n".
      format(str(num_mult), type(str(num_mult))))

This is the result of '3' * 7: 3333333 and its data type is: <class 'str'>.

This is the result of 3 * 7: 21 and its data type is: <class 'int'>.

This is the result of int(str_mult): 3333333 and its data type is: <class 'int'>.

This is the result of str(num_mult): 21 and its data type is: <class 'str'>.

