# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 01/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 1 [here](https://www.nltk.org/book/ch01.html).

# CONTENT

1. Language Processing and Python
    1. Computing with Language: Texts and Words
    2. A Closer Look at Python: Texts as Lists of Words
    3. Computing with Language: Simple Statistics
    4. [Back to Python: Making Decisions and Taking Control](#DecisionsControl)
        1. [Conditionals](#Conditionials)
        2. [Operating on Every Element](#Loops)
        3. [Looping with Conditions](#LoopingWithConditions)

**Install**, **import** and **download NLTK**. <br>

*Uncomment lines 2 and 5 if you haven't installed and downloaded NLTK yet.*

In [1]:
# install nltk
#!pip install nltk

# load nltk
import nltk

# download nltk
#nltk.download()

Load all items (9 texts) from **NLTK' book module**.

In [2]:
# load all items from NLTK’s book module.
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


<a name="DecisionsControl"></a>
## 1.1.4 Back To Python: Making Decisions and Taking Control
1. [Conditionals](#Conditionials)
2. [Operating on Every Element](#Loops)
3. [Looping with Conditions](#LoopingWithConditions)

<a name="Conditionials"></a>
### 4.1 Conditionals

![nltk_WordComparisonOperators.PNG](attachment:nltk_WordComparisonOperators.PNG)

`word.startswith("letters")` keeps the words that start with the specified letters. <br>
`word.endswith("letters")` keeps the words that end with the specified letters.

In [3]:
# create a list with the words within text1 that start with "magn"
list_with_magn = [w for w in set(text1) if w.startswith("magn")]
print(list_with_magn)
print("")
# create a list with the words within text1 that end with "ableness"
list_with_ableness = [w for w in set(text1) if w.endswith("ableness")]
print(list_with_ableness)

['magnify', 'magnet', 'magnets', 'magnetic', 'magnificent', 'magnitude', 'magnetism', 'magnanimity', 'magnification', 'magnanimous', 'magnifying', 'magniloquent', 'magnified', 'magnificence', 'magnetizing', 'magnetically']

['intolerableness', 'reasonableness', 'comfortableness', 'indomitableness', 'immutableness', 'honourableness', 'palpableness', 'uncomfortableness', 'indispensableness']


`word.islower()` &rarr; keeps the words that all their letters are lowercase. <br>
`word.isupper()` &rarr; keeps the words that all their letters are uppercase. <br>
`word.istitle()` &rarr; keeps the words that their first letter is capital.

In [4]:
# create a list with the words that have all lowercase letters
list_lower = [w for w in set(text1) if w.islower()]
# print the first 10 letters of the list
print(list_lower[0:10])

print("")

# create a list with the words that have all lowercase letters
list_upper = [w for w in set(text1) if w.isupper()]
# print the first 10 letters of the list
print(list_upper[0:10])

print("")

# create a list with the words that have all lowercase letters
list_title = [w for w in set(text1) if w.istitle()]
# print the first 10 letters of the list
print(list_title[0:10])

['trap', 'wealthy', 'cooler', 'town', 'teachings', 'sighed', 'deadreckoning', 'is', 'iciness', 'pruning']

['UPON', 'BOTTOM', 'FALCONER', 'APOLOGY', 'HVAL', 'BOARD', 'YORK', 'OPEN', 'NEST', 'BIOGRAPHY']

['Callao', 'Commanded', 'Mounttop', 'Stubb', 'Pulpit', 'Clinging', 'Sullenly', 'Crushed', 'Latter', 'Sleeping']


`word.isalpha()` &rarr; keeps the words that all their letters are alphabetic. <br>
`word.isdigit()` &rarr; keeps the words that all their letters are digits. <br>
`word.isalnum()` &rarr; keeps the words that all their letters are alphanumberic.

In [5]:
# create a list with the words that all their letters are alphabetic
list_alpha = [w for w in set(text1) if w.isalpha()]
# print the first 10 letters of the list
print(list_alpha[0:10])

print("")

# create a list with the words that all their letters are digits
list_digit = [w for w in set(text1) if w.isdigit()]
# print the first 10 letters of the list
print(list_digit[0:10])

print("")

# create a list with the words that all their letters are alphanumeric
list_alnum = [w for w in set(text1) if w.isalnum()]
# print the first 10 letters of the list
print(list_alnum[0:10])

['trap', 'wealthy', 'cooler', 'Callao', 'town', 'Commanded', 'teachings', 'sighed', 'UPON', 'deadreckoning']

['92', '86', '1729', '37', '7', '1820', '800', '63', '1842', '25']

['trap', 'wealthy', 'cooler', 'Callao', 'town', 'Commanded', 'teachings', 'sighed', 'UPON', 'deadreckoning']


<a name="Loops"></a>
### 4.2 Operating on Every Element

A [**List Comprehension**](https://www.w3schools.com/python/python_lists_comprehension.asp) is an **efficient** and **succinct** way (one-liner) of generating a new list .

**Syntax**: `[f(w) for ...]` or `[w.f() for ...]`

In [43]:
# get the length of text1
print("The length of text1 is: {} tokens.\n".format(len(text1)))

The length of text1 is: 260819 tokens.



In [44]:
# get the length of text1 without duplicate tokens
text1_no_dup = len(set(text1))

print("The length of text1 without duplicates is: {} tokens.".format(text1_no_dup))
print("This is {} less tokens than the full length text!\n".format(len(text1)-text1_no_dup))

The length of text1 without duplicates is: 19317 tokens.
This is 241502 less tokens than the full length text!



In [45]:
# get the length of text1 without duplicate tokens & with all letters lowercased
text1_no_duplicates_lower = len(set((word.lower() for word in text1)))

print("The length of text1 without duplicates and all letters lowercased is: {} tokens.".format(text1_no_duplicates_lower))
print("This is {} less tokens that the text without duplicates!".format(text1_no_dup - text1_no_duplicates_lower))

The length of text1 without duplicates and all letters lowercased is: 17231 tokens.
This is 2086 less tokens that the text without duplicates!


In [48]:
# remove punctuation symbols by keeping only alphanumeric characters
text1_clean = len(set(word.lower() for word in text1 if word.isalpha()))

print("The length of text1 without duplicates, all letters lowercased and with numbers removed is: {} tokens.".format(text1_clean))
print("This is {} less tokens that the text without duplicates & uppercase letters!".format(text1_no_duplicates_lower-text1_clean))

The length of text1 without duplicates, all letters lowercased and with numbers removed is: 16948 tokens.
This is 283 less tokens that the text without duplicates & uppercase letters!


<a name="LoopingWithConditions"></a>
### 4.3 Looping with Conditions

In [53]:
# check if these two characters sequences exist in the word
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
# for every word that meets the conditions
for word in tricky:
    # print this word with a tab space at the end
    # the def behavior is print each word in a new line
    print(word, end='\t')

ancient	ceiling	conceit	conceited	conceive	conscience	conscientious	conscientiously	deceitful	deceive	deceived	deceiving	deficiencies	deficiency	deficient	delicacies	excellencies	fancied	insufficiency	insufficient	legacies	perceive	perceived	perceiving	prescience	prophecies	receipt	receive	received	receiving	society	species	sufficient	sufficiently	undeceive	undeceiving	