# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 01/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 1 [here](https://www.nltk.org/book/ch01.html).

# CONTENT

1. Language Processing and Python
    1. Computing with Language: Texts and Words
    2. [A Closer Look at Python: Texts as Lists of Words](#ListsofWords)
        1. [Lists](#Lists) <br>
        1. [Indexing Lists](#Indeces) <br>
        1. [Strings](#Strings) <br>

**Install**, **import** and **download NLTK**. <br>

*Uncomment lines 2 and 5 if you haven't installed and downloaded NLTK yet.*

In [1]:
# install nltk
#!pip install nltk

# load nltk
import nltk

# download nltk
#nltk.download()

Load all items (9 texts) from **NLTK' book module**.

In [2]:
# load all items from NLTK’s book module.
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


<a name="ListsofWords"></a>
## 1.2 A Closer Look at Python: Texts as Lists of Words

<a name="Lists"></a>
### 2.1 Lists

`set(list)` removes duplicate elements.

In [3]:
# create a list with 200 words
word_list = text1[2300:2500]
# print the length of the list
print(len(word_list))
# remove duplicates from the list
new_word_list = set(word_list)
# count the length of the new list
len(new_word_list)

200


125

We can **combine the elements of two seperate lists into a single list** using **concatenation**.

**Syntax**: `list_1 + list_2`

In [46]:
list_1 = [1, 2, 3]
list_2 = [4, 5, 6]
list_1 + list_2

[1, 2, 3, 4, 5, 6]

<a name="Indeces"></a>
### 2.2 Indexing Lists
We can find the element at the specified position of a list.

>The number that represents this position is the item's **index**.

**Syntax**:`text[index]`

In [47]:
# print the number at position 1501
text1[1501]

'That'

We can also do the opposite; find the index of the specified element.

**Syntax**: `text[element]`

In [48]:
# find the index of the specified word
text1.index("That")

1501

**Slicing** permits us to **access sublists**,i.e. manageable pieces of language from large texts.

**Syntax**: `text[m:n-1]` 

>`text[0:5]` would result to a total of 5 words. **The $2^{nd}$ value defined**, i.e. the item at the 5$^{th}$ position, **is exclusive**. Hence, it would give the elements with index 0, 1, 2, 3 and 4, but not the item on the 5$^{th}$ position!

In [50]:
# get the 10 words of text1
print(text1[:10])

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.']


We can **replace an entire slice** with new elements.

**Syntax**: `list[m:n-1]`

In [54]:
my_list = [1, 2, 3, 4, 5]
# replace the items from index 1 to 3
my_list[1:3] = [0, 9, 8]
# print list
print(my_list)

# replace the items from index 1 to 3 and remove elements after that
my_list[1:] = [0, 9, 8]
# print list
print(my_list)

[1, 0, 9, 8, 4, 5]
[1, 0, 9, 8]


<a name="Strings"></a>
### 2.3 Strings
**Indexing**, **slicing** and **concatenation** works the same in strings as in lists.

We can **join two words into a single string** with a specified delimiter.

**Syntax**: `"delimiter".join([word1, word2])`

In [57]:
# join the two strings with "-" as the delimiter and assign it to a variable
moby_dick = "-".join(["Moby", "Dick"])
print(moby_dick)

Moby-Dick


We can **tokenize** a text, i.e. **split the text into individual words** with a specified delimiter.

**Syntax**: `text.split("delimiter")`

In [58]:
# split the phrase into individual words with '-' as the delimiter
moby_dick.split("-")

['Moby', 'Dick']