# Natural Language Processing (NLP)

In [1]:
import nltk #this is already installed within anaconda 

nltk.download('omw-1.4')
nltk.download('tagasets')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data] Error loading tagasets: Package 'tagasets' not found in
[nltk_data]     index
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...


True

## Introduction to NLP

NLP is harder than computer vision, due to the issue of finding a good representation for text. NLP is ambiguous. 

In [4]:
# Tokenizing text
# a token does not have to be just a single world, it can be more or less

from nltk.tokenize import sent_tokenize, word_tokenize

example_string = "All the speed he took, all the turns he'd taken"

tokenized = sent_tokenize(example_string) #tokenization looks for the full stop
w_tokenized = word_tokenize(example_string)
print(tokenized)
print(w_tokenized)

["All the speed he took, all the turns he'd taken"]
['All', 'the', 'speed', 'he', 'took', ',', 'all', 'the', 'turns', 'he', "'d", 'taken']


In [11]:
# stop words, are words you want to ignore because they do not add any meaningful insight to our sentences
from nltk.corpus import stopwords

example_string = "It's leviosa, not leviosaaaa"

words = word_tokenize(example_string)

# set the language for the stopwords
stop_words = set(stopwords.words('english'))
filtered_list = []

# SUGGESTION: see spacy, it's like sciklit learn but on NLP, you can load pdfs and it analyzes them

# let's remove the stop words
for word in words:
    if word.casefold() not in stop_words: #word.casefold() does not care about upper or lower case
        filtered_list.append(word)
        
print(filtered_list)

["'s", 'leviosa', ',', 'leviosaaaa']


Content words give information about the topic, whereas context words give information about the writing style.

### Stemming

Stemming is the process of taking a word and reducing it to its root

In [12]:
from nltk.stem import PorterStemmer 

In [16]:
stemmer = PorterStemmer() # stemming depends on stemming algorithm

example_string = "The crew of USS Discovery discovered many discoveries. Discovering is what explorers do."

words = word_tokenize(example_string)

stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

['the', 'crew', 'of', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is', 'what', 'explor', 'do', '.']
