# **PRACTICE 2 - NLTK**
Text pre-processing and normalization
* Tokenization
* Normalization
* Lemmatization
* Stemming
* BONUS
* Exercise 2 - 16/03/2023

Basic concepts reported from "Speech and Language Processing (3rd ed. draft)" by Dan Jurafsky and James H. Martin.

### Tokenization
The tokenization process is aimed at dividing strings into lists of substrings. For example, a tokenizer can be used to find the words and punctuation in a string. Case folding is another kind of normalization. Mapping everything to lower case means that Woodchuck and woodchuck are represented identically, which is very helpful for generalization in many tasks, such as information retrieval or speech recognition.

### Normalization
Word normalization is the task of putting words/tokens in a standard format, choosing a single normal form for words with multiple forms like USA and US or uh-huh and uhhuh.

### Lemmatization
Is the task of determining that two words have the same root, despite their surface differences. For example, the words sang, sung, and sings are forms of the verb sing.

### Stemming
Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word.


In [None]:
import nltk
try:
    nltk.data.find('punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('wordnet')
except LookupError:
    nltk.download('wordnet')

try:
    nltk.data.find('stopwords')
except LookupError:
    nltk.download('stopwords')

## Tokenization

In [None]:
import nltk
from nltk.tokenize import word_tokenize

sample_text = "This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things-names and heights and soundings-with the single exception of the red crosses and the written notes."

tokens = word_tokenize(sample_text)
print(f'Tokens: {tokens}')

## Normalization

In [None]:
import string
normalized_tokens = [word.lower() for word in tokens]
print(f'Normalized Tokens: {normalized_tokens}')
punctuations = set(string.punctuation)
punctuations.add('\'')


normalized_tokens = [w.translate(str.maketrans ('', '', string.punctuation)) for w in normalized_tokens]
normalized_tokens = [w for w in normalized_tokens if len(w) >0]
print(f'Normalized Tokens2: {normalized_tokens}')

## Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in normalized_tokens]
print(f'Lemmatized tokens: {lemmatized_tokens}')

## Stemming
In the following we will use the Porter Stemmer

In [None]:
from nltk.stem import PorterStemmer

e_words= ["wait", "waiting", "waited", "waits"]

ps =PorterStemmer()
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)

In [None]:
stemmed_words = [ps.stem(w) for w in normalized_tokens]
print(f'Stemmed words {stemmed_words}')
lemmed_stemmed_words = [ps.stem(w) for w in lemmatized_tokens]
print(f'Stemmed words {stemmed_words}')

## BONUS - Stopword removal
In the past it was common to remove high-frequency words from both
the query and document before representing them. The list of such high-frequency
stop list words to be removed is called a stop list. The intuition is that high-frequency terms
(often function words like the, a, to) carry little semantic weight and may not help
with retrieval.

In [None]:
import nltk
import nltk.collocations as collocations
from nltk.corpus import brown
import string

nltk.download('stopwords')
ignored_words = nltk.corpus.stopwords.words('english')

punctuations = list(string.punctuation)

bigram_measures = collocations.BigramAssocMeasures()
bigrams_finder = collocations.BigramCollocationFinder.from_words(brown.words())
bigrams_finder.apply_word_filter(lambda w: w.lower() in ignored_words)
bigrams_finder.nbest(bigram_measures.pmi, 20)


# Exercise 2 - 16/03/2023

Repeat the Exercise 1 - 16/03/2023 reported "23_March-16_nltk_practice_1" but this time apply all the pre-processing and normalization techniques we covered. Compare the results obtained with and without these techniques.