# Introduction to NLP in Python
## Quest 1: NLP Basics for Text Preprocessing

### Tokenization

Tokenizers divide strings into lists of substrings. After installing the nltk library, let's import the library along with these two built-in methods, *sent_tokenize* and *word_tokenize*. 

In [1]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

1. `sent_tokenize`

The first method, `sent_tokenize`, splits the given text into sentences. This is useful especially if you are dealing with bigger chunks of text with longer sentences.

We will make use of the following sample paragraph about NLP in the healthcare industry. Run the cell below to check out the output.

In [2]:
sample = 'Once upon a time, there was a little girl who loved to dance. She would spin and twirl around her room every day, dreaming of becoming a ballerina. One day, a famous ballet teacher saw her dancing and offered to train her. From then on, the little girl\'s dreams came true as she danced on stages all around the world.'
sentence_tokens = sent_tokenize(sample)

If you encounter the "Resource punkt not found" error when running the above cell, you can run the following command `nltk.download('punkt')`
<br/><br/>

2. `word_tokenize`

Likewise, the `word_tokenize` method tokenizes each individual word in the paragraph. Run the cell below to compare the outputs.

In [3]:
word_tokens = word_tokenize(sample)

Additionally, feel free to experiment with different sentences and pieces of text and passing them through each tokenizer. 

There are many more types of tokenizers in the nltk library itself, catered to producing various tokens based on the type of data that is needed. You can learn more about tokenizers from the nltk documentation [here](https://www.nltk.org/api/nltk.tokenize.html).

Return back to the StackUp platform, where we will continue on with the quest.

<br/><br/>

### Removing stop words

Stop words are the common words which don't really add much meaning to the text. Some stop words in English includes conjunctions such as for, and, but, or, yet, so, and articles such as a, an, the.

NLTK has pre-defined stop words for English. Let's go ahead and import it by running in the cell below.

In [4]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

The list stopwords now contains the NLTK predefined stop words. Using the tokenized text from earlier, let's remove the stop words and return the remaining tokens.

In [5]:
stopwords_removed = [i for i in word_tokens if i not in stopwords]

Now, lets head back to the StackUp platform, where we cover the third preprocessing technique in this quest.

<br></br>

### Stemming and Lemmatization

Here, we will experiment using the PorterStemmer and WordNetLemmatizer. Recall from the quest that stemming removes the suffix from the word while lemmatization takes into account the context and what the word means in the sentence.

Play along with different words to compare the outputs produced by a stemmer and a lemmatizer!

In [6]:
# run these lines if they have yet to be downloaded.
# once downloaded, you can comment out the lines.
# nltk.download('wordnet')
# nltk.download('omw-1.4')

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

Let's test both methods on various pluralised words.

In [7]:
sample_stem = [stemmer.stem(token) for token in stopwords_removed]
sample_lemma = [lemma.lemmatize(token) for token in stopwords_removed]

Compare the results produced above! The lemmatizer is more accurate when it comes to getting the root word of more complex plurals, however it is important to note that in the case of a large dataset, stemming comes in handy where performance is an issue. 

And that sums up the 3 techniques for text preprocessing in NLP! **Return back to the StackUp platform,** where we wrap up the quest and prepare the deliverables for submission. 

In [8]:
print(sample, "\n")
print(sentence_tokens, "/n")
print(word_tokens, "/n")
print(stopwords_removed, "/n")
print("Stemming results: ", sample_stem, "/n")
print("Lemmatization results; ", sample_lemma, "/n")

Once upon a time, there was a little girl who loved to dance. She would spin and twirl around her room every day, dreaming of becoming a ballerina. One day, a famous ballet teacher saw her dancing and offered to train her. From then on, the little girl's dreams came true as she danced on stages all around the world. 

['Once upon a time, there was a little girl who loved to dance.', 'She would spin and twirl around her room every day, dreaming of becoming a ballerina.', 'One day, a famous ballet teacher saw her dancing and offered to train her.', "From then on, the little girl's dreams came true as she danced on stages all around the world."] /n
['Once', 'upon', 'a', 'time', ',', 'there', 'was', 'a', 'little', 'girl', 'who', 'loved', 'to', 'dance', '.', 'She', 'would', 'spin', 'and', 'twirl', 'around', 'her', 'room', 'every', 'day', ',', 'dreaming', 'of', 'becoming', 'a', 'ballerina', '.', 'One', 'day', ',', 'a', 'famous', 'ballet', 'teacher', 'saw', 'her', 'dancing', 'and', 'offered',