# Introduction to NLP in Python
## Quest 1: NLP Basics for Text Preprocessing

### Tokenization

Tokenizers divide strings into lists of substrings. After installing the nltk library, let's import the library along with these two built-in methods, *sent_tokenize* and *word_tokenize*. 

In [1]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

1. `sent_tokenize`

The first method, `sent_tokenize`, splits the given text into sentences. This is useful especially if you are dealing with bigger chunks of text with longer sentences.

We will make use of the following sample paragraph about NLP in the healthcare industry. Run the cell below to check out the output.

In [2]:
#nltk.download('punkt')
text = 'In the healthcare industry, NLP is used to analyze large amounts of healthcare-related data. This includes clinical notes and medical imaging reports, and many more. With the help of NLP, healthcare providers can quickly and accurately identify patterns and insights from patient data. For example, NLP can predict patient outcomes, such as the likelihood of readmission or the risk of developing a particular condition. NLP can also be used to extract key information from medical imaging reports. An example can be the size and location of tumours. This can help healthcare providers make more informed treatment decisions. Overall, NLP is a powerful tool that can help improve patient outcomes and enhance the quality of care in the healthcare industry.'
sent_tokenize(text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['In the healthcare industry, NLP is used to analyze large amounts of healthcare-related data.',
 'This includes clinical notes and medical imaging reports, and many more.',
 'With the help of NLP, healthcare providers can quickly and accurately identify patterns and insights from patient data.',
 'For example, NLP can predict patient outcomes, such as the likelihood of readmission or the risk of developing a particular condition.',
 'NLP can also be used to extract key information from medical imaging reports.',
 'An example can be the size and location of tumours.',
 'This can help healthcare providers make more informed treatment decisions.',
 'Overall, NLP is a powerful tool that can help improve patient outcomes and enhance the quality of care in the healthcare industry.']

If you encounter the "Resource punkt not found" error when running the above cell, you can run the following command `nltk.download('punkt')`
<br/><br/>

2. `word_tokenize`

Likewise, the `word_tokenize` method tokenizes each individual word in the paragraph. Run the cell below to compare the outputs.

In [3]:
word_tokenize(text)

['In',
 'the',
 'healthcare',
 'industry',
 ',',
 'NLP',
 'is',
 'used',
 'to',
 'analyze',
 'large',
 'amounts',
 'of',
 'healthcare-related',
 'data',
 '.',
 'This',
 'includes',
 'clinical',
 'notes',
 'and',
 'medical',
 'imaging',
 'reports',
 ',',
 'and',
 'many',
 'more',
 '.',
 'With',
 'the',
 'help',
 'of',
 'NLP',
 ',',
 'healthcare',
 'providers',
 'can',
 'quickly',
 'and',
 'accurately',
 'identify',
 'patterns',
 'and',
 'insights',
 'from',
 'patient',
 'data',
 '.',
 'For',
 'example',
 ',',
 'NLP',
 'can',
 'predict',
 'patient',
 'outcomes',
 ',',
 'such',
 'as',
 'the',
 'likelihood',
 'of',
 'readmission',
 'or',
 'the',
 'risk',
 'of',
 'developing',
 'a',
 'particular',
 'condition',
 '.',
 'NLP',
 'can',
 'also',
 'be',
 'used',
 'to',
 'extract',
 'key',
 'information',
 'from',
 'medical',
 'imaging',
 'reports',
 '.',
 'An',
 'example',
 'can',
 'be',
 'the',
 'size',
 'and',
 'location',
 'of',
 'tumours',
 '.',
 'This',
 'can',
 'help',
 'healthcare',
 'pro

Additionally, feel free to experiment with different sentences and pieces of text and passing them through each tokenizer. 

There are many more types of tokenizers in the nltk library itself, catered to producing various tokens based on the type of data that is needed. You can learn more about tokenizers from the nltk documentation [here](https://www.nltk.org/api/nltk.tokenize.html).

Return back to the StackUp platform, where we will continue on with the quest.

<br/><br/>

### Removing stop words

Stop words are the common words which don't really add much meaning to the text. Some stop words in English includes conjunctions such as for, and, but, or, yet, so, and articles such as a, an, the.

NLTK has pre-defined stop words for English. Let's go ahead and import it by running in the cell below.

In [5]:
#nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


The list stopwords now contains the NLTK predefined stop words. Using the tokenized text from earlier, let's remove the stop words and return the remaining tokens.

In [6]:
tokens = word_tokenize(text)
tokens_no_stopwords = [i for i in tokens if i not in stopwords]
print(tokens_no_stopwords)

['In', 'healthcare', 'industry', ',', 'NLP', 'used', 'analyze', 'large', 'amounts', 'healthcare-related', 'data', '.', 'This', 'includes', 'clinical', 'notes', 'medical', 'imaging', 'reports', ',', 'many', '.', 'With', 'help', 'NLP', ',', 'healthcare', 'providers', 'quickly', 'accurately', 'identify', 'patterns', 'insights', 'patient', 'data', '.', 'For', 'example', ',', 'NLP', 'predict', 'patient', 'outcomes', ',', 'likelihood', 'readmission', 'risk', 'developing', 'particular', 'condition', '.', 'NLP', 'also', 'used', 'extract', 'key', 'information', 'medical', 'imaging', 'reports', '.', 'An', 'example', 'size', 'location', 'tumours', '.', 'This', 'help', 'healthcare', 'providers', 'make', 'informed', 'treatment', 'decisions', '.', 'Overall', ',', 'NLP', 'powerful', 'tool', 'help', 'improve', 'patient', 'outcomes', 'enhance', 'quality', 'care', 'healthcare', 'industry', '.']


Now, lets head back to the StackUp platform, where we cover the third preprocessing technique in this quest.

<br></br>

### Stemming and Lemmatization

Here, we will experiment using the PorterStemmer and WordNetLemmatizer. Recall from the quest that stemming removes the suffix from the word while lemmatization takes into account the context and what the word means in the sentence.

Play along with different words to compare the outputs produced by a stemmer and a lemmatizer!

In [7]:
# run these lines if they have yet to be downloaded.
# once downloaded, you can comment out the lines.
#nltk.download('wordnet')
#nltk.download('omw-1.4')

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\harsh\AppData\Roaming\nltk_data...


Let's test both methods on various pluralised words.

In [8]:
plurals = ['apples', 'octopuses', 'categories', 'criteria', 'tomatoes', 'matrices', 'hypotheses', 'radii', 'algae', 'cacti']

plurals_stem = [stemmer.stem(plural) for plural in plurals]
plurals_lemma = [lemma.lemmatize(plural) for plural in plurals]

print("Stemming results: ", plurals_stem)
print("Lemmatization results; ", plurals_lemma)

Stemming results:  ['appl', 'octopus', 'categori', 'criteria', 'tomato', 'matric', 'hypothes', 'radii', 'alga', 'cacti']
Lemmatization results;  ['apple', 'octopus', 'category', 'criterion', 'tomato', 'matrix', 'hypothesis', 'radius', 'algae', 'cactus']


Compare the results produced above! The lemmatizer is more accurate when it comes to getting the root word of more complex plurals, however it is important to note that in the case of a large dataset, stemming comes in handy where performance is an issue. 

And that sums up the 3 techniques for text preprocessing in NLP! **Return back to the StackUp platform,** where we wrap up the quest and prepare the deliverables for submission. 

# Deliverables

In [32]:
sample = "Both stemming and lemmatization are techniques used in natural language processing to reduce words to their base form. Stemming involves removing suffixes and prefixes from words to create a stem, which may not always be a real word. In contrast, lemmatization involves reducing words to their base form, which is always a real word and takes into account the context and part of speech of the word. While stemming is generally faster, it may not always produce accurate results. Lemmatization is slower but more accurate and is often preferred for applications that require high precision."
sentence_token = sent_tokenize(sample)
words_token = word_tokenize(sample)
stopwords_removed = [i for i in words_token if i not in stopwords]
sample_stem = [stemmer.stem(token) for token in stopwords_removed]
sample_lemma = [lemma.lemmatize(token) for token in stopwords_removed]

In [33]:
print(sample, "\n")
print(sentence_token,"\n")
print(words_token, "\n")
print(stopwords_removed, "\n")
print("Stemming results: ", sample_stem,"\n")
print("Lemmatization results; ", sample_lemma,"\n")

Both stemming and lemmatization are techniques used in natural language processing to reduce words to their base form. Stemming involves removing suffixes and prefixes from words to create a stem, which may not always be a real word. In contrast, lemmatization involves reducing words to their base form, which is always a real word and takes into account the context and part of speech of the word. While stemming is generally faster, it may not always produce accurate results. Lemmatization is slower but more accurate and is often preferred for applications that require high precision. 

['Both stemming and lemmatization are techniques used in natural language processing to reduce words to their base form.', 'Stemming involves removing suffixes and prefixes from words to create a stem, which may not always be a real word.', 'In contrast, lemmatization involves reducing words to their base form, which is always a real word and takes into account the context and part of speech of the word.