# Getting Started with Natural Language Processing

## 1. Intro to NLP
Look at the technologies around us:

- Spellcheck and autocorrect
- Auto-generated video captions
- Virtual assistants like Amazon’s Alexa
- Autocomplete
- Your news site’s suggested articles
What do they have in common?

All of these handy technologies exist because of ***natural language processing***! Also known as ***NLP***, the field is at the intersection of linguistics, artificial intelligence, and computer science. The goal? Enabling computers to interpret, analyze, and approximate the generation of human languages (like English or Spanish).

NLP got its start around 1950 with Alan Turing’s test for artificial intelligence evaluating whether a computer can use language to fool humans into believing it’s human.

But approximating human speech is only one of a wide range of applications for NLP! Applications from detecting spam emails or bias in tweets to improving accessibility for people with disabilities all rely heavily on natural language processing techniques.

NLP can be conducted in several programming languages. However, Python has some of the most extensive open-source NLP libraries, including the [Natural Language Toolkit or NLTK](https://www.nltk.org/). Because of this, you’ll be using Python to get your first taste of NLP.

#### Instructions

Don’t worry if you don’t understand much of the content right now — you don’t need to at this point! This lesson is an introductory overview of some of the main topics in NLP and you’ll get to dive deeper into each topic on its own later.

![NLP](Natural_Language_Processing_Overview.gif)

## 2. Text Preprocessing
> "You never know what you have... until you clean your data."

~ Unknown (or possibly made up)


Cleaning and preparation are crucial for many tasks, and NLP is no exception. ***Text preprocessing*** is usually the first step you’ll take when faced with an NLP task.

Without preprocessing, your computer interprets `"the"`, `"The"`, and `"<p>The"` as entirely different words. There is a LOT you can do here, depending on the formatting you need. Lucky for you, [Regex](https://en.wikipedia.org/wiki/Regular_expression) and NLTK will do most of it for you! Common tasks include:

**Noise removal** — stripping text of formatting (e.g., HTML tags).

**Tokenization** — breaking text into individual words.

**Normalization** — cleaning text data in any other way:

- **Stemming** is a blunt axe to chop off word prefixes and suffixes. “booing” and “booed” become “boo”, but “sing” may become “s” and “sung” would remain “sung.”
- **Lemmatization** is a scalpel to bring words down to their root forms. For example, NLTK’s savvy lemmatizer knows “am” and “are” are related to “be.”
- Other common tasks include lowercasing, [stopwords](https://en.wikipedia.org/wiki/Stop_words) removal, spelling correction, etc.

#### ✅ Instructions
1. We used NLTK’s PorterStemmer to normalize the text — run the code to see how it does. (It may take a few seconds for the code to run.)

2. In the output terminal you’ll see our program counts `"go"` and` "went"` as different words! Also, what’s up with` "mani"` and `"hardli"` ? A lemmatizer will fix this. Let’s do it.

Where `lemmatizer` is defined, replace `None` with `WordNetLemmatizer()`.

Where we defined `lemmatized`, replace the empty list with a list comprehension that uses `lemmatizer` to `lemmatize()` each token in `tokenized`.

(Don’t know Python that well? No problem. Just check the hints for help throughout the lesson.)

3. Why are the lemmatized verbs like `"went"` still conjugated? By default `lemmatize()` treats every word as a noun.

Give `lemmatize()` a second argument: get_part_of_speech(token). This will tell our lemmatizer what part of speech the word is.

Run your code again to see the result!

In [6]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# grabbing a part of speech function:
from part_of_speech import get_part_of_speech

##### if got error regarding punkt
import nltk<br>
nltk.download('punkt')

In [10]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jesus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [16]:
# regex for removing punctuation!
import re
# nltk preprocessing magic
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# grabbing a part of speech function:
from part_of_speech import get_part_of_speech

text = "So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise. I went to the dentist the other day, and sure enough I saw an angry one jump out of my dentist's bag within minutes of arriving. She hardly even noticed."

cleaned = re.sub('\W+', ' ', text)
print(cleaned)
tokenized = word_tokenize(cleaned)
print('\n',tokenized)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]
print("\nStemmed text:")
print(stemmed)
## -- CHANGE these -- ##
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

print("\nLemmatized text after changing spelings:")
print(lemmatized)
lemmatized = [lemmatizer.lemmatize(token,get_part_of_speech(token)) for token in tokenized]


print("\nLemmatized text after removing words by parts_of_speech:")
print(lemmatized)

So many squids are jumping out of suitcases these days that you can barely go anywhere without seeing one burst forth from a tightly packed valise I went to the dentist the other day and sure enough I saw an angry one jump out of my dentist s bag within minutes of arriving She hardly even noticed 

 ['So', 'many', 'squids', 'are', 'jumping', 'out', 'of', 'suitcases', 'these', 'days', 'that', 'you', 'can', 'barely', 'go', 'anywhere', 'without', 'seeing', 'one', 'burst', 'forth', 'from', 'a', 'tightly', 'packed', 'valise', 'I', 'went', 'to', 'the', 'dentist', 'the', 'other', 'day', 'and', 'sure', 'enough', 'I', 'saw', 'an', 'angry', 'one', 'jump', 'out', 'of', 'my', 'dentist', 's', 'bag', 'within', 'minutes', 'of', 'arriving', 'She', 'hardly', 'even', 'noticed']

Stemmed text:
['So', 'mani', 'squid', 'are', 'jump', 'out', 'of', 'suitcas', 'these', 'day', 'that', 'you', 'can', 'bare', 'go', 'anywher', 'without', 'see', 'one', 'burst', 'forth', 'from', 'a', 'tightli', 'pack', 'valis', 'I',