# NATURAL LANGUAGE PROCESSING

**Natural Language Processing, or NLP**, refers to analytics tasks that deal with natural human language, in the form of text or speech. These tasks usually involve some sort of machine learning, whether for text classification or for feature generation, but NLP isn't just machine learning. Tasks such as text preprocessing and cleaning also fall under the NLP umbrella.

The most common Python library used for NLP tasks is _`Natural Language Tool Kit`_, or _`NLTK`_ for short. This library was developed by researchers at the University of Pennsylvania, and quickly became the most powerful and complete library of NLP tools available.

**Working with text data** comes with a unique set of problems and solutions that other types of datasets don't have. Often, text data requires more cleaning and preprocessing than normal data, in order to get it into a format where we can use statistical methods or machine learning to work with it. Preprocessing steps as below:

The main challenge in working with text data isn't just the preprocessing -- its the number of decisions you have to make about how you'll clean and structure the data. 

   - **Creating a Bag of Words**: the bag contains information about all the important words in the text individually, but not in any particular order. The simplest way to create a bag of words is to just count how many times each unique word is used in a given corpus. If we have a number for every word, then we have a way to treat each bag as a vector, which opens up all kinds of machine learning tools for use.
   
   
   - **Basic Cleaning and Tokenization**: One of the most basic problems seen when working with text data is things like punctuation and capitalization. Cleaning a text dataset usually means removing punctuation, and lowercasing everything.
   
   
   - **Stemming, Lemmatization, and Stop Words**: NLP methods such as ***Stemming*** and ***Lemmatization*** help us deal with this problem, where we reduce each word token down to its root word. For cases such as `"run"`, `"runs"`, `"running"` and `"ran"`, they are more similar than different -- we may want our algorithm to treat these as the same word, `"run"`.
   
   
   - **Stemming** accomplishes this by removing the ends of words where the end signals some sort of derivational change to the word.
   
   
   - **Lemmatization** accomplishes pretty much the same thing as stemming, but does it in a more complex way, by examining the morphology of words and attempting to reduce each word to its most basic form, or lemma. Note that the results here often end up a bit different than stemming
   
   
   - **Vectorization Strategies** (Count & TF-IDF Vectorization - Term Frequency-Inverse Document Frequency)
   
   
Here are some of the ways that NLTK can make our lives easier when working with text data:

 - _Stop Word Removal_: NLTK contains a full library of stop words, making it easy to remove the words that don't matter from our data.

- _Filtering and Cleaning_: NLTK provides simple, easy ways to create and filter frequency distributions, as well providing multiple ways to clean, stem, lemmatize, or tokenize datasets.


- _Feature Selection and Feature Engineering_: NLTK contains tools to quickly generate features such as bigrams and n-grams. It also contains major libraries such as the Penn Tree Bank to allow quick feature engineering, such as generating part-of-speech tags, or sentence polarity.

**Regular Expressions** are a type of pattern that describe some text. We can use these regular expressions to quickly match patterns and filter through text documents. Regular Expressions (or regex, for short) are an important tool anytime we need to pull information from a larger text document without manually reading the entire thing. For data scientists, regex is extremely useful for data gathering. With regex, we can quickly scrape webpages by using regex to search through the html and find the info needed.

**Ranges, Groups, and Quantifiers** This will match any uppercase letter. Ranges are always inside of square brackets. We can put many things inside of ranges at the same time, and regex will match on any of them.

**Character Classes** Character classes are a special case of ranges. Since it's quite a common task to use ranges to do things like match on words or numbers, regex actually includes character classes as a shortcut. 

**Groups and Quantifiers** Groups are kind of like ranges, but they specify an exact pattern to match on. Groups are denoted by parentheses. Whereas `[A-Z0-9]` matches on any uppercase letter or any digit, `(A-Z0-9)` will only match on the sequence `'A-Z0-9'` exactly. This becomes much more useful when paired with Quantifiers, which allows us to specify how many times a group should happen in a row. If we want to specify an exact number of times, we can use curly braces.

**Regex Cheet Sheet**

![image.png](attachment:image.png)

**Context-Free Grammar** (CFG) In order to understand CFGs, we first need to back up and gain a little background knowledge about linguistics. According to linguistics, there are five different levels of language. One way that we can help a computer understand how to interpret a sentence is to create a CFG for it to use when parsing. The CFG defines the rules of how sentences can exist. We do this by labeling different word tokens as their grammatical types, and then defining which combinations of grammatical types are valid examples of verb phrases, noun phrases, etc.

![image-2.png](attachment:image-2.png)

## KEY TAKEAWAYS

 - NLP has become increasingly popular over the past few years, and NLP researchers have achieved very insightful insights
 
 
- The Natural Language Tool Kit (NLTK) is one of the most popular Python libraries for NLP


- Regular Expressions are an important part of NLP, which can be used for pattern matching and filtering


- Regular Expressions can become confusing, so make sure to use our provided cheat sheet the first few times you work with regex


- It is strongly recommended you take some time to use regex tester websites to ensure you understand how changing your regex pattern affects your results when working towards a correct answer!


- Feature Engineering is essential when working with text data, and to understand the dynamics of your text


- Common feature engineering techniques are removing stop words, stemming, lemmatization, and n-grams


- When diving deeper into grammar and linguistics, context-free grammars and part-of-speech tagging is important


- In this context, parse trees can help computers when dealing with ambiguous words


- How you clean and preprocess your data will have a major effect on the conclusions you'll be able to draw in your NLP classification problems

