Natural Language Processing (NLP) is a field that focuses on enabling computers to interact with human language. It intersects with Computer Science, Artificial Intelligence, Linguistics, and Information Theory. Initially, NLP relied on rule-based algorithms borrowed from linguistics, but it has since shifted towards machine learning and AI approaches, leading to significant advancements.

Working with text data requires specific preprocessing steps to prepare it for analysis. A common approach is to create a "Bag of Words" by counting the occurrences of each unique word. This allows the text to be treated as a vector, facilitating the application of machine learning algorithms. Preprocessing steps include basic cleaning and tokenization, where punctuation is removed and text is converted to lowercase. Stemming and lemmatization techniques address variations in word forms, reducing words to their root forms. Removing stop words, such as common articles and prepositions, helps reduce the dimensionality of the text data.

Vectorization strategies convert text data into numerical vectors. Count vectorization represents text as a matrix of word counts, while TF-IDF vectorization assigns weights to words based on term frequency (TF) and inverse document frequency (IDF). Understanding these foundational concepts enables practitioners to preprocess and transform text data effectively for various NLP tasks, allowing the development of robust models and analyses.

$\text{TF} = \frac{\text{Number of occurrences of a term in a document}}{\text{Total number of terms in the document}}$

$\text{IDF} = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the term}}\right)$

TL;DR: Natural Language Processing (NLP) involves analyzing and processing human language in the form of text or speech. It encompasses tasks like machine learning, text preprocessing, and cleaning. The Natural Language Toolkit (NLTK) is a powerful Python library widely used for NLP. It provides various tools for data cleaning, linguistic analysis, feature generation, and extraction. NLTK incorporates linguistic concepts to make working with text data accessible, even if you're not an expert in linguistics. It simplifies complex tasks like generating parse trees, making NLP more approachable for everyone.

A sample Parse Tree created with NLTK
![image](https://raw.githubusercontent.com/learn-co-curriculum/dsc-introduction-to-nltk/master/images/new_parse_tree.png)

TL;DR: When working with text data, NLTK simplifies the process by providing features such as stop word removal, filtering and cleaning tools, and feature selection and engineering capabilities. NLTK's stop word library helps remove irrelevant words, and it offers easy ways to filter frequency distributions and clean datasets through stemming, lemmatization, and tokenization. NLTK also enables the generation of features like bigrams and n-grams, as well as part-of-speech tags and sentence polarity. By utilizing NLTK effectively, text data can be processed and prepared for tasks such as classification. 

## What Are Regular Expressions?

Regular Expressions (regex) are patterns used to match and filter text documents. They are invaluable for extracting information from large text documents efficiently. Data scientists often use regex to scrape webpages and gather data.

## Use Cases for NLP

Regex is particularly useful in Natural Language Processing (NLP). It helps define tokenization rules, splitting strings into meaningful units. For example, regex can handle cases like splitting "they're" into ["they", "'", "re"] to treat it as a single token. Regex patterns are essential for intelligent tokenization and avoiding issues in text preprocessing.

## Creating Basic Patterns

To work with regex, we define a pattern using a Python string and compile it using the `re` library. Once compiled, the pattern can be applied to a string to find all instances of the pattern.

## Ranges, Groups, and Quantifiers

Ranges in regex allow matching on a set of characters. For example, [A-Z] matches any uppercase letter. Character classes provide shortcuts, such as \d for any digit. Groups, denoted by parentheses, match an exact pattern. Quantifiers specify the number of times a group should occur consecutively, using symbols like *, +, and ?.

## Cheat Sheet and Simplifying Regex

Regex can be complex, but understanding the patterns is more important than memorizing symbols. Refer to cheat sheets for specific symbols. Remember to keep a regex cheat sheet handy for quick reference.

[Regex Cheat Sheet](https://www.rexegg.com/regex-quickstart.html)

![image](https://raw.githubusercontent.com/learn-co-curriculum/dsc-introduction-to-regular-expressions/master/images/regex_cheat_sheet.png)

## Using Regex Testers

Working with regex patterns often involves an iterative process. It's helpful to use regex tester websites like [Regex Pal](https://www.regexpal.com/) or [regexr](https://regexr.com/) to visually inspect and test patterns with sample text.

## Regex Cheat Sheet

Don't worry about memorizing regex symbols and patterns. Keep a cheat sheet handy and focus on understanding the basic structure of regex patterns. Refer to the cheat sheet below for quick reference:

![Regex Cheat Sheet](images/regex_cheat_sheet.png)

## The Data

We'll work with a plain text file called `'menu.txt'`. Our goal is to use regex to scrape various types of information from this file.

```python
import re

with open('menu.txt', 'r') as f:
    file = f.read()

print(file)
```

### Creating a Basic Pattern With Character Classes

Let's start by creating a basic pattern using character classes to match predefined elements like words or digits. We'll find all the digits in the document.

```python
pattern = '\d'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

To get numbers with a dollar sign in front of them, we need to escape the dollar sign using `\`.

```python
pattern = '\$\d'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

### Using Groups, Ranges, and Quantifiers

Groups, ranges, and quantifiers allow us to create more complex patterns. Let's use them to match any price in the menu.

```python
pattern = '(\$\d.{3})'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

To match all the numbers in a price, we can use groups, ranges, and quantifiers more effectively.

```python
pattern = '(\$\d+\.?\d*)'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

### Putting It All Together

Now, let's write a pattern to match the phone number from the bottom of the menu.

```python
pattern = '(\(\d{3}\) (\d{3}-\d{4}))'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

