# NLP

Natural Language Processing (NLP) is a field that focuses on enabling computers to interact with human language. It intersects with Computer Science, Artificial Intelligence, Linguistics, and Information Theory. Initially, NLP relied on rule-based algorithms borrowed from linguistics, but it has since shifted towards machine learning and AI approaches, leading to significant advancements.

Working with text data requires specific preprocessing steps to prepare it for analysis. A common approach is to create a "Bag of Words" by counting the occurrences of each unique word. This allows the text to be treated as a vector, facilitating the application of machine learning algorithms. Preprocessing steps include basic cleaning and tokenization, where punctuation is removed and text is converted to lowercase. Stemming and lemmatization techniques address variations in word forms, reducing words to their root forms. Removing stop words, such as common articles and prepositions, helps reduce the dimensionality of the text data.

Vectorization strategies convert text data into numerical vectors. Count vectorization represents text as a matrix of word counts, while TF-IDF vectorization assigns weights to words based on term frequency (TF) and inverse document frequency (IDF). Understanding these foundational concepts enables practitioners to preprocess and transform text data effectively for various NLP tasks, allowing the development of robust models and analyses.

$\text{TF} = \frac{\text{Number of occurrences of a term in a document}}{\text{Total number of terms in the document}}$

$\text{IDF} = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the term}}\right)$

Natural Language Processing (NLP) involves analyzing and processing human language in the form of text or speech. It encompasses tasks like machine learning, text preprocessing, and cleaning. The Natural Language Toolkit (NLTK) is a powerful Python library widely used for NLP. It provides various tools for data cleaning, linguistic analysis, feature generation, and extraction. NLTK incorporates linguistic concepts to make working with text data accessible, even if you're not an expert in linguistics. It simplifies complex tasks like generating parse trees, making NLP more approachable for everyone.

A sample Parse Tree created with NLTK
![image](https://raw.githubusercontent.com/learn-co-curriculum/dsc-introduction-to-nltk/master/images/new_parse_tree.png)

When working with text data, NLTK simplifies the process by providing features such as stop word removal, filtering and cleaning tools, and feature selection and engineering capabilities. NLTK's stop word library helps remove irrelevant words, and it offers easy ways to filter frequency distributions and clean datasets through stemming, lemmatization, and tokenization. NLTK also enables the generation of features like bigrams and n-grams, as well as part-of-speech tags and sentence polarity. By utilizing NLTK effectively, text data can be processed and prepared for tasks such as classification. 

## What Are Regular Expressions?

Regular Expressions (regex) are patterns used to match and filter text documents. They are invaluable for extracting information from large text documents efficiently. Data scientists often use regex to scrape webpages and gather data.

## Use Cases for NLP

Regex is particularly useful in Natural Language Processing (NLP). It helps define tokenization rules, splitting strings into meaningful units. For example, regex can handle cases like splitting "they're" into ["they", "'", "re"] to treat it as a single token. Regex patterns are essential for intelligent tokenization and avoiding issues in text preprocessing.

## Creating Basic Patterns

To work with regex, we define a pattern using a Python string and compile it using the `re` library. Once compiled, the pattern can be applied to a string to find all instances of the pattern.

## Ranges, Groups, and Quantifiers

Ranges in regex allow matching on a set of characters. For example, [A-Z] matches any uppercase letter. Character classes provide shortcuts, such as \d for any digit. Groups, denoted by parentheses, match an exact pattern. Quantifiers specify the number of times a group should occur consecutively, using symbols like *, +, and ?.

## Cheat Sheet and Simplifying Regex

Regex can be complex, but understanding the patterns is more important than memorizing symbols. Refer to cheat sheets for specific symbols. Remember to keep a regex cheat sheet handy for quick reference.

[Regex Cheat Sheet](https://www.rexegg.com/regex-quickstart.html)

![image](https://raw.githubusercontent.com/learn-co-curriculum/dsc-introduction-to-regular-expressions/master/images/regex_cheat_sheet.png)

## Using Regex Testers

Working with regex patterns often involves an iterative process. It's helpful to use regex tester websites like [Regex Pal](https://www.regexpal.com/) or [regexr](https://regexr.com/) to visually inspect and test patterns with sample text.

## Regex Cheat Sheet

Don't worry about memorizing regex symbols and patterns. Keep a cheat sheet handy and focus on understanding the basic structure of regex patterns. 

## The Data

We'll work with a plain text file called `'menu.txt'`. Our goal is to use regex to scrape various types of information from this file.

```python
import re

with open('menu.txt', 'r') as f:
    file = f.read()

print(file)
```

### Creating a Basic Pattern With Character Classes

Let's start by creating a basic pattern using character classes to match predefined elements like words or digits. We'll find all the digits in the document.

```python
pattern = '\d'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

To get numbers with a dollar sign in front of them, we need to escape the dollar sign using `\`.

```python
pattern = '\$\d'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

### Using Groups, Ranges, and Quantifiers

Groups, ranges, and quantifiers allow us to create more complex patterns. Let's use them to match any price in the menu.

```python
pattern = '(\$\d.{3})'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

To match all the numbers in a price, we can use groups, ranges, and quantifiers more effectively.

```python
pattern = '(\$\d+\.?\d*)'
p = re.compile(pattern)
digits = p.findall(file)
digits
```

### Putting It All Together

Now, let's write a pattern to match the phone number from the bottom of the menu.

```python
pattern = '(\(\d{3}\) (\d{3}-\d{4}))'
p = re.compile(pattern)
digits = p.findall(file)
digits
```



## Feature Engineering For Text Data

Stopword removal, frequency distributions, stemming and lemmatization, and bigrams/n-grams are common approaches to NLP feature engineering. Let's discuss each of them in detail:

1. **Stopword Removal**: Stopwords are common words in a language that do not carry much meaning and are often removed to reduce noise and dimensionality in text data. NLTK provides a list of stopwords for various languages. The steps to remove stopwords are:
   - Obtain the stopwords for the specific language (e.g., English) from NLTK.
   - Optionally, remove punctuation marks or other characters that are not relevant to the analysis.
   - Tokenize the text data into individual words.
   - Remove the stopwords from the tokenized data using list comprehension or other methods.

2. **Frequency Distributions**: Frequency distribution is a way to count the occurrences of words in a text corpus. It helps identify the most common words and their frequencies, providing insights into the overall vocabulary and content. The steps to create a frequency distribution are:
   - Tokenize the text data into individual words.
   - Use the `FreqDist` class from NLTK to create a frequency distribution object.
   - Explore the frequency distribution using methods like `most_common()` to retrieve the most common words and their frequencies.

3. **Stemming and Lemmatization**: Stemming and lemmatization are techniques used to reduce words to their base or root forms, which can help in reducing dimensionality and capturing the essence of related words. Stemming applies simple rules to remove prefixes or suffixes from words, while lemmatization uses language-specific dictionaries to determine the base form of a word. NLTK provides stemmers and lemmatizers for various languages. The steps for stemming and lemmatization are:
   - Tokenize the text data into individual words.
   - Apply stemming using a stemmer (e.g., the Porter Stemmer) or lemmatization using a lemmatizer (e.g., WordNetLemmatizer) from NLTK.

4. **Bigrams, N-grams, and Mutual Information Score**: N-grams are contiguous sequences of n words in a text. Bigrams are a specific type of n-gram that consists of pairs of adjacent words. N-grams can capture the relationship between words and provide context. Mutual Information Score is a statistical measure that quantifies the dependence between two words in an n-gram. It helps identify meaningful word combinations. The steps to create n-grams and calculate mutual information scores are:
   - Tokenize the text data into individual words.
   - Use NLTK's `ngrams` function to generate n-grams of the desired size.
   - Optionally, filter n-grams based on frequency to retain meaningful combinations.
   - Calculate the mutual information score for each n-gram to assess the dependence between the words in the n-gram.

These techniques can be applied sequentially or selectively, depending on the specific requirements of the NLP task and the characteristics of the text data. Feature engineering in NLP involves experimenting with these approaches to find the most informative features for a given problem.

# CFG and POS

A Context-Free Grammar (CFG) is a formal grammar used to describe the syntax or structure of a language. It consists of a set of production rules that define how different components of a language can be combined to form valid sentences.

In the context of natural language processing (NLP), CFGs are used to analyze and generate sentences. They provide a way to model the syntactic structure of a language, focusing on the arrangement of words and phrases rather than their meaning. CFGs are particularly useful for tasks like part-of-speech tagging, parsing, and syntactic analysis.

CFGs define different types of constituents or phrases, such as noun phrases (NP), verb phrases (VP), prepositional phrases (PP), etc. These constituents are represented by non-terminal symbols in the grammar. Terminal symbols represent individual words or tokens in the language.

The production rules in a CFG specify how non-terminal symbols can be expanded or replaced by other symbols, including terminal symbols. For example, a production rule might state that a noun phrase (NP) can consist of a determiner (Det) followed by a noun (N). These rules define the structure of sentences in the language and allow for recursive expansion of constituents.

CFGs can be used to generate valid sentences by starting with a start symbol (usually "S") and repeatedly applying the production rules until only terminal symbols remain. They can also be used to parse or analyze existing sentences by constructing parse trees that represent the syntactic structure of the sentence according to the CFG.

In summary, a Context-Free Grammar is a formal system used to describe the syntax or structure of a language. In NLP, CFGs are useful for tasks like parsing and part-of-speech tagging, as they provide a way to model and analyze the syntactic structure of sentences.

# Text Classification

For the final lab of this section, we'll use everything we've learned so far to build a classifier that works well with text data. As you've probably guessed, text data is significantly harder to work with than most traditional datasets, because of the sheer amount of preprocessing needed to get the data into a format acceptable to a machine learning algorithm.

The main challenge in working with text data isn't just the preprocessing -- its the number of decisions you have to make about how you'll clean and structure the data. In a traditional dataset full of numerical and categorical features, the preprocessing steps are fairly straightforward. Generally, we normalize the numeric data, check for and deal with multicollinearity, convert categorical data to numerical format through one-hot encoding, and so forth. Although the steps themselves may not be easy, there's generally little ambiguity about what needs to be done. Text data is a bit more ambiguous. Let's examine some of the decisions we generally need to make when working with text data.
Cleaning and Preprocessing Text Data

Once we have our data, the fun part begins. We'll need to begin by preprocessing and cleaning our text data. As you've seen throughout this section, preprocessing text data is a bit more challenging than working with more traditional data types because there's no clear-cut answer for exactly what sort of preprocessing and cleaning we need to do. When working with traditional datasets, our goals are generally pretty clear for this stage -- normalize and clean our numerical data, convert categorical data to a numeric format, check for and deal with multicollinearity, etc. The steps we take are largely dependent on what the data already looks like when we get a hold of it. Text data is different -- if we inspect a raw text dataset, we'll generally see that it only has one dimension -- the actual text, in the form of a string. This could be anything from a tweet to a full novel. This means that we need to make some decisions about how to preprocess our data. Before we can begin cleaning and preprocessing our text data, we need to make some decisions about things such as:

    Do we remove stop words or not?
    Do we stem or lemmatize our text data, or leave the words as is?
    Is basic tokenization enough, or do we need to support special edge cases through the use of regex?
    Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?
    Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?
    What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?

These are all questions that we'll need to think about pretty much anytime we begin working with text data.
Feature Engineering

Another common decision point when working with text data is exactly what features to include in the dataset. As we saw in a previous lab, NLTK makes it quite easy to do things like generate part-of-speech tags for words, or create word or character-level n-grams. In general, there's no great answer for exactly which features will improve the performance of your model, and which won't. This means that your best bet is to experiment, and treat the entire project as an iterative process! When working with text data, don't be afraid to try modeling on alternative forms of the text data, such as bigrams or n-grams. Similarly, we encourage you to explore how adding in additional features such as POS tags or mutual information scores affect the overall model performance. Sometimes, it has a great effect on performance. Other times, not much. Either way, you won't know until you try!