## Text Preprocessing

### Regular Expression

In Python, regular expressions (regex) are sequences of characters that define a search pattern, used mainly for pattern matching within strings. They allow you to search, match, and manipulate text based on certain patterns rather than exact matches.

Here's a brief overview of how regular expressions work in Python:

1. **Pattern Definition**: Regular expressions are defined as strings in Python, using functions like `re.search()`, `re.match()`, `re.findall()`, `re.sub()`, `re.split()`, etc.

2. **Metacharacters**: Regular expressions consist of both literal characters and metacharacters. Metacharacters are characters with a special meaning. For example:

    1. **`.` (Dot)**:
        - Matches any single character except newline (`\n`).
        - For example, `a.c` would match 'abc', 'adc', 'aec', etc., but not 'ac' or 'abbc'.

    2. **`^` (Caret)**:
        - Matches the start of a string.
        - For example, `^hello` would match 'hello' only if it appears at the beginning of a string.

    3. **`$` (Dollar)**:
        - Matches the end of a string.
        - For example, `world$` would match 'world' only if it appears at the end of a string.

    4. **`[]` (Square Brackets)**:
        - Defines a character class, allowing you to specify a set of characters that you want to match.
        - For example, `[aeiou]` matches any vowel, `[0-9]` matches any digit, and `[abc]` matches 'a', 'b', or 'c'.

    5. **`|` (Pipe)**:
        - Represents alternation, allowing you to specify alternatives.
        - For example, `cat|dog` matches either 'cat' or 'dog'.

    6. **`*` (Asterisk)**:
        - Matches zero or more occurrences of the preceding character or group.
        - For example, `ba*t` would match 'bt', 'bat', 'baat', 'baaat', and so on.

    7. **`+` (Plus)**:
        - Matches one or more occurrences of the preceding character or group.
        - For example, `ba+t` would match 'bat', 'baat', 'baaat', and so on, but not 'bt'.

    8. **`?` (Question Mark)**:
        - Matches zero or one occurrence of the preceding character or group, indicating it is optional.
        - For example, `colou?r` would match both 'color' and 'colour'.

    9. **`{}` (Curly Brackets)**:
        - Specifies the exact number of occurrences or a range of occurrences.
        - For example, `a{3}` matches 'aaa', and `a{1,3}` matches 'a', 'aa', or 'aaa'.

    10. **`\` (Backslash)**:
        - Used as an escape character to treat metacharacters as literal characters.
        - For example, `\.` matches a literal dot.

    11. **`()` (Round Brackets)**:
        - Creates a capturing group, allowing you to capture and extract substrings or apply quantifiers to a group of characters.
        - For example, `(abc)+` matches 'abc', 'abcabc', 'abcabcabc', and so on.
    12. **`\b` (Word Boundary)**
      - This is a zero-width assertion that matches the position between a word character (like letters, digits, and underscores) and a non-word character. It ensures that the pattern matches complete words or strings that stand alone.

3. **Functions**: Python provides several functions for working with regular expressions:
   - `re.search()`: Searches for the pattern within the string and returns the first match.
   - `re.match()`: Checks for a match only at the beginning of the string.
   - `re.findall()`: Returns a list of all non-overlapping matches.
   - `re.sub()`: Replaces occurrences of the pattern with a replacement string.
   - `re.split()`: Splits the string by occurrences of the pattern.

4. **Anchors**: Anchors like `^` and `$` are used to match the start and end of a string, respectively.

5. **Character Classes**: `[ ]` is used to specify a character class. For example, `[aeiou]` matches any vowel.

6. **Modifiers**: Modifiers like `*`, `+`, and `?` are used to specify the number of occurrences of the preceding character or group.

7. **Escape Character**: `\` is used as an escape character to treat metacharacters as literal characters. For example, `\.` matches a literal dot.

8. **Alternation**: `|` is used to specify alternatives. For example, `cat|dog` matches either 'cat' or 'dog'.

#### `finadall()`

In [None]:
import re
# Regular expression pattern for matching email addresses
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
text = "Hello, you can reach out to mam@gmail.com or contact support@domain.com for more details."
# Use re.findall to extract all email addresses from the text
emails = re.findall(pattern, text)
print(emails)

['mam@gmail.com', 'support@domain.com']


Pattern Breakdown

1. **`[a-zA-Z0-9._%+-]+`**
   - **`[a-zA-Z0-9._%+-]`**: This is a character class that matches any single alphanumeric character (both uppercase and lowercase), period (`.`), underscore (`_`), percent (`%`), plus (`+`), or hyphen (`-`).
   - **`+`**: This quantifier matches one or more of the preceding character class, allowing for email local parts that can be of varying lengths.

2. **`@`**
   - This matches the literal `@` symbol in the email address.

3. **`[a-zA-Z0-9.-]+`**
   - **`[a-zA-Z0-9.-]`**: This character class matches any single alphanumeric character (both uppercase and lowercase), period (`.`), or hyphen (`-`) in the domain part of the email.
   - **`+`**: This quantifier matches one or more of the preceding character class, allowing for domain names of varying lengths.

4. **`\.`**
   - This matches the literal period (`.`) before the top-level domain. The backslash `\` is used to escape the period since a plain period matches any character in regular expressions.

5. **`[a-zA-Z]{2,}`**
   - **`[a-zA-Z]`**: This character class matches any single uppercase or lowercase letter.
   - **`{2,}`**: This quantifier specifies that the preceding character class (letters) must occur at least 2 times, which is typical for top-level domains (e.g., `.com`, `.org`).

#### `sub`

In [None]:
import re
# Regular expression pattern for matching email addresses
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
# Text containing email addresses
text = "Please contact us at support@example.com or sales@example.org."
# Use re.sub to replace email addresses with '[REDACTED]'
redacted_text = re.sub(pattern, '[REDACTED]', text)
print(redacted_text)

Please contact us at [REDACTED] or [REDACTED].


### Tokenization

Tokenization is the process of dividing text into individual units or tokens. The choice of tokens depends on the granularity needed for the analysis. Common types of tokens include:

**Sentence Tokenization**:
Divides text into sentences. Useful for tasks where understanding the structure of the text is important.
```python
import nltk
nltk.download('punkt')
```

```python
from nltk.tokenize import sent_tokenize
text = "Natural language processing is amazing. It allows computers to understand human language."
sentences = sent_tokenize(text)
print(sentences)
# Output: ['Natural language processing is amazing.', 'It allows computers to understand human language.']
```


**Word Tokenization**:
Divides text into individual words. This method is often used for tasks like text classification and sentiment analysis.

```python
from nltk.tokenize import word_tokenize
text = "Natural language processing is amazing!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['Natural', 'language', 'processing', 'is', 'amazing', '!']
```


**N-grams**:
Groups of `n` contiguous tokens. Useful for capturing context and patterns in text.

```python
from nltk.util import ngrams
text = "Natural language processing is amazing"
tokens = word_tokenize(text)
bigrams = list(ngrams(tokens, 2))
print(bigrams)
# Output: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'amazing')]
```



**Character Tokenization**:
Divides text into individual characters. This is less common but useful for specific tasks like character-level language modeling.

```python
text = "NLP"
tokens = list(text)
print(tokens)
# Output: ['N', 'L', 'P']
```

### Stemming and Lemmatization

Stemming and lemmatization are text preprocessing techniques used in natural language processing (NLP) to reduce words to their base or root form. This helps in standardizing text and improving the effectiveness of text analysis and machine learning models.

**1. Stemming**

Stemming reduces a word to its base or root form by removing suffixes. The stem might not be a proper word but a common root that different words share. for example the stem of words "running," "runner," "ran" is "run"


**Popular Stemming Algorithms**:
- **Porter Stemmer**: Applies heuristic rules to remove common suffixes.
- **Snowball Stemmer**: An improved version of the Porter Stemmer with better support for multiple languages.

**Example with Python’s NLTK**:
```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer() # Create a Porter Stemmer object
words = ["running", "runner", "ran", "easily", "fairly"] # Example words
stems = [stemmer.stem(word) for word in words] # Stem the words
print(stems)
# Output: ['run', 'runner', 'ran', 'easili', 'fairli']
```

**2. Lemmatization**
Lemmatization reduces a word to its base or root form by considering its context and meaning. Unlike stemming, lemmatization returns actual words that exist in the language and often uses a dictionary or lexical database. for example the lemma of words "running," "runner," "ran" is "run"

**Popular Lemmatization Libraries**:
- **WordNet Lemmatizer**: Uses the WordNet lexical database to find the lemma of a word.
- **spaCy**: Provides efficient lemmatization as part of its NLP pipeline.

**Example with Python’s NLTK**:
```python
import nltk
nltk.download('wordnet')
```
```python
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer() #Create a WordNet Lemmatizer object
words = ["running", "runner", "ran", "easily", "fairly"] #Example words
lemmas = [lemmatizer.lemmatize(word, pos=wordnet.VERB) for word in words] #Lemmatize the words
print(lemmas)
# Output: ['run', 'runner', 'run', 'easily', 'fairly']
```

## Feature Extraction in NLP

Feature extraction is the process of converting textual data into a set of numerical features. These features are then used as input for machine learning models, aiding in reducing the data's complexity while retaining relevant information needed for analysis. Common feature extraction techniques are:

1. **Bag of Words (BoW)**:
   - **Description**: Represents text as a bag (multiset) of words, ignoring grammar and word order but keeping track of word frequency.
   - **Feature Representation**: Each unique word in the corpus becomes a feature, with the value being the count (or frequency) of the word in the document.
   - **Example**: For documents "cat sat" and "cat sat cat", the feature vector might be `[2, 1]` for "cat" and `[1, 1]` for "sat".

2. **N-grams**:
   - **Description**: Represents sequences of `n` words (or characters) as features, capturing word combinations and patterns.
   - **Feature Representation**: Features are created for each sequence of `n` words, with their frequencies or occurrences used as feature values.
   - **Example**: For the sentence "The cat sat", bigrams are "The cat", "cat sat".

3. **Term Frequency-Inverse Document Frequency (TF-IDF)**:
   - **Description**: Enhances BoW by weighing the importance of words based on their frequency in a document relative to their frequency across all documents.
   - **Feature Representation**: Each word is assigned a TF-IDF score, reflecting its importance in the document relative to the corpus.
   - **Example**: Words that appear frequently in a document but rarely in the corpus get higher TF-IDF scores.

4. **Word Embeddings**:
   - **Description**: Represents words as dense vectors in a continuous vector space, capturing semantic relationships between words.
   - **Feature Representation**: Pre-trained embeddings (e.g., Word2Vec, GloVe) provide vector representations for words based on their context and meaning.
   - **Example**: The word "king" might be represented by a vector `[0.2, -0.1, 0.4, ...]`.


5. **Sentence Embeddings**:
   - **Description**: Represents entire sentences or documents as dense vectors that capture the meaning of the whole text.
   - **Feature Representation**: Models like Universal Sentence Encoder or BERT provide vector representations for sentences or documents.
   - **Example**: A sentence like "The cat sat on the mat" might be represented by a vector `[0.1, -0.3, 0.2, ...]`.

### Bag of Words

The Bag of Words model represents a text document as a collection of its words, disregarding the **grammar** and the **context** but keeping track of the frequency of each word. Essentially, it transforms text into a vector of word counts or occurrences.

Suppose we have the following two sentences:

1. "programming is fun"
2. "programming is easy and easy."

**Step 1: Tokenization**

- Sentence 1: `['programming', 'is', 'fun']`
- Sentence 2: `['programming', 'is', 'easy', 'and', 'easy']`

**Step 2: Vocabulary Building**

- Vocabulary: `['and', 'easy', 'fun', 'is', 'programming']`

**Step 3: Vectorization**

- For Sentence 1: `[0, 0, 1, 1, 1]`
- For Sentence 2: `[1, 2, 0, 1, 1]`

**Implementation Example:**

```python
from sklearn.feature_extraction.text import CountVectorizer
documents = [
    "programming is fun",
    "programming is easy and easy."
]
vectorizer = CountVectorizer() #Create a Bag of Words model
X = vectorizer.fit_transform(documents)
features = vectorizer.get_feature_names_out() #Get the feature names (words in the vocabulary)
print("Feature Names:", features)
print("Bag of Words Matrix:\n", X.toarray()) #Get the Bag of Words representation
```

### TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is used in text processing to evaluate the importance of a word in a document relative to a collection of documents, or corpus. It helps in converting text into numerical features that can be used in machine learning algorithms and information retrieval systems. It has two main components:

**1. Term Frequency (TF):** The term frequency of a word in a document is the number of times the word appears in the document, normalized by the total number of words in that document.

$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $

**Example**:
- Document: "The cat in the hat."
- Term Frequency of "cat": 1 / 5 = 0.2

**2. Inverse Document Frequency (IDF):** The inverse document frequency measures how important a term is by considering how often it appears across all documents in the corpus. Words that appear in many documents are less informative.

$ \text{IDF}(t) = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right) $

**Example**:
- Corpus: 1000 documents
- Number of documents containing "cat": 50
- IDF of "cat": $\log \left(\frac{1000}{50}\right) = \log (20) \approx 1.30$

**3. TF-IDF Calculation:** TF-IDF is the product of TF and IDF. It gives a numerical value representing the importance of a term in a document relative to the corpus.

**TF-IDF of "cat": $0.2 \times 1.30 = 0.26$**



```python
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
    "The cat in the hat",
    "The cat is on the mat",
    "The dog is on the mat"
]
vectorizer = TfidfVectorizer() #Create a TF-IDF vectorizer
tfidf_matrix = vectorizer.fit_transform(documents) #Fit and transform the documents
feature_names = vectorizer.get_feature_names_out() #Get feature names (words)
tfidf_array = tfidf_matrix.toarray() #Convert the matrix to an array and print it
print("Feature Names:", feature_names)
print("TF-IDF Matrix:\n", tfidf_array)
```

### Word Embedding

Word embeddings are a type of word representation used in natural language processing (NLP) that map words or phrases into continuous vector spaces. Unlike traditional methods like the Bag of Words (BoW) or TF-IDF, which represent words as discrete tokens or counts, word embeddings provide dense, low-dimensional representations that capture semantic meaning and relationships between words.


**Popular Algorithms for Word Embeddings**:

- **Word2Vec**: Developed by Google, it uses two main models:
  - **Continuous Bag of Words (CBOW)**: Predicts a target word based on its context words.
  - **Skip-gram**: Predicts context words given a target word.

- **GloVe (Global Vectors for Word Representation)**: Developed by Stanford, it uses matrix factorization on the word co-occurrence matrix to learn embeddings.

- **FastText**: An extension of Word2Vec by Facebook, which represents words as bags of character n-grams, allowing for better handling of morphologically rich languages.

- **Transformers (e.g., BERT, GPT)**: Modern models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) provide context-sensitive embeddings and are used for more advanced NLP tasks.


**Example with Word2Vec using Python’s `gensim` library**:

```python
from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'barked'],
    ['the', 'cat', 'is', 'on', 'the', 'roof']
]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)
word_vectors = model.wv #Get word vectors
vector_cat = word_vectors['cat']
vector_dog = word_vectors['dog']

print("Vector for 'cat':", vector_cat)
print("Vector for 'dog':", vector_dog)
```

**Example with `transformers` library**:
```python
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Encode text
inputs = tokenizer("The bank is by the river bank.", return_tensors="pt")
outputs = model(**inputs)

# Extract embeddings
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)
```

**Output**:
```
tensor([[[-0.0608,  0.0237, -0.0156, ..., -0.0105,  0.0223, -0.0125],
         [ 0.0462,  0.0123, -0.0312, ..., -0.0191,  0.0141,  0.0310],
         ...
         [ 0.0208, -0.0094, -0.0155, ..., -0.0028, -0.0158, -0.0185]]],
       grad_fn=<NativeLayerNormBackward>)
```

### Sentence Embedding

Sentence embedding is a technique used to convert sentences into fixed-size vectors that capture the semantic meaning of the sentences. There are various methods to obtain sentence embeddings, including traditional methods like averaging word embeddings and modern methods like using pre-trained models from libraries like Hugging Face’s `transformers`.

Here are two approaches for generating sentence embeddings:

**<h4>1. Using Pre-trained Models from Hugging Face Transformers</h4>**

For this approach, you’ll use a pre-trained model like BERT or Sentence-BERT from the Hugging Face `transformers` library.

```python
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

def get_sentence_embedding(sentence):
    # Tokenize input sentence
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get the embeddings of the [CLS] token
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    return cls_embedding.squeeze().numpy()

# Example sentence
sentence = "This is an example sentence."
embedding = get_sentence_embedding(sentence)
print("Sentence embedding:", embedding)
```

**<h4> 2. Using Averaged Word Embeddings</h4>**

This method uses pre-trained word embeddings (e.g., GloVe, FastText) and averages them to get a sentence embedding. Here’s how you can do it with GloVe embeddings:

```python
from gensim.models import KeyedVectors
import numpy as np
glove_vectors = KeyedVectors.load_word2vec_format('glove.6B.50d.txt', binary=False, no_header=True)
def get_word_embeddings(word):
    return glove_vectors[word] if word in glove_vectors else np.zeros(glove_vectors.vector_size)
def get_sentence_embedding(sentence):
    words = sentence.lower().split()
    word_embeddings = [get_word_embeddings(word) for word in words]
    sentence_embedding = np.mean(word_embeddings, axis=0)
    return sentence_embedding

# Example sentence
sentence = "This is an example sentence."
embedding = get_sentence_embedding(sentence)
print("Sentence embedding:", embedding)
```

<h4> Explanation

1. **Hugging Face Transformers**:
   - **Model and Tokenizer**: Load a pre-trained BERT model and tokenizer from Hugging Face.
   - **Tokenization**: Convert the sentence to tokens that the model understands.
   - **Embedding Extraction**: Use the output from the model, specifically the embedding of the `[CLS]` token, which represents the sentence.

2. **Averaged Word Embeddings**:
   - **GloVe or Similar Embeddings**: Load pre-trained word vectors.
   - **Word Embeddings**: Retrieve embeddings for each word in the sentence.
   - **Average Embedding**: Compute the average of the word embeddings to get a sentence embedding.

## NLP Applications

### Text Similarity Analysis

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample texts
texts = [
    "Machine learning is a field of artificial intelligence.",
    "Artificial intelligence encompasses machine learning and deep learning.",
    "I love learning about new technologies and artificial intelligence.",
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform texts into TF-IDF matrices
tfidf_matrix = vectorizer.fit_transform(texts)

# Compute cosine similarity between each pair of texts
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print("Cosine Similarity Matrix:")
print(cosine_sim)

Cosine Similarity Matrix:
[[1.         0.40055573 0.20517518]
 [0.40055573 1.         0.36320897]
 [0.20517518 0.36320897 1.        ]]


<h4> Explanation:

1. **TF-IDF Vectorizer**: Converts text data into TF-IDF features.
2. **Fit and Transform**: Transforms the text data into TF-IDF matrices.
3. **Cosine Similarity**: Measures similarity between texts based on their TF-IDF representations.

The resulting `cosine_sim` matrix shows the similarity scores between each pair of texts. Each value in the matrix ranges from 0 to 1, where 1 indicates complete similarity and 0 indicates no similarity.

### Text Classification

In [None]:
import pandas as pd
from sklearn import preprocessing, linear_model, metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
url = 'https://drive.google.com/file/d/1mcTm9p3zqfXZwEmT-iU6f2w9AxGxS6yg/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
df = pd.read_csv(path)
df.head()

Unnamed: 0,text,label
0,Stuning even for the non-gamer: This sound tra...,2
1,The best soundtrack ever to anything.: I'm rea...,2
2,Amazing!: This soundtrack is my favorite music...,2
3,Excellent Soundtrack: I truly like this soundt...,2
4,"Remember, Pull Your Jaw Off The Floor After He...",2


In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'])

In [None]:
# create a count vectorizer object
count_vect = CountVectorizer()
count_vect.fit(df['text'])

# transform the training and testing data using count vectorizer object
X_train_cv =  count_vect.transform(X_train)
X_test_cv =  count_vect.transform(X_test)

In [None]:
# word level tf-idf
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(df['text'])
X_train_tfidf =  tfidf_vect.transform(X_train)
X_test_tfidf =  tfidf_vect.transform(X_test)

In [None]:
from sklearn import svm
def train_model(classifier, X_train, y_train, X_test):
    classifier.fit(X_train, y_train)
    predictions = classifier.predict(X_test)
    return metrics.accuracy_score(predictions, y_test)

In [None]:
# Train Classifier on Count Vectors
accuracy = train_model(svm.SVC(), X_train_cv, y_train, X_test_cv)
print("Count Vectors Accuracy: ", accuracy)

# Train Classifier on Word Level TF IDF Vectors
accuracy = train_model(svm.SVC(), X_train_tfidf, y_train, X_test_tfidf)
print("WordLevel TF-IDF Accuracy: ", accuracy)

Count Vectors Accuracy:  0.846
WordLevel TF-IDF Accuracy:  0.8692
