# NLP Sept: 2024

In this notebook, we have an end-to-end NLP crash course for september 2024

Authors:
- Eng. Ahmed Métwalli
- Eng. Alia Elhefny

## Section 1: Introduction to Natural Language Processing (NLP)

### 1.1 Introduction to NLP

Natural Language Processing (NLP) is a multidisciplinary field that combines computer science, linguistics, and artificial intelligence to enable machines to understand, interpret, and generate human language. NLP is ubiquitous in the modern digital world, powering applications such as voice assistants, translation services, and sentiment analysis.

#### 1.1.1 What is NLP in the Real World?

NLP is applied in various real-world scenarios, including:

- **Text Analysis and Summarization**: Automated generation of concise summaries from large documents.
- **Sentiment Analysis**: Assessing the emotional tone of texts in social media or customer reviews.
- **Machine Translation**: Translating text from one language to another, as seen in Google Translate.
- **Chatbots and Virtual Assistants**: Enabling conversational interfaces in applications like Siri, Alexa, and customer support bots.

#### 1.1.2 NLP Tasks

Key tasks in NLP include:

- **Tokenization**: Splitting text into individual words or phrases.
- **Named Entity Recognition (NER)**: Identifying and classifying entities like names, places, and organizations.
- **Part-of-Speech (POS) Tagging**: Determining the grammatical category (noun, verb, etc.) of each word.
- **Dependency Parsing**: Analyzing the grammatical structure of a sentence.
- **Text Classification**: Assigning predefined categories to text data, such as spam detection in emails.
- **Sentiment Analysis**: Detecting the sentiment or emotion expressed in text.

### 1.2 What is Language?

Language is a complex system of communication used by humans, comprising various components that convey meaning and facilitate interaction. It consists of several fundamental building blocks:

#### 1.2.1 Building Blocks of Language

1. **Phonemes**: The smallest units of sound in a language. For example, the word "cat" has three phonemes: /k/, /æ/, and /t/.
2. **Morphemes**: The smallest units of meaning. "Unbelievable" has three morphemes: "un-", "believe", and "-able".
3. **Lexemes**: The set of all inflected forms of a single word. For example, "run" includes "runs", "ran", and "running".
4. **Syntax**: The arrangement of words and phrases to create well-formed sentences. It governs the grammatical structure of language.
5. **Context**: The situational background that influences the meaning of words and sentences.

### 1.3 Introduction to Approaches to NLP

NLP can be approached using various methods, each with its strengths and limitations:

#### 1.3.1 Heuristics-Based NLP

- Utilizes rule-based methods to process language.
- Effective for well-defined, small-scale problems.
- Example: Regular expressions for pattern matching in text.

#### 1.3.2 Machine Learning for NLP

- Uses statistical methods and algorithms to learn from data.
- Techniques include supervised and unsupervised learning.
- Common algorithms: Naive Bayes, Support Vector Machines (SVM), and decision trees.

#### 1.3.3 Deep Learning for NLP

- Leverages neural networks, especially deep neural networks, to model complex language patterns.
- Significant advancements have been made with architectures like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers.
- Applications include language translation, text generation, and question answering.


## Lab 1: Environment + Hands-on Regex (Heuristic Based)

### Environment Preparation:
- Download anaconda: https://www.anaconda.com/download/success
- Create a new environment called 'NLP_SEPT_2024'
    - Set Python version 3.11.x
    - Install:
        - Notebook
        - JupyterLab
        - VS Code
        - CMD Prompt
        - Powershell Prompt
    - In Python install basic packages (pip install `package`):
        - pandas
        - numpy
        - matplotlib
        - seaborn
        - wordcloud # Visualizing the most frequent words in a corpus.
        - nltk # Tokenization, POS tagging, stemming, and more.
        - spacy # Named Entity Recognition (NER), dependency parsing, and part-of-speech tagging.
        - textblob # Sentiment analysis, translation, and language detection.
        - gensim # Topic modeling, document similarity, and word embeddings.
        - torch # Implementing custom deep learning architectures, fine-tuning models like BERT for NLP tasks.
        - transformers # Text classification, translation, question answering, and language generation using pre-trained models.
        - sentence-transformers # Text similarity, clustering, and retrieval tasks.
        - tensorflow
        - keras
        - scikit-learn
        - chime

### Regex Hands-on

Common Regex Patterns (https://docs.python.org/3/library/re.html):
-        . - Matches any character except a newline.
-        ^ - Matches the start of the string.
-        $ - Matches the end of the string.
-        * - Matches 0 or more repetitions of the preceding element.
-        + - Matches 1 or more repetitions of the preceding element.
-        ? - Matches 0 or 1 repetition of the preceding element.
-        {n} - Matches exactly n repetitions of the preceding element.
-        {n,} - Matches n or more repetitions of the preceding element.
-        {n,m} - Matches between n and m repetitions of the preceding element.
-        [] - Matches any one of the enclosed characters.
-        | - Alternation; matches either the pattern before or after the |.
-        () - Groups multiple patterns into one.
-        \d: Matches any digit (equivalent to [0-9]).
-        \D: Matches any non-digit character.
-        \s: Matches any whitespace character (spaces, tabs, newlines).
-        \S: Matches any non-whitespace character.
-        \b: Matches a word boundary (the position between a word and a non-word character).
-        \B: Matches a non-word boundary.
-        \w: Matches any word character (alphanumeric plus underscore).
-        \W: Matches any non-word character.

- Practice REGEX: https://regex101.com/r/rsVgaP/1

In [2]:
import re
import pandas as pd
import numpy as np


In [19]:
# Example strings
text = "You should call 911 now. 911 is the emergency number"


In [None]:
# Find all numbers in the text
pattern = ...
# Use re.findall()
matches = ...
print(f"Numbers found: {matches}")

In [None]:
# Replace all numbers with the word '999'
pattern = ...
# hint use re.sub()
replaced_text = ...
print(replaced_text)


In [177]:
# Extract Email 
data = {'emails': ['john.doe@example.com', 'jane_smith@abc.co.uk', 'invalid.email@com']}
df = pd.DataFrame(data)

# Hint: Use df['col'].str.extract()

# Extract the username separately
# ^: Start of the string.
# ([\w.%+-]+): Captures the username part of the email.
# [\w.%+-]: Matches any word character (letters, digits, underscores), plus the special characters . % + -.
# +: One or more of the preceding characters.
# @: Matches the literal @ symbol, which is required to separate the username and domain.
df['username'] = ...

# Extract the domain separately
# @: Matches the literal @ symbol, which precedes the domain part.
# ([\w.-]+\.[a-zA-Z]{2,}): Captures the domain part of the email.
# [\w.-]+: Matches the main domain part, including letters, digits, hyphens, and dots.
# \.[a-zA-Z]{2,}: Matches the top-level domain (TLD) with at least two letters.
# $: End of the string.
df['domain'] = ...
df


Unnamed: 0,emails,username,domain
0,john.doe@example.com,john.doe,example.com
1,jane_smith@abc.co.uk,jane_smith,abc.co.uk
2,invalid.email@com,invalid.email,


In [147]:
# Email Validation
def validate_email(email):
    pattern = ...
    # ^(?!.*\.\.): Negative lookahead to ensure there are no consecutive dots in the email string.
    # [a-zA-Z0-9._%+-]+: Matches the local part (username) of the email. Allows letters, digits, and special characters such as ., _, %, +, -.
    # @[a-zA-Z0-9.-]+: Matches the domain part. Allows letters, digits, hyphens, and dots. This pattern allows a single dot but not consecutive dots within the domain part.
    # \.[a-zA-Z]{2,6}$: Matches the TLD with 2 to 6 alphabetic characters, which covers most common TLDs like .com, .org, .museum, etc.
    return bool(re.match(pattern, email))
# Test the refined function
emails = ['test.email@example.com', 'invalid-email@.com', 'name@domain.co', 'test..email@example.com', 'test@domain.c', 'test@domain.toolongtld']
results = [validate_email(email) for email in emails]
print(f"Validation Results: {results}")



Validation Results: [True, False, True, False, False, False]


In [32]:
# Phone Number Validation
def validate_phone_number(number):
    # Pattern to match common phone number formats
    pattern = ...

    # ^: Start of the string.
    # (\+\d{1,3}[-.\s]?)?: Matches the optional country code part.
        # \+: Matches a literal plus sign '+' at the start, indicating an international code.
        # \d{1,3}: Matches 1 to 3 digits for the country code (e.g., '1' for the US, '44' for the UK).
        # [-.\s]?: Matches an optional separator, which can be a hyphen '-', a dot '.', or a space ' '.
        # ?: Makes the entire country code part optional.
    # (\(?\d{3}\)?[-.\s]?)?: Matches the optional area code part.
        # \(?\d{3}\)?: Matches 3 digits for the area code, which may or may not be enclosed in parentheses. 
            # - \(? : Matches an optional opening parenthesis '('.
            # - \d{3}: Matches exactly 3 digits for the area code.
            # - \)?: Matches an optional closing parenthesis ')'.
        # [-.\s]?: Matches an optional separator (hyphen, dot, or space).
            # ?: Makes the entire area code part optional.
    # (\d{3}[-.\s]?\d{4}): Matches the main phone number part.
        # \d{3}: Matches exactly 3 digits.
        # [-.\s]?: Matches an optional separator (hyphen, dot, or space).
        # \d{4}: Matches exactly 4 digits for the remaining part of the phone number.
    # $: End of the string. Ensures that the pattern matches the entire phone number from start to end.
    
    return bool(re.fullmatch(pattern, number))


# Test the function with a list of phone numbers
numbers = ['+1-800-555-5555',  # Valid: Includes country code and separators.
           '(123) 456 7890',   # Valid: Area code in parentheses and spaces as separators.
           '12345']            # Invalid: Too short to be a valid phone number.

# Validate each phone number using the function
results = [validate_phone_number(number) for number in numbers]

# Display the validation results for each phone number
print(f"Validation Results: {results}")


Validation Results: [True, True, False]


In [20]:
# Extract URLs

# Extracting URLs from the given text
text = 'Visit our website at https://www.example.com or follow us at http://blog.example.com'

# Define the pattern to match URLs
pattern = ...

# https?://: 
# - https?: Matches the literal 'http' followed optionally by 's'. This means it can match both 'http' and 'https'.
# - ://: Matches the literal characters '://', which are required after 'http' or 'https' in a URL.

# [a-zA-Z0-9./-]+:
# - [a-zA-Z0-9./-]: Character set that matches any of the following characters:
#   - a-z: Lowercase English letters.
#   - A-Z: Uppercase English letters.
#   - 0-9: Digits.
#   - . (dot): Matches the literal dot, which is used in domain names and paths.
#   - / (forward slash): Matches the literal slash, which is used to separate different parts of the URL.
#   - - (hyphen): Matches the literal hyphen, which can be part of domain names or paths.
# - +: Quantifier that matches one or more of the preceding characters in the set, ensuring the pattern matches the entire URL.

# Hint: Use re.findall()
urls = ...

print(f"Extracted URLs: {urls}")


Extracted URLs: ['https://www.example.com', 'http://blog.example.com']


In [171]:
# Extract Birthday
# Sample text containing dates
text = "John's birthday is on 23/05/1995 and Mary's is on 15-04-1992."

# Define the pattern to match date formats
pattern = ...

# \b: Matches a word boundary, ensuring that the pattern matches whole numbers and not parts of larger strings.
    # - This prevents partial matches like '123' in '123abc'.
# \d{1,2} or \d{2,4}: 
# - \d: Matches any digit from 0 to 9.
# - {1,2}: Matches lower or upper digits for the day or month part, allowing for numbers like '3' or '23'.
# [-/]: - Matches either a hyphen '-' or a forward slash '/', which are common separators in date formats.


# Hint Use re.findall()
dates = ...

print(f"Extracted Dates: {dates}")


Extracted Dates: ['23/05/1995', '15-04-1992']


In [179]:
# Splitting text to be split into sentences
text = "Hello there! How are you today? Let's learn regex."

# Define the pattern to split the text into sentences
pattern = ...

# Explanation of the pattern:
# [.!?]:
# - [ ]: Square brackets define a character class, which matches any one of the enclosed characters.
# - .: Matches a literal period (.) which marks the end of a sentence.
# - !: Matches a literal exclamation mark (!) which marks the end of an exclamatory sentence.
# - ?: Matches a literal question mark (?) which marks the end of a question.

# Hint: Use re.split(pattern,text)
sentences = ...

sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

print(f"Sentences: {sentences}")


Sentences: ['Hello there', 'How are you today', "Let's learn regex"]


# Section 2: Approaches & Detailed NLP Pipeline

## 1. Introduction to Approaches to NLP

### Heuristics-Based NLP
Heuristic approaches rely on rules and patterns for natural language processing, without relying on data-driven methods. Commonly, regular expressions (regex) are used.

**Example:** Named Entity Recognition (NER) using heuristics might involve defining a set of patterns to detect names, locations, or dates.

In [None]:
import re
text = "Ahmed lives in Egypt"
pattern = r"..."  # Simple heuristic for proper nouns
# Use \b and make sure the first letter is capital and then get the rest of the word
entities = re.findall(pattern, text)
print(entities)  # ['Ahmed', 'Egypt']

['Ahmed', 'Egypt']


### Machine Learning for NLP
Machine learning approaches involve training models on labeled data. A basic example is using the Naive Bayes algorithm for text classification. One common method for text representation in machine learning is the bag-of-words model, which transforms text into numerical features by counting the occurrences of words.

**Equation:**

$$
P(y|x) = \frac{P(x|y)P(y)}{P(x)}
$$

This is the basis of the Naive Bayes classifier, where:
- \( y \) is the class label (e.g., positive or negative sentiment),
- \( x \) is the text feature (e.g., a sequence of words),
- \( P(y|x) \) is the posterior probability of class \( y \) given the text feature \( x \),
- \( P(x|y) \) is the likelihood of text feature \( x \) given class \( y \),
- \( P(y) \) is the prior probability of class \( y \),
- \( P(x) \) is the probability of the text feature \( x \).

**Example:** Text classification using Scikit-learn


In [None]:
# Importing necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class Utility:
    @staticmethod
    def vectorize_and_split(texts,labels):
        # Step 1: Vectorizing the text data using CountVectorizer (Bag-of-Words model)
        vectorizer = ... # Use count vectorizer
        X_v = vectorizer.fit_transform(...)  # Transform text into a matrix of token counts
        # Step 2: Splitting data into training and test sets
        X_train, ..., ..., y_test = train_test_split(..., labels, test_size=0.2, random_state=42)
        return X_train,...,...,y_test,vectorizer
    @staticmethod
    def model_pipeline(X_train,X_test,y_train,y_test):
        # Step 3: Initializing the Naive Bayes model (MultinomialNB is suitable for text classification)
        model = ... # Use the multinomial NB
        # Step 4: Training the model on the training data
        ... # Train the model
        # Step 5: Making predictions on the test data
        y_pred = ... # Predict
        # Step 6: Evaluating the model's performance
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {accuracy * 100:.2f}%")
        return ...
    @staticmethod
    def model_predict_and_evaluate(model,new_texts,vectorizer):
        X_new = vectorizer.transform(...) # transform new data
        predictions = model.predict(...) # predict new data 
        
        # Output predictions
        for text, pred in zip(new_texts, predictions):
            sentiment = "Positive" if pred == 1 else "Negative/Neutral"
            print(f"Text: '{text}' => Sentiment: {sentiment}")

class Helper:
    @staticmethod
    def run_pipeline(new_texts,texts,labels):
        X_train,...,...,y_test,vectorizer = Utility.vectorize_and_split(texts=texts,labels=labels)
        model = Utility.model_pipeline(X_train=X_train,
                                       X_test=...,
                                       y_train=...,
                                       y_test=y_test)
        Utility.model_predict_and_evaluate(model,new_texts,vectorizer)


In [None]:
# Data
# Predicting the sentiment of new texts
new_texts = [
    "Ahmed enjoys coding in Python",
    "The service was horrible and I am disappointed",
    "Quantum mechanics is an intriguing field",
    "I didn't like the food at all, it was tasteless"
]
# Sample dataset with texts and corresponding sentiment labels (1: positive, 0: neutral/negative)
texts = [
    "Ahmed loves NLP and enjoys learning about machine learning",
    "Egypt has beautiful landscapes and rich culture",
    "Heuristics are simple yet effective in problem solving",
    "Python is a great language for data science and AI",
    "Natural language processing is fascinating",
    "I don't like spam emails",
    "This movie was very boring and too long",
    "The weather in Egypt is warm and sunny most of the year",
    "I had a terrible customer service experience",
    "AI is transforming industries and creating new opportunities",
    "This is the worst product I have ever bought",
    "I love exploring new places and experiencing different cultures",
    "The food at the restaurant was fantastic",
    "The traffic in Cairo is terrible during rush hours",
    "I enjoy spending time with my family on the weekends",
    "The company's customer service was exceptional",
    "This is an average book with no exciting plot",
    "I am fascinated by advancements in quantum computing",
    "The park near my house is always clean and peaceful",
    "This phone has terrible battery life, I am disappointed",
    # Additional data:
    "The music was beautiful and uplifting",
    "I am not happy with the slow internet speed",
    "The staff at the hotel were very polite and helpful",
    "This software is extremely buggy and crashes often",
    "I had a great time at the concert",
    "I am fed up with all the ads in this app",
    "The customer support was rude and unhelpful",
    "The car's performance was beyond my expectations",
    "I hate waiting in long lines",
    "The book was interesting but too long",
    "I had a wonderful vacation with my family",
    "This smartphone is overpriced and not worth the money",
    "I enjoyed watching the latest movie at the cinema",
    "The delivery service was fast and efficient",
    "This laptop is lightweight and has a long battery life",
    "I am dissatisfied with the product's quality",
    "The new restaurant has a cozy atmosphere and delicious food",
    "The airline lost my luggage and I am very upset",
    "The presentation was informative and well-organized",
    "The hotel room was dirty and smelled bad"
]
# Labels: 1 for positive sentiment, 0 for neutral or negative sentiment
labels = [
    1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
    # Additional labels:
    1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0
]

In [None]:
Helper.run_pipeline(new_texts=new_texts,
                    texts=texts,
                    labels=labels)

Accuracy: 75.00%
Text: 'Ahmed enjoys coding in Python' => Sentiment: Positive
Text: 'The service was horrible and I am disappointed' => Sentiment: Negative/Neutral
Text: 'Quantum mechanics is an intriguing field' => Sentiment: Positive
Text: 'I didn't like the food at all, it was tasteless' => Sentiment: Positive


#### Terminology about last example!!

How CountVectorizer Works:
CountVectorizer is a simple and effective tool used in Natural Language Processing (NLP) to convert a collection of text documents into a matrix of token counts. It essentially creates a Bag-of-Words (BoW) representation of the text data, where:

- Tokenization: It splits the text into words or tokens.
- Vocabulary creation: It builds a vocabulary (unique words from the dataset).
- Count representation: Each text document is then represented as a vector of word counts, i.e., how many times each word from the vocabulary appears in each document.

---

Is CountVectorizer the Best Option?
While CountVectorizer is effective in many cases, it has limitations:

- Ignores word importance: It treats all words equally, meaning frequent words like "the" or "is" get as much weight as important words, which may not be desirable.
- Doesn't consider word context: It doesn't capture the relationship between words (e.g., "New York" is treated as two separate words rather than a single entity).
- Sparse representations: For large vocabularies, the resulting matrix can be very sparse, with many zero entries.

**Other Options and Their Differences:**

**TfidfVectorizer** (Term Frequency-Inverse Document Frequency):

- What it does: It transforms text into a matrix of TF-IDF features. It takes into account the frequency of a word in a document and its frequency across all documents. This way, less frequent but important words are given higher weight compared to common words.
- Why it's better: It reduces the importance of frequently occurring words that may not be informative (like "the" or "is"), leading to better performance for some tasks.
- Example: Words that appear in fewer documents will have higher weight, while common words will have a lower weight.


**HashingVectorizer**:

- What it does: Similar to CountVectorizer, but instead of building a vocabulary, it uses a hashing trick to map terms to indices in a fixed-size vector space.
- Why it's different: No need to store a vocabulary, which saves memory. This can be useful when dealing with very large datasets.

- Downside: It's not possible to reverse the transformation and map back to the original words since hashing is a one-way process.

**Word2Vec or GloVe (Word Embeddings)**:

- What it does: Instead of representing words based on their counts, it captures the semantic meaning of words by embedding them into dense vectors. Similar words are mapped close to each other in the vector space.
- Why it's better: It captures semantic relationships between words, so words like "king" and "queen" would have similar vectors, while traditional vectorizers treat them as completely different.
- Downside: More complex to train and use, but it can provide more informative features, especially for tasks that benefit from semantic understanding.


**BERT, GPT, and other transformer-based models**:

- What they do: These models use deep learning to generate context-aware word embeddings. Each word is represented based on the context in which it appears.
- Why it's better: Unlike CountVectorizer or TfidfVectorizer, they capture the meaning of a word depending on its surrounding words, making them much more powerful for complex NLP tasks like sentiment analysis, text classification, and question answering.
Downside: Computationally expensive and require large datasets for training. They also need more processing power compared to simpler vectorizers.

When to Use Which?
- For simple tasks with limited data: CountVectorizer or - TfidfVectorizer are usually sufficient. TfidfVectorizer tends to perform better in cases where the frequency of common words needs to be downplayed.

- For large-scale or production tasks: HashingVectorizer is useful when dealing with very large datasets, especially when memory is a concern.

- For advanced NLP tasks requiring semantic understanding: Word embeddings (Word2Vec, GloVe) or transformer-based models like BERT are more powerful, capturing context and meaning.

### Deep Learning for NLP
Deep learning approaches, such as Recurrent Neural Networks (RNNs) and Transformers, have revolutionized NLP by modeling complex dependencies in text data. RNNs use sequential data with hidden states, while Transformers, like BERT, use attention mechanisms for parallel processing.

**Equation:** 
The attention mechanism in Transformers is defined as:

$$
\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V
$$

Where:
- \( Q \) is the **query**,
- \( K \) is the **key**,
- \( V \) is the **value**,

and \( d_k \) is the dimension of the key.

**Example:** Using Huggingface's BERT for text classification


In [None]:
from transformers import ...
"""Imports the pipeline function from the Hugging Face Transformers library. 
The pipeline is a high-level abstraction that allows you to easily use state-of-the-art NLP models without worrying about the underlying complexity."""


classifier = ...(...) # use sentiment analysis from pipeline
""" Uses Bidirectional Encoder Representations from Transformers (BERT).
This should creates a pre-configured pipeline for sentiment analysis.
By default, Hugging Face uses a pre-trained model from the BERT family, specifically DistilBERT, which has been fine-tuned for sentiment analysis. It will automatically download the required model weights and tokenizer when you first run the code."""


result = ...("Ahmed enjoys studying NLP")
"""Text tokenization: Behind the scenes, the input text is first tokenized, which means it's split into smaller units (tokens) that the model can process.
Model prediction: The pre-trained model then processes these tokens and predicts whether the text expresses a positive or negative sentiment. Since the text is about "enjoying" something, the model will likely predict positive sentiment.
Post-processing: The result is then converted back into a human-readable label, which in this case is either "POSITIVE" or "NEGATIVE", along with a confidence score."""
print(result)

## 2. NLP Pipeline in Detail

### 1. Data Acquisition
The first step in any NLP project is acquiring the data. This can come from web scraping, APIs, or existing datasets.

**Example:** Load a dataset (e.g., IMDB movie reviews)


In [None]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
df = pd.DataFrame({'text': data.data, 'label': data.target})

# Take a look at the data


### 2. Text Cleaning
Text cleaning involves removing noise such as punctuation, numbers, stopwords, and converting text to lowercase.

**Example:** Simple text cleaning


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Metwalli\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [None]:
import re
from nltk.corpus import stopwords

"""nltk.corpus.stopwords: provides a corpus of stopwords (common words like "the", "is", "in", etc.) for various languages. Stopwords are often removed in text preprocessing since they don't contribute much to the meaning of the text for certain tasks like classification."""

def clean_text(text):
    """Purpose: Text cleaning and normalization are crucial steps before further processing or analysis, especially in Natural Language Processing (NLP). This ensures the text is free from unnecessary noise like digits, punctuation, and stopwords, and that it’s all in a consistent format (lowercase)."""
    #Hint: Use .sub() from regex and remove numbers and punctuation
    text = ... # This removes all numbers from the text 
    text = ... # regular expression that matches any character that is not a word (\w) or whitespace (\s), effectively removing punctuation.
    text = ... #Converts the text to lowercase to standardize it.
    text = ' '.join([word for word in text.split() if word not in ...('english')]) # Remove stop words
    return text

df['cleaned_text'] = df['text'].apply(...)

### 3. Pre-processing
Pre-processing includes tokenization, lemmatization, and stemming.

**Equation:** For word embeddings like TF-IDF, the formula is:

$$
\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log \left( \frac{N}{\text{df}(t)} \right)
$$

Where:
- **tf(t, d)** is the term frequency of term \( t \) in document \( d \),
- **df(t)** is the document frequency of term \( t \) across all documents,
- **N** is the total number of documents.

**Example:** Tokenization and stemming


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Metwalli\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Metwalli\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

#We import these functions because we are going to tokenize the text (split it into individual words) and then apply stemming to convert words to their root form for further processing.

ps = ... #This creates an instance of the PorterStemmer, which will be used to stem each tokenized word.

df['tokens'] = df['cleaned_text'].apply(...) # Use lambda function
df['stemmed_tokens'] = df['tokens'].apply(lambda x: [... for word in x]) # use stem(words) from ps 

### 4. Feature Engineering
This involves representing text as numeric features. Popular techniques include Bag-of-Words, TF-IDF, and Word Embeddings (Word2Vec, GloVe).

**Example:** TF-IDF vectorization


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
"""The TfidfVectorizer is used to convert text into a numerical form that machine learning models can understand. The vectorizer automatically handles tokenization and computes the TF-IDF score for each word."""
tfidf = ... # use an instance of tfidf
X = ... # use Fit & transform in tfidf on the cleaned_text from the df

"""it will:
Tokenize the text (i.e., split it into words).
Compute the term frequency for each word in each document.
Compute the inverse document frequency (IDF) across the corpus.
Multiply the TF and IDF values to get the TF-IDF score for each word in each document."""

### 5. Modeling
You can now apply machine learning or deep learning models like Logistic Regression, SVM, or neural networks.

**Example:** Logistic Regression for text classification


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, df['label'])

### 6. Evaluation
Model evaluation metrics include accuracy, precision, recall, and F1 score.

**Equations:**

$$
\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}
$$

Where:
- **TP** = true positives,
- **FP** = false positives,
- **FN** = false negatives.

**Example:** Evaluating model performance


In [None]:
from sklearn.metrics import classification_report

y_pred = model.predict(X)
print(classification_report(df['label'], y_pred))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       480
           1       0.96      0.97      0.97       584
           2       0.95      0.96      0.95       591
           3       0.93      0.95      0.94       590
           4       0.98      0.98      0.98       578
           5       0.98      0.98      0.98       593
           6       0.94      0.96      0.95       585
           7       0.98      0.98      0.98       594
           8       1.00      0.99      0.99       598
           9       1.00      1.00      1.00       597
          10       1.00      1.00      1.00       600
          11       1.00      0.98      0.99       595
          12       0.98      0.97      0.98       591
          13       1.00      0.99      0.99       594
          14       1.00      0.99      1.00       593
          15       0.96      1.00      0.98       599
          16       0.98      0.99      0.99       546
          17       1.00    

### 7. Deployment
Deploying the model can involve using platforms like FastAPI or Flask for serving predictions.

**Example:** FastAPI skeleton for deploying an NLP model


```python
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

@app.post("/predict/")
def predict(text: str):
    return {"prediction": model.predict([text])}

### 8. Monitoring and Model Updating
Monitoring involves tracking model performance post-deployment and updating the model when performance degrades.
