# NLP Sept: 2024

In this notebook, we have an end-to-end NLP crash course for september 2024

Authors:
- Eng. Ahmed Métwalli
- Eng. Alia Elhefny

## Section 1: Introduction to Natural Language Processing (NLP)

### 1.1 Introduction to NLP

Natural Language Processing (NLP) is a multidisciplinary field that combines computer science, linguistics, and artificial intelligence to enable machines to understand, interpret, and generate human language. NLP is ubiquitous in the modern digital world, powering applications such as voice assistants, translation services, and sentiment analysis.

#### 1.1.1 What is NLP in the Real World?

NLP is applied in various real-world scenarios, including:

- **Text Analysis and Summarization**: Automated generation of concise summaries from large documents.
- **Sentiment Analysis**: Assessing the emotional tone of texts in social media or customer reviews.
- **Machine Translation**: Translating text from one language to another, as seen in Google Translate.
- **Chatbots and Virtual Assistants**: Enabling conversational interfaces in applications like Siri, Alexa, and customer support bots.

#### 1.1.2 NLP Tasks

Key tasks in NLP include:

- **Tokenization**: Splitting text into individual words or phrases.
- **Named Entity Recognition (NER)**: Identifying and classifying entities like names, places, and organizations.
- **Part-of-Speech (POS) Tagging**: Determining the grammatical category (noun, verb, etc.) of each word.
- **Dependency Parsing**: Analyzing the grammatical structure of a sentence.
- **Text Classification**: Assigning predefined categories to text data, such as spam detection in emails.
- **Sentiment Analysis**: Detecting the sentiment or emotion expressed in text.

### 1.2 What is Language?

Language is a complex system of communication used by humans, comprising various components that convey meaning and facilitate interaction. It consists of several fundamental building blocks:

#### 1.2.1 Building Blocks of Language

1. **Phonemes**: The smallest units of sound in a language. For example, the word "cat" has three phonemes: /k/, /æ/, and /t/.
2. **Morphemes**: The smallest units of meaning. "Unbelievable" has three morphemes: "un-", "believe", and "-able".
3. **Lexemes**: The set of all inflected forms of a single word. For example, "run" includes "runs", "ran", and "running".
4. **Syntax**: The arrangement of words and phrases to create well-formed sentences. It governs the grammatical structure of language.
5. **Context**: The situational background that influences the meaning of words and sentences.

### 1.3 Introduction to Approaches to NLP

NLP can be approached using various methods, each with its strengths and limitations:

#### 1.3.1 Heuristics-Based NLP

- Utilizes rule-based methods to process language.
- Effective for well-defined, small-scale problems.
- Example: Regular expressions for pattern matching in text.

#### 1.3.2 Machine Learning for NLP

- Uses statistical methods and algorithms to learn from data.
- Techniques include supervised and unsupervised learning.
- Common algorithms: Naive Bayes, Support Vector Machines (SVM), and decision trees.

#### 1.3.3 Deep Learning for NLP

- Leverages neural networks, especially deep neural networks, to model complex language patterns.
- Significant advancements have been made with architectures like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers.
- Applications include language translation, text generation, and question answering.


## Lab 1: Environment + Hands-on Regex (Heuristic Based)

### Environment Preparation:
- Download anaconda: https://www.anaconda.com/download/success
- Create a new environment called 'NLP_SEPT_2024'
    - Set Python version 3.11.x
    - Install:
        - Notebook
        - JupyterLab
        - VS Code
        - CMD Prompt
        - Powershell Prompt
    - In Python install basic packages (pip install `package`):
        - pandas
        - numpy
        - matplotlib
        - seaborn
        - wordcloud # Visualizing the most frequent words in a corpus.
        - nltk # Tokenization, POS tagging, stemming, and more.
        - spacy # Named Entity Recognition (NER), dependency parsing, and part-of-speech tagging.
        - textblob # Sentiment analysis, translation, and language detection.
        - gensim # Topic modeling, document similarity, and word embeddings.
        - torch # Implementing custom deep learning architectures, fine-tuning models like BERT for NLP tasks.
        - transformers # Text classification, translation, question answering, and language generation using pre-trained models.
        - sentence-transformers # Text similarity, clustering, and retrieval tasks.
        - tensorflow
        - keras
        - scikit-learn
        - chime

### Regex Hands-on

Common Regex Patterns (https://docs.python.org/3/library/re.html):
-        . - Matches any character except a newline.
-        ^ - Matches the start of the string.
-        $ - Matches the end of the string.
-        * - Matches 0 or more repetitions of the preceding element.
-        + - Matches 1 or more repetitions of the preceding element.
-        ? - Matches 0 or 1 repetition of the preceding element.
-        {n} - Matches exactly n repetitions of the preceding element.
-        {n,} - Matches n or more repetitions of the preceding element.
-        {n,m} - Matches between n and m repetitions of the preceding element.
-        [] - Matches any one of the enclosed characters.
-        | - Alternation; matches either the pattern before or after the |.
-        () - Groups multiple patterns into one.
-        \d: Matches any digit (equivalent to [0-9]).
-        \D: Matches any non-digit character.
-        \s: Matches any whitespace character (spaces, tabs, newlines).
-        \S: Matches any non-whitespace character.
-        \b: Matches a word boundary (the position between a word and a non-word character).
-        \B: Matches a non-word boundary.
-        \w: Matches any word character (alphanumeric plus underscore).
-        \W: Matches any non-word character.

- Practice REGEX: https://regex101.com/r/rsVgaP/1

In [2]:
import re
import pandas as pd
import numpy as np


In [19]:
# Example strings
text = "You should call 911 now. 911 is the emergency number"


In [None]:
# Find all numbers in the text
pattern = ...
# Use re.findall()
matches = ...
print(f"Numbers found: {matches}")

In [None]:
# Replace all numbers with the word '999'
pattern = ...
# hint use re.sub()
replaced_text = ...
print(replaced_text)


In [177]:
# Extract Email 
data = {'emails': ['john.doe@example.com', 'jane_smith@abc.co.uk', 'invalid.email@com']}
df = pd.DataFrame(data)

# Hint: Use df['col'].str.extract()

# Extract the username separately
# ^: Start of the string.
# ([\w.%+-]+): Captures the username part of the email.
# [\w.%+-]: Matches any word character (letters, digits, underscores), plus the special characters . % + -.
# +: One or more of the preceding characters.
# @: Matches the literal @ symbol, which is required to separate the username and domain.
df['username'] = ...

# Extract the domain separately
# @: Matches the literal @ symbol, which precedes the domain part.
# ([\w.-]+\.[a-zA-Z]{2,}): Captures the domain part of the email.
# [\w.-]+: Matches the main domain part, including letters, digits, hyphens, and dots.
# \.[a-zA-Z]{2,}: Matches the top-level domain (TLD) with at least two letters.
# $: End of the string.
df['domain'] = ...
df


Unnamed: 0,emails,username,domain
0,john.doe@example.com,john.doe,example.com
1,jane_smith@abc.co.uk,jane_smith,abc.co.uk
2,invalid.email@com,invalid.email,


In [147]:
# Email Validation
def validate_email(email):
    pattern = ...
    # ^(?!.*\.\.): Negative lookahead to ensure there are no consecutive dots in the email string.
    # [a-zA-Z0-9._%+-]+: Matches the local part (username) of the email. Allows letters, digits, and special characters such as ., _, %, +, -.
    # @[a-zA-Z0-9.-]+: Matches the domain part. Allows letters, digits, hyphens, and dots. This pattern allows a single dot but not consecutive dots within the domain part.
    # \.[a-zA-Z]{2,6}$: Matches the TLD with 2 to 6 alphabetic characters, which covers most common TLDs like .com, .org, .museum, etc.
    return bool(re.match(pattern, email))
# Test the refined function
emails = ['test.email@example.com', 'invalid-email@.com', 'name@domain.co', 'test..email@example.com', 'test@domain.c', 'test@domain.toolongtld']
results = [validate_email(email) for email in emails]
print(f"Validation Results: {results}")



Validation Results: [True, False, True, False, False, False]


In [32]:
# Phone Number Validation
def validate_phone_number(number):
    # Pattern to match common phone number formats
    pattern = ...

    # ^: Start of the string.
    # (\+\d{1,3}[-.\s]?)?: Matches the optional country code part.
        # \+: Matches a literal plus sign '+' at the start, indicating an international code.
        # \d{1,3}: Matches 1 to 3 digits for the country code (e.g., '1' for the US, '44' for the UK).
        # [-.\s]?: Matches an optional separator, which can be a hyphen '-', a dot '.', or a space ' '.
        # ?: Makes the entire country code part optional.
    # (\(?\d{3}\)?[-.\s]?)?: Matches the optional area code part.
        # \(?\d{3}\)?: Matches 3 digits for the area code, which may or may not be enclosed in parentheses. 
            # - \(? : Matches an optional opening parenthesis '('.
            # - \d{3}: Matches exactly 3 digits for the area code.
            # - \)?: Matches an optional closing parenthesis ')'.
        # [-.\s]?: Matches an optional separator (hyphen, dot, or space).
            # ?: Makes the entire area code part optional.
    # (\d{3}[-.\s]?\d{4}): Matches the main phone number part.
        # \d{3}: Matches exactly 3 digits.
        # [-.\s]?: Matches an optional separator (hyphen, dot, or space).
        # \d{4}: Matches exactly 4 digits for the remaining part of the phone number.
    # $: End of the string. Ensures that the pattern matches the entire phone number from start to end.
    
    return bool(re.fullmatch(pattern, number))


# Test the function with a list of phone numbers
numbers = ['+1-800-555-5555',  # Valid: Includes country code and separators.
           '(123) 456 7890',   # Valid: Area code in parentheses and spaces as separators.
           '12345']            # Invalid: Too short to be a valid phone number.

# Validate each phone number using the function
results = [validate_phone_number(number) for number in numbers]

# Display the validation results for each phone number
print(f"Validation Results: {results}")


Validation Results: [True, True, False]


In [20]:
# Extract URLs

# Extracting URLs from the given text
text = 'Visit our website at https://www.example.com or follow us at http://blog.example.com'

# Define the pattern to match URLs
pattern = ...

# https?://: 
# - https?: Matches the literal 'http' followed optionally by 's'. This means it can match both 'http' and 'https'.
# - ://: Matches the literal characters '://', which are required after 'http' or 'https' in a URL.

# [a-zA-Z0-9./-]+:
# - [a-zA-Z0-9./-]: Character set that matches any of the following characters:
#   - a-z: Lowercase English letters.
#   - A-Z: Uppercase English letters.
#   - 0-9: Digits.
#   - . (dot): Matches the literal dot, which is used in domain names and paths.
#   - / (forward slash): Matches the literal slash, which is used to separate different parts of the URL.
#   - - (hyphen): Matches the literal hyphen, which can be part of domain names or paths.
# - +: Quantifier that matches one or more of the preceding characters in the set, ensuring the pattern matches the entire URL.

# Hint: Use re.findall()
urls = ...

print(f"Extracted URLs: {urls}")


Extracted URLs: ['https://www.example.com', 'http://blog.example.com']


In [171]:
# Extract Birthday
# Sample text containing dates
text = "John's birthday is on 23/05/1995 and Mary's is on 15-04-1992."

# Define the pattern to match date formats
pattern = ...

# \b: Matches a word boundary, ensuring that the pattern matches whole numbers and not parts of larger strings.
    # - This prevents partial matches like '123' in '123abc'.
# \d{1,2} or \d{2,4}: 
# - \d: Matches any digit from 0 to 9.
# - {1,2}: Matches lower or upper digits for the day or month part, allowing for numbers like '3' or '23'.
# [-/]: - Matches either a hyphen '-' or a forward slash '/', which are common separators in date formats.


# Hint Use re.findall()
dates = ...

print(f"Extracted Dates: {dates}")


Extracted Dates: ['23/05/1995', '15-04-1992']


In [179]:
# Splitting text to be split into sentences
text = "Hello there! How are you today? Let's learn regex."

# Define the pattern to split the text into sentences
pattern = ...

# Explanation of the pattern:
# [.!?]:
# - [ ]: Square brackets define a character class, which matches any one of the enclosed characters.
# - .: Matches a literal period (.) which marks the end of a sentence.
# - !: Matches a literal exclamation mark (!) which marks the end of an exclamatory sentence.
# - ?: Matches a literal question mark (?) which marks the end of a question.

# Hint: Use re.split(pattern,text)
sentences = ...

sentences = [sentence.strip() for sentence in sentences if sentence.strip()]

print(f"Sentences: {sentences}")


Sentences: ['Hello there', 'How are you today', "Let's learn regex"]
