<a href="https://colab.research.google.com/github/Tiwari666/NLP/blob/main/re_Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regular expressions or RegEx
Regular expression is defined as a sequence of characters that are mainly used to find or replace patterns present in the text. In simple words, we can say that a regular expression is a set of characters or a pattern that is used to find substrings in a given string.

A regular expression (RE) is a language for specifying text search strings. It helps us to match or extract other strings or sets of strings, with the help of a specialized syntax present in a pattern.

For Example, extracting all hashtags from a tweet, getting email iD or phone numbers, etc from large unstructured text content.

#How can Regular Expressions be used in NLP?
In NLP, we can use Regular expressions at many places such as,

1. To Validate data fields.

For Example,  dates, email address, URLs, abbreviations, etc.

2. To Filter a particular text from the whole corpus.

For Example, spam, disallowed websites, etc.

3. To Identify particular strings in a text.

For Example, token boundaries

4. To convert the output of one processing component into the format required for a second component.

#To Validate data fields.

#1. use of Regular Expression (re) : To Validate data fields.

In [1]:
import re

# Regular expression for date validation (MM/DD/YYYY format)
date_regex = r'^(0[1-9]|1[0-2])/(0[1-9]|[12][0-9]|3[01])/(19|20)\d{2}$'

def validate_date(date_str):
    # Check if the date string matches the regular expression
    if re.match(date_regex, date_str):
        return True
    else:
        return False

# Test the function
#date = "12/31/2022"
date = "12-31-2022"
if validate_date(date):
    print(f"{date} is a valid date.")
else:
    print(f"{date} is not a valid date.")

12-31-2022 is not a valid date.


#2. use of Regular Expression (re) : To Filter a particular text from the whole corpus.

In [2]:
import nltk
nltk.download('punkt')
import re

# Sample corpus
corpus = "This is an example sentence. Here is another example sentence with a specific word: apple. This is the last sentence."

# Tokenize the corpus into sentences
sentences = nltk.sent_tokenize(corpus)

# Define the regular expression pattern to match the desired text (e.g., "apple")
pattern = r'\bapple\b'

# Filter out sentences containing the desired text
filtered_sentences = [sentence for sentence in sentences if re.search(pattern, sentence)]

# Print the filtered sentences
for sentence in filtered_sentences:
    print(sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Here is another example sentence with a specific word: apple.


# 3. use of Regular Expression (re) :To Identify particular strings in a text.

\b: Word boundary anchor.

[bB]: Match either 'b' or 'B'.

\w+: Match one or more word characters (letters, digits, or underscores).

re.findall(pattern, text): Find all non-overlapping matches of the pattern in the text.

matches: Store the matched strings.

In [3]:
import re

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Define a regular expression pattern to search for words starting with 'b'
pattern = r'\b[bB]\w+'

# Use re.findall() to find all matches of the pattern in the text
matches = re.findall(pattern, text)

# Print the matched strings
print(matches)

['brown']


#4 ) use of Regular Expression (re) : To convert the output of one processing component into the format required for a second component.



In [4]:
import re

# Input date string in the format "DD-MM-YYYY"
input_date = "31-01-2024"

# Define a regular expression pattern to match the input date format
pattern = r'(\d{2})-(\d{2})-(\d{4})'

# Use re.sub() to replace the matched pattern with the desired format
output_date = re.sub(pattern, r'\3/\2/\1', input_date)

# Print the converted date string
print("Input date:", input_date)
print("Output date:", output_date)

Input date: 31-01-2024
Output date: 2024/01/31


#INTERPRETATION OF ABOVE CODE:

import re: Import the re module, which provides support for working with regular expressions.

input_date: The input date string in the format "DD-MM-YYYY".

pattern: The regular expression pattern r'(\d{2})-(\d{2})-(\d{4})':
(\d{2}): Match and capture two digits (day).

-: Match the hyphen separator.

(\d{2}): Match and capture two digits (month).

-: Match the hyphen separator.

(\d{4}): Match and capture four digits (year).

re.sub(pattern, r'\3/\2/\1', input_date): Use re.sub() to substitute the matched pattern with the desired format "YYYY/MM/DD". \1, \2, and \3 refer to the captured groups in the pattern representing day, month, and year respectively. Replace the matched pattern with the text captured by the third capture group (\3), followed by a "/", then the text captured by the second capture group (\2), another "/", and finally the text captured by the first capture group (\1).

output_date: Store the converted date string.

print("Input date:", input_date): Print the original input date string.

print("Output date:", output_date): Print the converted date string.

This example demonstrates how to use regular expressions to convert the output of one processing component (date in "DD-MM-YYYY" format) into the format required for a second component ("YYYY/MM/DD").

# What does this mean : pattern = r'\bapple\b'?

r: This is a prefix indicating that the following string is a raw string literal. Raw string literals treat backslashes (\) as literal characters, which is useful when working with regular expressions, as backslashes are often used as escape characters.

'\b': This is a word boundary anchor in regular expressions. It matches the position between a word character (like a letter or a number) and a non-word character (like a space or punctuation). \b asserts that the position is at the beginning or end of a word.

apple: This is the word we are searching for in the text.

'\b': Another word boundary anchor, similar to the first one. It asserts that the position is at the beginning or end of a word.

Putting it all together, r'\bapple\b' is a regular expression pattern that matches the word "apple" as a standalone word (i.e., not part of a longer word) in a text. The word boundaries (\b) ensure that "apple" is matched only when it appears as a whole word and not as part of another word.

#LITERALS:
In Python, a literal is a notation representing a fixed value in source code. Literals are used to express data directly and explicitly within the code. They are used to initialize variables or provide values for expressions without any computation or transformation. Literals can be of various types, including numeric, string, boolean, sequence, and special literals.

Here's a breakdown of literals in Python:

#Numeric Literals:

Integers: Whole numbers without a fractional part, e.g., 42, -10.

Floating-point numbers: Numbers with a fractional part or in exponential notation, e.g., 3.14, -0.01, 2.5e2.

Complex numbers: Numbers with a real and imaginary part, represented as real + imaginary*j, e.g., 2 + 3j, -1 - 2j.

String Literals:

#Single-line strings:

Enclosed in either single (') or double (") quotes, e.g., 'hello', "world".

Multi-line strings: Enclosed in triple quotes (''' or """) to span multiple
lines, e.g., '''multi-line\nstring'''.

#Boolean Literals:

True and False: Representing boolean values.

#None Literal:

None: Represents the absence of a value or a null value.

#Sequence Literals:

Lists: Ordered collections of items enclosed in square brackets ([]), e.g., [1, 2, 3].

Tuples: Ordered collections of items enclosed in parentheses (()), e.g., (1, 2, 3).

Sets: Unordered collections of unique items enclosed in curly braces ({}), e.g., {1, 2, 3}.

Dictionaries: Unordered collections of key-value pairs enclosed in curly braces ({}), where each key is associated with a value, e.g., {'a': 1, 'b': 2}.
Special Literals:

Ellipsis (...): Represents an indeterminate value or a placeholder, e.g., ....

Literals are fundamental building blocks in Python programming, used to represent constant values directly within the code.