# Regular Expressions

What are Regualar Expressions?
A regular expression, often abbreviated as "regex" or "regexp," is a powerful tool used in computer science and text processing to describe and match patterns in strings. It's a sequence of characters that defines a search pattern. These patterns can be used for tasks like searching, extracting, replacing, and validating text.

Regular expressions are widely used in tasks like text searching and manipulation, data validation, parsing, and more. They are supported by many programming languages and text processing tools, making them a versatile and essential tool for working with text data.

Why Regular Expressions?
Regular expressions are used for a variety of tasks in computer science, data processing, and text analysis due to their powerful pattern matching capabilities. Here are some key reasons why regular expressions are widely used:

- Pattern Matching: Regular expressions allow you to search for specific patterns or sequences of characters within a larger body of text. This is useful for tasks like finding specific words, dates, email addresses, or other structured information.

- Text Extraction: They can be used to extract specific pieces of information from a text document, such as names, phone numbers, URLs, or any other structured data.

- Data Validation: Regular expressions are used to validate input data. For example, you can use them to check if a user-provided email address or phone number is in the correct format.

- Search and Replace: They enable you to search for specific patterns in a text and replace them with something else. This is useful for tasks like cleaning up data or making text substitutions.

- Parsing and Tokenization: Regular expressions are essential for breaking down text into smaller units or tokens. This is used in tasks like natural language processing and compiler design.

- Web Scraping: When extracting data from websites, regular expressions can be used to locate and extract specific elements or information from HTML pages.

- Log File Analysis: Regular expressions are invaluable for searching and parsing log files, allowing you to extract important information or identify patterns of interest.

- Pattern Validation: They are used to validate whether a string adheres to a specific pattern or format, such as checking if a password meets certain criteria.

- Data Transformation and Cleaning: Regular expressions can be used to clean and transform text data. For example, removing unnecessary characters or formatting.

- Language Agnostic: Regular expressions are supported by most programming languages, making them a versatile tool that can be applied in a wide range of contexts.

Efficiency: When used correctly, regular expressions can provide very efficient search and match operations, especially for complex patterns.

Overall, regular expressions are a fundamental tool for working with text data and are an essential skill for tasks ranging from data preprocessing to text analysis and beyond. They provide a flexible and powerful means to perform complex pattern matching operations.

## Applications of Regular Expressions

Regular expressions (regex) find applications in a wide range of real-time scenarios across various domains. Here are some of the most important applications of regex:

- Data Validation and Form Input:
Ensuring that user-provided data (like email addresses, phone numbers, passwords, etc.) adhere to specified formats before processing.

- Search and Replace in Text Editors:
find and replace operations in text editors or IDEs, allowing for quick and precise changes in code or documents.

- Log File Parsing and Analysis:
Extracting relevant information from log files, helping to identify patterns, errors, or anomalies in system logs.

- Web Scraping and Data Extraction:
Extracting specific information from web pages, like email addresses, phone numbers, product names, etc., for further analysis.

- Data Cleaning and Transformation:
Preprocessing text data by removing unnecessary characters, fixing formatting issues, and standardizing data for analysis.

- Search Engines and Information Retrieval:
Powering search engines for matching user queries to relevant content on websites or databases.

- URL Routing and Validation:
Validating and parsing URLs to ensure they follow the correct format and extracting specific parameters from them.

- Lexical Analysis in Compiler Design:
Tokenizing source code into meaningful units for further processing by a compiler.

- Natural Language Processing (NLP):
Tokenizing sentences or words, extracting entities (like names, dates, locations), and performing advanced text processing tasks.

- Network Security and Firewall Rules:
Defining and enforcing rules for allowing or blocking specific types of traffic based on patterns in network traffic logs.

- Database Querying and Validation:
Validating and querying databases for specific patterns or formats, such as social security numbers, credit card numbers, etc.

- Formal Language Theory and Automata:
In computer science theory, regex is used in the definition of regular languages and finite automata.

- Validation of Configuration Files:
Ensuring that configuration files for software or systems follow the correct syntax and structure.

- Extracting Metadata from Documents:
Parsing documents (like PDFs, Word documents) to extract metadata such as titles, authors, dates, etc.

- URL Rewriting in Web Servers:
Modifying URLs on the fly to improve SEO or to direct traffic to specific pages.

- Pattern Matching in DNA Sequences:
Identifying specific genetic sequences or motifs in DNA for biological research.

These are just some of the many real-time applications of regular expressions. Their versatility and powerful pattern-matching capabilities make them an invaluable tool in various fields of computer science and data processing.

References for Practice
Try to solve all the interactive tutorial from below mentioned website:

https://regexone.com/lesson/introduction_abcs

https://regex101.com

### Meta Characters
., ^, $, *, +, ?, {, }, [, ], (, ), |, \

User-defined Character Classes
[abc] - Match either a or b or c
[^abc] - Match any character except a or b or c
[a-z] - Match a lower case english alphabet character
[A-Z] - Match an upper case english alphabet character
[a-zA-Z] - Match any english alphabet character
[0-9] - Match any digit character
[a-zA-Z0-9_] - Match any alphanumeric character
[^a-zA-Z0-9_] - Match any character except alphanumeric character

Pre-defined Character Classes
\d - Match a digit character i.e. [0-9]
\D - Match any character except digit character. i.e. [^0-9]
\w - Match an alpha-numeric character i.e. [a-zA-Z0-9_]
\W - Match any character except alpha-numeric character i.e. [^a-zA-Z0-9_]
\s - Match a space character.
\S - Match any character except space.
\t - Match a tab character.
\n - Match a next line character.

Quantifiers
a* - Match zero or more number of characters
a+ - Match one or more number of characters
a? - Match atmost 1 character i.e. 0 or 1
a{n} - Match exactly n number of character
a{m, n} - Match atleast m number and atmost n number of characters

Groups
Capturing Groups - (X)
Non Capturing Groups - (?:X)

Importing the required module

In [1]:
import re

re.fullmatch()
It returns a match object if and only if the entire string matches the pattern. Otherwise, it will return None.

Regular Expression to match a Nigeria number

In [50]:
import re
number = input('Enter your number:')
regex = r"[0-9]\d{0,11}"
match = re.fullmatch(regex,number)
if match:
    print('\033[32m',number, 'is valid')
else:
    print('\033[31m',number, 'not valid')

[32m 07043360273 is valid


In [23]:
# Step 1 - Ask the user to enter a phone number
user_entered_mobile_number = input("Enter a mobile number to validate: ")

# Step 2 - Create a regex pattern for Indian Phone Numbers
regex = r"[6-9]\d{9}"

# Step 3 - Match the Regex with the user entered mobile number
match = re.fullmatch(regex, user_entered_mobile_number)

# Step 4 - If match exist, print "Valid"
if match:
    print(user_entered_mobile_number, 'is valid.')
else:
    print(user_entered_mobile_number, 'not valid.')

90909090909 not valid.
