## Regular Expressions 

A **Regular Expression (regex)** is a sequence of characters that defines a search pattern. Regular expressions are primarily used for string searching, manipulation, and validation. They provide a powerful tool for text processing and pattern matching tasks.

#### Uses of Regular Expressions

Regular expressions are widely used in various fields for the following tasks:

1. **Data Pre-processing**:
   - Regex helps in cleaning and transforming raw data by identifying specific patterns in text and modifying or removing them.

2. **Rule-based Information Mining Systems**:
   - In rule-based systems, regular expressions help in extracting relevant information based on predefined patterns.

3. **Pattern Matching**:
   - Regex allows for searching text to find occurrences of a particular pattern, such as finding words, numbers, or email addresses in large text corpora.

4. **Text Feature Engineering**:
   - In machine learning and natural language processing (NLP), regex is used to extract relevant features from text data, such as keywords, tokens, or named entities.

5. **Web Scraping**:
   - When scraping data from web pages, regex is often used to extract specific pieces of information from HTML content, such as links, tables, or specific text sections.

6. **Data Validation**:
   - Regex is widely used in input validation to ensure data conforms to specific formats. For example, validating email addresses, phone numbers, or credit card numbers.

7. **Data Extraction**:
   - Regex allows for extracting important pieces of information from large datasets by matching specific patterns in the data.
   traction.

### Importing module 

In [1]:
import re

#### 1. Search 
The re.search() function in Python's re module is used to search for a pattern within a string. It finds the first occurrence of the specified pattern and returns a match object if the pattern is found; otherwise, it returns None

In [2]:
search= re.search('learn' , 'i am learning data science')
print("The word covers range of :",search.span(),"indices.")
print("The word start at index:" ,search.start())
print("The word start at index:",search.end())

The word covers range of : (5, 10) indices.
The word start at index: 5
The word start at index: 10


#### 2. Findall

In [3]:
a='deep understanding and thinking deeply releases deep thoughts which deepen the knowledge'
b='deep'

print("The number of Matches are :",len(re.findall(b,a)))
print(re.findall(b,a))

m = len(re.findall(b, a))
print(f"Found the word '{b}' {m} times in the given string.")

print(re.findall(r"\b\d+\b", "a 1 b 2 c 3"))



The number of Matches are : 4
['deep', 'deep', 'deep', 'deep']
Found the word 'deep' 4 times in the given string.
['1', '2', '3']


#### 3. Finditer

In [4]:
a = 'deep understanding and thinking deeply releases deep thoughts which deepen the knowledge'
b = 'deep'

# Use re.finditer() to get match objects
matches = list(re.finditer(b, a))

# Print the count of matches
print(f"Found the word '{b}' at {len(matches)} positions in the given string.")

# Print the start indices of each match
indices = [match.start() for match in matches]
print(f"The word '{b}' is found at indices: {indices}")


Found the word 'deep' at 4 positions in the given string.
The word 'deep' is found at indices: [0, 32, 48, 68]


### Creating the Regex

In [5]:
txt = 'My telephone number is 834-4324-345'
pattern = '\d\d\d-\d\d\d\d-\d\d\d'

re.search(pattern, txt)

<re.Match object; span=(23, 35), match='834-4324-345'>

In [6]:
# modifying way of writing 
pattern_1 = '\d{3}-\d{4}-\d{3}'
pattern_2 = r'(\d+)-(\d+)-(\d+)'
re.search(pattern_2,txt).group()

'834-4324-345'

#### Using groups in various ways 

In [7]:
pattern = r'(\d+)-(\w+)-(\d+)'
string = "Order ID: 05-deesha-15"

match = re.search(pattern, string)
if match:
    print(match.group(0))  #  (entire match)
    print(match.group(1))  #  (first capturing group)
    print(match.group(2))  #  (second capturing group)
    print(match.group(1,3))
    


05-deesha-15
05
deesha
('05', '15')


In [8]:
match = re.search(r'\d+', "No digits here")
if match:
    print(match.group())  # Safe to call only if match is not None
else:
    print("No match found.")


No match found.


### Specific pattern finding 


1. With specific patterns 

In [9]:
srch=re.findall('at', 'The rat sat on the mat and attached by a cat')
print(srch)
srch_1=re.findall('.at', 'The rat sat on the mat and attached by a cat')
print(srch_1)
srch_2=re.findall('.a.', 'The rat sat on the mat and attached by a cat')
print(srch_2)

['at', 'at', 'at', 'at', 'at']
['rat', 'sat', 'mat', ' at', 'cat']
['rat', 'sat', 'mat', ' an', ' at', 'tac', ' a ', 'cat']


In [10]:
all=re.findall('\d', '4 is divisible by 2 not by 3')
print(all)
start= re.findall('^\d', '4 is divisible by 2 not by 3')
print(start)
end=re.findall('\d$', '4 is divisible by 2 not by 31')
print(end)

['4', '2', '3']
['4']
['1']


In [11]:
with_spaces=re.findall('\s\d\s', '4 is divisible by 2 not by 3')
print(with_spaces)
after_letter=re.findall('[a-z,A-Z]\d','4 is divisible by2 not by 3')
print(after_letter)
any_bef=re.findall('(?<=.)\d', 'x4 is divisible by 2 and y3')
print(any_bef)
any_aft=re.findall('\d(?=.)','4x is divisble by 2z and 3')
print(any_aft)

#The regex \d(?=.) ensures that:
# Only digits followed by a character are matched.
# The following character is not included in the match.

[' 2 ']
['y2']
['4', '2', '3']
['4', '2']


In [12]:
#All alphanumeric 
txt='Welcome to GFG data science Program for 499 dyas for only 3999 ~!@#$%^&*()_+=-/><?.,|\}{[]:;'
srch=re.findall('[A-Za-z0-9]', txt)
# all except alphanumeric and spaces usng ^ (neagtion)
srch2=re.findall('[^A-Za-z0-9\s]',txt)
print(len(srch))
len(srch2)

51


29

### SUB and SUBN fucntions 
The sub and subn functions find patterns in the target string and substitute them with another string. The sub function substitutes only the first pattern occurrence, while subn substitutes the first n occurrences.

In [13]:
print(re.sub(r"\b\d+\b", "2022", "This year is 2021"))

This year is 2022


In [14]:
print(re.subn(r"\b\d+\b", "NUMBER", "a 1 b 2 c 3 d 4", count=2))

('a NUMBER b NUMBER c 3 d 4', 2)


In [15]:
text = "a 1 b 2 c 3 d 4 "
text = re.sub(r"\b\d+\b", "NUMBER", text, count=1)
# Replace the last occurrence of a number
text = re.sub(r"\b\d+\b(?=[^\d]*$)", "NUMBER", text)
print(text)

a NUMBER b 2 c 3 d NUMBER 


#### EXclusion 

In [16]:
txt = "I'm deep and roll no is 22124"

print(''.join(re.findall('[^\d]',txt)))
print(''.join(re.findall('[\d]',txt)))
print(''.join(re.findall('[\D]',txt)))
print(''.join(re.findall('[^\D]',txt)))

I'm deep and roll no is 
22124
I'm deep and roll no is 
22124


#### Phone numbers in diff pattern

In [17]:
for no in ["6-622534-342", "543-5345-645","4563-453-445","53-5453-5345","435-234-6324"]:
    
    print(str(re.findall('[\d]+-[\d]+-[\d]+' , no)[0]).replace('-',''))

6622534342
5435345645
4563453445
5354535345
4352346324


#### Email Pattern Matching 

In [32]:
mail='deepsethi00090@gmail.com.commmmmmmm'
pattern='[A-Za-z0-9]+@[\w]+.[\w]+.[\w]+'
pattern_2='[A-Za-z0-9]+@[\w]+.[\w]+.[\w]'
srch=re.search(pattern,mail)
print(srch.span())
srch_2=re.search(pattern_2,mail)
print(srch_2.span()) 

(0, 35)
(0, 26)


In [28]:
mail='deepsethi00090@gmail.com.commmmmmmm'
print(len(mail))

35


#### matching any pattern of the form of mail.

In [57]:
def is_valid_mail(s):
    pattern = r'[a-zA-Z0-9]+@[\w]+.(org|in|com|edu|gov|net)'
    # the universities mail or mail followed by two . .
    pattern_2=r'[a-zA-Z0-9]+@[\w]+.[\w]+.in'
    srch=re.search(pattern,s)
    srch_2=re.search(pattern_2,s)

    poss=bool(srch or srch_2)
    if(poss):
        print('The input is of the nature of the mail.')
    else:
        print('The input doesn\'t belongs to any type of the mail format.') 
    

In [58]:
is_valid_mail('lovepreetgmail.com')

The input doesn't belongs to any type of the mail format.


In [60]:
is_valid_mail('22301@iiitu.ac.intlll')

The input is of the nature of the mail.


#### EMAIL EXTRACTING

In [97]:
def process_emails(mails):
    user_id = []
    host_name = []
    domain_type = []
    
    for mail in mails:
        u_id = mail.split('@')[0]
        host = mail.split('@')[1].split('.')[0]
        domain = '.'.join(mail.split('@')[1].split('.')[1:])
        
        user_id.append(u_id)
        host_name.append(host)
        domain_type.append(domain)
        
        print(f" The User ID is : {u_id}, with the Host Name: {host}, and the Domain Type is: {domain}")
    
    return user_id, host_name, domain_type


In [98]:
emails = [
    'user1@gmail.com', 'user2@yahoo.co.uk', 'john_doe@outlook.com', 'alice123@domain.org',
    'bob_smith@company.biz', 'charlie.m@university.edu', 'david_wilson@company.co', 'emma.jones@website.net',
    'frank_lee@service.org', 'george.king@brand.info', 'hannah.white@provider.tv', 'ian_gray@shop.store',
    'jack_rose@domain.co.uk', 'karen.m@company.com', 'luke_brown@platform.xyz', 'mary.j@site.org',
    'nick_smith@tech.com', 'olivia.wilson@store.shop', 'paul_johnson@socialmedia.org', 'queen_b@web.net'
]

user_ids, host_names, domain_types = process_emails(emails)

 The User ID is : user1, with the Host Name: gmail, and the Domain Type is: com
 The User ID is : user2, with the Host Name: yahoo, and the Domain Type is: co.uk
 The User ID is : john_doe, with the Host Name: outlook, and the Domain Type is: com
 The User ID is : alice123, with the Host Name: domain, and the Domain Type is: org
 The User ID is : bob_smith, with the Host Name: company, and the Domain Type is: biz
 The User ID is : charlie.m, with the Host Name: university, and the Domain Type is: edu
 The User ID is : david_wilson, with the Host Name: company, and the Domain Type is: co
 The User ID is : emma.jones, with the Host Name: website, and the Domain Type is: net
 The User ID is : frank_lee, with the Host Name: service, and the Domain Type is: org
 The User ID is : george.king, with the Host Name: brand, and the Domain Type is: info
 The User ID is : hannah.white, with the Host Name: provider, and the Domain Type is: tv
 The User ID is : ian_gray, with the Host Name: shop, and

#### Number in the continuing text with using function 

In [81]:
def numbers(s):
    c=0
    list=[]
    for i in range(len(s)):
        if(s[i].isdigit()):
            if(c==0):
                start=i
            c+=1
        else:
            if(c!=0):
                end=i
                list.append(s[start:end])
            c=0
            
    if(c!=0):
       list.append(s[-c:])

    for i in list:
        print(i)

    print('The numbers in the text are :'," ".join(list))

In [82]:
numbers('5 cats and 3 dogs played in 1 park, while 10 birds flew above 2 trees and 4 kids laughed for 6 hours.')

5
3
1
10
2
4
6
The numbers in the text are : 5 3 1 10 2 4 6


#### Word search 

In [87]:
txt = 'AI is having the capability to revolutionize all the industries. The Industry is ready for a big change'.lower()
patt = 'the'.lower()

for i in range(len(txt)):
    if(txt[i:i+len(patt)]==patt):
        print(i,i+len(patt))

13 16
49 52
65 68


#### with specific starting chracters and specific endings

In [93]:
txt = 'AI is having the capability to revolutionize all the industries. The Industry is ready for a big change'
word = 'in'


for wrd in txt.split(' '):
    if (wrd[:2].lower() == word.lower()):
        print(wrd)
        
print('\n')

txt_2 = 'AI is having the capability to revolutionize all the industries. The Industry is ready for a big change'
word_2='he'
for wrd in txt_2.split(' '):
    if(wrd[-2:].lower()==word_2.lower()):
        print(wrd)

industries.
Industry


the
the
The


#### printing the pairs of the txt


In [94]:
txt='Cryptocurrency is a digital or virtual currency secured by cryptography, decentralized, and based on blockchain technology, offering financial privacy, security, and new investment opportunities globally.'

for wrd in range(len(txt.split(' '))-1):
    print(txt.split(' ')[wrd], txt.split(' ')[wrd+1])

Cryptocurrency is
is a
a digital
digital or
or virtual
virtual currency
currency secured
secured by
by cryptography,
cryptography, decentralized,
decentralized, and
and based
based on
on blockchain
blockchain technology,
technology, offering
offering financial
financial privacy,
privacy, security,
security, and
and new
new investment
investment opportunities
opportunities globally.
