# Python with Prof. Chauhan Bhavik
## Regular Expressions

### Part 1: Introduction to Regular Expressions
1) What are Regular Expressions (REs)?
2) Why use Regular Expressions?
3) Using Python's re module
4) Basic regex functions with examples (match, search, findall, split, sub)
5) Small practice exercises

### Part 2: Sequence Characters in Regular Expressions
6) What are sequence characters in regex?
7) Common sequence characters in Python's re module
8) Examples with explanation
9) Practice exercises

### Part 3: Quantifiers in Regular Expressions
10) What are quantifiers in regex?
11) Different types of quantifiers in Python
12) Practical examples with explanation
13) Practice exercises

### Part 4: Special Characters in Regular Expressions
14) Special characters in regex
15) How to use them in Python
16) Examples with explanation
17) Practice exercises

### Part 5: Using Regular Expressions on Files
18) How to use regex with text files
19) Reading a file in Python
20) Applying regex patterns on file content
21) Extracting useful information
22) Practice exercises

### Part 6: Retrieving Information from an HTML File using Regex
23) Reading HTML content
24) Using regex to extract:
  - Titles
  - Headings
  - Hyperlinks
  - Emails
25) Practice exercises

## Part 1: Introduction to Regular Expressions
### 1) What are Regular Expressions (REs)?
- Regular Expressions (REs) are patterns used to match strings. They are very useful for searching, replacing, and extracting text.

- Real-life use cases:
  - Validating email or phone number
  - Extracting data from text/logs
  - Web scraping
### 2) Why use Regular Expressions?
- We use regular expressions to find, validate, extract, and replace patterns in text (like emails, phone numbers, dates, log entries) quickly and efficiently.

### 3) Using Python's re module
- To work with regular expressions in Python, we use the built-in re module.
- Let's import the regular expressions module

In [195]:
import re  

### 4) Basic regex functions with examples (match, search, findall, split, sub)

#### re.match()
- re.match() → checks for a match at the beginning of the string
    - match.start() → Returns the starting index of the matched substring.
    - match.end() → Returns the ending index (exclusive) of the matched substring. 

In [199]:
text = 'Bhavik'
m = re.match("Bh", text) 
print(m)

<re.Match object; span=(0, 2), match='Bh'>


In [201]:
text = 'Energy can be transformed from one form to another form'

m1 = re.match("form", text)
m2 = re.match("Energy can be", text)
print(m1)
print(m2)

None
<re.Match object; span=(0, 13), match='Energy can be'>


#### re.search()
- re.search() → searches for a match anywhere in the string

In [204]:
text = 'Energy can be transformed from one form to another form'
pattern = 'form'

match = re.search(pattern, text)
print(match)

<re.Match object; span=(19, 23), match='form'>


In [206]:
if match:
    print('Pattern found at:', match.start(), 'to', match.end())
else:
    print('Pattern not found')

Pattern found at: 19 to 23


#### re.findall()
- re.findall() → returns all matches in a list

In [209]:
print(re.findall(pattern, text))

['form', 'form', 'form']


#### re.split()
- re.split() → splits string by a pattern

In [212]:
print(re.split(' ', text))

['Energy', 'can', 'be', 'transformed', 'from', 'one', 'form', 'to', 'another', 'form']


#### re.sub()
- re.sub() → replaces pattern with another string

In [215]:
print(re.sub('Energy', 'Water', text))

Water can be transformed from one form to another form


### 5) Small practice exercises

Try these small tasks:
- Find all numbers in the string: 'My phone number is 9876543210 and office number is 079-23232323'
- Replace all spaces in 'Python is fun to learn' with '-'
- Split an email address like 'student@example.com' into username and domain.

## Part 2: Sequence Characters in Regular Expressions
### 6) What are sequence characters in regex?
- Sequence characters in regex are special symbols that represent a class of characters (like digits, words, or spaces) instead of writing them explicitly.
### 7) Common sequence characters in Python's re module
- \d → Any digit (0–9)
- \D → Any non-digit
- \w → Any word character (letters, digits, underscore)
- \W → Any non-word character
- \s → Any whitespace (space, tab, newline)
- \S → Any non-whitespace
### 8) Examples with explanation

In [9]:
import re

txt = "My contact number is 1232459876 and office contact number is 079-14568932"

##### Example 1: Find all digits in text

In [223]:
print(re.findall(r'\d+',txt))

['1232459876', '079', '14568932']


##### Example 2: Find all words

In [226]:
print(re.findall(r'\w+',txt))

['My', 'contact', 'number', 'is', '1232459876', 'and', 'office', 'contact', 'number', 'is', '079', '14568932']


##### Example 3: Find all whitespaces

In [229]:
print(re.findall(r'\s+',txt))

[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


##### Example 4: Extract non-digit characters

In [232]:
print(re.findall(r'\D+',txt))

['My contact number is ', ' and office contact number is ', '-']


### 9) Practice exercises
- Extract all words from 'Hello_123 World99 Python_3' using \w+.
- Find all numbers in 'Invoice: 555, Price: 1234, Code: 77'.
- Count how many whitespaces are in 'Python is easy to learn'.

## Part 3: Quantifiers in Regular Expressions
### 10) What are quantifiers in regex?
- Quantifiers define how many times a character or group should appear in the string.
- Common Quantifiers:
    - `*` → 0 or more times
    - `+` → 1 or more times
    - `?` → 0 or 1 time (optional)
    - `{n}` → exactly n times
    - `{n,}` → n or more times
    - `{n,m}` → between n and m times


### 11)Different types of quantifiers in Python
- Common Quantifiers:
    - `*` → 0 or more times
    - `+` → 1 or more times
    - `?` → 0 or 1 time (optional)
    - `{n}` → exactly n times
    - `{n,}` → n or more times
    - `{n,m}` → between n and m times
### 12) Practical examples with explanation

In [3]:
import re
t = "aaa abc ada"
print(re.findall(r'a*', t))        # * → 0 or more times

['aaa', '', 'a', '', '', '', 'a', '', 'a', '']


In [5]:
print(re.findall(r'a+',t))        # + → 1 or more times

['aaa', 'a', 'a', 'a']


In [7]:
tx = "color colr colour"
print(re.findall(r'colou?r',tx))    # "a?" → matches a once or not at all.

['color', 'colour']


Useful when you want to handle optional letters (like "color" vs "colour", "bat" vs "bt").

In [10]:
text = """
My favorite color is blue.
My favourite colour is red.
Some people write color, other wire colour.
"""

matches = re.findall(r'colou?r', text)

print("Matches found: ", matches)

Matches found:  ['color', 'colour', 'color', 'colour']


In [12]:
m = re.findall(r'favou?rite', text)

print("Here I found : ",m)

Here I found :  ['favorite', 'favourite']


When processing English text, you might want to match both American (color, honor, analyze) and British spellings (colour, honour, analyse).

Regex with ? makes that easy!

In [15]:
text = """
Visit our site at http://example.com
Make sure to use https://secure.com for secure browsing.
"""

matches = re.findall(r'https?', text)
print("Matches found:", matches)

Matches found: ['http', 'https']


In [37]:
text = " 1 11 111 1111 11111"
print("Exactly n time here n = 4 ", re.findall(r'1{4}',text))
print("More than n time, Here n=3 ", (re.findall(r'1{3,}',text)))
print("Between n to m time, Here n=2 & m=4", re.findall(r'1{2,4}',text))

Exactly n time here n = 4  ['1111', '1111']
More than n time, Here n=3  ['111', '1111', '11111']
Between n to m time, Here n=2 & m=4 ['11', '111', '1111', '1111']


### 13) Practice exercises
- Find all words that start with `'a'` and have at least 2 `'b'` in `'abb abbb abbbb a ab'`.
- Extract numbers with exactly 3 digits from `'123 45 6789 12 999'`.
- Match words ending with `'ing'` in `'playing run sing walking talking'`.


## Part 4: Special Characters in Regular Expressions
### 14) Special characters in regex
Regex has special characters that change how patterns are interpreted.
#### What are Special Characters in Regex?
**Common Special Characters:**
- `.` → Matches any single character except newline
- `^` → Matches the beginning of a string
- `$` → Matches the end of a string
- `[]` → Matches any single character inside brackets
- `[^ ]` → Matches any character **not** inside brackets
- `|` → Acts like OR
- `()` → Groups expressions
- `\\` → Escapes a special character

### 15) How to use them in Python
By importing re module in python we can use easily. 

In [64]:
import re

### 16) Examples with explanation

##### (dot) . → Matches any single character except newline

In [68]:
text = "cat bat rat mat rom"
print(re.findall(r'.at', text))

['cat', 'bat', 'rat', 'mat']


##### (caret) ^ → Matches the beginning of a string

In [72]:
text1 = "Python is fun"
text2 = "Learning python is fun"
print(re.findall(r'^Python', text1))
print(re.findall(r'^python', text2))

['Python']
[]


##### $ → Matches the end of a string

In [91]:
text1 = "python is love"
text2 = "Learning python is fun"
print(re.findall(r'fun$', text1))
print(re.findall(r'fun$', text2))

[]
['fun']


##### [ ] → Matches any single character inside brackets

In [100]:
text = "cat bat rat mat"
print(re.findall(r'[cr]at', text))

['cat', 'rat']


##### [^ ] → Matches any character not inside brackets

In [106]:
text = "cat bat rat mat"
print(re.findall(r'[^cr]at', text))

['bat', 'mat']


##### | → Acts like OR

In [110]:
text = "I like cows and dogs"
print(re.findall(r'cow|dog', text))

['cow', 'dog']


##### () → Groups expressions

In [116]:
text = "grey gray gems gram"
print(re.findall(r'gr(a|e)y', text))

['e', 'a']


##### \\ → Escapes a special character

In [118]:
text = "Price is 100$"
print(re.findall(r'100\\$', text))

[]


### 17) Practice exercises
1. Match words that end with `at` but do not start with `c` in `"cat bat rat mat"`.
2. Find all strings that start with `'Hello'` in `"Hello World Hello123"`.
3. Extract both `'grey'` and `'gray'` from `"grey gray green"`.
4. Match the dollar sign (`$`) in `"Total cost is 500$"`.

## Part 5: Using Regular Expressions on Files
Regular expressions are often applied to files for:
- Searching logs for error messages
- Extracting phone numbers, emails, or IDs
- Filtering specific lines
- Cleaning unwanted data
### How to use regex with text files

In [239]:
import re

In [265]:
text = """Hello my e-mail address is qwe@ewq.com
Contact: +91-9876543210
Error: File not found!!!
Error is not good!
Back-up email: plm@mlp.com
"""

with open ("sample.txt", "w") as f:
    f.write(text)

### Reading a file in Python
We usually read the file into a string, then apply regex.

In [268]:
with open("sample.txt", "r") as f:
    content = f.read()

print(content)

Hello my e-mail address is qwe@ewq.com
Contact: +91-9876543210
Error: File not found!!!
Error is not good!
Back-up email: plm@mlp.com



### Applying regex patterns on file content

#### Let's Find all lines containing 'Error'

In [272]:
e = re.findall(r'.*Error.*',content)
print("Error lines: ", e)

Error lines:  ['Error: File not found!!!', 'Error is not good!']


.* (before Error)

- . → any character except newline (\n)

- Astric sign * → zero or more times

- So: match everything at the start of the line until the word "Error" appears.

Error

- Literal word "Error".

- This is the keyword we want to detect in each line.

.* (after Error)

- Again: match everything after the word "Error" until the end of the line.

#### Let's extract all words starting with capital letters

In [284]:
capi = re.findall(r'\b[A-Z][A-Za-z]*\b',content)
print("Words starting with capital letters: ",capi)

Words starting with capital letters:  ['Hello', 'Contact', 'Error', 'File', 'Error', 'Back']


\b

- Word boundary → ensures the match starts at the beginning of a word.

- Without this, it could match capital letters even inside words.

[A-Z]

- First character must be a capital letter (A to Z).

[A-Za-z]*

- After the first capital, match zero or more letters (uppercase or lowercase).

- Sign * → allows variable-length words.

\b (at the end)

- Word boundary → ensures the match ends at the end of a word.

### Extracting useful information
#### Let's Extract all emails from above file.

In [290]:
import re

eml = re.findall(r'[\w._%+-]+@[\w.-]+\.\w+',content)
print("Emails found: ",eml)

Emails found:  ['qwe@ewq.com', 'plm@mlp.com']


##### Let's understand what happened

[\w._%+-]+

- \w → matches any word character (letters, digits, _)
- ._%+- → allows dot ., underscore _, percent %, plus +, hyphen -
- + → one or more of the above
- Matches the username part of the email (before @).
- Example: "john.doe", "user_123", "info+support"

@
- Literal @ symbol that separates user and domain.

[\w.-]+
- \w → word characters
- .- → dot . or hyphen -
- Plus sign + → one or more
- Matches the domain name part.
- Example: "gmail", "yahoo-mail", "university.edu"

\.\w+

- \. → literal dot .
- \w+ → one or more word characters
- Matches the top-level domain (TLD).
- Example: .com, .org, .in, .edu

#### Let's extract all phone numbers

In [256]:
cont = re.findall(r'\+91-\d{10}', content)
print("Contact numebr: ", cont)

Contact numebr:  ['+91-9876543210']


##### Let's understand what happened

\+

- Plus sign + is a special regex character (quantifier: "one or more")

- To match a literal plus sign (+), we escape it with \\

- So \+ matches the actual “+” in +91

91-

- Matches the literal digits 91 followed by a hyphen -

- Together: +91-

\d{10}

- \d → digit (0–9)

- {10} → exactly 10 digits. So this part matches any 10-digit phone number

### Practice exercises
Use regex to extract the following from the `refile.txt` file:
1. All words ending with `ing`.
2. All domains (part after `@`) from the emails.
3. All lines that start with `'Hello'`.
4. Replace all emails with `'[EMAIL]'`.

## Part 6: Retrieving Information from an HTML File using Regex

Although libraries like **BeautifulSoup** are best for parsing HTML, 
regex can be useful for:
- Quick extraction of patterns (like emails or links)
- Log analysis
- Simple scraping tasks

In [314]:
import re 
my_site = """
<html>
<head>
<title>Welcome to My Website</title>
</head>
<body>
<h1>Main Heading</h1>
<p>Contact us at support@example.com</p>
<a href="https://example.com">Visit Example</a>
<a href="https://college.edu">College Website</a>
<h2>Subheading</h2>
<p>Python is powerful!</p>
</body>
</html>
"""

with open("ws.html", "w") as f:
    f.write(my_site)

### 23 Reading HTML content

In [319]:
with open("ws.html", "r") as f:
    html_content = f.read()

print(html_content)


<html>
<head>
<title>Welcome to My Website</title>
</head>
<body>
<h1>Main Heading</h1>
<p>Contact us at support@example.com</p>
<a href="https://example.com">Visit Example</a>
<a href="https://college.edu">College Website</a>
<h2>Subheading</h2>
<p>Python is powerful!</p>
</body>
</html>



### 24 Using regex to extract:

#### Extract titles

In [329]:
title = re.findall(r'<title>(.*?)</title>', html_content)
print("Page Title:", title)

Page Title: ['Welcome to My Website']


##### Tag title

- Matches the opening title tag literally.

##### (.*?)


- . → matches any character (except newline by default).


- Sign * → zero or more occurrences.


- ? → makes it non-greedy (smallest possible match).


- Wrapped in () → capture group → extracts only the content between <title> and </title>.


##### Tag /title


- Matches the closing </title> tag literally.

#### Headings


In [331]:
headings = re.findall(r'<h[1-6]>(.*?)</h[1-6]>', html_content)
print("Headings:", headings)

Headings: ['Main Heading', 'Subheading']


##### Tag <h[1-6]>

- Matches any opening heading tag from h1 tag to h6 tag.

- [1-6] → means "any digit between 1 and 6".

##### (.*?)

- Captures the text inside the heading.

- . → any character

- Sign * → zero or more

- ? → non-greedy (smallest match).

##### Tag /h[1-6]

- Matches the corresponding closing heading tag (/h1 … /h6).

#### Hyperlinks

In [333]:
links = re.findall(r'href="(.*?)"', html_content)
print("Hyperlinks:", links)

Hyperlinks: ['https://example.com', 'https://college.edu']


#### href="

- Matches the literal text href=" that starts a link inside an <a> tag.

#### (.*?)

- Captures the actual URL inside the quotes.

- Sign . → any character

- Sign * → zero or more

- ? → non-greedy (stops at the first ")

#### "

- Matches the closing quote of the hyperlink.

#### Emails

In [343]:
emails = re.findall(r'[\w._%+-]+@[\w.-]+\.\w+', html_content)
print("Emails:", emails)

Emails: ['support@example.com']


[\w._%+-]+

- \w → any letter, digit, or underscore ([A-Za-z0-9_])

- ._%+- → allows ., _, %, +, - inside email names

- Sign + → one or more of these characters

- Example matches: support, john.doe, user_123

@

- Literal @ symbol separating username and domain.

[\w.-]+

- Domain name part (like example or college).

- Allows letters, numbers, dots, and hyphens.

\.\w+

- \. → literal dot before domain extension.

- \w+ → one or more word characters (like com, edu, org)

### 25 Practice exercises
1. Extract all paragraph (`<p> ... </p>`) text from the HTML.
2. Find all `h2` headings only.
3. Extract only `.edu` links from the HTML.
4. Replace all emails with `[HIDDEN_EMAIL]`.

# Thank You