#  Data Science Learning Journey  
*Curiosity to Capability — One Notebook at a Time*

---
Compiled and authored by **Partho Sarothi Das**   
	Dhaka, Bangladesh  
	Bachelor's & Master's in Statistics  
	Investment Banking Professional → Aspiring Data Scientist 
    
---

# RegEx Module  
- A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
- RegEx can be used to check if a string contains the specified search pattern.
### re functions --- > findall(), search(), split(), sub() 

In [3]:
# Import Regular Expressions module

import re

### search()

- The search() function searches the string for a match, and returns a Match object if there is a match.
- If there is more than one match, only the first occurrence of the match will be returned:

In [5]:
# search() ---> A string starts with I and ends with python.

text = "I love Python"
x = re.search("^I.*hon$", text)
if x:
    print("Yes! Match found.")
else:
    print("Din't find any match.")

Yes! Match found.


In [6]:
# Search for the first white-space character in the string:
text = "I love Python"
x = re.search(r"\s",text)
x.start()

1

### findall()

In [8]:
import re

txt = "I want to study in Denmark and want a worldclass degree."
x = re.findall("want", txt)
print(x)

['want', 'want']


In [9]:
# Return an empty list if no match was found

txt = "I want to study in Denmark and want a worldclass degree."
x = re.findall("go", txt)
print(x)

[]


### split()

The split() function returns a list where the string has been split at each match:

In [11]:
import re

text = "I need a scholarship to study in Europe."
x = re.split(r"\s", text)
x

['I', 'need', 'a', 'scholarship', 'to', 'study', 'in', 'Europe.']

In [12]:
# Split the string only at the first occurrence

import re

text = "I realy enjoy Data Science."
x = re.split(r"\s", text,1)
x

['I', 'realy enjoy Data Science.']

In [13]:
# Split the string only at the first 2 occurrences

import re

text = "I realy enjoy Data Science."
x = re.split(r"\s", text,2)
x

['I', 'realy', 'enjoy Data Science.']

### The sub() Function  
The sub() function replaces the matches with the text of your choice:

In [15]:
# Replace every white-space character with the '**'

text = "ML is amazing!!!"
re.sub(r"\s", "**", text)

'ML**is**amazing!!!'

In [16]:
# Replace 1st two white-space character with the '**'

text = "ML can solve many real-life problems."
re.sub(r"\s","**", text,2)

'ML**can**solve many real-life problems.'

# Match Object

- A Match Object is an object containing information about the search and the result.
- If there is no match, the value None will be returned, instead of the Match Object.

### The Match object has properties and methods used to retrieve information about the search, and the result:

- .span() --- > returns a tuple containing the start-, and end positions of the match.
- .string --- > returns the string passed into the function
- .group() -- > returns the part of the string where there was a match

In [18]:
# The regular expression looks for any words that starts with an upper case "S":

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


In [19]:
# Print the part of the string where there was a match.

x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


In [20]:
# Print the string passed into the function:

print(x.string)

The rain in Spain


# Metacharacters
Metacharacters are characters with a special meaning

| **Character** | **Description**                          | **Example**     | **Matches**                            |         |                                   |
| ------------- | ---------------------------------------- | --------------- | -------------------------------------- | ------- | --------------------------------- |
| `[]`          | A **set of characters**                  | `[a-m]`         | Any **one** character from a to m      |         |                                   |
| `\`           | **Escape character** or special sequence | `\d`            | A digit (same as `[0-9]`)              |         |                                   |
| `.`           | **Any character** (except newline `\n`)  | `he..o`         | Matches e.g. `hello`, `heppo`, `hexxo` |         |                                   |
| `^`           | Matches **start of string**              | `^hello`        | Matches if string starts with `hello`  |         |                                   |
| `$`           | Matches **end of string**                | `planet$`       | Matches if string ends with `planet`   |         |                                   |
| `*`           | **0 or more** of previous character      | `he.*o`         | `heo`, `hello`, `heyyyyyo`, etc.       |         |                                   |
| `+`           | **1 or more** of previous character      | `he.+o`         | Requires at least one char in between  |         |                                   |
| `?`           | **0 or 1** occurrence of previous char   | `he.?o`         | `heo`, `heoo`, but not `hexxo`         |         |                                   |
| `{n}`         | **Exactly n** occurrences                | `he.{2}o`       | `hexxo`, `hello`, not `heo`            |         |                                   |
| `{n,m}`       | Between **n and m** repetitions          | `a{1,3}`        | `a`, `aa`, or `aaa`                    |         |                                   |
| \`            | \`                                       | **OR** operator | \`falls                                | stays\` | Matches either `falls` or `stays` |
| `()`          | **Grouping** and capturing               | `(ab)+`         | Matches `ab`, `abab`, `ababab`, etc.   |         |                                   |


In [23]:
# [] ---> Find all lower case characters alphabetically between "a" and "m"

text = "I love Data Science."
print(re.findall("[a-z]",text))
print(re.findall('[A-Z]', text))

['l', 'o', 'v', 'e', 'a', 't', 'a', 'c', 'i', 'e', 'n', 'c', 'e']
['I', 'D', 'S']


In [24]:
# \d  -----> Find all digit characters

text = 'It is a 20 dollar note.'
re.findall(r"\d", text)

['2', '0']

In [25]:
# Search for a sequence that starts with "he", followed by two (any) characters, and an "o"

text = "hello planet, heiio planet, he--o planet"
re.findall("he..o", text)

['hello', 'heiio', 'he--o']

In [26]:
# Check if the string starts with 'hello'

text = "hello world!!!"
re.findall("^hello", text)

['hello']

In [27]:
#Check if the string ends with 'apples':

text = "I love apples"
re.findall("apples$", text)

['apples']

In [28]:
# Zero or more occurrences
#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":

text = "Denmark is a very beautiful country."
print(re.findall("Den.*rk", text)) 
print(re.findall("bea.*ful", text))

['Denmark']
['beautiful']


In [29]:
# One or more occurrences

text = 'hello world'
re.findall('wo.+d', text)

['world']

In [30]:
# Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":

re.findall("he.{2}o", text)

['hello']

In [31]:
#Check if the string contains either "falls" or "stays":

txt = "The rain in Spain falls mainly in the plain!"
x = re.findall("falls|stays", txt)
print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['falls']
Yes, there is at least one match!


# Special Sequences  
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

| **Character** | **Description**                                                                                                                                          | **Example**    |
|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| `\A`          | Returns a match if the specified characters are at the **beginning of the string**                                                                       | `"\AThe"`      |
| `\b`          | Returns a match where the specified characters are at the **beginning or end of a word** (use raw string `r""`)                                          | `r"\bain"`     |
|               |                                                                                                                                                          | `r"ain\b"`     |
| `\B`          | Returns a match where the specified characters are present, but **NOT at the beginning or end of a word** (use raw string `r""`)                         | `r"\Bain"`     |
|               |                                                                                                                                                          | `r"ain\B"`     |
| `\d`          | Returns a match where the string contains **digits** (0–9)                                                                                                | `"\d"`         |
| `\D`          | Returns a match where the string **does NOT** contain digits                                                                                              | `"\D"`         |
| `\s`          | Returns a match where the string contains a **whitespace character** (space, tab, newline, etc.)                                                          | `"\s"`         |
| `\S`          | Returns a match where the string **does NOT** contain a whitespace character                                                                              | `"\S"`         |
| `\w`          | Returns a match where the string contains any **word character** (a–Z, 0–9, and `_`)                                                                      | `"\w"`         |
| `\W`          | Returns a match where the string **does NOT** contain any word characters                                                                                 | `"\W"`         |
| `\Z`          | Returns a match if the specified characters are at the **end of the string**                                                                              | `"Spain\Z"`    |


In [34]:
# \A---> #Check if the string starts with "The"
#  beginning of the string

text = "The train is running late."

x = re.findall(r"\AThe", text)
if x:
    print('There is a match.')
else:
    print('There is no match.')

There is a match.


In [35]:
# \b ---> beginning or end of a word

text1 = 'Beginning of the sentence.'
text2 = 'at the end'

print(re.findall(r"\bBeginning", text1))
print(re.findall(r"\bend", text2))

['Beginning']
['end']


In [36]:
# \B ----> NOT at the beginning or end of a word

text1 = 'Beginning of the sentence.'
text2 = 'at the end'

print(re.findall(r"\BBeginning", text1))
print(re.findall(r"\Bthe", text2))

[]
[]


In [37]:
# "\d"  "\D"   ----> Check if the string contains any digits (numbers from 0-9):

text = '10 apples for 120 taka.'
digits = re.findall(r"\d", text)
char = re.findall(r"\D", text)
print(x)
print(char)
print('Digit matched') if digits else print("Not matched.")
print('Char matched') if char else print("Not matched.")

['The']
[' ', 'a', 'p', 'p', 'l', 'e', 's', ' ', 'f', 'o', 'r', ' ', ' ', 't', 'a', 'k', 'a', '.']
Digit matched
Char matched


In [38]:
# Return a match at every NON white-space character:

import re

txt = "The rain in Spain"
x = re.findall(r"\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


In [39]:
#Return a match at every white-space character:

txt = "The rain in Spain"
x = re.findall(r"\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


In [40]:
#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

txt = "abdc 879 $%#@_bg!)("
x = re.findall(r"\w", txt)
y = re.findall(r"\W",txt)

print(x)
print(y)

['a', 'b', 'd', 'c', '8', '7', '9', '_', 'b', 'g']
[' ', ' ', '$', '%', '#', '@', '!', ')', '(']


In [41]:
#Check if the string ends with "Spain":

txt = "The rain in Spain"
x = re.findall(r"Spain\Z", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")

['Spain']
Yes, there is a match!


# Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:

| **Set**        | **Description**                                                                                     |
|----------------|-----------------------------------------------------------------------------------------------------|
| `[arn]`        | Match where **one of the characters** a, r, or n is present                                         |
| `[a-n]`        | Match for any **lowercase character** from a to n                                                   |
| `[^arn]`       | Match for any character **except** a, r, and n                                                      |
| `[0123]`       | Match where **any of the digits** 0, 1, 2, or 3 is present                                           |
| `[0-9]`        | Match for **any digit** from 0 to 9                                                                 |
| `[0-5][0-9]`   | Match for any **two-digit number from 00 to 59**                                                    |
| `[a-zA-Z]`     | Match for **any letter**, lowercase or uppercase                                                    ||
| `[+]`          | In sets, special characters like `+`, `*`, `.`, `|`, `()`, `$`, `{}` have **no special meaning**; `[+]` matches literal `+` |            |


In [44]:
# Check if the string has any a, c, or e characters:

text = 'apple'
re.findall("[ace]", text)

['a', 'e']

In [45]:
# Check if the string has any characters between a and g

text = 'apple'
re.findall("[a-g]", text)

['a', 'e']

In [46]:
# Check if the string has other characters than a, c, or e characters.

text = 'apple'
re.findall("[^ace]", text)

['p', 'p', 'l']

In [47]:
text = 'apple'
re.findall("[^a-e]", text)

['p', 'p', 'l']

In [48]:
# Check if the string has any 0, 1, 2, or 3 digits:

text = "I have 50 apples"
re.findall("[0123]", text)

['0']

In [49]:
# Check if the string has any 0-5 digits:

text = "I have 1298 apples"
re.findall("[0-5]", text)

['1', '2']

In [50]:
# Check if the string has any two-digit numbers, from 00 to 59:

txt = "8 times before 11:45 AM"
x = re.findall("[0-5][0-9]", txt)

print(x)

['11', '45']


In [51]:
# Check if the string has any characters from a to z lower case, and A to Z upper case:

text = 'I have bought 50 eggs.'
re.findall("[a-zA-Z]", text)

['I', 'h', 'a', 'v', 'e', 'b', 'o', 'u', 'g', 'h', 't', 'e', 'g', 'g', 's']

In [52]:
# Check if the string has any + characters:

text = 'I have +50 apples'
re.findall("[+]",text)

['+']

# ✅ Basic Level (1–10)

In [None]:
# 1. **What is a regular expression? How is it used in Python?**
# 2. **What does the `re` module do in Python?**
# 3. **What function is used to search for a pattern in a string?**
# 4. **Write a regex to find any digit from 0 to 9.**
# 5. **What is the difference between `re.search()` and `re.match()`?**
# 6. **What does the regex pattern `.` (dot) match?**
# 7. **Write a regex to match a string that starts with `"The"` using `^`.**
# 8. **What is the purpose of `[]` in regex? Give an example.**
# 9. **How do you escape a special character like `.` in regex?**
# 10. **What does the `+` operator mean in regular expressions?**

# ✅ Intermediate Level (11–20)

In [None]:
# 11. **What does `\s` match in regular expressions? How about `\S`?**
# 12. **What’s the difference between `\d`, `\D`, `\w`, and `\W`?**
# 13. **Write a regex to find all words in a string using `re.findall()` and `\w+`.**
# 14. **What does the pattern `[a-zA-Z]` match?**
# 15. **How do you use `|` (pipe) in regex? Show with an example.**
# 16. **What is the purpose of parentheses `()` in regex?**
# 17. **Write a pattern that matches email addresses.**
# 18. **What will `re.findall(r'\bcat\b', "The cat sat on the catalog")` return?**
# 19. **How do you match a literal `\` backslash character in regex?**
# 20. **Write a regex pattern to extract all numbers from the string: `"Order 12, 25 apples and 30 bananas"`**

# ✅ Advanced Level (21–30)

In [None]:
# 21. **What is the difference between greedy and non-greedy quantifiers in regex? Show example.**
# 22. **Explain how `re.sub()` works with an example.**
# 23. **Write a pattern to match valid Bangladeshi phone numbers: starts with `01` and has 11 digits.**
# 24. **How do you extract the domain from an email using regex?**
# 25. **Write a regex to match strings that end with `.csv` or `.txt`**
# 26. **What does `re.split(r'\s+', text)` do?**
# 27. **Write a regex to match dates in the format `DD-MM-YYYY`.**
# 28. **How can you compile a regex pattern and reuse it multiple times?**
# 29. **Explain the role of `(?i)` in regex.**
# 30. **Write a regex to validate a strong password: at least one uppercase, one lowercase, one digit, and one special character.**