In [1]:
import re

In [2]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. 
In previous quarter i.e. FY2020 Q4 it was $3 billion.
'''

pattern = "FY(\d{4} Q[1-4])[^\$]+\$([\d\.]+)"

matches = re.findall(pattern, text)
matches

[('2021 Q1', '4.85'), ('2020 Q4', '3')]

In [7]:
matches = re.search(pattern, text)
matches

<re.Match object; span=(51, 70), match='FY2021 Q1 was $4.85'>

In [8]:
matches.groups()

('2021 Q1', '4.85')

1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [14]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''
pattern = "https://twitter.com/[a-z]+_[a-z]+|https://twitter.com/[a-z]+_[0-9]_[a-z]+|https://twitter.com/[a-z]+"

re.findall(pattern, text)

['https://twitter.com/elonmusk',
 'https://twitter.com/teslarati',
 'https://twitter.com/dummy_tesla',
 'https://twitter.com/dummy_2_tesla']

In [10]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''
pattern = 'https://twitter\.com/([a-zA-Z0-9_]+)'

re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

2. Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings

(1) Credit Risk

(2) Supply Risk

In [16]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = "Concentration of Risk: ([^\n]+)"

re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

In [23]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))' # todo: type your regex here
matches = re.findall(pattern, text)
matches

['2021 Q1', '2021 S1']

## **Regular Expression in Python**


---

###  **What is Regex?**

**Regex (Regular Expression)** is a special sequence of characters that helps you **search, match, or manipulate text** easily.

Think of it like a **smart filter** that can find patterns in text — such as email addresses, phone numbers, specific words, or even dates.

---

### **Why is Regex Needed?**

Imagine these tasks:

* You want to find **all email addresses** in a file.
* You want to check if a **password is strong** (has numbers, capital letters, symbols, etc.).
* You want to **replace** all dates written like `01-01-2023` with `2023/01/01`.

Doing these with normal string methods (`split()`, `find()`, etc.) is hard. **Regex makes it super easy.**

---

### **How to Use Regex in Python**

Python has a built-in library called `re` for Regex.

```python
import re
```

---

### **Common Regex Functions in Python**

| Function       | Description                              |
| -------------- | ---------------------------------------- |
| `re.search()`  | Finds the first match                    |
| `re.findall()` | Finds all matches in a string            |
| `re.sub()`     | Replaces matched text                    |
| `re.match()`   | Checks if the string starts with a match |
| `re.split()`   | Splits string by the matched pattern     |

---

### **Examples**

### Example 1: Check if a string contains a number

```python
import re

text = "I have 2 apples"
if re.search(r"\d", text):
    print("Contains a number")
```

###  Example 2: Find all email addresses

```python
text = "Contact us at hello@example.com or info@site.org"
emails = re.findall(r"\S+@\S+\.\S+", text)
print(emails)
```

### Example 3: Replace all digits with `*`

```python
text = "Phone: 123-456-7890"
new_text = re.sub(r"\d", "*", text)
print(new_text)
```

### Example 4: Check if the string starts with "Hello"

```python
text = "Hello World!"
if re.match(r"Hello", text):
    print("Starts with Hello")
```

---

### **Real-Life Use Cases**

| Task              | Regex Use                                           |
| ----------------- | --------------------------------------------------- |
| Form validation   | Check if email/phone is valid                       |
| Data cleaning     | Remove unwanted symbols or whitespace               |
| Web scraping      | Extract emails, URLs, or prices                     |
| Log file analysis | Find error messages, timestamps                     |
| Password rules    | Ensure strong passwords with numbers, symbols, etc. |

---

### **Tips for Beginners**

* Always start small and test your pattern.
* Use **raw strings** in Python like `r"\d+"` (to avoid `\\` confusion).
* Use online tools like [regex101.com](https://regex101.com/) to test your pattern.
* Break your pattern into parts if it gets too complex.

---

### **Summary**

* Regex is like a powerful search tool for patterns in text.
* Python has a built-in `re` module for Regex.
* It's useful for validation, cleaning, searching, and replacing text.

---

### **Regex Coding Exercises**

In [24]:
## re library 

import re

**1. Extract Phone Numbers**

In [25]:
import re

text = "Please contact 01794303336 or +8801892432631 for any info."

pattern = "\d{11}|\+88\d{11}"

re.findall(pattern,text)



['01794303336', '+8801892432631']

**2. Extract Note Titles**

In [26]:
text = """
Note 1 - Overview
Grameenphone Ltd. (“Grameenphone”, the “Company”, “we”, “us” or “our”) was incorporated in Bangladesh under the 
Companies Act (1994). We are the leading telecommunications service provider in Bangladesh, offering voice, data, 
and digital services nationwide. The company operates under a single reportable segment: telecommunications services.

Note 2 - COVID-19 Impact
During 2021 and 2022, the COVID-19 pandemic affected our retail operations, network expansions, 
and field support services. Movement restrictions and lockdowns impacted customer acquisition and recharge volumes. 
However, digital engagement and mobile internet usage increased significantly during this period.

Note 3 - Basis of Preparation
These unaudited interim financial statements have been prepared in accordance with International Accounting Standard (IAS) 
34 “Interim Financial Reporting”. These statements should be read in conjunction with the annual financial statements for 
the year ended December 31, 2021.

"""

pattern = "Note \d - [^\n]+"

re.findall(pattern,text)


['Note 1 - Overview',
 'Note 2 - COVID-19 Impact',
 'Note 3 - Basis of Preparation']

In [27]:
text = """
Note 1 - Overview
Grameenphone Ltd. (“Grameenphone”, the “Company”, “we”, “us” or “our”) was incorporated in Bangladesh under the 
Companies Act (1994). We are the leading telecommunications service provider in Bangladesh, offering voice, data, 
and digital services nationwide. The company operates under a single reportable segment: telecommunications services.

Note 2 - COVID-19 Impact
During 2021 and 2022, the COVID-19 pandemic affected our retail operations, network expansions, 
and field support services. Movement restrictions and lockdowns impacted customer acquisition and recharge volumes. 
However, digital engagement and mobile internet usage increased significantly during this period.

Note 3 - Basis of Preparation
These unaudited interim financial statements have been prepared in accordance with International Accounting Standard (IAS) 
34 “Interim Financial Reporting”. These statements should be read in conjunction with the annual financial statements for 
the year ended December 31, 2021.

"""

pattern = "Note \d - ([^\n]+)"

re.findall(pattern,text)


['Overview', 'COVID-19 Impact', 'Basis of Preparation']

**3. Extract Financial Periods (e.g., "Jan 2024", "Q1 2023")**


In [28]:
text = """
The total revenue from mobile financial services in FY2022 Q3 was BDT 2,450 crore.
In the previous quarter i.e. FY2022 Q2, it was BDT 2,120 crore.
"""
pattern = "FY\d{4} Q[1-4]"

re.findall(pattern,text)



['FY2022 Q3', 'FY2022 Q2']

In [29]:
text = """
The total revenue from mobile financial services in FY2022 Q3 was BDT 2,450 crore.
In the previous quarter i.e. FY2022 Q2, it was BDT 2,120 crore.
"""
pattern = "FY(\d{4} Q[1-4])"

re.findall(pattern,text)

['2022 Q3', '2022 Q2']

**4. Extract Only Financial Numbers**

In [30]:
text = """
The total revenue from mobile financial services in FY2022 Q3 was BDT 2,450 crore.
In the previous quarter i.e. FY2022 Q2, it was BDT 2,120 crore.
"""
pattern = "BDT [0-9,]+"

re.findall(pattern,text)

['BDT 2,450', 'BDT 2,120']

In [31]:
text = """
The total revenue from mobile financial services in FY2022 Q3 was BDT 2,450 crore.
In the previous quarter i.e. FY2022 Q2, it was BDT 2,120 crore.
"""
pattern = "BDT ([0-9,]+)"

re.findall(pattern,text)

['2,450', '2,120']

**5. Extract Periods and Financial Numbers Together**

In [33]:
text = """
The total revenue from mobile financial services in FY2022 Q3 was BDT 2,450 crore.
In the previous quarter i.e. FY2022 Q2, it was BDT 2,120 crore.
"""
pattern = "FY(\d{4} Q[1-4])[^(BDT)]+BDT ([0-9,]+ [a-zA-Z]+)"

re.findall(pattern,text)

[('2022 Q3', '2,450 crore'), ('2022 Q2', '2,120 crore')]

**6. Extract Any Link (e.g., Facebook, LinkedIn, Website)**

In [35]:
text = """
For more information, visit our  website at https://www.datasolution360.com/ or follow us on Facebook:
https://www.facebook.com/Datasolution360 and You can also check out us at linkedin https://www.linkedin.com/company/data-solution-360/
"""

pattern = "https?://\S+"
re.findall(pattern,text)


['https://www.datasolution360.com/',
 'https://www.facebook.com/Datasolution360',
 'https://www.linkedin.com/company/data-solution-360/']

 **7. Extract Emails**

In [36]:
text = "Contact us at tanji.evan23@gmail.com or datasolution360.business@gmail.com"

emails = re.findall(r"\S+@\S+", text)
print(emails)


['tanji.evan23@gmail.com', 'datasolution360.business@gmail.com']
