1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [1]:
import re
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''
pattern = 'https://twitter\.com/([a-zA-Z0-9_]+)'

re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

explanation : 

Great question!

In the pattern:

```python
pattern = 'https://twitter\.com/([a-zA-Z0-9_]+)'
```

the backslash (`\`) before the dot (`.`) in `twitter\.com` is **used to escape the dot**.

---

### Why?

In **regular expressions**, the dot `.` is a **special character** that matches **any single character** (except a newline).
So:

* `twitter.com` (without backslash) would match:

  * `twitterXcom`, `twitter-com`, `twitter1com`, etc.
* `twitter\.com` (with backslash) means:

  * match a **literal dot** (`.`)

---

### Summary:

* ✅ `twitter\.com` = matches the exact string `twitter.com`
* ❌ `twitter.com` (unescaped) = matches things like `twitterXcom`, `twitteracom`, etc.

So the `\.` ensures precision. Without it, your regex might match incorrect or unexpected URLs.

Let me know if you want help testing other patterns!


Great observation!

The reason the output is `'elonmusk'` instead of the full URL (`https://twitter.com/elonmusk`) is because of how your regular expression is written:

```python
pattern = r'https://twitter\.com/([a-zA-Z0-9_]+)'
```

### 🔍 Let's break it down:

* `https://twitter\.com/` → matches the **literal beginning** of the Twitter URL.
* `([a-zA-Z0-9_]+)` → this is a **capturing group**, meaning:

  > extract **only this part** — the username after `twitter.com/`.

---

### 📌 `re.findall(pattern, text)` behavior:

* `re.findall()` returns a **list of all captured groups** if your pattern contains parentheses `()`.
* In your case, the group is `([a-zA-Z0-9_]+)`, which matches **just the usernames**, like `'elonmusk'`.

---

### ✅ If you want the **full URLs**, do **not** use capturing parentheses:

```python
pattern = r'https://twitter\.com/[a-zA-Z0-9_]+'
```

Then:

```python
re.findall(pattern, text)
```

Output:

```python
['https://twitter.com/elonmusk', 'https://twitter.com/teslarati', 'https://twitter.com/dummy_tesla', 'https://twitter.com/dummy_2_tesla']
```

---

Let me know if you'd like to extract both username **and** full URL together!


2. Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings

(1) Credit Risk

(2) Supply Rish

In [2]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Concentration of Risk: ([^\n]*)'

re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

explanation:


```python
pattern = 'Concentration of Risk: ([^\n]*)'
```

### 🔍 Explanation

#### 1. **Literal part: `'Concentration of Risk: '`**

* This matches **exactly** the phrase:

  ```
  Concentration of Risk:
  ```
* It is **case-sensitive** unless you specify otherwise in the regex options (e.g., `re.IGNORECASE`).

#### 2. **Capturing Group: `([^\n]*)`**

This part is very important:

* `(` and `)` → This is a **capturing group**, meaning this part of the match will be **extracted**.
* `[^...]` → This is a **negated character class**, meaning “match anything **except** the characters listed”.
* `\n` → This is a **newline character**.
* `*` → Match **zero or more** of the previous character class.

So together:

### 📌 `([^\n]*)` means:

> "Capture everything **after** 'Concentration of Risk: ' until the **end of the line** (i.e., before the next `\n`)."

---

### ✅ Example:

```python
import re

text = """
Company Overview
Concentration of Risk: High due to single supplier.
Financials
"""

pattern = r'Concentration of Risk: ([^\n]*)'

match = re.findall(pattern, text)
print(match)
```

### Output:

```python
['High due to single supplier.']
```


3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

Hint: you need to use (?:) here to match everything enclosed

In [3]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)
matches

['2021 Q1', '2021 S1']

explanation :

 The Code:

```python
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)
matches
```

---

### 🔍 Step-by-Step Explanation

#### 📄 `text = ''' ... '''`

You have a multiline string containing financial data for **Tesla** and **BMW**, with fiscal year and quarter/semester indicators like `FY2021 Q1` and `FY2021 S1`.

---

#### 🧠 Regular Expression:

```python
pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
```

Let’s break this down piece by piece:

##### 🧩 `'FY'`

* Matches the **literal string** `'FY'`, which stands for **Fiscal Year**.

##### 🧩 `(\d{4} (?:Q[1-4]|S[1-2]))`

This is the **capturing group** that we want to extract.

###### 🔢 `\d{4}`

* `\d` means **any digit** (`0–9`).
* `{4}` means **exactly 4 digits**.
* So this matches the **year**, e.g., `2021`.

###### 🧩 `' '` (space)

* A space between the year and quarter/semester.

###### 🧩 `(?:Q[1-4]|S[1-2])`

* This is a **non-capturing group** (`(?:...)`) used for grouping without capturing.
* It matches either:

  * `Q[1-4]` → A quarter (Q1, Q2, Q3, Q4)
  * `S[1-2]` → A semester (S1 or S2)

So `(?:Q[1-4]|S[1-2])` matches any one of:

* `Q1`, `Q2`, `Q3`, `Q4`
* `S1`, `S2`

---

### ✅ What does the pattern match?

The entire regex:

```regex
FY(\d{4} (?:Q[1-4]|S[1-2]))
```

Matches strings like:

* `FY2021 Q1` → captures `2021 Q1`
* `FY2021 S1` → captures `2021 S1`

The `FY` part is not captured (it's outside the parentheses), but the rest (`2021 Q1`, `2021 S1`) **is captured**.

---

### 📦 Output of `re.findall(...)`

Given your `text`, the matches will be:

```python
['2021 Q1', '2021 S1']
```

