---

## Advanced Regular Expression Assignments

### Assignment 1: Extracting Phone Numbers

**Raw Text:** 
Extract all valid Pakistani phone numbers from a given text.

**Example:**
```
Text: Please contact me at 0301-1234567 or 042-35678901 for further details.
```



In [23]:
import re

contact = """Please contact me at 0301-1234567 or 042-35678901 for further details."""


pattern = r"\d{3,4}-\d{7,8}"

phone_nos = re.findall(pattern, contact)
print("All Phone Numbers:\t", phone_nos)

pakistani_nos = [nos for nos in phone_nos if re.match(r"\d{4}-\d{7}", nos)] # re.match is used to match the pattern in list
print("\nPakistani Phone Numbers:\t", pakistani_nos)                        # comprehension   

All Phone Numbers:	 ['0301-1234567', '042-35678901']

Pakistani Phone Numbers:	 ['0301-1234567']


### Assignment 2: Validating Email Addresses

**Raw Text:** 
Validate email addresses according to Pakistani domain extensions (.pk).

**Example:**
```
Text: Contact us at info@example.com or support@domain.pk for assistance.
```



In [17]:
import re

emails = """Contact us at info@example.com or support@domain.pk for assistance."""

pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9-]+\.+[a-zA-Z]{2,4}\b'

email_ids = re.findall(pattern, emails)
print("All email addresses:\t", email_ids)

pk_emails = [email for email in email_ids if email.endswith(".pk")]
print("\nValidated email addresses according to Pakistani domain extensions (.pk):\t", pk_emails)

All email addresses:	 ['info@example.com', 'support@domain.pk']

Validated email addresses according to Pakistani domain extensions (.pk):	 ['support@domain.pk']


### Assignment 3: Extracting CNIC Numbers

**Raw Text:** 
Extract all Pakistani CNIC (Computerized National Identity Card) numbers from a given text.

**Example:**
```
Text: My CNIC is 12345-6789012-3 and another one is 34567-8901234-5.
```


In [31]:
import re

cnic = """My CNIC is 12345-6789012-3 and another one is 34567-8901234-5."""

pattern = r"\d{5}-\d{7}-\d{1}"

pakistani_cnics = re.findall(pattern, cnic)
print("Pakistani CNICs:\t", pakistani_cnics)

Pakistani CNICs:	 ['12345-6789012-3', '34567-8901234-5']



### Assignment 4: Identifying Urdu Words

**Raw Text:** 
Identify and extract Urdu words from a mixed English-Urdu text.

**Example:**
```
Text: یہ sentence میں کچھ English words بھی ہیں۔
```



In [9]:
import re

mixed_text = "یہ sentence میں کچھ English words بھی ہیں۔"

pattern = r"[\u0600-\u06FF]+"    # Arabic script Unicode range

urdu_words = re.findall(pattern, mixed_text)
print("Urdu Words:\t", urdu_words)

Urdu Words:	 ['یہ', 'میں', 'کچھ', 'بھی', 'ہیں۔']


### Assignment 5: Finding Dates

**Raw Text:** 
Find and extract dates in the format DD-MM-YYYY from a given text.

**Example:**
```
Text: The event will take place on 15-08-2023 and 23-09-2023.
```



In [11]:
import re

message = """The event will take place on 15-08-2023 and 23-09-2023."""

pattern = r"\d{2}-\d{2}-\d{4}"

dates = re.findall(pattern, message)
print("Event Dates:\t", dates)

Event Dates:	 ['15-08-2023', '23-09-2023']


### Assignment 6: Extracting URLs

**Raw Text:** 
Extract all URLs from a text that belong to Pakistani domains.

**Example:**
```
Text: Visit http://www.example.pk or https://website.com.pk for more information.
```



In [19]:
import re

URLs = """Visit http://www.example.pk or https://website.com.pk for more information."""

pattern = r"https?://[A-z0-9.-]+\.pk"

pakistani_domain_URLs = re.findall(pattern, URLs)
print("URLs belonging to Pakistani domains:\t", pakistani_domain_URLs)

URLs belonging to Pakistani domains:	 ['http://www.example.pk', 'https://website.com.pk']


### Assignment 7: Analyzing Currency

**Raw Text:** 
Extract and analyze currency amounts in Pakistani Rupees (PKR) from a given text.

**Example:**
```
Text: The product costs PKR 1500, while the deluxe version is priced at Rs. 2500.
```



In [19]:
import re

statement = """The product costs PKR 1500, while the deluxe version is priced at Rs. 2500."""

# pattern = r"(?i)(?:PKR|Rs\.?)\s?([\d,]+\.?d*)"
pattern = r"(?i)(?:PKR|Rs\.?)\s?([0-9]+)"

pakistani_currency = re.findall(pattern, statement)
print(pakistani_currency)

['1500', '2500']


## My Notes:

### (?i)(?:PKR|Rs\.?) - Non-capturing Part means it won't be printed in result / just recognizes the pattern

* (?i) = means PKR/Rs can also be pkr/rs 
* (?:PKR|Rs\.?) = it matches either "pkr" or "rs"(with an optional period), case-insensitive due to the (?i) flag.

### \s?

* \s? = optional white space may occur after PKR/Rs.
                                                         
                                                        
### ([\d,]+\.?\d*) - Capturing Part / will be printed in the result
                                                 
* [\d,]+ = this matches one or more digits \d or commas ,.
* \.? = This matches an optional decimal point
* d* = this matches zero or more digits following the decimal point
                                                

### Assignment 8: Removing Punctuation

**Raw Text:** 
Remove all punctuation marks from a text while preserving Urdu characters.

**Example:**
```
Text: کیا! آپ, یہاں؟
```



In [28]:
import re

text = "کیا! آپ, یہاں؟"

# Remove punctuation marks using regex
cleaned_text = re.sub(r'[!,?]+', '', text)     # sub = substitution function in re

print("Original Text:", text)
print("\nCleaned Text:", cleaned_text)


Original Text: کیا! آپ, یہاں؟

Cleaned Text: کیا آپ یہاں؟


### Assignment 9: Extracting City Names

**Raw Text:** 
Extract names of Pakistani cities from a given text.

**Example:**
```
Text: Lahore, Karachi, Islamabad, and Peshawar are major cities of Pakistan.
```


In [2]:
import re

cities =  """Lahore, Karachi, Islamabad, and Peshawar are major cities of Pakistan."""

# Defining a list of known Pakistani cities for comparison
pakistani_cities = ["Karachi", "Lahore", "Islamabad", "Rawalpindi", "Peshawar"]

pattern = r"\b(?:" + '|'.join(pakistani_cities) + r")\b"

matched_cities = re.findall(pattern, cities, re.IGNORECASE)
print(matched_cities)

['Lahore', 'Karachi', 'Islamabad', 'Peshawar']



### Assignment 10: Analyzing Vehicle Numbers

**Raw Text:** 
Identify and extract Pakistani vehicle registration numbers (e.g., ABC-123) from a text.

**Example:**
```
Text: I saw a car with the number plate LEA-567 near the market.
```



In [11]:
import re

number_plate = """I saw a car with the number plate LEA-567 near the market."""

pattern = r"\b[A-Z]{3}-\d{3}\b"

reg_no = re.findall(pattern, number_plate)
print("Pakistani Vehicle Numbers:\t", reg_no)

Pakistani Vehicle Numbers:	 ['LEA-567']
