<a href="https://colab.research.google.com/github/E1250/nlp_ref/blob/main/Regular%20Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Lecture -  https://www.youtube.com/watch?v=mN_O29PqLHQ&t=3s
* Lab - https://www.youtube.com/watch?v=iOWe92EkaXo
* Colab - https://colab.research.google.com/drive/1YS5OXdbshBpS4rgxlxRRBg6D9wSOb3Im?usp=sharing

# Basics of Regex

## Definition
Regex is a sequence of characters that forms a search pattern. It can be used for string matching and manipulation tasks.

## Syntax
Common regex symbols include:
- `.`: Matches any single character.
- `^`: Matches the start of a string if it outside like `^[A-Z]`, But if it is inside means Negation like below.
- `$`: Matches the end of a string.
- `*`: Matches zero or more occurrences of the preceding element.
- `+`: Matches one or more occurrences of the preceding element.
- `?`: Matches zero or one occurrence of the preceding element.
- `[A-Z]`: Matches any capital char in this range. also `[a-z]`, `[0-9]`
- `[abc]`: Matches any one character in the set.
- `[^abc]`: Matches any character not in the set.
- `(a|b)`: Matches either `a` or `b`.
- `{2,}`: Number of matches you wanna stop at
- `\.`: Means treat the `.` as char not a pattern

# Why Regex in NLP

## Pattern Recognition
Regex is used to identify patterns in text, such as email addresses, phone numbers, or specific keywords.

## Data Cleaning
It helps in cleaning and preprocessing text data by removing unwanted characters or formatting.

## Tokenization
Regex can be used to split text into tokens, essential for many NLP tasks.

## Text Transformation
You can use regex for tasks like replacing or modifying specific parts of the text.


In [None]:
import re

1. Finding Patterns:


In [None]:
text = "راسلني على hello_world@gmail.com"
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
emails

['hello_world@gmail.com']


2. Data Cleaning:

Removing Numbers and Symbols:

In [None]:
text = "بعض النصوص العشوائية مع أرقام 123 ورموز #!"
cleaned_text = re.sub(r'[^أ-ي ]', '', text)
cleaned_text

بعض النصوص العشوائية مع أرقام  ورموز 


3. Tokenization:

Splitting Text into Words:

In [None]:
text = "مرحبا وكيف حالك"
tokens = re.findall(r'\b\w+\b', text)
tokens

['مرحبا', 'وكيف', 'حالك']


4. Text Transformation:

Replacing Words:

In [None]:
text = "الطقس مشمس"
modified_text = re.sub(r'مشمس', 'ممطر', text)
modified_text

الطقس ممطر


Extracting Hashtags:

In [None]:
text = "هل نستطيع القضاء على#العفن#البني فى #البطاطس؟"
hashtags = re.findall(r'#(\w+)', text)
hashtags

['العفن', 'البني', 'البطاطس']


Finding Arabic Dates:

In [None]:
text = "التاريخ اليوم هو ٢٣/١٠/٢٠٢٣"
dates = re.findall(r'\b[٠-٩]{1,2}/[٠-٩]{1,2}/[٠-٩]{2,4}\b', text)
dates

['٢٣/١٠/٢٠٢٣']


Matching Arabic Phone Numbers:

In [None]:
text = "اتصل بي على ٠١٢٣٤٥٦٧٨٩"
phone_numbers = re.findall(r'\b[٠-٩]{10}\b', text)
phone_numbers

['٠١٢٣٤٥٦٧٨٩']


 Removing Arabic Diacritics (Tashkeel):

In [None]:
text = "هَذَا نَصٌّ بِالْتَشْكِيل"
text_without_diacritics = re.sub(r'[\u064B-\u065F]', '', text)
text_without_diacritics

هذا نص بالتشكيل
