# Regular Expressions

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by `string` searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. The concept came into common use with `Unix` text-processing utilities. Since the 1980s, different syntaxes for writing regular expressions exist, one being the POSIX standard and another, widely used, being the Perl syntax.

## Regex in Python

python provide module for regex called `re`

we are going to use the `findall` and `sub` functions

In [15]:
import re

text = "my name is Asya"

re.sub("Asya", "Eso", text)

'my name is Eso'

In [16]:
text

'my name is Asya'

In [17]:
re.findall("Asya", text)

['Asya']

here we used normal text for substitue and search, we can use a regex though !

## Special Characters (Regular Expressions)

some characters have special functions and are not just character, for example the `\n` which indicate a newline and the `\t` which is a tap space.

### Basic patterns that match single chars

| Character  | function |
| ------------- | ------------- |
| a-z, 0-9  | ordinary characters just match themselves exactly.|
| . (dot)  | matches any single character except newline '\n'  |
| \w | matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_] |
| \W | matches any non-word character |
| \b | boundary between word and non-word |
| \s | matches a single whitespace character -- space, newline, return, tab |
| \S | matches any non-whitespace character |
| \t, \n, \r | tab, newline, return |
| \d | decimal digit [0-9] |
| ^ | matches start of the string |
| $ | match the end of the string |

## Let's mix them with normal characters

> note that we use `r` before the pattern string to let python know not to parse them, for example not to take \n and replace it by newline.

In [6]:
text = """regular expression is a special sequence of characters \
that helps you match or find other strings or sets of strings, \
using regular expression pattern. regular expressions are widely used in UNIX world."""

re.findall(r"^regular", text)
#^ first regular word


['regular']

## Replace the first regular to a title case one

In [7]:
text = """regular expression is a special sequence of characters \
that helps you match or find other strings or sets of strings, \
using regular expression pattern. regular expressions are widely used in UNIX world."""

re.sub(r"^regular", "Regular", text)

'Regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using regular expression pattern. regular expressions are widely used in UNIX world.'

# Lab 1 : Ends with number

using regular expression write a script that check if a string ends with a number

In [15]:
# Prompt the user to enter a string and store it in the 'text' variable
text = input("enter a string:").strip()
#The strip() method is used to remove leading and trailing whitespace
#(spaces, tabs, or newlines) from the entered string.

# Use a regular expression pattern to find if the string ends with a number
# [0-9]$ is a regular expression pattern that matches any digit (0-9) at the end of the string ($)
result = re.findall(r"[0-9]$", text)

# Check if 'result' contains any matches (i.e., if the string ends with a number)
if result:
    print("string ends with a number")
else:
    print("it is not")


enter a string:it's 2023
string ends with a number


# Lab 2 : Only Valid text

write a script using regular expression to check if user input consists of only **`alphabet letters`**, **`number`** and **`_`**

In [16]:
# Prompt the user to enter a username and store it in the 'username' variable
username = input("enter your username:").strip()

# Use a regular expression pattern to find any non-word characters (e.g., special symbols) in the username
# \W is a regular expression pattern that matches any non-word character (not letters, digits, or underscores)
invalid_input = re.findall(r"\W", username)

# Check if 'invalid_input' contains any matches (i.e., if the username contains non-word characters)
if invalid_input:
    print("invalid username")
else:
    print("username {} is valid".format(username))


enter your username:asya_123
username asya_123 is valid


## Building a bigger regular expression

you can mix and match multiple expressions and have more than one instances of them

| Example  | description |
| --- | --- |
| [Pp]ython | Match "Python" or "python" |
| rub[ye] | Match "ruby" or "rube" |
| [aeiou] | Match any one lowercase vowel |
| [0-9] | Match any digit; same as [0123456789] |
| [a-z] | Match any lowercase ASCII letter |
| [A-Z] | Match any uppercase ASCII letter |
| [a-zA-Z0-9] | Match any of the defined |
| [^aeiou] | Match anything other than a lowercase vowel |
| [^0-9] | Match anything other than a digit |

> we can use **OR** to use multiple regex together.


In [17]:
# Define a list of text samples to search for the word "Python" or "py"
texts = [
  "python is a great language",
  "i lov to write in py",
  "what a cool language Python is",
  "the pyramids of giza are so huge!"
]
# Iterate through each text in the 'texts' list
for text in texts:
    # Use a regular expression pattern to find occurrences of "Python" (case-insensitive) or the word "py"
    # [Pp]ython matches "Python" with any combination of uppercase and lowercase letters
    # \b[Pp]y\b matches the word "py" surrounded by word boundaries
    python_detected = re.findall(r"[Pp]ython|\b[Pp]y\b", text)
    
    # Check if 'python_detected' contains any matches
    if python_detected:
        print("talking about python")
    else:
        print("something else")


talking about python
talking about python
talking about python
something else


## Repetition Cases

| Example | description |
| --- | --- |
| ruby? | Match "rub" or "ruby": the y is optional |
| ruby* | Match "rub" plus 0 or more y(s) |
| ruby+ | Match "rub" plus 1 or more y(s) |
| \d{3} | Match exactly 3 digits |
| \d{3,} | Match 3 or more digits |
| \d{3,5} | Match 3, 4, or 5 digits |

In [18]:
# Define a text string containing some text with a year in it
text = "it's 2018, happy new year!"

# Use a regular expression pattern to find four consecutive digits (\d{4}) in the text
# \d matches any digit, and {4} specifies that it should be exactly four consecutive digits
matches = re.findall(r"\d{4}", text)

# The result of re.findall() will be a list of all matched four-digit sequences
# In this case, it will find "2018" as the only match
print(matches)


['2018']


## Let's replace numbers with NUM

In [30]:
# Define a text string with some content that includes numbers
text = "this is a text tweet that contains multiple numbers 01112131411, 012121212"

# Use re.sub() to replace sequences of 11 digits with the word "PHONE" in the 'text' string
new_text = re.sub(r"\d{11}", " PHONE ", text)

# Use re.sub() to replace all other sequences of digits (1 or more) with the word "NUM" in 'new_text'
new_text = re.sub(r"\d+", " NUM ", new_text)

# Print the modified 'new_text' string
print(new_text)


this is a text tweet that contains multiple numbers  PHONE ,  NUM 


## Search groups

you can create a search groups with regex and retrieve each one with `search()` function

In [22]:
email_address = 'Please contact us at: support@google.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)
if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@google.com
support
google.com


## `match` vs `search`

The `match()` function checks for a match only at the beginning of the string whereas the `search()` function checks for a match anywhere in the string.

## The `re.compile()`

we can compile a regular expression instead of writing it multiple time

In [19]:
# Define a regular expression pattern for matching email addresses
mail_re = re.compile(r"[\w\.-]+@[\w\.-]+\.[\w\.-]+")

# Create a list of text samples containing email addresses
mails = [
    "this is a message with an email of:example.name@company.org",
    "my-email55@yahoo.com is the email you would like to use",
    "send me an email at:shortmail@long-company.net"
]

# Iterate through each text in the 'mails' list
for mail in mails:
    # Use the regular expression pattern to find all email addresses in the text
    found_emails = mail_re.findall(mail)
    
    # Print the list of found email addresses in the current text
    print(found_emails)


['example.name@company.org']
['my-email55@yahoo.com']
['shortmail@long-company.net']


# Lab 3 : Extract Hashtags

Write a script to extract a hashtag from tweet.

In [21]:
tweet = "a tweet with no hashtag, but a #HASHTAG and another #cool one #هاشتاج"

hashtags = re.findall(r"#\S+", tweet)

if hashtags:
    print(hashtags)
else:
    print("no hashtags found")

['#HASHTAG', '#cool', '#هاشتاج']
