# Key Concepts:
Regular Expressions: A sequence of characters that define a search pattern, used for pattern matching within text.

URL Validation: The process of checking if a URL is syntactically correct and conforms to expected standards.

URL Extraction: The process of identifying and extracting URLs from larger text content.

Domain Extraction: The process of isolating the domain name (e.g., google.com) from a URL.

### Example 1: Validate the URL

To match a url pattern, you can use the following regular expression:

The given code defines a Python function validate_url(url) that checks if a given URL is valid. It uses a regular expression to validate URLs, and the regular expression pattern is quite complex. 

The function returns True if the URL is valid and False otherwise.

In [None]:
import re

# Define a function to validate a URL
def validate_url(url):
    # Compile a regular expression pattern for validating URLs
    pattern = re.compile( # this is the same thing as (r'^https?://(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\b|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(:[0-9]{1,4})?(?:/?|[/?]\S+)$', re.IGNORECASE)
        r'^https?://'  # Match http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\b|'  # Match domain names
        r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}'  # Match IP addresses
        r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))'  # Match the last segment of the IP
        r'(:[0-9]{1,4})?'  # Match optional port numbers
        r'(?:/?|[/?]\S+)$',  # Match optional paths or query strings
        re.IGNORECASE  # Make the pattern case insensitive
    )
    # Return True if the URL matches the pattern, otherwise False
    return pattern.match(url) is not None

# Test the function with a valid URL
print(validate_url("http://www.google.com"))  # Output: True

# Test the function with an invalid URL
print(validate_url("www.google.com"))  # Output: False

print(validate_url("https://colab.research.google.com/drive/1PpLud29mzw2EWlGjjg1pd7RIatVJGWUB?usp=sharing#scrollTo=aoLjmV1DJv4b")) # Extra Test with the link to our GLAB

True
False
True


### Example 2: Match the URL from the string


In [None]:
import re

def extract_urls(text):
    # Regular expression for matching URLs
    url_pattern = r'https?://[a-zA-Z0-9.-]+(/\S*)?'

    # Find all URLs in the text
    urls = re.findall(url_pattern, text)

    return urls

# Sample text containing URLs
text = "Visit my website at https://www.example.com for more information. You can also check http://www.example.org/products"

# Call the function to extract URLs
urls = extract_urls(text)

# Print the extracted URLs
print("URLs found in the text:")
for url in urls:
    print(url)

URLs found in the text:

/products


This regular expression is used to match URLs. It looks for patterns starting with "http://" or "https://" followed by a domain part, which consists of alphanumeric characters, dots, and hyphens.

 It also allows for an optional path and query string (if any) captured by (/\S*)?. We use re.findall to find all URLs in the given text. The URLs found in the text are then printed.


Here's a breakdown:

`url_pattern` = ...: This part assigns the regex pattern to a variable named url_pattern for later use.

`r'...':` The r' before the opening quote indicates a raw string literal, preventing Python from interpreting backslashes () in the pattern as escape sequences.

`https?://:` This matches the beginning of a URL, looking for either "http://" or "https://". The ? makes the "s" in "https" optional, allowing for both secure and non-secure URLs.

`[a-zA-Z0-9.-]+:` This matches one or more (+) occurrences of any alphanumeric character (a-zA-Z0-9), dot (.), or hyphen (-). This part typically captures the domain name portion of the URL (e.g., "
[redacted link]).

`(/\S*)?:` This part is optional, as indicated by the ? at the end.

`(/ ... ):` This group captures the path and query string part of the URL, if present.

`\S*:` This matches zero or more (*) occurrences of any non-whitespace character (\S). This allows for various path and query string structures (e.g., "/products", "/blog?id=123").

### Example 3: Match the URL from the string and print URL

In [None]:
import re

def extract_urls_and_domains(text):
    # Regular expression for matching URLs
    url_pattern = r'https?://\S+'

    # Regular expression for extracting domain names from URLs
    domain_pattern = r'https?://([a-zA-Z0-9.-]+)'

    # Find all URLs in the text
    urls = re.findall(url_pattern, text)

    # Extract domain names from the URLs
    # Use re.search to match the domain pattern and extract the first group (domain name)
    domains = [re.search(domain_pattern, url).group(1) for url in urls]

    return urls, domains

# Sample text containing a URL
text = "Visit my website at https://www.example.com for more information."

# Call the function to extract URLs and domains
urls, domains = extract_urls_and_domains(text)

# Print the extracted URLs
print("URLs found in the text:")
for url in urls:
    print(url)

# Print the extracted domain names
print("\nExtracted domain names:")
for domain in domains:
    print(domain)

URLs found in the text:
https://www.example.com

Extracted domain names:
www.example.com
