<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Regular expressions
© ExploreAI Academy

In this notebook, we look at how to use the `re` library in Python and apply some of the functionality to easily extract the data we're interested in. We also look at the `re.compile` function – creating compiled objects for specific regex functions.

## Learning objectives

By the end of this notebook, you should be able to:
- Understand how to use regex to extract data we're interested in.
- Apply regex to both compiled objects and normal text.

## Examples

### Example 1

#### Question 
Given a paragraph about conservation efforts, split the text into individual sentences using regular expressions.

#### Solution

In [None]:
import re

text = "Conservation efforts are increasing. Habitats are being restored. Species are recovering."
sentences = re.split(r"\. ", text)

print(sentences)

#### Explanation
This code uses `re.split` to split a paragraph into sentences based on the period followed by a space. This is a simple use case of `re.split`, demonstrating its effectiveness in text segmentation.

### Example 2

#### Question 
Extract all numbers followed by "acres" to find references to land area in a text. Utilise `re.compile` to create a regex pattern that matches this format.

#### Solution

In [None]:
import re

text = "The national park covers 5000 acres, while the community forest spans 750 acres."

# Compiling the regex pattern
pattern = re.compile(r'\d+\s*acres', re.IGNORECASE)

# Finding all occurrences of land area
land_areas = pattern.findall(text)

print(land_areas)

#### Explanation
This solution uses a compiled regex pattern to efficiently find numerical values followed by "acres". The regex `\d+\s*acres` looks for one or more digits `(\d+)` followed by zero or more spaces `(\s*)` and the word "acres". The `re.IGNORECASE` flag ensures that variations in the capitalisation of "acres" are also matched.

### Example 3

#### Question 
Given a text with various animal and plant species names formatted as 'Genus Species', compile a regex object to find all occurrences of these species' names in the text.

#### Solution

In [None]:
import re

pattern = re.compile(r'\b[A-Z][a-z]* [a-z]+\b')
text = "In the Amazon rainforest, species like Panthera onca, Inia geoffrensis, and Euterpe precatoria are found."

species = pattern.findall(text)
print(species)

#### Explanation
The script compiles a regex pattern using `re.compile`, which is then used to find all matches in the given text. The regex `\b[A-Z][a-z]* [a-z]+\b` is designed to match words that start with a capital letter (indicative of a genus name in biological nomenclature) followed by lowercase letters, a space, and then a series of lowercase letters (representing the species name). 

While this pattern is typically representative of scientific names for species, it's important to note that it may not exclusively capture viable species names. This limitation arises because the pattern does not account for the complexities and exceptions found in biological nomenclature, such as species names with hyphens, Latin abbreviations, or those comprising more than two words. Additionally, the pattern might inadvertently match other text that coincidentally follows the same format but does not represent actual species names. Therefore, while this regex can be a powerful tool for preliminary data extraction, further verification and refinement may be necessary to ensure the accuracy and relevance of the extracted data, especially for scientific or research purposes.

### Example 4

#### Question 
Given a text containing different plant names related to sustainable land management, extract all occurrences of specific plants. The names to extract are "oak", "maple", and "pine". Use `re.compile` to optimise the pattern matching.

#### Solution

In [None]:
import re

text = "The forest had a variety of trees including oak, maple, and pine. Other species included birch and spruce."

# Compiling the regex pattern
pattern = re.compile(r'\boak\b|\bmaple\b|\bpine\b', re.IGNORECASE)

# Finding all occurrences of the specified plants
found_plants = pattern.findall(text)

print(found_plants)


#### Explanation
This code uses `re.compile` to create a compiled regex object for efficient matching. 
* When working with regex in Python, compiling a regex pattern into a regex object can enhance performance, especially when the pattern is used multiple times. This approach is more efficient because the regex engine converts the pattern string into an internal format optimised for repeated searches. This is particularly useful in scenarios like parsing large texts or processing multiple strings using the same pattern, as it avoids the overhead of recompiling the pattern for each use.

The regex pattern `\boak\b|\bmaple\b|\bpine\b` uses word boundaries `(\b)` to match whole words and `|` as an `OR` operator to match any of the specified plant names. The `re.IGNORECASE` flag makes the search case insensitive.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>