# REGEX :

* Regesx is the sequence of characters that defines the **search pattern**.
* The pattern that allows you to **match, extract or modify text**.
* In python, we use `regex` library, `re`, to:
    - **Define** string patterns
    - **Operate** on the strings that match the pattern : search, extract, replace or remove
    

Let's start with extracting **#hashtags** from a social media corpus

In [5]:
# Here is a small corpus of tweets that contain hashtags
tweets = [
    'An #autumn scene showing a beautiful #horse coming to visit me.', 
    'My new favourite eatery in #liverpool and I mean superb! #TheBrunchClub #breakfast #food', 
    '#nowplaying Pointer Sisters - Dare Me | #80s #disco #funk #radio']

# 1. import the regex library
import re

# 2. define the pattern
# '#' followed by non-empty sequence of letters and punctuations 'S+'
pattern = r'#\S+'

# 3. find all the strings that match the pattern with the 'findall' method
for text in tweets:
    print(re.findall(pattern, text))

['#autumn', '#horse']
['#liverpool', '#TheBrunchClub', '#breakfast', '#food']
['#nowplaying', '#80s', '#disco', '#funk', '#radio']


Extracting **@usernames**

In [7]:
text = 'Check out this new NLP course on @openclassrooms by @alexip'

pattern = r'@\S+'

print(re.findall(pattern, text))

['@openclassrooms', '@alexip']


The python `re` library includes the following three main functions to extract specificstrings or modify a text :
- `re.findall(pattern, tetx)` : returns the list of strings that matches the pattern
- `re.sub(pattern, replace_with, text)` : replaces string sequences that match the pattern by the **replace_with** sequence
- `re.search(patter, text)` : returns the last matching pattern with information about the starting and ending position of the pattern

Let's apply `re.sub()` to remove the HTML tags from an HTML page

In [8]:
import requests

# Music is in the House!
url = 'https://en.wikipedia.org/wiki/House_music'

# GET the content
# Note : requests.get().content returns a byte object
# that we can cast as string with .decode('UTF-8')
html = requests.get(url).content.decode('UTF-8')

# Remove the header part of the HTML
html = html.split('</head>')[1]

print(html)


<body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-House_music rootpage-House_music skin-vector-2022 action-view"><a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<div class="vector-header-container">
	<header class="vector-header mw-header">
		<div class="vector-header-start">
			<nav class="vector-main-menu-landmark" aria-label="Site" role="navigation">
				
<div id="vector-main-menu-dropdown" class="vector-dropdown vector-main-menu-dropdown vector-button-flush-left vector-button-flush-right"  >
	<input type="checkbox" id="vector-main-menu-dropdown-checkbox" role="button" aria-haspopup="true" data-event-name="ui.dropdown-vector-main-menu-dropdown" class="vector-dropdown-checkbox "  aria-label="Main menu"  >
	<label id="vector-main-menu-dropdown-label" for="vector-main-menu-dropdown-checkbox" class="vector-dropdown-label cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-butt

In [10]:
# Now remove the HTML tags
pattern = r'<[^>]*>'
text = re.sub(pattern, '', html)

print(text)


Jump to content

	
		
			
				

	
	

Main menu
	
	


				
		

	
	Main menu
	move to sidebar
	hide


	

	
		Navigation
	
	
		
		
			
			Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate
		
		
	


	
	

	
		Contribute
	
	
		
		
			
			HelpLearn to editCommunity portalRecent changesUpload file
		
		
	




				

	


		
			

	
	
		
		
	


		
		
			

	

Search
	
	
		
			
				
					
						
						
					
					
				
				Search
			
		
	


			
	
	

	
		
		
			
			
		
		
	


	

	
		
		
			
			
		
		
	


	
		
		
	
	

	
		
		
			
			
		
		
	


	

	
		
		
			Create account

Log in


			
		
		
	


	
	

	
	

Personal tools
	
	


		

	
		
		
			
			 Create account Log in
		
		
	



	
		Pages for logged out editors learn more
	
	
		
		
			
			ContributionsTalk
		
		
	


	
	




		
	


	
		
			
		
		
			
		
			
				
				
				
		
		
	
	
				
					
					
	
	Contents
	move to sidebar
	hide



	
		
			
				(Top)
			
		
		
		
			
			1Characteristics
		
		
		
		
	
	
		
			
			2Origins 

Extracting **URLs**

In [11]:
# For extracting urls, we need to define patter which:
# 1. has staarting point : 'http'
# 2. and ends with either " or <
# r'http.+?(?="|<)'

url = 'https://en.wikipedia.org/wiki/House_music'

# GET, decode and drop header
html = requests.get(url).content.decode('UTF-8').split('</head>')[1]

# find all the urls
pattern = r'http.+?(?="|<)'
urls = re.findall(pattern, html)

print(f'We found {len(urls)} URLs')

We found 441 URLs


In [12]:
for i in range(10) :
    print(urls[i])

https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en
https://ar.wikipedia.org/wiki/%D9%87%D8%A7%D9%88%D8%B3_(%D9%85%D9%88%D8%B3%D9%8A%D9%82%D9%89)
https://be-tarask.wikipedia.org/wiki/%D0%93%D0%B0%D1%9E%D1%81
https://bh.wikipedia.org/wiki/%E0%A4%B9%E0%A4%BE%E0%A4%89%E0%A4%B8_(%E0%A4%AC%E0%A4%BF%E0%A4%A7%E0%A4%BE)
https://bg.wikipedia.org/wiki/%D0%A5%D0%B0%D1%83%D1%81
https://bs.wikipedia.org/wiki/House_muzika
https://ca.wikipedia.org/wiki/M%C3%BAsica_house
https://cs.wikipedia.org/wiki/House_music
https://cy.wikipedia.org/wiki/House
https://da.wikipedia.org/wiki/House


**Components :**

* `[]` : a set of characters
* `a-z` : Lowercase letters ; `A-Z` : Uppercase letters ; `À-ÖØ-öø-ÿ` : Accented letters
* `\d` or `[0-9]` : Digits
* `\S` : Any character that is not a whitespace character
* `\w` : word characters, including numbers and underscore
* `\s` : space characters, including line returns, tabs, non-breaking space etc


**Repetition :**

* `+` : 1 or more repitions
* `?` : 0 or 1 repition
* `*` : 0 or more repetitions


**Boundaries :**

* `\b` : empty string, but only at the begining or end of a word, so a potential word tokenizer can be `r'\b\w\b'`
* `^` : from the start of the text
* `$` : until the end of the text


