# The re Module

Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.

For now, you’ll focus predominantly on one function, re.search().



```
# re.search(<regex>, <string>)
```





re.search(regex, string) scans string looking for the first location where the pattern regex matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.


## How to Import & Use re.search()

Because search() resides in the re module, you need to import it before you can use it. One way to do this is to import the entire module and then use the module name as a prefix when calling the function:



```
import re
re.search(...)
```



## First Example

In [1]:
s = 'foo123bar'

# One last reminder to import!
import re

re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In [2]:
if re.search('123', s):
    print('Found a match.')
else:
    print('No match.')

Found a match.


## Python Regex Metacharacters
The real power of regex matching in Python emerges when **regex** contains special characters called **metacharacters**. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [None]:
s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s)


<re.Match object; span=(3, 6), match='123'>

In [None]:
print(re.search('[0-9][0-9][0-9]', 'crux87'))

None


With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [None]:
s = 'foo123bar'
re.search('1.3', s)


s = 'foo13bar'
print(re.search('1.3', s))

None


# Metacharacters Supported by the re Module



*   (.)	Matches any single character except newline
*  (^) Anchors a match at the start of a string
*   ($)	Anchors a match at the end of a string
*	(*) Matches zero or more repetitions
*  (+)	Matches one or more repetitions
*  (?)	Matches zero or one repetition,
*  ({})	Matches an explicitly specified number of repetitions
*  (\) Escapes a metacharacter of its special meaning
*  ([])	Specifies a character class
*  (|)	Designates alternation
*  ()	Creates a group
*  (:, #, =, !)	Designate a specialized group
*  (<>)	Creates a named group


In [3]:
# search for substrings
re.search('capell[io]', 'Marco ha i capelli neri')

<re.Match object; span=(11, 18), match='capelli'>

In [None]:
re.search('capell[o]', 'Marco ha i capelli neri')

In [4]:
# look for the first lowercase
re.search('[a-z]', 'SS Lazio')


<re.Match object; span=(4, 5), match='a'>

In [5]:
#search for the first two numbers
re.search('[0-9][0-9]', 'SS Lazio 1900')


<re.Match object; span=(9, 11), match='19'>

In [6]:
#search for hexadecimanli
re.search('[0-9a-fA-f]', '--- a0 ---')


<re.Match object; span=(4, 5), match='a'>

In [None]:
# search for the first character that is NOT an [a-z] letter
re.search('[^ 0-9][^ A-Z]', 'SS Lazio 1900')

<re.Match object; span=(1, 3), match='S '>

In [7]:
re.findall("\d+", 'SS Lazio 1900')

['1900']

In [8]:
!pip install beautifulsoup4
!pip install urllib3
!pip install pandas
!pip install matplotlib
!pip install nltk



In [9]:
import datetime
import numpy as np
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import os
import pandas as pd

# Scraping

In [None]:

url = 'https://www.ansa.it/'

url = "https://web.uniroma2.it/it/percorso/didattica/sezione/informatica-94846"


req = Request(url=url,headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}) 
response = urlopen(req)    
# reading content
html = BeautifulSoup(response)


In [None]:
html

<!DOCTYPE html>
<html lang="it">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Informatica</title>
<link href="https://stackpath.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://cdn.jsdelivr.net/npm/bootstrap@4.6.0/dist/css/bootstrap.min.css" integrity="sha384-B0vP5xmATw1+K9KRQjQERJvTumQW0nPEzvF6L/Z6nronJ3oUOFUFpCjEUQouq2+l" rel="stylesheet"/>
<link href="//fonts.googleapis.com/css?family=Roboto:300,400,500,700,900" media="all" rel="stylesheet"/>
<link href="https://maxst.icons8.com/vue-static/landings/line-awesome/line-awesome/1.3.0/css/line-awesome.min.css" rel="stylesheet"/>
<link href="/assets/css/sidebar/main.css" rel="stylesheet"/>
<link href="/assets/css/sidebar/sidebar-themes.css" rel="stylesheet"/>
<link href="https://cdn.datatables.net/1.10.24/css/jquery.dataTables.css" rel="stylesheet"

Find all links in page

In [None]:
import re

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', str(html))

In [None]:
urls

['https://stackpath.bootstrapcdn.com',
 'https://cdn.jsdelivr.net',
 'https://maxst.icons8.com',
 'https://cdn.datatables.net',
 'https://web.uniroma2.it',
 'https://web.uniroma2.it',
 'http://www.w3.org',
 'http://www.w3.org',
 'http://www.bohemiancoding.com',
 'http://www.scuolaiad.it',
 'http://lettere.uniroma2.it',
 'http://statistiche.almalaurea.it',
 'http://statistiche.almalaurea.it',
 'https://valmon.disia.unifi.it',
 'http://www.informatica.uniroma2.it',
 'http://www.informatica.uniroma2.it',
 'http://lettere.uniroma2.it',
 'https://uniroma2.ubuy.cineca.it',
 'http://urp.uniroma2.it',
 'http://formazione.insegnanti.uniroma2.it',
 'https://delphi.uniroma2.it',
 'https://agevola.uniroma2.it',
 'https://cdn.jsdelivr.net',
 'https://cdn.jsdelivr.net',
 'https://cdn.datatables.net']