#Regular expressions

On many occasions we need to search for substrings that do not always have the same content, but follow a pattern, such as a phone number. In order to search for this type of substring we need to use regular expressions.

Regular expressions are a sequence of characters with other characters
that allow us to define text patterns. Using these regular expressions, we can search for these patterns in text strings.

To apply regular expressions to strings, we'll use Python's re module. 

search(): This function searches for the first occurrence of the pattern in the text string. As a result, it returns an object of type match in which we can obtain the positions where the pattern is found within the text string. In case the pattern does not exist within the string, the function returns a None object.

match() - Find a pattern at the beginning of the string. In case the pattern does not exist or is not found at the beginning of the chain, it will return an object of type None.

split(): This method allows us to split a character string following a pattern. As a result, it returns a list with each of the divisions.

sub(): allows you to replace the patterns found with another substring that we pass as a parameter.

findall(): finds all occurrences of a pattern within a character string. As a result, it will return a list with all the substrings that match the pattern.

In [None]:
import re 

mensaje = "Esto es un mensaje de prueba para el curso de Python." 
match = re.search('curso', mensaje) 
print("Comienzo:", match.start(), "Final:", match.end()) 

mensaje = "Esto es un mensaje de prueba para el curso de Python." 
match = re.match('Esto', mensaje) 
print("Comienzo:", match.start(), "Final:", match.end()) 

mensaje = "Esto es un mensaje de prueba para el curso de Python." 
print(re.split(' ', mensaje) )


mensaje = "Esto es un mensaje de prueba para el curso de Python." 
re.sub('Python', 'Java', mensaje)  

mensaje = "Esto es un mensaje de prueba para el curso de Python." 
re.findall('de', mensaje) 
 
 

Comienzo: 37 Final: 42
Comienzo: 0 Final: 4
['Esto', 'es', 'un', 'mensaje', 'de', 'prueba', 'para', 'el', 'curso', 'de', 'Python.']


['de', 'de']

In [None]:
mensaje = "Esto es un mensaje de prueba para el curso de Python." 
print(re.split(' ', mensaje) )

['Esto', 'es', 'un', 'mensaje', 'de', 'prueba', 'para', 'el', 'curso', 'de', 'Python.']


We will use a longer text to test the next examples:

In [None]:
text = "In literary theory, a text is any object that can be 'read', whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothing. It is a coherent set of signs that transmits some kind of informative message. This set of signs is considered in terms of the informative message's content, rather than in terms of its physical form or the medium in which it is represented. \n Within the field of literary criticism, 'text' also refers to the original information content of a particular piece of writing; that is, the 'text' of a work is that primal symbolic arrangement of letters as originally composed, apart from later alterations, deterioration, commentary, translations, paratext, etc. Therefore, when literary criticism is concerned with the determination of a 'text', it is concerned with the distinguishing of the original information content from whatever has been added to or subtracted from that content as it appears in a given textual document (that is, a physical representation of text). Since the history of writing predates the concept of the 'text', most texts were not written with this concept in mind. Most written works fall within a narrow range of the types described by text theory. \n The concept of 'text' becomes relevant if and when a 'coherent written message is completed and needs to be referred to independently of the circumstances in which it was created.'"
print(text)

In literary theory, a text is any object that can be 'read', whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothing. It is a coherent set of signs that transmits some kind of informative message. This set of signs is considered in terms of the informative message's content, rather than in terms of its physical form or the medium in which it is represented. 
 Within the field of literary criticism, 'text' also refers to the original information content of a particular piece of writing; that is, the 'text' of a work is that primal symbolic arrangement of letters as originally composed, apart from later alterations, deterioration, commentary, translations, paratext, etc. Therefore, when literary criticism is concerned with the determination of a 'text', it is concerned with the distinguishing of the original information content from whatever has been added to or subtracted from that content as it appears in a given te

Now it is necessary to know how to build more complex patterns that allow us to search for substrings with different content, but with whom they have a pattern in common.

Literals: These are the elements that contain only basic characters. They are the ones we have seen in the previous examples 

Escape characters: they are used to define special characters within a text string, such as line breaks. It is also necessary to use them to search for characters within a regular expression that have their own meaning, for example, the asterisk (*). Escape characters begin with a backslash (\). Some of the most important escape characters are as follows:

• \n: line break.

• \t: tabulator.

• \\: Backslash.

• \d: a digit.

• \w: an alphanumeric character.

• \s: a blank space.

• \D: character that is not a digit.

• \W: non-alphanumeric character.

To use them, we just have to include them inside a text string that we will use in the corresponding function of the re module. For example, if we want to replace all line breaks with the text “**HERE**” in the text, we will apply the corresponding escape character to use in the sub function:

In [None]:
re.sub('\n', '**HERE**', text) 

 

NameError: ignored

Character groups: The use of character groups allows us not only to find a pattern in the text, but also to capture the result of the pattern to process it later.

In [None]:
pattern = 'se\w+\s\w+\s\w+' 
print(re.findall(pattern, text))

pattern = 'se\w+\s(\w+)' 
print(re.findall(pattern, text))

['set of signs', 'set of signs', 'sentation of text']
['of', 'of', 'of']


Metacharacters: Metacharacters are characters that have a special meaning within regular expressions. These metacharacters allow you to search for repetitions of patterns, types of characters, etc. Next, we will describe some of the most common metacharacters:

• |: allows us to separate different alternatives that we are looking for within a text. 

• ?: The element that precedes it appears once or not at all.

• +: the element that precedes it appears one or more times. 

• *: the element that precedes it appears zero or more times. 

• {n}: the element that precedes it appears n times. 

• {n,m}: The previous element appears between n and m. If n is empty it means that the element appears from 0 to m times. If, on the other hand, m is empty, it means that the element appears n or more times. 

• []: allows us to represent classes of characters, that is, it will search for strings that have some of the characters defined inside the brackets.

• -: allows us to define a range of characters. 

In [None]:
pattern = 'is|be|with|the' # replace any occurrence of the words is or be with **HERE**.
print(re.sub(pattern, '*HERE**', text) )

pattern = 'texts?'  #we are looking for the word text, in plural or singular.
print(re.findall(pattern, text))

ejemplo = '1 22 333 4444 33 3 34' 
pattern = '3+'#search for the number where the digit 3 appears one or more times.
print(re.findall(pattern, ejemplo)) 

ejemplo = '12233311444425211221112' 
pattern = '(1*2)' #contains the digit 1 zero or more times followed by the number 2.
print(re.findall(pattern, ejemplo))

ejemplo = '1 22 333 4444 1234 45 a45 324235' 
pattern = '\d{3}' #which part of the character string has exactly 3 digits.
print(re.findall(pattern, ejemplo)) # Devolverá ['333', '444'] 

ejemplo = '1 22 333 4444 542524545' 
pattern = '\d{2,3}' #which part of the text string has 2 or 3 digits in a row.
print(re.findall(pattern, ejemplo)) # Devolverá ['22', '333', '444'] 

ejemplo = '1 22 333 4444 233 323 3345 234322' 
pattern = '[2,3]{2,3}' #find which part of the string has the digits 2 or 3 repeated between 2 and 3 times.
print(re.findall(pattern, ejemplo)) # Devolverá ['22', '333'] 
 
ejemplo = '1 22 333 4444 12345 123 13345' 
pattern = '[1-3]{2,3}' #which part of the string has the values ​​between 1 and 3 repeated 2 or more times.
print(re.findall(pattern, ejemplo)) 
 


In literary *HERE**ory, a text *HERE** any object that can *HERE** 'read', whe*HERE**r th*HERE** object *HERE** a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothing. It *HERE** a coherent set of signs that transmits some kind of informative message. Th*HERE** set of signs *HERE** considered in terms of *HERE** informative message's content, ra*HERE**r than in terms of its physical form or *HERE** medium in which it *HERE** represented. 
 Within *HERE** field of literary critic*HERE**m, 'text' also refers to *HERE** original information content of a particular piece of writing; that *HERE**, *HERE** 'text' of a work *HERE** that primal symbolic arrangement of letters as originally composed, apart from later alterations, deterioration, commentary, translations, paratext, etc. Therefore, when literary critic*HERE**m *HERE** concerned *HERE** *HERE** determination of a 'text', it *HERE** concerned *HERE** *HERE** d*HERE**tingu*HERE**hing o

#Wikipedia example

In [None]:
!pip install wikipedia
import wikipedia
p = wikipedia.page("Python programming language")
print(p.url)
print(p.title)
content = p.content # Content of page.
print(content)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=fefb5196af6c50bccefccd8f3f304c6f30cdc1f64687d86378f06f282408408b
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
https://en.wikipedia.org/wiki/Python_(programming_language)
Python (programming language)
Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, 

#Numpy 
Numpy is a module that we can install in Python and it is oriented towards scientific libraries. This module provides new data structures, such as arrays and multidimensional arrays, and includes very powerful methods for working with them. 

This library is the basis for many of the scientific and data analysis libraries that exist in Python. 

#Arrays in numpy 

it's the most basic structure that exists within numpy. This data structure is a sequence of values, which are assigned a position. 

In [None]:
import numpy as np
myarray=np.array([*range(1,6)])
print(myarray[2])
myarray=np.array([1, 2.0, 'Hola', True])
mymatrix=np.array([[1,2,3],[4,5,6]])
mymatrix=np.array([[1,2,3],[4,5]])



3


  


Arrays are very similar to Python lists, but arrays are faster and there are many calculations we can do on all the values ​​in an array faster than with Python lists.

In [None]:

import sys
mylist=[*range(1000)]
myarray=np.array([*range(1000)])
listsize=0 

for k in mylist: listsize+=sys.getsizeof(k)
print(listsize)

arraysize=0 
for k in myarray: arraysize+=k.itemsize
print(arraysize)

In [None]:
import time
list1=[*range(1000000)]
list2=[*range(1000000)]
array1=np.array([*range(1000000)])
array2=np.array([*range(1000000)])
comienzo=time.time()
resultado=[x-y for x,y in zip(list1,list2)]
final=time.time()
print('Tiempo: ', final-comienzo)

comienzo=time.time()
resultado=array1 - array2
final=time.time()
print('Tiempo: ', final-comienzo)




Tiempo:  0.1020822525024414
Tiempo:  0.005115509033203125


#Array operations (element-wise)

In [None]:
array1=np.array([ 2, 3, 5])
array2=np.array([ 2, 4, 10])

print(np.subtract(array1,array2))
print(array1-array2)

print(np.add(array1,array2))
print(array1+array2)

print(np.multiply(array1,array2))
print(array1*array2)

print(np.divide(array1,array2))
print(array1/array2)

print(np.power(array1,array2))
print(array1**array2)

print(np.power(array1,2))
print(array1**2)

print(np.sqrt(array1))

print(np.square(array1))


print(np.gcd(array1,array2))

print(np.lcm(array1,array2))


[ 0 -1 -5]
[ 0 -1 -5]
[ 4  7 15]
[ 4  7 15]
[ 4 12 50]
[ 4 12 50]
[1.   0.75 0.5 ]
[1.   0.75 0.5 ]
[      4      81 9765625]
[      4      81 9765625]
[ 4  9 25]
[ 4  9 25]
[1.41421356 1.73205081 2.23606798]
[ 4  9 25]
[2 1 5]
[ 2 12 10]


#Array comparisons (element-wise)

In [None]:
print(np.greater(array1,array2))
print(np.greater_equal(array1,array2))
print(np.equal(array1,array2))
print(np.less(array1,array2))
print(np.less_equal(array1,array2))
print(np.not_equal(array1,array2))




[False False False]
[ True False False]
[ True False False]
[False  True  True]
[ True  True  True]
[False  True  True]


#Array logical operations (element-wise)

In [None]:
array1=np.array([True, False, True])
array2=np.array([False, False, True])
print(np.logical_and(array1,array2))
print(np.logical_or(array1,array2))
print(np.logical_xor(array1,array2))
print(np.logical_not(array1))



array([ True, False,  True])

#Array statistics

In [None]:
array1=np.array([*range(101)])
print(np.amin(array1))
print(np.amax(array1))
print(np.percentile(array1,50))
print(np.median(array1))
print(np.mean(array1))
print(np.std(array1))
print(np.var(array1))






0
100
50.0
