# Regular Expressions: Great results with simple methods


## Save some texts from Wikipedia

No text mining without texts. We just use some texts from Wikipedia to practice. Python makes it easy to download Wikipedia articles. To do this, we import the SaveWiki script (available on LearnWeb, just a few handy functions) and save all Wikipedia pages in the category *Infectious disease*   in the folder *infect* .

In [1]:
import SaveWiki
SaveWiki.downloadWikiCat('Infectious diseases','infect')

## Simple string search

In Python we can easily search in a line of text/string. We simply run through all the lines of a file and see whether a certain word occurs in it.

In [2]:
import codecs

file = codecs.open('infect/Quarantine.txt','r','utf8')

for line in file:
    line = line.strip()
    if 'community' in line:
        print(line)
        print('-----')
    
file.close()

The word quarantine comes from quarantena or quarantaine, meaning "forty days", used in the Venetian language in the 14th and 15th centuries and also in France. The word is designated in the period during which all ships were required to be isolated before passengers and crew could go ashore during the Black Death plague. The quarantena followed the trentino, or "thirty-day isolation" period, first imposed in 1347 in the Republic of Ragusa, Dalmatia (modern Dubrovnik in Croatia).Merriam-Webster gives various meanings to the noun form, including "a period of 40 days", several relating to ships, "a state of enforced isolation", and as "a restriction on the movement of people and goods which is intended to prevent the spread of disease or pests". The word is also used as a verb.Quarantine is distinct from medical isolation, in which those confirmed to be infected with a communicable disease are isolated from the healthy population.Quarantine may be used interchangeably with cordon sanitai

We can now find all lines that contain the word *community*, but not those that contain *communities*. We could solve that in Python, but eg capitalization would be the next problem. Here we can often more efficiently use regular expressions.

### PERL Syntax

There have been a number of UNIX programs, such as vi, sed, and grep, that use regular expressions since the 1970s. Many of these functions have been grouped together in the PERL scripting language. The notation used in all of these programs is therefore often called PERL notation. This notation is also supported by Python. An Overview of this notation is available on LearnWeb or can be foud at many internet sites.

We now use the *re.search()* function for searching. The first argument is a regular expression, the second is the string in which to search.


In [3]:
import re

file = codecs.open('infect\Quarantine.txt','r','utf8')

for line in file:
    line = line.strip()
    if re.search('(C|c)ommunit(y|ies)',line):
        print(line)
        print('-----')
    
file.close()

The word quarantine comes from quarantena or quarantaine, meaning "forty days", used in the Venetian language in the 14th and 15th centuries and also in France. The word is designated in the period during which all ships were required to be isolated before passengers and crew could go ashore during the Black Death plague. The quarantena followed the trentino, or "thirty-day isolation" period, first imposed in 1347 in the Republic of Ragusa, Dalmatia (modern Dubrovnik in Croatia).Merriam-Webster gives various meanings to the noun form, including "a period of 40 days", several relating to ships, "a state of enforced isolation", and as "a restriction on the movement of people and goods which is intended to prevent the spread of disease or pests". The word is also used as a verb.Quarantine is distinct from medical isolation, in which those confirmed to be infected with a communicable disease are isolated from the healthy population.Quarantine may be used interchangeably with cordon sanitai

We don't know now what we found, *community* or *communities*. We find out like this:

In [4]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

for line in file:
    line = line.strip()
    result = re.search('(C|c)ommunit(y|ies)',line)
    if result:
        print(result.group(0))
    
file.close()

community
community
communities
communities
communities


We can also output the position in the string:

In [5]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr+=1
    line = line.strip()
    result = re.search('(C|c)ommunit(y|ies)',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
        
    
file.close()

7 community 1157 - 1166
20 community 812 - 821
30 communities 40 - 51
38 communities 4258 - 4269
159 communities 122 - 133


We now only find the first occurrence of the search pattern. We use the _findall()_ function to find all found locations. We'll look at that later. Now let's focus on the regular expressions.

Note that a * is not a wildcard, but means repeating the preceding one as often as you like. You can use a dot (.) to match any character.  If you have want to search a  '.', you must use '\\.' use. Likewise, if you are looking for a parenthesis, you must precede it with the _backslash_.

Finally, another example in which we use repetition with a given lower and upper bound

In [6]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('[A-Z]{3,5}',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

2 SARS 396 - 400
38 SARS 3108 - 3112
73 COVID 198 - 203
89 AQIS 116 - 120
119 SARS 535 - 539
120 CDC 72 - 75
122 CDC 164 - 167
123 CDC 54 - 57
126 CDC 8 - 11
132 DGMQ 49 - 53
133 ACRP 534 - 538
143 COVID 2432 - 2437
148 MAF 0 - 3
191 COVID 4 - 9
193 COVID 11 - 16
199 COVID 550 - 555
220 COVID 78 - 83
230 NASA 56 - 60
252 ISBN 140 - 144
253 SARS 33 - 37
254 PMID 158 - 162
258 MRSA 83 - 87
260 SARS 15 - 19
261 PBS 28 - 31
262 PDF 89 - 92


### Exercise 

Try the following regular expressions and try to understand the expressions using the two tables in the slides.

1. '[A-Z]{3}'
2. '[A-Z]{3,}'
3. '\(.*\)'
4. '\([^ ]*\)'
5. '\([^\(\)]*\)'
6. '\(\w*\)'
7. '\d+\. [A-Z][a-zä]+ [12][09][0-9][0-9]'
8. '\w+virus'

#### 1. [A-Z]{3}

This regular expression will list all the words with only 3 characters and each character is uppercase i.e. A - Z

In [7]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('[A-Z]{3}',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

2 SAR 396 - 399
38 SAR 3108 - 3111
73 COV 198 - 201
89 AQI 116 - 119
119 SAR 535 - 538
120 CDC 72 - 75
122 CDC 164 - 167
123 CDC 54 - 57
126 CDC 8 - 11
132 DGM 49 - 52
133 ACR 534 - 537
143 COV 2432 - 2435
148 MAF 0 - 3
191 COV 4 - 7
193 COV 11 - 14
199 COV 550 - 553
220 COV 78 - 81
230 NAS 56 - 59
252 ISB 140 - 143
253 SAR 33 - 36
254 PMI 158 - 161
258 MRS 83 - 86
260 SAR 15 - 18
261 PBS 28 - 31
262 PDF 89 - 92


#### 2. [A-Z]{3,}

This regular expression will list all the words with atleast/minimum 3 characters and atmost any number of characters and each character is uppercase i.e. A - Z

In [8]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('[A-Z]{3,}',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

2 SARS 396 - 400
38 SARS 3108 - 3112
73 COVID 198 - 203
89 AQIS 116 - 120
119 SARS 535 - 539
120 CDC 72 - 75
122 CDC 164 - 167
123 CDC 54 - 57
126 CDC 8 - 11
132 DGMQ 49 - 53
133 ACRP 534 - 538
143 COVID 2432 - 2437
148 MAF 0 - 3
191 COVID 4 - 9
193 COVID 11 - 16
199 COVID 550 - 555
220 COVID 78 - 83
230 NASA 56 - 60
252 ISBN 140 - 144
253 SARS 33 - 37
254 PMID 158 - 162
258 MRSA 83 - 87
260 SARS 15 - 19
261 PBS 28 - 31
262 PDF 89 - 92


#### 3. (.*)

This regular expression will just group each line in the file seperatly

In [9]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('(.*)',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

1 A quarantine is a restriction on the movement of people, animals and goods which is intended to prevent the spread of disease or pests. It is often used in connection to disease and illness, preventing the movement of those who may have been exposed to a communicable disease, yet do not have a confirmed medical diagnosis. It is distinct from medical isolation, in which those confirmed to be infected with a communicable disease are isolated from the healthy population. Quarantine considerations are often one aspect of border control. 0 - 538
2 The concept of quarantine has been known since biblical times, and is known to have been practised through history in various places. Notable quarantines in modern history include the village of Eyam in 1665 during the bubonic plague outbreak in England; East Samoa during the 1918 flu pandemic; the Diphtheria outbreak during the 1925 serum run to Nome, the 1972 Yugoslav smallpox outbreak, the SARS pandemic, the Ebola pandemic and extensive quara

#### 4. ([^ ]*)

This regular expression will list all lines either starting with space or without space

In [10]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('([^ ]*)',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

1 A 0 - 1
2 The 0 - 3
3 Ethical 0 - 7
4 
 0 - 1
5 
 0 - 1
6 == 0 - 2
7 The 0 - 3
8 
 0 - 1
9 
 0 - 1
10 == 0 - 2
11 
 0 - 1
12 
 0 - 1
13 === 0 - 3
14 An 0 - 2
15 
 0 - 1
16 Anyone 0 - 6
17 
 0 - 1
18 
 0 - 1
19 === 0 - 3
20 The 0 - 3
21 
 0 - 1
22 
 0 - 1
23 === 0 - 3
24 The 0 - 3
25 Venice 0 - 6
26 
 0 - 1
27 
 0 - 1
28 === 0 - 3
29 
 0 - 1
30 Epidemics 0 - 9
31 
 0 - 1
32 In 0 - 2
33 
 0 - 1
34 
 0 - 1
35 ==== 0 - 4
36 Since 0 - 5
37 The 0 - 3
38 Sanitary 0 - 8
39 
 0 - 1
40 
 0 - 1
41 == 0 - 2
42 
 0 - 1
43 Plain 0 - 5
44 
 0 - 1
45 
 0 - 1
46 == 0 - 2
47 The 0 - 3
48 
 0 - 1
49 
 0 - 1
50 === 0 - 3
51 Guidance 0 - 8
52 
 0 - 1
53 respond 0 - 7
54 proportionately 0 - 15
55 be 0 - 2
56 be 0 - 2
57 be 0 - 2
58 only 0 - 4
59 
 0 - 1
60 all 0 - 3
61 all 0 - 3
62 all 0 - 3
63 all 0 - 3
64 
 0 - 1
65 infected 0 - 8
66 basic 0 - 5
67 communication 0 - 13
68 constraints 0 - 11
69 patients 0 - 8
70 
 0 - 1
71 
 0 - 1
72 === 0 - 3
73 Quarantine 0 - 10
74 
 0 - 1
75 
 0 - 1
76 === 0 - 3
77 Qu

#### 5. ([^()]*)

This regular expression will list all sentences which dont start with either ( or )

In [11]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('([^()]*)',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

1 A quarantine is a restriction on the movement of people, animals and goods which is intended to prevent the spread of disease or pests. It is often used in connection to disease and illness, preventing the movement of those who may have been exposed to a communicable disease, yet do not have a confirmed medical diagnosis. It is distinct from medical isolation, in which those confirmed to be infected with a communicable disease are isolated from the healthy population. Quarantine considerations are often one aspect of border control.
 0 - 539
2 The concept of quarantine has been known since biblical times, and is known to have been practised through history in various places. Notable quarantines in modern history include the village of Eyam in 1665 during the bubonic plague outbreak in England; East Samoa during the 1918 flu pandemic; the Diphtheria outbreak during the 1925 serum run to Nome, the 1972 Yugoslav smallpox outbreak, the SARS pandemic, the Ebola pandemic and extensive quar

#### 6. (\w*)

This regular expression will list all sentences either starting with any word character or not

In [12]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('(\w*)',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

1 A 0 - 1
2 The 0 - 3
3 Ethical 0 - 7
4  0 - 0
5  0 - 0
6  0 - 0
7 The 0 - 3
8  0 - 0
9  0 - 0
10  0 - 0
11  0 - 0
12  0 - 0
13  0 - 0
14 An 0 - 2
15  0 - 0
16 Anyone 0 - 6
17  0 - 0
18  0 - 0
19  0 - 0
20 The 0 - 3
21  0 - 0
22  0 - 0
23  0 - 0
24 The 0 - 3
25 Venice 0 - 6
26  0 - 0
27  0 - 0
28  0 - 0
29  0 - 0
30 Epidemics 0 - 9
31  0 - 0
32 In 0 - 2
33  0 - 0
34  0 - 0
35  0 - 0
36 Since 0 - 5
37 The 0 - 3
38 Sanitary 0 - 8
39  0 - 0
40  0 - 0
41  0 - 0
42  0 - 0
43 Plain 0 - 5
44  0 - 0
45  0 - 0
46  0 - 0
47 The 0 - 3
48  0 - 0
49  0 - 0
50  0 - 0
51 Guidance 0 - 8
52  0 - 0
53 respond 0 - 7
54 proportionately 0 - 15
55 be 0 - 2
56 be 0 - 2
57 be 0 - 2
58 only 0 - 4
59  0 - 0
60 all 0 - 3
61 all 0 - 3
62 all 0 - 3
63 all 0 - 3
64  0 - 0
65 infected 0 - 8
66 basic 0 - 5
67 communication 0 - 13
68 constraints 0 - 11
69 patients 0 - 8
70  0 - 0
71  0 - 0
72  0 - 0
73 Quarantine 0 - 10
74  0 - 0
75  0 - 0
76  0 - 0
77 Quarantine 0 - 10
78  0 - 0
79 Civil 0 - 5
80 The 0 - 3
81 New 0 -

#### 7. \d+. [A-Z][a-zä]+ [12][09][0-9][0-9]

This regular expression extract all the dates from the file having the format: Date Month Year

In [13]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('\d+. [A-Z][a-zä]+ [12][09][0-9][0-9]',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

37 17 January 1912 593 - 608
38 24 June 1922 117 - 129
77 14 February 2003 603 - 619
119 21 March 2017 789 - 802
137 24 September 2015 293 - 310
180 24 July 1969 8 - 20
194 26 March 2020 3 - 16
199 23 January 2020 345 - 360
204 22 February 2020 40 - 56
205 10 March 2020 215 - 228
210 18 March 2020 519 - 532


#### 8. \w+virus

This regular expression will extract all the words ending with 'virus'

In [14]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('\w+virus',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

38 coronavirus 3791 - 3802


## Grouping

Parentheses in the pattern form groups. We can output the matching part in the found text for each group. The whole pattern corresponds to group 0, the remaining groups are numbered from left to right. Groups can be nested!

In [15]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

for line in file:
    result = re.search('([A-Z]\w+) ([A-Z]\w+)(\.|$| [a-z])',line)
    if result:
        print(result.group(0),'|',result.group(1),'|',result.group(2))
    
file.close()

East Samoa d | East | Samoa
Black Death p | Black | Death
Evil Speech. | Evil | Speech
Medieval Islamic w | Medieval | Islamic
The Islamic p | The | Islamic
Black Death w | Black | Death
North African t | North | African
North America t | North | America
Ottoman Empire a | Ottoman | Empire
Great Britain f | Great | Britain
The Venice c | The | Venice
The Polish g | The | Polish
Riverside Hospital o | Riverside | Hospital
United Nations a | United | Nations
International Institute f | International | Institute
The Lancet i | The | Lancet
Daily News w | Daily | News
But Capt. | But | Capt
Australian Quarantine a | Australian | Quarantine
Services Agency a | Services | Agency
Services Officer. | Services | Officer
Animals Act a | Animals | Act
HK Laws. | HK | Laws
Aviation Department t | Aviation | Department
Medical Service c | Medical | Service
United Kingdom u | United | Kingdom
Quarantine Act w | Quarantine | Act
Executive Orders o | Executive | Orders
CDC Director t | CDC | Director


## More functions

There are three other functions that work with regular expressions:

### Split

Splits a string at each occurrence of the pattern. The result is a list of the parts found.

In [16]:
from pprint import pprint
print(re.split('-','multi-drug-resistant'))
text = 'During the 1918 influenza pandemic, some communities instituted protective sequestration (sometimes referred to as "reverse quarantine") to keep the infected from introducing influenza into healthy populations.'
pprint(re.split('[\.,;:]? +',text)) #Notice the space before +!

['multi', 'drug', 'resistant']
['During',
 'the',
 '1918',
 'influenza',
 'pandemic',
 'some',
 'communities',
 'instituted',
 'protective',
 'sequestration',
 '(sometimes',
 'referred',
 'to',
 'as',
 '"reverse',
 'quarantine")',
 'to',
 'keep',
 'the',
 'infected',
 'from',
 'introducing',
 'influenza',
 'into',
 'healthy',
 'populations.']


### Match

Tests whether the the string starts with the search pattern.

### Findall

Finds all occurrences and not just the first one. The result is a list of strings if no groups are used. If groups were used, the result is a list of lists of strings.

In the following we use one additional pair of parentheses to access the entire match.

In [17]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    #fundliste = re.findall('[12][09][0-9][0-9]',zeile)
    resultlist = re.findall('((19|20)\d{2})',line)
    if len(resultlist) > 0:
        for result in resultlist:
            print(nr,result[0])
    
file.close()

2 1918
2 1925
2 1972
2 2020
3 2015
30 1901
30 1918
35 1927
37 1903
37 1912
37 1920
37 1926
37 1912
37 1904
37 1914
37 1914
37 1917
37 1921
38 1922
38 1923
38 1922
38 1925
38 1922
38 1923
38 1922
38 1923
38 1923
38 1924
38 1925
38 1926
38 1926
38 1927
38 2007
38 2014
38 1957
38 1968
38 1994
38 2016
38 2019
38 2020
47 1907
51 1984
77 2003
77 2003
89 2015
93 2005
107 2000
119 2014
119 2017
132 2014
133 2008
137 2015
143 1907
143 1918
143 1963
143 1944
143 1963
143 2020
167 1907
167 1910
167 1915
167 1938
168 1907
168 1910
168 1938
171 1918
172 1918
175 1942
175 1990
176 1942
176 1990
179 1969
179 1971
180 1969
180 1971
183 1972
184 1972
187 2014
188 2014
191 2020
194 2020
199 2020
204 2020
205 2020
205 2020
205 2020
205 2020
205 2020
205 2020
210 2020
215 2020
220 2020
248 1911
252 1999
253 2015
254 2000
258 2003
258 1935
259 2020
259 2005
260 2020
260 2005
262 2014


## A small application

Finally, let's build a small application.

We build a KWIC table for viruses. KWIC stands for Keyword in Context and is used to clarify the meaning of a word through the context and to show possible uses of a word.

In [18]:
import glob

filelist = glob.glob("infect/*.txt")
for f in filelist:
    result = re.search(r'.*\\([\w,_\-\'\(\)]+)\.txt',f) # Wir brauchen hier ein magisches r (von raw) vor dem String damit \\ als Backslash gelesen werden kann
    title = result.group(1)
    
    file = codecs.open(f,'r','utf8')
    #Jetzt suchen wir alle Viren 
    for line in file:
        start = 0
        line = line.strip()
        resultlist = re.findall(r'([\w-]*[Vv]irus(es)?)\b',line)
        if len(resultlist) > 0:
            for result in resultlist:
                virus = result[0]
                #now we need to find the position of the result in the line
                position = re.search(r'\b'+virus+r'\b',line[start:])
                start = start + position.start()
                end = start + position.end()
                left_context = ' '*max(0,20-start)  + line[max(0,start-20):start]
                right_context = line[end:end+20] 
                virus = virus + max(0,18-len(virus))*' '
                print(left_context,virus,right_context, '('+ title +')', sep = '\t')
                start += 1
    file.close()

ds on the strain of 	virus             		(ACAM2000)
ed from the Vaccina 	virus             	M2000 vaccine cannot	(ACAM2000)
ontain the smallpox 	virus             	d, is not dead like 	(ACAM2000)
nes containing live 	viruses           	io and chickenpox.Th	(ACAM2000)
kenpox.The vaccinia 	virus             	ed via a typical sho	(ACAM2000)
r arm. The vaccinia 	virus             	ird week, leaving a 	(ACAM2000)
ncing symptoms, the 	virus             	her the host is show	(Asymptomatic_carrier)
   === Epstein–Barr 	virus             		(Asymptomatic_carrier)
ted with persistent 	viruses           	 of the herpes virus	(Asymptomatic_carrier)
uch as Epstein–Barr 	virus             	es virus family. Stu	(Asymptomatic_carrier)
ember of the herpes 	virus             	% of adults have ant	(Asymptomatic_carrier)
e infected with the 	virus             		(Asymptomatic_carrier)
e to produce active 	virus             	 virus unintentional	(Asymptomatic_carrier)
s of the attenuated 	virus             	

AttributeError: 'NoneType' object has no attribute 'start'

## Another application

We also could count which virus is mentioned how often:


In [19]:
from collections import Counter

viruscount = Counter()

filelist = glob.glob("infect/*.txt")
for f in filelist:
    file = codecs.open(f,'r','utf8')

    for line in file:
        line = line.strip()
        resultlist = re.findall(r'(([A-Z][a-z]+( |-)){,2}[\w-]*[Vv]irus(es)?)\b',line)
        
        if len(resultlist) > 0:
            for result in resultlist:
                virus = result[0]
                viruscount.update([virus])
                
viruscount.most_common()

[('virus', 482),
 ('viruses', 347),
 ('Viruses', 58),
 ('coronavirus', 41),
 ('The virus', 25),
 ('Some viruses', 13),
 ('rhinovirus', 11),
 ('rotavirus', 11),
 ('Barr virus', 10),
 ('Ebola virus', 10),
 ('lyssavirus', 10),
 ('West Nile virus', 9),
 ('poliovirus', 9),
 ('coronaviruses', 9),
 ('adenovirus', 9),
 ('herpesvirus', 9),
 ('Virus', 8),
 ('Epstein-Barr virus', 8),
 ('Coronavirus', 6),
 ('Many viruses', 6),
 ('Cytomegalovirus', 6),
 ('Norovirus', 6),
 ('These viruses', 6),
 ('metapneumovirus', 5),
 ('hantavirus', 5),
 ('norovirus', 5),
 ('adenoviruses', 5),
 ('papillomavirus', 5),
 ('herpesviruses', 5),
 ('polyomavirus', 5),
 ('Nipah virus', 4),
 ('Rotavirus', 4),
 ('cytomegalovirus', 4),
 ('retrovirus', 4),
 ('Adenovirus', 4),
 ('Plant viruses', 4),
 ('The viruses', 3),
 ('Poliovirus', 3),
 ('African Ebola virus', 3),
 ('Emerging Viruses', 3),
 ('Marburg virus', 3),
 ('Coronaviruses', 3),
 ('flavivirus', 3),
 ('Rhinovirus', 3),
 ('Rabies virus', 3),
 ('Herpesviruses', 3),
 ('p

In [20]:
len(viruscount)

180

In [21]:
sum(viruscount.values())

1374

# Exercises

* Find a list of all diseases ending with -itis



In [22]:
file = codecs.open('infect\Quarantine.txt','r','utf8')

nr = 0
for line in file:
    nr = nr+1
    result = re.search('\w+itis',line)
    if result:
        print(nr,result.group(0),result.start(),'-',result.end())
    
file.close()

36 Britis 637 - 643
110 Britis 5 - 11
111 Britis 4444 - 4450
176 Britis 30 - 36


### Exercise Hearst Patterns

In [23]:
import glob

regex = ['\w+ such as (the)? \w+ ((and | or) \w+)?',
         '\w+,? especially \w+ ((and | or)\w+)?',
         '\w+,? including \w+ ((and | or)\w+)?',
         '(\w+(,?))+ and other \w+',
         '(\w+(,?))+ or other \w+']

filelist = glob.glob("infect/*.txt")
for rg in regex:
    print('\nTaking regular expression:')
    print(rg + '\n')
    count = 0
    for f in filelist:
        file = codecs.open(f,'r','utf8')
        nr = 0
        for line in file:
            nr = nr+1
            result = re.search(rg,line)
            if result:
                count += 1
                print(nr,result.group(0),result.start(),'-',result.end())
        file.close()
    print('\nTotal supporting examples for pattern : {} are {}'.format(rg,count))


Taking regular expression:
\w+ such as (the)? \w+ ((and | or) \w+)?

10 Organizations such as the American  839 - 874
38 resistance such as the potential  940 - 973
129 rules such as the pneumonia  67 - 95
1 terms such as the infective  547 - 575
1 criteria such as the Bradford  714 - 744
18 organisms such as the African  908 - 938
24 animals such as the West  1132 - 1157
43 problems such as the growing  239 - 268
155 countries such as the US  598 - 623
22 responses such as the SOS  382 - 408
35 elements such as the Little  228 - 256
125 phase such as the Rehabilitation  864 - 897
164 wards such as the Heffron  292 - 318
24 sites such as the Lazzarettos  1012 - 1042
81 vehicles such as the ambulance  67 - 98
142 landmarks such as the Columbia  401 - 432
15 strains such as the highly  300 - 327
5 people such as the elderly  650 - 677
54 profiles such as the Th3  763 - 788
57 others such as the National  986 - 1014
155 Cells such as the macrophage  372 - 401
2 pandemics such as the 1918