# Regex exercise

This notebook will showcase various Regex exercises.

## Places to live
In this exercise we will answer questions about the hometowns of our target population.

1. How many people live in Amsterdam?
2. What are the street names of people living in postal code number 7311? 
3. In which cities are streets containing the name 'Willem'? Alphabetically ordered result. 

To start with question one: according to www.postcode-adresboek.nl the postal codes of Amsterdam are 1011 to 1109. We need to find these codes in the text-file in combination with the word 'Amsterdam'. We're looking for four numbers, than one or no space, and then two letters, followed by a space and then 'Amsterdam' (possibly with additions, which wouldn't matter). The first two digits are 1 and then a 0 or 1.
"1 : the first number is 1
[0|1]: the second number is 0 or 1
\d{2}: followed by two random digits 
\s?: Then one or zero spaces
[A-Z]{2}: Then two capitals between A and Z
\sAmsterdam: then a space and the word Amsterdam

I will also make a function that uses a pattern to go through the data, and adds the found data to a list. This, because I might have to do the same action several times in this notebook. I also import the regex module, open the file and go though it line by line until I find something that matches the pattern. I'll add it to the list and request the length of the list: this way I will get the number of postal codes in Amsterdam.

In [1]:
pattern_ams = "1[0|1]\d{2}\s?[A-Z]{2}\sAmsterdam"

def search_request(pattern, target, tally):
    if (re.search(pattern, target)):
        tally.append(target)
    return pattern, target, tally

import re
AmsterdamCitizens = []
with open('people.txt', errors="ignore") as file:
    data = file.read().split("\n")
    for line in data:
        search_request(pattern_ams, line, AmsterdamCitizens)

print(AmsterdamCitizens[0:5])
print(f'There are {len(AmsterdamCitizens)} citizens of Amsterdam in the file.')

['1056 WL Amsterdam', '1075 GC Amsterdam', '1011 GH Amsterdam', '1013 XB Amsterdam', '1077 CR Amsterdam']
Er staan 213 bewoners van Amsterdam in het bestand.


Now I have to look for the streetnames and -numbers of the people in postal code 7311. We'll make a new pattern. We know that we have to find 7311. We'll make a group (which is: (?P<groupname>) ) for the streetname and -number, so that we can easily retrieve it later. We'll say that the streetname can consist of anything (".\*") and the -number of any amount of digits ("\d*"), divided by a space (\s). On the new line will be the postal code 7311 (\n7311). In the above cel I already imported and divided the data by new lines, so that the results would look better, but I have to revert it now: otherwise the code will read the file line by line and my pattern wouldn't work. 

In [2]:
streetPattern = re.compile(r"(?P<streetname>.*)\s(?P<streetnumber>\d*)\n7311")
with open('people.txt', errors="ignore") as file:
    unsplit = file.read()
    postcodeMatches = streetPattern.findall(unsplit)
    for address in postcodeMatches: 
        print(address)


('Schotweg', '153')
('Torenstraat', '200')
('Nieuwstraat', '34')
('Kortebeinstraat', '58')
('Librije', '31')


Finally I will look for all places that have streets containing 'Willem'. I'll make a pattern with a group for the place name. I'll look for a string of undefined characters that contains 'Willem', followed by one or more digits. Then on a new line I need four numbers and two capitals (the postal code). This way I won't just search for first- or last names for example, but really only for street names. After the postal code will be the group 'placename', consisting of one or more letters and spaces. It will save everything until the end of that line, this way I will also store city names consisting of more than one word.
All these city names will be saved in a list. Then, I will go through the list and save only unique city names. It returns 36 cities. 

In [3]:
Willem = re.compile(r".*Willem.*\s\d+\n\d{4}\s*[A-Z]{2}\s(?P<plaatsnaam>[A-Za-z]+.*)")
WillemMatches = Willem.findall(unsplit)
answer = []
for item in WillemMatches:
    if item not in answer: 
        answer.append(item)
print(f"There are {len(answer)} cities with a street containing 'Willem'. The cities are:")
print(sorted(answer))


Er zijn 36 steden met een straatnaam waar 'Willem' in voorkomt. Deze steden zijn:
['Amstelveen', 'Bathmen', 'Berg en Dal', 'Best', 'Borculo', 'Buren', 'Bussum', 'Den Helder', 'Deventer', 'Dordrecht', 'Dreumel', 'Edam', 'Eibergen', 'Eindhoven', 'Enschede', 'Groningen', 'Helden', 'IJsselstein', 'Katwijk', 'Leeuwarden', 'Leiden', 'Monnickendam', 'Nieuwegein', 'Oegstgeest', 'Oud Gastel', 'Rotterdam', 'Schijndel', 'Sittard', 'Slagharen', 'Sleeuwijk', 'Sneek', 'Strijensas', 'Ugchelen', 'Uithoorn', 'Vlaardingen', 'Voorburg']


## Phone numbers

In this exercise we will answer several questions regarding a text-file.
1. How many phone numbers are in the file?
2. How many phone numbers end on 0?
3. What are the phone numbers containing 666 in the body? (so excluding the first '06')
4. What are the phone numbers that only use numbers 6, 7, 8, 9 and 0?

We know that Dutch phone numbers, both mobile and landline, exists of 10 digits. 

First I will make a pattern that the phone numbers could match. 
A phone number will be written as follows:
(with a space, attached, or with a dash)
06 12345678 
031 2345678
0431 345678

I'll catch the first group (2 to 4 numbers) as follows: \d{2} (it will start with at least 2 numbers) and \d* (then either none or some more digits). 
Next, I'll catch the dash, other punctuation mark, or attached, as follows: \s?.? (a space, a punctuation mark, both optional)
Finally the final group of digits: \d{6} (at least 6 digits) \d* (after that none or some more digits)

In [4]:
telPattern = "\d{2}\d*\s?.?\d{6}\d*"

I'll use this pattern to search my file and save any found matches in a 'phonebook'-list.

In [5]:
phonebook = []
for line in data: 
    search_request(telPattern, line, phonebook)

print(phonebook[0:5])


['06 65594347', '06 54990591', '06 59639138', '06 00708074', '06 99487919']


The result is a list with phonenumbers! I'll check how many there are.

In [6]:
print(f"There are {len(phonebook)} phonenumbers in the file.")

Er zijn 2369 telefoonnummers in dit bestand.


Now I'll search how many numbers end on a 0. I'll lightly adapt my previous pattern. I'll check with '0$' if the last digit in each line is a 0. If it is, I'll add it to my results list.

In [7]:
nul_pattern = "\d{2}\d*\s?.?\d{6}\d*0$"
nul_numbers = []
for number in phonebook:
    search_request(nul_pattern, number, nul_numbers)

print(f"There are {len(nul_numbers)} numbers ending on a 0.")

Er zijn 226 nummers eindigend op een 0 in dit bestand.


Now we'll find all the numbers containing 666. Again, I'll adapt my previous pattern. 
The beginning will stay the same. Then I will change it: an undefined amount of digits (\d*), then it must contain 666, and then again an undefined amount of digits. 
The results are kept in a list.

In [8]:
devil_pattern = "\d{2}\d*\s?.?\d*666\d*"
devil_numbers = []
for number in phonebook:
    search_request(devil_pattern number, devil_numbers)

print(f"There are {len(devil_numbers)} numbers with '666' in their body. The numbers are: {devil_numbers}")

Er zijn 10 nummers met '666' in hun body. De nummers zijn: ['06 81159666', '06 66617275', '06 35458666', '06 79129666', '06 93566645', '06 56664905', '06 40789666', '06 47666791', '06 22666721', '06 81666899']


Finally, I'll have to find the numbers with just 6,7,8,9 and 0. I'll use the pattern from the last exercise, but I'll change the second part to a filter looking for 6,7,8,9 and 0. I do this by using a '+', which finds one or more of the preceding number. 
Also, I'll filter for only mobile phone numbers, which means I'll separate the first two numbers (\d{2}), then filter for a space (\s), and then in the remaining 8 digits look for the given numbers. 

In [9]:
speci_pattern = "\d{2}\s[06-9]{8}"
speci_numbers = []
for number in phonebook:
    search_request(speci_pattern, number, speci_numbers)

print(f"There are {len(speci_numbers)} numbers with only 6, 7, 8, 9 or 0. These numbers are: {speci_numbers}")

Er zijn 13 nummers met enkel 6, 7, 8, 9 of 0 in hun body. De nummers zijn: ['06 89790998', '06 60767687', '06 06809799', '06 69768699', '06 98090669', '06 77089007', '06 78986889', '06 79067060', '06 68980880', '06 90899980', '06 89677768', '06 00078607', '06 77967898']
