# Welcome to the Notebook 
---

## Task 1
### What is Regex?

Regular expressions (Regex) allow us to extract substrings with a specific pattern from a text.



##### Meta characters: 
characters with special meaning 

<img src="images/t1.png" >

##### Special Sequences: 
Special Sequences with a special meaning

<img src="images/t2.png" >
    

In [1]:
import re

In [2]:
paragraph = """John is 24 years old and Sara is 23 and Maiki is 15 years old."""

let's extract all ages

In [3]:
pattern_for_ages = r'\d+'
pattern_for_ages_2 = r'[0-9]+'

ages = re.findall(pattern_for_ages, paragraph)
ages_2 = re.findall(pattern_for_ages_2, paragraph)
print(ages)
print(ages_2)

['24', '23', '15']
['24', '23', '15']


let's extract all names

In [4]:
pattern_for_names = r'[A-Z][a-z]+'
re.findall(pattern_for_names, paragraph)

['John', 'Sara', 'Maiki']

## Task 2
    - extracting phone numbers
    - formatting phone numbers so that all of them has the same format
    - extracting names
    - storing the data into a python dictionary

In [5]:
phone_numbers = """
john: 145-202-9330
Sara: 156.201.3333
maiki: 111*505*1254
"""

Extracting phone numbers 

In [6]:
phone_pattern = r'\d+.\d+.\d+'

re.findall(phone_pattern, phone_numbers)

['145-202-9330', '156.201.3333', '111*505*1254']

Reformatting phone numbers and then extracting them again

In [7]:
replace_pattern = r'[.*]'
phone_numbers_new = re.sub(replace_pattern, '-',phone_numbers)

phone_pattern = r'\d\d\d-\d\d\d-\d\d\d\d'
phone_pattern2 = r'\d{3}.\d{3}.\d{4}'

#re.findall(phone_pattern, phone_numbers_new)
phones = re.findall(phone_pattern2, phone_numbers_new)

Let's extract the names

In [8]:
name_pattern = r'[A-Z]?[a-z]+'

names = re.findall(name_pattern, phone_numbers)

Let's store it into a dictionary

In [9]:
contacts = dict()

for name in names:
    for number in phones:
        contacts[name] = number

contacts

{'john': '111-505-1254', 'Sara': '111-505-1254', 'maiki': '111-505-1254'}

## Task 3
    the user enters his/her email, we want to check if the email address entered by the user is in a correct format.

<img height= 400 width=600 src="images/emailparts.png">

In [10]:
email = input("What is the e-mail adress?")

email_pattern = r'[A-Za-z0-9_.-]+@[A-Za-z-.]+\.[a-z]+'

if re.match(email_pattern, email):
    print("This is correct")
else:
    print("This is false")

What is the e-mail adress?denniscrielaard@icloud.com
This is correct


- Exercise: write a RegEx pattern that <b>does not</b> allow the user to enter an email address that begins with numbers.

In [14]:
email = input("What is the e-mail adress?")

email_pattern = r'[A-Za-z][A-Za-z0-9_.-]+@[A-Za-z-.]+\.[a-z]+'

if re.match(email_pattern, email):
    print("This is correct")
else:
    print("This is false")

What is the e-mail adress?123dennis@crielaard.com
This is false


### Task 4
     - write a RegEx pattern to recognize the following urls
     - reformatting the urls into => domain name + top level domain 
       E.g. coursera.org
    
<img height= 400 width=600 src="images/urlparts.png">

In [18]:
urls = """
https://www.google.com
http://youtube.com
https://www.nasa.gov
https://coursera.org
"""

pattern = r'https?://(www\.)?(\w+)(\.\w+)'

let's reformat the urls


In [19]:
urls_list = re.findall(pattern, urls)
for url in urls_list:
    url = url[1] + url[2]
    print(url)

google.com
youtube.com
nasa.gov
coursera.org


In [21]:
new_urls = re.sub(pattern, r'\2\3', urls)
print(new_urls)


google.com
youtube.com
nasa.gov
coursera.org



- Exercise: below you can see a list of dates with different formats. reformat all the dates into dd/mm/yyyy using Regex.

In [34]:
dates = """
12 01 2020
15.05.2021
07/03/2020
10-03-2019
"""


In [35]:
pattern = r'(\d+)[./\-\s](\d+)[./\-\s](\d+)'
new_dates = re.sub(pattern, r'\1/\2/\3', dates)
print(new_dates)


12/01/2020
15/05/2021
07/03/2020
10/03/2019



# Task 5

### Text Mining using RegEx

we have a dataset containing of some personal notes

In [6]:
import pandas as pd
import re

data = pd.read_csv("dataset.csv")
data

Unnamed: 0,notes
0,Friends reunion on Thursday 25-05-2021 at 6:00 pm
1,on Saturday-night 29-05-2021 at 3:00 pm champi...
2,the doctor's appointment is on Tuesday 12-04-2...
3,Meeting with friends on Friday 14-12-2020 at 8...
4,On Wednesday 06.01.2021 at 9:30 pm there is a ...
5,Don't forget to call Dani on Friday 22/07/2020...
6,"Wednesday 25/05/2021 at 5:30 pm, meeting with ..."
7,Job interview at 9:00 am Monday 02.02.2021


We want to extract some useful information out of this text data and store them into a dataframe. <br>
so let's create the dataframe

In [7]:
information = pd.DataFrame(columns=['day', 'month', 'year', 'weekday', 'time'])
information

Unnamed: 0,day,month,year,weekday,time


Counting the number of words in each note

In [9]:
data['notes'].str.count(r'\w+')

0    11
1    15
2    14
3    12
4    14
5    15
6    12
7    10
Name: notes, dtype: int64

Get the list of all of the words in each note

In [10]:
data['notes'].str.findall(r'\w+')

0    [Friends, reunion, on, Thursday, 25, 05, 2021,...
1    [on, Saturday, night, 29, 05, 2021, at, 3, 00,...
2    [the, doctor, s, appointment, is, on, Tuesday,...
3    [Meeting, with, friends, on, Friday, 14, 12, 2...
4    [On, Wednesday, 06, 01, 2021, at, 9, 30, pm, t...
5    [Don, t, forget, to, call, Dani, on, Friday, 2...
6    [Wednesday, 25, 05, 2021, at, 5, 30, pm, meeti...
7    [Job, interview, at, 9, 00, am, Monday, 02, 02...
Name: notes, dtype: object

Exercise: find all of the dates in each note

In [16]:
pattern = r'\d{2}.\d{2}.\d{4}'

data['notes'].str.findall(pattern)

0    [25-05-2021]
1    [29-05-2021]
2    [12-04-2021]
3    [14-12-2020]
4    [06.01.2021]
5    [22/07/2020]
6    [25/05/2021]
7    [02.02.2021]
Name: notes, dtype: object

let's clean these dates and extract them 

In [30]:
dates = data['notes'].str.replace(r'(\d{2}).(\d{2}).(\d{4})', r'\1-\2-\3', regex=True)
dates = dates.str.extract(r'(?P<day>\d{2}).(?P<month>\d{2}).(?P<year>\d{4})')
dates

Unnamed: 0,day,month,year
0,25,5,2021
1,29,5,2021
2,12,4,2021
3,14,12,2020
4,6,1,2021
5,22,7,2020
6,25,5,2021
7,2,2,2021


Let's extract the times

In [31]:
time_df = data['notes'].str.extract(r'(?P<time>\d+[:]\d{2}\s[a-z]+)')
time_df

Unnamed: 0,time
0,6:00 pm
1,3:00 pm
2,4:40 pm
3,8:30 pm
4,9:30 pm
5,12:00 pm
6,5:30 pm
7,9:00 am


Exercise: Extract the weekday names

In [37]:
weekday = data['notes'].str.extract(r'(?P<weekday>\w+day)')
weekday

Unnamed: 0,weekday
0,Thursday
1,Saturday
2,Tuesday
3,Friday
4,Wednesday
5,Friday
6,Wednesday
7,Monday


Now let's merge these three dataframes

In [39]:
information = dates.join(weekday).join(time_df)
information

Unnamed: 0,day,month,year,weekday,time
0,25,5,2021,Thursday,6:00 pm
1,29,5,2021,Saturday,3:00 pm
2,12,4,2021,Tuesday,4:40 pm
3,14,12,2020,Friday,8:30 pm
4,6,1,2021,Wednesday,9:30 pm
5,22,7,2020,Friday,12:00 pm
6,25,5,2021,Wednesday,5:30 pm
7,2,2,2021,Monday,9:00 am


Now we can do a lot of analysis based on this dataframe <br>

Let's answer to this question: <b> In which days of the week, I am busier? </b>

aggregate the dataframe on weekday column and count 

In [40]:
information.groupby('weekday').count()

Unnamed: 0_level_0,day,month,year,time
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Friday,2,2,2,2
Monday,1,1,1,1
Saturday,1,1,1,1
Thursday,1,1,1,1
Tuesday,1,1,1,1
Wednesday,2,2,2,2


as another analytical question: In which months this year, I have been busier?

In [None]:
mask = information[information['year'] == 2021]
information[mask].