# Introduction to Data Science 


## Lab 6: Regular Expression

**British University in Egypt**<br>
**Instructors:** Nahla Barakt <br>

In [1]:
from IPython.display import HTML
style = "<style>div.exercise { background-color: #ffcccc;border-color: #E9967A; border-left: 5px solid #800080; padding: 0.5em;}</style>"
HTML(style)

In [2]:
##!pip install plotly==5.6.0

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd

# Table of Contents
<ol start="1">
  <li> Regular Expressions  </li>
  <li> Exercise </li>
</ol>

##  Part 1: Regular Expressions

### Background and Motivation
*Regular Expressions* (a.k.a. `regex` or `regexp`) are a tool for working with and manipulating text data.  We've already done some text manipulation in this lab.  We've shied away from particularly thorny examples until now.  Using `python`'s string methods is useful, but that approach has it's limitations.

Regular expressions provide a set of rules for working with text data.  At first, these expressions look completely foreign (e.g. `([0-9]+(\.[0-9]+){3})`), but once you know some of the basics they're not so bad.

As it turns out, the fundamentals of regular expressions are based upon abstract algebra.  Mathematicians have studied regular expressions simply to lay down and understand their theoretical underpinnings.  We won't go anywhere near that level of detail.  For us, regular expressions will simply be used to process some gnarly text data.

There are a few key `regex` patterns and concepts that you must know and be comfortable with.  That fact is, there are many ways to create a `regex` to search for a particular pattern.  Some approaches are more succinct than others.  As with most things, you will get better the more you practice.  You should try to make your `regex` patterns as crisp as possible while still mainting readabilty.

### Some resources
In order to become proficient with `regex`s, you are **strongly encouraged** to take the *RegexOne* tutorial at [https://regexone.com/](https://regexone.com/).  That tutorial is an interactive and accessible introduction to regular expressions.  It contains problems at the end to test your knowledge.  The *RegexOne* website also contains a very nice demo for `Python3`.  This lab will borrow from the *RegexOne* `python` demo to walk you through some concepts.

You may also want to consider the book [Mastering Regular Expressions](http://shop.oreilly.com/product/9780596528126.do) for more details as well as some historical comments.

---

### Learning by Example
Suppose you have a string containing a date:

In [4]:
birthday = "June 11"

You would like to search this string for the month.  For such a simple string, this can easily be done with the `python` string methods.

In [5]:
birth_month = birthday.strip()[:-3]
print(birth_month)

June


We're after much more intense strings, which we'll process with regular expressions.  Let's warm up with a `regex` on this simple string.

In [6]:
regex = r"\w+" 
regex

'\\w+'

What in the world does this mean?!  Well, there are a few syntactical details here:
1. The `r` means that the string is a *raw string*.  This just tells `python` not to interpret backslashes and other metacharacters in the string.  For example, in order to render TeX, you must use a raw string.
2. The `\w` indicates any alphanumeric character.
3. The `+` indicates one or more occurances.

In English words, we say that `regex` is a regular expression that tries to match one or more occurances of alphanumeric characters.

We still haven't specified what string we want to find the matches in.  All we've done so far is specify a `regex`.

Let's remedy that.  We will now use the `python` `re` module to start matching some regular expressions in strings.  Here are two more resources for you:
* [`re` module documentation](https://docs.python.org/3/library/re.html) --- The official `python` documentation on the `re` module
* [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto) --- A gentler introduction to using the `re` module.

Honestly, your best bet is still to start with the resources found on the *RegexOne* site.

In [7]:
import re # Regular expression module

In [8]:
re.search(r'\w', birthday)

<re.Match object; span=(0, 1), match='J'>

In [9]:
months = re.search(regex, birthday) # Search string for regex
print(months,type(months))

<re.Match object; span=(0, 4), match='June'> <class 're.Match'>


We just searched the `birthday` string for the regular expression contained in `regex`.  If the pattern doesn't match, then we get `None` in return, otherwise we get an object that contains some information.  In our case, the pattern matches something in `birthday`.  What information did we get?
* We are told the `start` and `end` of the matching pattern (that is the `span=(0, 4)`)
* We are told what matched (that is the `'June'`)

Note that the `\w+` expression stops when it reaches white space so we just get the first string.

You can access the starting and ending indices with the `start()` and `end()` methods as follows:

In [10]:
print("The matched pattern starts at index {0} and ends at index {1}.".format(months.start(), months.end()))

The matched pattern starts at index 0 and ends at index 4.


Note that we could have used a very simple pattern to search for the word `June`:

In [11]:
regex = r"June"
months = re.search(regex, birthday)
print(months)

<re.Match object; span=(0, 4), match='June'>


Same answer!

In [12]:
months = re.search(r"Oct", birthday) #nothing displays

**Note:** When a regex fails to match, like above, it can look a little weird. Instead of getting an empty regex object that prints out, we get `None`, which doesn't dispaly anything. Printing it still works though.

In [13]:
print(months) #printing the match object shows us the result, even if no match was found.

None


As already mentioned, regular expressions work directly with text.  You need the fancier stuff when you have more complicated strings.  We'll get to that in a moment.  First, do the following exercise.

<div class=exercise><b>Exercise</b></div>
Consider the string 
```python
statement = "June is a lovely month."
```
* Use a regular expression to the find the pattern `June`.
* Create a new string, `fragment` from `statement`, which starts just after the word `June`.

Your output should be ` is a lovely month.`

In [14]:
# your code here
statement='June is a lovely month'
express=re.search(r"June",statement)
name=statement[express.end():]
print(name)

 is a lovely month


Okay, we're ready to move on to more interesting things.  We'll do this in a sequence demos.

First, let's try to get the day out of the birthday string.  We'll use some more intesting expressions to illustrate some of the important patterns.

#### We can use `\d` to get just digits.

In [15]:
regex = r"\d+"
re.search(regex, birthday)

<re.Match object; span=(5, 7), match='11'>

#### We can use `[a-z]` for characters `a` to `z` and `[0-9]` for digits `0` to `9`.

In [16]:
regex = r"[A-Za-z]+"
re.search(regex, birthday)

<re.Match object; span=(0, 4), match='June'>

Note that we had to specify both capital letters and lowercase letters.  We also needed the `+` pattern to make sure that one or more occurances of the characters were found.  If not, we would have only gotten one occurance as illustrated in the next example.

In [17]:
regex = r"[0-9]"
re.search(regex, birthday)

<re.Match object; span=(5, 6), match='1'>

Only got the first occurance of `1`!

#### `findall()`

Let's start getting down to business.  We want the actual month and the actual day.  Not the whole thing.  That's not too hard given what we already have at our disposal.

In [18]:
regex_month = r"[A-Za-z]+"
month = re.findall(regex_month, birthday)
print(month)

regex_day = r"\d+"
day = re.findall(regex_day, birthday)
print(day)

['June']
['11']


The `findall()` method returns a list of all the pattern matches.  Very cool.  Now we're ready to move on to another very important concept: *groups*.

#### Groups
Let's say we have a busy string of birthdays:

In [19]:
birthdays = "June 11th, December 13th, September 21st, May 12th"

We want to get all the months and all the days.  This looks like a job for the `findall()` method.

In [20]:
regex = r"[A-Za-z]+"
bdays = re.findall(regex, birthdays)
print(bdays)

['June', 'th', 'December', 'th', 'September', 'st', 'May', 'th']


That's not right.  Almost, but not quite.  We can fix things in a bunch of ways.  Let's take this opportunity to introduce groups.

In [21]:
regex = r"([A-Za-z]+) (\d+\w+)"
bdays = re.findall(regex, birthdays)
print(bdays)

[('June', '11th'), ('December', '13th'), ('September', '21st'), ('May', '12th')]


Let's try to unpack all of that:
* The parentheses indicate a group.  So, our first set of parentheses indicate that we want a pattern of characters with one or more occurances.
* Right after that first group, we have a space.
* Then we have another group.  This time, the group indicates a pattern with one or more occurances of numbers followed by one or more occurances of any alphanumeric characters.

We could have accomplished the same thing in a number of ways.  Here are a couple more possibilities:
```python
regex = r"([A-Za-z]+)\s(\d+\w+)"
regex = r"([A-Za-z]+)\s(\w+)"
regex = r"([A-Za-z]+) (\d+[a-z]+)"
```
You get the idea.

It's also possible to just get the months and days separately.

In [22]:
regex = r"[A-Za-z]+ \d+\w+"
bdays = re.findall(regex, birthdays)
for bday in bdays:
    print(bday)

June 11th
December 13th
September 21st
May 12th


#### Applying Regex on dataframe

In [23]:
## Defining a data frame
df=pd.DataFrame({'Name':['Wikipedia'],'Description':\
              ['Other collaborative online encyclopedias were attempted before Wikipedia, but none were as successful.[18] Wikipedia began as a complementary project for Nupedia, a free online English-language encyclopedia project whose articles were written by experts and reviewed under a formal process.[19] It was founded on March 9, 2000, under the ownership of Bomis, a web portal company. Its main figures were Bomis CEO Jimmy Wales and Larry Sanger, editor-in-chief for Nupedia and later Wikipedia.[1][20] Nupedia was initially licensed under its own Nupedia Open Content License, but even before Wikipedia was founded, Nupedia switched to the GNU Free Documentation License at the urging of Richard Stallman.[21] Wales is credited with defining the goal of making a publicly editable encyclopedia,[22][23] while Sanger is credited with the strategy of using a wiki to reach that goal.[24] On January 10, 2001, Sanger proposed on the Nupedia mailing list to create a wiki as a "feeder" project for Nupedia.[25]']})

In [24]:
df['Description'][0]

'Other collaborative online encyclopedias were attempted before Wikipedia, but none were as successful.[18] Wikipedia began as a complementary project for Nupedia, a free online English-language encyclopedia project whose articles were written by experts and reviewed under a formal process.[19] It was founded on March 9, 2000, under the ownership of Bomis, a web portal company. Its main figures were Bomis CEO Jimmy Wales and Larry Sanger, editor-in-chief for Nupedia and later Wikipedia.[1][20] Nupedia was initially licensed under its own Nupedia Open Content License, but even before Wikipedia was founded, Nupedia switched to the GNU Free Documentation License at the urging of Richard Stallman.[21] Wales is credited with defining the goal of making a publicly editable encyclopedia,[22][23] while Sanger is credited with the strategy of using a wiki to reach that goal.[24] On January 10, 2001, Sanger proposed on the Nupedia mailing list to create a wiki as a "feeder" project for Nupedia.[

In [25]:
## Let's write a regex to get numbers between brackets
df['Description'].str.findall(r'[[0-9]+]')

  regex = re.compile(pat, flags=flags)


0    [[18], [19], [1], [20], [21], [22], [23], [24]...
Name: Description, dtype: object

In [26]:
## Write a regex to extract dates , noting the format of date is (Month dd, yyyy)

##############Your Sol##############


0    [March 9, 2000, January 10, 2001]
Name: Description, dtype: object

There are many other ways to play with these `regex` patterns.  You will get many chances to do so in your homework.  For now, let's do an exercise.

<div class=exercise><b>Exercise</b></div>
* Open and read the file `shelterdogs.xml` into a string named `dogs`.  It should look like:

```
<?xml version="1.0" encoding="UTF-8"?>

<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```
* Write a regular expression to match the dog names.  That is, you want to match the name inside the name tag: `<name> dog_name </name>`.
  * **Hint:** Use a group.
* Print out each name.

Your output should be 
```python
Chloe
Karl
```

In [27]:
# your code here

['Cloe', 'Karl']


<div class=exercise><b>Exercise</b></div>
Although you successfully completed the previous exercise, you think it would have been nicer to strip out the first two lines of the `dogs` string.

**Hints:**
* The first line has some special metacharacters in it (e.g. ?, ", \n).  You can escape these by using a backslash. For example, \? treats ? like a real question mark.  Otherwise it's the *optional* character in regular expressions.
* Consider using [\n]+ to deal with the end of line character.

Your output should be:
```
<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>
```

<?xml version="1.0" encoding="UTF-8"?>

<dogshelter>
    <dog id="dog1">
        <name> Cloe </name>
        <age> 3 </age>
        <breed> Border Collie </breed>
        <playgroup> Yes </playgroup>
    </dog>
    <dog id="dog2"> 
        <name> Karl </name> 
        <age> 7 </age>
        <breed> Beagle </breed>
        <playgroup> Yes </playgroup>
    </dog>
</dogshelter>



In [29]:
regex=r'<\?.*\?>[\n]+'
words=re.search(regex,dogs)
print(words)

<re.Match object; span=(0, 40), match='<?xml version="1.0" encoding="UTF-8"?>\n\n'>


## Part 2:Exercise

In [30]:
# load the data
nasa_csv = pd.read_csv('nasa.csv', index_col=0)
nasa_csv.head()

Unnamed: 0,Start_Datetime,End_Datetime,startFrequency,endFrequency,flare_Location,flare_region,importance,CME_Date,CME_Time,width,speed,CPA,is_halo,lower_bound
0,1997-04-01 14:00:00,1997-04-01 14:15:00,8000,4000,S25E16,8026,M1.3,04/01,15:18,79.0,312.0,74,False,False
1,1997-04-07 14:30:00,1997-04-07 17:30:00,11000,1000,S28E19,8027,C6.8,04/07,14:27,360.0,878.0,na,True,False
2,1997-05-12 05:15:00,1997-05-14 16:00:00,12000,80,N21W08,8038,C1.3,05/12,05:30,360.0,464.0,na,True,False
3,1997-05-21 20:20:00,1997-05-21 22:00:00,5000,500,N05W12,8040,M1.3,05/21,21:00,165.0,296.0,263,False,False
4,1997-09-23 21:53:00,1997-09-23 22:16:00,6000,2000,S29E25,8088,C1.4,09/23,22:02,155.0,712.0,133,False,False


In [31]:
## Print the shape of the data

##############Your Sol##############


(482, 14)

In [32]:
## Check nulls of each column

##############Your Sol##############


Start_Datetime      0
End_Datetime        0
startFrequency      0
endFrequency        0
flare_Location      6
flare_region       83
importance        105
CME_Date           20
CME_Time           20
width              20
speed              20
CPA                21
is_halo             0
lower_bound         0
dtype: int64

In [33]:
## Drop nulls on column importance

##############Your Sol##############


In [34]:
### get unique values of column importance

##############Your Sol##############


array(['M1.3', 'C6.8', 'C1.3', 'C1.4', 'C8.6', 'M4.2', 'X2.1', 'X9.4',
       'X2.6', 'B9.4', 'C1.1', 'M1.4', 'X1.2', 'C8.9', 'X1.0', 'M6.8',
       'X1.1', 'X2.7', 'M7.7', 'B6.6', 'B7.9', 'C7.5', 'M1.0', 'C2.9',
       'C4.4', 'M8.4', 'C1.0', 'C5.9', 'M8.0', 'M4.4', 'M3.9', 'C8.8',
       'M1.7', 'C3.5', 'M1.6', 'C7.6', 'C2.1', 'X1.8', 'M3.8', 'M1.8',
       'C4.7', 'C7.3', 'M6.5', 'M1.2', 'C2.3', 'C9.7', 'M3.1', 'FILA',
       'M1.5', 'C7.8', 'M7.6', 'X2.3', 'M5.2', 'M3.5', 'M3.0', 'M1.9',
       'M5.7', 'X5.7', 'M3.7', 'M5.9', 'M5.1', 'M2.5', 'C4.0', 'C3.2',
       'M7.4', 'C5.4', 'C7.9', 'X2.0', 'M8.2', 'X1.9', 'X4.0', 'C1.6',
       'C6.5', 'M6.7', 'C5.6', 'X1.7', 'X20.', 'X5.6', 'M7.9', 'M2.3',
       'X14.', 'C2.2', 'M6.3', 'X5.3', 'C9.5', 'M9.1', 'X1.6', 'X1.3',
       'M2.8', 'M9.9', 'M7.1', 'X3.4', 'M5.0', 'M2.2', 'C3.1', 'C9.6',
       'M2.6', 'X1.5', 'C4.5', 'C3.7', 'M8.5', 'C3.3', 'X3.3', 'X4.8',
       'M8.7', 'M2.4', 'X3.1', 'C5.2', 'M2.9', 'M4.6', 'M2.7', 'M1.1',
      

In [35]:
## check previous results, you can see an irrelevant record, drop it

##############Your Sol##############


In [36]:
## Column Importance represent the power of the solar flare, it consists of two parts char and number
## Split The importance column into two parts, first contain the letters and the second contains the number as float
## Use str.split pandas function

##############Your Sol##############
)

In [37]:
## There are some columns that are dates, so we need to convert them
## Search for pd.to_datetime and convert Start_Datetime, End_Datetime, CME_Date to dates

##############Your Sol##############


In [38]:
## Get the average of Numerical part of importance column

##############Your Sol##############


4.085106382978725

In [39]:
## Count number of solar flare of each importance type

##############Your Sol##############


importance_1
B      6
C     96
M    185
X     89
Name: importance_2, dtype: int64