<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/CST2312_PY4E_Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CST2312 - Class #14, from Charles Severance, Python for Everybody, Lesson 12 - Regular Expressions
updated by Professor Patrick, 25-Oct-2021


This notebook works with a Github repository in ProfessorPatrickSlatraigh/CST2312. 

That repository includes the files "mbox-short.txt" and "mbox.txt". 
The repo can be cloned to be used in Google Colab as a resource or a URL for a file in the Github repo, or Google Drive can be used to open a Python handle for a file which will persist beyond the current Colab session.  Before working with the file in Github or Google Drive, the first section of this notebook describes how local files can be uploaded to a Colab session. Files which are uploaded as resources to the content area of Colab do not persist after the Colab notebook is closed.



Here is the mbox-short.txt file as a reference: https://www.py4e.com/code3/mbox-short.txt , which is a shortened version of the file: https://www.py4e.com/code3/mbox.txt .  Both files are text files which contain a series of email messages.  The files are used as references in exercises in the Charles Severance book Python for Everybody (py4e.com).



---



The following code snippet imports pandas which is needed for the file loading processes from Github and Googe Drive (gdrive) which are described at the bottom of this notebook.

In [1]:
# let's import pandas as pd so that we have it available
import pandas as pd


The following code snippet imports re -- the regular expressios libary in Python, which we will use in this notebook

In [2]:
# let's import the regular expressions library so that we have it available
import re



---



## **UPLOADING TO COLAB EVERY TIME**

The first example reads "mbox-short.txt" from the Google Colab content folder "sample_data".  In order to do that, the "mbox-short.txt" file needs to be uploaded to the "sample_data" folder.  That upload is temporary for the Google Colab session - the "mbox-short.txt" file will go away after you finish with your active Colab notebook.  Note that this method does not require pandas.


Use the panel on the left of your Colab session to navigate to the content area and the "sample_data" folder.  Then use the three vertical dots to the right of the name "sample_data" to choose 'Upload' and navigate to the "mbox-short.txt" file on your computer.


Use the three vertical dots to your uploaded "mbox_short.txt" file in the "sample_data" content folder to choose "Copy path" and that will put the full path (URL) in your clipboard.  If the path is not the same as in the following call to the open() command then replace the string for the file name with the URL from your clipboard - paste it in as the argument to open().

In [None]:
colab_handle = open("/content/sample_data/mbox-short.txt")

Now you can use the print function to see the attributes of the new colab_handle you created to the "mbox-short.txt" file in the content folder "sample_data" on Google Colab.


In [None]:
print(colab_handle)

<_io.TextIOWrapper name='/content/sample_data/mbox-short.txt' mode='r' encoding='UTF-8'>


You can use a for loop to print the contents of "mbox_short.txt"

In [None]:
for line in colab_handle :
    print(line)


# **READING FILES FROM GITHUB**

Now let's try reading the same file from a Github repository (repo).  We will use the CST2312 repo in the ProfessorPatrickSlatraigh account on Github.  The file "mbox-short.txt" was uploaded to that repo.  

From Github we navigated to the "mbox-short.txt" file and viewed it in it's raw format using the "raw" button to the right of the file name.  While in raw viewing mode in a browser, we copied the URL from the browser to the clipboard.  Please note that this works with open repos, not private repos.


In [None]:
git_handle = pd.read_fwf("https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/CST2312/main/mbox-short.txt")

In [None]:
print(git_handle)

Storing files in Github gives us persistence.  That is, when we are done with our Google Colab session the files on Github remain and can be used again.  And our Google Colab notebooks should work each time we open them without the need for use to upload files to the content area on Google Colab for every session.

# **READING FILES FROM GOOGLE DRIVE**

We can also have persistent files stored in Google Drive.  To read files from Google Drive we will need to import the drive module from google.colab.  We will also need to have Google Drive give stream access to Google Colab.  If the files are on a different Google Drive account from the Google Colab account then be sure to have permission of the Google Drive owner for access to the file.

You can use the drive module from google.colab to mount your entire Google Drive to Colab by:

1. Executing the below code which will provide you with an authentication link

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


2. Open the link

3. Choose the Google account whose Drive you want to mount

4. Allow Google Drive Stream access to your Google Account

5. Copy the code displayed, paste it in the text box as shown below, and press Enter

Once the Drive is mounted, you’ll get the message “Mounted at /content/gdrive”, and you’ll be able to browse through the contents of your Drive from the file-explorer pane.

You can even write directly to Google Drive from Colab using the usual file/directory operations.

In [None]:
!touch "/content/gdrive/My Drive/sample_file.txt"

This will create a file in your Google Drive, and will be visible in the file-explorer pane once you refresh it.  Notice that the path within the content area is different from the "sample_data" folder we used earlier for fles uploaded directly to Google Colab.  The content area will have a "gdrive" folder after you have authenticated with Google Drive.  Within the "gdrive" folder there should be a folder structure according to your Google Drive folders.

If your Google Drive folder had the file "mbox-short.txt" within the "My Drive" folder then you would be able to open that file with the following code:

In [4]:
gdrive_handle = open("/content/gdrive/My Drive/mbox-short.txt")

Now you can use the print function to see the attributes of the new gdrive_handle you created to the "mbox-short.txt" file in the content folder "gdrive/My Drive/" on Google Drive.


In [None]:
print(gdrive_handle)

<_io.TextIOWrapper name='/content/gdrive/My Drive/mbox-short.txt' mode='r' encoding='UTF-8'>


As in the earlier Google Colab example, you can now use a for loop to print the contents of "mbox_short.txt" in Google Drive

In [None]:
for line in gdrive_handle :
    print(line)


# **Reading Pastebin and Other HTTP with GET**

This section reads the "mail-short.txt" file from a Pastebin posting using the RAW format in Pastebin and the HTTP GET from the request module.  The source file is online at: https://pastebin.com/raw/ADPQe6BM 

First import the request module as rq

In [None]:
import requests as rq

The use the GET command to read the RAW text file on Pastebin

In [None]:
http_handle = rq.get('https://pastebin.com/raw/ADPQe6BM')
list_of_lines = http_handle.text.splitlines()

Print the response to check that the HTTP request worked

In [None]:
print(http_handle)

<Response [200]>


And print the result

In [None]:
for line in list_of_lines:
    print(line)



---



# **Using Regular Expressions**



---



# Using re.search()

 In PY4E Chapter 12 (Regular Expressions), Part I, the search() method is used to process lines in our mbox-short.txt file and return True where there is a match.

In [None]:
# gdrive_handle should already have mbox-short.txt open
# be sure that the pointer of gdrive_handle is at the start of the file

for line in gdrive_handle :
    line = line.rstrip()
    if re.search('^From:', line) :
        print(line)


Let's do a little quick housekeeping -- remember that we can use the seek() method on a file handle to change the pointer position in the file.  Let's reset the gdrive_handle pointer to the start of the file so that we do not need to open it again.


In [None]:
gdrive_handle.seek(0)

Give it a try and see if you can process the file again but this time, add the housekeeping exercise of resetting the file at the end of the snippet.

In [None]:
for line in gdrive_handle :
    line = line.rstrip()
    if re.search('^From:', line) :
        print(line)

gdrive_handle.seek(0)


This example looks for the first character as an 'X' then any number of characters followed by a colon ':'.  Let's insert a counter 'i' to stop after the first twenty instances of a match.  And let's conclude with some housekeeping to reset the file pointer. 

In [None]:
i = 0
for line in gdrive_handle :
    line = line.rstrip()
    if re.search('^X.*:', line) :
        print(line)
        i += 1
        if i > 20:
            break

gdrive_handle.seek(0)


Refining that last snippet to select only lines which begin with 'X-' and then some number of non-whitespace '\S' characters followed by a colon ':'.  

In [None]:
i = 0
for line in gdrive_handle :
    line = line.rstrip()
    if re.search('^X-\S.+:', line) :
        print(line)
        i += 1
        if i > 20:
            break

gdrive_handle.seek(0)




---



# Using re.findall()

In PY4E Chapter 12 (Regular Expressions), Part II, the findall() method is used to process lines in our mbox-short.txt file and return a list of matches.

In [20]:
x = 'My 2 favorite numbers are 19 and 42.'
y = re.findall('[0-9]+', x)
print(y)


['2', '19', '42']


The next example searches the string x for any uppercase vowel and returns a list of those found.

In [21]:
y = re.findall('[AEIOU]+', x)
print(y)

[]


*Now, greedy matching.*

In [None]:
x = 'From: Using the: character'
y = re.findall('^F.+:', x)
print(y)


A non-greedy version of that last example.

In [None]:
x = 'From: Using the: character'
y = re.findall('^F.+?:', x)
print(y)

Greedy is the default for '+' and '*' -- they must be followed by a '?' to be non-greedy, which is to select the shortest matching string, not the longest.

Using greedy matching to find email addresses in the mailbox-short.txt file.  (Again, we will limit our process to the first twenty instances.)

In [None]:
i = 0
for line in gdrive_handle :
    line = line.rstrip()
    address = re.findall('\S+@\S+', line) 
    if address :
        print(address)
        i += 1
    if i > 20:
        break

gdrive_handle.seek(0)

Fine tuning that last snippet to only look for lines which begin with 'From ' -- including the space.  The regular expression in the parentheses is the extraction pattern which must be present, but the pattern outside of the parenteses must also be present, including any spaces.

In [None]:
i = 0
for line in gdrive_handle :
    line = line.rstrip()
    address = re.findall('^From (\S+@\S+)', line) 
    if address :
        print(address)
        i += 1
    if i > 20:
        break

gdrive_handle.seek(0)

How would you print the email addresses without the string quotes and list brackets?  Try your code in the following cell.

In [None]:
# your code here 



---



# Regular Expressions for String Extraction

In PY4E Chapter 12 (Regular Expressions), Part III, the application of regular expressions to string extraction is discussed.



---



*With thanks to this reference article:  
Neptune.ai blogs - How to Deal with Files in Google Colab: Everything You Need to Know,* https://neptune.ai/blog/google-colab-dealing-with-files-2


*And, of course to Charles Severance and his work Python for Everybody at* https://py4e.com

---

