# Working with text - Regular Expressions
When working with text data regular expressions are one of the most indispensable and powerful tools at hand of any programmer. Often refered as 'regexp' they can be applied to search through the texts for certain words and characters, to substitute one word or a part of it for another and, most prominently, to exctract relevant data using patterns. These patterns is what makes them so interesting and useful.

For instance, suppose you are looking for words consisting of exact 7 letters, or, say you are interested in names, thus the words that start with the Capital letter. Or just want to extract decimal numbers, maybe you are interested to find any prices scattered across the text (thus, a number followed by a currency sign), how about gathering contact information such as phones, emails appeared in the documents? Those have exact pattern such as (whitehouse phone number) bestemailaddress@google.com. Or You might want to gather all links, which is http:// followed by domain name/address, or, say, you are interested to get whole sentences that contain any of the above. This is where regexps shine.

In this lesson we will cover basic techniques that allow to accomplish anything mentioned above and more. Upon finishing it you'll feel confident to work with text, preprocess it and apply regular expressions - all using python.

## Mission 1 - working with files
---
An essential step before working with text is to read the files and get acquinted with the contents.
This mission explains how to read a single file, a number of files, and as a practical excerise we'll find the number of words each txt file consists of.

### Read files
To work with text files we first need to read it into the memory.

Here is a small file, that contains a piece of poem from Alice in Wonderland by Lewis Caroll.

In [None]:
beautiful_soup.txt

In [None]:
Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!
Soup of the evening, beautiful Soup!
...

To open a file we can use open() function which creates a fileobject:

In [51]:
f = open('beautiful_soup.txt', 'r')

The first argument is a filename, the second is the mode. Mode `'r'` means the object is created for read-only purpose.

Now the object can be read, and assigned to a varible using `f.read()`

In [52]:
file_str = f.read()
file_str

'Beautiful Soup, so rich and green,\nWaiting in a hot tureen!\nWho for such dainties would not stoop?\nSoup of the evening, beautiful Soup!\nSoup of the evening, beautiful Soup!\n...'

Python stores whole text in a variable as a single string.

Notice `\n` character at places where originally there were new lines. `\n` is a special whitespace character that denotes a new line in any textual data. When you open files with text editors such as notepad, `\n` are always there but not displayed for convenience.

Another example of a whitespace character is `\t` which denotes a tab.

[Best pactice](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) when working with files is to use `with` statement as following:

In [32]:
with open('beautiful_soup.txt', 'r') as file:
    file_str = file.read()
    file_str

'Beautiful Soup, so rich and green,\nWaiting in a hot tureen!\nWho for such dainties would not stoop?\nSoup of the evening, beautiful Soup!\nSoup of the evening, beautiful Soup!\n...'

This way python automatically closes the file after each operation, which is memory efficient and especially matters when the files are very large.

What if we want to read several files?

To read several documents we can type each filename manually, assign to a new variable and read, which is ok, but not practical if we have many files or difficult names. How about to read all txt files at once?

To do that we would first build a list of filenames using `listdir()` function of the `os` module:

In [47]:
import os
path = 'short_poems/'
filenames = os.listdir(path)
filenames

['fire_and_ice.txt', 'nothing_gold_can_stay.txt', 'risk.txt', 'trees.txt']

Now we can loop through the filenames, choosing only those with `.txt` at the end:

In [50]:
for filename in filenames:
    if filename.endswith(".txt"): 
        with open(path+filename, 'r') as file:
            file_str = file.read()
            file_str

'Some say the world will end in fire,\nSome say in ice.\nFrom what IвЂ™ve tasted of desire\nI hold with those who favor fire.\nBut if it had to perish twice,\nI think I know enough of hate\nTo say that for destruction ice\nIs also great\nAnd would suffice.\n\nAuthor: Robert Frost'

'NatureвЂ™s first green is gold, \nHer hardest hue to hold. \nHer early leafвЂ™s a flower; \nBut only so an hour. \nThen leaf subsides to leaf. \nSo Eden sank to grief, \nSo dawn goes down to day. \nNothing gold can stay.\nAuthor: Robert Frost'

'And then the day came,\nwhen the risk\nto remain tight\nin a bud\nwas more painful\nthan the risk\nit took\nto Blossom.\nAuthor: AnaГЇs Nin'

'I think that I shall never see\nA poem lovely as a tree.\n\nA tree whose hungry mouth is prest\nAgainst the earthвЂ™s sweet flowing breast;\n\nA tree that looks at God all day,\nAnd lifts her leafy arms to pray;\n\nA tree that may in Summer wear\nA nest of robins in her hair;\n\nUpon whose bosom snow has lain;\nWho intimately lives with rain.\n\nPoems are made by fools like me,\nBut only God can make a tree.\n\nAuthor: Joyce Kilmer'

Worth to note how we pass `path+filename` argument to an open() function. `path+filename` comprise a single `path/filename.txt` string.

Great! Reading text files is a breeze now.

### Count words
To count words in a document, we would need to obtain a list of words. `str.split()` function splits a string by any whitespace character and returns a list words or tokens.

In [38]:
sentence = 'Brown fox jumps over a lazy dog.'
word_list = sentence.split()
word_list

['Brown', 'fox', 'jumps', 'over', 'a', 'lazy', 'dog.']

We can find out word number by the lenght of the list:

In [39]:
print('Words: ',len(word_list))

Words:  7


If we are interested to find how many lines are there in the document. We would split the document by a newline character `\n`, passing it as a parameter to the `str.split()` function.

In [40]:
with open('beautiful_soup.txt', 'r') as file:
    file_str = file.read()
    lines = file_str.split('\n')
    print('Lines: ',len(lines))

Lines:  6


### Practice
Find out how large are the documents in number of words.

1. Read in all .txt documents.
    - Build a list of filenames from `short_poems` folder.
    - Loop through only txt filenames.
2. Calculate word number for each document.
    - split current document as words, and assign list to a variable.
    - print name of the document and the number of words it contains.

## Mission 2 - regular expressions p.1
TBD

## Mission 3 - regular expressions p.2
TBD

## Mission 4 - Project - CIA files
A student is presented with a challenge to extract sentences that contain dates from a large number of documents on a disk. Say, 50 files. The files are secret projects spanning the period since 1900 and even up to a future. The aim is to present filenames sorted by the extracted dates with the first line (a header) and sentences that contained this date as a recap.

The kink would be that dates appear in text in various forms such as: 1941 Sept 1/1941 September 1/1 September 1941/1941 Sept/1941-09-01/1941 etc. - i.e. different positions of year, month, day and different representations of those. So a major part of the challenge would be to normalize the dates to a single form, and then perform sorting based on the findings.

On completion users would have a powerful tool that goes through all text files on a designated path find any dates, and have files and their recap presented sorted by date.