# File Handling

## Reading in a file 
Text files are one of the most common file formats you'll probably have to deal with. 
- Cross platform
- End in .txt

We use the Python open method and provide a file path.  

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

From the Jupyterlab sidebar, look for **months.txt** by clicking on the **support_files** directory and then on the **datasets** directory.

We provide these in the order we clicked on them to tell this notebook where to find the file.

In [7]:
f = open("./support_files/datasets/months.txt")

In [8]:
# creates a file object
f

<_io.TextIOWrapper name='./support_files/datasets/months.txt' mode='r' encoding='UTF-8'>

In [9]:
# now you can read in the file, contents returned as a string
print(f.read())

January
February
March
April
May
June
July
August
September
October
November
December


Since we have read the entire contents of the file, the cursor that reads the file is now at the end.  

If we want to read the file again we need to reset the position of the cursor to the start of the file using the [seek()](https://python-reference.readthedocs.io/en/latest/docs/file/seek.html) method.



In [14]:
# go to the start of the file
f.seek(0)

0

## Reading in a file line by line

- End of line characters or control characters: invisibly mark the end of a line in a text file  
- You can create with the "\n"  
- Derived from back in the typewriter days when you had to enter in a line return

In [15]:
print("Here's the first line \n and here's the second")

Here's the first line 
 and here's the second


In [16]:
f.readline()

'January\n'

## Stripping extra characters
- readline returns the new line 
- print returns the new line
- you can end up with a lot of extra lines

In [18]:
f = open("support_files/datasets/months.txt")
next = f.readline()
while next != "":
    print(next)
    next = f.readline()

January

February

March

April

May

June

July

August

September

October

November

December


In [20]:
f = open("support_files/datasets/months.txt")
next = f.readline()
while next != "":
    # use strip to remove the extra newline
    print(next.strip())
    next = f.readline()

January
February
March
April
May
June
July
August
September
October
November
December


## Reading in Every Line
- Read in all the lines and return them as a list

In [22]:
f = open("support_files/datasets/months.txt")
f.readlines()

['January\n',
 'February\n',
 'March\n',
 'April\n',
 'May\n',
 'June\n',
 'July\n',
 'August\n',
 'September\n',
 'October\n',
 'November\n',
 'December']

In [23]:
f = open("support_files/datasets/months.txt")
for month in f.readlines():
   print("Month = " + month.strip())

Month = January
Month = February
Month = March
Month = April
Month = May
Month = June
Month = July
Month = August
Month = September
Month = October
Month = November
Month = December


## Writing to a file

- r = read file
- w = write to file, erasing any existing data
- a = write new data to end of file
- b = binary mode
- r\+ = read and write to file
- w\+ = write to file, create if it doesn't exist
- a\+ = read, append to file, create if it doesn't exist

more file modes: https://docs.python.org/3/library/functions.html#open

In [24]:
f = open("months.txt", "a+")

In [26]:
# this would overwrite our month data if we provided the full file path to the support files
f = open("months.txt", "w")
f.write("Erasing all the things and adding this!")
f.close()

### Closing the file
- write does not close the file
- changes will not show up until you clsoe the file
- good practice, don't leave a process hanging

## Opening a file with with()
- more flexible
- automatically closes the file handle when function is finished

# Parsing Text: Advanced String Operations

Most of the time when we're parsing text files, they will have more data than a single column.

Let's look at how to parse a file that contains voting data about radishes.

There are two columns of data separated by a space, a dash, and then another space. 

The first column is the name of the voter and the second column is the name of the radish variety.

In [31]:
in_file = open("./support_files/datasets/small_radish.txt", "r")

In [32]:
for line in in_file:
    print(line.strip())

Evie Pulsford - April Cross
Matilda Condon - April Cross
Samantha Mansell - Champion
geronima trevisani - cherry belle
Alexandra Shoebridge - Snow Belle


Use the python method [split()](https://docs.python.org/3.3/library/stdtypes.html#str.split) to separate the string into two parts.

In [34]:
for line in open("./support_files/datasets/small_radish.txt"):
    # strip trailing new lines 
    line = line.strip()
    parts = line.split(" - ")
    print(parts)

['Evie Pulsford', 'April Cross']
['Matilda Condon', 'April Cross']
['Samantha Mansell', 'Champion']
['geronima trevisani', 'cherry belle']
['Alexandra Shoebridge', 'Snow Belle']


### Print who voted for the April Cross radish

In [36]:
for line in open("./support_files/datasets/small_radish.txt"):
   line = line.strip()
   parts = line.split(" - ")
   name, vote = parts
   if vote == "April Cross":
      print(name + " voted for April Cross!")

Evie Pulsford voted for April Cross!
Matilda Condon voted for April Cross!


## Regular Expressions

Now that we know how to read files and extract simple data using split to separate on a particular character, we'll want to do additional analysis of text data.

This can be done using regular expressions (aka regex or regexes) to match different parts of the string.

Regex functionality in Python resides in a module named `re`.

https://docs.python.org/3/howto/regex.html

In [37]:
import re

### Compiling the regular expression

Regular expressions use a specific set of tokens to determine what to match.  

Most characters only match themselves, so if you're looking for **r** then you would type the letter.  

There are two categories of characters that can be used to match multiple characters and find patterns in strings.

1. Metacharacters
2. Special Sequences


#### Metacharacters

Metacharacters are used to tell Python to do special actions.  

The primary metacharacters are:

`[ ] . ^ $ * + ? { } | ( ) \`

The first metacharacters we'll look at are the square brackets `[` and `]`. 

They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.

Ex: [abc] matches the characters a, b, or c

Ex: [a-z] matches all lowercase characters of the alphabet

The `.` metacharacter matches any one character

The `^` metacharacter specifies that the next set of characters must be at the start of a string

The `$` metacharacter specified the previous set of characters must be at the end of a string

The `*` and `+` metacharacters are use to control how many characters match

The `{` and `}` metacharacters are used to specify the number of matches

The `|` metacharacter is used to separate characters that are optional matches. ex: A OR B is `A|B`

The `(` and `)` metacharacters are used to create groups of matches


#### Special Sequences

Special sequences use the `\` metacharacter to tell Python to perform additional actions.

\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].


Once you have determine which tokens to use, you will compile the pattern and then use it to search through multiple strings for matches.

https://docs.python.org/3/library/re.html#re.compile


In [52]:
s = re.compile('little')

While there are multiple methods in the re package, we're going to use [search()](https://docs.python.org/3/library/re.html#re.Pattern.search)

This allows us to find the first match of the string we're searching for and return it.

In [66]:
phrase = "Mary had a little lamb.  It's fleece was white as snow."
match = s.search(phrase)
match

<re.Match object; span=(11, 17), match='little'>

In [60]:
# get the first match
match.group(0)

'little'

In [61]:
# what coordinates in the string matched the pattern
match.span()

(11, 17)

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
<b>Exercise 5.1</b>

What coordinates is the word `fleece`?
</div>


In [63]:
fleece = re.search('fleece', phrase)
fleece.span()

(30, 36)

## Substitution

The `re` package can also be used to subsitute values using [sub()](https://docs.python.org/3/library/re.html#re.sub)

It takes 3 arguments

1. The pattern to match
2. The value used to replace
3. The string to match and replace


In [65]:
bob = re.sub("Mary", "Bob", phrase)
bob

"Bob had a little lamb.  It's fleece was white as snow."

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
    <b>Exercise 5.2</b>

Fix the grammatical error in the phrase.  Remove the `'` from the string
</div>

In [67]:
fixed = re.sub("'", "", phrase)
fixed

'Mary had a little lamb.  Its fleece was white as snow.'

## Conclusion


This concludes your introduction to regular expression matching and Python’s re module. Congratulations! 

Python regular expressions are very powerful in assisting us in analyzing text.

You now know how to:

Use re.search() to perform regex matching in Python
Substitute characters in strings

Read up on more ways to use the regular expression module here https://www.pythoncheatsheet.org/cheatsheet/regular-expressions