# File operations

In [16]:
#Searching through a file

fhand = open('Test-Copy1.txt')
count = 0
for line in fhand:
    if line.startswith('From:'):
        print(line)

We can use the find string method to simulate a text editor search that finds lines where the search string is anywhere in the line. Since find looks for an occurrence of a string within another string and either returns the position of the string or -1 if the string was not found. E.g.<br/>
```python
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if line.find('@uct.ac.za') == -1: continue
    print(line)
```

In [6]:
with open('Test-Copy1.txt') as f:
    print(f.read())







            November 17, 2020   12:14:31


                SUMMARY                 
________________________________________
Items     Price       Quantity    Amount
________________________________________
coke      1.0               89      89.0
fanta     3.0               32      96.0
sprite    4.0               87     348.0
________________________________________
                  Total Amount     896.0
________________________________________

Paid: 1000.0
Balance: 104.0



In [17]:
import re

hand = open('Test-Copy1.txt')
for line in hand:
    line = line.rstrip()
    if re.search('fanta', line):
        print(line)

fanta     3.0               32      96.0


# Regular Expressions

The regular expression library **re** must be imported into your program before you can use it. The simplest use of the regular expression library is the **search()** function. The following program demonstrates a trivial use of the search function.
```python
# Search for lines that contain 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)
```

### Character Matching

In the following example, the regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:” since the period characters in the regular expression match any character.

```python
# Search for lines that start with 'F', followed by any 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^F..m:', line):
        print(line)
```



```python
# Search for lines that start with From and have an at sign
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:.+@', line):
        print(line)
```

The search string **^From:.+@** will successfully match lines that start with **“From:”**, followed by one or more characters **(.+)**, followed by an at-sign **(@)**. So this will match the following line:
<blockquote>From: stephen.marquard@uct.ac.za
</blockquote>
You can think of the **.+** wildcard as expanding to match all the characters between the colon character and the at-sign.

```python
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)
```

The **findall()** method searches the string and returns a list of all of the strings that look like *email addresses*. We are using a two-character sequence that matches a non-whitespace character **(\S)**.
```python
Output: ['csev@umich.edu', 'cwen@iupui.edu']
```

```python
# Search for lines that have an at sign between characters
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('\S+@\S+', line)
    if len(x) > 0:
        print(x)
```
We read each line and then extract all the substrings that match our regular expression. Since **findall()** returns a list, we simply check if the number of elements in our returned list is more than zero to print only lines where we found at least one substring that looks like an email address.
```python
output:

['wagnermr@iupui.edu']
['cwen@iupui.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801032122.m03LMFo4005148@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
```

```python
# Search for lines that have an at sign between characters
# The characters must be a letter or number
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-Z]', line)
    if len(x) > 0:
        print(x)

...
Output:
['wagnermr@iupui.edu']
['cwen@iupui.edu']
['postmaster@collab.sakaiproject.org']
['200801032122.m03LMFo4005148@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
```
Notice that on the *source@collab.sakaiproject.org* lines, our regular expression eliminated two letters at the end of the string (“>;”). This is because, we are looking for substrings that start with a single lowercase letter, uppercase letter, or number “**[a-zA-Z0-9]”**, followed by zero or more non-blank characters (**\S\***), followed by an at-sign, followed by zero or more non-blank characters (**\S\***), followed by an uppercase or lowercase letter. Note that we switched from **+** to **\*** to indicate zero or more non-blank characters since **[a-zA-Z0-9]** is already one non-blank character. Remember that the * or + applies to the single character immediately to the left of the plus or asterisk.

In [None]:
import re
dir()

### Python Regular Expression Quick Guide
```basic
^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times 
         (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times 
         (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end
```