# In-Class Coding Lab: Files

The goals of this lab are to help you to understand:

- Reading data from a file all at once or one line at a time.
- Searching for data in files
- Parsing text data to numerical data.
- How to build complex programs incrementally.

## Average  Spam Confidence

For this lab, we will write a program to read spam confidence headers from a mailbox file like `ICCL-mbox-tiny.txt` or `ICCL-mbox-small.txt`. These files contain raw email data, and in that data is a SPAM confidence number for each message:

`X-DSPAM-Confidence:0.8475`

Our goal will be to find each of these lines in the file, and extract the confidence number (In this case `0.8475`), with the end-goal of calculating the average SPAM Confidence of all the emails in the file. 

### Reading from the file

Let's start with some code to read the lines of text from `ICCL-mbox-tiny.txt`

In [1]:
filename = "CCL-mbox-tiny.txt" #filename is CCL
with open(filename, 'r') as f:
    for line in f.readlines():
        print(line.strip())        
print(lines)

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ;
5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Received: from pr

NameError: name 'lines' is not defined

In [2]:
#print number of lines
total_lines = 0
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        total_lines += 1        
print(total_lines)

332


### Now Try It

Now modify the code above to print the number of lines in the file, instead of printing the lines themselves. You'll need to increment a variable each time through the loop and then print it out afterwards.

There should be **332** lines.

### Finding the SPAM Confidence

Next, we'll focus on only getting lines addressing `X-DSPAM-Confidence:`. We do this by including an `if` statement inside  the `for` loop.

You need to edit line 4 of the code below to only print lines which begin with `X-DSPAM-Confidence:` There should be **5**

In [4]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        if line.startswith ("X-DSPAM-Confidence"): 
            print(line.strip())        


X-DSPAM-Confidence: 0.8475
X-DSPAM-Confidence: 0.6178
X-DSPAM-Confidence: 0.6961
X-DSPAM-Confidence: 0.7565
X-DSPAM-Confidence: 0.7626


### Parsing out the confidence value

The final step is to figure out how to parse out the confidence value from the string. 
For example for the given line: `X-DSPAM-Confidence: 0.8475` we need to get the value `0.8475` as a float.

The  strategy here is to replace `X-DSPAM-Confidence:` with an empty string, then calling the `float()` function to convert the results to a float. 

### Now Try It


In [5]:
line = 'X-DSPAM-Confidence: 0.8475'
number =  float(line.strip('X-DSPAM-Confidence:')) #TODO remove 'X-DSPAM-Confidence convert float.
print (number)

0.8475


### Putting it all together

Now that we have all the working parts, let's put it all together.

```
1.  line count is 0
2.  total confidence is 0
3.  open mailbox file
4.  for each line in file
5.  if line starts with `X-DSPAM-Confidence:`
6.     remove `X-DSPAM-Confidence:` from line and convert to float
7.     increment line count
8.     add spam confidence to total confidence
9.  print average confidence (total confidence/line count)
```

In [6]:
## TODO: Write program here:
line_count = 0
total_confidence = 0
filename = "CCL-mbox-short.txt"

try:
    with open(filename, 'r') as f:
        for line in f.readlines():
            if line.startswith("X-DSPAM-Confidence"): 
                confidence = float(line.strip('X-DSPAM-Confidence:'))
                line_count += 1
                total_confidence += confidence
                
    avg_confidence = total_confidence / line_count
    print("The average spam confidence for file '%s' is %.4f." % (filename, avg_confidence))
    
except FileNotFoundError:
    print("Error: 10012. File Not Found. The data file is not currently available.")

The average spam confidence for file 'CCL-mbox-short.txt' is 0.7507.


When you have the program working, try it with `ICCL-mbox-short.txt` mailbox file, too.

In [1]:
line_count = 0
total_confidence = 0
filename = "CCL-mbox-tiny.txt"

try:
    with open(filename, 'r') as f:
        for line in f.readlines():
            if line.startswith("X-DSPAM-Confidence"): 
                confidence = float(line.strip('X-DSPAM-Confidence:'))
                line_count += 1
                total_confidence += confidence
                
    avg_confidence = total_confidence / line_count
    print("The average spam confidence for file '%s' is %.4f." % (filename, avg_confidence))
    
except FileNotFoundError:
    print("Error: 10012. File Not Found. The data file is not currently available.")

The average spam confidence for file 'CCL-mbox-tiny.txt' is 0.7361.


In [3]:
def x(): 
    y = 10 
    return y 
y = 15

y = 5 
print(y)

5
