# In-Class Coding Lab: Files

The goals of this lab are to help you to understand:

- Reading data from a file all at once or one line at a time.
- Searching for data in files
- Parsing text data to numerical data.
- How to build complex programs incrementally.

## Average  Spam Confidence

For this lab, we will write a program to read spam confidence headers from a mailbox file like `CCL-mbox-tiny.txt` or `CCL-mbox-small.txt`. These files contain raw email data, and in that data is a SPAM confidence number for each message:

`X-DSPAM-Confidence:0.8475`

Our goal will be to find each of these lines in the file, and extract the confidence number (In this case `0.8475`), with the end-goal of calculating the average SPAM Confidence of all the emails in the file. 

### Reading from the file

Let's start with some code to read the lines of text from `CCL-mbox-tiny.txt`



In [3]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'w') as f:
    for line in f.readlines():
        print(line.strip())        


From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])
by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ;
5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Received: from pr

### Now Try It

Now modify the code above to print the number of lines in the file, instead of printing the lines themselves. You'll need to increment a variable each time through the loop and then print it out afterwards.

There should be **332** lines.


In [13]:
# TODO Write code to not print the lines but count the number of lines!
line_count = 0
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        line_count = line_count + 1
        
print(f"there are {line_count} lines in the file")

there are 332 lines in the file



### Finding the SPAM Confidence

Next, we'll focus on only getting lines addressing `X-DSPAM-Confidence:`. We do this by including an `if` statement inside  the `for` loop. This is a very common pattern in computing used to search through massive amouts of data.

You need to edit line 4 of the code below to only print lines which begin with `X-DSPAM-Confidence:` You know you got it working if in your output there are only **5** rows.

In [14]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        if line.startswith("X-DSPAM-Confidence:"): 
            print(line.strip())        


X-DSPAM-Confidence: 0.8475
X-DSPAM-Confidence: 0.6178
X-DSPAM-Confidence: 0.6961
X-DSPAM-Confidence: 0.7565
X-DSPAM-Confidence: 0.7626


### Parsing out the confidence value

The final step is to figure out how to parse out the confidence value from the string. 
For example for the given line: `X-DSPAM-Confidence: 0.8475` we need to get the value `0.8475` as a float.

The  strategy here is to replace `X-DSPAM-Confidence:` with an empty string, then calling the `float()` function to convert the results to a float. 

### Now Try It

Write code to parse the value `0.8475` from the text string `'X-DSPAM-Confidence: 0.8475'`.

In [7]:
line = 'X-DSPAM-Confidence: 0.8475'
number =  float(line.split(' ')[1])
print (number)

0.8475


In [12]:
line = 'X-DSPAM-Confidence: 0.8475'
number =  float(line.replace("X-DSPAM-Confidence:","") )
print (number)

0.8475


### Putting it all together

Now that we have all the working parts, let's put it all together.

```
0.  use the file named 'CCL-mbox-short.txt' 
1.  line count is 0
2.  total confidence is 0
3.  open mailbox file
4.  for each line in file
5.  if line starts with `X-DSPAM-Confidence:`
6.     remove `X-DSPAM-Confidence:` from line and convert to float
7.     increment line count
8.     add spam confidence to total confidence
9.  print average confidence (total confidence/line count)
```

In [15]:
filename  = 'CCL-mbox-tiny.txt'
line_count = 0
total_confidence = 0
with open (filename, "r") as f:
    for line in f:
        if line.startswith("X-DSPAM-Confidence:"):
            confidence = float(line.replace("X-DSPAM-Confidence:",""))
            line_count = line_count + 1 
            total_confidence = total_confidence + confidence 
            
print(f"Total lines with X-DSPAM-Confidence: {line_count}")
print(f"Average spam confidence {total_confidence/line_count}")
                                            
            

Total lines with X-DSPAM-Confidence: 5
Average spam confidence 0.7361


### Question

How do you know this is right? How Can you verify it's right?  HINT: You might not be able to hand-calculate the average spam confidence, but you can do that with `CCL-mbox-tiny.txt`, right?

## Metacognition

Please answer the following questions. This should be a personal narrative, in your own voice. Answer the questions by double clicking on the question and placing your answer next to the Answer: prompt.

### Questions


1. Record any questions you have about this lab that you would like to ask in recitation. It is expected you will have questions if you did not complete the code sections correctly.  Learning how to articulate what you do not understand is an important skill of critical thinking. 

Answer: 

2. What was the most difficult aspect of completing this lab? Least difficult?  

Answer: 

3. What aspects of this lab do you find most valuable? Least valuable?

Answer: 

4. Rate your comfort level with this week's material so far.   

1 ==> I can do this on my own and explain how to do it.   
2 ==> I can do this on my own without any help.   
3 ==> I can do this with help or guidance from others. If you choose this level please list those who helped you.   
4 ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand.   

Answer: 



In [None]:
# SAVE YOUR WORK FIRST! CTRL+S
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()