## Quick Wedge Investigation

Let's take a quick look at a Wedge file and also learn a little bit about data structures in Python.

1. Manually open one of the Wedge zipped files, extract the file on the inside, and move it into the same folder as this notebook. *Don't use the one I use so we can compare notes.*

1. Change the name of `input_file` down below. 

We'll go through this together in class on Tuesday, so you don't have to run it all now. Just run this notebook through the point where I've written "Stop Here". 

In [1]:
input_file = "transArchive_201307_201309.csv"

Let's open the file and look at some rows. The easiest way to open a file in Python is with `read`, which operates on an open filehandle. Read allows you to pass a `size` argument so you don't have to read the whole file.

In [2]:
my_file = open(input_file)
my_file.read(1000)

'"datetime","register_no","emp_no","trans_no","upc","description","trans_type","trans_subtype","trans_status","department","quantity","Scale","cost","unitPrice","total","regPrice","altPrice","tax","taxexempt","foodstamp","wicable","discount","memDiscount","discountable","discounttype","voided","percentDiscount","ItemQtty","volDiscType","volume","VolSpecial","mixMatch","matched","memType","staff","numflag","itemstatus","tenderstatus","charflag","varflag","batchHeaderID","local","organic","display","receipt","card_no","store","branch","match_id","trans_id"\n"2013-07-01 07:18:31","16","54","1","0000000009506","Offsite: Plain Croissant","I"," "," ","17","10","0","0.2100","1.3900","13.9000","1.3900","0.0000","0","0","1","0","0.0000","0.0000","1","0","0",NULL,"10","0","0","0.0000","0","0",,NULL,"1","0","0",,"0",NULL,"0","0"," ","0","50056","32","0","0","4"\n"2013-07-01 07:18:42","16","54","1","0000000009515","Offsite: White Burger Bun","I"," "," ","17","24","0","0.0000","0.4000","9.6000","0.

*Question*: What is `read` giving you here? 

It's best practice to close files after you open them.

In [3]:
my_file.close()

There's a much handier way to allow Python to remember to close the files for you, using `with`. 

In [4]:
with open(input_file) as my_file :
    print(my_file.read(1000))

"datetime","register_no","emp_no","trans_no","upc","description","trans_type","trans_subtype","trans_status","department","quantity","Scale","cost","unitPrice","total","regPrice","altPrice","tax","taxexempt","foodstamp","wicable","discount","memDiscount","discountable","discounttype","voided","percentDiscount","ItemQtty","volDiscType","volume","VolSpecial","mixMatch","matched","memType","staff","numflag","itemstatus","tenderstatus","charflag","varflag","batchHeaderID","local","organic","display","receipt","card_no","store","branch","match_id","trans_id"
"2013-07-01 07:18:31","16","54","1","0000000009506","Offsite: Plain Croissant","I"," "," ","17","10","0","0.2100","1.3900","13.9000","1.3900","0.0000","0","0","1","0","0.0000","0.0000","1","0","0",NULL,"10","0","0","0.0000","0","0",,NULL,"1","0","0",,"0",NULL,"0","0"," ","0","50056","32","0","0","4"
"2013-07-01 07:18:42","16","54","1","0000000009515","Offsite: White Burger Bun","I"," "," ","17","24","0","0.0000","0.4000","9.6000","0.400

Notice the presence of `\n` in the file. This is the newline character. The backslash tells Python not to just print an `n` but to do a newline instead. 

In [5]:
for letter in "a string of letters" :
    print(letter + "\n")

a

 

s

t

r

i

n

g

 

o

f

 

l

e

t

t

e

r

s



The other really popular special character is tab, denoted by `\t`.

In [6]:
print("\t".join("a string"))

a	 	s	t	r	i	n	g


If you're ahead, look up the function `join` and see if you can figure out what it's doing.

---

There's a more convenient way to read text files that are split into lines, using the fact that Python assumes that's what you want to do. 

In [7]:
with open(input_file) as my_file :
    for line in my_file :
        print(line)        
        break

"datetime","register_no","emp_no","trans_no","upc","description","trans_type","trans_subtype","trans_status","department","quantity","Scale","cost","unitPrice","total","regPrice","altPrice","tax","taxexempt","foodstamp","wicable","discount","memDiscount","discountable","discounttype","voided","percentDiscount","ItemQtty","volDiscType","volume","VolSpecial","mixMatch","matched","memType","staff","numflag","itemstatus","tenderstatus","charflag","varflag","batchHeaderID","local","organic","display","receipt","card_no","store","branch","match_id","trans_id"



### Stop Here
Try to run through the cells above this point before class. Let me know on Slack if you have any trouble. 

---

If you want to print out the first `n` lines, you could make a counter variable to do it. Complete the code below to print out the first 5 lines. I've put `??` in the places you need to put some code.

In [None]:
counter = 1

with open(input_file) as ?? :
    for line in my_file :
        print(??) 
        counter += ??
        if counter == ?? :
            break

There's another cool trick that I use all the time. Python provides a function, `enumerate`, that auto-generates this sort of counter as it goes along. Check it out.

In [None]:
with open(input_file) as my_file :
    for idx,line in enumerate(my_file) :
        print(idx)                              
        print(line)
        
        if idx == 4 :
            break

We'll talk more about parsing text files, since it's an important part of Python. In the interim, some questions for you:

* Does your file have a header row?
* What's the delimiter of your file?
* Does your file put quotes around the fields? 


---

### Splitting lines

One of the key tasks when working with text files is splitting `lines` into `fields`. Python provides a function for this.

In [None]:
?str.split

Here's an example of `split` in action, using the `line` variable you created above. 

In [None]:
line.split(",") 

Repurpose the cell above that uses `enumerate` to split the lines based on the delimiter that you have. What do you notice about the output? Just print a couple of lines so you don't blow away your screen!

---

Now we're going to do something much more sophisticated, to give you a sense of how much we can do with these simple tools. I'm going to write some uncommented code, that we'll read together and try to figure out what's going on.

In [None]:
vals = set() # "vals" is a bad name for a variable, but I'm being tricky

with open(input_file) as my_file :
    for idx, line in enumerate(my_file) :
        pieces = line.split(",")
        vals.add(pieces[45])
        
        if idx % 100000 == 0 :
            print("Processed " + str(idx+1) + " lines.")

print("Processed " + str(idx) + " lines.")
print("Done processing.")

In [None]:
print(len(vals))

What's this doing?