In this notebook we'll work through some of the issues we will identify in class that we'll need to solve for the Wedge project. The key tasks are as follows:

1. We need to get our file or files out of the zip file.
1. We need to be able to read in the file. 
1. We need to do a few tests: looking for a header row, checking for delimiters, checking for quotes.
1. We need to identify the owner number, which has index `45` in the row.
1. We need to find the destination row.
1. We need to write out the row. 

I've built some toy examples for us to play with in class. First, let's get a list of the files. The `os` package has a handy function `listdir` that will help. 

In [1]:
import os

In [2]:
os.listdir("wedge/")

['transArchive_201310_201312_small.zip',
 'transArchive_201207_201209_small.zip',
 'transArchive_201204_201206_inactive_small.zip',
 'transArchive_201304_201306_inactive_small.zip',
 'transArchive_201007_201009_small.zip',
 'transArchive_201105_small.zip',
 'transArchive_201110_201112_small.zip',
 'transArchive_201304_201306_small.zip',
 'transArchive_201404_201406_inactive_small.zip',
 'transArchive_201504_201506_small.zip',
 '.DS_Store',
 'transArchive_201612_small.zip',
 'transArchive_201606_small.zip',
 'transArchive_201401_201403_inactive_small.zip',
 'transArchive_201407_201409_small.zip',
 'transArchive_201201_201203_inactive_small.zip',
 'transArchive_201301_201303_inactive_small.zip',
 'transArchive_201310_201312_inactive_small.zip',
 'transArchive_201107_201109_small.zip',
 'transArchive_201601_small.zip',
 'transArchive_201210_201212_inactive_small.zip',
 'transArchive_201010_201012_small.zip',
 'transArchive_201204_201206_small.zip',
 'transArchive_201410_201412_inactive_sm

Let's save these files to a variable.

In [3]:
zip_files = os.listdir("wedge")

## Working with Zip Files

Zip files are complicated, but useful. Here's a nice description I found on [GeeksForGeeks](https://www.geeksforgeeks.org/working-zip-files-python/):

> ZIP is an archive file format that supports lossless data compression. By lossless compression, we mean that the compression algorithm allows the original data to be perfectly reconstructed from the compressed data. So, a ZIP file is a single file containing one or more compressed files, offering an ideal way to make large files smaller and keep related files together.

So, one tricky thing is that a zip file is a _file_ but it can also contain lots of other sub-files, so it acts like a _folder_ as well.

Q: Why don't we just unzip all the Wedge files and skip the unzipping? 

A: ?? 

There's a useful package for working with zip files called ... `zipfile`. Might be worth bookmarking the manual [page](https://docs.python.org/3/library/zipfile.html) for it. We won't need the whole package, so we'll just import the _Class_ `ZipFile`.

In [4]:
from zipfile import ZipFile # usually you'd do all these imports at the beginning

In [5]:
# Let's extract one file from the first zip in our list

# opening the zip file in READ mode 
with ZipFile("wedge/" + zip_files[0], 'r') as zf : 
    # printing what's in the zip file.  
    zf.printdir() 
  
    # extracting all the files 
    print('Extracting all the files now...') 
    zf.extractall() 
    print('Done!') 

File Name                                             Modified             Size
transArchive_201310_201312_small.csv           2019-09-18 14:51:20      2949882
Extracting all the files now...
Done!


Look in the folder for this repository and you'll see that the file was extracted into this directory. (Q: Why this one?) Now, we don't want to do this in practice, so we'll try to just read the files in the zip file. Let's delete the file that we just extracted just to be clean. The `os` package has a helpful function (`remove`) for us. But we need to get the name of the file first.

In [6]:
with ZipFile("wedge/" + zip_files[0], 'r') as zf :
    print(zf.namelist())

['transArchive_201310_201312_small.csv']


Q: what is `namelist` returning?

A: Name list out of 'transArchive_201310_201312_small.csv'. This is a list of one element, which is a string. The string is the name of the underlying csv.

In [7]:
# Now let's delete that spurious file we created
with ZipFile("wedge/" + zip_files[0], 'r') as zf :
    this_file_list = zf.namelist()
    os.remove(this_file_list[0])

In [8]:
#make new directory as a test to hold files step 1 
temp_folder_name = "temp"

if not os.path.isdir(temp_folder_name): #if folder exisits
    os.mkdir(temp_folder_name)          # if not, make it

# extract all zipfile contents to directory
for zip_file in zip_files: 
    with ZipFile("wedge/" + zip_files[0], 'r') as zf :
  
            #extract all contents to temporary folder
            zf.extractall(path=temp_folder_name)   



In [9]:
# delete temporary directory as a test step 2

#get files inside temp dir
files_to_delete = os.listdir(temp_folder_name)

#delete them one at a time
for file in files_to_delete: 
    os.remove(temp_folder_name + "/" + file)
    
#remove the folder
os.rmdir(temp_folder_name)




Now go check the folder for this repository on your machine and you'll see it's gone.

In this next cell, write a loop over all the files in `zip_files` printing the contents to the screen.

In [10]:
for zipf in zip_files :
    if zipf[-3:] == "zip":
        with ZipFile("wedge/" + zipf,'r') as zf :  
            print(zf.namelist())
        

['transArchive_201310_201312_small.csv']
['transArchive_201207_201209_small.csv']
['transArchive_201204_201206_inactive_small.csv']
['transArchive_201304_201306_inactive_small.csv']
['transArchive_201007_201009_small.csv']
['transArchive_201105_small.csv']
['transArchive_201110_201112_small.csv']
['transArchive_201304_201306_small.csv']
['transArchive_201404_201406_inactive_small.csv']
['transArchive_201504_201506_small.csv']
['transArchive_201612_small.csv']
['transArchive_201606_small.csv']
['transArchive_201401_201403_inactive_small.csv']
['transArchive_201407_201409_small.csv']
['transArchive_201201_201203_inactive_small.csv']
['transArchive_201301_201303_inactive_small.csv']
['transArchive_201310_201312_inactive_small.csv']
['transArchive_201107_201109_small.csv']
['transArchive_201601_small.csv']
['transArchive_201210_201212_inactive_small.csv']
['transArchive_201010_201012_small.csv']
['transArchive_201204_201206_small.csv']
['transArchive_201410_201412_inactive_small.csv']
['tr

Q: What do you notice about the contents of the zips? 

A: ?? 

Now, we're getting close. At this point, we'd like to do something like the following:

1. Open a zip file
1. Get a list of the files in there
1. Read those files as we've read plain text files before.

In [11]:
this_zf = zip_files[0]

with ZipFile("wedge/" + this_zf,'r') as zf :
    zipped_files = zf.namelist()
    
    for file_name in zipped_files :
        with zf.open(file_name,'r') as input_file :
            for idx, line in enumerate(input_file) :
                print(line)
                if idx >= 5 :
                    break


b'"datetime","register_no","emp_no","trans_no","upc","description","trans_type","trans_subtype","trans_status","department","quantity","Scale","cost","unitPrice","total","regPrice","altPrice","tax","taxexempt","foodstamp","wicable","discount","memDiscount","discountable","discounttype","voided","percentDiscount","ItemQtty","volDiscType","volume","VolSpecial","mixMatch","matched","memType","staff","numflag","itemstatus","tenderstatus","charflag","varflag","batchHeaderID","local","organic","display","receipt","card_no","store","branch","match_id","trans_id"\r\n'
b'"2013-10-01 07:11:53","700","700","829","0000000009118","KB:Tofu Creole/Br.Rice/Adult","I"," "," ","8","1","0","0.0000","10.0000","10.0000","10.0000","0.0000","1","0","0","0","0.0000","0.0000","0","0","0","0.00000000","1","0","1","0.0000","0","0"," ","0","1","0","0",,"0",NULL,"1","0"," ","0","3","512","0","0","1"\r\n'
b'"2013-10-01 07:11:53","700","700","829","0000000009120","KB:Beef Stroganoff/Adult","I"," "," ","8","1","0","0

Let's spend a little time reading what's going on here. 

Q: What's up with the `b'some string stuff'`? 

A: ?? 


Do deal with byte strings, we can use `io.TextIOWrapper` to get the job done.

In [12]:
import io

this_zf = zip_files[0]

with ZipFile("wedge/" + this_zf,'r') as zf :
    zipped_files = zf.namelist()
    
    for file_name in zipped_files :
        input_file = zf.open(file_name,'r')
        input_file = io.TextIOWrapper(input_file,encoding="utf-8")
        
        for idx, line in enumerate(input_file) :
            print(line)
            if idx > 3 :
                break

        input_file.close() # tidy up

"datetime","register_no","emp_no","trans_no","upc","description","trans_type","trans_subtype","trans_status","department","quantity","Scale","cost","unitPrice","total","regPrice","altPrice","tax","taxexempt","foodstamp","wicable","discount","memDiscount","discountable","discounttype","voided","percentDiscount","ItemQtty","volDiscType","volume","VolSpecial","mixMatch","matched","memType","staff","numflag","itemstatus","tenderstatus","charflag","varflag","batchHeaderID","local","organic","display","receipt","card_no","store","branch","match_id","trans_id"

"2013-10-01 07:11:53","700","700","829","0000000009118","KB:Tofu Creole/Br.Rice/Adult","I"," "," ","8","1","0","0.0000","10.0000","10.0000","10.0000","0.0000","1","0","0","0","0.0000","0.0000","0","0","0","0.00000000","1","0","1","0.0000","0","0"," ","0","1","0","0",,"0",NULL,"1","0"," ","0","3","512","0","0","1"

"2013-10-01 07:11:53","700","700","829","0000000009120","KB:Beef Stroganoff/Adult","I"," "," ","8","1","0","0.0000","10.000

Q: What do you notice about this output? 

A: ??

Okay, now we're close to finishing our work with zip files. In the cell below, write code that will 

1. Iterate over every zip file.
1. Print out the name of the containing file. 
1. Print out the first 3 lines of each file. 

In [18]:
# your code here

zip_files = os.listdir("wedge/")

for zip_file in zip_files :
    
    with ZipFile("wedge/" + zip_file, 'r') as my_zip_file:
    
        files_inside = my_zip_file.namelist()
        for zipped_file in files_inside :
            print(zipped_file)
            
            with my_zip_file.open(zipped_file,'r') as input_file:
                for idx,line in enumerate(input_file):
                    print(line.decode("utf-8").strip().split("/t"))
                    if idx == 2:
                        break
                        
        #print(f"{zip_file} has this inside it {files_insdie}")
      



transArchive_201310_201312_small.csv
['"datetime","register_no","emp_no","trans_no","upc","description","trans_type","trans_subtype","trans_status","department","quantity","Scale","cost","unitPrice","total","regPrice","altPrice","tax","taxexempt","foodstamp","wicable","discount","memDiscount","discountable","discounttype","voided","percentDiscount","ItemQtty","volDiscType","volume","VolSpecial","mixMatch","matched","memType","staff","numflag","itemstatus","tenderstatus","charflag","varflag","batchHeaderID","local","organic","display","receipt","card_no","store","branch","match_id","trans_id"']
['"2013-10-01 07:11:53","700","700","829","0000000009118","KB:Tofu Creole/Br.Rice/Adult","I"," "," ","8","1","0","0.0000","10.0000","10.0000","10.0000","0.0000","1","0","0","0","0.0000","0.0000","0","0","0","0.00000000","1","0","1","0.0000","0","0"," ","0","1","0","0",,"0",NULL,"1","0"," ","0","3","512","0","0","1"']
['"2013-10-01 07:11:53","700","700","829","0000000009120","KB:Beef Stroganoff/Ad

BadZipFile: File is not a zip file

## Checking for delimiters

Now that we can get inside these files, let's test for delimiters. First do some Googling and see if you can come up with some good approaches. 

---

Some dead space to make it easier to not peek ahead. Practice the searching!

---

For real!

---

Okay, let's get on with it.

The `csv` module has a super-useful function called `sniffer`. It'll just let you test for delimiters. Let's see it in action. (Also, we're going to store the delimiters in a dictionary keyed to file name so that we can use them later.)

In [20]:
import csv

delimiters = dict() 

# Start by reading in all the files again.

for this_zf in zip_files :
    with ZipFile("wedge/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            dialect = csv.Sniffer().sniff(sample=input_file.readline(),
                                      delimiters=[",",";","\t"])
            
            delimiters[file_name] = dialect.delimiter
            
            print(" ".join(["It looks like",
                           file_name,
                           "has delimiter",
                           dialect.delimiter,
                           "."]))

            input_file.close() # tidy up

It looks like transArchive_201310_201312_small.csv has delimiter , .
It looks like transArchive_201207_201209_small.csv has delimiter , .
It looks like transArchive_201204_201206_inactive_small.csv has delimiter ; .
It looks like transArchive_201304_201306_inactive_small.csv has delimiter ; .
It looks like transArchive_201007_201009_small.csv has delimiter , .
It looks like transArchive_201105_small.csv has delimiter , .
It looks like transArchive_201110_201112_small.csv has delimiter , .
It looks like transArchive_201304_201306_small.csv has delimiter , .
It looks like transArchive_201404_201406_inactive_small.csv has delimiter ; .
It looks like transArchive_201504_201506_small.csv has delimiter , .


BadZipFile: File is not a zip file

Let's read through this and try to interpret what's going on. 

## Checking for Headers

Now that we can find the delimiters, let's check for the presence of headers. First, let's 
just split the first line based on the delimiter and print that out. 

In [None]:
for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            this_delimiter = delimiters[file_name]
            
            for line in input_file :
                print(line.strip().split(this_delimiter))
                break

            input_file.close() # tidy up

Now rewrite the above cell so that you test for the presence of a header row and write out the True or False value.

In [None]:
# your code here

##  Mix of Quotes and Not

There are some simple ways we can probably deal with quotes for the Wedge. If we get here, we'll discuss. 