In this notebook we'll work through some of the issues we identified in class that we'll need to solve to do the Wedge project. 
<!--
Here's a reminder.
![list-from-board](list_of_tasks.jpg)
Working from this list, we see the key tasks as follows:
--> 

Here's a list of tasks we need to accomplish: 

1. We need to get our file or files out of the zip file.
1. We need to be able to read the file into Python in the file. 
1. We need to do a few tests: looking for a header row, checking for delimiters, checking for quotes.
1. We need to write out the data in a "clean" format. 

I've built some toy examples for us to play with in class. First, let's get a list of the files. The `os` package has a handy function `listdir` that will help. 

In [None]:
import os

In [None]:
os.listdir("data/")

Let's save these files to a variable.

In [None]:
zip_files = os.listdir("data/")

## Working with Zip Files

Zip files are complicated, but useful. Here's a nice description I found on [GeeksForGeeks](https://www.geeksforgeeks.org/working-zip-files-python/):

> ZIP is an archive file format that supports lossless data compression. By lossless compression, we mean that the compression algorithm allows the original data to be perfectly reconstructed from the compressed data. So, a ZIP file is a single file containing one or more compressed files, offering an ideal way to make large files smaller and keep related files together.

So, one tricky thing is that a zip file is a _file_ but it can also contain lots of other sub-files, so it acts like a _folder_ as well.

Q: Why don't we just unzip all the Wedge files and skip the unzipping? 

A: Because these files are *big*. We might run out of space unzipping them all. 

There's a useful package for working with zip files called ... `zipfile`. Might be worth bookmarking the manual [page](https://docs.python.org/3/library/zipfile.html) for it. We won't need the whole package, so we'll just import the _Class_ `ZipFile`.

In [None]:
from zipfile import ZipFile # usually you'd do all these imports at the beginning

In [None]:
# Let's extract one file from the first zip in our list

# opening the zip file in READ mode 
with ZipFile("data/" + zip_files[0], 'r') as zf : 
    # printing what's in the zip file.  
    zf.printdir() 
  
    # extracting all the files 
    print('Extracting all the files now...') 
    zf.extractall() 
    print('Done!') 

Look in the folder `python-file-exploration/` and you'll see that the file was extracted into this directory. (Q: Why this one?) Now, we don't want to do this in practice, so we'll try to just read the files in the zip file. Let's delete the file that we just extracted just to be clean. The `os` package has a helpful function (`remove`) for us. But we need to get the name of the file first.

In [None]:
with ZipFile("data/" + zip_files[0], 'r') as zf :
    print(zf.namelist())

Q: what is `namelist` returning?

A: A list of strings, each string containing the name of a file inside the zip. 

In [None]:
# Now let's delete that spurious file we created
with ZipFile("data/" + zip_files[0], 'r') as zf :
    this_file_list = zf.namelist()
    os.remove(this_file_list[0])

Now go check the `python-file-exploration` directory and you'll see it's gone.

In this next cell, write a loop over all the files in `zip_files` printing the contents to the screen.

In [None]:
for zipf in zip_files :
    with ZipFile("data/" + zipf,'r') as zf :  
        print(zf.namelist())

Q: What do you notice about the contents of the zips? 

A: One of them has two files! I don't think this is true for the Wedge, but it's good to practice. 

Now, we're getting close. At this point, we'd like to do something like the following:

1. Open a zip file
1. Get a list of the files in there
1. Read those files as we've read plain text files before.

In [None]:
this_zf = zip_files[0]

with ZipFile("data/" + this_zf,'r') as zf :
    zipped_files = zf.namelist()
    
    for file_name in zipped_files :
        with zf.open(file_name,'r') as input_file :
            for idx, line in enumerate(input_file) :
                print(line)
                if idx > 5 :
                    break


Let's spend a little time reading what's going on here. 

Q: What's up with the `b'some string stuff'`? 

A: Text files inside zips are stored as _bytes_, not strings. The explanation of this gets [complicated](https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string) and we'll talk more about it later in the semester. Python 3 will allow us to use a new function `io.TextIOWrapper` to get the job done.

In [None]:
import io

this_zf = zip_files[0]

with ZipFile("data/" + this_zf,'r') as zf :
    zipped_files = zf.namelist()
    
    for file_name in zipped_files :
        input_file = zf.open(file_name,'r')
        input_file = io.TextIOWrapper(input_file,encoding="utf-8")
        
        for idx, line in enumerate(input_file) :
            print(line)
            if idx > 3 :
                break

        input_file.close() # tidy up

Q: What do you notice about this output? 

A: Header rows. Quotes around strings. Tab delimits are being printed as tabs.

Okay, now we're close to finishing our work with zip files. In the cell below, write code that will 

1. Iterate over every zip file.
1. Print out the name of the containing file. 
1. Print out the first 3 lines of each file. 

In [None]:
for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            print(file_name + "\n")
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")

            for idx, line in enumerate(input_file) :
                print(line)
                if idx > 3 :
                    break

            input_file.close() # tidy up

Notice all the weirdness in there!

## Checking for delimiters

Now that we can get inside these files, let's test for delimiters. First do some Googling and see if you can come up with some good approaches. 

---

Some dead space to make it easier to not peek ahead. Practice the searching!

---

For real!

---

Okay, let's get on with it.

The `csv` module has a super-useful function called `sniffer`. It'll just let you test for delimiters. Let's see it in action. (Also, we're going to store the delimiters in a dictionary keyed to file name so that we can use them later.)

In [None]:
import csv

delimiters = dict() 

# Start by reading in all the files again.

for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            dialect = csv.Sniffer().sniff(sample=input_file.readline(),
                                      delimiters=[",",";","\t"])
            
            delimiters[file_name] = dialect.delimiter
            
            #input_file.seek(0)
            
            print(" ".join(["It looks like",
                           file_name,
                           "has delimiter",
                           dialect.delimiter,
                           "."]))

            input_file.close() # tidy up

In [None]:
for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            for line in input_file.readlines() :
                line = line.split(dialect.delimiter).strip()
                


Let's read through this and try to interpret what's going on. 

## Checking for Headers

Now that we can find the delimiters, let's check for the presence of headers. First, let's 
just split the first line based on the delimiter and print that out. 

In [None]:
for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            this_delimiter = delimiters[file_name]
            
            for line in input_file :
                print(line.strip().split(this_delimiter))
                break

            input_file.close() # tidy up

Now rewrite the above cell so that you test for the presence of a header row and write out the True or False value.

In [None]:
for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            this_delimiter = delimiters[file_name]
            
            for line in input_file :
                print("Header row? " + str("Sepal" in line))
                print(line.strip().split(this_delimiter))
                break

            input_file.close() # tidy up