In this notebook we'll work through some of the issues we will identify in class that we'll need to solve for the Wedge project. The key tasks are as follows:

1. We need to get our file or files out of the zip file.
1. We need to be able to read in the file. 
1. We need to do a few tests: looking for a header row, checking for delimiters, checking for quotes.
1. We need to identify the owner number, which has index `45` in the row.
1. We need to find the destination row.
1. We need to write out the row. 

I've built some toy examples for us to play with in class. First, let's get a list of the files. The `os` package has a handy function `listdir` that will help. 

In [4]:
import os

In [5]:
os.listdir("data")
#os.listdir()

['file1.zip',
 'file2.zip',
 'file3.zip',
 'file4.zip',
 'file5.zip',
 'file_6_7.zip']

Let's save these files to a variable.

In [6]:
zip_files = os.listdir("data/")

## Working with Zip Files

Zip files are complicated, but useful. Here's a nice description I found on [GeeksForGeeks](https://www.geeksforgeeks.org/working-zip-files-python/):

> ZIP is an archive file format that supports lossless data compression. By lossless compression, we mean that the compression algorithm allows the original data to be perfectly reconstructed from the compressed data. So, a ZIP file is a single file containing one or more compressed files, offering an ideal way to make large files smaller and keep related files together.

So, one tricky thing is that a zip file is a _file_ but it can also contain lots of other sub-files, so it acts like a _folder_ as well.

Q: Why don't we just unzip all the Wedge files and skip the unzipping? 

A: ?? 

There's a useful package for working with zip files called ... `zipfile`. Might be worth bookmarking the manual [page](https://docs.python.org/3/library/zipfile.html) for it. We won't need the whole package, so we'll just import the _Class_ `ZipFile`.

In [7]:
from zipfile import ZipFile # usually you'd do all these imports at the beginning

In [8]:
# Let's extract one file from the first zip in our list

# opening the zip file in READ mode 
with ZipFile("data/" + zip_files[0], 'r') as zf : 
    # printing what's in the zip file.  
    zf.printdir() 
  
    # extracting all the files 
    print('Extracting all the files now...') 
    zf.extractall() 
    print('Done!') 

File Name                                             Modified             Size
file1.csv                                      2018-09-17 18:18:58         4177
Extracting all the files now...
Done!


In [9]:
zip_files

['file1.zip',
 'file2.zip',
 'file3.zip',
 'file4.zip',
 'file5.zip',
 'file_6_7.zip']

Look in the folder for this repository and you'll see that the file was extracted into this directory. (Q: Why this one?) Now, we don't want to do this in practice, so we'll try to just read the files in the zip file. Let's delete the file that we just extracted just to be clean. The `os` package has a helpful function (`remove`) for us. But we need to get the name of the file first.

In [10]:
zip_files

['file1.zip',
 'file2.zip',
 'file3.zip',
 'file4.zip',
 'file5.zip',
 'file_6_7.zip']

In [11]:
for zip_file in zip_files :
    with ZipFile("data/" + zip_file, 'r') as zf :
        print(zf.namelist())

['file1.csv']
['file2.csv']
['file3.csv']
['file4.csv']
['file5.csv']
['file6.csv', 'file7.csv']


Q: what is `namelist` returning?

A: Name list out of `file_1.zip`. This is a list of one element, which is a string. The string is the name of the underlying csv. 

In [12]:
# Now let's delete that spurious file we created
with ZipFile("data/" + zip_files[0], 'r') as zf :
    this_file_list = zf.namelist()
    #print(this_file_list)
    os.remove(this_file_list[0])

In [13]:
def clean_up_temp_folder(folder_name) : 
    # delete temporary directory

    num_deleted = 0 
    
    # get files inside temp dir
    files_to_delete = os.listdir(folder_name)

    # delete them one at a time.
    for file in files_to_delete :
        os.remove(temp_folder_name + "/" + file)
        num_deleted += 1

    # remove the folder
    os.rmdir(folder_name)
    
    print(f"Just removed folder {folder_name}. We deleted {str(num_deleted)} files.")
    return(num_deleted)
    

In [14]:
# Make new directory
temp_folder_name = "temp"

if not os.path.isdir(temp_folder_name) : # if folder exists
    os.mkdir(temp_folder_name)           # if not, make it

# extract all zipfile contents to directory
for zip_file in zip_files :
    with ZipFile("data/" + zip_file, 'r') as zf :
        
        # extract all contents to temporary folder
        zf.extractall(path=temp_folder_name) 


In [15]:
clean_up_temp_folder(temp_folder_name)

Just removed folder temp. We deleted 7 files.


7

Now go check the folder for this repository on your machine and you'll see it's gone.

In this next cell, write a loop over all the files in `zip_files` printing the contents to the screen.

In [16]:
for zipf in zip_files :
    with ZipFile("data/" + zipf,'r') as zf :  
        print(zf.namelist())

['file1.csv']
['file2.csv']
['file3.csv']
['file4.csv']
['file5.csv']
['file6.csv', 'file7.csv']


Q: What do you notice about the contents of the zips? 

A: ?? 

Now, we're getting close. At this point, we'd like to do something like the following:

1. Open a zip file
1. Get a list of the files in there
1. Read those files as we've read plain text files before.

In [46]:
this_zf = zip_files[0]

with ZipFile("data/" + this_zf,'r') as zf :
    zipped_files = zf.namelist()
    
    for file_name in zipped_files :
        #print(file_name)
        with zf.open(file_name,'r') as input_file :
            for idx, line in enumerate(input_file) :
                print(line)
                if idx == 4 :
                    break


b'"Sepal.Length"\t"Sepal.Width"\t"Petal.Length"\t"Petal.Width"\t"Species"\r\n'
b'5.1\t3.5\t1.4\t0.2\t"setosa"\r\n'
b'4.9\t3\t1.4\t0.2\t"setosa"\r\n'
b'4.7\t3.2\t1.3\t0.2\t"setosa"\r\n'
b'4.6\t3.1\t1.5\t0.2\t"setosa"\r\n'


Let's spend a little time reading what's going on here. 

Q: What's up with the `b'some string stuff'`? 

A: ?? 


Do deal with byte strings, we can use `io.TextIOWrapper` to get the job done.

In [50]:
import io

this_zf = zip_files[0]

with ZipFile("data/" + this_zf,'r') as zf :
    zipped_files = zf.namelist()
    
    for file_name in zipped_files :
        input_file = zf.open(file_name,'r')
        input_file = io.TextIOWrapper(input_file,encoding="utf-8")
        
        for idx, line in enumerate(input_file) :
            print(line)
            if idx > 3 :
                break

        input_file.close() # tidy up

"Sepal.Length"	"Sepal.Width"	"Petal.Length"	"Petal.Width"	"Species"

5.1	3.5	1.4	0.2	"setosa"

4.9	3	1.4	0.2	"setosa"

4.7	3.2	1.3	0.2	"setosa"

4.6	3.1	1.5	0.2	"setosa"



Q: What do you notice about this output? 

A: ??

Okay, now we're close to finishing our work with zip files. In the cell below, write code that will 

1. Iterate over every zip file.
1. Print out the name of the containing file. 
1. Print out the first 3 lines of each file. 

In [70]:
import csv

In [90]:
andrews_list = ['"Sepal.Length"', '"Sepal.Width"', '"Petal.Length"', '"Petal.Width"', '"Species"']

In [91]:
andrews_list

['"Sepal.Length"',
 '"Sepal.Width"',
 '"Petal.Length"',
 '"Petal.Width"',
 '"Species"']

In [92]:
first_item = andrews_list[0]

In [94]:
first_item.strip('"')

'Sepal.Length'

In [96]:
first_item.replace('"','')

'Sepal.Length'

In [128]:
headers = ['Sepal.Length','Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']

In [129]:
# your code here

zip_files = os.listdir("data")

if not os.path.isdir("temp_clean") : # if folder exists
    os.mkdir("temp_clean")           # if not, make it

for zip_file in zip_files :
    
    with ZipFile("data/" + zip_file,'r') as my_zip_file : 
    
        files_inside = my_zip_file.namelist()
        for zipped_file in files_inside :
            sniffer = csv.Sniffer()

            print(f"Processing {zipped_file} now.")
            
            with my_zip_file.open(zipped_file,'r') as input_file :
                
                output_file_name = input_file.name.replace(".csv","_clean.csv")

                with open("temp_clean/" + output_file_name,'w') as outfile : 
                    outfile.write("\t".join(headers) + "\n")
                                    
                    rows_printed = 0
                    for idx, line in enumerate(input_file) :

                        file_has_header = False

                        dialect = sniffer.sniff(line.decode("utf-8"))
                        line = line.decode("utf-8").strip().split(dialect.delimiter)
                        line = [piece.replace('"','') for piece in line]


                        if idx == 0 :
                            if 'Sepal' in line[0] :
                                file_has_header = True

                        if file_has_header and idx == 0 :
                            # don't print line
                            pass
                        else : 
                            outfile.write("\t".join(line) + "\n")
                            rows_printed += 1
    #                             pass 
    #                         else : 
    #                             print(line)
    #                            print("I think it's a header")



    #                     if zipped_file == "file1.csv" :
    #                         line = line.decode("utf-8").strip().split("\t")
    #                     elif zipped_file in ["file4.csv","file7.csv"] :
    #                         line = line.decode("utf-8").strip().split(";")
    #                     else :
    #                         line = line.decode("utf-8").strip().split(",")

    #                    print(line)
                        if rows_printed <= 3 :
                            print(line)
                        
        #print(f"{zip_file} has this inside it {files_inside}")
 

Processing file1.csv now.
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
Processing file2.csv now.
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
Processing file3.csv now.
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
Processing file4.csv now.
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
Processing file5.csv now.
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', '3', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
Processing file6.csv now.
['5.1', '3.5', '1.4', '0.2', 'setosa']
['4.9', 

In [120]:
input_file.name

'file1.csv'

## Checking for delimiters

Now that we can get inside these files, let's test for delimiters. First do some Googling and see if you can come up with some good approaches. 

---

Some dead space to make it easier to not peek ahead. Practice the searching!

---

For real!

---

Okay, let's get on with it.

The `csv` module has a super-useful function called `sniffer`. It'll just let you test for delimiters. Let's see it in action. (Also, we're going to store the delimiters in a dictionary keyed to file name so that we can use them later.)

In [89]:
import csv

delimiters = dict() 

# Start by reading in all the files again.

for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            dialect = csv.Sniffer().sniff(sample=input_file.readline(),
                                          delimiters=[",",";","\t"])
            
            delimiters[file_name] = dialect.delimiter
            
            print(" ".join(["It looks like",
                           file_name,
                           "has delimiter",
                           dialect.delimiter,
                           "."]))

            input_file.close() # tidy up

It looks like file1.csv has delimiter 	 .
It looks like file2.csv has delimiter , .
It looks like file3.csv has delimiter , .
It looks like file4.csv has delimiter ; .
It looks like file5.csv has delimiter , .
It looks like file6.csv has delimiter , .
It looks like file7.csv has delimiter ; .


Let's read through this and try to interpret what's going on. 

## Checking for Headers

Now that we can find the delimiters, let's check for the presence of headers. First, let's 
just split the first line based on the delimiter and print that out. 

In [None]:
for this_zf in zip_files :
    with ZipFile("data/" + this_zf,'r') as zf :
        zipped_files = zf.namelist()

        for file_name in zipped_files :
            input_file = zf.open(file_name,'r')
            input_file = io.TextIOWrapper(input_file,encoding="utf-8")
            
            this_delimiter = delimiters[file_name]
            
            for line in input_file :
                print(line.strip().split(this_delimiter))
                break

            input_file.close() # tidy up

Now rewrite the above cell so that you test for the presence of a header row and write out the True or False value.

In [None]:
# your code here

##  Mix of Quotes and Not

There are some simple ways we can probably deal with quotes for the Wedge. If we get here, we'll discuss. 