# Python Logic

This workship will be the process behind writing a basic file parser. The goal will be to read in a file, parse the data, and write out a file with our cleaned data as a CSV file, as well as to print some statistics about our program's runtime to console.

Before we get started, theres a few areas of python to discuss

### Packages

One of the biggest strengths of python is the number of packages that have been written for it. **Packages** are tools that other people have written that can be used in our own code.

Packages are easy to install, if you're using Anaconda, most of the popular packages are already there for you to use. If you installed python from the official site, then we need to use **pip**, python's official package manager to install them accessed from the terminal. (not a python file or the python interactive prompt)

In [1]:
pip install numpy


The following command must be run outside of the IPython shell:

    $ pip install numpy

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


we need to tell python that we want to use specific packages in our program. Depending on how much of a packages there are a few ways to do that.

In [2]:
import numpy # so we can use numpy in our program

import numpy as np # now we can refer to numpy as np 

from numpy import * # brings in every numpy function for us to use

we can also import our own files to use as packages. by stating **import filename** without the .py as shown below. Python will only allow imports for files in the same directory.

ASIDE: There is a way such that we can import at a file at a different directory by using the sys package and specifying the path. If you want to read up on this later: https://stackoverflow.com/questions/4383571/importing-files-from-different-folder

In [3]:
import hello

### \_\_main\_\_

Programming languages run code in whats called a "main" function. Python considers all code not in a function definition to be part of the main funcion.

This means, Python runs code from first line to last line, following standard logic flow, and we need to define a function before, or above, where we first use (or call) it.

So, when we import a package, python will take the line **import 'package'** and RUN the files it looks at.

If its just defining funcions this is okay, but sometimes we want packages to do things IF AND ONLY IF its the file we call in terminal with the command **python hello.py**

To do this we use a special variable python has called **\_\_name\_\_** (with two underscores before and after)
If its the file we ran, the **\_\_name\_\_** variable will be **\_\_main\_\_**
otherwise it will be the name of the imported package (eg. **\_\_hello\_\_**)

This means we can write the code below (in hello.py)

In [4]:
def main():
    ...
    
if __name__ == "__main__":
    main()

what happens here is python sees main, but because we're only defining the function it doesn't run it. If we used the terminal to call hello.py, it then sees that the **\_\_name\_\_** variable is **\_\_main\_\_** and runs the code in **main()**, if we imported it with **import hello**, the **\_\_main\_\_** variable will be **\_\_hello\_\_** and code in **main()** wont run.

So lets put together some basic code that we'll need for this project

In [5]:
def main():
    return None
    
if __name__ == "__main__":
    main()

Now that we have the skeleton of our program written, with the numpy and timer (more on that later) packages imported, we can start writing the body of our code.

### Reading from files

We want to get information out of files, to do that we use the **open()** function which takes in a file, and a mode ("r" for 'read' or "w" for 'write) both in the format of a string

In [6]:
fp = open("../files/example_data.txt","r") # opens example_data.txt in read mode

ASIDE: we can name variables almost anything we want following the few rules that exist. **file** already exists as a keywork in python, so we need to use something else. Popular variable names include **f** for "file" or **fp** for "file-pointer".

when we are done using our file we need to call **close()**. There are a lot of problems that can occur if we don't close a file, but they're a bit too technical for our purposes. For now, just remember to call **close()** when you're done with a file.

Once we have our file opened in read mode, we want to get data from our file, which we can do in a few ways:

In [7]:
data = fp.read() # this reads in the ENTIRE file
# this works for our sample file, but will fill up our computers memory for really big files

data = fp.read(100) # reads ONLY the first 100 characters of a file

line = fp.readline() # this reads the entire file, line by line

# if we know our file is short enough...
lines = fp.readlines() # reads in ALL times, where each line is an entry in a list

in a file, lines will end with an invisible newline character (**\\n**), **readline()** will look at one line at a time (as a string) stopping each line at a new line character.

Since we're only looking at one line at a time, this reduces memory usage significantly.

Let's edit our program, we'll pretend we're using a large file and need to worry about memory.

In [8]:
def main():
    fp = open("../files/example_data.txt", "r")
    line = fp.readline()
    print(line)
    while line:
        line = fp.readline()
        print(line)
    
if __name__ == "__main__":
    main()

# a python comment in a txt file!?

data: not-a number, gpa: 3.2, credits: 135, semesters: 9

credits:132, gpa : 3.8, semesters: 8

gpa : 4.0, credits: 73, filler, semesters: 3

gpa:2.9, filler, credits :56, semesters: 4

credits : 90, garbage, gpa: 3.7, semesters: 7

gpa:3.2, semesters: 7, credits :113,

! another random character?

5 + 137 = 142

gpa: 3.7, credits: 15, semesters: 1

and more noise

ã€€




There's a more elegant solution using the **with** statement

In [9]:
def main():
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            print(line)
            
if __name__ == "__main__":
    main()

# a python comment in a txt file!?

data: not-a number, gpa: 3.2, credits: 135, semesters: 9

credits:132, gpa : 3.8, semesters: 8

gpa : 4.0, credits: 73, filler, semesters: 3

gpa:2.9, filler, credits :56, semesters: 4

credits : 90, garbage, gpa: 3.7, semesters: 7

gpa:3.2, semesters: 7, credits :113,

! another random character?

5 + 137 = 142

gpa: 3.7, credits: 15, semesters: 1

and more noise

ã€€



This does the exact same thing, but is more concise, easier to read, and will automatically close the file when we leave the with statement. Its the best way to read a file.

### Parsing Strings

Now that we can access our file, we can start parsing it. Let's take a moment to try and understand how our data is formatted.

If you look at the file example_data.txt youll notice that in general it's formatted as "word" : "number"

but there's a few lines spread out that don't follow this pattern. That's junk data that we're going to need to deal with as we look at lines.

*I will make a guarantee, so that we don't need to worry about it, that every valid row will have all the same words* *before colons.*

There are alot of tools for us to work with, but we'll focus on just a few. If you want to read up on them, documentation is availabe: https://docs.python.org/3.7/library/stdtypes.html
and https://docs.python.org/3.7/library/string.html
Documentation is not something to memorize, it is used for reference.
feel free to "control + f" and search for these in the documentation to learn more about them.

The functions we'll be focusing on:

In [10]:
import string
temp = "A Temporary STRING : STRING"

print(temp.lower()) # returns a copy of the string with all lowercase characters.strip() # removes all spaces (or other characters) from a string
print(temp.split())# splits the string on the characters we ask for, returns an array
print(string.ascii_letters) # is a string of all ascii letters
print(temp.isdigit()) # returns True if str is made of only numbers
print(temp.isalpha()) # returns True if str is made of only letters

a temporary string : string
['A', 'Temporary', 'STRING', ':', 'STRING']
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
False
False


ASIDE: for more complex string manipulation there are "regular expressions". Those are much more powerful but are extremely difficult to use. They were designed to parse natural language, how we write and speak in natural english, rather than formatted data and is thus beyond the scope of this workshop.

so, lets deal with the easiest lines first:

In [11]:
if temp[0] in string.ascii_letters:
    print("is a letter!")

is a letter!


**string.ascii_letters** requires the string package, so lets import that too

In [12]:
import string

lets make sure our line has the general formatting of what we want

In [13]:
if ":" in line:
    print("contains a ':'!")

So these statements will get rid of our junk lines. We can combine our two if statements together as well, so lets do that as we modify our code.

In [14]:
import string

def main():
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                print(line)
            
if __name__ == "__main__":
    main()

data: not-a number, gpa: 3.2, credits: 135, semesters: 9

credits:132, gpa : 3.8, semesters: 8

gpa : 4.0, credits: 73, filler, semesters: 3

gpa:2.9, filler, credits :56, semesters: 4

credits : 90, garbage, gpa: 3.7, semesters: 7

gpa:3.2, semesters: 7, credits :113,

gpa: 3.7, credits: 15, semesters: 1



now, lets get rid of whitespace and seperate our string so we can get data out.

In [15]:
line = temp.lower() # makes everything lowercase
line = temp.strip() # removes leading and trailing whitespace
line = temp.replace(" ", "") # removes all whitespace
line = temp.split(',') # splits the line at the ,'s into a list

lets add this to our program, in one line.

In [16]:
import string

def main():
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                line = line.lower().replace(" ", "").split(',')
                print(line)
            
if __name__ == "__main__":
    main()

['data:not-anumber', 'gpa:3.2', 'credits:135', 'semesters:9\n']
['credits:132', 'gpa:3.8', 'semesters:8\n']
['gpa:4.0', 'credits:73', 'filler', 'semesters:3\n']
['gpa:2.9', 'filler', 'credits:56', 'semesters:4\n']
['credits:90', 'garbage', 'gpa:3.7', 'semesters:7\n']
['gpa:3.2', 'semesters:7', 'credits:113', '\n']
['gpa:3.7', 'credits:15', 'semesters:1\n']


the order of these functions is executed left to right, looking at what these functions do, as long as .split(',') is at the end, this will run, as .lower() and .strip() return strings while .split() returns a list.

also, because we assigned this new output to line, we can no longer get the original line back as it was in the file.

Currently line is a list, so \[entry_1, entry_2, ... , entry_n\]


Lets continue to parse this, we want to make sure every entry has data in the format we want "word" : "number" which means our conditions are is if : is in the entry, if the left is a word, and the right is a digit.

In [17]:
for entry in line:
    if ":" in entry:
        data = entry.split(':')
        if data[0].isalpha() and data[1].isdigit():
            print(data[0] + " " + data[1])


lets add this to our program

In [18]:
import string

def main():
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                line = line.lower().replace(" ", "").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            print(data[0] + " " + data[1])
            
if __name__ == "__main__":
    main()

credits 135
credits 132
credits 73
credits 56
credits 90
semesters 7
credits 113
credits 15


But wait! this isnt working as we wanted! as it turns out **str.isdigit()** looks only for digits, so a '.' causes to to be false.

we want a way to check if its a number, so what we can do is **try** to cast the value to a digit.

### try & except

one method of debugging we can manipulate is the try and except conditionals. This will try to run a block of code, and if there is no error, continue as normal. If Python "throws" an error we can deal with it in our own block of code.

We can manipulate this to do what we want.

In [19]:
def isNum(num_string):
    try: # we try to do the folling block of code
        float(num_string) # cast num_string to a float
        return True # return True
    except ValueError: # if our code fails at any point in the try block, python will throw a ValueError
        return False # in which case we return False

so lets add this to our code

In [20]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                line = line.lower().replace(" ", "").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            print(data[0] + " " + data[1])
            
if __name__ == "__main__":
    main()

gpa 3.2
credits 135
semesters 9

credits 132
gpa 3.8
semesters 8

gpa 4.0
credits 73
semesters 3

gpa 2.9
credits 56
semesters 4

credits 90
gpa 3.7
semesters 7

gpa 3.2
semesters 7
credits 113
gpa 3.7
credits 15
semesters 1



But wait! we are checking for ":" in two places, we don't need to do that, and there are some newlines "\n" we can remove.

Lets think about it, if ":" is in our line theres no guarantee that every entry will have one, but if every entry has a ":" that line valid line will to. So we can simplify our code and remove the redundancy. 

In [22]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters: # removed our other conditional
                line = line.lower().replace(" ", "").strip("\n").split(',') # added a strip newlines
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            print(data[0] + " : " + data[1])
            
if __name__ == "__main__":
    main()

gpa : 3.2
credits : 135
semesters : 9
credits : 132
gpa : 3.8
semesters : 8
gpa : 4.0
credits : 73
semesters : 3
gpa : 2.9
credits : 56
semesters : 4
credits : 90
gpa : 3.7
semesters : 7
gpa : 3.2
semesters : 7
credits : 113
gpa : 3.7
credits : 15
semesters : 1


we said we wanted to make a CSV of our data. it was presented with ":", so the best way for us to store the data in our program is with a dictionary, data\[0\] is our key and data\[1\] is our value.

we're going to need to have one for every line as well, so lets make a list to store our dictionaries.

In [23]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


### Writing to Files

Writing to files is similar to reading from them. First we need to open a file in write mode, so use the **open()** function, but with **"w"** instead of **"r"**.

If the file we tell Python to write to doesnt exist, it will create a new file, if it does exist, it will write over the old one, deleting all data in it previously. If we don't want to clear the file and instead append to it, we can use the append mode with **"a"** instead.

In [24]:
fp = open("../files/example_parsed.csv", "w")

Lets add this to our code, but use that same **with** keyword as before
since by this point we're done with our first open, we can use **fp** again.

In [25]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
    with open("../files/example_parsed.csv", "w") as fp:
        return None
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


we can use the **file.write()** function to write to a file. **file.write()** does not add newlines, but we can manually add them when we want with the newline character **\n**. for example:

In [26]:
name_var = "Andrew"
with open ("hello.txt", "w") as fp:
    fp.write("hello, my name is " + name_var)

Lets think about what we need to do to write the data we've parsed from our file properly.
We need to make one line for each dictionary in our list **data_dicts**. The first line should be the keys, then we need to traverse the values in the dictionary, and write them to the line, comma's seperating all of them EXCEPT the last. We also do not want a comma at the end of the line. In short we want *"a_val, b_val, c_val, d_val, e_val, ... , n_val"*

Lets modify our program first to write the keys.

In [27]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
    with open("../files/example_parsed.csv", "w") as fp:
        for index, key in enumerate(data_dicts[0].keys()):
            if index < len(data_dicts[0]) - 1: 
                fp.write(key + ", ")
            else:
                fp.write(key + "\n")
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


We used the function **enumerate()** this lets us traverse a list, so that we can get TWO variables, the value of the list at our current location, and the index of the list of our current location. This is equivalent to:

In [28]:
for index in range(len(data_dicts[0].keys())):
    print(data_dicts[0].keys()[index])

NameError: name 'data_dicts' is not defined

using **enumerate()** is much easier to read.

We added "," between each word and a newline character "\n" at the end of the line as well. 

now we want to go through the values of the dictionary and print them out seperated by commas for each line, so lets do that.

In [29]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
    with open("../files/example_parsed.csv", "w") as fp:
        for index, key in enumerate(data_dicts[0].keys()):
            if index < len(data_dicts[0]) - 1:
                fp.write(key + ", ")
            else:
                fp.write(key + "\n")
        for data_dict in data_dicts:
            for index, value in enumerate(data_dict.values()):
                if index < len(data_dict) - 1:
                    fp.write(str(value) + ", ")
                else:
                    fp.write(str(value) + "\n")
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


This was almost the same thing as what we did to write the keys, except were looking at **dict.values()** and doing it for each dictionary in our list.

But looking at our data? somethings off! Dictionaries are unordered, so python will print out the keys in the order than they were added.

Well, we know what our keys are, so as long as we use the same list we can make sure its in the same order, so lets do that.

In [30]:
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
    with open("../files/example_parsed.csv", "w") as fp:
        for index, key in enumerate(data_dicts[0].keys()):
            if index < len(data_dicts[0]) - 1:
                fp.write(key + ", ")
            else:
                fp.write(key + "\n")
        for data_dict in data_dicts:
            for index, key in enumerate(data_dicts[0].keys()):
                if index < len(data_dict) - 1:
                    fp.write(str(data_dict[key]) + ", ")
                else:
                    fp.write(str(data_dict[key]) + "\n")
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


While reading like this is in fact correct, we chose to write to a csv. Because a proper CSV file has a guaranteed format for our data, it's easy to make a tool for us to read and write that we can use instead of re-writing it over and over again.

Python has already done this for us, so we dont even need to write the tool, we can just use it, its in the preinstalled **csv** package.

In [31]:
import csv
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',') # added a strip newlines
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
    with open("../files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


Thats it! A lot of programming is wondering if a language can do this or that beforehand and doing a quick google search or search through the documentation to check and to learn how. You'll learn the full capabilities of a language by using and exploring it yourself, so practice, practice, practice! lets quickly see what these functions do here: https://docs.python.org/3.6/library/csv.html

At this point theres a small problem with our code, a problem I'll leave to you to fix. If we have a massive file we're going to use a massive amount of memory since we're storing all of it in a list of dictionaries before we even begin to write.

Try rewriting the above code so that it writes to the file as it reads it in.
I've included one of my solutions in combined.py

### cleaning our code

With the current state of our code, our **main()** function is doing alot, its reading in a file, parsing it, and writing to a file(). So lets break the parsing part into a new fuction called **parse()**. We're going to want to pass in a line (so it can exist through multiple calls), and have it return a dictionary that we can then append to our **dict_list**

In [32]:
import csv
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line): # we try to keep all functions (except main) in alphabetical order
    return data_dict

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().replace(" ", "").strip("\n").split(',') # added a strip newlines
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and isNum(data[1]):
                            data_dict[data[0]] = float(data[1])
            if len(data_dict) != 0:
                data_dicts.append(data_dict)
                print(data_dict)
            
    with open("../files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

{'gpa': 3.2, 'credits': 135.0, 'semesters': 9.0}
{'credits': 132.0, 'gpa': 3.8, 'semesters': 8.0}
{'gpa': 4.0, 'credits': 73.0, 'semesters': 3.0}
{'gpa': 2.9, 'credits': 56.0, 'semesters': 4.0}
{'credits': 90.0, 'gpa': 3.7, 'semesters': 7.0}
{'gpa': 3.2, 'semesters': 7.0, 'credits': 113.0}
{'gpa': 3.7, 'credits': 15.0, 'semesters': 1.0}


All the code to do this already exists, starting at **data_dict = dict()** and adding a return statement instead of the **data_dicts.append(data_dict)** line.

A consequence of moving the code to a new function is that sometimes we'll return a None type object, so lets adjust for that too.

In [33]:
import csv
import string

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line):
    data_dict = dict()
    if line[0] in string.ascii_letters:
        line = line.lower().replace(" ", "").strip("\n").split(',')
        for entry in line:
            if ":" in entry:
                data = entry.split(':')
                if data[0].isalpha() and isNum(data[1]):
                    data_dict[data[0]] = float(data[1])
    if len(data_dict) != 0:
        return data_dict

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            line = parse(line)
            if line is not None:
                data_dicts.append(line)
        
    with open("../files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

This is a bit more manageable to read. our main function reads in a file and writes out to a file.
It also calls a parse function to parse our line. We don't need to care how the parsing works as long as what it returns is in the format we expect (a dictionary).

A bonus of this is now if we want to use the same parsing function in a different program, we can just import the file and call **parse()**, and because **\_\_name\_\_** of the file won't be **\_\_main\_\_** the function **main()** wont run!

### Timers

sometimes its useful to know how long a program or function takes to run, so lets quickly put together a timer.

to do this we need use the **time()** function in the time package

**time()** returns the time in seconds (as a decimal) since the epoch, or 00:00 on Thurdsay, Janurary 1st, 1970, so we can do basic arithmetic to determine the number of seconds, minutes, hours, or days elapsed at any two points in time.

ASIDE: because values in programming take up memory, they eventually hit a maximum value. Once the number of seconds from the epoch to a current moment in time occurs, various changes will have to made to Python and computers to adjust for this. That means currently, Python's time function will break sometime in 2038 (on Unix systems).

so to create a basic timer:

In [34]:
import time

start = time.time()
... # our code
end = time.time()
elapsed = end - start

lets add that to our program to measure the time it takes to run the code *if and only if* we run the file as where **\_\_name\_\_** is **\_\_main\_\_**

In [35]:
import csv
import string
import time

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line):
    data_dict = dict()
    if line[0] in string.ascii_letters:
        line = line.lower().replace(" ", "").strip("\n").split(',')
        for entry in line:
            if ":" in entry:
                data = entry.split(':')
                if data[0].isalpha() and isNum(data[1]):
                    data_dict[data[0]] = float(data[1])
    if len(data_dict) != 0:
        return data_dict

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            line = parse(line)
            if line is not None:
                data_dicts.append(line)

    with open("../files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)

if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start

### String Formatting

lets print this out to terminal, in the format "program ran in \_\_\_\_\_ minutes and \_\_\_\_\_ seconds"
we can do this with:

In [36]:
elapsed_min = elapsed // 60
elapsed_sec = elapsed % 60
print("program ran in " + str(elapsed_min) + " minutes and " + str(elapsed_sec) " seconds")

SyntaxError: invalid syntax (<ipython-input-36-98b1eaa53c00>, line 3)

Theres a slightly cleaner way to do this using **str.format()**
to say the same thing:

In [37]:
print("program ran in {} minutes and {} seconds".format(elapsed_min, elapsed_sec))

NameError: name 'elapsed_min' is not defined

this is a way to format strings in python3, its much easier to read than before. It automatically casts our values to strings as well! We can compress this further with something called *fstrings*, but for now lets focus on readability instead of brevity.

the **{}** will be replaced with what we pass in to **str.format()** in the order we pass it in, so elapsed_min fills in the first **{}** and **elapsed_sec** fills in the second **{}**

lets make one change such that the number of seconds gets printed to 4 decimal places.

In [38]:
print("program ran in {} minutes and {:.4f} seconds".format(elapsed_min, elapsed_sec))

NameError: name 'elapsed_min' is not defined

the .3f means "after the decimal, round the number to three places in a floating point number format

There are alot of different combinations, and the best way to learn them is googling how and to experimenting yourself. For a full list of various formatting options for strings ive included the link below

https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting

Lets add this to our program so it looks like:

In [18]:
import csv
import string
import time

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line):
    data_dict = dict()
    if line[0] in string.ascii_letters:
        line = line.lower().replace(" ", "").strip("\n").split(',')
        for entry in line:
            if ":" in entry:
                data = entry.split(':')
                if data[0].isalpha() and isNum(data[1]):
                    data_dict[data[0]] = float(data[1])
    if len(data_dict) != 0:
        return data_dict

def main():
    data_dicts = list()
    
    with open("../files/example_data.txt", "r") as fp:
        for line in fp:
            line = parse(line)
            if line is not None:
                data_dicts.append(line)
            
    with open("../files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
        
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60 # integer division, will always guarentee a whole number, rounded DOWN
    elapsed_sec = elapsed % 60 # modulous, will give us the remainder of elapsed / 60
    print("program ran in {} minutes and {:.4f} seconds".format(elapsed_min, elapsed_sec))

program ran in 0.0 minutes and 0.0010 seconds


### Command Line Arguments

There is one last problem with our code we should worry about fixing. Right now, the file we want to read in and the file we want to read out, are hard coded in the middle of our program, if this was a longer file, it would be hard to find, so lets put some constants at the top of our program to make this easier.

Constants in python are actually just variables, but we tell the programmer that we dont want to value to change by using all capital letters.

In [None]:
import csv
import string
import time

# Consants
INPUT = "../files/example_data.txt"
OUTPUT = "../files/example_parsed.csv"

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line):
    data_dict = dict()
    if line[0] in string.ascii_letters:
        line = line.lower().replace(" ", "").strip("\n").split(',')
        for entry in line:
            if ":" in entry:
                data = entry.split(':')
                if data[0].isalpha() and isNum(data[1]):
                    data_dict[data[0]] = float(data[1])
    if len(data_dict) != 0:
        return data_dict

def main():
    data_dicts = list()
    
    with open(INPUT, "r") as fp: # replace the file path with INPUT
        for line in fp:
            line = parse(line)
            if line is not None:
                data_dicts.append(line)

            
    with open(OUTPUT, "w") as fp: # replace the file path with OUTPUT
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60
    elapsed_sec = elapsed % 60
    print("program ran in {} minutes and {:.4f} seconds".format(elapsed_min, elapsed_sec))

Now its easy to change the values in our code! If we only intend to use this on the same file in the same relative location thats okay, but we still have to actually open the program to edit this should that change. We want this to be a bit more adaptable using **command line arguments**. to use them we need the *sys* module

In [None]:
import csv
import string
import sys # added sys in alphabetical order
import time

# Consants
INPUT = "..files/example_data.txt"
OUTPUT = "..files/example_parsed.csv"

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line):
    if line[0] in string.ascii_letters:
        line = line.lower().replace(" ", "").strip("\n").split(',')
        for entry in line:
            if ":" in entry:
                data = entry.split(':')
                if data[0].isalpha() and isNum(data[1]):
                    data_dict[data[0]] = float(data[1])
    if len(data_dict) != 0:
        return data_dict

def main():
    data_dicts = list()
    
    with open(INPUT, "r") as fp:
        for line in fp:
            line = parse(line)
            if line is not None:
                data_dicts.append(line)

            
    with open(OUTPUT, "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60
    elapsed_sec = elapsed % 60
    print("program ran in {} minutes and {:.4f} seconds".format(elapsed_min, elapsed_sec))

Now is a good time to notice, the order I've been importing is alphabetical. There is no rule that this needs to be done so, but it makes it easier to read. in general I recommend:

- import ___ and import ___ as where ___ is sorted alphabetically
- from ___ import ___

as the order to write your import statements. (readability is important!)

At this point in time we can get arguments from command line, python will expect commands in the form:

In [None]:
python program.py arg1 arg2 arg3 arg4

we can access the arguments as we would an array. note that program.py is actually the argument at index 0, so we need to say **sys.argv\[1\]** to get arg1

in our program lets give the constants the value of arg1 for **INPUT** and arg2 for **OUTPUT**

note: this code will NOT run in Jupyter due to not getting argv's

In [17]:
import csv
import string
import sys
import time

# Consants
INPUT = str(sys.argv[1]) # assign the constants their values, and make them strings
OUTPUT = str(sys.argv[2])

def isNum(num_string):
    try:
        float(num_string)
        return True
    except ValueError:
        return False

def parse(line):
    data_dict = dict()
    if line[0] in string.ascii_letters:
        line = line.lower().replace(" ", "").strip("\n").split(',')
        for entry in line:
            if ":" in entry:
                data = entry.split(':')
                if data[0].isalpha() and isNum(data[1]):
                    data_dict[data[0]] = float(data[1])
    if len(data_dict) != 0:
        return data_dict

def main():
    data_dicts = list()
    
    with open(INPUT, "r") as fp:
        for line in fp:
            line = parse(line)
            if line is not None:
                data_dicts.append(line)

            
    with open(OUTPUT, "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60
    elapsed_sec = elapsed % 60
    print("program ran in {} minutes and {:.3f} seconds".format(elapsed_min, elapsed_sec))

FileNotFoundError: [Errno 2] No such file or directory: '-f'

Now we finally have a proper, well written program, lets run it! in the command line we navigate to the folder this file workshop_parser.py is saved in, then type.

In [16]:
python workshop_parser.py ../files/example_data.txt ../files/example_parsed.csv

SyntaxError: invalid syntax (<ipython-input-16-556052aee289>, line 1)

when its done there should be new file called exapmle_parsed.csv in the "files" folder, example_data.txt remained untouched, and the runtime should be printed to the console. Check to make everything is working as intended. The final version of this program is avaialbe above or in the file **workshop_parser_FINAL.py**.

The first workshop covered the basics of python, this workshop worked on taking data to clean and present in a format we can easily work with for later. Now that we have data in the format we want, In the next workshop we'll be focusing on performing statistical analysis of our data and visuzalizing it with numpy and plotly.