# Python Logic

In the first workshop, we covered the basics of Python. In this workshop, we will build upon that knowledge and explore one of Python's strengths of manipulating data by using tools Python provides to build a simple file parsing program. The file parser will read in a file, parse the data, and write out our cleaned data as a CSV file, as well as to print some statistics about our program's runtime to console.

Before we get started, theres a few areas of python to discuss

### Packages

One of the biggest strengths of python is the number of packages that have been written for it. **Packages** are tools that other people have written that can be used in our own code. Packages are very helpful because they help you write code more efficiently and consisely since you can use functionality that is already written.

Packages are easy to install, if you're using Anaconda, most of the popular packages are already there for you to use. If you installed python from the official site, then we need to use **pip**, python's official package manager to install them accessed from the terminal. (not a python file or the python interactive prompt)

In [1]:
pip install numpy


The following command must be run outside of the IPython shell:

    $ pip install numpy

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more informations on how to install packages:

    https://docs.python.org/3/installing/


we need to tell python that we want to use specific packages in our program. Depending on how much of a package we want to use, there are a few ways to do that.

In [2]:
import numpy # so we can use numpy in our program

import numpy as np # now we can refer to numpy as np 

from numpy import * # brings in every numpy function for us to use, or replace * with the specific function you want to import 

if we want to use our own programs as packages we can do that, either by putting the file in the same folder and using **import 'filename'** (without .py). If its in a different folder, we tell python the RELATIVE path to the file. So for example hello.py in the same folder would be imported with

In [3]:
import hello

ModuleNotFoundError: No module named 'hello'

or if its in a different folder we would specify the location using '..' to go up a folder and / to go in a folder. For example, to go up two folders and into a folder called packages we use

In [4]:
import ../../packages/hello


SyntaxError: invalid syntax (<ipython-input-4-c1d6e7036059>, line 1)

So, why are packages important? They can help reduce the complexity of the code you write by utilizing code that is already written. They can also help make code easier to read. For example, Numpy is a popular package used for scientific computing. 

### \_\_main\_\_

Next, we will go over the "main" function. When writing code, we want to organize it in a way that is readable and that makes sense. Python reads a file from top to bottom. What this means is that if we have a function that calls another function, we want to make sure the function that is called is defined before the function that is making the call. It is good practice to group code into different functions based on what the code does. Then, we can define a function called main() which will be used to contain code that we actually want to run.  

Programming languages run code in whats called a "main" function. Python considers all code not in a function definition to be part of the main funcion.

This means, Python runs code from first line to last line, following standard logic flow, and we need to define a function before, or above, where we first use (or call) it.

So, when we import a package, python will take the line **import 'package'** and RUN the files it looks at.

If its just defining funcions this is okay, but sometimes we want packages to do things IF AND ONLY IF its the file we call in terminal with the command **python hello.py**

To do this we use a special variable python has called **\_\_name\_\_** (with two underscores before and after)
If its the file we ran, the **\_\_name\_\_** variable will be **\_\_main\_\_**
otherwise it will be the name of the imported package (eg. **\_\_hello\_\_**)

This means we can write the code below (in hello.py)

In [5]:
def func1():
    break

def func2():
    break

def main():
    func1()
    func2()
    
if __name__ == "__main__":
    main()

what happens here is python sees main, but because we're only defining the function it doesn't run it. If we used the terminal to call hello.py, it then sees that the **\_\_name\_\_** variable is **\_\_main\_\_** and runs the code in **main()**, if we imported it with **import hello**, the **\_\_main\_\_** variable will be **\_\_hello\_\_** and code in **main()** wont run.

So lets put together some basic code that we'll need for this project

In [6]:
def main():
    break
    
if __name__ == "__main__":
    main()

Now that we have the skeleton of our program written, with the numpy and timer (more on that later) packages imported, we can start writing the body of our code.
Again, it is important to understand the way logic flows in Python so that when you are writing code, you can think about how to organize your code for readability.


### Reading from files

Data can stored in a number of different file formats such as a text file or csv file. Python has the ability to look into a file and extract information from it so that we are able to work with the data. This is called reading from a file. For our project, we will be reading into the example_data.txt file  

The **open()** function takes in two string parameters: a file and a mode (ex: "r" for 'read' or "w" for 'write) and returns a file object.

In [1]:
fp = open("..files/example_data.txt","r") # opens example_data.txt in read mode

ASIDE: we can name variables almost anything we want following the few rules that exist. **file** already exists as a keywork in python, so we need to use something else. Popular variable names include **f** for "file" or **fp** for "file-pointer".

when we are done using our file we need to call **close()**. There are a lot of problems that can occur if we don't close a file, but they're a bit too technical for our purposes. For now, just remember to call **close()** when you're done with a file.

Once we have our file opened in read mode, we want to get data from our file, which we can do in a few ways:

In [5]:
data = file.read() # this reads in the ENTIRE file and returns the contents as a string
# this works for our sample file, but will fill up our computers memory for really big files

data = file.read(100) # reads ONLY the first 100 characters of a file, returns a string

line = file.readline() # this reads the file line by line, returning a string
# this can be used if our file is large and we want to look at a line one at a time

# if we know our file is short enough...
lines = file.readlines() # reads in ALL times, where each line is an entry in a list

in a file, lines will end with an invisible newline character (**\\n**), **readline()** will look at one line at a time (as a string) stopping each line at a new line character.

Since we're only looking at one line at a time, this reduces memory usage significantly.

Let's edit our program, we'll pretend we're using a large file and need to worry about memory.

In [8]:
def main():
    fp = open("..files/example_data.txt", "r")
    line = fp.readline()
    print(line)
    while line:
        line = fp.readline()
        print(line)
    fp.close()
    
if __name__ == "__main__":
    main()

There's a more elegant solution using the **with** statement

In [12]:
def main():
    with open("example_data.txt", "r") as fp:
        for line in fp:
            print(line)
            
if __name__ == "__main__":
    main()

﻿# a python comment in a txt file!?

data: not-a number, gpa: 3.2, credits: 135, semesters: 9

credits:132, gpa : 3.8, semesters: 8

gpa : 4.0, credits: 73, filler, semesters: 3

gpa:2.9, filler, credits :56, semesters: 4

credits : 90, garbage, gpa: 3.7, semesters: 7

gpa:3.2, semesters: 7, credits :113,

! another random character?

5 + 137 = 142

gpa: 3.7, credits: 15, semesters: 1

and more noise

　



This does the exact same thing, but is more concise, easier to read, and will automatically close the file when we leave the with statement. Its the best way to read a file.

### Parsing Strings

Now that we can access our file, we can start parsing and cleaning the contents in the file to grab only the data that we want to look at. Let's take a moment to try and understand how our data is formatted.

If you look at the file example_data.txt youll notice that in general it's formatted as "word" : "number"

but there's a few lines spread out that don't follow this pattern. That's junk data that we're going to need to deal with as we look at lines.

*I will make a guarantee, so that we don't need to worry about it, that every valid row will have all the same words* *before colons (gpa, credits, semesters).*

Python has alot of tools for us to work with, but we'll focus on just a few. If you want to read up on them, documentation is availabe: https://docs.python.org/3.7/library/stdtypes.html
and https://docs.python.org/3.7/library/string.html
Documentation is not something to memorize, it is used for reference.
feel free to "control + f" and search for these in the documentation to learn more about them.

The functions/constants we'll be focusing on:

In [11]:
str.lower() # returns a copy of the string with all lowercase characters
str.strip() # removes all spaces (or other characters) from a string
str.split() # splits the string on the characters we ask for, returns an array
str.isdigit() # returns True if str is made of only numbers
str.isalpha() # returns True if str is made of only letters

str.ascii_letters # constant; is a string of all ascii letters

ASIDE: for more complex string manipulation there are "regular expressions". Those are much more powerful but are extremely difficult to use. They were designed to parse natural language, how we write and speak in natural english, rather than formatted data and is thus beyond the scope of this workshop.

so, lets deal with the easiest lines first and remove lines that don't start with a letter and don't contain a colon:

In [None]:
if line[0] is in string.ascii_letters:

**string.ascii_letters** requires the string package, so lets import that too

In [None]:
import string

lets make sure our line has the general formatting of what we want

In [None]:
if ":" is in line:

So these statements will get rid of our junk lines. We can combine our two if statements together as well, so lets do that as we modify our code.

In [13]:
import string

def main():
    with open("example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                print(line)
            
if __name__ == "__main__":
    main()

data: not-a number, gpa: 3.2, credits: 135, semesters: 9

credits:132, gpa : 3.8, semesters: 8

gpa : 4.0, credits: 73, filler, semesters: 3

gpa:2.9, filler, credits :56, semesters: 4

credits : 90, garbage, gpa: 3.7, semesters: 7

gpa:3.2, semesters: 7, credits :113,

gpa: 3.7, credits: 15, semesters: 1



now, lets get rid of whitespace and seperate our string so we can get data out.

In [None]:
line = line.lower() # makes everything lowercase
line = line.strip() # removes whitespace
line = line.split(',') # splits the line at the ,'s into a list

lets add this to our program, in one line.

In [14]:
import string

def main():
    with open("example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                line = line.lower().strip().split(',')
                print(line)
            
if __name__ == "__main__":
    main()

['data: not-a number', ' gpa: 3.2', ' credits: 135', ' semesters: 9']
['credits:132', ' gpa : 3.8', ' semesters: 8']
['gpa : 4.0', ' credits: 73', ' filler', ' semesters: 3']
['gpa:2.9', ' filler', ' credits :56', ' semesters: 4']
['credits : 90', ' garbage', ' gpa: 3.7', ' semesters: 7']
['gpa:3.2', ' semesters: 7', ' credits :113', '']
['gpa: 3.7', ' credits: 15', ' semesters: 1']


the order of these functions is executed left to right, looking at what these functions do, as long as .split(',') is at the end, this will run, as .lower() and .strip() return strings while .split() returns a list.

also, because we assigned this new output to line, we can no longer get the original line back as it was in the file.

Currently line is a list, so \[entry_1, entry_2, ... , entry_n\]


Lets continue to parse this, we want to make sure every entry has data in the format we want "word" : "number" which means our conditions are is if : is in the entry, if the left is a word, and the right is a digit.

In [None]:
for entry in line:
    if ":" in entry:
        data = entry.split(':')
        if data[0].isalpha() and data[1].isdigit():
            print(data[0] + " " + data[1])


lets add this to our program

In [15]:
import string

def main():
    with open("example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters and ':' in line:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            print(data[0] + " " + data[1])
            
if __name__ == "__main__":
    main()

credits 132


But wait! we are checking for ":" in two places, we don't need to do that.

Lets think about it, if ":" is in our line theres no guarantee that every entry will have one, but if every entry has a ":" that line valid line will to. So we can simplify our code and remove the redundancy. 

In [16]:
import string

def main():
    with open("example_data.txt", "r") as fp:
        for line in fp:
            if line[0] in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ":" in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            print(data[0] + " " + data[1])
            
if __name__ == "__main__":
    main()

credits 132


we said we wanted to make a CSV of our data. it was presented with ":", so the best way for us to store the data in our program is with a dictionary, data\[0\] is our key and data\[1\] is our value.

we're going to need to have one for every line as well, so lets make a list to store our dictionaries.

In [None]:
import string

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ':' in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            data_dict[data[0]] = data[1]
                            
                data_dicts.append(data_dict)
            
if __name__ == "__main__":
    main()

So as we have seen, Python has many tools that help us parse and clean data so that we can format data into a readable format. There is plenty of documentation available online of the many tools in Python that you can use. 

### Writing to Files

After manipulating data and getting the data into the format that we want, we can now use Python to transfer this data into another file. This is called writing to a file.

Writing to files is similar to reading from them. First we need to open a file in write mode, so use the **open()** function, but with **"w"** instead of **"r"**.

If the file we tell Python to write to doesnt exist, it will create a new file, if it does exist, it will write over the old one, deleting all data in it previously. If we don't want to clear the file and instead append to it, we can use the append mode with **"a"** instead.

In [None]:
file = open("..files/example_parsed.csv", "w")

Lets add this to our code, but use that same **with** keyword as before
since by this point we're done with our first open, we can use **fp** again.

In [None]:
import string

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] is in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ':' in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            data_dict[data[0]] = data[1]
            if len(data_dicts) != 0:
                data_dicts.append(data_dict)
            
    with open("..files/example_parsed.csv", "w") as fp:
        break
            
if __name__ == "__main__":
    main()

we can use the **file.write()** function to write to a file. **file.write()** does not add newlines, but we can manually add them when we want with the newline character **\n**. for example:

In [None]:
name_var = "Andrew"
with open ("hello.txt", "w") as fp:
    fp.write("hello, my name is " + name_var)

Lets think about what we need to do to write the data we've parsed from our file properly.
We have a list of dictionaries where each dictionary has three entries; one for gpa, semester, and credits

We need to make one line for each dictionary in our list **data_dicts**. The first line, or the header of our file, should be the keys (gpa, semester, credits), then we need to traverse the values in the dictionary, and write them to the line, comma's seperating all of them EXCEPT the last. We also do not want a comma at the end of the line. In short we want *"a_val, b_val, c_val, d_val, e_val, ... , n_val"*

Lets modify our program first to write the keys.

In [None]:
import string

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] is in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ':' in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            data_dict[data[0]] = data[1]
            if len(data_dicts) != 0:
                data_dicts.append(data_dict)
            
    with open("..files/example_parsed.csv", "w") as fp:
        for index, key in enumerate(data_dicts[0].keys()):
            if index < len(data_dicts[0]) - 1:        #determine if key is at the last index
                fp.write(key + ", ")
            else:
                fp.write(key + "\n")
            
if __name__ == "__main__":
    main()

We used the function **enumerate()** this lets us traverse a list, so that we can get TWO variables, the value of the list at our current location, and the index of the list of our current location. This is equivalent to:

In [None]:
for index in range(len(data_dicts[0].keys())):
    print(data_dicts[0].keys()[index])

using **enumerate()** is much easier to read.

We added "," between each word and a newline character "\n" at the end of the line as well. 

now we want to go through the values of the dictionary and print them out seperated by commas for each line, so lets do that.

In [None]:
import string

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] is in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ':' in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            data_dict[data[0]] = data[1]
            if len(data_dicts) != 0:
                data_dicts.append(data_dict)
            
    with open("..files/example_parsed.csv", "w") as fp:
        for index, key in enumerate(data_dicts[0].keys()):
            if index < len(data_dicts[0] - 1):
                fp.write(key + ", ")
            else:
                fp.write(key + "\n")
        for data_dict in data_dicts:
            for index, value in enumerate(data_dict.values()):
                if index < len(data_dict[0] - 1):
                    fp.write(value + ", ")
                else:
                    fp.write(value + "\n")
            
if __name__ == "__main__":
    main()

This was almost the same thing as what we did to write the keys, except were looking at **dict.values()** and doing it for each dictionary in our list.

ASIDE: Dictionaries are UNORDERED, however, python has a specific way to order the key : value pairs to store them in memory, and when we request them, we will always get it back in this order. So calling data_dict.values() will guarantee a specific order to our unordered data.

While reading like this is in fact correct, we chose to write to a csv. Because a proper CSV file has a guaranteed format for our data, it's easy to make a tool for us to read and write that we can use instead of re-writing it over and over again.

Python has already done this for us, so we dont even need to write the tool, we can just use it, its in the preinstalled **csv** package.

In [None]:
import csv
import string

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] is in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ':' in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            data_dict[data[0]] = data[1]
            if len(data_dicts) != 0:
                data_dicts.append(data_dict)
            
    with open("..files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys()) #parameters: file and fieldnames
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

Thats it! A lot of programming is wondering if a language can do this or that beforehand and doing a quick google search or search through the documentation to check and to learn how. You'll learn the full capabilities of a language by using and exploring it yourself, so practice, practice, practice! lets quickly see what these functions do here: https://docs.python.org/3.6/library/csv.html

At this point theres a small problem with our code, a problem I'll leave to you to fix. If we have a massive file we're going to use a massive amount of memory since we're storing all of it in a list of dictionaries before we even begin to write.

Try rewriting the above code so that it writes to the file as it reads it in.
I've included one of my solutions in combined.py

### cleaning our code

With the current state of our code, our **main()** function is doing alot, its reading in a file, parsing it, and writing to a file(). So lets break the parsing part into a new fuction called **parse()**. We're going to want to pass in a line (so it can exist through multiple calls), and have it return a dictionary that we can then append to our **dict_list**

In [None]:
import csv
import string

def parse(list):
    return data_dict

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
            data_dict = dict()
            if line[0] is in string.ascii_letters:
                line = line.lower().strip().split(',')
                for entry in line:
                    if ':' in entry:
                        data = entry.split(':')
                        if data[0].isalpha() and data[1].isdigit():
                            data_dict[data[0]] = data[1]
            if len(data_dicts) != 0:
                data_dicts.append(data_dict)
            
    with open("..files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

All the code to do this already exists, starting at **data_dict = dict()** and adding a return statement instead of the **data_dicts.append(data_dict)** line

In [None]:
import csv
import string

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
    else:
        return None
        

def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
        
           data_dict = parse(line)
           data_dicts.append(data_dict) 
            
    with open("..files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

we can compress this a bit further without making it any more difficult to read

In [None]:
import csv
import string

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
        
def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
           data_dicts.append(parse(line)) # we can combine the two statements for this
            
    with open("..files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    main()

This is a bit more manageable to read. our main function reads in a file and writes out to a file.
It also calls a parse function to parse our line. We don't need to care how the parsing works as long as what it returns is in the format we expect (a dictionary).

A bonus of this is now if we want to use the same parsing function in a different program, we can just import the file and call **parse()**, and because **\_\_name\_\_** of the file won't be **\_\_main\_\_** the function **main()** wont run!

### Timers

sometimes its useful to know how long a program or function takes to run, so lets quickly put together a timer.

to do this we need use the **time()** function in the time package

**time()** returns the time in seconds (as a decimal) since the epoch, or 00:00 on Thurdsay, Janurary 1st, 1970, so we can do basic arithmetic to determine the number of seconds, minutes, hours, or days elapsed at any two points in time.

ASIDE: because values in programming take up memory, they eventually hit a maximum value. Once the number of seconds from the epoch to a current moment in time occurs, various changes will have to made to Python and computers to adjust for this. That means currently, Python's time function will break sometime in 2038 (on Unix systems).

so to create a basic timer:

In [None]:
import time

start = time.time()
... # our code
end = time.time()
elapsed = end - start

lets add that to our program to measure the time it takes to run the code *if and only if* we run the file as where **\_\_name\_\_** is **\_\_main\_\_**

In [None]:
import csv
import string
import time

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
        
def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
           data_dicts.append(parse(line))
            
    with open("..files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start

### String Formatting

lets print this out to terminal, in the format "program ran in \_\_\_\_\_ minutes and \_\_\_\_\_ seconds"
we can do this with:

In [None]:
elapsed_min = elapsed // 60
elapsed_sec = elapsed % 60
print("program ran in " + str(elapsed_min) + " minutes and " + str(elapsed_sec) " seconds")

Theres a slightly cleaner way to do this using **str.format()**
to say the same thing:

In [None]:
print("program ran in {} minutes and {} seconds".format(elapsed_min, elapsed_sec))

this is a way to format strings in python3, its much easier to read than before. It automatically casts our values to strings as well! We can compress this further with something called *fstrings*, but for now lets focus on readability instead of brevity.

the **{}** will be replaced with what we pass in to **str.format()** in the order we pass it in, so elapsed_min fills in the first **{}** and **elapsed_sec** fills in the second **{}**

lets make one change such that the number of seconds gets printed to 3 decimal places.

In [None]:
print("program ran in {} minutes and {:.3f} seconds".format(elapsed_min, elapsed_sec))

the .3f means "after the decimal, round the number to three places in a floating point number format

There are alot of different combinations, and the best way to learn them is googling how and to experimenting yourself. For a full list of various formatting options for strings ive included the link below

https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting

Lets add this to our program so it looks like:

In [None]:
import csv
import string
import time

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
        
def main():
    data_dicts = list()
    
    with open("..files/example_data.txt", "r") as fp:
        for line in fp:
           data_dicts.append(parse(line))
            
    with open("..files/example_parsed.csv", "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60 # integer division, will always guarentee a whole number, rounded DOWN
    elapsed_sec = elapsed % 60 # modulous, will give us the remainder of elapsed / 60
    print("program ran in {} minutes and {:.3f} seconds".format(elapsed_min, elapsed_sec))

### Command Line Arguments

There is one last problem with our code we should worry about fixing. Right now, the file we want to read in and the file we want to read out, are hard coded in the middle of our program, if this was a longer file, it would be hard to find, so lets put some constants at the top of our program to make this easier.

Constants in python are actually just variables, but we tell the programmer that we dont want to value to change by using all capital letters.

In [None]:
import csv
import string
import time

# Consants
INPUT = "..files/example_data.txt"
OUTPUT = "..files/example_parsed.csv"

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
        
def main():
    data_dicts = list()
    
    with open(INPUT, "r") as fp: # and we replace the string with our variable
        for line in fp:
           data_dicts.append(parse(line))
            
    with open(OUTPUT, "w") as fp: # here too
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60
    elapsed_sec = elapsed % 60
    print("program ran in {} minutes and {:.3f} seconds".format(elapsed_min, elapsed_sec))

Now its easy to change the values in our code! If we only intend to use this on the same file in the same relative location thats okay, but we still have to actually open the program to edit this should that change. We want this to be a bit more adaptable using **command line arguments**. to use them we need the *sys* module

In [None]:
import csv
import string
import sys
import time

# Consants
INPUT = "..files/example_data.txt"
OUTPUT = "..files/example_parsed.csv"

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
        
def main():
    data_dicts = list()
    
    with open(INPUT, "r") as fp:
        for line in fp:
           data_dicts.append(parse(line))
            
    with open(OUTPUT, "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60
    elapsed_sec = elapsed % 60
    print("program ran in {} minutes and {:.3f} seconds".format(elapsed_min, elapsed_sec))

Now is a good time to notice, the order I've been importing is alphabetical. There is no rule that this needs to be done so, but it makes it easier to read. in general I recommend:

- import ___ and import ___ as where ___ is sorted alphabetically
- from ___ import ___

as the order to write your import statements. (readability is important!)

At this point in time we can get arguments from command line, python will expect commands in the form:

In [None]:
python program.py arg1 arg2 arg3 arg4

we can access the arguments as we would an array. note that program.py is actually the argument at index 0, so we need to say **sys.argv\[1\]** to get arg1

in our program lets give the constants the value of arg1 for **INPUT** and arg2 for **OUTPUT**

In [None]:
import csv
import string
import sys
import time

# Consants
INPUT = str(sys.argv[1]) # assign the constants their values, and make them strings
OUTPUT = str(sys.argv[2])

def parse(line):
    data_dict = dict()
    if line[0] is in string.ascii_letters:
        line = line.lower().strip().split(',')
        for entry in line:
            if ':' in entry:
                data = entry.split(':')
                if data[0].isalpha() and data[1].isdigit():
                    data_dict[data[0]] = data[1]
    if len(data_dicts) != 0:
        return data_dict
        
def main():
    data_dicts = list()
    
    with open(INPUT, "r") as fp:
        for line in fp:
           data_dicts.append(parse(line))
            
    with open(OUTPUT, "w") as fp:
        writer = csv.DictWriter(fp, data_dicts[0].keys())
        writer.writeheader()
        writer.writerows(data_dicts)
            
if __name__ == "__main__":
    start = time.time()
    main()
    elapsed = time.time() - start
    elapsed_min = elapsed // 60
    elapsed_sec = elapsed % 60
    print("program ran in {} minutes and {:.3f} seconds".format(elapsed_min, elapsed_sec))

Now we finally have a proper, well written program, lets run it! in the command line we navigate to the folder this file workshop_parser.py is saved in, then type.

In [None]:
python workshop_parser.py ..files/example_data.txt ..files/example_parsed.csv

when its done there should be new file called exapmle_parsed.csv in the "files" folder, example_data.txt remained untouched, and the runtime should be printed to the console. Check to make everything is working as intended. The final version of this program is avaialbe above or in the file **workshop_parser_FINAL.py**.

ASIDE: Theres still a few quality of life improvement we can add the file, making sure the arguments called with our file follow an expected format and adding some documentation. None of these are necessary if we plan to keep the file small and for personal use only and is therefore outside the scope of this workshop. I've included an extra file with the additions I've discussed called **workshop_parser_FINAL_ASIDE.py**. If you wish to see these changes, take a look at that file.

The first workshop covered the basics of python, this workshop worked on taking data to clean and present in a format we can easily work with for later. Now that we have data in the format we want, In the next workshop we'll be focusing on performing statistical analysis of our data and visuzalizing it with numpy and plotly.