# Programming For Analytics
##### © Pratik Agrawal, 2016

## Functions, File I/O, Data Frames
- Functions
 - Structure
 - Docstring (Document String or Help Text)
 - Arguments
 - Exercises!
- File I/O
 - `glob()`
 - Reading Files
   - `open()`/`file()`
   - `read()`
   - `readline()`/`readlines()`
   - `seek()`
   - `close()`
 - Writing Files
   - `write()`
   - `writelines()`
 - Closing Files
   - `try`/`finally()`
   - `with`
 - Exercises!
- DataFrames! 
 - `pandas`
   - `read_csv()`
   - `head()`
   - `tail()`
   - `describe()`
   - Reading specific columns
   - Creating new columns
   - `unique()`
   - `nunique()`
   - `value_counts()`
 - Exercises!
 
 
## Functions
### 1. Structure
A function in Python is defined using the keyword `def`, followed by a function name, a signature within parentheses `()`, and a colon `:`.

In [None]:
def func_0(arg_0, arg_1):
    """func_0 check for equivalence of the two arguments
       arguments-
       arg_0: first value to compare
       arg_1: second value to compare
       return-
       boolean: True or False based on comparison
    """
    return (arg_0==arg_1)

In [None]:
func_0("a","a")

### 2. Docstring 
A document string helps identify what the function does, and also allows a user of the function to invoke the help function on the function.

In [None]:
def func_0(arg_0, arg_1):
    """func_0 check for equivalence of the two arguments
       arguments-
       arg_0: first value to compare
       arg_1: second value to compare
       return-
       boolean: True or False based on comparison
    """
    return (arg_0==arg_1)

In [None]:
help(func_0)

### 3. Arguments
We can pass anywhere from 0 to limitless number of variables to a function, so as to allow the function to perform some operation using them.
Note: If your function does not utilize an argument, then consider removing it from the arguments in the function definition.

- With one argument

In [None]:
def func_0(arg_0):
    """func_0 prints the argument passed within
       arguments-
       arg_0: value to be printed
       return-
       None
    """
    print arg_0

- With multiple arguments

In [None]:
def sum_all_values(*args):
    """Sum all values passed to the function
       arguments-
       args: comma separated values that need to be passed in to the function
       return-
       Sum of all values passed in to function
    """
    return sum(args)

In [None]:
sum_all_values(1)

In [None]:
sum_all_values(1,2,3)

- With compound data types such as dictionaries

In [None]:
def instantiate_model(params):
    """Function to train a model
       arguments-
       params: dictionary of parameters required to instantiate model
       return-
       instantiated model object
    """
    lr = LogisticRegression(alpha=params["alpha"], tol=params["tol"])
    return lr

### 4. Exercises
The shortest distance between two points on the globe, assuming it is perfectly spherical, is the length of the  _great circle_ path.  If you are given two locations in latitude and longitude, then the [Haversine Formula](http://en.wikipedia.org/wiki/Haversine_formula) gives this shortest distance in a numerically stable way:

$a = \sin^{2}(\Delta \phi/2) + \cos(\phi_{1})\cos(\phi_{2})\sin^{2}(\Delta \lambda/2)$

$c = 2 \arcsin(\sqrt{a})$

$d = rc$

Where $\phi_{i}$ is latitude and $\lambda_{i}$ is the longitude of point $i$ and $r$ is the radius of the globe.

#### Q1

Write a function that takes as inputs a `radius` and two points specified by tuples of `(latitude, longitude)` and returns the distance between the points along a great circle.

NOTE: Remember to convert your angles to radians!!

In [None]:
from math import sin, cos, asin, radians    

In [None]:
# test your answer: should be roughly 7895 km
r_earth = 6371.0 # km
austin = (30.2500, -97.7500)
cambridge = (52.2050, 0.1190)

print round(haversine_formula(r_earth, austin, cambridge))

<br>
<br>
<br>
<br>
#### Solution

In [None]:
def haversine_formula(r, p1, p2):
    # unpack points
    phi_1, lambda_1 = p1
    phi_2, lambda_2 = p2
    
    # convert to radians
    phi_1 = radians(phi_1)
    phi_2 = radians(phi_2)
    lambda_1 = radians(lambda_1)
    lambda_2 = radians(lambda_2)
    
    # compute haversine formula
    a = sin((phi_1-phi_2)/2)**2 + cos(phi_1)*cos(phi_2)*sin((lambda_1-lambda_2)/2.0)**2
    c = 2*asin(a**0.5)
    return r*c

#### Q2

Given the list of cities, and their latitude and longitudes:

In [None]:
cities = {
    'Atlanta': (33.7569444444, -84.3902777778),
    'Austin': (30.3, -97.7333333333),
    'Boston': (42.3577777778, -71.0616666667),
    'Chicago': (41.9, -87.65),
    'Dallas': (32.7825, -96.7975),
    'Denver': (39.7391666667, -104.984722222),
    'Houston': (29.7627777778, -95.3830555556),
    'Los Angeles': (34.05, -118.25),
    'Miami': (25.7833333333, -80.2166666667),
    'New York': (40.67, -73.94),
    'San Francisco': (37.7666666667, -122.433333333),
    'Seattle': (47.6, -122.316666667),
}

write a function that, given a dictionary of city names and locations, plus the names of two cities, returns the great circle distance between the two cities.  The result should be rounded to the nearest 10 km.

    def city_distance(cities, city_1, city_2):
        ...

HINT: The built-in `round()` function takes an optional second argument for the number of digits of precision.  This argument can be negative.

<br>
<br>
<br>
<br>
#### Solution

In [None]:
def calc_distance(cities, city_1, city_2):
    r_earth = 6371.0 # km
    return haversine_formula(r_earth, city_1, city_2)

round(calc_distance(cities, cities["Atlanta"], cities["Austin"]),-1)

#### Q3

Write a function that, given a set of cities returns a dictionary whose keys are pairs of cities and whose values are the distances between them.  You should use an appropriate data structure for the keys, and round distances to the nearest 10 km.

    def compute_distances(cities):
        ...

Bonus points if you compute the distance for a given pair of cities only once.

<br>
<br>
<br>
<br>
#### Solution

In [None]:
def compute_distances(cities):
    r_earth = 6371.0 # km
    
    distances = {}
    # create a set so we can track visited cities
    # tracking this is optional, but otherwise we will double the computations
    unvisited = set(cities)
    for city_1 in cities:
        # remove city_1 from cities to visit
        unvisited.remove(city_1)
        
        location_1 = cities[city_1]
        for city_2 in unvisited:
            location_2 = cities[city_2]
            distance = haversine_formula(r_earth, location_1, location_2)
            distances[frozenset((city_1, city_2))] = round(distance, -1)
            
    return distances

In [None]:
compute_distances(cities)

## File I/O
Most of the time you will be interacting with files. We need a way to access the files, but at the same time we also need ways to discover files.
### 1. `glob`
The `glob` module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. No tilde expansion is done, but \*, ?, and character ranges expressed with [] will be correctly matched.

In [None]:
import glob

#### `glob()`
The function glob() which can be invoked by `glob.glob()` will run discovery of files. In the example below, we pass only the directory name. Execute the code snippet to see what the output is.

In [None]:
complete_list_of_files = glob.glob("../programming-for-analytics-course-material/data/")
complete_list_of_files

Now lets add a * at the end of the directory path.

In [None]:
complete_list_of_files = glob.glob("../programming-for-analytics-course-material/data/*")
complete_list_of_files

As seen above, a * is a wildcard, and allows for specifying to the `glob()` function, that you are interested in all files under `../programming-for-analytics-course-material/data`

If we are only interested in a particular set of files, then we could also do something like-

In [None]:
files_msft = glob.glob("../programming-for-analytics-course-material/data/msft*.csv")
files_msft

#### or

In [None]:
files_aapl = glob.glob("../programming-for-analytics-course-material/data/aapl*")
files_aapl

#### or

In [None]:
files_2014 = glob.glob("../programming-for-analytics-course-material/data/*14*14*")
files_2014

### 2. Reading Files

Let's say we have a file 'rcs.txt' which contains data in text format like this:

    #freq (MHz)     vv (dB)     hh (dB)
      100          -20.3       -31.2
      200          -22.7       -33.6

We'd like to get the data into a list of lists of floating point numbers in
Python:

    [[100.0, -20.3, -31.2],
     [200.0, -22.7, -33.6]]

We can open the file with the `open` function or the `file` type:

#### `open()`/`file()`
There are two ways to open files

In [None]:
file_in = open('../programming-for-analytics-course-material/data/rcs.txt')

#### or

In [None]:
file_in = file('../programming-for-analytics-course-material/data/rcs.txt')

#### or

In [None]:
file_in = open('../programming-for-analytics-course-material/data/rcs.txt','r')

Here the `'r'` specifies read-only mode

#### `read()`
You can read in the contents as a string using the `read()` method of the file object

In [None]:
text = file_in.read()

In [None]:
print text

In [None]:
type(text)

#### `readline()/readlines()`
These functions allow you to read either a single line `readline()` or all lines as a list of lines `readlines()`
<br>Lets open the file again

In [None]:
file_in = open('../programming-for-analytics-course-material/data/rcs.txt','r')

In [None]:
file_in.readline()

Call `readline()` again

In [None]:
file_in.readline()

One more time-

In [None]:
file_in.readline()

And now the last time!

In [None]:
file_in.readline()

So from the above you note that `readline()` just went through each line in the file, however it did not go back to the start of the file once all lines were iterated over.
<br><br> We also had to call the `readline()` function multiple times, which would mean any application requiring all lines to be read from a file, would need the programmer to write a `for` loop to get all lines. 
<br>A solution to this problem is using `readlines()`

In [None]:
file_in = open('../programming-for-analytics-course-material/data/rcs.txt','r')

In [None]:
file_in.readlines()

In [None]:
file_in.readlines()

From the above we see that `readlines()` returns a list of lines, where each line in the file is an item in the list. 

<br>We did make a mistake, after reading all lines into a list, we never stored this anywhere

In [None]:
file_in = open('../programming-for-analytics-course-material/data/rcs.txt','r')

In [None]:
all_lines = file_in.readlines()

In [None]:
all_lines

#### `seek()`
Up until now, each time we wanted to read lines from the file, we were opening the file repeatedly. There are a multitude of issues in carrying out this method of opening the same files repeatedly
<br>1. There are multiple file handles open, all requiring memory! 
<br>2. There are file locks placed on each instance of opening the file
<br><br>
If all you need is to read the file from the top-

In [None]:
file_in.seek(0)

In [None]:
file_in.readlines()

The `seek(0)` function call tells the file handler to place i/o position at the beginning of the file. You can also place the i/o operation to take place at the end of the file with a `seek(-1)` call.

#### `close()`
Each time we open a file, there should be a corresponding `close()` call that we should make at the end of the file i/o. This closes access to the file on the system disk

In [None]:
file_in.close()

### 3. Writing Files
You will, from time-to-time, write files to disk as well-

In [None]:
file_out = open('../programming-for-analytics-course-material/data/index.txt','w')

In [None]:
file_in = open('../programming-for-analytics-course-material/data/index.txt','r')

In [None]:
file_in.readlines()

The only difference between reading and writing to a file is switching the mode from `'r'` to `'w'`

Lets get a complete list of files for our data directory

In [None]:
complete_list_of_files = glob.glob('../programming-for-analytics-course-material/data/*')

In [None]:
complete_list_of_files

#### `write()`
function to write a single line to a file. 
<br>
We are going to store the list of files and their paths to a text file called `index.txt`

In [None]:
for line in complete_list_of_files:
    file_out.write(line)
    file_out.write('\n')
file_out.close()

##### Open the index.txt file, and check what you wrote to the file

Everything seems to be in a single paragraph. This is because you are not specifying that each line should be on its own line in the file.

#### `writelines()`
No more for loops requires for writing each line.

In [None]:
file_out = open('../programming-for-analytics-course-material/data/index.txt','w')
file_out.writelines('\n'.join(complete_list_of_files))
file_out.close()

### 4. Closing Files
We have already seen the function `close()` to close files.

There is definitely a drawback with this method. It has to be called explicity. If in case your program suddenly stops execution and exits, your file handles will not be closed, and as such the files will have write locks on them.

#### `try` - `finally`
This block of statement makes sure your file is closed in case the program exits abruptly

In [None]:
file_in = open('../programming-for-analytics-course-material/data/rcs.txt','r')

In [None]:
try:
    all_lines = file_in.readlines()
finally:
    file_in.close()
all_lines

However we had to write quite a few statements to get this working. We can make this a lot nicer - 

#### `with`

first lets clear our variable `all_lines`

In [None]:
all_lines = []

In [None]:
with open('../programming-for-analytics-course-material/data/rcs.txt','r') as file_in:
    all_lines = file_in.readlines()
all_lines

### 5. Exercises!
#### Q1. 
Read in a set of logs from an ASCII file.

Read in the logs found in the file `short_logs.crv`.
The logs are arranged as follows::

    DEPTH    S-SONIC    P-SONIC ...
    8922.0   171.7472   86.5657
    8922.5   171.7398   86.5638
    8923.0   171.7325   86.5619
    8923.5   171.7287   86.5600
    ...

So the first line is a list of log names for each column of numbers.
The columns are the log values for the given log.

Make a dictionary with keys as the log names and values as the
log data::

    logs['DEPTH']
    [8922.0, 8922.5, 8923.0, ...]
    logs['S-SONIC']
    [171.7472, 171.7398, 171.7325, ...]
    

<br>
<br>
<br>
<br>
#### Solution

In [None]:
log_file = open('../programming-for-analytics-course-material/data/short_logs.crv')

# The first line is a header that has all the log names:
header = log_file.readline()
log_names = header.split()
log_count = len(log_names)

# Read in each row of values, converting them to floats as
# they are read in.  Assign them to the log name for their
# particular column:
logs = {}

# Initialize the logs dictionary so that it contains the log names
# as keys, and an empty list for the values.
for name in log_names:
    logs[name] = []

for line in log_file:
    values = [float(val) for val in line.split()]
    for i, name in enumerate(log_names):
        logs[name].append(values[i])

log_file.close()

# output the first 10 values for the DEPTH log.

print 'DEPTH:', logs['DEPTH'][:10]


#### Q2.
The files aapl\*, msft\* contain trading data.  Data is
arranged in the file is in comma seprarated format-

    Date,Open,High,Low,Close,Volume,Adj Close
    2015-10-21,114.00,115.580002,113.699997,113.760002,41795200,113.760002
    2015-10-20,111.339996,114.169998,110.82,113.769997,48778800,113.769997
    2015-10-19,110.800003,111.75,110.110001,111.730003,29606100,111.730003
    2015-10-16,111.779999,112.00,110.529999,111.040001,38236300,111.040001
    2015-10-15,110.93,112.099998,110.489998,111.860001,37341000,111.860001

In this exercise you will write two functions: one that reads
in data from files of this format, and one which writes data
out to files of this format.
##### a)
You should be able to provide a name of the company such as- `aapl` alongwith the directory where the files might be located, and let the read function discover all the files. 
The data read in should be stored in a `dict` object and keyed on the date.
##### b) 
Append all `dict` objects together.
##### c) 
Write a function that accepts name of the company such as - `aapl` and a `dict` object with all the financial data concatenated. 
This function should write out all the data to a single file with the name such as- 
`aapl-mmmyy-mmmyy.csv`

## DataFrames!
Our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program. It's a collection of series (or columns) with a common set of commands that make managing data in Python super easy.

### 1. `pandas`
`pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In [None]:
import pandas as pd

#### `read_csv()`
this function, and a variety of other read functions under pandas have been created to read in different file formats

Lets read in the `best-sandwiches` dataset

In [None]:
best_sws_df = pd.read_csv('../programming-for-analytics-course-material/data/best-sandwiches.csv')

#### `head()`
the head function allows us to peek at the first few rows of the df

In [None]:
best_sws_df.head()

By default the `head()` function only displays the first few rows of data... to be precise, 5 rows. 

We can display `n` rows of data from the top-

In [None]:
best_sws_df.head(10)

#### `tail()`
the tail function allows us to peek at the last few rows of the df

In [None]:
best_sws_df.tail()

By default the `tail()` function only displays the last few rows of data... to be precise, 5 rows.
We can display n rows of data from the bottom-

In [None]:
best_sws_df.tail(10)

#### `describe()`
the `describe()` function can be used to present descriptive stats on each column of the df

In [None]:
best_sws_df.describe()

However, describe will not work like your `R` function `summary()` and does not automatically present frequency information for categorical data

#### Reading a specific column
you do not need to display the entire df, or use the entire df for each calculation or output. One can also select a single column. 

In [None]:
best_sws_df.sandwich

We can also call the functions `head()`, `tail()`, and `describe()` on a single column-

In [None]:
best_sws_df.sandwich.head()

In [None]:
best_sws_df.sandwich.tail()

In [None]:
best_sws_df.sandwich.describe()

#### Note
This method of referring to a column only returns a `read-only` copy. So you cannot modify the column. 

#### Creating a new column
The way to create or modify a new column is to refer to the column using the following format-

In [None]:
best_sws_df["sandwich"].head()

We should probably normalize the names, by converting all the names to upper case.

In [None]:
best_sws_df["sandwich"] = best_sws_df.sandwich.apply(lambda x: x.upper())
best_sws_df.sandwich.head()

#### `unique()`
the `unique()` function provides a list of unique values in a given columns

In [None]:
list(best_sws_df.price.unique())

The number of unique values in the price column can be calculated using the `len()` function

In [None]:
len(list(best_sws_df.price.unique()))

#### or

In [None]:
best_sws_df.price.nunique()

#### `value_counts()`
the function `value_counts()` provides a frequency for each unique level in a column-

In [None]:
import matplotlib.pyplot as plt
%pylab inline
best_sws_df.price.value_counts().plot("bar")
plt.show

### 2. Exercises!
#### Q1
Read in the financial data file you have created as a dataframe (concatenation of all the files for a company).
#### Q2
Provide descriptive statistics for all columns of this data