# Programming For Analytics
##### © Pratik Agrawal, 2016

## Functions, File I/O, Data Frames
- Functions
 - Structure
 - Docstring (Document String or Help Text)
 - Arguments
 - Exercises!
- File I/O
 - `glob()`
 - Reading Files
   - `open()`/`file()`
   - `read()`
   - `readline()`/`readlines()`
   - `seek()`
   - `close()`
 - Writing Files
   - `write()`
   - `writelines()`
 - Closing Files
   - `try`/`finally()`
   - `with`
 - Exercises!
- DataFrames! 
 - `pandas`
   - `read_csv()`
   - `head()`
   - `tail()`
   - `describe()`
   - Reading specific columns
   - Creating new columns
   - `unique()`
   - `nunique()`
   - `value_counts()`
 - Exercises!
 
 
## Functions
### 1. Structure
A function in Python is defined using the keyword `def`, followed by a function name, a signature within parentheses `()`, and a colon `:`.

In [1]:
def func_0(arg_0, arg_1):
    """func_0 check for equivalence of the two arguments
       arguments-
       arg_0: first value to compare
       arg_1: second value to compare
       return-
       boolean: True or False based on comparison
    """
    return (arg_0==arg_1)

In [2]:
func_0("a","a")

True

### 2. Docstring 
A document string helps identify what the function does, and also allows a user of the function to invoke the help function on the function.

In [3]:
def func_0(arg_0, arg_1):
    """func_0 check for equivalence of the two arguments
       arguments-
       arg_0: first value to compare
       arg_1: second value to compare
       return-
       boolean: True or False based on comparison
    """
    return (arg_0==arg_1)

In [4]:
help(func_0)

Help on function func_0 in module __main__:

func_0(arg_0, arg_1)
    func_0 check for equivalence of the two arguments
    arguments-
    arg_0: first value to compare
    arg_1: second value to compare
    return-
    boolean: True or False based on comparison



### 3. Arguments
We can pass anywhere from 0 to limitless number of variables to a function, so as to allow the function to perform some operation using them.
Note: If your function does not utilize an argument, then consider removing it from the arguments in the function definition.

- With one argument

In [5]:
def func_0(arg_0):
    """func_0 prints the argument passed within
       arguments-
       arg_0: value to be printed
       return-
       None
    """
    print arg_0

- With multiple arguments

In [6]:
def sum_all_values(*args):
    """Sum all values passed to the function
       arguments-
       args: comma separated values that need to be passed in to the function
       return-
       Sum of all values passed in to function
    """
    return sum(args)

In [7]:
sum_all_values(1)

1

In [8]:
sum_all_values(1,2,3)

6

- With compound data types such as dictionaries

In [9]:
def instantiate_model(params):
    """Function to train a model
       arguments-
       params: dictionary of parameters required to instantiate model
       return-
       instantiated model object
    """
# Referring to dictionary key values of "alpha" and "tol" and assigning those variables to whatever value we input
    lr = LogisticRegression(alpha=params["alpha"], tol=params["tol"])
    return lr

### 4. Exercises
The shortest distance between two points on the globe, assuming it is perfectly spherical, is the length of the  _great circle_ path.  If you are given two locations in latitude and longitude, then the [Haversine Formula](http://en.wikipedia.org/wiki/Haversine_formula) gives this shortest distance in a numerically stable way:

$a = \sin^{2}(\Delta \phi/2) + \cos(\phi_{1})\cos(\phi_{2})\sin^{2}(\Delta \lambda/2)$

$c = 2 \arcsin(\sqrt{a})$

$d = rc$

Where $\phi_{i}$ is latitude and $\lambda_{i}$ is the longitude of point $i$ and $r$ is the radius of the globe.

#### Q1

Write a function that takes as inputs a `radius` and two points specified by tuples of `(latitude, longitude)` and returns the distance between the points along a great circle.

NOTE: Remember to convert your angles to radians!!

In [28]:
# Exercise 1

In [32]:
# Phi is latitude and lambda is longitude in degrees. Formula expects it to be in radius. 
# Each tuple will have latitude and longitude (tuple is multiple key values to a dictionary word)
from math import sin, cos, asin, radians    


NameError: name 'math' is not defined

In [33]:
import math
math.radians(30)

0.5235987755982988

In [48]:
# must import math functions first
import math 

def haversine_formula(r,p1,p2):
    # unpack points since points are tuples
    phi_1, lambda_1 = p1
    phi_2, lambda_2 = p2

    # convert to radians
    phi_1 = math.radians(phi_1)
    phi_2 = math.radians(phi_2)
    lambda_1 = math.radians(lambda_1)
    lambda_2 = math.radians(lambda_2)
    
    # compute haverine formula
    a = sin((phi_1 - phi_2)/2)**2 + (cos(phi_1)*cos(phi_2)*sin((lambda_1 - lambda_2)/2)**2)
    c = 2 * asin((a)**0/5)        
    return(r * c)

#### Q2

Given the list of cities from the "Flight Distances" exercise, and their latitude and longitudes:

In [55]:
cities = {
    'Atlanta': (33.7569444444, -84.3902777778),
    'Austin': (30.3, -97.7333333333),
    'Boston': (42.3577777778, -71.0616666667),
    'Chicago': (41.9, -87.65),
    'Dallas': (32.7825, -96.7975),
    'Denver': (39.7391666667, -104.984722222),
    'Houston': (29.7627777778, -95.3830555556),
    'Los Angeles': (34.05, -118.25),
    'Miami': (25.7833333333, -80.2166666667),
    'New York': (40.67, -73.94),
    'San Francisco': (37.7666666667, -122.433333333),
    'Seattle': (47.6, -122.316666667),
}

In [49]:
# Creat function that takes city names of city_1 and city_2, each with it's own tuple, from dictionary "cities"
def calc_distance(cities, city_1, city_2):

    r_earth = 6371.0 # km
    return haversine_formula(r_earth, city_1, city_2)

# Round output to the 10's    
round(calc_distance(cities, cities["Atlanta"], cities["Austin"]),-1)

2570.0

In [50]:
#Exercise 2
r_earth = 6371.0 # km

# Manually input City Coordinates
Austin = (30.2500, -97.7500)
Cambridge = (52.2050, 0.1190)

print round(haversine_formula(r_earth, Austin, Cambridge),-1)

2570.0


write a function that, given a dictionary of city names and locations, plus the names of two cities, returns the great circle distance between the two cities.  The result should be rounded to the nearest 10 km.

    def city_distance(cities, city_1, city_2):
        ...

HINT: The built-in `round()` function takes an optional second argument for the number of digits of precision.  This argument can be negative.

#### Q3

Write a function that, given a set of cities returns a dictionary whose keys are pairs of cities and whose values are the distances between them.  You should use an appropriate data structure for the keys, and round distances to the nearest 10 km.

    def compute_distances(cities):
        ...

Bonus points if you compute the distance for a given pair of cities only once.

In [56]:
# Exercise 3
def compute_distances(cities):
    r_earth = 6371.0 # km
    
    # Create new dictionary with all distances
    distances={}
    
    # create a set so we can track visited cities. It starts with all cities since nothing is visited
    # tracking this is optional, but otherwise we will double the computations
    unvisited = set(cities)
    
    # for the first city in Dictionary "cities"
    for city_1 in cities:
        
        # remove city_1 from cities to visit, so now we have visited city_1
        unvisited.remove(city_1)
        
        # Save location_1 to be the key value (tuple location) of city_1 in "cities" dictionary
        location_1 = cities[city_1]
        
        # Now look at second city in remaining list of cities in "unvisited" dictionary
        for city_2 in unvisited:
            
            # Save location_2 to be location tuple of this city_2 in "cities" dictionary
            location_2 = cities[city_2]
            
            # Calculate distance using haversine_formula that takes distances between locations 
            distance = haversine_formula(r_earth, location_1, location_2)
            
            # Remember that dictionary you created called "distances"?
            # The pair of cities you used (the frozen set) is a now pair of elements in dictionary "distances"
            # Their distance between them is the key
            distances[frozenset((city_1, city_2))] = round(distance, -1)
            
    return distances

In [57]:
compute_distances(cities)

{frozenset({'Denver', 'Miami'}): 2570.0,
 frozenset({'Dallas', 'Los Angeles'}): 2570.0,
 frozenset({'Atlanta', 'Boston'}): 2570.0,
 frozenset({'Atlanta', 'New York'}): 2570.0,
 frozenset({'Atlanta', 'Miami'}): 2570.0,
 frozenset({'Boston', 'Houston'}): 2570.0,
 frozenset({'Austin', 'Boston'}): 2570.0,
 frozenset({'Miami', 'Seattle'}): 2570.0,
 frozenset({'Austin', 'Dallas'}): 2570.0,
 frozenset({'New York', 'San Francisco'}): 2570.0,
 frozenset({'Chicago', 'Denver'}): 2570.0,
 frozenset({'Houston', 'San Francisco'}): 2570.0,
 frozenset({'Boston', 'Seattle'}): 2570.0,
 frozenset({'Austin', 'Miami'}): 2570.0,
 frozenset({'Austin', 'San Francisco'}): 2570.0,
 frozenset({'Los Angeles', 'Seattle'}): 2570.0,
 frozenset({'Boston', 'New York'}): 2570.0,
 frozenset({'Atlanta', 'Houston'}): 2570.0,
 frozenset({'Boston', 'Denver'}): 2570.0,
 frozenset({'Chicago', 'Miami'}): 2570.0,
 frozenset({'San Francisco', 'Seattle'}): 2570.0,
 frozenset({'Boston', 'Dallas'}): 2570.0,
 frozenset({'Dallas', 'H

## File I/O
Most of the time you will be interacting with files. We need a way to access the files, but at the same time we also need ways to discover files.
### 1. `glob`
The `glob` module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. No tilde expansion is done, but \*, ?, and character ranges expressed with [] will be correctly matched.

In [59]:
import glob

#### `glob()`
The function glob() which can be invoked by `glob.glob()` will run discovery of files. In the example below, we pass only the directory name. Execute the code snippet to see what the output is.

In [64]:
# Shows only File Path
complete_list_of_files = glob.glob("/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data")
complete_list_of_files

['/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data']

Now lets add a * at the end of the directory path.

In [75]:
# Shows all actual files in this path
complete_list_of_files = glob.glob("/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/*")
complete_list_of_files

['/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-1-jan15-oct15.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-3-dec80-dec13.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/best-sandwiches.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/Divvy_Stations_2013.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/Divvy_Trips_2013.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/index.txt',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-1-jan15-oct15.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/ms

As seen above, a * is a wildcard, and allows for specifying to the `glob()` function, that you are interested in all files under `../programming-for-analytics-course-material/data`

If we are only interested in a particular set of files, then we could also do something like-

In [76]:
# Shows only files in this path that start with msft
files_msft = glob.glob("/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft*")
files_msft

['/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-1-jan15-oct15.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-3-mar86-dec13.csv']

#### or

In [67]:
# Shows only files in this path that starts with aap1
files_aapl = glob.glob("/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl*")
files_aapl

['/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-1-jan15-oct15.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-3-dec80-dec13.csv']

#### or

In [68]:
# Shows only files in this path that starts with two 14's
files_2014 = glob.glob("/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/*14*14*")
files_2014

['/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-2-jan14-dec14.csv']

### 2. Reading Files

Let's say we have a file 'rcs.txt' which contains data in text format like this:

    #freq (MHz)     vv (dB)     hh (dB)
      100          -20.3       -31.2
      200          -22.7       -33.6

We'd like to get the data into a list of lists of floating point numbers in
Python:

    [[100.0, -20.3, -31.2],
     [200.0, -22.7, -33.6]]

We can open the file with the `open` function or the `file` type:

#### `open()`/`file()`
There are two ways to open files

In [69]:
# Read file using open() function
file_in = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt')

#### or

In [70]:
# Read file using file() function
file_in = file('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt')

#### or

In [71]:
# READ-ONLY file using open() function, BUT add a ,'r' after the file path
file_in = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt','r')

Here the `'r'` specifies read-only mode

#### `read()`
You can read in the contents as a string using the `read()` method of the file object

In [72]:
text = file_in.read()

In [73]:
print text

#freq (MHz)     vv (dB)     hh (dB)
  100          -20.3       -31.2
  200          -22.7       -33.6



In [74]:
type(text)

str

#### `readline()/readlines()`
These functions allow you to read either a single line `readline()` or all lines as a list of lines `readlines()`
<br>Lets open the file again

In [77]:
file_in = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt','r')

In [78]:
# reads first line
file_in.readline()

'#freq (MHz)     vv (dB)     hh (dB)\n'

Call `readline()` again

In [79]:
# reads next line
file_in.readline()

'  100          -20.3       -31.2\n'

One more time-

In [80]:
# reads next line
file_in.readline()

'  200          -22.7       -33.6\n'

And now the last time!

In [81]:
# reads final line
file_in.readline()

''

So from the above you note that `readline()` just went through each line in the file, however it did not go back to the start of the file once all lines were iterated over.
<br><br> We also had to call the `readline()` function multiple times, which would mean any application requiring all lines to be read from a file, would need the programmer to write a `for` loop to get all lines. 
<br>A solution to this problem is using `readlines()`

In [86]:
file_in = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt','r')

In [87]:
# use readlines() to read MULTIPLE LINES
file_in.readlines()

['#freq (MHz)     vv (dB)     hh (dB)\n',
 '  100          -20.3       -31.2\n',
 '  200          -22.7       -33.6\n']

In [84]:
# if you readlines() again, it will be empty since you already read them all
file_in.readlines()

[]

From the above we see that `readlines()` returns a list of lines, where each line in the file is an item in the list. 

<br>We did make a mistake, after reading all lines into a list, we never stored this anywhere

In [88]:
file_in = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt','r')

In [89]:
# Save reading of all lines as a variable
all_lines = file_in.readlines()

In [90]:
# call variable so we can read all lines whenver we want
all_lines

['#freq (MHz)     vv (dB)     hh (dB)\n',
 '  100          -20.3       -31.2\n',
 '  200          -22.7       -33.6\n']

#### `seek()`
Up until now, each time we wanted to read lines from the file, we were opening the file repeatedly. There are a multitude of issues in carrying out this method of opening the same files repeatedly
<br>1. There are multiple file handles open, all requiring memory! 
<br>2. There are file locks placed on each instance of opening the file
<br><br>
If all you need is to read the file from the top-

In [94]:
# Resets read lines to the beginning
file_in.seek(0)

In [95]:
file_in.readlines()

['#freq (MHz)     vv (dB)     hh (dB)\n',
 '  100          -20.3       -31.2\n',
 '  200          -22.7       -33.6\n']

The `seek(0)` function call tells the file handler to place i/o position at the beginning of the file. You can also place the i/o operation to take place at the end of the file with a `seek(-1)` call.

#### `close()`
Each time we open a file, there should be a corresponding `close()` call that we should make at the end of the file i/o. This closes access to the file on the system disk

In [96]:
# MUST close file when you're done or else it will eat up memory
file_in.close()

### 3. Writing Files
You will, from time-to-time, write files to disk as well-

In [97]:
# since we opened with a ,'w' now we can write to it
file_out = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/index.txt','w')

The only difference between reading and writing to a file is switching the mode from `'r'` to `'w'`

Lets get a complete list of files for our data directory

In [98]:
complete_list_of_files = glob.glob('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/*')

In [99]:
complete_list_of_files

['/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-1-jan15-oct15.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/aapl-3-dec80-dec13.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/best-sandwiches.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/Divvy_Stations_2013.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/Divvy_Trips_2013.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/index.txt',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-1-jan15-oct15.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/msft-2-jan14-dec14.csv',
 '/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/ms

#### `write()`
function to write a single line to a file. 
<br>
We are going to store the list of files and their paths to a text file called `index.txt`

In [100]:
# Write a line to the file "index.txt"
for line in complete_list_of_files:
    file_out.write(line)
file_out.close()

##### Open the index.txt file, and check what you wrote to the file

Everything seems to be in a single paragraph. This is because you are not specifying that each line should be on its own line in the file.

#### `writelines()`
No more for loops requires for writing each line.

In [102]:
# Open index.txt and write to it, write line from complete_list_of_files, and close file
file_out = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/index.txt','w')
file_out.writelines('\n'.join(complete_list_of_files))
file_out.close()

### 4. Closing Files
We have already seen the function `close()` to close files.

There is definitely a drawback with this method. It has to be called explicity. If in case your program suddenly stops execution and exits, your file handles will not be closed, and as such the files will have write locks on them.

#### `try` - `finally`
This block of statement makes sure your file is closed in case the program exits abruptly

In [104]:
file_in = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt','r')

In [None]:
# try will read lines and finally will close them when the function is done and after creating your "all_lines" variable
try:
    all_lines = file_in.readlines()
finally:
    file_in.close()
all_lines

However we had to write quite a few statements to get this working. We can make this a lot nicer - 

#### `with`

first lets clear our variable `all_lines`

In [106]:
# create an empty list
all_lines = []

In [None]:
# Open file to read and store all read lines into variable all_lines
# Using the with and as function will open and then close file when you're done with the loop
with open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/rcs.txt','r') as file_in:
    all_lines = file_in.readlines()
all_lines

### 5. Exercises!
#### Q1. 
Read in a set of logs from an ASCII file.

Read in the logs found in the file `short_logs.crv`.
The logs are arranged as follows::

    DEPTH    S-SONIC    P-SONIC ...
    8922.0   171.7472   86.5657
    8922.5   171.7398   86.5638
    8923.0   171.7325   86.5619
    8923.5   171.7287   86.5600
    ...

So the first line is a list of log names for each column of numbers.
The columns are the log values for the given log.

Make a dictionary with keys as the log names and values as the
log data::

    logs['DEPTH']
    [8922.0, 8922.5, 8923.0, ...]
    logs['S-SONIC']
    [171.7472, 171.7398, 171.7325, ...]
    

In [109]:
logs = {}

In [None]:
log_file = open('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/short_logs.crv')
                
# The first line is a header tha has all the log names:
    for name in log_names:
    log[name] = []
                
    for line in log_file:
    values = [float(val) for val in line.split()]
    for i, name in enumarate(log_names):
    log[name].append(values[i])
                
log_file.close()
                
print 'DEPTH', logs['DEPTH'][:10]

#### Q2.
The files aapl\*, msft\* contain trading data.  Data is
arranged in the file is in comma seprarated format-

    Date,Open,High,Low,Close,Volume,Adj Close
    2015-10-21,114.00,115.580002,113.699997,113.760002,41795200,113.760002
    2015-10-20,111.339996,114.169998,110.82,113.769997,48778800,113.769997
    2015-10-19,110.800003,111.75,110.110001,111.730003,29606100,111.730003
    2015-10-16,111.779999,112.00,110.529999,111.040001,38236300,111.040001
    2015-10-15,110.93,112.099998,110.489998,111.860001,37341000,111.860001

In this exercise you will write two functions: one that reads
in data from files of this format, and one which writes data
out to files of this format.
##### a)
You should be able to provide a name of the company such as- `aapl` alongwith the directory where the files might be located, and let the read function discover all the files. 
The data read in should be stored in a `dict` object and keyed on the date.
##### b) 
Append all `dict` objects together.
##### c) 
Write a function that accepts name of the company such as - `aapl` and a `dict` object with all the financial data concatenated. 
This function should write out all the data to a single file with the name such as- 
`aapl-mmmyy-mmmyy.csv`

In [None]:
############################## THIS IS HOMEWORK for msft and appl files (there are 3 of each of them)
############################## THIS IS HOMEWORK
############################## THIS IS HOMEWORK
############################## THIS IS HOMEWORK
############################## THIS IS HOMEWORK
############################## THIS IS HOMEWORK
############################## THIS IS HOMEWORK


In [None]:
# start of financial data
# 3 appl files will be found, concatenated, and write on file

## DataFrames!
Our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program. It's a collection of series (or columns) with a common set of commands that make managing data in Python super easy.

### 1. `pandas`
`pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In [111]:
# import this library called pandas which is good for data structures and data analysis
import pandas as pd

#### `read_csv()`
this function, and a variety of other read functions under pandas have been created to read in different file formats

Lets read in the `best-sandwiches` dataset

In [113]:
# Read filetype using pandas function pd.read_csv()
best_sws_df = pd.read_csv('/Users/Gian/GitHub/programming-for-analytics-course-material 9.27.13 PM/data/best-sandwiches.csv')

#### `head()`
the head function allows us to peek at the first few rows of the df

In [115]:
# head() shows first 5 rows of values
best_sws_df.head()

Unnamed: 0,rank,sandwich,restaurant,description,price,address,city,phone,website,full_address,formatted_address,lat,lng
0,1,BLT,Old Oak Tap,The B is applewood smoked&mdash;nice and snapp...,$10,2109 W. Chicago Ave.,Chicago,773-772-0406,theoldoaktap.com,"2109 W. Chicago Ave., Chicago","2109 West Chicago Avenue, Chicago, IL 60622, USA",41.895734,-87.67996
1,2,Fried Bologna,Au Cheval,Thought your bologna-eating days had retired w...,$9,800 W. Randolph St.,Chicago,312-929-4580,aucheval.tumblr.com,"800 W. Randolph St., Chicago","800 West Randolph Street, Chicago, IL 60607, USA",41.884672,-87.647754
2,3,Woodland Mushroom,Xoco,Leave it to Rick Bayless and crew to come up w...,$9.50.,445 N. Clark St.,Chicago,312-334-3688,rickbayless.com,"445 N. Clark St., Chicago","445 North Clark Street, Chicago, IL 60654, USA",41.890602,-87.630925
3,4,Roast Beef,Al&rsquo;s Deli,"The Francophile brothers behind this deli, whi...",$9.40.,914 Noyes St.,Evanston,,alsdeli.net,"914 Noyes St., Evanston","914 Noyes Street, Evanston, IL 60201, USA",42.058442,-87.684425
4,5,PB&amp;L,Publican Qualty Meats,"When this place opened in February, it quickly...",$10,825 W. Fulton Mkt.,Chicago,312-445-8977,publicanqualitymeats.com,"825 W. Fulton Mkt., Chicago","825 West Fulton Market, Chicago, IL 60607, USA",41.886637,-87.648553


By default the `head()` function only displays the first few rows of data... to be precise, 5 rows. 

We can display `n` rows of data from the top-

In [116]:
# shows first 10 lines
best_sws_df.head(10)

Unnamed: 0,rank,sandwich,restaurant,description,price,address,city,phone,website,full_address,formatted_address,lat,lng
0,1,BLT,Old Oak Tap,The B is applewood smoked&mdash;nice and snapp...,$10,2109 W. Chicago Ave.,Chicago,773-772-0406,theoldoaktap.com,"2109 W. Chicago Ave., Chicago","2109 West Chicago Avenue, Chicago, IL 60622, USA",41.895734,-87.67996
1,2,Fried Bologna,Au Cheval,Thought your bologna-eating days had retired w...,$9,800 W. Randolph St.,Chicago,312-929-4580,aucheval.tumblr.com,"800 W. Randolph St., Chicago","800 West Randolph Street, Chicago, IL 60607, USA",41.884672,-87.647754
2,3,Woodland Mushroom,Xoco,Leave it to Rick Bayless and crew to come up w...,$9.50.,445 N. Clark St.,Chicago,312-334-3688,rickbayless.com,"445 N. Clark St., Chicago","445 North Clark Street, Chicago, IL 60654, USA",41.890602,-87.630925
3,4,Roast Beef,Al&rsquo;s Deli,"The Francophile brothers behind this deli, whi...",$9.40.,914 Noyes St.,Evanston,,alsdeli.net,"914 Noyes St., Evanston","914 Noyes Street, Evanston, IL 60201, USA",42.058442,-87.684425
4,5,PB&amp;L,Publican Qualty Meats,"When this place opened in February, it quickly...",$10,825 W. Fulton Mkt.,Chicago,312-445-8977,publicanqualitymeats.com,"825 W. Fulton Mkt., Chicago","825 West Fulton Market, Chicago, IL 60607, USA",41.886637,-87.648553
5,6,Belgian Chicken Curry Salad,Hendrickx Belgian Bread Crafter,The mom-and-pop aesthetic is a yeast-scented b...,$7.25.,100 E. Walton St.,Chicago,312-649-6717,,"100 E. Walton St., Chicago","100 East Walton Street, Chicago, IL 60611, USA",41.900246,-87.625163
6,7,Lobster Roll,Acadia,In a town that recently discovered the joys of...,$16,1639 S. Wabash Ave.,Chicago,312-360-9500,acadiachicago.com,"1639 S. Wabash Ave., Chicago","1639 South Wabash Avenue, Chicago, IL 60616, USA",41.858965,-87.625142
7,8,Smoked Salmon Salad,Birchwood Kitchen,Birchwood&rsquo;s sandwich-slinging virtuosos ...,$10,2211 W. North Ave.,Chicago,773-276-2100,birchwoodkitchen.com,"2211 W. North Ave., Chicago","2211 West North Avenue, Chicago, IL 60647, USA",41.910324,-87.682842
8,9,Atomica Cemitas,Cemitas Puebla,"Standing three inches high, the Atomica is som...",$9,3619 W. North Ave.,Chicago,773-772-8435,cemitaspuebla.com,"3619 W. North Ave., Chicago","3619 West North Avenue, Chicago, IL 60647, USA",41.90985,-87.717581
9,10,Grilled Laughing Bird Shrimp and Fried Oyster ...,Nana,Grilled Laughing Bird shrimp and fried oyster ...,$17,3267 S. Halsted St.,Chicago,312-929-2486,nanaorganic.com,"3267 S. Halsted St., Chicago","3267 South Halsted Street, Chicago, IL 60608, USA",41.834559,-87.646049


#### `tail()`
the tail function allows us to peek at the last few rows of the df

In [117]:
# shows last 5 rows
best_sws_df.tail()

Unnamed: 0,rank,sandwich,restaurant,description,price,address,city,phone,website,full_address,formatted_address,lat,lng
56,46,Kufta,Chickpea,"Nestled within the freshest, chewiest pita poc...",$8,2018 W. Chicago Ave.,Chicago,773-384-9930,chickpeaonthego.com,"2018 W. Chicago Ave., Chicago","2018 West Chicago Avenue, Chicago, IL 60622, USA",41.896178,-87.677832
57,47,Debbie&rsquo;s Egg Salad,The Goddess and Grocer,Nothing quite satisfies the comfortfood hunger...,$6.50.,25 E. Delaware Pl.,Chicago,312-896-2600,goddessandgrocer.com,"25 E. Delaware Pl., Chicago","25 East Delaware Place, Chicago, IL 60611, USA",41.89899,-87.62729
58,48,Beef Curry,Zenwich,Proof positive that sandwiches have surpassed ...,$7.50.,416 N. York St.,Elmhurst,,eatmyzenwich.com,"416 N. York St., Elmhurst","416 North York Street, Elmhurst, IL 60126, USA",41.910661,-87.939928
59,49,Le V&eacute;g&eacute;tarien,Toni Patisserie,Toni Cox spreads rich white bean hummus on a b...,$8.75.,65 E. Washington St.,Chicago,312-726-2020,tonipatisserie.com,"65 E. Washington St., Chicago","65 East Washington Street, Chicago, IL 60602, USA",41.883212,-87.625406
60,50,The Gatsby,Phoebe&rsquo;s Bakery,The best thing about Phoebe&rsquo;s panini is ...,$6.85.,3351 N. Broadway,Chicago,773-868-4000,phoebesbakery.com,"3351 N. Broadway, Chicago","3351 North Broadway, Chicago, IL 60657, USA",41.942739,-87.644342


By default the `tail()` function only displays the last few rows of data... to be precise, 5 rows.
We can display n rows of data from the bottom-

In [118]:
# shows last 10 rows
best_sws_df.tail(10)

Unnamed: 0,rank,sandwich,restaurant,description,price,address,city,phone,website,full_address,formatted_address,lat,lng
51,41,The Marty,Z&amp;H MarketCafe,The generous application of muhammara&mdash;a ...,$7.25.,1323 E. 57th St.,Chicago,773-538-7372,zhmarketcafe.com,"1323 E. 57th St., Chicago","1323 East 57th Street, Chicago, IL 60637, USA",41.791193,-87.59382
52,42,Whitefish,Market House on the Square,Could a sandwich sound more unsexy? Hardly. Bu...,$11,655 Forest Ave.,Lake Forest,,themarkethouse.com,"655 Forest Ave., Lake Forest","655 Forest Avenue, Lake Forest, IL 60045, USA",42.251828,-87.841322
53,43,"Oat Bread, Pecan Butter, and Fruit Jam",Elaine&rsquo;s Coffee Call,The chalkboard menu is ambiguous about whether...,$6,1816 N. Clark St.,Chicago,,jdvhotels.com/hotels/chicago/lincoln,"1816 N. Clark St., Chicago","1816 North Clark Street, Chicago, IL 60614, USA",41.915304,-87.634502
54,44,Cauliflower Melt,Marion Street Cheese Market,Who says veggie sandwiches are healthy? This b...,$9,100 S. Marion St.,Oak Park,,marionstreetcheesemarket.com,"100 S. Marion St., Oak Park","100 South Marion Street, Oak Park, IL 60302, USA",41.886737,-87.802515
55,45,Cubano,Cafecito,"The success of this sandwich, a Cuban traditio...",$5.49.,26 E. Congress Pkwy.,Chicago,312-922-2233,cafecitochicago.com,"26 E. Congress Pkwy., Chicago","26 East Congress Parkway, Chicago, IL 60605, USA",41.875804,-87.626366
56,46,Kufta,Chickpea,"Nestled within the freshest, chewiest pita poc...",$8,2018 W. Chicago Ave.,Chicago,773-384-9930,chickpeaonthego.com,"2018 W. Chicago Ave., Chicago","2018 West Chicago Avenue, Chicago, IL 60622, USA",41.896178,-87.677832
57,47,Debbie&rsquo;s Egg Salad,The Goddess and Grocer,Nothing quite satisfies the comfortfood hunger...,$6.50.,25 E. Delaware Pl.,Chicago,312-896-2600,goddessandgrocer.com,"25 E. Delaware Pl., Chicago","25 East Delaware Place, Chicago, IL 60611, USA",41.89899,-87.62729
58,48,Beef Curry,Zenwich,Proof positive that sandwiches have surpassed ...,$7.50.,416 N. York St.,Elmhurst,,eatmyzenwich.com,"416 N. York St., Elmhurst","416 North York Street, Elmhurst, IL 60126, USA",41.910661,-87.939928
59,49,Le V&eacute;g&eacute;tarien,Toni Patisserie,Toni Cox spreads rich white bean hummus on a b...,$8.75.,65 E. Washington St.,Chicago,312-726-2020,tonipatisserie.com,"65 E. Washington St., Chicago","65 East Washington Street, Chicago, IL 60602, USA",41.883212,-87.625406
60,50,The Gatsby,Phoebe&rsquo;s Bakery,The best thing about Phoebe&rsquo;s panini is ...,$6.85.,3351 N. Broadway,Chicago,773-868-4000,phoebesbakery.com,"3351 N. Broadway, Chicago","3351 North Broadway, Chicago, IL 60657, USA",41.942739,-87.644342


#### `describe()`
the `describe()` function can be used to present descriptive stats on each column of the df

In [119]:
# gives summary information / descriptive stats on each column of the dataframe that actually has numerical, not object
# Cost will be objects...might have to change to numeric afterwards
# Lat and Long are coordinates
best_sws_df.describe()

Unnamed: 0,rank,lat,lng
count,61.0,61.0,61.0
mean,25.278689,41.906938,-87.682965
std,13.689572,0.083674,0.090901
min,1.0,41.601541,-88.125653
25%,13.0,41.884672,-87.684425
50%,25.0,41.89899,-87.648218
75%,36.0,41.930101,-87.634157
max,50.0,42.251828,-87.59382


However, describe will not work like your `R` function `summary()` and does not automatically present frequency information for categorical data

#### Reading a specific column
you do not need to display the entire df, or use the entire df for each calculation or output. One can also select a single column. 

In [120]:
# read a specific column (in this case, sandwich is the name of a column)
# BTW since we opened the file as a read only, we cannot modify anything
best_sws_df.sandwich

0                                                   BLT
1                                         Fried Bologna
2                                     Woodland Mushroom
3                                            Roast Beef
4                                              PB&amp;L
5                           Belgian Chicken Curry Salad
6                                          Lobster Roll
7                                   Smoked Salmon Salad
8                                       Atomica Cemitas
9     Grilled Laughing Bird Shrimp and Fried Oyster ...
10                              Ham and Raclette Panino
11                                        Breaded Steak
12                                        Breaded Steak
13                                        Breaded Steak
14                                        Breaded Steak
15                                          The Hawkeye
16                                          Chicken Dip
17                                 Wild Boar Slo

We can also call the functions `head()`, `tail()`, and `describe()` on a single column-

In [121]:
# Show first 5 values of sandwich column
best_sws_df.sandwich.head()

0                  BLT
1        Fried Bologna
2    Woodland Mushroom
3           Roast Beef
4             PB&amp;L
Name: sandwich, dtype: object

In [122]:
# Show us last 5 values of sandwich column
best_sws_df.sandwich.tail()

56                          Kufta
57       Debbie&rsquo;s Egg Salad
58                     Beef Curry
59    Le V&eacute;g&eacute;tarien
60                     The Gatsby
Name: sandwich, dtype: object

In [123]:
# Provide summary statistics of sandwich column like unique values, most repeated value, number of times it is repeated
# Also shows data type of column and total rows
best_sws_df.sandwich.describe()

count                                  61
unique                                 50
top       Serrano ham and Manchego cheese
freq                                    4
Name: sandwich, dtype: object

#### Note
This method of referring to a column only returns a `read-only` copy. So you cannot modify the column. 

#### Creating a new column
The way to create or modify a new column is to refer to the column using the following format-

In [124]:
best_sws_df["sandwich"].head()

0                  BLT
1        Fried Bologna
2    Woodland Mushroom
3           Roast Beef
4             PB&amp;L
Name: sandwich, dtype: object

We should probably normalize the names, by converting all the names to upper case.

In [128]:
# Provides me a writable version of the data frame from a read only version of file
# .apply() applies a function across the entire column. 
# lambda x: x.upper() takes any x (value in column) and uppercase that string
# This is how you can remove the $ sign from cost too. Assign it to a new column that is numerical (float type)

best_sws_df["sandwich"] = best_sws_df.sandwich.apply(lambda x: x.upper())
best_sws_df.sandwich.head()

0                  BLT
1        FRIED BOLOGNA
2    WOODLAND MUSHROOM
3           ROAST BEEF
4             PB&AMP;L
Name: sandwich, dtype: object

#### `unique()`
the `unique()` function provides a list of unique values in a given columns

In [129]:
# This is a great way to see how many unique categorical values you have. Not so good for prices
list(best_sws_df.price.unique())

['$10',
 '$9',
 '$9.50.',
 '$9.40.',
 '$7.25.',
 '$16',
 '$17',
 '$11',
 '$5.49.',
 '$14',
 '$13',
 '$4.50.',
 '$11.95.',
 '$11.50.',
 '$6.25.',
 '$15',
 '$5',
 '$6',
 '$8',
 '$5.99.',
 '$7.52.',
 '$7.50.',
 '$12.95.',
 '$7',
 '$21',
 '$9.79.',
 '$9.75.',
 '$7.95.',
 '$6.50.',
 '$8.75.',
 '$6.85.']

The number of unique values in the price column can be calculated using the `len()` function

In [130]:
# This counts the number of unique values in the unique values in this list
len(list(best_sws_df.price.unique()))

31

#### or

In [131]:
# This is the same. Much cleaner and quicker and gives number of unique values

best_sws_df.price.nunique()

31

#### `value_counts()`
the function `value_counts()` provides a frequency for each unique level in a column-

In [132]:
# Counts the frequencies each unique value has shown up
best_sws_df.price.value_counts()

$5.49.     5
$9         5
$10        4
$9.79.     4
$8         4
$6         4
$7.50.     3
$7.52.     3
$7         3
$7.25.     2
$11.95.    2
$11        2
$13        2
$17        1
$11.50.    1
$6.50.     1
$15        1
$9.40.     1
$8.75.     1
$14        1
$9.50.     1
$12.95.    1
$7.95.     1
$16        1
$4.50.     1
$5.99.     1
$21        1
$5         1
$9.75.     1
$6.25.     1
$6.85.     1
Name: price, dtype: int64

### 2. Exercises!
#### Q1
Read in the financial data file you have created as a dataframe (concatenation of all the files for a company).
#### Q2
Provide descriptive statistics for all columns of this data

In [None]:
####################### THIS IS HOMEWORK from the 3 files that start with msft and the 3 files that start with appl
# 
# Try using %%timeit

import matplotlib.pyplot as plt
%pylab inline
best_sws_df.price.value_counts().plot("bar")
