## Python for Data Science, the basics

The objective of this section is to introduce some basic concepts in Python that are especially relevant to data science for public policy.  Many of these examples have been assembled from existing tutorials.  

### Functions

Python functions are not all that different from `R` functions.  The syntax to declare a function is slightly different.  Python uses the `def` keyword to start a function, here is the syntax:

```python
def function_name(arg1, arg2, arg3, .... argN):
     #statement inside function
```

**Note**: All the statements inside the function should be indented using equal spaces. Function can accept zero or more arguments(also known as parameters) enclosed in parentheses. You can also omit the body of the function using the  pass keyword, like this:

```python
def myfunc():
    pass
```

Here is an example that calculates the running sum between the supplied arguments.  The function updates `result` .  Note also that Python, unlike `R`, is zero-indexed and not inclusive of the end argument (hence the `end + 1` in `range`).

In [1]:
def mysum(start, end):
    result = 0
    for i in range(start, end + 1):
        result += i

    # one way to print the result, note the comment syntax
    print("")
    print("The sum is:")
    print(result)
    
    # another way to print the same result
    print("\nThe sum is: %s" % result)

mysum(10, 50)


The sum is:
1230

The sum is: 1230


The above function simply prints the result to the console, what if we want to assign the result to a variable for further processing? Then we need to use the `return` statement. The `return` statement sends a result back to the caller and exits the function.

In [2]:
def mysum(start, end):
    result = 0
    for i in range(start, end + 1):
        result += i
    return result

s = mysum(10, 50)
print(s)

1230


In [3]:
def mysum(start, end):
    if(start > end):
        print("start should be less than end")
        return    # here we are not returning any value so a special value None is returned
    
    result = 0
    for i in range(start, end + 1):
        result += i
    return result

s = mysum(110, 50)
print(s)

start should be less than end
None


You can also raise `Exceptions` to either stop the process or do specific things based on the type of exception.

In [4]:
def mysum(start, end):
    if(start > end):
        raise Exception("What sort of backwards world do you think we live in?")
    
    result = 0
    for i in range(start, end + 1):
        result += i
    return result

s = mysum(110, 50)
print(s)

Exception: What sort of backwards world do you think we live in?

**PRACTICE 1**: Write a function to calculate $x^y$ iteratively.  That is, you supply the argument `x` and multiply it by itself `y` times.  Bonus if you set the inital values appropriately and use the `ValueError` exception for negative `y` values.

In [None]:
# Insert answer here

**PRACTICE 2**: Write a function to count the number of words in a supplied sentence.  Use the [`split()`](http://stackoverflow.com/questions/4071396/split-by-comma-and-strip-whitespace-in-python) method.

In [None]:
# Insert answer here

### Pandas

Python is used for many things, including web development, backend operations, and data science.  A few packages have been developed for Python that were specifically built to handle data types and operations that are common in data science.  The most notable is [`Pandas`](http://pandas.pydata.org/).  There are three basic datatype in pandas:

1. **Series**: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
2. **DataFrame**: A 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 
3. **Panel**: A somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data.

Note that the data structures are composed of themselves, with higher orders of complexity.  Here we start to explore the basics of interacting with pandas.  First, import the package.  You may have to install it, unless you used Anaconda to install Jupyter and Python.

In [7]:
import pandas

Create a Series by passing a list of values, letting pandas create a default integer index:

In [8]:
s = pandas.Series([1, 3, 5, None, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


Create a DataFrame by passing a numpy array, with a datetime index and labeled columns.  You will have to import `numpy` or numerical python to generate random numbers.

In [9]:
import numpy

dates = pandas.date_range('20130101', periods=6)
columns = list('ABCD')
randos = numpy.random.randn(6,4)
print(columns)
randos.shape

['A', 'B', 'C', 'D']


(6, 4)

In [10]:
df = pandas.DataFrame(randos, index=dates, columns=columns)
print(df.dtypes)
print(df.shape)
df

A    float64
B    float64
C    float64
D    float64
dtype: object
(6, 4)


Unnamed: 0,A,B,C,D
2013-01-01,-1.125246,-0.381719,1.813668,-0.911999
2013-01-02,-0.995033,-0.750741,-2.201204,1.371349
2013-01-03,-0.378763,0.279895,-0.093675,0.25506
2013-01-04,-0.928351,-0.755839,-0.532296,0.844673
2013-01-05,-1.939081,1.987731,1.275682,-1.469973
2013-01-06,0.643405,0.530623,-1.363361,-0.537567


Much like `R` you can view useful slices of the DataFrame using built-in functions.  You can also describe or summarize the data, transpose the data frame, sort values, and split the object into its components -- all using built-in functions.

In [11]:
df.head(3)

Unnamed: 0,A,B,C,D
2013-01-01,-1.125246,-0.381719,1.813668,-0.911999
2013-01-02,-0.995033,-0.750741,-2.201204,1.371349
2013-01-03,-0.378763,0.279895,-0.093675,0.25506


In [12]:
df.tail(2)

Unnamed: 0,A,B,C,D
2013-01-05,-1.939081,1.987731,1.275682,-1.469973
2013-01-06,0.643405,0.530623,-1.363361,-0.537567


In [13]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.787178,0.151658,-0.183531,-0.074743
std,0.862285,1.044374,1.530413,1.086936
min,-1.939081,-0.755839,-2.201204,-1.469973
25%,-1.092693,-0.658485,-1.155594,-0.818391
50%,-0.961692,-0.050912,-0.312985,-0.141254
75%,-0.51616,0.467941,0.933343,0.69727
max,0.643405,1.987731,1.813668,1.371349


In [14]:
df.values

array([[-1.12524613, -0.38171931,  1.81366801, -0.91199866],
       [-0.99503303, -0.75074055, -2.20120413,  1.37134923],
       [-0.37876336,  0.27989473, -0.09367454,  0.25505951],
       [-0.92835061, -0.75583889, -0.53229585,  0.84467316],
       [-1.93908051,  1.9877306 ,  1.27568153, -1.46997254],
       [ 0.64340537,  0.5306228 , -1.36336056, -0.53756677]])

In [15]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,0.643405,0.530623,-1.363361,-0.537567
2013-01-05,-1.939081,1.987731,1.275682,-1.469973
2013-01-04,-0.928351,-0.755839,-0.532296,0.844673
2013-01-03,-0.378763,0.279895,-0.093675,0.25506
2013-01-02,-0.995033,-0.750741,-2.201204,1.371349
2013-01-01,-1.125246,-0.381719,1.813668,-0.911999


In [16]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-04,-0.928351,-0.755839,-0.532296,0.844673
2013-01-02,-0.995033,-0.750741,-2.201204,1.371349
2013-01-01,-1.125246,-0.381719,1.813668,-0.911999
2013-01-03,-0.378763,0.279895,-0.093675,0.25506
2013-01-06,0.643405,0.530623,-1.363361,-0.537567
2013-01-05,-1.939081,1.987731,1.275682,-1.469973


Another important abstraction that is explicitly useful in pandas is pulling out specific elements of a DataFrame by position or value.


In [17]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.995033,-0.750741,-2.201204,1.371349
2013-01-03,-0.378763,0.279895,-0.093675,0.25506
2013-01-04,-0.928351,-0.755839,-0.532296,0.844673


In [18]:
df['20130102':'20130104']['A']

2013-01-02   -0.995033
2013-01-03   -0.378763
2013-01-04   -0.928351
Freq: D, Name: A, dtype: float64

In [19]:
df[df.B > 0]

Unnamed: 0,A,B,C,D
2013-01-03,-0.378763,0.279895,-0.093675,0.25506
2013-01-05,-1.939081,1.987731,1.275682,-1.469973
2013-01-06,0.643405,0.530623,-1.363361,-0.537567


In [20]:
df > 0

Unnamed: 0,A,B,C,D
2013-01-01,False,False,True,False
2013-01-02,False,False,False,True
2013-01-03,False,True,False,True
2013-01-04,False,False,False,True
2013-01-05,False,True,True,False
2013-01-06,True,True,False,False


In [21]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,,1.813668,
2013-01-02,,,,1.371349
2013-01-03,,0.279895,,0.25506
2013-01-04,,,,0.844673
2013-01-05,,1.987731,1.275682,
2013-01-06,0.643405,0.530623,,


In [22]:
# Setting values
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D
2013-01-01,0.0,-0.381719,1.813668,-0.911999
2013-01-02,-0.995033,-0.750741,-2.201204,1.371349
2013-01-03,-0.378763,0.279895,-0.093675,0.25506
2013-01-04,-0.928351,-0.755839,-0.532296,0.844673
2013-01-05,-1.939081,1.987731,1.275682,-1.469973
2013-01-06,0.643405,0.530623,-1.363361,-0.537567


A final class of utilities that are useful in this lecture are operations on DataFrames.

In [23]:
df.mean()

A   -0.599637
B    0.151658
C   -0.183531
D   -0.074743
dtype: float64

In [24]:
# Along the other axis
df.mean(1)

2013-01-01    0.129988
2013-01-02   -0.643907
2013-01-03    0.015629
2013-01-04   -0.342953
2013-01-05   -0.036410
2013-01-06   -0.181725
Freq: D, dtype: float64

In [25]:
# Applying standard functions to the data
df.apply(numpy.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,0.0,-0.381719,1.813668,-0.911999
2013-01-02,-0.995033,-1.13246,-0.387536,0.459351
2013-01-03,-1.373796,-0.852565,-0.481211,0.71441
2013-01-04,-2.302147,-1.608404,-1.013506,1.559083
2013-01-05,-4.241228,0.379327,0.262175,0.089111
2013-01-06,-3.597822,0.909949,-1.101186,-0.448456


In [26]:
# Applying custom, temporary functions to the data
df.apply(lambda x: x.max() - x.min())

A    2.582486
B    2.743569
C    4.014872
D    2.841322
dtype: float64

### Doing my homework

I am tasked with analyzing the market for earth imagery from satellites.  We will analyze a cleaned version of a dataset that includes information on all active satellites currently in orbit.  The data is saved as a CSV in this repository.  Read the data.

In [27]:
data = pandas.read_csv('satdata.csv')
data[['country', 'users', 'purpose', 'date']].head()

Unnamed: 0,country,users,purpose,date
0,Denmark,Civil,Earth Observation,2016-04-25
1,Multinational,Commercial,Communications,2014-02-06
2,Multinational,Commercial,Communications,2016-06-15
3,Multinational,Commercial,Communications,1997-08-19
4,Multinational,Commercial,Communications,2015-03-02


**PRACTICE 3**.  How many satellites are currently in orbit?

In [None]:
# Insert answer into here

**PRACTICE 4**.  What is the average, max, and minimum size of the satellites in orbit?

In [None]:
# Insert answer into here

**PRACTICE 5**.  How many earth observing or earth science satellites satellites are currently in orbit?  How many of these satellites are for strictly non-military users? Save the subset as `sub`.

In [None]:
# Insert answer into here

**PRACTICE 6**.  Create a histogram of the detailed purpose within the class of earth observing, non-military satellites.

In [None]:
# Insert answer into here

**PRACTICE 7**.  Create a sub list of earth observing, non-military satellites with the following attributes.  Create a timeseries graph of the cumulative number of these types of satellites in orbit.  Resample the data to graph by week, rather than by day.  

```python
acceptables = [
    'Radar Imaging/Earth Science',
    'Optical Imaging/Automatic Identification System (AIS)',
    'Optical Imaging/Meteorology',
    'Hyperspectral Imaging',
    'Optical Imaging/Infrared Imaging',
    'Multispectral Imaging',
    'Radar Imaging',
    'Optical Imaging (video)',
    'Optical Imaging',
    'Earth Science',
    'Meteorology/Earth Science',
    'Optical Imaging/Meterology'
]
```

In [41]:
# Insert answer into here

**PRACTICE 8**.  There is a jump in orbiting satellites around April 2016.  Why?  Who sent so many satellites into orbit?  How many?

In [52]:
# Insert answer into here