## Python for Data Science, the basics

The objective of this section is to introduce some basic concepts in Python that are especially relevant to data science for public policy.  Many of these examples have been assembled from existing tutorials.  

### Functions

Python functions are not all that different from `R` functions.  The syntax to declare a function is slightly different.  Python uses the `def` keyword to start a function, here is the syntax:

```python
def function_name(arg1, arg2, arg3, .... argN):
     #statement inside function
```

**Note**: All the statements inside the function should be indented using equal spaces. Function can accept zero or more arguments(also known as parameters) enclosed in parentheses. You can also omit the body of the function using the  pass keyword, like this:

```python
def myfunc():
    pass
```

Here is an example that calculates the running sum between the supplied arguments.  The function updates `result` .  Note also that Python, unlike `R`, is zero-indexed and not inclusive of the end argument (hence the `end + 1` in `range`).

In [2]:
def mysum(start, end):
    result = 0
    for i in range(start, end + 1):
        result += i

    # one way to print the result, note the comment syntax
    print("")
    print("The sum is:")
    print(result)
    
    # another way to print the same result
    print("\nThe sum is: %s" % result)

mysum(10, 50)


The sum is:
1230

The sum is: 1230


The above function simply prints the result to the console, what if we want to assign the result to a variable for further processing? Then we need to use the `return` statement. The `return` statement sends a result back to the caller and exits the function.

In [3]:
def mysum(start, end):
    result = 0
    for i in range(start, end + 1):
        result += i
    return result

s = mysum(10, 50)
print(s)

1230


In [4]:
def mysum(start, end):
    if(start > end):
        print("start should be less than end")
        return    # here we are not returning any value so a special value None is returned
    
    result = 0
    for i in range(start, end + 1):
        result += i
    return result

s = mysum(110, 50)
print(s)

start should be less than end
None


You can also raise `Exceptions` to either stop the process or do specific things based on the type of exception.

In [7]:
def mysum(start, end):
    if(start > end):
        raise Exception("What sort of backwards world do you think we live in?")
    
    result = 0
    for i in range(start, end + 1):
        result += i
    return result

s = mysum(110, 50)
print(s)

Exception: What sort of backwards world do you think we live in?

**PRACTICE 1**: Write a function to calculate $x^y$ iteratively.  That is, you supply the argument `x` and multiply it by itself `y` times.  Bonus if you set the inital values appropriately and use the `ValueError` exception for negative `y` values.

In [None]:
# Insert answer here

**PRACTICE 2**: Write a function to count the number of words in a supplied sentence.  Use the [`split()`](http://stackoverflow.com/questions/4071396/split-by-comma-and-strip-whitespace-in-python) method.

In [None]:
# Insert answer here

### Pandas

Python is used for many things, including web development, backend operations, and data science.  A few packages have been developed for Python that were specifically built to handle data types and operations that are common in data science.  The most notable is [`Pandas`](http://pandas.pydata.org/).  There are three basic datatype in pandas:

1. **Series**: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
2. **DataFrame**: A 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 
3. **Panel**: A somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data.

Note that the data structures are composed of themselves, with higher orders of complexity.  Here we start to explore the basics of interacting with pandas.  First, import the package.  You may have to install it, unless you used Anaconda to install Jupyter and Python.

In [8]:
import pandas

Create a Series by passing a list of values, letting pandas create a default integer index:

In [9]:
s = pandas.Series([1, 3, 5, None, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


Create a DataFrame by passing a numpy array, with a datetime index and labeled columns.  You will have to import `numpy` or numerical python to generate random numbers.

In [10]:
import numpy

dates = pandas.date_range('20130101', periods=6)
columns = list('ABCD')
randos = numpy.random.randn(6,4)
print(columns)
randos.shape

['A', 'B', 'C', 'D']


(6, 4)

In [12]:
df = pandas.DataFrame(randos, index=dates, columns=columns)
print(df.dtypes)
print(df.shape)
df

A    float64
B    float64
C    float64
D    float64
dtype: object
(6, 4)


Unnamed: 0,A,B,C,D
2013-01-01,1.343419,0.367837,-0.207296,2.071716
2013-01-02,1.913703,-0.7477,-2.543656,-1.347205
2013-01-03,-0.73699,-0.241838,0.642106,1.34643
2013-01-04,1.738009,-0.971304,0.111547,-0.30235
2013-01-05,2.152918,0.746074,-0.556313,-1.012012
2013-01-06,-0.225186,-1.473381,0.053223,2.11061


Much like `R` you can view useful slices of the DataFrame using built-in functions.  You can also describe or summarize the data, transpose the data frame, sort values, and split the object into its components -- all using built-in functions.

In [13]:
df.head(3)

Unnamed: 0,A,B,C,D
2013-01-01,1.343419,0.367837,-0.207296,2.071716
2013-01-02,1.913703,-0.7477,-2.543656,-1.347205
2013-01-03,-0.73699,-0.241838,0.642106,1.34643


In [14]:
df.tail(2)

Unnamed: 0,A,B,C,D
2013-01-05,2.152918,0.746074,-0.556313,-1.012012
2013-01-06,-0.225186,-1.473381,0.053223,2.11061


In [15]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,1.030979,-0.386719,-0.416732,0.477865
std,1.211538,0.839815,1.114291,1.556918
min,-0.73699,-1.473381,-2.543656,-1.347205
25%,0.166965,-0.915403,-0.469059,-0.834596
50%,1.540714,-0.494769,-0.077036,0.52204
75%,1.869779,0.215418,0.096966,1.890394
max,2.152918,0.746074,0.642106,2.11061


In [16]:
df.values

array([[ 1.34341879,  0.36783709, -0.20729565,  2.07171588],
       [ 1.91370251, -0.74770037, -2.54365597, -1.34720528],
       [-0.73699029, -0.24183764,  0.64210595,  1.34642999],
       [ 1.73800877, -0.97130404,  0.11154683, -0.30234977],
       [ 2.15291833,  0.74607425, -0.55631338, -1.01201152],
       [-0.22518607, -1.47338149,  0.05322319,  2.11060978]])

In [17]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2013-01-06,-0.225186,-1.473381,0.053223,2.11061
2013-01-05,2.152918,0.746074,-0.556313,-1.012012
2013-01-04,1.738009,-0.971304,0.111547,-0.30235
2013-01-03,-0.73699,-0.241838,0.642106,1.34643
2013-01-02,1.913703,-0.7477,-2.543656,-1.347205
2013-01-01,1.343419,0.367837,-0.207296,2.071716


In [18]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-0.225186,-1.473381,0.053223,2.11061
2013-01-04,1.738009,-0.971304,0.111547,-0.30235
2013-01-02,1.913703,-0.7477,-2.543656,-1.347205
2013-01-03,-0.73699,-0.241838,0.642106,1.34643
2013-01-01,1.343419,0.367837,-0.207296,2.071716
2013-01-05,2.152918,0.746074,-0.556313,-1.012012


Another important abstraction that is explicitly useful in pandas is pulling out specific elements of a DataFrame by position or value.


In [19]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,1.913703,-0.7477,-2.543656,-1.347205
2013-01-03,-0.73699,-0.241838,0.642106,1.34643
2013-01-04,1.738009,-0.971304,0.111547,-0.30235


In [20]:
df['20130102':'20130104']['A']

2013-01-02    1.913703
2013-01-03   -0.736990
2013-01-04    1.738009
Freq: D, Name: A, dtype: float64

In [21]:
df[df.B > 0]

Unnamed: 0,A,B,C,D
2013-01-01,1.343419,0.367837,-0.207296,2.071716
2013-01-05,2.152918,0.746074,-0.556313,-1.012012


In [22]:
df > 0

Unnamed: 0,A,B,C,D
2013-01-01,True,True,False,True
2013-01-02,True,False,False,False
2013-01-03,False,False,True,True
2013-01-04,True,False,True,False
2013-01-05,True,True,False,False
2013-01-06,False,False,True,True


In [23]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,1.343419,0.367837,,2.071716
2013-01-02,1.913703,,,
2013-01-03,,,0.642106,1.34643
2013-01-04,1.738009,,0.111547,
2013-01-05,2.152918,0.746074,,
2013-01-06,,,0.053223,2.11061


In [24]:
# Setting values
df.at[dates[0],'A'] = 0
df

Unnamed: 0,A,B,C,D
2013-01-01,0.0,0.367837,-0.207296,2.071716
2013-01-02,1.913703,-0.7477,-2.543656,-1.347205
2013-01-03,-0.73699,-0.241838,0.642106,1.34643
2013-01-04,1.738009,-0.971304,0.111547,-0.30235
2013-01-05,2.152918,0.746074,-0.556313,-1.012012
2013-01-06,-0.225186,-1.473381,0.053223,2.11061


A final class of utilities that are useful in this lecture are operations on DataFrames.

In [25]:
df.mean()

A    0.807076
B   -0.386719
C   -0.416732
D    0.477865
dtype: float64

In [26]:
# Along the other axis
df.mean(1)

2013-01-01    0.558064
2013-01-02   -0.681215
2013-01-03    0.252427
2013-01-04    0.143975
2013-01-05    0.332667
2013-01-06    0.116316
Freq: D, dtype: float64

In [27]:
# Applying standard functions to the data
df.apply(numpy.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,0.0,0.367837,-0.207296,2.071716
2013-01-02,1.913703,-0.379863,-2.750952,0.724511
2013-01-03,1.176712,-0.621701,-2.108846,2.070941
2013-01-04,2.914721,-1.593005,-1.997299,1.768591
2013-01-05,5.067639,-0.846931,-2.553612,0.756579
2013-01-06,4.842453,-2.320312,-2.500389,2.867189


In [28]:
# Applying custom, temporary functions to the data
df.apply(lambda x: x.max() - x.min())

A    2.889909
B    2.219456
C    3.185762
D    3.457815
dtype: float64

### Doing my homework

I am tasked with analyzing the market for earth imagery from satellites.  We will analyze a cleaned version of a dataset that includes information on all active satellites currently in orbit.  The data is saved as a CSV in this repository.  Read the data.

In [32]:
data = pandas.read_csv('satdata.csv')
data[['country', 'users', 'purpose', 'date','mass']].head()

Unnamed: 0,country,users,purpose,date,mass
0,Denmark,Civil,Earth Observation,2016-04-25,1
1,Multinational,Commercial,Communications,2014-02-06,6330
2,Multinational,Commercial,Communications,2016-06-15,1800
3,Multinational,Commercial,Communications,1997-08-19,3775
4,Multinational,Commercial,Communications,2015-03-02,2000


**PRACTICE 3**.  How many satellites are currently in orbit?

In [30]:

print len(data)

1459


**PRACTICE 4**.  What is the average, max, and minimum size of the satellites in orbit?

In [33]:
# Insert answer into here
x = pandas.to_numeric(data.mass, errors = 'coerce')
x.describe()

count    660.000000
mean     264.006061
std      291.132048
min        1.000000
25%             NaN
50%             NaN
75%             NaN
max      980.000000
Name: mass, dtype: float64

**PRACTICE 5**.  How many earth observing or earth science satellites satellites are currently in orbit?  How many of these satellites are for strictly non-military users? Save the subset as `sub`.

In [39]:
# Insert answer into here
acceptables = ['Earth Observation',
 'Earth Observation/Communications',
 'Earth Observation/Communications/Space Science',
 'Earth Observation/Technology Development',
 'Earth Science',
 'Earth/Space Science']
set(data.purpose)

#Boolean
sub = data[data.purpose.isin(acceptables)]
sub.shape

(405, 14)

**PRACTICE 6**.  Create a histogram of the detailed purpose within the class of earth observing, non-military satellites.

In [43]:
# Copy records
acceptables = [
    'Civil',
 'Civil/Government',
 'Commercial',
 'Commercial/Gov/Mil',
 'Commercial/Government',
 'Commerical',
 'Government',
 'Government/Civil',
 'Government/Commercial'
]
set(data.users)
sub = sub[sub.users.isin(acceptables)]
sub.shape



(290, 14)

**PRACTICE 7**.  Create a sub list of earth observing, non-military satellites with the following attributes.  Create a timeseries graph of the cumulative number of these types of satellites in orbit.  Resample the data to graph by week, rather than by day.  

```python
acceptables = [
    'Radar Imaging/Earth Science',
    'Optical Imaging/Automatic Identification System (AIS)',
    'Optical Imaging/Meteorology',
    'Hyperspectral Imaging',
    'Optical Imaging/Infrared Imaging',
    'Multispectral Imaging',
    'Radar Imaging',
    'Optical Imaging (video)',
    'Optical Imaging',
    'Earth Science',
    'Meteorology/Earth Science',
    'Optical Imaging/Meterology'
]
```

In [41]:
# Insert answer into here

**PRACTICE 8**.  There is a jump in orbiting satellites around April 2016.  Why?  Who sent so many satellites into orbit?  How many?

In [52]:
# Insert answer into here