# Applying python to data analysis 

So far, what we have been doing is a foundation for applying python to data analysis. 
What we need for this task: 
* The basic python types (`list`, `set`, `dict`, `tuple`):
 
 * How to use those types. 
 * How to construct new ones. 
* The data storage types (`ndarray`, `DataFrame`): 
 
 * How to make one. 
 * How to manipulate one. 
  * filtering
  * constructing new columns. 
  * transforming between types
  
# Now we move on to the final step of the journey. 
* Use this knowledge to do actual data analysis. 
* Learn to use the pre-packaged Python libraries that are constructed to help. 

# Some important caveats
* `numpy` predates `pandas`
 
 * Most data analysis libraries support the `numpy` format `ndarray`.
 * Some data analysis libraries don't support the `pandas` format `DataFrame`.  
* Libraries contain general-purpose methods but usually avoid special purposes. 

 * If everyone else needs to do something, chances are that there's a library that helps. 
 * If -- on the other hand -- your needs are unique, the likelihood of a library existing is small. 

* Libraries support the common patterns of data abstraction in python, and things that seem reasonable usually are. 
 * However, some things may have unexpected results. 


# Some ubiquitous patterns

### 1. If you want to construct something, and have something else, try the constructor. 

In the following cell, we want the result to be
```
    array([[1, 2, 3],
           [4, 5, 6]])

```

In [3]:
import numpy as np
nd = np.array([[1, 2, 3],
           [4, 5, 6]])
nd

array([[1, 2, 3],
       [4, 5, 6]])

In [5]:
# what if we want a DataFrame? 
import pandas as pd
df = pd.DataFrame([[1, 2, 3],
           [4, 5, 6]])
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


In [27]:
# What if we want the array from within the DataFrame df? 
v2 = np.array(df)
v2

array([['xavier', 'cat', 20, 2],
       ['mark', 'dog', 100, 4],
       ['ben', 'dog', 30, 2]], dtype=object)

### 2. Modify behaviors with extra optional arguments.


What if we want a DataFrame with row and column labels, like this...?


|        |   `a`   |   `b`   |   `c`   |
| -----: | :-----: | :-----: | :-----: |
|   `d`  |   `1`   |   `2`   |   `3`   |
|   `e`  |   `4`   |   `5`   |   `6`   |


In [7]:
# What if we want a DataFrame with row and column labels?
d2 = pd.DataFrame(nd, | | a | b | c | | -----: | :-----: | :-----: | :-----: | | d | 1 | 2 | 3 | | e | 4 | 5 | 6 |)
d2

SyntaxError: invalid syntax (<ipython-input-7-f9204f417f8c>, line 2)

# Aside: how do optional arguments work? 
Consider the following example: 

In [8]:
def foo(number, multiplier=2):
    return number*multiplier

print(foo(2))
print(foo(2,7))
print(foo(3, multiplier=20))

4
14
60


* `multiplier=2` determines an optional argument. 
* The value given is used if there is no value in the call. 
* You may use positional or named calls (`multiplier=2`) in calling the function. 

### 3. Arguments that are sequences can be specified in many valid ways. 

Now we want a dataframe that looks like this:

|        |   `x`   |   `y`   |   `z`   |
| -----: | :-----: | :-----: | :-----: |
|   `d`  |   `1`   |   `2`   |   `3`   |
|   `e`  |   `4`   |   `5`   |   `6`   |

Given the data as specified in the next cell, create a dataframe as above, matching the row identifiers and column names exactly.


In [9]:
data = [[1, 2, 3], [4, 5, 6]]

In [10]:
pd.DataFrame(data, columns=['x', 'y', 'z'])

Unnamed: 0,x,y,z
0,1,2,3
1,4,5,6


In [11]:
# Create the dataframe while keeping the number of explicit conversions to a minimum.
# Letting the python machinery do as much as possible _implicitly_
cols = 'xyz'
pd.DataFrame(data, ...)

TypeError: 'ellipsis' object is not iterable

In [12]:
# Create the dataframe while keeping the number of explicit conversions to a minimum.
# Letting the python machinery do as much as possible _implicitly_
cdict = {'x': 42, 'y': 20, 'z': 10}
pd.DataFrame(data, ...)

TypeError: 'ellipsis' object is not iterable

# Let's make sure we can do some basic things.

It's often important to convert between the basic types `array`, `DataFrame`, and `Series` to get things done. Here are some examples.

In [13]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-07-py-np-pd-wrapup.ok')

Assignment: Applying python
OK, version v1.14.15



#### 1. Consider the `DataFrame`:

In [14]:
df = pd.DataFrame({'name': ['xavier', 'mark', 'ben'],
                   'species': ['cat', 'dog', 'dog'],
                   'fleas': [20, 100, 30],
                   'ticks': [2, 4, 2]})
df

Unnamed: 0,name,species,fleas,ticks
0,xavier,cat,20,2
1,mark,dog,100,4
2,ben,dog,30,2


1. Create a `numpy` `ndarray` `nf` from `df` that contains only the numeric columns of `df`.  While the specific value of our `df` is simple, your recipe should work even if `df` has thousands of rows. 

In [35]:
# Your answer:
nf = np.array(df._get_numeric_data())
nf

array([[ 20,   2],
       [100,   4],
       [ 30,   2]])

In [36]:
_ = ok.grade('q01')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



2. What is wrong with just using `nf = np.array(df)`?

___Your answer:___ It cannot exclude non-numeric data types

Now consider the `array`: 

In [37]:
column_labels = ['x', 'y', 'z']
row_labels = ['a', 'b', 'c']
n3 = np.array([[1,2,3],[4,5,6],[7,8,9]])
n3

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

3. Create a `DataFrame` `d3` from this that has the column and row labels specified. 

In [39]:
# Your answer: 
d3 = pd.DataFrame(n3, columns=column_labels,index = row_labels)
print(d3)

   x  y  z
a  1  2  3
b  4  5  6
c  7  8  9


In [40]:
_ = ok.grade('q03')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



Now consider the following data: 

In [41]:
%more e4.csv

4. Read in this file and convert to an `array` 'n4'. Omit non-numeric columns. Hint: read as a `DataFrame`, read up on how to not use the first line as a header. 

In [49]:
# Your answer: 
d4 = pd.read_csv('e4.csv',header=None)
n4 = np.array(d4._get_numeric_data())
n4

array([[210, 400],
       [500, 422],
       [ 40,  50]])

In [50]:
_ = ok.grade('q04')

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



# When you are done with this notebook, 

* Save and checkpoint. 
* Ensure that the name of this file is precisely `03-07-py-np-pd-wrapup.ipynb`. 
* <del>Change `ready` to `True` in the cell below. </del>
* <del>Run the cell below to submit your work for grading. </del>
* Save and checkpoint the notebook. 

* If your Jupyter installation can download the notebook as a PDF,
    * (File >> Download as >> PDF via LaTeX (.pdf)), 
    * Rename the downloaded file to `<loginid>-03-07-py-np-pd-wrapup.pdf`. In other words, my filename would be `jsingh11-03-07-py-np-pd-wrapup.pdf`.
    * Submit the file `<loginid>-03-07-py-np-pd-wrapup.pdf` to Canvas.
* Otherwise 
    * (File >> Download as >> Notebook (.ipynb)). In other words, my filename would be `jsingh11-03-07-py-np-pd-wrapup.ipynb`.
    * Rename the downloaded file to `<loginid>-03-07-py-np-pd-wrapup.ipynb`,
    * Submit the file `<loginid>-03-07-py-np-pd-wrapup.ipynb` to Canvas.