# Data analysis with Pandas and Numpy

## 1. Numpy

Numpy is a powerful Python package for manipulating numerical data structures called arrays. The name numpy stands for "numerical Python".

A vector and a matrix are both examples of arrays. A vector is a one-dimensional array, and a matrix is a two-dimensional array. In numerical methods, one will often want to organize and manipulate data that has many dimensions. Arrays are the ideal Euclidean structure for numerical data.

Numpy is an essential package for working with numerical data. Even though you can often perform the operations you would like to execute with Python's native data types, numpy will often provide the most convenient and efficient functionality.

We can create a 1-dimensional numpy array by importing numpy and using the `array()` function.

In [None]:
import numpy as np

In [None]:
arr1 = np.array([8, 4, 6, 0, 2])

In [None]:
# creating a two-dimensional array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
# attribute
arr2.shape
# method
arr2.sum()
type(arr2)

You can slice elements of the array with numpy's indexing. The indexing is such that the first dimension is rows, the second dimension is columns, and every other dimension is not geometrically represented. As with all of Python, the first element's index number is 0. If you want a slice from the mth element to the nth element, you must use m: n+1. Further

- a scalar m means the m+1th element
- an empty : means the entire dimension
- two colons followed by an integer ::p means every pth element
- a colon followed by an integer :n gives a slice from the first element to the nth element.
- an integer followd by a colon m: gives a slice from the m+1th element to the last element.

In [None]:
arr2[0]
arr2[1,1]
arr2[:,1]
arr2[0,:]
arr2[-1, -1]

In [None]:
# basic arithmetic
arr3 = np.array([[10,8,6], [4,2,0]])
arr2 + arr3
arr2 * arr3
arr2 + 2
arr2 < 5

Numpy has lots of built-in functions. In this example, we create a vector called `dependents` of 20 random integers [0, 6) that represent number of dependents. Since the EITC can only be applied to up to 3 dependents, we create a new vector called `eitc_qual` that uses the `minimum` method to take the minimum of each element in `dependents` and 3.

NOTE: we could have imported the randint method with:

```python
from numpy import random as rd
dependents = rd.randint(0,6,20)
```

In [None]:
dependents = np.random.randint(0,6,20)
eitc_qual = np.minimum(dependents, 3)

You can think of `np.where` as an if-statement. The first argument is the condition, the second is the result if the condition is met, and the third is the result if the condition is not met.

In [None]:
mstat = np.array(['single', 'joint', 'joint', 'single', 'single'])
num_taxpayers = np.where(mstat=='single', 1, 2)

#  2. Pandas

Pandas is another Python package for data analysis. Pandas two main data structures are the `Series` object and the `DataFrame` object. For now, we will focus on the `DataFrame` object.

The `DataFrame` is the standard data structure that you would think of when using programs like Stata, SAS, or R. As with the univariate Series object, the DataFrame allows for traditional data analysis facility while interacting with all of Python's other functionality. You will notice that many of the methods available to pandas `DataFrames` are also available in `numpy`. Their methods are usually equivalent, but the advantage of performing operations with the `DataFrame` is its respect for the index values.

In [None]:
import pandas as pd

In [None]:
# there are multiple ways to create a DataFrame. Here, we convert a dictionary to a DataFrame 
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
# attributes
frame.index
frame.columns
frame.shape
# copying DataFrame
frame2 = frame.copy()
# one way to add a column
region = ['East', 'East', 'East', 'West', 'West']
frame2['Region'] = region
# a better way to add a column using np.where
frame['Region'] = np.where(frame['State']=='Ohio', 'East', 'West')

Instead of manually creating the data for the DataFrame, let's read a csv file into a Pandas DataFrame using the `read_csv()` method. The mandatory argument is a string of the file path to the csv (relative to where you Jupyter Notebook or Python script is saved). In this case, we have saved `weather.csv` to the same folder as this Notebook.

In [None]:
weather_df = pd.read_csv('weather.csv')
# the default index is 0, 1, 2... We can set the YEAR column as our index for readability.
weather_df2 = weather_df.set_index('YEAR')
weather_df2.index
# use the describe() method to get a sense of the data
describe_df = weather_df2.describe()