# Pandas Tutorial
Fan 10/24/2017
## 1. intro to pandas + 2. Data Indexing and Selection

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. There are three fundamental Pandas data structure: Series, Dataframe, Index

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

### Series + Data Selection in Series
A pandas series is a one-dimensional array of indexed data. It can be created from a list or array.

In [2]:
import pandas as pd
data = pd.Series([2,3,4,5])
print(data)
print(data.values)
print(data.index)

0    2
1    3
2    4
3    5
dtype: int64
[2 3 4 5]
RangeIndex(start=0, stop=4, step=1)


#### 1.Series as generalized NumPy array
The essential difference between numpy.array and pd.Series is that numpy array has an implicitly index but pandas series has the explicitly index with values

In [66]:
data1 = pd.Series([1,2,3,4], index= ['a','b','c', 'd'])
print(data1)
print(data1['a'])
print(data1[0:2])

a    1
b    2
c    3
d    4
dtype: int64
1
a    1
b    2
dtype: int64


#### 2. Series as specialized dictionary
The type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [59]:
population_dic = {'California': 333333,
                 'New York' : 222222,
                 'Washington' : 11111}
population = pd.Series(population_dic)
print(population)
print(population['California'])
print(population[0])

California    333333
New York      222222
Washington     11111
dtype: int64
333333
333333


#### 3. Constructing series object
`>>>pd.Series(data, index = index)`

where index is an optional argument, and data can be one of many entities.
For example, data can be a list or NumPy array, in which case index defaults to an integer sequence.

### The Pandas Dataframe Object + Data Selection in DataFrame
#### 1.Dataframe as generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [36]:
area_dic = {'California' : 1231231,
           'New York' : 131231,
           'Washington' : 2232323}
area = pd.Series(area_dic)
states = pd.DataFrame({'population':population_dic,
                      'area' : area_dic})
print(states)
print(states.index)
print(states.columns)

               area  population
California  1231231      333333
New York     131231      222222
Washington  2232323       11111
Index(['California', 'New York', 'Washington'], dtype='object')
Index(['area', 'population'], dtype='object')


In [76]:
states['area']

California    1231231
New York       131231
Washington    2232323
Name: area, dtype: int64

In [77]:
states.iloc[0:2,1]

California    333333
New York      222222
Name: population, dtype: int64

In [79]:
states[0:3]

Unnamed: 0,area,population
California,1231231,333333
New York,131231,222222
Washington,2232323,11111


Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.
#### 2. DataFrame as specialized dictionary
Notice the potential point of confusion here: in a two-dimesnional NumPy array, data[0] will return the first row. For a DataFrame, data['col0'] will return the first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays.
#### 3. Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways.
##### From a single Series object

In [31]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,333333
New York,222222
Washington,11111


##### From a list of dicts

In [34]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


##### From a dictionary of Series objects

In [37]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,area,population
California,1231231,333333
New York,131231,222222
Washington,2232323,11111


##### From a two-dimensional NumPy array

In [3]:
import numpy as np
print(np.random.rand(3,2))
pd.DataFrame(np.random.rand(3, 2), columns=['first c', 'second c'], index=['a', 'b', 'c'])

[[ 0.11038117  0.55315952]
 [ 0.87817474  0.95823377]
 [ 0.12982379  0.71718979]]


Unnamed: 0,first c,second c
a,0.18499,0.383516
b,0.763681,0.642214
c,0.421959,0.784291


### The Pandas Index Object + Indexers: loc, iloc, and ix
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let's construct an Index from a list of integers.

the `loc` attribute allows indexing and slicing that always references the explicit index.
The `iloc` attribute allows indexing and slicing that always references the implicit Python-style index

In [42]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

#### 1. Index as immutable array
The Index in many ways operates like an array, but the index cannot be modified via the normal means.

#### 2. Index as ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [43]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB

Int64Index([3, 5, 7], dtype='int64')

#### 3. loc iloc

In [67]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
print(data.iloc[0])
print(data.loc[1])


a
a


## 3. Operating on Data in Pandas
One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). Pandas inherits much of this functionality from NumPy
### Ufuncs: Index Preservation

In [80]:
rng = np.random.RandomState(42)
data = pd.DataFrame(rng.randint(0,10,(3,4)))
data

Unnamed: 0,0,1,2,3
0,6,3,7,4
1,6,9,2,6
2,7,4,3,7


In [83]:
np.sin(data)

Unnamed: 0,0,1,2,3
0,-0.279415,0.14112,0.656987,-0.756802
1,-0.279415,0.412118,0.909297,-0.279415
2,0.656987,-0.756802,0.14112,0.656987


## 4. Handling Missing Data

### 1. None: Pythonic missing data
The first sentinel value used by Pandas is `None`, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects):

In [4]:
value1 = np.array([1,2,None,3])
value1

array([1, 2, None, 3], dtype=object)

The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error.
### 2. NaN: Missing numerical data
The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [6]:
value2 = np.array([1,2,np.nan, 3, 4])
value2

array([  1.,   2.,  nan,   3.,   4.])

In [7]:
value2 + 1

array([  2.,   3.,  nan,   4.,   5.])

In [8]:
np.nansum(value2)

10.0

As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values.
### 3. Operating on Null Values
There are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:
`isnull()`: Generate a boolean mask indicating missing values
`notnull()`: Opposite of isnull()
`dropna()`: Return a filtered version of the data
`fillna()`: Return a copy of the data with missing values filled or imputed
We will conclude this section with a brief exploration and demonstration of these routines.

#### Detecting null values

In [11]:
data = pd.Series([1,2,3,np.nan])
data.isnull()

0    False
1    False
2    False
3     True
dtype: bool

In [13]:
data[data.isnull()]

3   NaN
dtype: float64

#### Dropping null values

In [14]:
data.dropna()

0    1.0
1    2.0
2    3.0
dtype: float64

In [18]:
data1 = pd.DataFrame([[1,2,3],[3,np.nan,4],[np.nan,5,6]])
data1

Unnamed: 0,0,1,2
0,1.0,2.0,3
1,3.0,,4
2,,5.0,6


In [20]:
data1.dropna()

Unnamed: 0,0,1,2
0,1.0,2.0,3


In [21]:
data1.dropna(axis = 'columns')

Unnamed: 0,2
0,3
1,4
2,6


#### Filling null values
Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.

In [22]:
data1.fillna(0)

Unnamed: 0,0,1,2
0,1.0,2.0,3
1,3.0,0.0,4
2,0.0,5.0,6


In [25]:
#forward fill(to propagate the previous value forward)
data1.fillna(method = 'ffill', axis = 0)

Unnamed: 0,0,1,2
0,1.0,2.0,3
1,3.0,2.0,4
2,3.0,5.0,6


# 5. Hierarchical Indexing
# 6. Combining Datasets: Concat and Append

In [26]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [6]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [7]:
d1 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
d2 = pd.DataFrame([['a','b','c'],['d','e','f'],['g','h','i']])
display('d1','d2','pd.concat([d1,d2])')

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9

Unnamed: 0,0,1,2
0,a,b,c
1,d,e,f
2,g,h,i

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9
0,a,b,c
1,d,e,f
2,g,h,i


In [37]:
pd.concat([d1,d2], axis = 1, join = 'outer', ignore_index = True)

Unnamed: 0,0,1,2,3,4,5
0,1,2,3,a,b,c
1,4,5,6,d,e,f
2,7,8,9,g,h,i


In [40]:
d1.append(d2)

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9
0,a,b,c
1,d,e,f
2,g,h,i
