# Pandas Practice, Dataframe Manipulation

*This is a notebook to describe basic dataframe manipulation with the **pandas** python module*

Pandas documentation can be found at [Pandas](https://pandas.pydata.org/pandas-docs/stable/)

#### First, import pandas with alias

In [53]:
import pandas as pd

#### Then, import numpy as well
We will be using both

In [54]:
import numpy as np

### First concept to learn is a series

- A series is a one-dimensional **labelled array** 
    - It is capable of holding any data type, including objects
    - The axis labels are collectively referred to as the **index**


- The basic method to create a series is to call:
```python
s = pd.Series(data, index=index)
```    
- In the above sample, ```data``` can be many different things:
    - a Python dict
    - an n-dimensional array (ndarray)
    - a scalar value (like 5)


- The passed **index** is a list of axis labels. This separates into a few cases depending on what **data** is:

#### If **data** is an ndarray:

- If ```data``` is an ndarray, **index** must be the same length as **data**.
- If no index is passed, one will be created having values ```[0, ..., len(data) - 1]``` (a zero-origin index). e.g.:

In [55]:
s = pd.Series(np.random.randn(5))
print(s)

0    0.171813
1   -1.034947
2    0.186197
3   -1.450996
4   -0.786615
dtype: float64


- also can return index of this variable:

In [56]:
print(s.index)

RangeIndex(start=0, stop=5, step=1)


- can pass in labels like this:

In [57]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    1.367234
b    0.152522
c   -1.672828
d    1.179884
e   -0.376522
dtype: float64


#### If **data** is a dict:

- if ```data``` is a dict:
    - if **index** is passed the values in data corresponding to the labels in the index will be pulled out
    - otherwise, an index will be constructed from the sorted keys of the dict, if possible:

In [58]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [59]:
e = pd.Series(d)
print(e)

a    0.0
b    1.0
c    2.0
dtype: float64


In [60]:
f = pd.Series(d, index=['b', 'c', 'd', 'a'])
print(f)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


note that NaN is the standard missing data marker used in pandas

#### If data is a scalar value:

- If ```data``` is a scalar value, an index must be provided.
- The value will be repeated to match the length of **index**

In [61]:
g = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
print(g)

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64


### A few quick notes:
- a series can be sliced like ```s[:2]```, in this case the index will be sliced as well
- a series can be searched like:

In [62]:
'e' in s

True

In [63]:
'f' in s

False

In [64]:
s['a'] = 12.
print(s)

a    12.000000
b     0.152522
c    -1.672828
d     1.179884
e    -0.376522
dtype: float64


In [65]:
print(s['a'])

12.0


If a label is not contained, an exception is raised:

In [66]:
s['f']

KeyError: 'f'

Using the ```get``` method, a missing label will return None or specified default:

In [67]:
print(s.get('f'))

None


In [68]:
print(s.get('f', np.nan))

nan


can perform mathematical operations like in numpy, in this case a ```Series``` works a lot like an ndarray:

In [69]:
print(s + s)

a    24.000000
b     0.305043
c    -3.345656
d     2.359767
e    -0.753044
dtype: float64


In [70]:
print(s * 2)

a    24.000000
b     0.305043
c    -3.345656
d     2.359767
e    -0.753044
dtype: float64


This is another example using numpy's ```.exp()``` function:

In [51]:
np.exp(s)

a    162754.791419
b         3.447194
c         0.330539
d         1.132191
e         0.351002
dtype: float64

#### A key difference between an ndarray and a pandas Series:
Is that operations between Series automatically align the data based on label.
    - Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [72]:
s[1:] + s[:-1]

a         NaN
b    0.305043
c   -3.345656
d    2.359767
e         NaN
dtype: float64

Notice that above, the result of an operation between unaligned Series will have the **union** of the indexes involved.

If a label is not found in one Series or the other, the result will be marked as missing, ```NaN```

Being able to write code without doing any explicit data alignment grants immense freedom and flixibility in interactive data analysis and research.

The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

Note: Series can also have a ```name``` attribute:

In [73]:
s = pd.Series(np.random.randn(5), name='something')
print(s)

0   -2.247914
1    0.116714
2    2.641048
3   -0.989728
4   -1.239781
Name: something, dtype: float64


In [75]:
s.name

'something'

You can also rename a Series:

In [76]:
s2 = s.rename("different")
print(s2.name)

different


## See next notebook, D