## Getting started with pandas

In [5]:
import pandas as pd
from pandas import Series, DataFrame

*Series* and *DataFrame* are two of the most important Data Structures.
Series is like a 1-d numpy array, but it also contains indices of the elements.

In [5]:
obj = pd.Series([-1,0,3,4])
obj

0   -1
1    0
2    3
3    4
dtype: int64

In [6]:
obj.values

array([-1,  0,  3,  4])

In [7]:
obj.index

RangeIndex(start=0, stop=4, step=1)

You could also choose indices yourself

In [8]:
obj2 = pd.Series([-1,0,3,4], index=['a','b','c','d'])
obj2

a   -1
b    0
c    3
d    4
dtype: int64

In [10]:
obj2['b']

0

In [12]:
obj2[['a','c','d']]

a   -1
c    3
d    4
dtype: int64

___You can create a series object from a python dictionary!___

Some key points:
* Series can be altered in-place
* Series have a name, and based on the name can be merged (like a join in SQL)
* Missing data is presented as NaN
    * functions for detecing this like: isnull / notnull
    
## Dataframe
A dataframe is a heterogeneous set of data, arranged in rows/columns. (A heterogeneous table)
There are many ways to construct them, but a common one is from a dict of equal-sized values. (each entry in the dict is a column, containing the rows of values)

In [19]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
      'year': [2000, 2001, 2002, 2001, 2002, 2003],
      'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [21]:
frame.head() # select the first 5 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [24]:
ordered_df = pd.DataFrame(data, columns=['year','pop', 'state']) # you can pass less columns, or even ones not contained (NaN)
ordered_df

Unnamed: 0,year,pop,state
0,2000,1.5,Ohio
1,2001,1.7,Ohio
2,2002,3.6,Ohio
3,2001,2.4,Nevada
4,2002,2.9,Nevada
5,2003,3.2,Nevada


In [26]:
ordered_df['state'] # get a Series from a dataframe
# alternative: ordered_df.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [31]:
missing_df = pd.DataFrame(data, columns=['year','pop', 'debt'], index=['a','b','c','d','e','f'])
missing_df

Unnamed: 0,year,pop,debt
a,2000,1.5,
b,2001,1.7,
c,2002,3.6,
d,2001,2.4,
e,2002,2.9,
f,2003,3.2,


In [30]:
missing_df['debt'] = 1.56 # Assign a scalar to debt
missing_df

Unnamed: 0,year,pop,debt
0,2000,1.5,1.56
1,2001,1.7,1.56
2,2002,3.6,1.56
3,2001,2.4,1.56
4,2002,2.9,1.56
5,2003,3.2,1.56


In [32]:
missing_df.loc['e']

year    2002
pop      2.9
debt     NaN
Name: e, dtype: object

In [43]:
missing_df['y2k'] = missing_df['year'] == 2000
missing_df

Unnamed: 0,year,pop,debt,eastern,y2k
a,2000,1.5,,True,True
b,2001,1.7,,False,False
c,2002,3.6,,False,False
d,2001,2.4,,False,False
e,2002,2.9,,False,False
f,2003,3.2,,False,False


In [44]:
del missing_df['y2k']
missing_df

Unnamed: 0,year,pop,debt,eastern
a,2000,1.5,,True
b,2001,1.7,,False
c,2002,3.6,,False
d,2001,2.4,,False
e,2002,2.9,,False
f,2003,3.2,,False


In [45]:
missing_df.T

Unnamed: 0,a,b,c,d,e,f
year,2000,2001,2002,2001,2002,2003
pop,1.5,1.7,3.6,2.4,2.9,3.2
debt,,,,,,
eastern,True,False,False,False,False,False


In [47]:
missing_df.columns.name='cols'
missing_df.index.name='rows'
missing_df

cols,year,pop,debt,eastern
rows,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,2000,1.5,,True
b,2001,1.7,,False
c,2002,3.6,,False
d,2001,2.4,,False
e,2002,2.9,,False
f,2003,3.2,,False


# Arithmetic and data alignment
By default, operations will behave like an out-join beween two DataFrames. 

When performing operations between Series and DataFrame, the operation is broadcast down the rows for each column of the DataFrame. So each row, will have the operation executed against the same Series object.

* `apply` for applying a function over rows / columns
* `applymap` for applying to each element in the DataFrame

# Computing descriptive statistics

To compute a group of statistics:
`dataFrame.describe()`

By default, they skip NA values. Unless `skipna=False` is set.

# Correlation and Covariance

The common methods are out-of-the-box included.

# Set operations

* `value_counts`: counts how often a value appears in a Series