# Pandas Python Library

Dropping a column or row.

In [None]:
data_frame.drop(["ColumnName"], axis=1) #1 for column and 0 for row

### What can a DataFrame hold?

1. A Pandas __DataFrame__
2. A Pandas __Series__: a one-dimensional labeled array capable of holding any data type with axis labels or index. An example of a Series object is one column from a DataFrame. 
3. A NumPy __ndarray__, which can be record or structured.
4. A two-dimensional __ndarray__.
5. Dictionaires of one-dimensional __ndarray__'s list, dictionaires or Series.

You can specify the __index__ and __column__ names for your DataFrame. The index, indicates the difference in rows, while the column names indicate the difference in columns. These two components of a dataframe come in very handy for manipulating the data. 

For __loc__ and __iloc__ this is for usually index based selection.

__loc__ is for when you want to specify the value of the index
__iloc__ is for when you want to specify the position of the index

So you can do something like df.loc[row, columns]. Same thing with iloc

__corr()__ will give you pair-wise correlation of columns.

When looking at a method in panadas be sure to check if it has arg before the method name or not. If it has arg before the name it is probably going to return the index but if it does not it will probably just sort the dataframe or series in place. 

### Group by - Apply

GroupBy.apply(func, *args, **kwargs)

Apply function func group-wise and combine the results together. 

The function passed to apply must take a dataframe as its first argument and return a dataframe, a series or a scalar.

In [2]:
import pandas as pd

df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
g = df.groupby('A')

In [9]:
df

Unnamed: 0,A,B,C
0,a,1,4
1,a,2,6
2,b,3,5


In [8]:
g.apply(lambda x: x / x.sum())

Unnamed: 0,B,C
0,0.333333,0.4
1,0.666667,0.6
2,1.0,1.0


### Filter

Dataframe.filter(items=None, like=None, regex=None, axis=None)

Subset rows or columns of dataframe according to labels in specified index. Note that this routine does not filter a dataframe on its contents. This filter is applied to the labels of the index. I think it is really only to do with columns

In [17]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.filter("A")

Unnamed: 0,A
0,1
1,2
2,3


### Group By - Filter

DataFrameGroupBy.filter(func, dropna = True, *args, **kwargs)

Return a copy of the DataFrame excluding elements from groups that do not satisfy the boolean criterion specified by the func.

In [18]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],\
                   'B' : [1, 2, 3, 4, 5, 6],\
                   'C' : [2.0, 5., 8., 1., 2., 9.]})
grouped = df.groupby('A')
grouped.filter(lambda x: x['B'].mean() > 3.)

Unnamed: 0,A,B,C
1,bar,2,5.0
3,bar,4,1.0
5,bar,6,9.0


### Group By - Aggregate

DataFrameGroupBy.agg(arg, *args, **kwargs)

Aggregate using one or more operations over the specified axis. 

In [23]:
import numpy as np

df = pd.DataFrame({'A': [1, 1, 2, 2],\
                    'B': [3, 2, 3, 4],\
                    'C': np.random.randn(4)})

In [24]:
df.groupby('A').agg('min')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,-0.432306
2,3,-0.290737


### Dataframe Transform

__transform()__ seems like it could be highly usefull. The transform is an operation used in conjunction with groupby.

While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the ouput is the same shape as the input. A common example is to center the data by subtracting the group-wise mean. 

So transform will not return a different sized dataframe. 

In [26]:
df = pd.DataFrame({'A': [1, 1, 2, 2],\
                    'B': [3, 2, 3, 4],\
                    'C': [-1, 0, 1, 2]})

Unnamed: 0,A,B,C
0,1,3,-1
1,1,2,0
2,2,3,1
3,2,4,2


In [27]:
df.groupby("A")["B"].transform("sum")

0    5
1    5
2    7
3    7
Name: B, dtype: int64

### Using logic in pandas

The logical operators in pandas use a similar syntax to other programming languages but you only use one. So & instead of &&.

Below is a code example.

In [None]:
train[(train["GrLivArea"] > 4000) & (train["SalePrice"] < 400000)] 