<a href="https://colab.research.google.com/github/DeanPhillipsOKC/pandas-notes/blob/master/Pandas_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Operations

In [0]:
import pandas as pd

In [0]:
df_one = pd.DataFrame({'k1': ['A', 'A', 'B', 'B', 'C', 'C'],
                       'col1': [100, 200, 300, 300, 400, 500],
                       'col2': ['NY', 'CA', 'WA','WA', 'AK', 'NV']})

In [5]:
df_one

Unnamed: 0,k1,col1,col2
0,A,100,NY
1,A,200,CA
2,B,300,WA
3,B,300,WA
4,C,400,AK
5,C,500,NV


## Get all unique values for a column

In [6]:
df_one['col2'].unique()

array(['NY', 'CA', 'WA', 'AK', 'NV'], dtype=object)

## Get the number of unique values in a column

In [7]:
df_one['col2'].nunique()

5

## Only show unique rows

In [8]:
df_one.drop_duplicates()

Unnamed: 0,k1,col1,col2
0,A,100,NY
1,A,200,CA
2,B,300,WA
4,C,400,AK
5,C,500,NV


## Performing projections on column value
The first method involves defining a function, and passing that into apply.

In [0]:
def grab_first_letter(word):
  return word[0]

In [0]:
df_one['first letter'] = df_one['col2'].apply(grab_first_letter)

The second method is pretty much the same thing, but uses a lambda instead of an explicitly defined function

In [0]:
df_one['second letter'] = df_one['col2'].apply(lambda x: x[1])


In [15]:
df_one

Unnamed: 0,k1,col1,col2,first letter,second letter
0,A,100,NY,N,Y
1,A,200,CA,C,A
2,B,300,WA,W,A
3,B,300,WA,W,A
4,C,400,AK,A,K
5,C,500,NV,N,V


## Remapping column values using the map function, and a mapping dictionary

In [0]:
df_one['numbers'] = df_one['k1'].map({'A': 1, 'B': 2, 'C': 3})

In [17]:
df_one

Unnamed: 0,k1,col1,col2,first letter,second letter,numbers
0,A,100,NY,N,Y,1
1,A,200,CA,C,A,1
2,B,300,WA,W,A,2
3,B,300,WA,W,A,2
4,C,400,AK,A,K,3
5,C,500,NV,N,V,3


## Getting the max / min of a column, and their location

In [20]:
df_one['col1'].max()

500

In [21]:
df_one['col1'].idxmax()

5

In [22]:
df_one['col1'].min()

100

In [23]:
df_one['col1'].idxmin()

0

## Get all columns from a DataFrame

In [24]:
df_one.columns

Index(['k1', 'col1', 'col2', 'first letter', 'second letter', 'numbers'], dtype='object')

Note that the columns property is not readonly meaning you can reassign it, essentially renaming all columns in the DataFrame

In [0]:
df_one.columns = ['c1', 'c2', 'c3', 'c4', 'c5', 'c6']

In [26]:
df_one

Unnamed: 0,c1,c2,c3,c4,c5,c6
0,A,100,NY,N,Y,1
1,A,200,CA,C,A,1
2,B,300,WA,W,A,2
3,B,300,WA,W,A,2
4,C,400,AK,A,K,3
5,C,500,NV,N,V,3


In [0]:
features = pd.DataFrame({'A': [100, 200, 300, 400, 500],
                         'B': [12, 13, 14, 15, 16]})

predictions = pd.DataFrame({'pred': [0, 1, 1, 0, 1]})

## Using concatenation to join two DataFrames

In [29]:
features

Unnamed: 0,A,B
0,100,12
1,200,13
2,300,14
3,400,15
4,500,16


In [30]:
predictions

Unnamed: 0,pred
0,0
1,1
2,1
3,0
4,1


This is the default behavior of the concat method (join by row).  This isn't really what we want to do in this case as it just sticks the prediction rows right after the feature rows and generates a bunch of NaNs

In [31]:
pd.concat([features, predictions])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B,pred
0,100.0,12.0,
1,200.0,13.0,
2,300.0,14.0,
3,400.0,15.0,
4,500.0,16.0,
0,,,0.0
1,,,1.0
2,,,1.0
3,,,0.0
4,,,1.0


Specifying the column axis looks much better

In [32]:
pd.concat([features, predictions], axis=1)

Unnamed: 0,A,B,pred
0,100,12,0
1,200,13,1
2,300,14,1
3,400,15,0
4,500,16,1


## Get Dummies
This command is supposedly useful for machine learning. It will create a column for each distinc value in the specified column and set the matching field to 1 for each row in the original dataset that is a match.

In [34]:
pd.get_dummies(df_one['c1'])

Unnamed: 0,A,B,C
0,1,0,0
1,1,0,0
2,0,1,0
3,0,1,0
4,0,0,1
5,0,0,1
