# Learn X in Y Minutes
## Where X=Pandas

_Inspired by the popular [learnxinyminutes](https://learnxinyminutes.com/)._

Pandas is the shorthand for 'Python and Data Analysis'. It provides a rich set of features for exploring and manipulating data, making it the go-to toolkit for a lot of data scientists. 

Since Pandas is a Python library, you might want to check out [Python](https://learnxinyminutes.com/docs/python/) first.

---

# Basics
Let's get the hang of Pandas!

In [1]:
# import pandas
import pandas as pd

In [2]:
# 0.22 is the latest
pd.__version__

u'0.22.0'

In [3]:
# Panda's dataframe object is like an SQL table; made up of rows and columns
import random
df = pd.DataFrame([[random.randint(0,9) for i in range(10)] for i in range(5)],
                  index=[i for i in range(5)], 
                  columns=list('abcdefghij'))

df # => a 5*10 matrix/table

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
0,3,4,4,1,0,0,9,1,1,0
1,4,3,4,8,4,2,4,5,2,9
2,2,5,8,6,6,3,3,4,4,8
3,0,0,6,3,5,1,7,1,4,0
4,3,0,7,4,5,4,1,8,6,3


In [4]:
# Add another column
df['grp'] = pd.DataFrame(['a', 'b'] * 5)

In [5]:
# ..headers
df.columns

Index([u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'grp'], dtype='object')

In [6]:
# ..just the first ones
df.head(2)

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,grp
0,3,4,4,1,0,0,9,1,1,0,a
1,4,3,4,8,4,2,4,5,2,9,b


In [7]:
# ..or the last ones
df.tail(2)

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,grp
3,0,0,6,3,5,1,7,1,4,0,b
4,3,0,7,4,5,4,1,8,6,3,a


In [8]:
# Select specific columns
df[['d', 'f']]

Unnamed: 0,d,f
0,1,0
1,8,2
2,6,3
3,3,1
4,4,4


In [9]:
# ..rows
df[3:5]

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,grp
3,0,0,6,3,5,1,7,1,4,0,b
4,3,0,7,4,5,4,1,8,6,3,a


In [10]:
# Rename columns
df = df.rename(columns = {'a': 'aa', 'b': 'bb'})
list(df.columns)

['aa', 'bb', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'grp']

In [11]:
# Create new column from other columns. Same index, same column operation. 
df['ij'] = df['i'] + df['j']
df[1:3]

Unnamed: 0,aa,bb,c,d,e,f,g,h,i,j,grp,ij
1,4,3,4,8,4,2,4,5,2,9,b,11
2,2,5,8,6,6,3,3,4,4,8,a,12


You have more ways to view data.

In [12]:
# Filter specific rows by using .loc to the df
df.loc[df['f'] % 2 == 0] 

Unnamed: 0,aa,bb,c,d,e,f,g,h,i,j,grp,ij
0,3,4,4,1,0,0,9,1,1,0,a,1
1,4,3,4,8,4,2,4,5,2,9,b,11
4,3,0,7,4,5,4,1,8,6,3,a,9


In [13]:
# Show only wanted columns
df.loc[df['f'] > 5][['f', 'bb']] 

Unnamed: 0,f,bb


In [14]:
# Filtering will accept expressions that evaluate as True / False 
df.loc[(df['aa'].isin([3, 5])) | (df['bb'] < 4)][['aa', 'bb']]

Unnamed: 0,aa,bb
0,3,4
1,4,3
3,0,0
4,3,0


In [15]:
# Group and aggregate!
df['grp'] = pd.DataFrame(['a', 'b'] * 5) # => add a column with discrete values
df.groupby(['grp']).agg({'c': 'sum', 'd': 'mean', 'e': 'min', 'f': 'max'})[['c', 'd', 'e', 'f']]

# => grouped using the categorical vars, then aggregated by column depending on the function specified
# the agg function accepts a dict of {column: func_name or numpy_func}

Unnamed: 0_level_0,c,d,e,f
grp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,19,3.666667,0,4
b,10,5.5,4,2


In [16]:
# ..and order / sort too
df.sort_values(by=['ij'], ascending=[False])[['ij']].head()

Unnamed: 0,ij
2,12
1,11
4,9
3,4
0,1


In [17]:
# You can chain operations.
df.loc[df['ij'] > 10] \
  .groupby(['grp']) \
  .agg('sum') \
  .sort_values(['ij'], ascending=[True])

Unnamed: 0_level_0,aa,bb,c,d,e,f,g,h,i,j,ij
grp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
b,4,3,4,8,4,2,4,5,2,9,11
a,2,5,8,6,6,3,3,4,4,8,12


In [18]:
# This is how you join.
df_2 = pd.DataFrame([[random.randint(0,9) for i in range(2)] for i in range(5)],
                  index=[i for i in range(5)], 
                  columns=['b_2', 'c_2']) # => another df

df[['bb', 'c']].merge(df_2, left_on='c', right_on='c_2')

Unnamed: 0,bb,c,b_2,c_2
0,0,6,0,6


Here are some things you will be needing too.

In [19]:
# Append separate datasets w/ same column names
df_3 = pd.DataFrame([[random.randint(0,9) for i in range(10)] for i in range(5)],
                  index=[i for i in range(5)], 
                  columns=list('abcdefghij'))

df.append(df_3).reset_index()[3:7] # => notice NaN values in columns that didn't match

Unnamed: 0,index,a,aa,b,bb,c,d,e,f,g,grp,h,i,ij,j
3,3,,0.0,,0.0,6,3,5,1,7,b,1,4,4.0,0
4,4,,3.0,,0.0,7,4,5,4,1,a,8,6,9.0,3
5,0,0.0,,3.0,,4,4,4,1,5,,9,9,,9
6,1,3.0,,7.0,,1,0,0,9,7,,4,6,,6


In [20]:
# Lambda goooodness!
df[['bb']].apply(lambda x: x ** 2)

Unnamed: 0,bb
0,16
1,9
2,25
3,0
4,0


In [21]:
# Of course, there's a pivot in there.
df.pivot_table(columns='grp', aggfunc='mean') # => makes each ['grp'] value as columns

grp,a,b
aa,2.666667,2.0
bb,3.0,1.5
c,6.333333,5.0
d,3.666667,5.5
e,3.666667,4.5
f,2.333333,1.5
g,4.333333,5.5
h,4.333333,3.0
i,3.666667,3.0
ij,7.333333,7.5


In [22]:
# A must-have for analysts, describe().. <3
df.describe()

Unnamed: 0,aa,bb,c,d,e,f,g,h,i,j,ij
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,2.4,2.4,5.8,4.4,4.0,2.0,4.8,3.8,3.4,4.0,7.4
std,1.516575,2.302173,1.788854,2.701851,2.345208,1.581139,3.193744,2.949576,1.949359,4.301163,4.722288
min,0.0,0.0,4.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
25%,2.0,0.0,4.0,3.0,4.0,1.0,3.0,1.0,2.0,0.0,4.0
50%,3.0,3.0,6.0,4.0,5.0,2.0,4.0,4.0,4.0,3.0,9.0
75%,3.0,4.0,7.0,6.0,5.0,3.0,7.0,5.0,4.0,8.0,11.0
max,4.0,5.0,8.0,8.0,6.0,4.0,9.0,8.0,6.0,9.0,12.0


# In Action
Let's use an actual dataset and do some basic descriptive analysis, while applying new Pandas skills.

In [None]:
# Maybe some kaggle stuff here

# Useful Links
- [Official Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)   
- [Pandas Cheatsheet (PDF)](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)   
- [Python for Data Analysis (Book)](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=pd_sim_14_2?_encoding=UTF8&psc=1&refRID=TJ9Q3J20VHGT3KFN273Q)   
- [Dataschool.io Series on Pandas (Videos)](http://www.dataschool.io/easier-data-analysis-with-pandas/)   
- [Intro to Data Analysis (MOOC)](https://www.udacity.com/course/intro-to-data-analysis--ud170)   