## Python Data Analysis Library - Pandas - Basic Commands

In [None]:
import pandas

Pandas provide convenient functions to read and parse data from multiple file formats. Let's read sample data from a tsv file.  as we did in the previous classes we can use help() function to get help with unknown commands.

In [None]:
help(pandas.read_csv)

Notice that our file is tab separated file, not a comma separated file. So let’s see what will happen if you just use this command. 

In [None]:
pandas.read_csv('gapminder.tsv')

Well, it did not work at all. let's specify that out file has TAB as the separator.

In [None]:
pandas.read_csv('gapminder.tsv', sep='\t')

Ok, it worked. We can assign the output of the read_csv command to a variable and use it at the later time.

In [None]:
df = pandas.read_csv('gapminder.tsv', sep='\t')

In [None]:
type(df)

So it is a Data Frame. A DataFrame is a fast and efficient object for data manipulation with integrated indexing. Let's see how it looks. 

In [None]:
df[:10]

if you are not in the notebook/ipython you will explicitly need to use print()

In [None]:
print(df)

You can use .head() to display first few rows of the data frame. 

In [None]:
df.head() # vs print(df.head())

Let's see the shape of the dataframe 

In [None]:
df.shape

How ever if you try to use shape() you will get an error as shape is an attribute, not a method/function.

In [None]:
df.shape() 

As always you can hit TAB key to see the options available to you with DataFrame df object.  

In [None]:
df.

As you can see a Dataframe has many attributes, we will explore some in the flowing section.  

In [None]:
df.columns

In [None]:
df.dtypes

Pandas make selecting columns very easy ! 

In [None]:
country_df = df['country'] # subset 1 column

In [None]:
country_df.head()

In [None]:
subset = df[['country', 'continent', 'year']] # subset multiple columns

In [None]:
subset.head()

We can have numerical indexes also, but remember python index start with zero. 

In [None]:
subset = df[[1, 2, 3]] # subset columns by column number
# note it is 0 indexed,
# meaning 1 is the second column

# compare this with df[[1]]

In [None]:
subset.head()

In [None]:
subset = df[list(range(1, 4))] # if python 3 you need to use list(range()), in python2 you just need range()

In [None]:
subset.head()

.loc  provide a purely label-location based indexer to select given item from a dataframe

In [None]:
help(df.loc())

In [None]:
df.loc[0]

In [None]:
row_100 = df.loc[99]

In [None]:
type(row_100)

negative index here does not work. when used -1 it is actually looking for the row name "-1"

In [None]:
df.loc[-1]  

We can use shape to get last index. 

In [None]:
df.shape[0]

In [None]:
df.loc[df.shape[0] - 1]

iloc provide a purely integer-location based indexing. now 0 is the location, and not the lable.

In [None]:
df.iloc[0]

so negative index works

In [None]:
df.iloc[-1]

ix is the most flexible indexer. A primarily label-location based indexer, with integer position fallback

In [None]:
df.ix[0]

In [None]:
df.head()

In [None]:
df.ix[[0, 99, 999]]

In [None]:
#df.ix[ rows , columns]

In [None]:
df.ix[0, 'continent']

In [None]:
df.ix[0, 1]

In [None]:
df.ix[[0, 99, 999], ['continent', 'year']]

In [None]:
df.head()

Pandas provides inbuilt group and aggregation capabilities 

In [None]:
df.groupby('year')['lifeExp'].mean()

In [None]:
grouped = df.groupby(['year', 'continent'])['lifeExp'].mean()

In [None]:
grouped

In [None]:
# during the tutorial I also imported matplotlib
# import matplitlib.pyplot as plt
# you just need the magic here for the plot below

%matplotlib inline

df.groupby('year')['lifeExp'].mean().plot()

Suranga Edirisinghe 05/17/2017 (neranjan@gsu.edu)