# Pandas Viewing

In this activity, we will practice how to explore data in Pandas. This is a very important part of any data analysis.

In [1]:
import numpy as np
import pandas as pd

## Object Creation

Create a Series by passing a list of values with default integer index


In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Create a DataFrame by passing a Numpy array with a datetime index and labeled columns.

In [4]:
dates = pd.date_range('20130101', periods=6)

In [5]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [7]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.39325,-0.265685,-0.129212,0.205404
2013-01-02,-1.032109,-1.810026,-0.954699,-0.715999
2013-01-03,0.995691,-1.260994,-1.08822,-0.725872
2013-01-04,-0.142701,-0.772313,-1.220831,0.203825
2013-01-05,0.056843,-0.420129,1.372631,-1.197541
2013-01-06,-1.04759,0.866896,-2.424426,0.332825


Create a DataFrame fro ma dict of objects

In [8]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

In [9]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [10]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

In Jupyter, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that can be completed:

In [12]: df2.<TAB>  # noqa: E225, E999

df2.A

df2.bool

df2.abs

df2.boxplot

df2.add

df2.C

df2.add_prefix

df2.clip

df2.add_suffix

df2.clip_lower

df2.align

df2.clip_upper

df2.all

df2.columns

df2.any

df2.combine

df2.append

df2.combine_first

df2.apply

df2.consolidate

df2.applymap

df2.D


The list above is only subset of what is possible to do on top of the DataFrame

## Viewing Data

Here is how to view the top and bottom rows of the frame:

In [13]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,0.39325,-0.265685,-0.129212,0.205404
2013-01-02,-1.032109,-1.810026,-0.954699,-0.715999
2013-01-03,0.995691,-1.260994,-1.08822,-0.725872
2013-01-04,-0.142701,-0.772313,-1.220831,0.203825
2013-01-05,0.056843,-0.420129,1.372631,-1.197541


In [14]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-0.142701,-0.772313,-1.220831,0.203825
2013-01-05,0.056843,-0.420129,1.372631,-1.197541
2013-01-06,-1.04759,0.866896,-2.424426,0.332825


Display the index and columns:

In [15]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [16]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

Convert DataFrame to numpy array.

In [17]:
df.to_numpy()

array([[ 0.3932503 , -0.26568504, -0.12921245,  0.20540356],
       [-1.03210876, -1.81002621, -0.95469935, -0.71599888],
       [ 0.99569097, -1.26099447, -1.0882201 , -0.7258723 ],
       [-0.14270118, -0.77231266, -1.2208309 ,  0.20382465],
       [ 0.05684337, -0.42012898,  1.37263127, -1.19754144],
       [-1.04759005,  0.86689583, -2.42442555,  0.33282479]])

NumPy arrays have one dtype for the entire array, while Pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), Pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being an object, which requires casting every value to a Python object. This can lead to very expensive (time and memory-consuming) operations.



The function describe() shows a quick statistic summary of your data

In [18]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.129436,-0.610375,-0.740793,-0.316227
std,0.803967,0.91935,1.270599,0.643143
min,-1.04759,-1.810026,-2.424426,-1.197541
25%,-0.809757,-1.138824,-1.187678,-0.723404
50%,-0.042929,-0.596221,-1.02146,-0.256087
75%,0.309149,-0.304296,-0.335584,0.205009
max,0.995691,0.866896,1.372631,0.332825


Tranpose your data

In [19]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,0.39325,-1.032109,0.995691,-0.142701,0.056843,-1.04759
B,-0.265685,-1.810026,-1.260994,-0.772313,-0.420129,0.866896
C,-0.129212,-0.954699,-1.08822,-1.220831,1.372631,-2.424426
D,0.205404,-0.715999,-0.725872,0.203825,-1.197541,0.332825


and sort by column name

In [20]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.205404,-0.129212,-0.265685,0.39325
2013-01-02,-0.715999,-0.954699,-1.810026,-1.032109
2013-01-03,-0.725872,-1.08822,-1.260994,0.995691
2013-01-04,0.203825,-1.220831,-0.772313,-0.142701
2013-01-05,-1.197541,1.372631,-0.420129,0.056843
2013-01-06,0.332825,-2.424426,0.866896,-1.04759


We can also sort DataFrame by values in specific column.

In [21]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-02,-1.032109,-1.810026,-0.954699,-0.715999
2013-01-03,0.995691,-1.260994,-1.08822,-0.725872
2013-01-04,-0.142701,-0.772313,-1.220831,0.203825
2013-01-05,0.056843,-0.420129,1.372631,-1.197541
2013-01-01,0.39325,-0.265685,-0.129212,0.205404
2013-01-06,-1.04759,0.866896,-2.424426,0.332825
