# Pandas

Pandas provides **Series** and **DataFrame** data structures to Python in addition to data input/output functions, basic data analysis tools, utilities and plotting capabilities. It builds on top of NumPy ndarray. Pandas is squished form of **Panel Data Structure**. Pandas is primarily used for data munging and preparation but also provides some data analysis tools. Pandas does not implement any significant modeling functionality other than simple linear and panel regression. For these, you can use statsmodels and scikit-learn which build on top of the DataFrame data structure.

In [1]:
import numpy as np
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [2]:
data = np.random.randn(6, 4)
df = pd.DataFrame(data, columns=list('ABCD'))
print(df)

          A         B         C         D
0  1.102933  0.004014 -1.392061 -1.144745
1  0.590045 -0.420929  0.945703  1.962290
2  0.684100 -0.653912  1.431976 -0.900998
3 -0.559076  0.563607 -0.607014  0.014401
4 -0.171584 -1.903543  0.390012 -1.587581
5  0.741322  1.001099  0.989959 -1.002977


## Properties of DataFrame

In [3]:
print('Shape:', df.shape)
print('Data Types:', df.dtypes)
print('Column Labels:', df.columns)
print(df.index)

Shape: (6, 4)
Data Types: A    float64
B    float64
C    float64
D    float64
dtype: object
Column Labels: Index(['A', 'B', 'C', 'D'], dtype='object')
RangeIndex(start=0, stop=6, step=1)


In [4]:
print(df.head())

          A         B         C         D
0  1.102933  0.004014 -1.392061 -1.144745
1  0.590045 -0.420929  0.945703  1.962290
2  0.684100 -0.653912  1.431976 -0.900998
3 -0.559076  0.563607 -0.607014  0.014401
4 -0.171584 -1.903543  0.390012 -1.587581


In [5]:
print(df.tail())

          A         B         C         D
1  0.590045 -0.420929  0.945703  1.962290
2  0.684100 -0.653912  1.431976 -0.900998
3 -0.559076  0.563607 -0.607014  0.014401
4 -0.171584 -1.903543  0.390012 -1.587581
5  0.741322  1.001099  0.989959 -1.002977


In [6]:
print(df.head(3))

          A         B         C         D
0  1.102933  0.004014 -1.392061 -1.144745
1  0.590045 -0.420929  0.945703  1.962290
2  0.684100 -0.653912  1.431976 -0.900998


## Indexing and Slicing DataFrame

In [7]:
print(type(df['A'])) # Each column of a DataFrame is a Series

<class 'pandas.core.series.Series'>


In [8]:
print(df['A']) # Copy elements of column A, retaining the row indices

0    1.102933
1    0.590045
2    0.684100
3   -0.559076
4   -0.171584
5    0.741322
Name: A, dtype: float64


In [9]:
print(type(df.loc[1, :])) # Each row of a DataFrame is a Series
print(df.loc[1, :]) # Copy row with index 1

<class 'pandas.core.series.Series'>
A    0.590045
B   -0.420929
C    0.945703
D    1.962290
Name: 1, dtype: float64


In [10]:
print(df.A) # Column is accessed using name of column and the dot notation (provided column name has no spaces)

0    1.102933
1    0.590045
2    0.684100
3   -0.559076
4   -0.171584
5    0.741322
Name: A, dtype: float64


In [11]:
print(df.A[1]) # Column A, row index 1

0.5900447924454416


In [12]:
print(df.loc[0:2, :]) # 0:2 includes the stop index 2, unlike slicing in NumPy

          A         B         C         D
0  1.102933  0.004014 -1.392061 -1.144745
1  0.590045 -0.420929  0.945703  1.962290
2  0.684100 -0.653912  1.431976 -0.900998


In [13]:
df.columns = ['Column A', 'Column B', 'Column C', 'Column D'] # Column names can be changed, and can contain spaces
print(df.head(3))

   Column A  Column B  Column C  Column D
0  1.102933  0.004014 -1.392061 -1.144745
1  0.590045 -0.420929  0.945703  1.962290
2  0.684100 -0.653912  1.431976 -0.900998


In [14]:
print(df['Column C'][0:3]) # Slicing follows normal Python rules in this case

0   -1.392061
1    0.945703
2    1.431976
Name: Column C, dtype: float64


In [15]:
print(df[['Column A', 'Column C']][0:3]) # You can choose the columns and rows to copy

   Column A  Column C
0  1.102933 -1.392061
1  0.590045  0.945703
2  0.684100  1.431976


In [16]:
print(df[['Column D', 'Column A']][5:2:-1]) # Order of rows and columns can be chosen by user

   Column D  Column A
5 -1.002977  0.741322
4 -1.587581 -0.171584
3  0.014401 -0.559076


In [17]:
print(df['Column B'] * 2)

0    0.008028
1   -0.841858
2   -1.307823
3    1.127213
4   -3.807087
5    2.002198
Name: Column B, dtype: float64


In [18]:
print(type(df.to_numpy()))

<class 'numpy.ndarray'>


In [19]:
print(df.to_numpy())

[[ 1.10293289  0.00401405 -1.39206098 -1.14474462]
 [ 0.59004479 -0.42092884  0.94570298  1.96228952]
 [ 0.68409978 -0.65391167  1.43197645 -0.90099754]
 [-0.55907583  0.56360652 -0.60701377  0.01440117]
 [-0.1715836  -1.90354331  0.39001194 -1.58758127]
 [ 0.7413225   1.00109901  0.98995902 -1.00297743]]


In [20]:
print(df.index.array)

<PandasArray>
[0, 1, 2, 3, 4, 5]
Length: 6, dtype: int64


In [21]:
df.columns.array

<PandasArray>
['Column A', 'Column B', 'Column C', 'Column D']
Length: 4, dtype: object

## Boolean Operations

In [22]:
df.columns = list('ABCD')
print(df.A > 0)

0     True
1     True
2     True
3    False
4    False
5     True
Name: A, dtype: bool


In [23]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
0,1.102933,0.004014,-1.392061,-1.144745
1,0.590045,-0.420929,0.945703,1.96229
2,0.6841,-0.653912,1.431976,-0.900998
5,0.741322,1.001099,0.989959,-1.002977


In [24]:
indx = df[df.B > 0].index
print(indx)

Int64Index([0, 3, 5], dtype='int64')


In [25]:
df.loc[indx, ['B', 'D']]

Unnamed: 0,B,D
0,0.004014,-1.144745
3,0.563607,0.014401
5,1.001099,-1.002977


In [26]:
df['E'] = 'Non positive' # Creates a new column E, and populates the column with Non positive
df.loc[indx, 'E'] = 'Positive' # Overwrites column E of rows of with index in indx
print(df)

          A         B         C         D             E
0  1.102933  0.004014 -1.392061 -1.144745      Positive
1  0.590045 -0.420929  0.945703  1.962290  Non positive
2  0.684100 -0.653912  1.431976 -0.900998  Non positive
3 -0.559076  0.563607 -0.607014  0.014401      Positive
4 -0.171584 -1.903543  0.390012 -1.587581  Non positive
5  0.741322  1.001099  0.989959 -1.002977      Positive


## Reading and Writing MS Excel and CSV Files

Most data is available in one of several common formats, such as, CSV, MS Excel, HDF5, SQL. Pandas provides functions to read and write all such formats.

In [32]:
fn = 'PCA CDB-2901-F-Census.xlsx'
df_bgm = pd.read_excel(pd.ExcelFile(fn))
print(f"{fn}: {len(df_bgm)} records")
print(f"{len(df_bgm.columns)} columns")
print(df_bgm.columns.array)

PCA CDB-2901-F-Census.xlsx: 1323 records
95 columns
<PandasArray>
[         'State',       'District',        'DT Name',       'CD Block',
   'Town/Village',           'Ward',             'EB',          'Level',
           'Name',            'TRU',          'No_HH',          'TOT_P',
          'TOT_M',          'TOT_F',           'P_06',           'M_06',
           'F_06',           'P_SC',           'M_SC',           'F_SC',
           'P_ST',           'M_ST',           'F_ST',          'P_LIT',
          'M_LIT',          'F_LIT',          'P_ILL',          'M_ILL',
          'F_ILL',     'TOT_WORK_P',     'TOT_WORK_M',     'TOT_WORK_F',
     'MAINWORK_P',     'MAINWORK_M',     'MAINWORK_F',      'MAIN_CL_P',
      'MAIN_CL_M',      'MAIN_CL_F',      'MAIN_AL_P',      'MAIN_AL_M',
      'MAIN_AL_F',      'MAIN_HH_P',      'MAIN_HH_M',      'MAIN_HH_F',
      'MAIN_OT_P',      'MAIN_OT_M',      'MAIN_OT_F',     'MARGWORK_P',
     'MARGWORK_M',     'MARGWORK_F',      'MARG_CL_P',    

In [33]:
p_tot = df_bgm['TOT_P'].sum()
p_male = df_bgm['TOT_M'].sum()
p_female = df_bgm['TOT_F'].sum()
p_lit = df_bgm['P_LIT'].sum()
p_litm = df_bgm['M_LIT'].sum()
p_litf = df_bgm['F_LIT'].sum()
p_scm = df_bgm['M_SC'].sum()
p_scf = df_bgm['F_SC'].sum()
print(p_tot, p_lit, p_lit * 100 / p_tot)
print(p_lit, p_litm, p_litf, p_litm+p_litf-p_lit)

11168997 6751338 60.447128779782105
6751338 3894327 2857011 0


In [34]:
df_blr = pd.read_excel(pd.ExcelFile('PCA CDB-2918-F-Census.xlsx'))
print(f"{fn}: {len(df_blr)} records")
print(f"{len(df_blr.columns)} columns")
print(df_blr.columns.array)

PCA CDB-2901-F-Census.xlsx: 625 records
95 columns
<PandasArray>
[         'State',       'District',        'DT Name',       'CD Block',
   'Town/Village',           'Ward',             'EB',          'Level',
           'Name',            'TRU',          'No_HH',          'TOT_P',
          'TOT_M',          'TOT_F',           'P_06',           'M_06',
           'F_06',           'P_SC',           'M_SC',           'F_SC',
           'P_ST',           'M_ST',           'F_ST',          'P_LIT',
          'M_LIT',          'F_LIT',          'P_ILL',          'M_ILL',
          'F_ILL',     'TOT_WORK_P',     'TOT_WORK_M',     'TOT_WORK_F',
     'MAINWORK_P',     'MAINWORK_M',     'MAINWORK_F',      'MAIN_CL_P',
      'MAIN_CL_M',      'MAIN_CL_F',      'MAIN_AL_P',      'MAIN_AL_M',
      'MAIN_AL_F',      'MAIN_HH_P',      'MAIN_HH_M',      'MAIN_HH_F',
      'MAIN_OT_P',      'MAIN_OT_M',      'MAIN_OT_F',     'MARGWORK_P',
     'MARGWORK_M',     'MARGWORK_F',      'MARG_CL_P',     

In [35]:
movies = pd.read_csv('ml-latest-small/movies.csv')
print(len(movies))
movies.info()

9742
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId    9742 non-null int64
title      9742 non-null object
genres     9742 non-null object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [36]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
userId       100836 non-null int64
movieId      100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [37]:
links = pd.read_csv('ml-latest-small/links.csv')
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId    9742 non-null int64
imdbId     9742 non-null int64
tmdbId     9734 non-null float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [38]:
tags = pd.read_csv('ml-latest-small/tags.csv')
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
userId       3683 non-null int64
movieId      3683 non-null int64
tag          3683 non-null object
timestamp    3683 non-null int64
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [39]:
m = pd.merge(movies, links, on='movieId', how='inner')
m.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0


In [40]:
tags[tags['movieId']==1]['tag']

629     pixar
981     pixar
2886      fun
Name: tag, dtype: object

In [41]:
mean_ratings = ratings.groupby('movieId', as_index=False)['rating'].mean()

In [42]:
m = m.merge(mean_ratings, on='movieId', how='inner')

In [43]:
m.sort_values(['rating'], ascending=False)

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,rating
7638,88448,Paper Birds (Pájaros de papel) (2010),Comedy|Drama,1360822,50004.0,5.0
8089,100556,"Act of Killing, The (2012)",Documentary,2375605,123678.0,5.0
9065,143031,Jump In! (2007),Comedy|Drama|Romance,805559,13968.0,5.0
9076,143511,Human (2015),Documentary,3327994,359364.0,5.0
9078,143559,L.A. Slasher (2015),Comedy|Crime|Fantasy,2735292,323792.0,5.0
...,...,...,...,...,...,...
9253,157172,Wizards of the Lost Kingdom II (1989),Action|Fantasy,90334,7237.0,0.5
7536,85334,Hard Ticket to Hawaii (1987),Action|Comedy,93146,26011.0,0.5
6486,53453,Starcrash (a.k.a. Star Crash) (1978),Action|Adventure|Fantasy|Sci-Fi,79946,22049.0,0.5
5200,8494,"Cincinnati Kid, The (1965)",Drama,59037,886.0,0.5


## pandasql

pandasql is a Python package that allows 

In [47]:
from pandasql import sqldf

pysqldf = lambda q: sqldf(q, globals())

res = pysqldf('SELECT * FROM movies')
print(type(res))
print(res)

<class 'pandas.core.frame.DataFrame'>
      movieId                                      title  \
0           1                           Toy Story (1995)   
1           2                             Jumanji (1995)   
2           3                    Grumpier Old Men (1995)   
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
...       ...                                        ...   
9737   193581  Black Butler: Book of the Atlantic (2017)   
9738   193583               No Game No Life: Zero (2017)   
9739   193585                               Flint (2017)   
9740   193587        Bungo Stray Dogs: Dead Apple (2018)   
9741   193609        Andrew Dice Clay: Dice Rules (1991)   

                                           genres  
0     Adventure|Animation|Children|Comedy|Fantasy  
1                      Adventure|Children|Fantasy  
2                                  Comedy|Romance  
3                            Come

# References
* [Official Pandas tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
* [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min)
* [Intro to pandas data structures by Greg Reda](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/)
* [pandasql](https://github.com/yhat/pandasql/)