# Introduction to Pandas

* Series
* DataFrames
* Missing Data
* GroupBy
* Operations
* Data Input and Output

**Series**

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [2]:
import numpy as np
import pandas as pd

## Creating a Series

List,numpy array, or dictionary can be converted to a Series:

In [5]:
labels = ['a','b','c']
my_list = [10,20,30]

arr = np.array(my_list)

d = {'a':10,'b':20,'c':30}

### Using Lists

In [6]:
pd.Series(data=my_list,index=labels)
# pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

### Using NumPy Arrays

In [7]:
# pd.Series(arr)
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

### Using Dictionaries

In [9]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

### Data in a Series

A pandas Series can hold a variety of object types:

In [10]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [11]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

In [13]:
# grab information from a Series
sales_Q1 = pd.Series(data=[250,450,200,150],index = ['USA', 'China','India', 'Brazil'])  
sales_Q1

USA       250
China     450
India     200
Brazil    150
dtype: int64

In [14]:
sales_Q2 = pd.Series([260,500,210,100],index = ['USA', 'China','India', 'Japan'])    
sales_Q2

USA      260
China    500
India    210
Japan    100
dtype: int64

In [17]:
print(sales_Q1['USA'])
print(sales_Q2['China'])
sales_Q1 + sales_Q2

250
500


Brazil      NaN
China     950.0
India     410.0
Japan       NaN
USA       510.0
dtype: float64

## DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [23]:
import pandas as pd
import numpy as np
from numpy.random import randint

In [26]:
np.random.seed(42)
data = randint(-100,100,(5,4))
print('Data: \n',data)

Data: 
 [[  2  79  -8 -86]
 [  6 -29  88 -80]
 [  2  21 -26 -13]
 [ 16  -1   3  51]
 [ 30  49 -48 -99]]


In [28]:
columns= ['W', 'X', 'Y', 'Z'] # four columns
index= ['A', 'B', 'C', 'D', 'E'] # five rows

df = pd.DataFrame(data, index, columns)
print('DataFrame: \n',df)

DataFrame: 
     W   X   Y   Z
A   2  79  -8 -86
B   6 -29  88 -80
C   2  21 -26 -13
D  16  -1   3  51
E  30  49 -48 -99


### Selection and Indexing
#### Columns (axis : 1)

In [53]:
# Grab single column
df['W']

A     2
B     6
C     2
D    16
E    30
Name: W, dtype: int64

In [54]:
# grab multiple columns
df[['W','Z']]

Unnamed: 0,W,Z
A,2,-86
B,6,-80
C,2,-13
D,16,51
E,30,-99


In [55]:
# DataFrame Columns are just Series 

print('Type: \n',type( df['W']))

Type: 
 <class 'pandas.core.series.Series'>


In [56]:
# Create new column
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2,79,-8,-86,-6
B,6,-29,88,-80,94
C,2,21,-26,-13,-24
D,16,-1,3,51,19
E,30,49,-48,-99,-18


In [57]:
# Removing columns
# axis =1 : Column

df.drop('new', axis=1)

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


In [60]:
# Change not in place untill reassigned
df.drop('new', axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


#### Rows ( axis : 0)

In [63]:
df.loc['A']

W     2
X    79
Y    -8
Z   -86
Name: A, dtype: int64

In [68]:
# Selecting rows with name
df.loc[['A','E']]
df

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
C,2,21,-26,-13
D,16,-1,3,51
E,30,49,-48,-99


In [69]:
# select rows with integer index location
df.iloc[0]
df.iloc[0:2]

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80


In [71]:
# Remove row by nane
df.drop('C', axis=0)

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
B,6,-29,88,-80
D,16,-1,3,51
E,30,49,-48,-99


#### Select subset of rows and coumns

In [72]:
df.loc[['A','C'],['W','Y']]

Unnamed: 0,W,Y
A,2,-8
C,2,-26


#### Conditional Selection

In [75]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,False,False
B,True,False,True,False
C,True,True,False,False
D,True,False,True,True
E,True,True,False,False


In [76]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2,79.0,,
B,6,,88.0,
C,2,21.0,,
D,16,,3.0,51.0
E,30,49.0,,


In [78]:
# df['X']>0
df[df['X']>0]

Unnamed: 0,W,X,Y,Z
A,2,79,-8,-86
C,2,21,-26,-13
E,30,49,-48,-99


In [79]:
df[df['X']>0]['Y']

A    -8
C   -26
E   -48
Name: Y, dtype: int64

In [80]:
df[df['X']>0][['Y','Z']]

Unnamed: 0,Y,Z
A,-8,-86
C,-26,-13
E,-48,-99


In [82]:
# parenthesis can be used with | and & for two conditional
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
B,6,-29,88,-80
D,16,-1,3,51


In [83]:
# Reset to default 0,1, ... , n index
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2,79,-8,-86
1,B,6,-29,88,-80
2,C,2,21,-26,-13
3,D,16,-1,3,51
4,E,30,49,-48,-99


In [86]:
newindx = 'CA NY WY OR CO'.split()
df['States'] = newindx
df

Unnamed: 0,W,X,Y,Z,States
A,2,79,-8,-86,CA
B,6,-29,88,-80,NY
C,2,21,-26,-13,WY
D,16,-1,3,51,OR
E,30,49,-48,-99,CO


In [87]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2,79,-8,-86
NY,6,-29,88,-80
WY,2,21,-26,-13
OR,16,-1,3,51
CO,30,49,-48,-99


### Dataframe Summaries:

There are a couple of ways to obtain summary data on DataFrames.<br>
<tt><strong>df.describe()</strong></tt> provides summary statistics on all numerical columns.<br>
<tt><strong>df.info and df.dtypes</strong></tt> displays the data type of all columns.

In [88]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,11.2,23.8,1.8,-45.4
std,11.96662,42.109381,51.915316,63.366395
min,2.0,-29.0,-48.0,-99.0
25%,2.0,-1.0,-26.0,-86.0
50%,6.0,21.0,-8.0,-80.0
75%,16.0,49.0,3.0,-13.0
max,30.0,79.0,88.0,51.0


In [89]:
df.dtypes

W          int64
X          int64
Y          int64
Z          int64
States    object
dtype: object

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, A to E
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   W       5 non-null      int64 
 1   X       5 non-null      int64 
 2   Y       5 non-null      int64 
 3   Z       5 non-null      int64 
 4   States  5 non-null      object
dtypes: int64(4), object(1)
memory usage: 400.0+ bytes


***