# Series

The first main data type we will talk about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [22]:
import numpy as np
import pandas as pd

# Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [24]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {
        'a':10,
        'b':20,
        'c':30
    }

== Using Lists ==

In [25]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

== Using Numpy Arrays ==

In [26]:
pd.Series(data=arr, index=labels)

a    10
b    20
c    30
dtype: int64

== Using Dicts ==

In [27]:
pd.Series(data=d)

a    10
b    20
c    30
dtype: int64

# Data objects in Series

Series can hold different types of objects:

eg. List of strings

In [28]:
pd.Series(data=['Robin', 'Deifilia', 'Eric'])

0       Robin
1    Deifilia
2        Eric
dtype: object

eg. List of functions

In [29]:
pd.Series(data=[len, sum, max])

0    <built-in function len>
1    <built-in function sum>
2    <built-in function max>
dtype: object

# Indexing

 Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Below is a demo on how to access data in a panda Series object. Conside the two following series, ser1 and ser2,

In [None]:
population_19 = {
    'Canada': 37.59,
    'USA': 328.20,
    'Mexico': 126.01,
    'Japan': 126.27,
    'Singapore': 5.7
}

population_20 = {
    'Canada': 37.72,
    'USA': 331.20,
    'Mexico': 128.93,
    'Japan': 126.48,
    'Singapore': 5.8,
    'New Zealand': 4.82
}

In [30]:
ser1 = pd.Series(population_19)

In [31]:
ser1

Canada        37.59
USA          328.20
Mexico       126.01
Japan        126.27
Singapore      5.70
dtype: float64

In [32]:
ser2 = pd.Series(population_20)

In [33]:
ser2

Canada          37.72
USA            331.20
Mexico         128.93
Japan          126.48
Singapore        5.80
New Zealand      4.82
dtype: float64

In [34]:
ser1['Canada']

37.59

In [35]:
ser2['USA']

331.2

We can check the population growth rate by subtracting the two Series objects:

In [36]:
((ser2 - ser1) / ser1) * 100

Canada         0.345837
Japan          0.166310
Mexico         2.317276
New Zealand         NaN
Singapore      1.754386
USA            0.914077
dtype: float64

We now move on to the main topic of panda, DataFrame, which extends the pandas Series obects.

# DataFrames

It is inspired by the R programming language. You can think of DataFrames as a bunch of Series objects being put together.

In [38]:
from numpy.random import randn

# Generate randome seed
np.random.seed(5555)

['A', 'B', 'C', 'D', 'E']

In [None]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [41]:
df

Unnamed: 0,W,X,Y,Z
A,1.645072,2.259312,-2.051142,1.050748
B,0.186003,0.103139,0.431995,0.745913
C,-0.856635,1.69615,-0.838842,0.600775
D,-1.83759,-0.209084,-0.680674,-2.211914
E,0.294773,0.360709,0.254907,1.63437


# Selection and Indexing

Ways of accessing data in a Dataframe object:

In [42]:
df['W']   # Does the output look familiar?

A    1.645072
B    0.186003
C   -0.856635
D   -1.837590
E    0.294773
Name: W, dtype: float64

In [43]:
type(df['W'])

pandas.core.series.Series

We can also pass a list of column names

In [44]:
df[['W', 'X']]

Unnamed: 0,W,X
A,1.645072,2.259312
B,0.186003,0.103139
C,-0.856635,1.69615
D,-1.83759,-0.209084
E,0.294773,0.360709


In [45]:
df['W', 'X']

KeyError: ('W', 'X')

For pandas objects (Series, DataFrame), the indexing operator [] only accepts

    - colname or list of colnames to select column(s)
    - slicing or Boolean array to select row(s), i.e. it only refers to one dimension of the dataframe.

For df[[colname(s)]], the interior brackets are for list, and the outside brackets are indexing operator, i.e. you must use double brackets if you select two or more columns. With one column name, single pair of brackets returns a Series, while double brackets return a dataframe.

source: https://stackoverflow.com/questions/33417991/pandas-why-are-double-brackets-needed-to-select-column-after-boolean-indexing#:~:text=Because%20you%20have%20no%20columns,sub%2Dselect%20from%20the%20df.&text=Because%20inner%20brackets%20are%20just,operation%20of%20pandas%20dataframe%20object.

In [46]:
# SQL Syntax
df.X

A    2.259312
B    0.103139
C    1.696150
D   -0.209084
E    0.360709
Name: X, dtype: float64

Creating new column:

In [47]:
df['new_col'] = df['X'] + df['Y']

In [48]:
df

Unnamed: 0,W,X,Y,Z,new_col
A,1.645072,2.259312,-2.051142,1.050748,0.20817
B,0.186003,0.103139,0.431995,0.745913,0.535134
C,-0.856635,1.69615,-0.838842,0.600775,0.857308
D,-1.83759,-0.209084,-0.680674,-2.211914,-0.889758
E,0.294773,0.360709,0.254907,1.63437,0.615615


Removing a column from df:

In [49]:
df.drop('new_col', axis=1)

Unnamed: 0,W,X,Y,Z
A,1.645072,2.259312,-2.051142,1.050748
B,0.186003,0.103139,0.431995,0.745913
C,-0.856635,1.69615,-0.838842,0.600775
D,-1.83759,-0.209084,-0.680674,-2.211914
E,0.294773,0.360709,0.254907,1.63437


In [50]:
df   # Notice the column, new_col, is still in df.

Unnamed: 0,W,X,Y,Z,new_col
A,1.645072,2.259312,-2.051142,1.050748,0.20817
B,0.186003,0.103139,0.431995,0.745913,0.535134
C,-0.856635,1.69615,-0.838842,0.600775,0.857308
D,-1.83759,-0.209084,-0.680674,-2.211914,-0.889758
E,0.294773,0.360709,0.254907,1.63437,0.615615


It's because the drop operation is not done inplace unless the user/developer specifies it. Or else, the drop method will just return a new DataFrame object.

In [51]:
df.drop('new_col', axis=1, inplace=True)

In [52]:
df

Unnamed: 0,W,X,Y,Z
A,1.645072,2.259312,-2.051142,1.050748
B,0.186003,0.103139,0.431995,0.745913
C,-0.856635,1.69615,-0.838842,0.600775
D,-1.83759,-0.209084,-0.680674,-2.211914
E,0.294773,0.360709,0.254907,1.63437


Dropping rows:

In [53]:
df.drop('A', axis=0)

Unnamed: 0,W,X,Y,Z
B,0.186003,0.103139,0.431995,0.745913
C,-0.856635,1.69615,-0.838842,0.600775
D,-1.83759,-0.209084,-0.680674,-2.211914
E,0.294773,0.360709,0.254907,1.63437


Accessing Rows:

In [54]:
df.loc['A']  # Label based location

W    1.645072
X    2.259312
Y   -2.051142
Z    1.050748
Name: A, dtype: float64

In [55]:
type(df.loc['A'])

pandas.core.series.Series

Select by element index position rather than labels:

In [56]:
df.iloc[0]  # Index based location

W    1.645072
X    2.259312
Y   -2.051142
Z    1.050748
Name: A, dtype: float64

In [58]:
df

Unnamed: 0,W,X,Y,Z
A,1.645072,2.259312,-2.051142,1.050748
B,0.186003,0.103139,0.431995,0.745913
C,-0.856635,1.69615,-0.838842,0.600775
D,-1.83759,-0.209084,-0.680674,-2.211914
E,0.294773,0.360709,0.254907,1.63437


Selecting by subset of rows and columns:

In [57]:
df.loc['B','Y']

0.43199540294682964

In [59]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,1.645072,-2.051142
B,0.186003,0.431995


# Conditional Selection


An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [60]:
df

Unnamed: 0,W,X,Y,Z
A,1.645072,2.259312,-2.051142,1.050748
B,0.186003,0.103139,0.431995,0.745913
C,-0.856635,1.69615,-0.838842,0.600775
D,-1.83759,-0.209084,-0.680674,-2.211914
E,0.294773,0.360709,0.254907,1.63437


In [63]:
df > 0

pandas.core.frame.DataFrame

Showing positive cells only:

In [62]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,1.645072,2.259312,,1.050748
B,0.186003,0.103139,0.431995,0.745913
C,,1.69615,,0.600775
D,,,,
E,0.294773,0.360709,0.254907,1.63437


In [None]:
df[df['W'] > 0]     # Return only the rows, A, B, E since they are positive

In [None]:
df['W'] > 0

In [None]:
df[df['W'] > 0]['Y']

In [None]:
df[df['W'] > 0][['Y','X']]

In [None]:
df[df['Z'] < 0]

In [None]:
df['Z'] < 0

For two conditions you can use | and & with parenthesis:

In [None]:
df[(df['W'] > 0) & (df['Y'] > 1)]

In [None]:
df[(df['W'] > 0) & (df['Y'] > 0)]

In [None]:
df

## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [None]:
df

In [None]:
# Reset to default 0,1...n index
df.reset_index()

In [None]:
provinces = 'BC ON QC SK AB'.split()

In [None]:
df['Province'] = provinces

In [None]:
df

In [None]:
df.set_index('Province', inplace=True)

In [None]:
df

## Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [None]:
# Index Levels
teams = {
    'Atlanta Hawks': 14,
    'Boston Celtics': 3,
    'Brooklyn Nets': 7,
    'Charlotte Bobcats': 10,
    'Chicago Bulls': 11,
    'Cleveland Cavaliers': 15,
    'Detroit Pistons': 13,
    'Indiana Pacers': 4,
    'Miami Heat': 5,
    'Milwaukee Bucks': 1,
    'New York Knicks': 12,
    'Orlando Magic': 8,
    'Philadelphia Sixers': 6,
    'Toronto Raptors': 2,
    'Washington Wizards': 9,
    'Dallas Mavericks': 7,
    'Denver Nuggets': 3,
    'Golden State Warriors': 15,
    'Houston Rockets': 4,
    'LA Clippers': 2,
    'LA Lakers': 1,
    'Memphis Grizzlies': 9,
    'Minnesota Timberwolves': 14,
    'New Orleans Hornets': 13,
    'Oklahoma City Thunder': 5,
    'Phoenix Suns': 10,
    'Portland Trail Blazers': 8,
    'Sacramento Kings': 12,
    'San Antonio Spurs': 11,
    'Utah Jazz': 6
}
conf = ['Eastern' if i < 15 else 'Western' for i in range(30)]
hier_index = list(zip(conf, teams))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [None]:
hier_index

In [None]:
nba_df = pd.DataFrame(list(teams.values()),index=hier_index,columns=['Standing'])
nba_df

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [None]:
nba_df.loc['Western']

In [None]:
nba_df.loc['Western'].loc['LA Lakers']

In [None]:
nba_df[nba_df['Standing'] == 1]   # Get best teams from both conferences

# Missing Data

Convenient methods to deal with Missing Data in pandas:

In [None]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(thresh=2)

In [None]:
df.fillna(value='Fill in value')

In [None]:
df['A'].fillna(value=df['A'].mean())

In [None]:
df   # Again the operations above are not inplace unless specified

# Groupby

The groupby method allows you to group rows of data together and call aggregate functions

In [None]:
data = {
    'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
    'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
    'Sales':[200,120,340,124,243,350]
}
df = pd.DataFrame(data)
df

We can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:

In [None]:
df.groupby('Company')

In [None]:
by_comp = df.groupby("Company")

In [None]:
by_comp.mean()  # Pandas will ignore Strings when computing mean

Getting standard deviation: 

In [None]:
by_comp.std()

More aggregate methods:

In [None]:
by_comp.min()

In [None]:
by_comp.max()

In [None]:
by_comp.count()

In [None]:
by_comp.describe()

In [None]:
by_comp.describe().transpose()

In [None]:
by_comp.describe().loc['GOOG']

In [None]:
by_comp.describe().transpose()['GOOG']

# Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. We will see some examples:

In [None]:
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']},
    index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']},
    index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({
    'A': ['A8', 'A9', 'A10', 'A11'],
    'B': ['B8', 'B9', 'B10', 'B11'],
    'C': ['C8', 'C9', 'C10', 'C11'],
    'D': ['D8', 'D9', 'D10', 'D11']},
    index=[8, 9, 10, 11])

In [None]:
df1

In [None]:
df2

In [None]:
df3

## Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [None]:
pd.concat([df1,df2,df3])

In [None]:
pd.concat([df1,df2,df3],axis=1)

In [None]:
left = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']})   

In [None]:
left

In [None]:
right

## Merging

The **merge** function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

In [None]:
pd.merge(left,right,how='inner',on='key')

A more complicated example:

In [None]:
left = pd.DataFrame({
    'key1': ['K0', 'K0', 'K1', 'K2'],
    'key2': ['K0', 'K1', 'K0', 'K1'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']})
    
right = pd.DataFrame({
    'key1': ['K0', 'K1', 'K1', 'K2'],
    'key2': ['K0', 'K0', 'K0', 'K0'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on=['key1', 'key2'])  # default: inner join

In [None]:
pd.merge(left, right, how='outer', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='right', on=['key1', 'key2'])

In [None]:
pd.merge(left, right, how='left', on=['key1', 'key2'])

## Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [None]:
left = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']},
    index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({
    'C': ['C0', 'C2', 'C3'],
    'D': ['D0', 'D2', 'D3']},
    index=['K0', 'K2', 'K3'])

In [None]:
left

In [None]:
right

In [None]:
left.join(right)  # Join by index instead of columns

In [None]:
left.join(right, how='outer')

In [None]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

In [None]:
df['col2'].unique()

In [None]:
df['col2'].nunique()   # Number of unique values in the column 'col2'

In [None]:
df['col2'].value_counts()

Selecting data in df:

In [None]:
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf

### Applying Functions


In [None]:
def square(x):
    return x**2

df['col1'].apply(square)

The apply function works best with lambda expressions:

In [None]:
df['col1'].apply(lambda n: n**2)

In [None]:
df['col3'].apply(len)

In [None]:
df['col1'].sum()

Permanently Removing a Column:

In [None]:
del df['col1'] # Or by df.drop(...)

In [None]:
df

Getting index and column names:

In [None]:
df.index

In [None]:
df.columns

# Sorting

In [None]:
df.sort_values(by='col2') #inplace=False by default

Checking for null values:

In [None]:
df.isnull()

In [None]:
df['NaN_col'] = np.NaN
df

In [None]:
df.dropna()  # Drop rows with NaN values

In [None]:
df.fillna('FILL DATA')

In [None]:
data = {
    'A':['foo','foo','foo','bar','bar','bar'],
    'B':['one','one','two','two','one','one'],
    'C':['x','y','x','y','x','y'],
    'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
df

In [None]:
df.pivot_table(values='D',
               index=['A', 'B'],
               columns=['C']) # Just like Excel

# Data Input / Output:

## CSV input

In [None]:
df = pd.read_csv('example.csv')
df

In [None]:
df['Gender'] = pd.Series(data=['Male', 'Female', 'Female'])
df

## CSV Output

In [None]:
df.to_csv('example.csv',index=False)

## Excel

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

In [None]:
pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

In [None]:
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

## HTML

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

    conda install lxml
    conda install html5lib
    conda install BeautifulSoup4

Then restart Jupyter Notebook.
(or use pip install if you aren't using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

Quick tip: You can run terminal commands in Jupyter Notebook using the ```!``` in front of your statement. Eg. ```!conda install lxml```

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

In [None]:
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

In [None]:
df

In [None]:
df[0]

# SQL (Optional)

If you're interested in learning SQL, I strongly recommend you taking COMP421.

The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.

If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.

See also some cookbook examples for some advanced strategies.

The key functions are:

read_sql_table(table_name, con[, schema, ...])
Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, ...])
Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, ...])
Read SQL query or database table into a DataFrame.
DataFrame.to_sql(name, con[, flavor, ...])
Write records stored in a DataFrame to a SQL database.

In [None]:
from sqlalchemy import create_engine

In [None]:
engine = create_engine('sqlite:///:memory:')

In [None]:
df[0].to_sql('data', engine)

In [None]:
sql_df = pd.read_sql('data',con=engine)

In [None]:
sql_df