# Pandas Basics

Much like numpy can be used to query and analyze data, pandas can do the same thing!  The reason that both tools are used in the data science community is that they fill different requirements and excel at different things.  In this lecture we will be learning more about pandas and it's strengths, specifically:

* Pandas versus Numpy
* Pandas indexes
* Working with data in Pandas
* SQL Intro
* Querying DataFrames
* Merging DataFrames
* Stacking and Unstacking
* Aggregations

## Pandas versus Numpy

In the last lecture we learned about Numpy and it's power for making Python very fast.  We saw how to query data, as well as learned how Numpy "thinks" about data, as a tensor.  In this lecture we will look at Pandas, which thinks about data primarily like a database table.  This is because there are obvious primitives for dealing with one dimensional tensors, called `Series` and obvious primitives for dealing with two dimensional tensors called a `DataFrame`, but there are no obvious primitives for dealing with "higher order" tensors in pandas.  You can do some things for creating higher order tensors in pandas, but honestly it's kind of tough to work with that set of primitives.

While pandas is limited in the order of dimensions it can represent, most data is either one dimensional or two dimensional anyway, or at the very least can be decomposed to an order 1 or order 2 tensor, so this isn't as much of a restriction as you might think.  In this way you can think of pandas as "specializing" in the typical case and not worrying about the exotic cases.

Additionally, generally speaking pandas is somewhat slower than numpy.  For this reason it is possible to access numpy equivalents of any of the pandas data structures, allowing you to access numpy's speed, while making use of pandas ease of use, in some cases.  That said, not all pandas operations can be done with the numpy objects.

Let's look at some simple examples of how to do things in numpy and equivalently in pandas:

In [5]:
import numpy as np
import time

start = time.time()
vector_one = np.array([1, 4, 7])
vector_two = np.array([2, 4, 6])

print("Result", np.matmul(vector_one, vector_two.T))
print("took ", time.time() - start, "seconds")

Result 60
took  0.00042819976806640625 seconds


In [6]:
import pandas as pd

start = time.time()
series_one = pd.Series(vector_one)
series_two = pd.Series(vector_two)

print("Result", series_one.dot(series_two.T))
print("took", time.time() - start, "seconds")

Result 60
took 0.0012764930725097656 seconds


As you can see the numpy version is an entire order of magnitude faster.  But the pandas version is a little bit easier to read.

Next let's look at how to slice pandas versus numpy data structures:

In [10]:
array = np.random.normal(0, 1, size=10000)

start = time.time()
print("Result", len(array[array > 0.5])/len(array))
print("took", time.time() - start, "seconds")

Result 0.3196
took 0.0003104209899902344 seconds


In [11]:
series = pd.Series(array)

start= time.time()
print("Result", len(series[series > 0.5])/len(array))
print("took", time.time() - start, "seconds")

Result 0.3196
took 0.0016829967498779297 seconds


So as you can see, numpy flatly beats pandas for speed.  However, there are two ways in which pandas is the better choice.  First, look at the number of public methods associated with a pandas Series versus a numpy order 1 tensor:

In [12]:
numpy_methods = [method for method in dir(array) if "_" not in method]
pandas_methods = [method for method in dir(series) if "_" not in method]
print(len(numpy_methods))
print(len(pandas_methods))

71
158


It should be clear, that you can just do _a lot_ more with a pandas series out of the box than a numpy order 1 tensor.  While this isn't always a good thing, it's worth keeping in mind the strengths and deficiencies of both frameworks.

## Pandas Indexes

Pandas dataframe come equipped with three indexes:
* an index for columns
* an index for rows
* an ordering for rows that serves as in implicit index.

The third index is the least forward without an example, and we'll look at it soon.

Let's look at a motivating example for the rows and columns first:

In [14]:
import pandas as pd

supply_matrix = np.random.rand(3,3) * 100
demand_vector = np.random.rand(3) * 100

df = pd.DataFrame(supply_matrix)
df.columns = ["Manufacturing", "Technology", "Real Estate"]
df

Unnamed: 0,Manufacturing,Technology,Real Estate
0,15.733845,54.994273,26.636739
1,39.737975,14.633238,27.060675
2,99.48918,84.865771,62.413707


As you can see the columns are listed above the dataframe's data and give us information how each column is semantically named.  If you've ever seen an excel spreadsheet, a database table or a CSV, this is carries the same meaning.

Additionally there is a row index as shown on the left hand side of the dataframe.  This can either be a simple enumeration of the columns, indexed by the natural numbers, or it can take on a semantic meaning as follows:

In [20]:
import pandas as pd

supply_matrix = np.random.rand(3,3) * 100

df = pd.DataFrame(supply_matrix)
df.columns = ["Manufacturing", "Technology", "Real Estate"]
df.index = ["Manufacturing", "Technology", "Real Estate"]
df

Unnamed: 0,Manufacturing,Technology,Real Estate
Manufacturing,28.319233,3.615384,71.878214
Technology,15.882315,33.206169,91.800734
Real Estate,23.779109,94.587522,65.557476


But typically it is the former rather than the latter which is used for the row index.  Let's return to our previous example and show the "third" index:

In [22]:
import pandas as pd

supply_matrix = np.random.rand(3,3) * 100

df = pd.DataFrame(supply_matrix)
df.columns = ["Manufacturing", "Technology", "Real Estate"]
df

Unnamed: 0,Manufacturing,Technology,Real Estate
0,88.83863,49.091207,0.960337
1,78.447857,47.530946,33.912809
2,57.043741,67.166951,32.933661


In [23]:
df = df.sort_values("Manufacturing")
df

Unnamed: 0,Manufacturing,Technology,Real Estate
2,57.043741,67.166951,32.933661
1,78.447857,47.530946,33.912809
0,88.83863,49.091207,0.960337


As you can see, now the row index is inconflict with order in which the rows appear.  This means we can query for the first row in two ways:

In [24]:
df.iloc[0]

Manufacturing    57.043741
Technology       67.166951
Real Estate      32.933661
Name: 2, dtype: float64

In [25]:
df.loc[2]

Manufacturing    57.043741
Technology       67.166951
Real Estate      32.933661
Name: 2, dtype: float64

Notice these two ways of selecting the first row in the dataframe differed in which index they use, but return the same information.  This will become especially important when we move onto merging dataframes, because the row index is used rather than the "implicit" third index, defined by the order the data appears in the dataframe.  This explicit row index _can_ be mutated of course, but need not be when a mutation to the ordering of data occurs.  This may seem complex, but can be desirable, especially given the ability to control when and if this explicit row index is updated.

## Working With Data In Pandas

One of pandas greatest strengths is it's ability to easily read in data and dump out data.  Let's look at an example:

In [36]:
supply_matrix = np.round(np.random.rand(500,3) * 100, 2)

df1 = pd.DataFrame(supply_matrix)
df1.columns = ["Manufacturing", "Technology", "Real Estate"]
df1.to_csv("economy.csv", index=False)

Here we choose not to save our index to the csv which is why `index=False`, this tells the method to not save the index as well, just the data.  Now let's see how to read in the dataframe:

In [37]:
df2 = pd.read_csv("economy.csv")
df2.equals(df1)

True

We can also read csv's directly from the internet with the same method:

In [39]:
churn_df = pd.read_csv("https://raw.githubusercontent.com/EricSchles/datascience_book/master/Churn_Modelling.csv")
churn_df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


As you can see, all we need to do is supply a url pointing to a csv and we are all set!  In addition to being able to save to csvs and read from csvs, we can also read data directly from a database into pandas!  This is why pandas is sometimes thought of as an "in-memory" database.

For this we are going to need a connection object to the database and a relevant query:

In [40]:
!heroku config:get DATABASE_URL -a limitless-brook-05892

postgres://iplxzpexpwitbu:de934ccb02766f0175be3114e97a687ac3ca6192f817faf5a7af5bf379200bb4@ec2-52-204-232-46.compute-1.amazonaws.com:5432/dd18i1ap02mb2p


In [53]:
from sqlalchemy import create_engine
engine = create_engine('postgres://iplxzpexpwitbu:de934ccb02766f0175be3114e97a687ac3ca6192f817faf5a7af5bf379200bb4@ec2-52-204-232-46.compute-1.amazonaws.com:5432/dd18i1ap02mb2p')
sql_table_df = pd.read_sql("select * from public.table", engine)

In [54]:
sql_table_df.head()


Unnamed: 0,id,manufacturing,technology,real_estate
0,1,33.02,62.58,14.47
1,2,67.9,85.5,21.02
2,3,21.38,15.68,19.35
3,4,64.38,35.94,31.53
4,5,83.37,12.78,48.09
