# Manipulating Data with Pandas - part 1


Pandas is an open source data structures and data analysis tool for Python. 

What problem does pandas solve? 
It allows user to carry out their entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Some [Library highlights](https://pandas.pydata.org):
- A fast and efficient **DataFrame** object for data manipulation with integrated indexing;
- Tools for **reading and writing** data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent **data alignment** and integrated handling of **missing data**: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible **reshaping** and pivoting of data sets;
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets;
- Columns can be inserted and deleted from data structures for **size mutability**;
- Aggregating or transforming data with a powerful **group by** engine allowing split-apply-combine operations on data sets;
- High performance **merging and joining** of data sets;
- **Hierarchical axis indexing** provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- **Time series**-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

First thing first, import the pandas library 

In [1]:
import pandas as pd
import numpy as np 

In [2]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

## Concatenation with `pd.concat` 
By default, the concatenation takes place row-wise i.e `axis=0`. We can specify `axis=1`

In [3]:
# Default concatenation

df1 = make_df('AB', [1,2])
df2 = make_df('AB', [3,4])
df3 = pd.concat([df1,df2])

In [4]:
display(df1, df2, df3)

Unnamed: 0,A,B
1,A1,B1
2,A2,B2


Unnamed: 0,A,B
3,A3,B3
4,A4,B4


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


In [5]:
df4 = make_df('AB', [0,1])
df5 = make_df('CD', [0,1])
display(df4, df5, pd.concat([df4, df5], axis=1))

Unnamed: 0,A,B
0,A0,B0
1,A1,B1


Unnamed: 0,C,D
0,C0,D0
1,C1,D1


Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1


## The `append()` method 
The `append()` method is a shorter version of `pd.concat([df1, df2])`, you can simply call `df1.append(df2)`

In [6]:
df1.append(df2)

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


## Merge and Join 

Pandas offer a high-performance, in-memory join and merge operations. `pd.merge()` is a subset of relational algebra, a formal set of rules for manipulating relational data. 

Merge and join operations are often the most used when dealing with multiple datasets. 

In [7]:
pop = pd.read_csv('state-population.csv')
areas = pd.read_csv('state-areas.csv')
abbrevs = pd.read_csv('state-abbrevs.csv')

display(pop.head(), areas.head(), abbrevs.head())

Unnamed: 0,state/region,ages,year,population
0,AL,under18,2012,1117489.0
1,AL,total,2012,4817528.0
2,AL,under18,2010,1130966.0
3,AL,total,2010,4785570.0
4,AL,under18,2011,1125763.0


Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707


Unnamed: 0,state,abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


Lets say we want to rank US states and territories by their 2008 population density. Once we know we have the information to perform the merge, we'll start with a many-to-one merge.

In [8]:
merged = pd.merge(pop, abbrevs, how='outer', 
                  left_on='state/region', right_on='abbreviation')
# Dropping duplicate information 
merged = merged.drop('abbreviation', axis=1)
merged.head()

Unnamed: 0,state/region,ages,year,population,state
0,AL,under18,2012,1117489.0,Alabama
1,AL,total,2012,4817528.0,Alabama
2,AL,under18,2010,1130966.0,Alabama
3,AL,total,2010,4785570.0,Alabama
4,AL,under18,2011,1125763.0,Alabama


In [9]:
# Checking for mismatches 
merged.isnull().any()

state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

In [10]:
merged[merged.population.isnull()].head()

Unnamed: 0,state/region,ages,year,population,state
2448,PR,under18,1990,,
2449,PR,total,1990,,
2450,PR,total,1991,,
2451,PR,under18,1991,,
2452,PR,total,1993,,


In [11]:
# locating which unique state has null values 
merged.loc[merged.state.isnull(), 'state/region'].unique()

array(['PR', 'USA'], dtype=object)

In [12]:
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()

state/region    False
ages            False
year            False
population       True
state           False
dtype: bool

There are no more null values is "state" column. We've filled our null values appropriately. 

In [13]:
final = pd.merge(merged, areas, on='state', how='left')
final.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012,1117489.0,Alabama,52423.0
1,AL,total,2012,4817528.0,Alabama,52423.0
2,AL,under18,2010,1130966.0,Alabama,52423.0
3,AL,total,2010,4785570.0,Alabama,52423.0
4,AL,under18,2011,1125763.0,Alabama,52423.0


In [14]:
# Checking for null values to see if there are any mismatches 
final.isnull().any()

state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool

In [15]:
# Which regions were ignored 
final.state[final['area (sq. mi)'].isnull()].unique()

array(['United States'], dtype=object)

We notice that our data does not contain the area of the United States as a whole. We can insert the value by using the sum of all state areas, but in this case we'll just drop the null values. 

In [16]:
final.dropna(inplace=True)

We're selecting the data we need to answer our questions, rank US states and territories by their 2008 population density. Then we'll compute the population density and sort it in order. We'll start by indexing our data using `state`. 

In [17]:
data2008 = final.query("year == 2008 & ages == 'total'")
data2008.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
12,AL,total,2008,4718206.0,Alabama,52423.0
84,AK,total,2008,687455.0,Alaska,656425.0
108,AZ,total,2008,6280362.0,Arizona,114006.0
178,AR,total,2008,2874554.0,Arkansas,53182.0
204,CA,total,2008,36604337.0,California,163707.0


In [18]:
data2008.set_index('state', inplace=True)
density = data2008.population / data2008['area (sq. mi)']

density.sort_values(ascending=False, inplace=True)
density.head()

state
District of Columbia    8532.882353
Puerto Rico             1069.947653
New Jersey               998.749140
Rhode Island             682.849838
Connecticut              639.534452
dtype: float64

The result show the ranking of US states plus D.C. and Puerto Rico in order of their 2008 population density. Let's check out the end of the list. 

In [19]:
density.tail()

state
South Dakota    10.361951
North Dakota     9.300308
Montana          6.640201
Wyoming          5.582234
Alaska           1.047271
dtype: float64

Shout out to Jake VanderPlas for writing the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do)
