# Pandas 2:  Cleaning and organizing data 

We describe features of Pandas that allow us to clean and organize data.  Part of Pandas' powerful data management toolset.  Topics include:  

* [String methods](#strings).  Fixing string variables.  
* [Missing values](#missing).  Marking, dropping, counting.    
* [Selecting variables and observations](#selection).  
* [Boolean selection](#boolean).  Choosing observations from conditions.  

<!--
* [Indexing](#index).  Setting and resetting the index.  Multi-indexes. 
* [Switching rows and columns](#pivot).  Transpose.  Pivot and melt.  
* [Stack and unstack](#stack).  Managing column structure and labels.  Melt.  
--> 

More topics coming.  

**Note: requires internet access to run.**  

<!-- 
internal links http://sebastianraschka.com/Articles/2014_ipython_internal_links.html
-->

This IPython notebook was created by Dave Backus for the NYU Stern course [Data Bootcamp](http://databootcamp.nyuecon.com/).  

<a id=prelims></a>

## Preliminaries 

In [2]:
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())

Python version: 3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.17.1
Today:  2016-04-02


<a id=wants></a>

## The want operator 

We need to know what we're looking for.  Some problems we've run across that ask to be solved:

* Numerical data is contaminated by commas (marking thousands) or dollar signs.  
* Row and column labels are contaminated.  
* Missing values are marked erratically.  
* Variables run across rows rather than down columns.  

We'll solve these problems and more.  

### Example:  Chipotle data 

This data comes from a [New York Times story]((http://www.nytimes.com/interactive/2015/02/17/upshot/what-do-people-actually-order-at-chipotle.html) about the number of calories in a typical order at Chipotle.  The topic doesn't particularly excite us, but the data raises a number of issues that come up repeatedly. We adapt some code written by [Daniel Forsyth](http://www.danielforsyth.me/pandas-burritos-analyzing-chipotle-order-data-2/). 

In [3]:
url = 'https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv'
chp = pd.read_csv(url, sep='\t')   # tab (\t) delimited 
print('Variable dtypes:\n', chp.dtypes, sep='')
chp.head()

Variable dtypes:
order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


**Comment.** Note that the variable `item_price` has dtype object. The reason is evidently the dollar sign.  We'd prefer to have it as a number, specifically a float.  

### Example:  Data Bootcamp entry poll

This is the [poll](https://docs.google.com/forms/d/1N7ugrqIQNHm_e1BLVAtaRMnC8SmSiTDMoYBy_0FYZic/viewform) we did at the start of the course.  Responses were collected in a Google spreadsheet, which we converted to a csv and uploaded to our website.  

In [4]:
import pandas as pd
url1 = 'http://pages.stern.nyu.edu/~dbackus/Data/'
url2 = 'Data-Bootcamp-entry-poll_s16.csv'
url = url1 + url2 
ep = pd.read_csv(url, header=0) 
print('Dimensions:', ep.shape)
print('\nData types:\n', ep.dtypes, sep='')
ep.head(2) 

Dimensions: (104, 11)

Data types:
Timestamp                                                                                               object
What program are you enrolled in?                                                                       object
What career path most interests you?                                                                    object
How much programming experience have you had?                                                           object
How much experience with probability and statistics have you had?                                       object
Do you use social media for information purposes?  Check all that apply.                                object
Other comments (optional)                                                                              float64
What is your expected major or concentration?                                                           object
What kinds of data most interest you?  List any that cross your mind.        

Unnamed: 0,Timestamp,What program are you enrolled in?,What career path most interests you?,How much programming experience have you had?,How much experience with probability and statistics have you had?,Do you use social media for information purposes? Check all that apply.,Other comments (optional),What is your expected major or concentration?,What kinds of data most interest you? List any that cross your mind.,Why have you enrolled in this course?,If we have time -- and we may not -- what special topics would interest you? Check all that apply.
0,1/3/2016 4:43,Wharton UGR,Entrepreneurship,,,Twitter,,Analytics or other quant,People Analytics,I lost my mind for a minute,Natural language processing
1,1/3/2016 9:36,MBA,"Technology (Google, Amazon, etc)",I have taken one programming course,I have taken one probability or statistics course,Twitter,,Strategy & Finance,,To help with my career,Natural language processing


**Comments.** This is mostly text data, which means it's assigned the dtype object.  Which is fine.  But the variable names and values are excessively verbose.  It would be easier to work with if we came up with shorter ones.  

### Example:  OECD healthcare statistics 

The OECD collects [healthcare data](http://www.oecd.org/els/health-systems/health-data.htm) on lots of (mostly rich) countries, which is helpful in producing comparisons.  Here we use a [spreadsheet](http://www.oecd.org/health/health-systems/OECD-Health-Statistics-2015-Frequently-Requested-Data.xls) linked in one of their documents. 

In [5]:
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2015-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2, 
                     skiprows=3, 
                     usecols=[0, 51, 52, 53, 54, 55, 57],
                     sheetname='Physicians', 
                     skip_footer=21) 
print('Dimensions:', docs.shape)
print('\nIndex', docs.index.tolist(), sep='')
print('\nVariable dtypes:\n', docs.dtypes.tail(8), sep='')
docs.head()

Dimensions: (34, 7)

Index[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]

Variable dtypes:
Unnamed: 0                 object
2010                       object
2011                       object
2012                      float64
2013                       object
2014                       object
2013 (or nearest year)    float64
dtype: object


Unnamed: 0.1,Unnamed: 0,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia 1,..,3.31,3.31,3.39,..,3.39
1,Austria 1,4.8,4.84,4.9,4.99,..,4.99
2,Belgium 1,2.92,2.91,2.93,2.95,..,2.95
3,Canada 2,2.37,2.44,2.5,2.55,..,2.55
4,Chile 3,1.43,1.58,1.74,1.88,2.02,1.88


**Comments.**  Here we have a couple issues.  One is that first column includes a space and a number:  `Australia 1`, `Chile 3`, etc.  We care about this because when we plot the data across countries, the country labels are going to be country names, so we want them in a better form than this.  A second issue is the `..`'s in the sheet, which leads us to label any column that includes them as dtype object.  Here we want to label them as missing values.  

One more thing before we proceeed:  change the name of the country variable. 

In [6]:
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
docs.head()

Unnamed: 0,Country,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia 1,..,3.31,3.31,3.39,..,3.39
1,Austria 1,4.8,4.84,4.9,4.99,..,4.99
2,Belgium 1,2.92,2.91,2.93,2.95,..,2.95
3,Canada 2,2.37,2.44,2.5,2.55,..,2.55
4,Chile 3,1.43,1.58,1.74,1.88,2.02,1.88


### Example:  World Economic Outlook 

The IMF's [World Economic Outlook database](https://www.imf.org/external/ns/cs.aspx?id=28) contains a broad range of macroeconomic data for a large number of countries.  It's updated twice a year and is a go-to source for things like current account balances (roughly, the trade balance) and government debt and deficits.  It also has a few quirks, as we'll see.  

In [7]:
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', usecols=[1,2,3,4,6,40,41,42,43,44], thousands=',', na_values=['n/a', '--']) 

print('Variable dtypes:\n', weo.dtypes, sep='')
weo.head()

Variable dtypes:
ISO                    object
WEO Subject Code       object
Country                object
Subject Descriptor     object
Units                  object
2011                  float64
2012                  float64
2013                  float64
2014                  float64
2015                  float64
dtype: object


Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
0,AFG,NGDP_R,Afghanistan,"Gross domestic product, constant prices",National currency,386.368,440.336,457.654,463.52,472.981
1,AFG,NGDP_RPCH,Afghanistan,"Gross domestic product, constant prices",Percent change,6.479,13.968,3.933,1.282,2.041
2,AFG,NGDP,Afghanistan,"Gross domestic product, current prices",National currency,836.222,1033.591,1116.827,1173.182,1172.361
3,AFG,NGDPD,Afghanistan,"Gross domestic product, current prices",U.S. dollars,17.89,20.296,20.17,20.444,19.681
4,AFG,NGDP_D,Afghanistan,"Gross domestic product, deflator",Index,216.432,234.728,244.033,253.103,247.866


**Comment.** This has several issues:  

* The variables run across rows with observations labeled 1980, 1981, etc across the top.  
* The data columns (1980, 1981, etc) have dtype object.  A little work suggests that this is because they include commas marking thousands.   
* The entries labeled `n/a` need to be marked as missing values.

We can solve the last two in the `read_csv` function by deleting the hash.  The other one takes some work.  

<a id='strings'></a>

## String methods 

We can treat variables as strings in Pandas, just as we dealt with strings in core Python.  

Recall how we comverted the string 

```python
dollars = '$123.45'
```

to a float.  We replaced the dollar sign with a blank and converted it to a float:

```python
num = dollars.replace('$', '')
num = float(num)
print('Type of variable num:', type(num))
```
Run the code below to verify. 

Here we do the same thing with "string methods" in Pandas.  We append `.str` to a variable and then apply the string method of our choice.  Type conversions work through the `astype` method.  

In [8]:
dollars = '$123.45'
num = dollars.replace('$', '')
num = float(num)
print('Type of variable num:', type(num))

Type of variable num: <class 'float'>


**Example.**  Here we do this to the `item_price` variable in the Chipotle dataframe.  

In [9]:
chpnum = chp 
chpnum['item_price'] = chpnum['item_price'].str.replace('$', '').astype(float)
chp.dtypes

order_id                int64
quantity                int64
item_name              object
choice_description     object
item_price            float64
dtype: object

**Example.** Here we strip off the numbers at the end of the indexes in the OECD `docs` dataframe. We do this in two steps:

* Play with the `rsplit` method to see how it works.  
* Apply `rsplit` to the example `country = 'United States 1'`.  
* Use a string method to do this to all the entries of the variable `Country`.  

In [10]:
# one at a time
country = 'United States 1'

# get documentation for the rsplit method
country.rsplit?

In [11]:
# an example 
country.rsplit()

['United', 'States', '1']

**Comment.** Not quite, we only want to split once.  

In [12]:
# what about this?
country.rsplit(maxsplit=1)

['United States', '1']

In [13]:
# one more step, we want the first component of the list
country.rsplit(maxsplit=1)[0]

'United States'

In [14]:
# now do this for the variable Country 
docs['Country'].str.rsplit(n=1).str[0].head() 

0    Australia
1      Austria
2      Belgium
3       Canada
4        Chile
dtype: object

In [15]:
# now assign it back to the variable 
docs['Country'] = docs['Country'].str.rsplit(n=1).str[0]
docs.head()

Unnamed: 0,Country,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia,..,3.31,3.31,3.39,..,3.39
1,Austria,4.8,4.84,4.9,4.99,..,4.99
2,Belgium,2.92,2.91,2.93,2.95,..,2.95
3,Canada,2.37,2.44,2.5,2.55,..,2.55
4,Chile,1.43,1.58,1.74,1.88,2.02,1.88


**Comments.** 

* Note that we need two `.str`'s here:  one to do the split, the other to extract the first element.  
* For reasons that mystify us, we ran into problems when we used `maxsplit=1`, but it works with `n=1`.  

**Exercise.** ??

<a id='missing'></a>

## Missing values 

It's important to label missing values, so that Pandas doesn't interpret entries as strings.  Pandas is also smart enough to ignore things labeled missing when it does calculations or graphs.  If we compute, for example, the mean of a variable, the default is to ignore missing values.  

We've seen that we can label certain entries as missing values in read statements:  `read_csv`, `read_excel`, and so on.  Here we do it directly.  

### Marking missing values

**Example.** The `docs` dataframe contains a number of instances of `..` (double period).  How can we mark them as missing values?  

In [16]:
docs.head()

Unnamed: 0,Country,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia,..,3.31,3.31,3.39,..,3.39
1,Austria,4.8,4.84,4.9,4.99,..,4.99
2,Belgium,2.92,2.91,2.93,2.95,..,2.95
3,Canada,2.37,2.44,2.5,2.55,..,2.55
4,Chile,1.43,1.58,1.74,1.88,2.02,1.88


**What to do.**  We use the replace method on the whole dataframe.  To mark something as missing, we replace it as `None`, which Pandas interprets as missing and labels `NaN`.  

In [17]:
docs.replace(to_replace=['..'], value=[None]).head()

Unnamed: 0,Country,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia,,3.31,3.31,3.39,,3.39
1,Austria,4.8,4.84,4.9,4.99,,4.99
2,Belgium,2.92,2.91,2.93,2.95,,2.95
3,Canada,2.37,2.44,2.5,2.55,,2.55
4,Chile,1.43,1.58,1.74,1.88,2.02,1.88


**Comment.**  Replace automatically updates the dtypes.  Here the double dots led us to label the variables as objects.  After the replace, they're now floats, as they should be.  

In [18]:
docs.dtypes.head()

Country     object
2010        object
2011        object
2012       float64
2013        object
dtype: object

In [19]:
docsna = docs.replace(to_replace=['..'], value=[None])
docsna.dtypes.head()

Country     object
2010       float64
2011       float64
2012       float64
2013       float64
dtype: object

**Comment.**  Some people prefer to use the numpy `nan`.  Here's an example.  The only advantage is that we avoid possible conflicts with other uses of the value `None`.  

In [37]:
docs.replace(to_replace=['..'], value=[np.nan]).head()

Unnamed: 0,Country,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia,,3.31,3.31,3.39,,3.39
1,Austria,4.8,4.84,4.9,4.99,,4.99
2,Belgium,2.92,2.91,2.93,2.95,,2.95
3,Canada,2.37,2.44,2.5,2.55,,2.55
4,Chile,1.43,1.58,1.74,1.88,2.02,1.88


**Comment.** Unlike the string methods we described earlier, this use of replace affects complete entries, not elements of string entries.  For example, suppose we tried to replace the periods in decimal numbers with an asterisk.  We could try the following, but it doesn't work:  the decimal numbers  don't change.  

In [39]:
docs.replace(to_replace=['.'], value=['*']).head()

Unnamed: 0,Country,2010,2011,2012,2013,2014,2013 (or nearest year)
0,Australia,..,3.31,3.31,3.39,..,3.39
1,Austria,4.8,4.84,4.9,4.99,..,4.99
2,Belgium,2.92,2.91,2.93,2.95,..,2.95
3,Canada,2.37,2.44,2.5,2.55,..,2.55
4,Chile,1.43,1.58,1.74,1.88,2.02,1.88


### Working with missing values 

In [20]:
# grab a variable to play with
var = docsna[2013].head(10)
var

0    3.39
1    4.99
2    2.95
3    2.55
4    1.88
5    3.69
6     NaN
7    3.28
8    3.02
9    3.33
Name: 2013, dtype: float64

In [21]:
# which ones are missing ("null")?
var.isnull()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
Name: 2013, dtype: bool

In [22]:
# which ones are not missing ("not null")?
var.notnull()

0     True
1     True
2     True
3     True
4     True
5     True
6    False
7     True
8     True
9     True
Name: 2013, dtype: bool

In [23]:
# drop the missing 
var.dropna()

0    3.39
1    4.99
2    2.95
3    2.55
4    1.88
5    3.69
7    3.28
8    3.02
9    3.33
Name: 2013, dtype: float64

**Comment.** We usually don't have to worry about this, Pandas takes care of missing values automatically.  

<a id='selection'></a>

## Selecting variables and observations 

The word **selection** refers to choosing a subset of variables or observations using their labels or index.  There are lots of ways to do this.  Mostly we do "Boolean" selection, which we address in the next section.  We review more direct options here, mostly at high speed because they're not things we use much.  

In the outline below, `df` is a dataframe, `var` and `varn` are variable names, `vlist = 'var1', 'var2']` is a list of variable names, and `nlist = [0, 3, 4]` is a list of numerical variable or observation indexes, and `n1` and `n2` are integers.  Some of the basic selection/indexing/slicing methods have the form:  

* `df[var]` extracts a variable -- a series, in other words.
* `df[var][3]` extracts observation 3 (starting at zero) from the series `df[var]`. 
* `df[vlist]` extracts a new dataframe consisting of the variables in `vlist`.  
* `df[nlist]]` does the same thing.
* `df[n1:n2]` extracts observations `n1` to `n2-1`, the traditional slicing syntax.  

We find the last one confusing:  it extracts rows, not columns.  Pandas guru Wes McKinney notes: "this might seem inconsistent to some readers."  Yup!  We don't do it much, partly for that reason.  

<!-- page 127 top -->

The Pandas docs push the `loc` and `iloc` methods.  We'll ignore them -- we don't use them much -- but if you're interested, see [the docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing).    

In [24]:
# we create a small dataframe to experiment with 
small = weo.head()
small

Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
0,AFG,NGDP_R,Afghanistan,"Gross domestic product, constant prices",National currency,386.368,440.336,457.654,463.52,472.981
1,AFG,NGDP_RPCH,Afghanistan,"Gross domestic product, constant prices",Percent change,6.479,13.968,3.933,1.282,2.041
2,AFG,NGDP,Afghanistan,"Gross domestic product, current prices",National currency,836.222,1033.591,1116.827,1173.182,1172.361
3,AFG,NGDPD,Afghanistan,"Gross domestic product, current prices",U.S. dollars,17.89,20.296,20.17,20.444,19.681
4,AFG,NGDP_D,Afghanistan,"Gross domestic product, deflator",Index,216.432,234.728,244.033,253.103,247.866


**Exercise.** Try each of these in a different cell and explain what they do:  
    
* `small[['ISO', 'Units']]`
* `small[[0, 4]]`
* `small['2011']`
* `small['2011'][3]`
* `small[1:3]`

<a id='boolean'></a>

<a id='boolean'></a>

## Boolean selection  

This is mostly what we do:  we choose observations that satisfy one or more conditions.  We choose 

### Our want operator

Our *want* here is to take the `weo` dataframe and extract government debt and deficits for a given set of countries.  

Here's the head of the dataframe to give us a sense of what we're dealing with.  

In [25]:
weo.head()

Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
0,AFG,NGDP_R,Afghanistan,"Gross domestic product, constant prices",National currency,386.368,440.336,457.654,463.52,472.981
1,AFG,NGDP_RPCH,Afghanistan,"Gross domestic product, constant prices",Percent change,6.479,13.968,3.933,1.282,2.041
2,AFG,NGDP,Afghanistan,"Gross domestic product, current prices",National currency,836.222,1033.591,1116.827,1173.182,1172.361
3,AFG,NGDPD,Afghanistan,"Gross domestic product, current prices",U.S. dollars,17.89,20.296,20.17,20.444,19.681
4,AFG,NGDP_D,Afghanistan,"Gross domestic product, deflator",Index,216.432,234.728,244.033,253.103,247.866


**List variables and countries.** We start by listing the variables and countries in the dataframe:  the menu of choices we have at hand.  

Note, too, the use of the `drop_duplicates` method, which does what it sounds like.    

In [26]:
variable_list = weo[['WEO Subject Code', 'Subject Descriptor', 'Units']].drop_duplicates()
print('Number of variables: ', variable_list.shape[0])
variable_list.head()

Number of variables:  45


Unnamed: 0,WEO Subject Code,Subject Descriptor,Units
0,NGDP_R,"Gross domestic product, constant prices",National currency
1,NGDP_RPCH,"Gross domestic product, constant prices",Percent change
2,NGDP,"Gross domestic product, current prices",National currency
3,NGDPD,"Gross domestic product, current prices",U.S. dollars
4,NGDP_D,"Gross domestic product, deflator",Index


In [27]:
country_list = weo[['ISO', 'Country']].drop_duplicates()
print('Number of countries: ', country_list.shape[0])
country_list.head()

Number of countries:  190


Unnamed: 0,ISO,Country
0,AFG,Afghanistan
44,ALB,Albania
88,DZA,Algeria
132,AGO,Angola
176,ATG,Antigua and Barbuda


**Exercise.** 

* Construct a complete list of countries with `countries = weo[['ISO', 'Country']]`; that is, without applying the `drop_duplicates` method.  How large is it?  How many duplicates have we dropped?  
* What are the country codes for Argentina, Germany, and Greece?  
* What are the variable codes for government debt (gross debt, percent of GDP) and net lending/borrowing (also percent of GDP)?  

### Comparisons 

We can construct comparisons for dataframe columns much as we did with simple variables.  The difference is that we get a complete column or True/False responses, not just one.  

Mutiple comparisons have a different syntax than we saw earlier.  `and` is replaced by `&`, and `or` is replaced by `|`.  And when we have more than comparison, we need to enclose them in parentheses.  

Here's an example.  

**Exercise.**  Compute and explain the comparisons:  

* `small['Units'] == 'National currency'`
* `small['2011'] >= 100`
* `(small['Units'] == 'National currency') & (small['2011'] >= 100)`
* `(small['Units'] == 'National currency') | (small['2011'] >= 100)`

### Boolean selection

Boolean selection simply chooses those observations for which a condition is `True`.  Some people refer to this as **filtering**.

**Example.**  We choose obervations for which the units are `'National currency'`.  We do this first in two steps, then in one.  

In [28]:
ncunits = small['Units'] == 'National currency'
small[ncunits]

Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
0,AFG,NGDP_R,Afghanistan,"Gross domestic product, constant prices",National currency,386.368,440.336,457.654,463.52,472.981
2,AFG,NGDP,Afghanistan,"Gross domestic product, current prices",National currency,836.222,1033.591,1116.827,1173.182,1172.361


In [29]:
small[small['Units'] == 'National currency']

Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
0,AFG,NGDP_R,Afghanistan,"Gross domestic product, constant prices",National currency,386.368,440.336,457.654,463.52,472.981
2,AFG,NGDP,Afghanistan,"Gross domestic product, current prices",National currency,836.222,1033.591,1116.827,1173.182,1172.361


**Exercise.** Construct dataframes for which 

* `small['Units']` does **not** equal `'National currency'`.
* `small['Units']` equals `'National currency'` and `small['2011']` is greater than 100.  

### The `isin` method

Pay attention now, this is useful.  Suppose we want to extract the data for which `weo['ISO'] == 'ARG'` (Argentina) and `weo['ISO'] == 'GRC'` (Greece).  We could do that by combining the comparisons:  

```python
(weo['ISO'] == 'ARG') | (weo['ISO'] == 'GRC')
```
Remind youself that `|` stands for "or."  

A simpler approach is to apply the `isin` method to a variable.  This sets is to take observations for which `weo['ISO']` equals any element in a list.  We can use multiple comparisons to get 
We can combine a number of comparisons with the `isin` method.  

**Example.**  Let's apply the same logic to variable codes.  If we want to extract the observations with codes 
```
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
```

we would use 

In [30]:
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
weo['WEO Subject Code'].isin(vlist).head(45)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40     True
41    False
42    False
43    False
44    False
Name: WEO Subject Code, dtype: bool

**Comment.** We're choosing 2 variables from 45, so there are lots of Falses.

In [31]:
# this time let's use the result for selection 
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
weo[weo['WEO Subject Code'].isin(vlist)].head(6)

Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
32,AFG,GGXCNL_NGDP,Afghanistan,General government net lending/borrowing,Percent of GDP,-0.646,0.182,-0.629,-1.733,-0.251
40,AFG,GGXWDG_NGDP,Afghanistan,General government gross debt,Percent of GDP,,,,,
76,ALB,GGXCNL_NGDP,Albania,General government net lending/borrowing,Percent of GDP,-3.516,-3.434,-5.159,-5.414,-5.074
84,ALB,GGXWDG_NGDP,Albania,General government gross debt,Percent of GDP,59.413,62.02,70.091,72.522,73.305
120,DZA,GGXCNL_NGDP,Algeria,General government net lending/borrowing,Percent of GDP,-0.436,-4.089,-0.4,-7.331,-13.682
128,DZA,GGXWDG_NGDP,Algeria,General government gross debt,Percent of GDP,9.942,9.944,8.278,8.788,10.188


**Comment.** We can do the same thing with countries.  If we want to choose two variables and three countries, the code looks like:  

In [32]:
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
weo_sub = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]
weo_sub

Unnamed: 0,ISO,WEO Subject Code,Country,Subject Descriptor,Units,2011,2012,2013,2014,2015
252,ARG,GGXCNL_NGDP,Argentina,General government net lending/borrowing,Percent of GDP,-1.949,-2.359,-1.997,-2.689,-4.852
260,ARG,GGXWDG_NGDP,Argentina,General government gross debt,Percent of GDP,35.763,37.328,40.217,45.282,52.137
2716,DEU,GGXCNL_NGDP,Germany,General government net lending/borrowing,Percent of GDP,-0.861,0.091,0.133,0.306,0.511
2724,DEU,GGXWDG_NGDP,Germany,General government gross debt,Percent of GDP,77.928,79.311,76.978,74.621,70.746
2804,GRC,GGXCNL_NGDP,Greece,General government net lending/borrowing,Percent of GDP,-10.215,-6.406,-2.944,-3.914,-4.171
2812,GRC,GGXWDG_NGDP,Greece,General government gross debt,Percent of GDP,170.96,156.494,174.951,177.054,196.95


**Comments.**

* We've now done what we described when we applied the want operator. 
* This is a go-to method.  Circle it for later reference.  
* **This is a go-to method.  Circle it for later reference.**  

**Exercise.** Use the `isin` method to extract *Gross domestic product in US dollars* for China, India, and the United States.  Assign the result to the dataframe `gdp`.   

**Exercise (challenging).** Plot the variable `gdp['2015']` as a bar chart.  What would you say it needs?  

## Review

**Exercise.** We explore the Census's [Business Dynamics Statistics](http://www.census.gov/ces/dataproducts/bds/), a huge collection of data about firms. We've extracted a small piece of one of their databases that includes these variables for 2013:

* Size: size category of firms based on number of employees
* Firms: number of firms in this size category
* Emp: number of employees in this size category

Run the code cell below to load the data and use the result to answer these questions:  

* Compute the mean number of firms with `nbar = bds['Firms'].mean()`. 
* Generate the new variable `bds['AboveMean']` that is `True` if the value of `bds['Firms']` is above the mean, `False` otherwise. 
* What dtype is this new variable?  
* Select the observations for which the number of firms is above the mean.  
* *Challenging.*  Fix the size categories.  Specifically, use a string method to eliminate the prefixes `a)`, `b)`, etc.  That is, change `a) 1 to 4` to `1 to 4`, `b) 5 to 9` to `5 to 9`, and so on.  *Hint:* Use the `split` method.  

In [35]:
data = {'Size': ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49', 'e) 50 to 99',
                 'f) 100 to 249', 'g) 250 to 499', 'h) 500 to 999', 'i) 1000 to 2499',
                 'j) 2500 to 4999', 'k) 5000 to 9999', 'l) 10000+'], 
        'Firms': [2846416, 1020772, 598153, 373345, 115544, 63845,
                  19389, 9588, 6088, 2287, 1250, 1357], 
        'Emp': [5998912, 6714924, 8151891, 11425545, 8055535, 9788341, 
                6611734, 6340775, 8321486, 6738218, 6559020, 32556671]}
bds = pd.DataFrame(data) 
bds .head()

Unnamed: 0,Emp,Firms,Size
0,5998912,2846416,a) 1 to 4
1,6714924,1020772,b) 5 to 9
2,8151891,598153,c) 10 to 19
3,11425545,373345,d) 20 to 49
4,8055535,115544,e) 50 to 99
