# Case Study - Counting pandas

## How to count all the functions, methods and attributes that pandas has to offer?
There are probably multiple intelligent ways to do this but for this exercise we will start off by assuming the [API reference](http://pandas.pydata.org/pandas-docs/stable/api.html) in the pandas docs contain all the functionality of pandas. Full URL: http://pandas.pydata.org/pandas-docs/stable/api.html

Wow, thats an absurd amount of functionality for one library. Manually counting this might take some time. Lets use pandas to help us out.

In [2]:
import pandas as pd

### Finding pages with html tables

Many times it will not be obvious that a web page consists of html tables. For example, the Pandas api reference web page does not appear to have what you would normally define as a 'table'. However, all modern browsers have functionality to nicely display the contents of the current html page. In chrome you can right click **inspect** or **view page source**. If you click inspec, then the html for that object will be directly navigated to.

Once inspecting the html you can use search functions to find html tables which are always written with **`<table>`** elements.

Go ahead and inspect the api page and see if the underlying elements are indeed html tables.

### `read_html` to scrape tables

Pandas has a handy-dandy function **`read_html`** which reads all the html tables off of the given url. It returns a list of pandas dataframe objects - one for each table found. Let's use this now to grab every single table on that page.

In [4]:
# grab all html tables from api reference page
api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html')

In [5]:
# how many tables are there
len(api_tables)

119

In [6]:
#lets look at a few tables
api_tables[0]

Unnamed: 0,0,1
0,read_pickle(path),Load pickled pandas object (or any other pickl...


In [8]:
# take a look at another table
api_tables[44]

Unnamed: 0,0,1
0,"Series.from_csv(path[, sep, parse_dates, ...])","Read CSV file (DISCOURAGED, please use pandas...."
1,Series.to_pickle(path),Pickle (serialize) object to input file path.
2,"Series.to_csv([path, index, sep, na_rep, ...])",Write Series to a comma-separated values (csv)...
3,Series.to_dict(),Convert Series to {label -> value} dict
4,Series.to_frame([name]),Convert Series to DataFrame
5,Series.to_xarray(),Return an xarray object from the pandas object.
6,"Series.to_hdf(path_or_buf, key, \*\*kwargs)",Write the contained data to an HDF5 file using...
7,"Series.to_sql(name, con[, flavor, schema, ...])",Write records stored in a DataFrame to a SQL d...
8,"Series.to_msgpack([path_or_buf, encoding])",msgpack (serialize) object to input file path
9,"Series.to_json([path_or_buf, orient, ...])",Convert the object to a JSON string.


Looks like they are all two column tables with the attribute in the first column and the description in the right column. Every thing looks good. Lets try counting

In [9]:
count = 0
for table in api_tables:
    count += table.shape[0]
print("There are {} things pandas can do!".format(count))

There are 943 things pandas can do!


## How much functionality does the pandas Series have?
As seen above, the pandas object is followed up by its method/attribute in object-oriented notation. If we want to count just the Series functionality we need to search each table's first column for the word `Series`. pandas again provides us with some nicely equipped with plenty of [string processing methods](http://pandas.pydata.org/pandas-docs/stable/text.html).

To use these string processing methods, define a pandas Series and use .str. and press tab to see all the available methods.

In [10]:
# Lets use the first column from the above table
s = api_tables[44][0]
s.head()

0    Series.from_csv(path[, sep, parse_dates, ...])
1                            Series.to_pickle(path)
2    Series.to_csv([path, index, sep, na_rep, ...])
3                                  Series.to_dict()
4                           Series.to_frame([name])
Name: 0, dtype: object

In [11]:
# use the str.contains method to see if each item does in fact contain the word 'Series' in it
s.str.contains('Series')

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
Name: 0, dtype: bool

In [12]:
# OK lets count the appearance of 
count_series = 0
for table in api_tables:
    count_series += table[0].str.contains('Series').sum()
print("There are {} things pandas Series can do!".format(count_series))

There are 283 things pandas Series can do!


## Problem 1
<span  style="color:green; font-size:16px"> Writing a new for loop every time we want to count a new word in our dataset is cumbersome. Can you write a function that accepts the parameter **word** and returns the count of this word if it appears as in the pandas API as a functions/methods/attributes. Count a few words with it like DataFrame or MultiIndex</span>

In [10]:
# your code here
def count_functionality(word):

## Problem 2
<span  style="color:green; font-size:16px">Define a new function by modifying the above function slightly to have it return a list of all the methods</span>

In [11]:
# your code here

## Problem 3
<span  style="color:green; font-size:16px">Explore several of the Series `.str` methods that you should now have captured in a list on one of the API reference tables to get </span>

In [12]:
# your code here

## Problem 4
<span  style="color:green; font-size:16px">Lets get some 'live' data.</span>
1. Naviate to [real clear politics Trump vs Clinton](http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html) 
1. use pandas **`read_html`** to read in that full table at the bottom of the page and display it here in the notebook
1. use the header parameter to find the correct header instead of the default numbers
1. Inspect the info to make sure the clinton and trump data types are float64
1. add a column that calculates the difference of trump vs clinton
1. sort the dataframe by this newly created column
1. What conclusions (if any) can you make

In [None]:
# your code here