# Machine Learning

## Table of Contents

- [Initialization](#Initialization)
- [Motivation and Background](#Motivation-and-Background)
- [Data Basics](#Data-Basics)
- [Understanding the Data](#Understanding-the-Data)

    - [Exercise 1 - descriptive statistics](#Exercise-1---descriptive-statistics)

- [Cleaning and Subsetting Data](#Cleaning-and-Subsetting-Data)

    - [Exercise 2 - function `cleanData`](#Exercise-2---function-cleanData)

- [Model Selection and Assessment](#Model-Selection-and-Assessment)

    - [Exercise 3 - Train your Model](#Exercise-3---Train-your-Model)

- [References](#References)


# Setup and Initialization

- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment.

In [86]:
import pandas
from sqlalchemy import create_engine
import numpy
from IPython.core.display import Image
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import metrics
import math

# Motivation and Background

- Back to [Table of Contents](#Table-of-Contents)

Research in science policy often involves making use of publically available datasets.  Today, we will be diving into the National Institutes for Health (NIH) grant data.  NIH provides a myriad of information about its grants which you can access here: http://projectreporter.nih.gov/reporter.cfm. Unfortunately, since the information draws on disparate sources like eRA databases, Medline, PubMed Central, the NIH Intramural Database, and iEdison, it is often not complete. 

In this workbook, we will examine one application of machine learning that deals with predicting missing information. Often in sciece policy research, we are interested in knowing what areas of science are being funded.  There are diverse techniques to determine a grant's area of science, including assessing a grants topic using text analytics like topic modeling (covered in the Text Analysis workbook).

The best option is an explicit and standardized identifier that is present on every record.  The NIH grant data doesn't have an area of science taxonomy.  It does, however, provide a place to store an expicit, though potentially not standardized or totally accurate, indicator of a grant's overall topic area - the academic department of the grant's primary investigator (PI).  Unfortunately, this variable is often left empty.  In this workbook, we will walk through the process of imputing values for a missing categorical variable using the example of predicting the academic department of a given grant's primary investigator.

# Data Basics

- Back to [Table of Contents](#Table-of-Contents)

The data we need for this exercise is in the '`umetricsgrants`' database in MySQL on the class server.  For more details on this data, see the data schemas we've provided in class:

- Database Schema PowerPoint (also includes USPTO and StarMetrics data) - [http://jpsmonline.umd.edu/mod/resource/view.php?id=2387](http://jpsmonline.umd.edu/mod/resource/view.php?id=2387)
- UmetricsGrants Schema (image) - [http://jpsmonline.umd.edu/mod/resource/view.php?id=2384](http://jpsmonline.umd.edu/mod/resource/view.php?id=2384)

In particular, in the '`nih_project`' table, there is a variable '`ORG_DEPT`' that details the department of the PI of the grant.  This will be our outcome variable of interest (the variable we will try to predict when it is missing).  Once you start trying your own models, you can use as predictors variables from any of the tables whose names start with '`nih_`'.  To get us started, however, we have taken a subset of columns from these tables and placed them in a single table named '`MachineLearning`' in the '`homework`' database.

In the past (in the Database Basics and Text Analysis assignment notebooks) we have interacted with the database using SQL and a direct python connection to the database.  In this lesson, we'll be using a different program - the **pandas** package ( site: [http://pandas.pydata.org/](http://pandas.pydata.org/); doc: [http://pandas.pydata.org/pandas-docs/stable/index.html](http://pandas.pydata.org/pandas-docs/stable/index.html) ) - to read in and manipulate data.  Pandas provides an alternative to reading data directly from MySQL that stores the data in special table format called a "data frame" that allows for easy statistical analysis and can be directly used for machine learning.

Pandas uses a database engine to connect to databases (via the SQLAlchemy Python package).  In the code cell below, we will create a database engine conneted to our class MySQL database server for Pandas to use.  In the code cell below, place your database username and password in the variables '`mysql_username`' and '`mysql_password`', then run the cell:

In [87]:
# set up database credentials
mysql_username = ""
mysql_password = ""

# Create database connection for pandas.
pandas_db = create_engine( "mysql://" + mysql_username + ":" + mysql_password + "@localhost:3306/homework?charset=utf8" )

Next, we will use this database connection to have pandas read in the data stored in the '`MachineLearning`' table.  Pandas has a set of Input/Output tools that let it read from and write to a large variety of tabular data formats ( [http://pandas.pydata.org/pandas-docs/stable/io.html](http://pandas.pydata.org/pandas-docs/stable/io.html) ), including CSV and Excel files, databases via SQL, JSON files, and SAS and Stata data files.  In the example below, we'll use the `pandas.read_sql()` function to read the results of an SQL query into a pandas data frame.  

In [88]:
data_frame = pandas.read_sql( 'Select * from homework.MachineLearning;', pandas_db )

Now, lets look at what the data looks like.  The DataFrame method '`data_frame.head( number_of_rows )`' outputs the  first `number_of_rows` rows in a data frame.  Lets look at the first five rows in our data.

In the code cell below, there are two ways to output this information.  If you just call the method, you'll get an HTML table output directly into the ipython notebook.  If you pass the results of the method to the "`print()`" function, you'll get text output that works outside of jupyter/ipython.

In [None]:
# to get a pretty tabular view, just call the method.
data_frame.head( 5 )

# to get a text-based view, print() the call to the method.
#print( data_frame.head( 5 ) )

# Understanding the Data

- Back to [Table of Contents](#Table-of-Contents)

In pandas, our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program, with the data for each column stored in its own list that pandas calls a Series (or vector of values), along with a set of methods (another name for functions that are tied to objects) that make managing data in pandas easy.

A Series is a list of values each of which can also have a label, which pandas calls an "index", and which generally is used to store names of columns when you retrieve a Series that represents a row, and IDs of rows when you retrieve a Series that represents a column of data in a table.

While DataFrames and Series are separate objects, they may share the same methods where those methods make sense in both a table and list context (`head()` and `tail()`, as used in examples in this notebook).

More details on pandas data structures:

- Data Structures overview: [http://pandas.pydata.org/pandas-docs/stable/dsintro.html](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
- Series specifics: [http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)
- DataFrame specifics: [http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

For example, if you want to look at the last five values in the `ORG_DEPT` column, you can retrieve the Series that contains the `ORG_DEPT` column's data using square-bracket notation (like you'd use to access a value in a dictionary):

    data_frame[ "ORG_DEPT" ]
    
then call the "`tail( number_of_rows )`" method (the opposite of "`head()`") on the Series to get `number_of_rows` values from the end of the column's data.  To get the last 5 values in the `ORG_DEPT` column:

In [None]:
# get vector of "ORG_DEPT" column values from data frame
org_dept_column_series = data_frame[ "ORG_DEPT" ]

# see the last 5 values in the vector.
print( org_dept_column_series.tail( 5 ) )

# It is also OK to chain together, but I did not above for clarity's sake, and in
#    general, be wary of doing too many things on one line.
# data_frame[ "ORG_DEPT" ].tail( 5 )

To see how the data is stored internally, we can reference the "`data_frame.dtypes`" variable, which contains a pandas Series object with the name of each column in your data frame the label for the type of the data in that column:

In [None]:
data_frame.dtypes

Lets look at these database columns one by one:

* **`APPLICATION_ID`** - Unique identifier for each grant
* **`CFDA_CODE`** - CFDA contains detailed program descriptions for 2,292 Federal assistance programs: [https://www.cfda.gov/](https://www.cfda.gov/)
* **`YEAR`** - Year in which grant was awarded
* **`ACTIVITY`** - A 3-character code identifying the grant, contract, or intramural activity through which a project is supported. Here is a list of activity codes: [http://grants.nih.gov/grants/funding/ac_search_results.htm](http://grants.nih.gov/grants/funding/ac_search_results.htm)
* **`ADMINISTERING_IC`** - Administering Institute or Center - A two-character code to designate the agency, NIH Institute, or Center administering the grant. See [definitions]: [](http://grants.nih.gov/grants/glossary.htm#I14).
* **`ARRA_FUNDED`** - “Y” indicates a project supported by funds appropriated through the American Recovery and Reinvestment Act of 2009.
* **`ORG_NAME`** - The name of the educational institution, research organization, business, or government agency receiving funding for the grant, contract, cooperative agreement, or intramural project.  
* **`ORG_DEPT`** - The departmental affiliation of the contact principal investigator for a project, using a standardized categorization of departments.  Names are available only for medical school departments.
* **`STUDY_SECTION`** - A designator of the legislatively-mandated panel of subject matter experts that reviewed the research grant application for scientific and technical merit.
* **`TOTAL_COST`** - Total project funding from all NIH Institute and Centers for a given fiscal year.
* **`TOPIC_ID`** - Using text analysis techniques, a topic_id was assigned to each grant. This topic_id is a key for that topic. You can see what the topic contains by looking in the `topiclda_text` table in `umetricsgrants` database.
* **`ED_INST_TYPE`** - Generic name for the grouping of components across an institution who has applied for or receives NIH funding.

In the examples that follow, we'll only be using the columns in this table as predictors for models.  Once you get to the last exercise, however, when you are trying to make as accurate a model as you can, you will be free to use any of the columns in the table `nih_project` in the `umetricsgrants` database.  A complete description of all variables in that table: [http://exporter.nih.gov/about.aspx](http://exporter.nih.gov/about.aspx)

### Exercise 1 - descriptive statistics

- Return to [Table of Contents](#Table-of-Contents)

Pandas provides some great functions for descriptive statistics ( [http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics) ).  Some examples:

- **`describe()`** - "computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course)" ( from [http://pandas.pydata.org/pandas-docs/stable/basics.html#summarizing-data-describe](http://pandas.pydata.org/pandas-docs/stable/basics.html#summarizing-data-describe) )

    - includes the count of values, mean, standard deviation, min, 25%, 50%, and 75% values, and the max.

- **`head()` and `tail()`**, shown above - "To view a small sample of a Series or DataFrame object, use the `head()` and `tail()` methods. The default number of elements to display is five, but you may pass a custom number." ( from [http://pandas.pydata.org/pandas-docs/stable/basics.html#head-and-tail](http://pandas.pydata.org/pandas-docs/stable/basics.html#head-and-tail) )
- **`value_counts()`** - The `value_counts()` "series method and top-level function computes a histogram of a one-dimensional array of values." ( from [http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode](http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode) ).  This method returns a Series of the counts of the number of times each unique value in the column is present in the column (also known as frequencies), from largest count to least, with the value itself the label for each row.

For the first part of exercise 1, we will combine some of these methods for calculating descriptive statistics into a function that accepts a DataFrame of data from our `homework.MachineLearning` table, then calculates and returns both the descriptives for the columns in the table and the top ten most frequently referenced departments.

We will be making a function that returns multiple values to help you understand how this works in Python, since some of the machine learning functions used below return multiple values.

In a Python function, if you want to return multiple values, you place each in the line with your return statement, separated by commas.  This is like a list of variables, but you don't need to put it in square brackets.  You just place the items after the `return` keyword.  So, if you wanted to write a function that returns the circumference and area of a circle when passed the radius of a circle (run the cell below):

In [None]:
def calculateCircleInfo( radius_IN ):
    
    '''
    Accepts radius of circle.  Calculates circumference and area of circle,
       returns them both.
    '''
    
    # return references
    circumference_OUT = -1
    area_OUT = -1
    
    # got a radius?
    if ( ( radius_IN is not None ) and ( radius_IN != "" ) and ( radius_IN > -1 ) ):
        
        # yes.  calculate circumference...
        circumference_OUT = 2 * math.pi * radius_IN
        
        # ...and calculate area
        area_OUT = math.pi * ( radius_IN ** 2 )
        
    #-- END check to see if radius is populated and not negative (we'll allow 0...). --#
    
    # return both diameter and area.
    return circumference_OUT, area_OUT
    
#-- END function calculateCircleInfo() --#

When a function or method returns more than one thing, it returns the items in a tuple.  When you call a function or method that returns more than one thing, to accept all of the items it returns, you can either:

- assign the results to a single variable.  This variable will contain a tuple, in which you can then reference the individual values returned by the function or method using square bracket notation, by index (starting with 0).
- assign the results of the function to a list of the same number of variables, separated by commas, to the left of the assignment operator (the equal sign - "=").

For example, to calculate the circumference and area of a circle with radius of 3:

In [None]:
# declare variables
radius = -1
circumference = -1
area = -1
circle_info_tuple = None

# set radius
radius = 3

# calculate circumference and area - individual values
circumference, area = calculateCircleInfo( radius )

# print the results
print( "[values] - For circle of radius " + str( radius ) + ":\n- circumference = " + str( circumference ) + "\n- area = " + str( area ) )

# calculate circumference and area - tuple
circle_info_tuple = calculateCircleInfo( radius )

# unpack value using square-bracket tuple notation.
circumference = circle_info_tuple[ 0 ] # first
area = circle_info_tuple[ 1 ] # second

# print the results
print( "\n[tuple] - For circle of radius " + str( radius ) + ":\n- circumference = " + str( circumference ) + "\n- area = " + str( area ) )

Now, we'll make our own function that returns multiple values.  Complete the `printDescriptiveStats()` function below so that it uses the data frame of data from the `homework.MachineLearning` table and uses it to create and return:

- 1) - **`summary_OUT`** - the summary statistics for all variables in the dataset (the results of invoking `describe()`)
- 2) - **`top_depts_OUT`** - a Series that contains the top 10 most referenced department names from the column "`ORG_DEPT`" and the count of mentions for each (use a combination of the `head()`, and `value_counts()` functions).

In [89]:
def printDescriptiveStats( data_frame_IN ):

    """
    Parameters
    ----------
    data_frame_IN : A pandas DataFrame
    
    Returns
    -------
    summary_OUT : A Pandas dataframe containing count, mean, standard deviation, 
              minimum, maximum, and the 25th, 50th and 75th percentile
    top_depts_OUT : A pandas.core.series.Series containing the top 10 departments
               and their frequencies
    """
    ### BEGIN SOLUTION
    
    # use describe to summarize table
    summary_OUT = data_frame_IN.describe()
    
    # get column "ORG_DEPT"
    department_series = data_frame_IN["ORG_DEPT"]

    # calculate frequencies
    department_frequencies = department_series.value_counts()
    
    # get top ten most frequent
    top_depts_OUT = department_frequencies.head( 10 )
    
    return summary_OUT, top_depts_OUT 

    ### END SOLUTION
    
#-- END function printDescriptiveStats() --#

Run the cell above that contains your function definition, then run the cell below to test it out and see what the data in data_frame looks like.

In [90]:
# TEST - lets see what our data looks like
output = printDescriptiveStats( data_frame )

# TEST to make sure two things returned
assert len( output ) == 2

# get descriptives and top_departments
descriptives = output[ 0 ]
top_departments = output[ 1 ]

print( "Descriptives:" )
print( descriptives )
print( "\nTop Departments:" )
print( top_departments )

# TEST to make sure frequencies are right.
assert top_departments[ 0 ] == 66695

Descriptives:
       APPLICATION_ID           YEAR       TOTAL_COST
count   750000.000000  495804.000000    493884.000000
mean   6367218.393325    2000.791922    313985.411447
std     202037.966264       0.806793    546415.565411
min    2628851.000000    1999.000000         0.000000
25%    6287895.000000    2000.000000    119554.000000
50%    6380018.000000    2001.000000    235692.000000
75%    6489934.000000    2001.000000    328750.000000
max    7305971.000000    2006.000000  57896226.000000

Top Departments:
INTERNAL MEDICINE/MEDICINE     66695
BIOCHEMISTRY                   26188
NONE                           20624
MICROBIOLOGY/IMMUN/VIROLOGY    20412
PHARMACOLOGY                   19536
PSYCHIATRY                     18692
BIOLOGY                        17976
PSYCHOLOGY                     17220
PATHOLOGY                      17004
PHYSIOLOGY                     16068
dtype: int64


# Cleaning and Subsetting Data

- Back to [Table of Contents](#Table-of-Contents)

Looking at the data, it is clear that there are a lot of missing values in `ORG_DEPT` (look at how many rows have "NONE", and where that values ranks in the frequencies for `ORG_DEPT`).

In order to train a model to predict this value, we'll need to separate those that have a value from those that do now, so that we can train and evaluate our model using a samples of just those grant rows that contain an `ORG_DEPT` value.

To filter out rows with no `ORG_DEPT` value, we'll need to do some basic cleaning of the data.  For that, we first need to figure out which variables have missing values.

The function `calc_null_frequencies()` accepts a DataFrame in which you'd like to count null values per column.  It returns a DataFrame that lists each column in the DataFrame and for each, counts of non-null ("False") and null ("True") values:

In [91]:
def calc_null_frequencies( data_frame_IN ):

    """
    for a given DataFrame, calculates how many values for 
    each variable are null and returns the resulting table.
    """
    
    # return reference
    null_frequencies_OUT = None
    
    # declare variables
    data_frame_long = None
    variable_value_series = None
    is_null_value_series = None
    variable_name_series = None
    
    # Use DataFrame.melt() to convert DataFrame table into long format
    #   where every column value is its own row: variable names are in
    #   column "variable", values are in column "value".
    data_frame_long = pandas.melt( data_frame_IN )
    
    # get Series that just contains column values
    #value_series = data_frame_long.value # alternate dot (".") notation
    variable_value_series = data_frame_long[ "value" ]
    
    # Make a Series where position of each column value in Series
    #    contains True if value was NULL or False if not.
    is_null_value_series = variable_value_series.isnull()

    # get Series of column names to match each of the values
    #   (not distinct - one for each value - lots!)
    # variable_name_series = data_frame_long.variable # alternate dot (".") notation
    variable_name_series = data_frame_long[ "variable" ]
    
    # ceate a dataframe that sums counts of non-null values ("False") to
    #    null values ("True") for each column name.
    null_frequencies_OUT = pandas.crosstab( variable_name_series, is_null_value_series )

    return null_frequencies_OUT

#-- END function calc_null_frequencies() --#

In [92]:
# Lets see what our NULL values look like
calc_null_frequencies( data_frame )

value,False,True
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
ACTIVITY,750000,0
ADMINISTERING_IC,750000,0
APPLICATION_ID,750000,0
ARRA_FUNDED,750000,0
CFDA_CODE,493152,256848
ED_INST_TYPE,402091,347909
ORG_DEPT,402091,347909
ORG_NAME,748056,1944
STUDY_SECTION,593916,156084
TOPIC_ID,750000,0


Now, that we have a better understanding of the issues our dataset has (in particular, `ORG_DEPT` has many NULL values and many rows where value is "NONE"), lets go ahead and write some code to deal with them.

### Exercise 2 - function `cleanData`

- Return to [Table of Contents](#Table-of-Contents)

To facilitate our cleanup, we will create a function that uses pandas' subsetting functionality to filter out rows where a given column contains a given value.

In pandas, if you want to filter a DataFrame, you use a special form of square bracket notation to specify a boolean test to be used on each row to decide if it should be included in the resulting DataFrame.  If the test evaluates to True, the row is included.  If the test evaluates to False, the row is not included.

The syntax for this notation is to place this boolean test inside square brackets next to the name of the variable that contains the DataFrame you want to filter:

    filtered_data_frame = data_frame[ <boolean_test> ]

Inside this `<boolean_test>`, you can reference the value for a given column using the standard square bracket notation (`data_frame[ "<column_name>" ]`), you can call functions or methods on objects, and you can use logical operators to `and` and `or` multiple tests together.  Examples make it clearer:

    # filter out rows where column "CFDA_CODE" is not 123 (so keep rows where "CFDA_CODE" is 123)
    filtered_data_frame = data_frame[ data_frame[ "CFDA_CODE" ] == 123 ]
    
    # filter out rows where column "YEAR" is NULL (so keep rows where pandas.notnull( data_frame[ "YEAR" ] == True )
    filtered_data_frame = data_frame[ pandas.notnull( data_frame[ "YEAR" ] ) == True ]
    
    # filter out rows where column "ORG_NAME" = "ARBYS" (so keep rows where "ORG_NAME" is not "ARBYS")
    filtered_data_frame = data_frame[ data_frame[ "ORG_NAME" ] != "ARBYS" ]
    
    # filter out rows where column "ORG_NAME" = "ARBYS" and "YEAR" is 1726 (Arby's didn't exist in 1726...)
    filtered_data_frame = data_frame[ ( data_frame[ "ORG_NAME" ] != "ARBYS" ) and ( data_frame[ "YEAR" ] != 1726 )]

In the cleanData function below, using the DataFrame (**`data_frame_IN`**), name of a column in the dataframe (**`column_name_IN`**), and filter value passed in (**`filter_value_IN`**):

- return a cleaned dataframe that is a subset of **`data_frame_IN`** from which you've removed rows that contain the specified value (**`filter_value_IN`**) in the specified column (**`column_name_IN`**).  In order to return this subset, store it in the return variable **`cleaned_data_OUT`**.
- Your function should be equipped to deal with NULL - a special value that can't be filtered normally.  The simplest way to do this is to decide that a certain string value will tell you when you are filtering on NULL (any case of the string "NULL" - "NULL"/"null"/"NuLl", etc.).  Whenever you find this value has been passed in `filter_value_IN`, use the `pandas.notnull()` function to test whether you should keep a given row, rather than a boolean operator.

In [93]:
def cleanData( data_frame_IN, column_name_IN, filter_value_IN ):

    """
    Parameters
    ----------
    - data_frame_IN : A pandas DataFrame
    - column_name_IN : Name of the column on the dataframe
    - filter_value_IN : The value that causes rows to be filtered out of data_frame_IN 
           if it is present in the column named column_name_IN.
    
    Returns
    -------
    - cleaned_data_OUT : A Pandas DataFrame containing only rows that did not have the 
                specified value in the specified column
    """
    
    # return reference
    cleaned_data_OUT = None

    # check to make sure that column name is in data frame's list of column names.
    if( column_name_IN not in list( data_frame_IN.columns.values ) ):
    
        print("ERROR : Column you specified not present in the dataframe")
        clean_data_OUT = None
        
    #-- END check to see if column is in data frame's list of column names. --#
    
    ### BEGIN SOLUTION

    if filter_value_IN.upper() == "NULL":
    
        # keep rows where column passed is not NULL.
        cleaned_data_OUT = data_frame_IN[ pandas.notnull( data_frame_IN[ column_name_IN ] ) == True ]

    else:
        
        # keep rows where column passed in does not contain filter_out_value_IN.
        cleaned_data_OUT = data_frame_IN[ data_frame_IN[ column_name_IN ] != filter_value_IN ]

    ### END SOLUTION
    
    return cleaned_data_OUT
    
#-- END function cleanData() --#

Clearly, if the department name is NULL, we cannot use that data to train our classifier.  Fortunately, we can use our `cleanData()` function to remove all rows where "`ORG_DEPT`" is NULL.

In [94]:
# TEST - remove all rows where "`ORG_DEPT`" is NULL.
cleaned_data_frame = cleanData( data_frame, "ORG_DEPT", "null" )

# generate null frequencies data frame again to see if it worked.
null_frequencies_df = calc_null_frequencies( cleaned_data_frame )

# TEST - should be 402091 rows left after clearing out NULL departments.
assert len( cleaned_data_frame ) == 402091

# output NULL frequencies table
null_frequencies_df

value,False,True
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
ACTIVITY,402091,0
ADMINISTERING_IC,402091,0
APPLICATION_ID,402091,0
ARRA_FUNDED,402091,0
CFDA_CODE,396159,5932
ED_INST_TYPE,402091,0
ORG_DEPT,402091,0
ORG_NAME,402091,0
STUDY_SECTION,399547,2544
TOPIC_ID,402091,0


That did clean most of the NULL values for us, even across all the columns.  There might be other values in "`ORG_DEPT`" we want to clear out as well, though ("NONE"), so lets run the code cell below to print a frequency table for "`ORG_DEPT`" and see what values remain:

In [95]:
cleaned_data_frame["ORG_DEPT"].value_counts()

INTERNAL MEDICINE/MEDICINE       66695
BIOCHEMISTRY                     26188
NONE                             20624
MICROBIOLOGY/IMMUN/VIROLOGY      20412
PHARMACOLOGY                     19536
PSYCHIATRY                       18692
BIOLOGY                          17976
PSYCHOLOGY                       17220
PATHOLOGY                        17004
PHYSIOLOGY                       16068
PEDIATRICS                       15660
ANATOMY/CELL BIOLOGY             13732
CHEMISTRY                        13196
PUBLIC HEALTH &PREV MEDICINE     12568
GENETICS                          9188
NEUROLOGY                         8836
SURGERY                           8056
MISCELLANEOUS                     7112
RADIATION-DIAGNOSTIC/ONCOLOGY     6980
NEUROSCIENCES                     6108
OTHER HEALTH PROFESSIONS          6028
DENTISTRY                         5240
OPHTHALMOLOGY                     5180
VETERINARY SCIENCES               5064
OTHER BASIC SCIENCES              4180
OBSTETRICS &GYNECOLOGY   

The values of "`NONE`", "`MISCELLANEOUS`" and "`NO CODE ASSIGNED`" are also useless in terms of training or testing a machine learning model for predicting a grant's department.  Run the code cell below to use our `cleanData()` method to remove all rows where the "`ORG_DEPT`" column contains "`NONE`", "`MISCELLANEOUS`" or "`NO CODE ASSIGNED`":

In [96]:
# Getting rid of NONE and NO CODE ASSIGNED.
cleaned_data_frame = cleanData( cleaned_data_frame, "ORG_DEPT", "NONE" )
cleaned_data_frame = cleanData( cleaned_data_frame, "ORG_DEPT", "NO CODE ASSIGNED" )
cleaned_data_frame = cleanData( cleaned_data_frame, "ORG_DEPT", "MISCELLANEOUS" )

In order to avoid confusion caused by mixed case, we should also make sure that the values of all categorical variables (variables of data type "`numpy.object_`") are converted to a uniform case.  Run the example code in the cell below to convert all categorical columns to upper case:

In [97]:
# Converting to upper case

# loop over column names
column_name_series = data_frame.columns.values
for column_name in list( column_name_series ):
    
    # get column data type
    column_data_type = cleaned_data_frame[ column_name ].dtype
    
    # is it categorical (numpy.object_)?
    if column_data_type == numpy.object_:
        
        # yes - use Series.str.upper() method to convert all values in column to upper case.
        cleaned_data_frame[ column_name ] = cleaned_data_frame[ column_name ].str.upper()
        
    #-- END check to see if variable is categorical --#
    
#-- END loop over columns. --#

# and calculate the null frequencies again to see where we're at
calc_null_frequencies( cleaned_data_frame )

value,False,True
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
ACTIVITY,374331,0
ADMINISTERING_IC,374331,0
APPLICATION_ID,374331,0
ARRA_FUNDED,374331,0
CFDA_CODE,369871,4460
ED_INST_TYPE,374331,0
ORG_DEPT,374331,0
ORG_NAME,374331,0
STUDY_SECTION,372071,2260
TOPIC_ID,374331,0


For this example, "`YEAR`", "`STUDY_SECTION`" and "`CFDA_CODE`" are categorical variables that we aren't interested in predicting, so run the cell below to remove rows where these columns are NULL:

In [98]:
cleaned_data_frame = cleanData( cleaned_data_frame, "CFDA_CODE", "NULL" )
cleaned_data_frame = cleanData( cleaned_data_frame, "STUDY_SECTION", "NULL" )
cleaned_data_frame = cleanData( cleaned_data_frame, "YEAR", "NULL" )

# Lets see if we have any more null frequencies to deal with
calc_null_frequencies( cleaned_data_frame )

value,False,True
variable,Unnamed: 1_level_1,Unnamed: 2_level_1
ACTIVITY,367691,0
ADMINISTERING_IC,367691,0
APPLICATION_ID,367691,0
ARRA_FUNDED,367691,0
CFDA_CODE,367691,0
ED_INST_TYPE,367691,0
ORG_DEPT,367691,0
ORG_NAME,367691,0
STUDY_SECTION,367691,0
TOPIC_ID,367691,0


We still have 48 missing values for "`TOTAL_COST`", which we'll want to address one way or another so we can use this column in our models.  We can get rid of them, but we could also use some sort of function to estimate a value (basic ones are centrality measures like mean, median or mode based on the other values in that column).  This can be complicated to do well.  Since there are a total of 367,691 records, losing 48 won't really impact the size of our data set, so for this exercise, we'll just delete these 48 records and move on:

In [99]:
# get rid of rows where "TOTAL_COST" is NULL.
cleaned_data_frame = cleanData( cleaned_data_frame, "TOTAL_COST", "NULL" )

# look at null frequencies now.
calc_null_frequencies( cleaned_data_frame )

value,False
variable,Unnamed: 1_level_1
ACTIVITY,367643
ADMINISTERING_IC,367643
APPLICATION_ID,367643
ARRA_FUNDED,367643
CFDA_CODE,367643
ED_INST_TYPE,367643
ORG_DEPT,367643
ORG_NAME,367643
STUDY_SECTION,367643
TOPIC_ID,367643


Lets take another look at our ORG_DEPT variable, and see if we can combine some departments that are very similar.

In [100]:
# Look at the department values that remain, and how many total records we have:
print( cleaned_data_frame[ "ORG_DEPT" ].value_counts() )
print( "\nRecord Count: " + str( len( cleaned_data_frame ) ) )

INTERNAL MEDICINE/MEDICINE       64811
BIOCHEMISTRY                     26136
MICROBIOLOGY/IMMUN/VIROLOGY      20324
PHARMACOLOGY                     19392
PSYCHIATRY                       18536
BIOLOGY                          17852
PSYCHOLOGY                       17180
PATHOLOGY                        16804
PHYSIOLOGY                       15980
PEDIATRICS                       15216
ANATOMY/CELL BIOLOGY             13704
CHEMISTRY                        13184
PUBLIC HEALTH &PREV MEDICINE     11328
GENETICS                          9124
NEUROLOGY                         8776
SURGERY                           7904
RADIATION-DIAGNOSTIC/ONCOLOGY     6884
NEUROSCIENCES                     6080
OTHER HEALTH PROFESSIONS          5840
DENTISTRY                         5224
OPHTHALMOLOGY                     5124
VETERINARY SCIENCES               4988
OTHER BASIC SCIENCES              4152
OBSTETRICS &GYNECOLOGY            3736
ENGINEERING (ALL TYPES)           3020
OTOLARYNGOLOGY           

It certainly looks like a lot of these departments can be combined into broader areas.  To reduce the number of categories to predict (and so make it easier for a model to make good predictions), run the cell below to combine some of these departments into broader areas of interest:

In [101]:
# MEDICINE
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "ANESTHESIOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "DENTISTRY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "DERMATOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "EMERGENCY MEDICINE", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "FAMILY MEDICINE", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "INTERNAL MEDICINE/MEDICINE", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "NEUROLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "NEUROSURGERY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "OBSTETRICS &GYNECOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "OPHTHALMOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "ORTHOPEDICS", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "OTOLARYNGOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PATHOLOGY",[ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PHARMACOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PHYSICAL MEDICINE &REHAB", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PEDIATRICS", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PLASTIC SURGERY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "RADIATION-DIAGNOSTIC/ONCOLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "SURGERY", [ 'ORG_DEPT' ] ] = "MEDICINE"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "UROLOGY", [ 'ORG_DEPT' ] ] = "MEDICINE"

# BIOLOGY
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "ANATOMY/CELL BIOLOGY", [ 'ORG_DEPT' ] ] = "BIOLOGY"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PHYSIOLOGY", [ 'ORG_DEPT' ] ] = "BIOLOGY"
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "ZOOLOGY", [ 'ORG_DEPT' ] ] = "BIOLOGY"

# ENGINEERING
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "BIOMEDICAL ENGINEERING", [ 'ORG_DEPT' ] ] = "ENGINEERING (ALL TYPES)"

# OTHER HEALTH PROFESSIONS
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "PUBLIC HEALTH &PREV MEDICINE", [ 'ORG_DEPT' ] ] = "OTHER HEALTH PROFESSIONS"

# OTHER CLINICAL SCIENCES
cleaned_data_frame.loc[ cleaned_data_frame.ORG_DEPT == "NUTRITION", [ 'ORG_DEPT' ] ] = "OTHER CLINICAL SCIENCES"

# We will also get rid of "ADMINISTRATION
cleaned_data_frame = cleanData( cleaned_data_frame, "ORG_DEPT", "ADMINISTRATION" )

# check out distribution of categories now:
cleaned_data_frame["ORG_DEPT"].value_counts()

MEDICINE                         168031
BIOLOGY                           48976
BIOCHEMISTRY                      26136
MICROBIOLOGY/IMMUN/VIROLOGY       20324
PSYCHIATRY                        18536
PSYCHOLOGY                        17180
OTHER HEALTH PROFESSIONS          17168
CHEMISTRY                         13184
GENETICS                           9124
NEUROSCIENCES                      6080
VETERINARY SCIENCES                4988
ENGINEERING (ALL TYPES)            4984
OTHER BASIC SCIENCES               4152
BIOSTATISTICS &OTHER MATH SCI      2644
OTHER CLINICAL SCIENCES            2376
SOCIAL SCIENCES                    2336
PHYSICS                             760
BIOPHYSICS                          228
dtype: int64

# Model Selection and Assessment

- Back to [Table of Contents](#Table-of-Contents)

Now that we have a clean dataset, we can move on to the fun parts!! The python machine learning libraries do not accept categorical variables, so we need to convert all such variables to dummies first. However, pandas makes it super easy! 

But before we do that, lets split our data variables into predictors (features, or dependent variables, or "X" variables) and variables to predict (independent variables, or "Y" variables).  For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "`X_`", and names of variables that pertain to variables we are to predict will start with "`y_`".

In [102]:
# Lets go ahead and split into predictors and predicted

# make a list of the column names not in dependent column name list (currently just "ORG_DEPT")
# one line - predictor_column_list = [ column_name for column_name in list( cleaned_data_frame.columns.values ) if column_name not in [ "ORG_DEPT" ] ]
X_column_list = []
y_column_list = [ "ORG_DEPT" ]

# loop over column names.
column_name_list = cleaned_data_frame.columns.values
for column_name in column_name_list:
    
    # if the name is not predicted_column_list, add it to predictor_column_list
    if ( column_name not in y_column_list ):
        
        # add to the predictor_column_list
        X_column_list.append( column_name )
        
    #-- END check to see if column is in predicted/IV/Y list --#
    
#-- END loop over columns. --#

# split columns into two DataFrames, those we are to predict,
#    and those that are predictors.
X_data_frame = cleaned_data_frame[ X_column_list ]
y_data_frame = cleaned_data_frame[ y_column_list ]

Now, we can easily convert all categorical variables in `X_data_frame` into dummy/binary variables using the `pandas.get_dummies()` function ( [http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) ).

In [103]:
# Python's sckikit algorithms dont work on categorical variables. Fortunately, Pandas provides an easy way out!
X_data_frame = pandas.get_dummies( X_data_frame )

If we're building a model, we're going to need a way to know whether or not it's working. Convincing others of the quality of results is often the most challenging part of an analysis.  In machine learning, making repeatable, well-documented work with clear success metrics makes all the difference.

For our classifier, we're going to use the following build methodology:

<img src="https://s3.amazonaws.com/demo-datasets/traintest.png" />

In brief, this methodology involves:

- First **splitting your data** into a training set (75% of your data) and a test set (25% of your data).
- "**Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.
- In the **Model Build** phase, you decide on a model then train or fit your model using your training data.
- In the **Evaluate Performance** phase, you run your fitted model on your set of testing predictors, then assess the quality of the model by comparing the predicted values to the actual values for each record in your testing data set. 

Since we have a limited number of relatively basic features, we won't be going into any Feature Engineering examples for this exercise.  However, feature engineering is an essential part of implementing quality machine learning - to learn more, start with the "Discover Feature Engineering" tutorial: [http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

Let us now split our dataset into test and training using the `train_test_split()` function from scikit learn's sklearn.cross_validation module ( [http://scikit-learn.org/stable/modules/cross_validation.html](http://scikit-learn.org/stable/modules/cross_validation.html) ):

In [104]:
# use train_test_split() to split our X and Y variables into separate 75% and 25%
#    DataFrames of training (X_train and y_train) and testing (X_test and y_test) data.
X_train, X_test, y_train, y_test = train_test_split( X_data_frame, y_data_frame, test_size = 0.25, random_state = 0 )

Python's `scikit-learn` is a very well known machine library. It is also well documented and maintained. You can learn all about it here: [http://scikit-learn.org/stable/](http://scikit-learn.org/stable/). We will be using different classifiers from this library for our predictions in this workbook. 

We will start with the simplest `LogisticRegression` model ( [http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) ) and see how well that does.

You can use any number of metrics to judge your models, but we will be using the accuracy score as our measure. 

In [105]:
# Lets fit the model
model = LogisticRegression()
model.fit( X_train, y_train )
print(model)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)


  y = column_or_1d(y, warn=True)


When we print the model, we see different parameters we can adjust as we refine the model based on running it against test data (values such as `intercept_scaling`, `max_iters`, `penalty`, and `solver`).  Example output:

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

To adjust these parameters, one would alter the call that creates the `LogisticRegression()` model instance, passing it one or more of these parameters with a value other than the default.  So, to re-fit the model with `max_iter` of 1000, `intercept_scaling` of 2, and `solver` of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:

    model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )

More details on what each of thee parameters mean is on the `LogisticRegression` documentation page: [http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

The basic way one would tune these parameters is to iterate over fitting your model to your training data with different parameters (hopefully chosen based on your knowledge of your data and the model you are fitting), then testing it against training data until the model's accuracy is as high as you can get it.  Unfortunately, this exposes one to the potential that the model is over-fitted to one's test data, and so won't perform as well when it is used to predict other sets of data.

Cross-validation is a good way to fine-tune the parameters with less risk of over-fitting.  It involves dividing your training data into 5 or so equal sets called folds, then choosing a different fold to serve as the test data set each time you test a new set of parameters. This sounds complicated, but scikit learn has functions to help, and a good tutorial on cross-validation can be found on the scikit learn site: [http://scikit-learn.org/stable/modules/cross_validation.html](http://scikit-learn.org/stable/modules/cross_validation.html).

Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:

In [106]:
# store our test "to predict" variables in "expected".
expected = y_test

# predict values from our "predictors" usin the model.
predicted = model.predict(X_test)

# generate an accuracy score by comparing expected to predicted.
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )

Accuracy = 0.460175159583


We get an accuracy score of 46%. This is not a great score, however, it is much better than random guessing, which would have had a chance of 1/18 of succeeding. The other way to guess would be to take the mode, which in this case is MEDICINE with a frequency of 168031, which would give us an accuracy score of  45%. So logistic regression does better than both (though not much better than the mode). Let's see if the other classifiers can do any better.

### Exercise 3 - Train your Model

- Back to [Table of Contents](#Table-of-Contents)

Complete the function below to train different classifiers from the scikit library.

The `classifier()` function that you will implement:

- Accepts X_train_IN, y_train_IN, X_test_IN, and y_test_IN variables.
- creates a model, fits the model, tests the model, and calculates an accuracy score for the model (like we did above).
- returns the accuracy score as a percent (so a number between 1 and 100, not a decimal between 0 and 1 - ... so multiply by 100).

Your goal is to come up with a classifier that gives at least 75% accuracy on the test dataset.  To do this, you can:

- choose different models from those offered as part of scikit learn.

    - To start, here are some resources to help with choosing a model:

        - The scikit learn tutorial on choosing a model - [http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
        - This video on choosing and tuning a model - [http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/](http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/)
    
    - In particular, here are some sets of models you could explore (just make sure the model you choose is appropriate for your data - predicting a categorical variable using a mix of numeric and categorical data):

        - other Linear Models like the `LogisticRegression` - [http://scikit-learn.org/stable/modules/linear_model.htmlFeel](http://scikit-learn.org/stable/modules/linear_model.html)
        - Decision Tree models - [http://scikit-learn.org/stable/modules/tree.html](http://scikit-learn.org/stable/modules/tree.html)

- play around with different parameters for the models you try.
- experiment with different sets of X variables.

Again, in general, make sure that the model and parameters you choose are appropriate for both your X and Y variables.

In [107]:
def classifier(X_train_IN, y_train_IN, X_test_IN, y_test_IN):
    """
    Parameters
    ----------
    X_train_IN : A pandas DataFrame of features used for training the classifier
    y_train_IN : A pandas dataframe of y values used for training the classifier
    X_test_IN, y_test_IN : Use these to test the accuracy of your classifier
    
    Returns
    -------
    accuracy score : a float giving the percent (0 to 100) of accurate predictions you made
    """
   

    ### BEGIN SOLUTION
    from sklearn.tree import DecisionTreeClassifier
    model = DecisionTreeClassifier()
    model.fit(X_train_IN, y_train_IN)
    expected = y_test_IN
    predicted = model.predict(X_test_IN)
    return accuracy_score(expected, predicted)*100
    ### END SOLUTION.

In [108]:
# TEST to see if your accuracy is greater than 75%
accuracy_score = classifier( X_train, y_train, X_test, y_test )
print( "Accuracy Percentage: " + str( accuracy_score ) )

# TEST - is it greater than 75%?
assert accuracy_score >= 75

Accuracy Score: 97.0305657829


## References

- Back to [Table of Contents](#Table-of-Contents)

Links to documentation:

* [Scikit-Learn Documentation](#http://scikit-learn.org/stable/)
* [NIH Reporter Documentation](#http://exporter.nih.gov/about.aspx)