# Data Exploration: Federal RePORTER

Written by: Maryah Garner

## Table of Contents
* [Load projects data](#load-data)
* [Data glimpse](#data-glimpse)
* [Columns](#columns)
* [Grouping and aggregating data](#grouping-aggregating)
* [Merge dataframes](#merge-dataframes)
* [Subsetting data](#subsetting-data)
* [Descriptive stats](#descriptive-stats)
* [Abstracts](#abstracts)
* [Checkpoint](#checkpoint)

## Import necessary libraries

In [None]:
import pandas as pd

# Projects Data

## Load projects data <a class="anchor" id="load-data"></a>

In [None]:
# Specify a path with the data folder
# Change "NAME" to your name as recorded on your computer
# path = 'C:/Users/NAME/PADM-GP_2505/Data/'
Path = '.../PADM-GP_2505/Data'

In [None]:
# Read-in a CSV file
grants_2016 = pd.read_csv(Path + '/Projects/RePORTER_PRJ_C_FY2016_new.csv', encoding='latin-1')

## Data glimpse <a class="anchor" id="data-glimpse"></a>

In [None]:
# We can see how many (rows, columns) there are in the dataframe by using .shape
grants_2016.shape

In [None]:
# See first 5 rows with head() function
grants_2016.head(5)

## Columns <a class="anchor" id="columns"></a>

In [None]:
# Check the column names
grants_2016.columns

### Columns, rows, data selection

#### Single column selection
If we want to select a specific column, we can use the following syntax:

In [None]:
# select a single column: the dataframe variable name, followed by square brackets, and then put the
# the column name between quotes (either single or double). 
grants_2016['IC_NAME'].head()

#### Multiple-column selection
To select multiple columns, wrap the column names in double brackets `[[` and `]]`
- The interior brackets are for the list of variable names, and the outside brackets are indexing operator

In [None]:
# here we selected the columns and assigned them to a new dataframe called "df"
df = grants_2016[['IC_NAME','ORG_NAME', 'PI_NAMEs', 'ORG_STATE','APPLICATION_TYPE', 'PROJECT_START', 'PROJECT_END','TOTAL_COST']]
df.head()

## Grouping and Aggregating Data <a class="anchor" id="grouping-aggregating"></a>

#### Group by and aggregation functions
It is possible to group the dataframe by a column, and use aggregation function on them, and sort the result.

For example, we would like to know: how many NIH grants were awarded by each administering agency, Institute, or Center?

In [None]:
# calculate how many grants (unique application ids) that were awarded by each administering agency, Institute, or Center (IC_NAME)
# step1: in the groupby() method, we pass the column we want to group by
# step2: use the nunique() method to count the number of unique values (in this case, number of unique application ids by each entity)
# step3: sort the results in descending order (set the ascending parameter to False)

df_group = grants_2016.groupby('IC_NAME')['APPLICATION_ID'].nunique().sort_values(ascending=False)
df_group.head()

#### Create dataframe

In [None]:
# Note that the aggregation function didn't return a dataframe. 
# So we have to convert it into a dataframe if we want to process it further
df_group = df_group.to_frame().reset_index()
df_group.head()

#### Remane columns

In [None]:
# Let's correct the columns names, this shouldn't be project_id but a number of all funded projects
df_group.rename(columns={'APPLICATION_ID':'number of funded projects'}, inplace = True)
df_group.head(10)

Instead of just looking at the total number of projects funded by each intity, you might also want to know the sum ot the total cost of these projects

In [None]:
# calculate how the sum of the total costs for each administering agency, Institute, or Center (IC_NAME)
# step1: in the groupby() method, we pass the column we want to group by
# step2: use the sum() method to add together the total costs (in this case, number of unique application ids by each entity)
# step3: sort the results in descending order (set the ascending parameter to False)

Cost = grants_2016.groupby('IC_NAME')['TOTAL_COST'].sum().sort_values(ascending = False)

# step3: convert into a dataframe and reset index

Cost = Cost.to_frame().reset_index()
Cost.head()

Other useful aggregation functions are: sum(): sum, mean(): average, agg(): use a python dictionary to specify aggregation function based on each column

## Merge Dataframes <a class="anchor" id="merge-dataframes"></a>
Pandas provides an ability to merge (join) two datasets together. You can store the results in a new dataframe. There are different ways of merging data: left, right, outer, inner (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html).

Join together the two dataframes we just created suinf the common identifier (`IC_NAME`)


In [None]:
# Merge the first dataframe "df_group" with the count of projects per funding entity with the Cost datafrome
# how='inter' means use the intersection of the same project ids between the two dataframes

merge_df = pd.merge(df_group, Cost, on='IC_NAME', how = 'inner')
merge_df.head(5)


## Subsetting Data <a class="anchor" id="subsetting-data"></a>
#### Subsetting numerical data
For your project you might only want to look at projects funded by a spacific Institute. In these notebooks we will be exploring projects funded by the NATIONAL CANCER INSTITUTE

In [None]:
merge_df2 = pd.merge(df_group, Cost, on='IC_NAME', how='inner')
merge_df.head()

In [None]:
# conditional subsetting: put the conditional statement within the square brackets 
# the conditional statement here is that we want the IC_NAME to be NATIONAL CANCER INSTITUTE. 

df_NCI = grants_2016[grants_2016['IC_NAME'] == 'NATIONAL CANCER INSTITUTE']
df_NCI.head()

#### Subsetting string/categorical data
If you are interested in studying all NIH funded projects focused on Cancer research, only looking at projects funded by the National Cancer Institute might be too narrow of a search. 
We will use the `isin` method to select both projects funded by the NATIONAL CANCER INSTITUTE and the NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES.

In [None]:
# select specific institutes
# we specify the target list within the parentheses of the `isin` method

df = grants_2016[grants_2016['IC_NAME'].isin(['NATIONAL CANCER INSTITUTE', 'NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES'])]
df.head()

#### Subsetting with multiple conditions
If we want to subset the data with more than one condition, we can specify all the conditions and concatenate them with the python keyword `&`. Remember to put every single condition within a pair of parentheses.

In [None]:
# use the notnull function to Select observations that have a Total cost recorrded (i.e., the total cost is not null)
# also select specific institutes

df2 = grants_2016[(grants_2016['TOTAL_COST'].notnull()) &
                  (df['IC_NAME'].isin(['NATIONAL CANCER INSTITUTE', 'NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES']))]
df2.head()

## Descriptive stats <a class="anchor" id="descriptive-stats"></a>
Pandas has integrated some very useful tools to help us understand the distribution of the data. The `describe` method computes the most commonly used descriptive statistics, such as count, mean, standard deviation and quantiles for a dataframe.
We well select out meaningfull numeric values and look at their distribution.
Using the df_NCI data frame we created earlier, will look at the summary statistics for projects funded by the National Cancer Institute.

In [None]:
# see the descriptive statistics of selected numeric variables (using our original dataframe)
df_NCI[['DIRECT_COST_AMT', 'INDIRECT_COST_AMT','TOTAL_COST', 'TOTAL_COST_SUB_PROJECT']].describe()


### scientific notation 
Turn of scientific notation

In [None]:
# Convert scientific notation to a full float
pd.set_option('display.float_format', '{:.2f}'.format)
# see the descriptive statistics of selected numeric variables (using our original dataframe)
df_NCI[['DIRECT_COST_AMT', 'INDIRECT_COST_AMT','TOTAL_COST', 'TOTAL_COST_SUB_PROJECT']].describe()


# Abstracts <a class="anchor" id="abstracts"></a>
We are just goin to read in and view the Grant abstracts for now. 

In [None]:
# Read-in a CSV file
abstracts_2016 = pd.read_csv(Path + '/Abstracts/RePORTER_PRJABS_C_FY2016_new.csv', encoding='latin-1')


# look at the first 2 rows with head() function
abstracts_2016.head(2)

In [None]:
# Show full text in a cell
pd.set_option('display.max_colwidth', -1)
# look at the first 2 rows again
abstracts_2016.head(2)

# Checkpoint (Assignment 2, due Febuary 24th) <a class="anchor" id="checkpoint"></a>
### 1. Read in projects data for a year of your choice (other then 2016)
### 2 Subset the data for a specific administering agency, Institute, or Center (IC)
### 2. What are the top 5 organizations (by number of projects) who have received funding from this entity? 
### 3. How many projects were funded by each of the top 5 organizations, and what is the total cost? 
### 4. Who are the top five PIs from the organizations with the most projects, and how many projects were they the PI for? 
Note, you are not expected to clean the PI_names variable yet (the next notebook will walk you through that process).