<h3>Projects funded from USA (NSF)</h3>

<p>The National Science Foundation (NSF) funds research and education in science and engineering, through grants, contracts, and cooperative agreements. The Foundation accounts for about 20 percent of federal support to academic institutions for basic research. 

Information about research projects that NSF has funded since 1989 can be found by searching the Award Abstracts database (https://www.nsf.gov/awardsearch/download.jsp). The information includes abstracts that describe the research, and names of principal investigators and their institutions. The database includes both completed and in-process research.</p>

<p>We will use all the provided information from 1994 till 2017 (259106 awards). The dataset consists of .zip folders for each year, with .xml files for each award.</p>

In [1]:
import os
import zipfile
import cjson
import pandas as pd
import xmltodict, json

from xml.dom.minidom import parse

filenames = os.listdir('USA_data')

<p>First, we unzip each folder and then for each file of the folder, we convert the xml file to a dictionary with:
<ul>
<li>key: the file name/award id (ex. 9601223)</li>
<li>value: the actual content of the xml file in JSON format</li>
</ul></p>

In [2]:
raw_text = {}

for zipfilename in filenames:
    with zipfile.ZipFile('USA_data/'+zipfilename) as z:
        for filename in z.namelist():
            if not os.path.isdir(filename):
                try:
                    with z.open(filename) as f:
                        raw_text[filename[:-4]] = json.dumps(xmltodict.parse(f))
                except:
                    print filename
                    pass

<p>For each award/key of our dictionary, we create a new dictionary to keep the values that we'll use later on the analysis:
<ul>
<li>id: the award id</li>
<li>title: the award title</li>
<li>objective: the abstract narration of the award</li>
<li>state: the code of the state</li>
<li>subjects: information provided from 'Program Reference' and 'Program Element' values</li>
<li>foa: FoaInformation</li>
</ul>
<br>
Then, we append each dictionary to a list.
</p>

In [3]:
usa_list = []

for key, val in raw_text.items():
    usa_dict = {}

    value = cjson.decode(val)
    new_val = value['rootTag']['Award']

    usa_dict['id'] = new_val['AwardID']
    usa_dict['title'] = new_val['AwardTitle']
    usa_dict['objective'] = new_val['AbstractNarration']
    usa_dict['state'] = new_val['Institution']['StateCode']
    
    if key[:2] in ['94', '95', '96', '97']:
        usa_dict['framework_programme'] = 'FP4'
    elif key[:2] in ['98', '99', '00', '01']:
        usa_dict['framework_programme'] = 'FP5'
    elif key[:2] in ['02', '03', '04', '05', '06']:
        usa_dict['framework_programme'] = 'FP6'
    elif key[:2] in ['07', '08', '09', '10', '11', '12', '13']:
        usa_dict['framework_programme'] = 'FP7'
    else:
        usa_dict['framework_programme'] = 'H2020'
    
    try:
        if type(new_val['ProgramReference']) is dict:
            usa_dict['subjects'] = new_val['ProgramElement']['Text'] + new_val['ProgramReference']['Text']
        elif type(new_val['ProgramReference']) is list:
            text = ''
            for i in new_val['ProgramReference']:
                text = text + ' ' + i['Text']

            usa_dict['subjects'] = text
        else:
            pass
    except:
        pass


    try:
        if type(new_val['FoaInformation']) is dict:
            usa_dict['foa'] = new_val['FoaInformation']['Name']
        elif type(new_val['FoaInformation']) is list:
            text = ''
            for i in new_val['FoaInformation']:
                text = text + ' ' + i['Name']

            usa_dict['foa'] = text
        else:
            pass
    except:
        pass

    usa_list.append(usa_dict)

<p>Finally, we use this list to create a Dataframe:<p>

In [4]:
import pickle

dfUSA = pd.DataFrame(usa_list)
dfUSA.to_pickle('dfUSA')
dfUSA

Unnamed: 0,foa,framework_programme,id,objective,state,subjects,title
0,,FP7,1231468,Intellectual Merit: There is much to be learne...,NH,,Using Next Generation Sequencing to Quantify t...
1,,FP7,1039870,With this award from the Major Research Instru...,PA,MAJOR RESEARCH INSTRUMENTATION CHEMICAL INSTR...,MRI: Acquisition of a New 500 MHz NMR Console ...


In [5]:
dfUSA

Unnamed: 0,foa,framework_programme,id,objective,state,subjects,title
0,,FP7,1231468,Intellectual Merit: There is much to be learne...,NH,,Using Next Generation Sequencing to Quantify t...
1,,FP7,1039870,With this award from the Major Research Instru...,PA,MAJOR RESEARCH INSTRUMENTATION CHEMICAL INSTR...,MRI: Acquisition of a New 500 MHz NMR Console ...
2,,FP7,1231467,This award focuses on providing support to ena...,MI,,International Medical Geography Symposium; Eas...
3,,FP7,1231461,Objective: <br/>The objective of this program ...,PA,Wireless comm & sig processing Computat syste...,Collaborative Research: CCSS: Cyber-Enabled Sm...
4,Other Applications NEC,FP7,0835543,Kohn-Sham density functional theory (DFT) prov...,WA,UNASSIGNED COMPLEXITY CDI NON SOLICITED RESEA...,CDI-Type II Beyond Kohn-Sham Density Functiona...
5,,FP7,1018691,Seamless understanding of the meaning of visua...,NM,,Collaborative Research: RI: Small: A Scalable ...
6,,H2020,1659006,This project will establish a Research Coordin...,CT,GEOINFORMATICS SEDIMENTARY GEO & PALEOBIOLOGY,RCN: RATES: Building a Spatial and Temporal Fr...
7,Other Applications NEC,FP6,0520468,Proposal Number: DEB-0520468<br/>Proposal Titl...,PA,,EID: Parasite Induced Susceptibility and Trans...
8,,H2020,1450532,"Education researchers, practitioners, industry...",MA,,Massachusetts Engineering Innovation and Disse...
9,Other Applications NEC,FP4,9409829,When an animal moves around in its environment...,CT,,Medullary Electrosensory Processing
