#Project Proposal: SBIR award prediction

#### Reminder from first proposal --> High level view:
The Small Business Innovation Research (SBIR,  https://www.sbir.gov/) program relies on an award-based system to trigger high-tech innovation in small US companies. Many agencies (Departement of Agriculture, of Defense, ...) contribute to this fund. Two main phases structure the SBIR program: phase I (approx 150 000 dollars, 6 months) and phase II (approx 1 000 000 dollars for 2 years). To apply to phase II, companies must be phase I awardees. 

The predictive question I want to answer is: will your project make it to phase II?  
The interpretative question I would like to address is: what features make you likely to be successful in applying to phase II?

####Aim of this document --> establish the presence of a signal in the data:
This document follows the pipeline I developped to evaluate a simple predictive model (Logistic Regression) on a small dataset. In the end, this exploratory work shows that the simple predictor outperforms a baseline predictor (one that randomly predicts success or failure with the baseline probability of the training set). Further ideas to increase predictive power are also listed.

Data and code I used for this proposal can be found on GitHub, at https://github.com/AnnaVM/SBIR_Project

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier

#from code in the GitHub Repro (https://github.com/AnnaVM/SBIR_Project)

In [2]:
cd code

/Users/AnnaVMS/Desktop/Galvanize/SBIR-project/code


In [3]:
from prepare_data import subset_data
from model import Model

In [4]:
cd ..

/Users/AnnaVMS/Desktop/Galvanize/SBIR-project


###Loading a subset of the data

For this proposal, I worked only on a subset of my data (data from years 2012 to 2015, with one specific agency, the Department of Defense 'dod'). 
- A first converter allowed me to go from .xlsx files to .csv files. (not shown here, available on the GitHub Repo)
- A second function allows me to work on the subset (code and csv files are available on the GitHub Repo)

In [5]:
df = subset_data('dod', 2012, '/Users/AnnaVMS/Desktop/test2')
#update path according to download

['2015-12-14_award_export_10001TO15000.csv', '2015-12-14_award_export_1TO5000.csv', '2015-12-14_award_export_5001TO10000.csv']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [6]:
print 'number of phase I projects considered: ',len(df.to_phase_II)

number of phase I projects considered:  1016


In [7]:
print 'number of phase I projects with successful application to phase II: ',sum(df.to_phase_II)

number of phase I projects with successful application to phase II:  305


In [8]:
print 'number of phase I projects without a phase II: ',len(df.to_phase_II)-sum(df.to_phase_II)

number of phase I projects without a phase II:  711


In [9]:
#quick overview of the dataset
print 'information in the database: \n',df.columns

information in the database: 
Index([u'index', u'Company', u'Award Title', u'Agency', u'Branch', u'Phase',
       u'Program', u'Agency Tracking #', u'Contract', u'Award Start Date',
       u'Award Close Date', u'Solicitation #', u'Solicitation Year',
       u'Topic Code', u'Award Year', u'Award Amount', u'DUNS',
       u'Hubzone Owned', u'Socially and Economically Disadvantaged',
       u'Woman Owned', u'# Employees', u'Company Website', u'Address1',
       u'Address2', u'City', u'State', u'Zip', u'Contact Name',
       u'Contact Title', u'Contact Phone', u'Contact Email', u'PI Name',
       u'PI Title', u'PI Phone', u'PI Email', u'RI Name', u'RI POC Name',
       u'RI POC Phone', u'Research Keywords', u'Abstract', u'to_phase_II'],
      dtype='object')


- The last column 'to_phase_II' gives information on whether a phase II was obtained. It will be used as the label to predict in my model
- Features can be choosen in the rest of the columns (here for instance: 'Solicitation Year', 'Award Amount', 'Hubzone Owned', 'Socially and Economically Disadvantaged', 'Woman Owned', '# Employees') or engineered (from 'Abstract', I made 'Abstract Length'--simple character count-- and 'Topic of Abstract' --through NMF topic modeling)

###defining a first model, based on this subset of data

In [10]:
#defining a training and testing set (through indices)
kf = KFold(1016,5, shuffle=True)
kf_iterator = kf.__iter__()
train_index, test_index = kf_iterator.next()

####The Pipeline:
1- Processing the text of the Abstract of submissions
- tfidf vectorization of the text (5000 words in the dictionary, default tokenization of sklearn)
- topic modeling: NMF (5 components, no optimization done here)

2- Preparing data for model
- using dummy variables for topics
- mapping Y/N to 1/0
- getting the abstract length in characters

3- Running the Logistic Regression
- Standard scaling of the data
- Logistic Regression (lasso regularization)
--> outputs the score

In [11]:
model_test = Model(df, train_index, test_index)

In [12]:
#step 1
model_test.process_text('Abstract')

In [13]:
#step 2
model_test.prepare_LogReg()

In [14]:
#step 3
model_test.perform_LogReg()

(0.73891625615763545, 0.58019414662416691, 0.74509803921568629)

- The first score is the accuracy of the Logistic Classifier on the training set;
- the second score is the accuracy of the baseline predictor (one that randomly predicts success or failure with the baseline probability of the training set),
- the third score gives the accuracy of the Logistic Classifier on the test set.

In [15]:
labels = model_test.LogReg_ytest
predicted_labels = model_test.model_LogReg.predict(model_test.LogReg_Xtest)

In [16]:
def tp_fp(labels, predicted_labels):
    labels = np.array(labels)
    predicted_labels = np.array(predicted_labels)
    tp = sum((labels == predicted_labels) & (labels == 1))
    tn = sum((labels == predicted_labels) & (labels == 0))
    fp = sum((labels != predicted_labels) & (predicted_labels == 1))
    fn = sum((labels != predicted_labels) & (predicted_labels == 0))
    return tp, tn, fp, fn

In [17]:
tp, tn, fp, fn = tp_fp(labels, predicted_labels)

In [18]:
print sum(model_test.LogReg_ytest)
print len(model_test.LogReg_ytest)-sum(model_test.LogReg_ytest)

61
143


In [19]:
print tp, tn, fp, fn

22 130 13 39


In [20]:
accuracy = (tp+tn)*1./(tp+tn+fp+fn)
recall = tp*1./(tp + fn)
precision = tp*1./(tp + fp)
print accuracy, recall, precision

0.745098039216 0.360655737705 0.628571428571


In [21]:
rf = RandomForestClassifier()

In [22]:
rf.fit(model_test.LogReg_Xtrain, model_test.LogReg_ytrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [23]:
rf.score(model_test.LogReg_Xtest, model_test.LogReg_ytest)

0.7009803921568627

In [24]:
predicted_rf_labels = rf.predict(model_test.LogReg_Xtest)

In [25]:
tp, tn, fp, fn = tp_fp(labels, predicted_rf_labels)

In [26]:
accuracy = (tp+tn)*1./(tp+tn+fp+fn)
recall = tp*1./(tp + fn)
precision = tp*1./(tp + fp)
print accuracy, recall, precision

0.700980392157 0.377049180328 0.5


####Some information on the models (coefficients from LogReg and topics from NMF)

In [27]:
#the coefficients in the Logisitic Regression (lasso regularized)
pd.DataFrame( {'criterion:': model_test.list_columns[:-1], 
               'coeffs': model_test.model_LogReg.coef_[0]} )

Unnamed: 0,coeffs,criterion:
0,0.0,topic 4
1,-0.010458,topic 3
2,-0.183616,topic 2
3,0.173449,topic 1
4,0.108927,topic 0
5,0.00967,Solicitation Year
6,0.695898,Award Amount
7,0.076544,Hubzone Owned as_int
8,-0.098911,Socially and Economically Disadvantaged as_int
9,0.106902,Woman Owned as_int


In [28]:
#topics defined with NMF:
for i in xrange(5):
    print model_test.get_top_n_words_for_topic(i)

[u'sensor' u'data' u'sensors' u'space' u'tracking' u'target' u'objects'
 u'algorithms' u'radar' u'detection']
[u'power' u'high' u'energy' u'technology' u'fuel' u'design' u'low'
 u'applications' u'antenna' u'performance']
[u'training' u'data' u'information' u'network' u'analysis' u'security'
 u'learning' u'cyber' u'support' u'based']
[u'laser' u'optical' u'high' u'lasers' u'wavelength' u'diode' u'spectral'
 u'swir' u'silicon' u'fiber']
[u'model' u'models' u'aircraft' u'materials' u'damage' u'modeling' u'phase'
 u'process' u'plume' u'software']


###Further Work

Other, easy but time-consuming developments
- adding other features (State...)
- vectorizing and topic modeling for other text inputs (Title, Keywords, Contact Title...)
- adding further company information
- tuning regularization on Logistic Regression
- running other models (getting feature importance with Random Forest for interpretation, trying to increase predictive power with SVM)

Other, harder, routes:
- developping a sentiment analysis (very technical, very assured, very detail-oriented, very goal-oriented)

###Bibliography
A paper featured in Kaggle, as a source inspiration
- http://cs.stanford.edu/~althoff/raop-dataset/altruistic_requests_icwsm.pdf