## Tobias Reaper

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---

# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?

---
---

In [1]:
# Three musketeers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Extras
import pandas_profiling
import janitor
import os

# Plotly imports
import plotly.express as px
import plotly.figure_factory as ff

In [2]:
# Jupyter + Plotly imports (if running in Colab or Visual Studio Code, comment out the 3 lines below)
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode()  # Set plotly to notebook mode / work offline

In [3]:
# the basix
import category_encoders as ce
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [4]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [5]:
# Set pandas display options to allow for more columns and rows
pd.options.display.max_columns = 200
pd.options.display.max_rows = 200

---

## 🔬👀 Explorations 🔭👀

In [6]:
# define directory path to the data
datapath = "../data/NSF_Awards"
filename = "2018-nsf_awards.csv"  # testing it out first with only one year - 2018
fullpath = os.path.join(datapath, filename)  # full filepath

print(fullpath)

../data/NSF_Awards/2018-nsf_awards.csv


In [12]:
# Read in the csv and do initial cleanup with pyjanitor
df_0 = (pd
        .read_csv(fullpath)  # read the csv
        .clean_names()  # remove whitespace, punctuation, capitalization
#         .remove_empty()  # removes rows/cols that are completely empty
       )

In [13]:
print(df_0.shape)
df_0.head()

(12603, 45)


Unnamed: 0,awardee,doing_business_as_name,pd_pi_name,pd_pi_phone,pd_pi_email,co_pd_s_co_pi_s_,award_date,estimated_total_award_amount,funds_obligated_to_date,start_date,end_date,transaction_type,agency,awarding_agency_code,funding_agency_code,cfda_number,primary_program_source,award_title_or_description,federal_award_id_number,duns_id,parent_duns_id,program,program_officer_name,program_officer_phone,program_officer_email,awardee_street,awardee_city,awardee_state,awardee_zip,awardee_county,awardee_country,awardee_cong_district,primary_organization_name,primary_street,primary_city,primary_state,primary_zip,primary_county,primary_country,primary_cong_district,abstract_at_time_of_award,publications_produced_as_a_result_of_this_research,publications_produced_as_conference_proceedings,projectoutcomesreport,unnamed_44
0,"UNIVERSITY OF CALIFORNIA, LOS ANGELES",University of California-Los Angeles,Seth G Claudepierre,(303) 641-2461,seth.claudepierre@gmail.com,,"=""09/21/2018""","=""$64,446""","=""$64,446""","=""08/01/2018""","=""07/31/2019""",Grant,NSF,"=""4900""","=""4900""","=""47.050""",040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: GEM--The Effect of M...,"=""1854440""","=""092530369""","=""071549000""",MAGNETOSPHERIC PHYSICS,Carrie E. Black,(703) 292-8519,cblack@nsf.gov,10889 Wilshire Boulevard,LOS ANGELES,CA,90095-1406,Los Angeles,US,"=""33""",University of California-Los Angeles,,,CA,90095-1406,Los Angeles,US,"=""33""",This project aims to address the electromagnet...,,,,
1,UNIVERSITY OF WASHINGTON,University of Washington,Mo Li,(612) 638-8958,moli96@uw.edu,,"=""09/21/2018""","=""$149,665""","=""$149,665""","=""09/01/2018""","=""08/31/2020""",Grant,NSF,"=""4900""","=""4900""","=""47.041""",040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: Conformal and robust ...,"=""1854974""","=""605799469""","=""042803536""","COMMS, CIRCUITS & SENS SYS",Shubhra Gangopadhyay,(703) 292-2485,sgangopa@nsf.gov,4333 Brooklyn Ave NE,Seattle,WA,98195-0001,Seattle,US,"=""07""",University of Washington,,,WA,98195-0001,Seattle,US,"=""07""",The program aims to pioneer a flexible optical...,,,,
2,CLEMSON UNIVERSITY,Clemson University,Yongjia Song,(864) 656-9832,yongjis@clemson.edu,,"=""09/21/2018""","=""$84,855""","=""$84,855""","=""09/01/2018""","=""05/31/2019""",Grant,NSF,"=""4900""","=""4900""","=""47.041""",040100 NSF RESEARCH & RELATED ACTIVIT,An Adaptive Partition-based Approach for Solvi...,"=""1854960""","=""042629816""","=""042629816""",OE Operations Engineering,Georgia-Ann Klutke,(703) 292-8360,gaklutke@nsf.gov,230 Kappa Street,CLEMSON,SC,29634-5701,,US,"=""03""",Clemson University,230 Kappa Street,Clemson,SC,29634-0001,Clemson,US,"=""03""",Stochastic programs are popular models for pro...,,,,
3,TEXAS A&M ENGINEERING EXPERIMENT STATION,Texas A&M Engineering Experiment Station,Ya Wang,(631) 632-8322,ya.s.wang@stonybrook.edu,,"=""09/21/2018""","=""$500,000""","=""$500,000""","=""09/01/2018""","=""07/31/2023""",Grant,NSF,"=""4900""","=""4900""","=""47.041""",040100 NSF RESEARCH & RELATED ACTIVIT,CAREER: Understanding Dynamics of Ultra-small ...,"=""1851635""","=""847205572""","=""847205572""","Dynamics, Control and System D",Irina Dolinskaya,(703) 292-7078,idolinsk@nsf.gov,TEES State Headquarters Bldg.,College Station,TX,77845-4645,College Station,US,"=""17""",Texas A&M University,,College Station,TX,77843-3578,College Station,US,"=""17""",This Faculty Early Career Development Program ...,,,,
4,UNIVERSITY OF PITTSBURGH THE,University of Pittsburgh,Masoud Barati,(225) 578-1054,mbarati@lsu.edu,,"=""09/21/2018""","=""$199,893""","=""$199,893""","=""08/28/2018""","=""08/31/2020""",Grant,NSF,"=""4900""","=""4900""","=""47.041""",040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: A Global Algorithm for...,"=""1851602""","=""004514360""","=""004514360""","ENERGY,POWER,ADAPTIVE SYS",Anil Pahwa,(703) 292-2285,apahwa@nsf.gov,University Club,Pittsburgh,PA,15213-2303,Pittsburgh,US,"=""14""",University of Pittsburgh,,Pittsburgh,PA,15213-2303,Pittsburgh,US,"=""14""",Non-convex programming involves optimization p...,,,,


In [37]:
def wrangle(data):
    """Another one o' dem virtual lassoooos."""
    
    # Prevent the slice/copy warning
    data = data.copy()

    # Columns with the same mistake: ="value"
    
    cols_to_date = [  # to be converted to date
        "award_date",
        "start_date",
        "end_date",
    ]

    cols_to_int = [  # to be converted to integer
        "estimated_total_award_amount",
        "funds_obligated_to_date",
        "awarding_agency_code",
        "funding_agency_code",
        "cfda_number",
        "federal_award_id_number",
        "duns_id",
        "parent_duns_id",
        "awardee_cong_district",
        "primary_cong_district",
    ]

    for col in cols_to_date + cols_to_int:
        data[col] = data[col].str.strip(to_strip='="$')
        data[col] = data[col].str.replace(".", "")
        data[col] = data[col].str.replace(",", "")
        
    # Convert date columns to datetime
    for col in cols_to_date:
        data[col] = pd.to_datetime(data[col], infer_datetime_format=True)
        
    return data

In [38]:
# Apply above wranglemations to the dataframe
df_1 = wrangle(df_0)

In [47]:
df_1.head()

Unnamed: 0,awardee,doing_business_as_name,pd_pi_name,pd_pi_phone,pd_pi_email,co_pd_s_co_pi_s_,award_date,estimated_total_award_amount,funds_obligated_to_date,start_date,end_date,transaction_type,agency,awarding_agency_code,funding_agency_code,cfda_number,primary_program_source,award_title_or_description,federal_award_id_number,duns_id,parent_duns_id,program,program_officer_name,program_officer_phone,program_officer_email,awardee_street,awardee_city,awardee_state,awardee_zip,awardee_county,awardee_country,awardee_cong_district,primary_organization_name,primary_street,primary_city,primary_state,primary_zip,primary_county,primary_country,primary_cong_district,abstract_at_time_of_award,publications_produced_as_a_result_of_this_research,publications_produced_as_conference_proceedings,projectoutcomesreport,unnamed_44
0,"UNIVERSITY OF CALIFORNIA, LOS ANGELES",University of California-Los Angeles,Seth G Claudepierre,(303) 641-2461,seth.claudepierre@gmail.com,,2018-09-21,64446,64446,2018-08-01,2019-07-31,Grant,NSF,4900,4900,47050,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: GEM--The Effect of M...,1854440,92530369,71549000,MAGNETOSPHERIC PHYSICS,Carrie E. Black,(703) 292-8519,cblack@nsf.gov,10889 Wilshire Boulevard,LOS ANGELES,CA,90095-1406,Los Angeles,US,33,University of California-Los Angeles,,,CA,90095-1406,Los Angeles,US,33,This project aims to address the electromagnet...,,,,
1,UNIVERSITY OF WASHINGTON,University of Washington,Mo Li,(612) 638-8958,moli96@uw.edu,,2018-09-21,149665,149665,2018-09-01,2020-08-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: Conformal and robust ...,1854974,605799469,42803536,"COMMS, CIRCUITS & SENS SYS",Shubhra Gangopadhyay,(703) 292-2485,sgangopa@nsf.gov,4333 Brooklyn Ave NE,Seattle,WA,98195-0001,Seattle,US,7,University of Washington,,,WA,98195-0001,Seattle,US,7,The program aims to pioneer a flexible optical...,,,,
2,CLEMSON UNIVERSITY,Clemson University,Yongjia Song,(864) 656-9832,yongjis@clemson.edu,,2018-09-21,84855,84855,2018-09-01,2019-05-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,An Adaptive Partition-based Approach for Solvi...,1854960,42629816,42629816,OE Operations Engineering,Georgia-Ann Klutke,(703) 292-8360,gaklutke@nsf.gov,230 Kappa Street,CLEMSON,SC,29634-5701,,US,3,Clemson University,230 Kappa Street,Clemson,SC,29634-0001,Clemson,US,3,Stochastic programs are popular models for pro...,,,,
3,TEXAS A&M ENGINEERING EXPERIMENT STATION,Texas A&M Engineering Experiment Station,Ya Wang,(631) 632-8322,ya.s.wang@stonybrook.edu,,2018-09-21,500000,500000,2018-09-01,2023-07-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,CAREER: Understanding Dynamics of Ultra-small ...,1851635,847205572,847205572,"Dynamics, Control and System D",Irina Dolinskaya,(703) 292-7078,idolinsk@nsf.gov,TEES State Headquarters Bldg.,College Station,TX,77845-4645,College Station,US,17,Texas A&M University,,College Station,TX,77843-3578,College Station,US,17,This Faculty Early Career Development Program ...,,,,
4,UNIVERSITY OF PITTSBURGH THE,University of Pittsburgh,Masoud Barati,(225) 578-1054,mbarati@lsu.edu,,2018-09-21,199893,199893,2018-08-28,2020-08-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: A Global Algorithm for...,1851602,4514360,4514360,"ENERGY,POWER,ADAPTIVE SYS",Anil Pahwa,(703) 292-2285,apahwa@nsf.gov,University Club,Pittsburgh,PA,15213-2303,Pittsburgh,US,14,University of Pittsburgh,,Pittsburgh,PA,15213-2303,Pittsburgh,US,14,Non-convex programming involves optimization p...,,,,


In [None]:
df_1.dtypes

In [48]:
def wrangle_ints(data):
    """Another one o' dem virtual lassoooos."""
    
    # Prevent the slice/copy warning
    data = data.copy()

    cols_to_int = [  # to be converted to integer
        "estimated_total_award_amount",
        "funds_obligated_to_date",
    ]

    # Convert columns to integer
    for col in cols_to_int:
        data[col] = data[col].astype(int)
        
    return data

In [49]:
df_2 = wrangle_ints(df_1)

In [50]:
df_2.head()

Unnamed: 0,awardee,doing_business_as_name,pd_pi_name,pd_pi_phone,pd_pi_email,co_pd_s_co_pi_s_,award_date,estimated_total_award_amount,funds_obligated_to_date,start_date,end_date,transaction_type,agency,awarding_agency_code,funding_agency_code,cfda_number,primary_program_source,award_title_or_description,federal_award_id_number,duns_id,parent_duns_id,program,program_officer_name,program_officer_phone,program_officer_email,awardee_street,awardee_city,awardee_state,awardee_zip,awardee_county,awardee_country,awardee_cong_district,primary_organization_name,primary_street,primary_city,primary_state,primary_zip,primary_county,primary_country,primary_cong_district,abstract_at_time_of_award,publications_produced_as_a_result_of_this_research,publications_produced_as_conference_proceedings,projectoutcomesreport,unnamed_44
0,"UNIVERSITY OF CALIFORNIA, LOS ANGELES",University of California-Los Angeles,Seth G Claudepierre,(303) 641-2461,seth.claudepierre@gmail.com,,2018-09-21,64446,64446,2018-08-01,2019-07-31,Grant,NSF,4900,4900,47050,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: GEM--The Effect of M...,1854440,92530369,71549000,MAGNETOSPHERIC PHYSICS,Carrie E. Black,(703) 292-8519,cblack@nsf.gov,10889 Wilshire Boulevard,LOS ANGELES,CA,90095-1406,Los Angeles,US,33,University of California-Los Angeles,,,CA,90095-1406,Los Angeles,US,33,This project aims to address the electromagnet...,,,,
1,UNIVERSITY OF WASHINGTON,University of Washington,Mo Li,(612) 638-8958,moli96@uw.edu,,2018-09-21,149665,149665,2018-09-01,2020-08-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: Conformal and robust ...,1854974,605799469,42803536,"COMMS, CIRCUITS & SENS SYS",Shubhra Gangopadhyay,(703) 292-2485,sgangopa@nsf.gov,4333 Brooklyn Ave NE,Seattle,WA,98195-0001,Seattle,US,7,University of Washington,,,WA,98195-0001,Seattle,US,7,The program aims to pioneer a flexible optical...,,,,
2,CLEMSON UNIVERSITY,Clemson University,Yongjia Song,(864) 656-9832,yongjis@clemson.edu,,2018-09-21,84855,84855,2018-09-01,2019-05-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,An Adaptive Partition-based Approach for Solvi...,1854960,42629816,42629816,OE Operations Engineering,Georgia-Ann Klutke,(703) 292-8360,gaklutke@nsf.gov,230 Kappa Street,CLEMSON,SC,29634-5701,,US,3,Clemson University,230 Kappa Street,Clemson,SC,29634-0001,Clemson,US,3,Stochastic programs are popular models for pro...,,,,
3,TEXAS A&M ENGINEERING EXPERIMENT STATION,Texas A&M Engineering Experiment Station,Ya Wang,(631) 632-8322,ya.s.wang@stonybrook.edu,,2018-09-21,500000,500000,2018-09-01,2023-07-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,CAREER: Understanding Dynamics of Ultra-small ...,1851635,847205572,847205572,"Dynamics, Control and System D",Irina Dolinskaya,(703) 292-7078,idolinsk@nsf.gov,TEES State Headquarters Bldg.,College Station,TX,77845-4645,College Station,US,17,Texas A&M University,,College Station,TX,77843-3578,College Station,US,17,This Faculty Early Career Development Program ...,,,,
4,UNIVERSITY OF PITTSBURGH THE,University of Pittsburgh,Masoud Barati,(225) 578-1054,mbarati@lsu.edu,,2018-09-21,199893,199893,2018-08-28,2020-08-31,Grant,NSF,4900,4900,47041,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: A Global Algorithm for...,1851602,4514360,4514360,"ENERGY,POWER,ADAPTIVE SYS",Anil Pahwa,(703) 292-2285,apahwa@nsf.gov,University Club,Pittsburgh,PA,15213-2303,Pittsburgh,US,14,University of Pittsburgh,,Pittsburgh,PA,15213-2303,Pittsburgh,US,14,Non-convex programming involves optimization p...,,,,


In [None]:
df_2.isnull().sum()

In [53]:
def drop_cols(data):
    """Another one o' dem virtual lassoooos."""
    
    # Prevent the slice/copy warning
    data = data.copy()

    cols_to_keep = [  # to be converted to integer
        "doing_business_as_name",
        "award_date",
        "estimated_total_award_amount",
        "funds_obligated_to_date",
        "start_date",
        "end_date",
        "transaction_type",
        "agency",
        "cfda_number",
        "primary_program_source",
        "award_title_or_description",
        "federal_award_id_number",
        "program",
        "awardee_city",
        "awardee_state",
        "awardee_zip",
        "awardee_county",
        "awardee_country",
        "awardee_cong_district",
        "abstract_at_time_of_award",
    ]

    data = data[cols_to_keep]
        
    return data

In [54]:
df_3 = drop_cols(df_2)

In [55]:
df_3.head()

Unnamed: 0,doing_business_as_name,award_date,estimated_total_award_amount,funds_obligated_to_date,start_date,end_date,transaction_type,agency,cfda_number,primary_program_source,award_title_or_description,federal_award_id_number,program,awardee_city,awardee_state,awardee_zip,awardee_county,awardee_country,awardee_cong_district,abstract_at_time_of_award
0,University of California-Los Angeles,2018-09-21,64446,64446,2018-08-01,2019-07-31,Grant,NSF,47050,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: GEM--The Effect of M...,1854440,MAGNETOSPHERIC PHYSICS,LOS ANGELES,CA,90095-1406,Los Angeles,US,33,This project aims to address the electromagnet...
1,University of Washington,2018-09-21,149665,149665,2018-09-01,2020-08-31,Grant,NSF,47041,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: Conformal and robust ...,1854974,"COMMS, CIRCUITS & SENS SYS",Seattle,WA,98195-0001,Seattle,US,7,The program aims to pioneer a flexible optical...
2,Clemson University,2018-09-21,84855,84855,2018-09-01,2019-05-31,Grant,NSF,47041,040100 NSF RESEARCH & RELATED ACTIVIT,An Adaptive Partition-based Approach for Solvi...,1854960,OE Operations Engineering,CLEMSON,SC,29634-5701,,US,3,Stochastic programs are popular models for pro...
3,Texas A&M Engineering Experiment Station,2018-09-21,500000,500000,2018-09-01,2023-07-31,Grant,NSF,47041,040100 NSF RESEARCH & RELATED ACTIVIT,CAREER: Understanding Dynamics of Ultra-small ...,1851635,"Dynamics, Control and System D",College Station,TX,77845-4645,College Station,US,17,This Faculty Early Career Development Program ...
4,University of Pittsburgh,2018-09-21,199893,199893,2018-08-28,2020-08-31,Grant,NSF,47041,040100 NSF RESEARCH & RELATED ACTIVIT,Collaborative Research: A Global Algorithm for...,1851602,"ENERGY,POWER,ADAPTIVE SYS",Pittsburgh,PA,15213-2303,Pittsburgh,US,14,Non-convex programming involves optimization p...


In [56]:
# Check out non-numeric columns
df_3.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq,first,last
agency,12603,1,NSF,12603,NaT,NaT
transaction_type,12603,2,Grant,12504,NaT,NaT
primary_program_source,12603,4,040100 NSF RESEARCH & RELATED ACTIVIT,11461,NaT,NaT
awardee_country,12603,6,US,12588,NaT,NaT
cfda_number,12603,9,47049,2751,NaT,NaT
awardee_cong_district,12586,53,02,1584,NaT,NaT
awardee_state,12588,53,CA,1602,NaT,NaT
end_date,12603,79,2021-08-31 00:00:00,1635,2018-01-31,2024-08-31
start_date,12603,169,2018-09-01 00:00:00,2373,2014-06-30,2019-08-01
award_date,12603,265,2018-08-07 00:00:00,194,2017-10-12,2018-09-21


In [2]:
df_3.describe()

NameError: name 'df_3' is not defined

In [1]:
# define the target
target = "estimated_total_award_amount"

In [None]:
X = df_3.drop(columns=[target, "funds_obligated_to_date"])