# Allen Majewski 20200107

Notes:

This is a git repository with the following structure:

`README.md`

`src/`

`-- taxitip_model.ipynb` (this file)

`data/`

   `-- 2015_<MM>_100k.csv` 
   
   `-- 2015_weather.csv`
   
   `-- holidays.csv`
   
The data files are in the gitignore so they are not pushed.

The discussion will follow inline with the code.

Below is the problem statement.



# dotdata ML task

### Data
	1. NYC taxi trip history data from 2015/01 – 2015/12
	2. Daily weather data from 2015/01 – 2015/12
	3. Public holiday data for the year of 2015

## Problem
Build an ML model to predict whether each trip has over 20% tip rate or not.

### Expected outputs

	1. Please submit the code you developed to build predictive model(s) as Jupyter notebook(s) on Python3 kernel.
	2. Please summarize the key conclusion of your analysis. This report should include the following:
		a. The most significant features that affect each trip’s tip percentage.
		b. The performance of your predictive model(s), and suggest what additional dataset you’d like to include to improve the performance of your models.

### Note
	1.  Please assume the client wants to get the model with high prediction accuracy.
	2.  Please assume the code would be reviewed by your team members, and be further developed.

In [101]:
import numpy as np
import pandas as pd
import os
import sys
import random
# import mysql.connector
# import json
# import math
# import datetime
# import dateutil
# import argparse
import matplotlib.pyplot as plt
# from operator import add, truediv
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import Ridge, RidgeCV, LinearRegression
from scipy.stats.stats import pearsonr


def get_min_max(seq):
    min_ = min(seq)
    max_ = max(seq)
    seq = (seq-min_)/(max_ - min_)
    return [(min_,max_),seq]

def lmap(func, alist):
    return list(map(func, alist))


os.chdir('../data')

### load up fare data

`bigdf` is all the months

In [102]:
dfs = []

for infile in sorted([ _ for _ in os.listdir('.') if _.endswith('100k.csv')]):
    sys.stdout.write('\r'+f'loading {infile}')
    df = pd.read_csv(infile)
    dfs.append(df)
    
bigdf = pd.concat(dfs, ignore_index=True)

bigdf.tail(5)

loading 2015-12_100k.csv

Unnamed: 0,VendorID,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,pickup_zip,pickup_borough,pickup_neighborhood,dropoff_zip,dropoff_borough,dropoff_neighborhood
1199995,2,2015-12-16 19:00:26,2015-12-16 19:08:39,2,1.25,-73.974289,40.779854,1,N,-73.975204,...,1.4,0.0,0.3,10.7,10023,Manhattan,Upper West Side,10025,Manhattan,Upper West Side
1199996,1,2015-12-07 06:41:33,2015-12-07 06:45:03,1,0.7,-73.991707,40.74987,1,N,-73.980362,...,0.0,0.0,0.3,5.8,10001,Manhattan,Chelsea and Clinton,10016,Manhattan,Gramercy Park and Murray Hill
1199997,1,2015-12-02 09:18:35,2015-12-02 09:30:59,1,1.4,-73.955223,40.773376,1,N,-73.968201,...,0.0,0.0,0.3,10.3,10028,Manhattan,Upper East Side,10022,Manhattan,Gramercy Park and Murray Hill
1199998,1,2015-12-29 09:02:47,2015-12-29 09:08:49,1,0.6,-73.990509,40.742191,1,N,-73.979141,...,0.0,0.0,0.3,6.3,10010,Manhattan,Gramercy Park and Murray Hill,10010,Manhattan,Gramercy Park and Murray Hill
1199999,2,2015-12-09 21:57:56,2015-12-09 22:18:33,2,10.0,-73.870743,40.773689,1,N,-73.86084,...,0.0,5.54,0.3,36.34,11369,Queens,West Queens,10462,Bronx,Southeast Bronx




## Defining the objective: 20% of what?

We will consider whether the tip is 20% of the fare *without taxes*, as discussed with Michal over email.  Noting that $\text{total_amount}$ includes tax, tip, and all charges, let's define 

$\text{user_fare} = \text{total_amount} - \text{tip_amount} - \text{mta_tax}$

Then tip fraction is just

$\text{tip_fraction} = \text{tip_amount}/\text{user_fare}$

In [103]:
USER_FARE = np.array(bigdf.total_amount) - np.array(bigdf.tip_amount) - np.array(bigdf.mta_tax)
TIP_FRACTION = np.array(bigdf.tip_amount)/USER_FARE

bigdf['user_fare']    = USER_FARE
bigdf['tip_fraction'] = TIP_FRACTION



## Treat as regression problem that results in a classifier

#### we add a boolean value for tip above or below threshold of 20%

We can and will use regression here to predict the tip, but the problem is also to simply classify  whether the tip is above a threshold (20%).  Let us then implement:

`initialize:` 

$\text{tip_over_twenty_percent} \leftarrow 0$.


`if:`

$\text{tip_fraction} \ge 0.2$

`then:`

$\text{tip_over_twenty_percent} \leftarrow 1$



In [104]:
TIP_OVER_TWENTY = [0]*len(bigdf)

for idx, tip_fraction in enumerate(bigdf.tip_fraction.values):
    if tip_fraction >= 0.2:
        TIP_OVER_TWENTY[idx] = 1

bigdf['tip_over_twenty_percent'] = TIP_OVER_TWENTY

bigdf.head(5)

Unnamed: 0,VendorID,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,...,total_amount,pickup_zip,pickup_borough,pickup_neighborhood,dropoff_zip,dropoff_borough,dropoff_neighborhood,user_fare,tip_fraction,tip_over_twenty_percent
0,2,2015-01-06 11:39:29,2015-01-06 11:49:15,1,1.78,-73.999619,40.743599,1,N,-73.992203,...,9.8,10011,Manhattan,Chelsea and Clinton,10036,Manhattan,Chelsea and Clinton,9.3,0.0,0
1,1,2015-01-13 09:18:29,2015-01-13 09:23:40,1,2.1,-73.981956,40.77829,1,N,-73.962173,...,10.75,10023,Manhattan,Upper West Side,10024,Manhattan,Upper West Side,7.8,0.314103,1
2,2,2015-01-16 07:15:44,2015-01-16 07:26:42,1,2.33,-73.991188,40.742226,1,N,-73.981613,...,11.8,10010,Manhattan,Gramercy Park and Murray Hill,10019,Manhattan,Chelsea and Clinton,10.3,0.097087,0
3,1,2015-01-23 11:56:05,2015-01-23 12:13:20,1,2.1,-73.959297,40.763336,1,N,-73.979996,...,13.3,10065,Manhattan,Upper East Side,10023,Manhattan,Upper West Side,12.8,0.0,0
4,1,2015-01-24 10:11:48,2015-01-24 10:18:32,2,0.9,-73.971832,40.764751,1,N,-73.984047,...,7.3,10065,Manhattan,Upper East Side,10019,Manhattan,Chelsea and Clinton,6.8,0.0,0


A small test of the above manipulations

In [105]:
TEST_SIZE=50000

for i in range(TEST_SIZE):
    
    row_idx = random.randint(0,len(bigdf))
    sys.stdout.write(f'\rtesting row {row_idx}')

    row = bigdf.iloc[row_idx]
    user_fare = row.user_fare
    tip_amount = row.tip_amount
    tip_fraction = row.tip_fraction
    tip_over_twenty_percent = row.tip_over_twenty_percent

    assert(tip_fraction == tip_amount/user_fare)
    
    if tip_fraction <0.2:
        assert(tip_over_twenty_percent == 0)
    else:
        assert(tip_over_twenty_percent == 1)

testing row 437999

## Use of auxiliary data: holidays, weather


### exame holiday data

In [92]:
holidays = pd.read_csv('holidays.csv', sep=';')
holidays.head(10)

Unnamed: 0,Date,Holiday
0,01.01.15,New Years Day
1,19.01.15,Martin Luther King Jr. Day
2,12.02.15,Lincoln's Birthday
3,16.02.15,Presidents' Day
4,10.05.15,Mother's Day
5,25.05.15,Memorial Day
6,21.06.15,Father's Day
7,03.07.15,Independence Day (observed)
8,07.09.15,Labor Day
9,12.10.15,Columbus Day


### plan to consider holidays/dates:

1) just add a boolean  

`is_holiday` 

  to the principle data 


2) additionally, add 

`day_of_the_week` 

as in monday-sunday

`M T W R F Sa Su` 

to be one-hot encoded.



### examine weather data

In [32]:
weather = pd.read_csv('2015_weather.csv', sep=';')
weather.head(10)

Unnamed: 0,pickup_date,avg_temp_C,Rain,Fog,Snow
0,01.01.15,1,,,
1,02.01.15,4,,,
2,03.01.15,3,1.0,,1.0
3,04.01.15,9,1.0,,
4,05.01.15,2,,,
5,06.01.15,-6,,,1.0
6,07.01.15,-9,,,
7,08.01.15,-9,,,
8,09.01.15,-3,,,1.0
9,10.01.15,-7,,,




#### plan to consider weather data:
* impute 0 where NaN is in the Rain, Fog, Snow columns and use directly as feature since it is bool
* consider minmax-scaled temperature in deg C, and also temperature anomaly



In [1]:
import sys
import ast
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mysql.connector
from sqlalchemy import create_engine
from operator import add, truediv
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge, RidgeCV, LinearRegression
from scipy.stats.stats import pearsonr
from collections import OrderedDict

port = 3306
username = "rhombus"
password = "Rhombus_2019z"
end_point = "afwic.c9fkygyhkkab.us-gov-west-1.rds.amazonaws.com"
afwic = mysql.connector.connect(user=username, password=password, host=end_point, port=port)
afwiccon = create_engine('mysql+mysqldb://{}:{}@{}:{}'.format(username,password,end_point,port))

username = "rhombus"
password = "rhombuspower"
end_point = "quantum.c9fkygyhkkab.us-gov-west-1.rds.amazonaws.com"
quantum = mysql.connector.connect(user=username, password=password, host=end_point, port=port)
quantumcon = create_engine('mysql+mysqldb://{}:{}@{}:{}'.format(username,password,end_point,port))

df_availability = pd.read_sql(''' SELECT * FROM LIMS_EV.AFKCA_MCH_PSH_AVAILABILITY ''',afwic)
df_availability2= pd.read_sql(''' SELECT * FROM LIMS_EV.AFKCA_MCH_PSH_AVAILABILITY_AVG''',afwic)
df_budget = pd.read_sql(''' SELECT * FROM PROGRAMMING_UI.DT_ABIDES_AFKCA ''',quantum)
df_afkca = pd.read_sql(''' SELECT * FROM PLANNING_UI_DEPLOY.LOOKUP_CONNECTION_AFKCA ''',quantum)
saveeq = {'A010A':'A-10','A010C':'A-10',
       'AC130H':'AC-130H/U','AC130J':'AC-130J','AC130U':'AC-130H/U','AC130W':'AC-130W',
       'B001B':'B-1','B002A':'B-2',
       'B052G':'B-52','B052H':'B-52','B052C':'B-52','B052D':'B-52','B052E':'B-52','B052F':'B-52',
       'C012A':'C-12','C012C':'C-12','C012D':'C-12','C012F':'C-12','C012J':'C-12',
       'C017A':'C-17',
       'C020A':'C-20','C020B':'C-20','C020C':'C-20','C020E':'C-20','C020H':'C-20','C020K':'C-20',
       'C021A':'C-21',
       'C032A':'C-32','C032B':'C-32',
       'C037A':'C-37','C037B':'C-37',
       'C040A':'C-40','C040B':'C-40','C040C':'C-40',
       'C130H':'C-130H','C130J':'C-130J',
       'CV022B':'CV-22',
       'E003A':'E-3','E003B':'E-3','E003C':'E-3','E003G':'E-3',
       'E004B':'E-4','E008A':'E-8','E008C':'E-8',
       'EC130E':'EC-130 CCALL','EC130H':'EC-130 CCALL','EC130J':'EC-130 CCALL',
       'F015C':'F-15CD','F015D':'F-15CD','F015E':'F-15E',
       'F016C':'F-16CD','F016D':'F-16CD',
       'F022A':'F-22',
       'F035A':'F-35A',
       'F117A':'F-117',
       'HC130J':'HC-130J',
       'HC130N':'HC-130N/P',
       'HC130P':'HC-130N/P',
       'HH060A':'HH-60','HH060G':'HH-60','HH060U':'HH-60',
       'KC010A':'KC-10',
       'KC135A':'KC-135','KC135D':'KC-135','KC135E':'KC-135','KC135R':'KC-135','KC135T':'KC-135','KC135Q':'KC-135',
       'LC130H':'C-130H',
       'MC012W':'MC-12',
       'MC130H':'MC-130H','MC130J':'MC-130J',
       'MQ001B':'MQ-1','MQ009A':'MQ-9',
       'OC135B':'ARMS CONTROL (OC-135B)',
       'RC026B':'RC-26',
       'RQ004A':'RQ-4','RQ004B':'RQ-4',
       'T001A':'T-1',
       'T006A':'T-6',
       'T038A':'T-38','T038C':'T-38',
       'TH001H':'TH-1','TH001F':'TH-1',
       'U002S':'U-2','TU002R':'U-2','U002R':'U-2',
       'UH001H':'UH-1N NDO','UH001N':'UH-1N NDO','UH001V':'UH-1N NDO','UH001F':'UH-1N NDO','UH001P':'UH-1N NDO',
       'VC025A':'VC-25A',
       'WC130J':'WC-130J',
       'WC135B':'WC-135','WC135C':'WC-135','WC135W':'WC-135'}
lut = pd.DataFrame.from_dict(saveeq,orient='index')
lut.columns = ['AFKCA']
lut.index.names = ['EQUIPMENT_DESIGNATOR']
lut.head()




afkca_type_dict = {'BOMBERS': ['B-1', 'B-2', 'B-52'],
 'C2ISR': ['ARMS CONTROL (OC-135B)', 'E-3', 'E-4', 'E-8', 'MC-12', 'MQ-1', 'MQ-9', 'RC-26', 'RQ-4', 'U-2', 'WC-130J'],
 'CSAR': ['HC-130J', 'HC-130N/P'],
 'EW ASSETS': ['EC-130 CCALL'],
 'FIGHTERS': ['A-10', 'F-117', 'F-15CD', 'F-15E', 'F-16CD', 'F-22', 'F-35A'],
 'OSA/EA': ['C-12', 'C-20', 'C-21', 'C-32', 'C-37', 'C-40', 'VC-25A'],
 'ROTARY WING': ['HH-60', 'UH-1N NDO'],
 'SOF': ['CV-22', 'MC-130J'],
 'STRATLIFT': ['C-17'],
 'TACLIFT': ['C-130H', 'C-130J'],
 'TANKERS': ['KC-10', 'KC-135'],
 'TRAINERS': ['T-1', 'T-38', 'T-6', 'TH-1']}

# ALL_ACS = pd.read_sql('''SELECT * FROM PROGRAMMING_UI.DT_ABIDES_AFKCA where OPS = 'ACS';''', quantum)
# ALL_SSS = pd.read_sql('''SELECT * FROM PROGRAMMING_UI.DT_ABIDES_AFKCA where OPS = 'SSS';''', quantum)


# def extract_afkca(x):
#     x = x.replace('{','[').replace('}',']')
#     x = '{{{}}}'.format(x[1:-1])
#     x = ast.literal_eval(x)['AFKCA']
#     return x

def extract_afkca(x):
    if x is None:
        return []
    else:
        x = x.replace('{','[').replace('}',']')
        x = '{{{}}}'.format(x[1:-1])
        x = ast.literal_eval(x)['AFKCA']
        return x


def extract_budget(afkca='F-15CD', kind='MILPERS', retlist=None):
    def lmap(func, alist):
        return list(map(func, alist))

    if retlist is None:
        return np.array(lmap(float, df_budget[df_budget['AFKCA'] == afkca].iloc[0][kind].split(',')))

    elif retlist>0:
        return lmap(float, df_budget[df_budget['AFKCA'] == afkca].iloc[0][kind].split(','))


def get_min_max(seq):
    min_ = min(seq)
    max_ = max(seq)
    seq = (seq-min_)/(max_ - min_)
    return [(min_,max_),seq]


def get_deltas(seq):
    deltas = [] 
    i=0
    while i < len(seq)-1:
        deltas.append(seq[i+1]-seq[i])
        i+=1
    return np.array(deltas)



def get_availability(afkca='F-15CD', start_year=2001, end_year=2018):
    return np.array(list(df_availability[df_availability.AFKCA==afkca].AVAILABILITY)[start_year-2000:end_year-1999])



def get_scatter_df(afkca_='F-15CD',suppress_cols=[],use_avg=False):

    # availability df adf
    
    adf=df_availability[df_availability.AFKCA==afkca_]#[['FISCAL_YEAR','AVAILABILITY']]
    adf=adf[adf.FISCAL_YEAR > 2000]
    
    if use_avg:
        adf=df_availability2[df_availability2.AFKCA==afkca_]#[['FISCAL_YEAR','AVAILABILITY']]
        adf=adf[adf.FISCAL_YEAR > 2000]
    
    # budget df bdf
    b=OrderedDict()
    b['FY']                    = range(2001,2019)
    b['OM']                    = extract_budget(afkca=afkca_,kind='OM')[:-5]
   
    b['MILPERS']               = extract_budget(afkca=afkca_,kind='MILPERS')[:-5]
    b['AFKCA_TAI']             = extract_budget(afkca=afkca_, kind='FORCES_TAI')[:-5]
    
    bdf=pd.DataFrame.from_dict(b)
    
    df=bdf[bdf.FY>=min(adf.FISCAL_YEAR)]
    df['AVAIL'] = list(adf.AVAILABILITY)

    df=df[df.AFKCA_TAI>0]
    df['OM/TAI']                = np.array(df['OM'])/np.array(df['AFKCA_TAI'])
    df['MILPERS/TAI']           = np.array(df['MILPERS'])/np.array(df['AFKCA_TAI'])
    
    df = df[['FY', 'OM', 'MILPERS', 'AFKCA_TAI','OM/TAI', 'MILPERS/TAI','AVAIL']]
    df = df.drop(columns=suppress_cols)

    return df #.drop(columns=suppress_cols)


afk='F-22'
print(afk)

pd.plotting.scatter_matrix(get_scatter_df(afk,suppress_cols=['AFKCA_TAI', 'OM/TAI', 'MILPERS/TAI']))#'OM','MILPERS']))#, 'AFKCA_TAI']))
plt.show()

plt.hist(list(get_scatter_df(afk).FY),bins=range(2001,2019))
plt.show()

def get_connected_df(afkca_='F-15CD', use_avg=False, ret='all'):
    

    # availability df adf
    adf=df_availability[df_availability.AFKCA==afkca_]
    adf=adf[adf.FISCAL_YEAR > 2000]

    if use_avg:
        adf=df_availability2[df_availability2.AFKCA==afkca_]
        adf=adf[adf.FISCAL_YEAR > 2000]

    # budget df bdf
    b=OrderedDict()
    b['FY']                    = range(2001,2019)
    b['AFKCA_TAI']             = extract_budget(afkca=afkca_, kind='FORCES_TAI')[:-5]
    b['OM']                    = extract_budget(afkca=afkca_,kind='OM')[:-5]
    b['MILPERS']               = extract_budget(afkca=afkca_,kind='MILPERS')[:-5]
    b['AFKCA_TAI']             = extract_budget(afkca=afkca_, kind='FORCES_TAI')[:-5]
    bdf=pd.DataFrame.from_dict(b)


    ACS_connections = extract_afkca(df_afkca[df_afkca.AFKCA == afkca_].CONNECTED_AFKCA_ACS.iloc[0]) 
    ACS_BUDGETS = OrderedDict()
    for afk in ACS_connections:
        milpers = extract_budget(afkca=afk, kind='MILPERS')
        om      = extract_budget(afkca=afk, kind='OM')
        rdte    = extract_budget(afkca=afk, kind='RDTE')
        proc    = extract_budget(afkca=afk, kind='PROCUREMENT')
        milcon  = extract_budget(afkca=afk, kind='MILCON')
        other   = extract_budget(afkca=afk, kind='OTHER')
    #       ACS_BUDGETS[afk] = 1/TOTAL_TAI*(milpers + om + rdte + proc + milcon + other)
        ACS_BUDGETS[afk] =             (milpers + om + rdte + proc + milcon + other)[:-5]
    ACS_BUDGETS['FY']    = range(2001,2019)
    ACS_df = pd.DataFrame.from_dict(ACS_BUDGETS)

    SSS_connections = extract_afkca(df_afkca[df_afkca.AFKCA == afkca_].CONNECTED_AFKCA_SSS.iloc[0])
    SSS_BUDGETS = OrderedDict()
    for afk in SSS_connections:
        milpers = extract_budget(afkca=afk, kind='MILPERS')
        om      = extract_budget(afkca=afk, kind='OM')
        rdte    = extract_budget(afkca=afk, kind='RDTE')
        proc    = extract_budget(afkca=afk, kind='PROCUREMENT')
        milcon  = extract_budget(afkca=afk, kind='MILCON')
        other   = extract_budget(afkca=afk, kind='OTHER')
    #       SSS_BUDGETS[afk] = 1/TOTAL_TAI*(milpers + om + rdte + proc + milcon + other)
        SSS_BUDGETS[afk] = (milpers + om + rdte + proc + milcon + other)[:-5]
    SSS_BUDGETS['FY']    = range(2001,2019)
    SSS_df = pd.DataFrame.from_dict(SSS_BUDGETS)


    bdf=bdf[bdf.FY>=min(adf.FISCAL_YEAR)]
    ACS_df=ACS_df[ACS_df.FY>=min(adf.FISCAL_YEAR)]
    SSS_df=SSS_df[SSS_df.FY>=min(adf.FISCAL_YEAR)]

    df=pd.DataFrame()
    df['FY']      = adf.FISCAL_YEAR
    df['AVAIL']   = adf.AVAILABILITY
    # df['OM']      = bdf.OM
    # df['MILPERS'] = bdf.MILPERS

    df = df.merge(bdf,    on='FY')
    df = df.merge(ACS_df, on='FY')
    df = df.merge(SSS_df, on='FY')

    if ret=='all':
        return df
    elif ret=='adf':
        return adf
    elif ret=='bdf':
        return bdf
    elif ret=='ACS':
        return ACS_df
    elif ret=='SSS':
        return SSS_df
    
    

def get_all_ACS_SSS_df(afkca_='F-15CD', use_avg=True, ret='all'):
    

    # availability df adf

    if use_avg:
        adf=df_availability2[df_availability2.AFKCA==afkca_]
        adf=adf[adf.FISCAL_YEAR > 2000]
    else:
        adf=df_availability[df_availability.AFKCA==afkca_]
        adf=adf[adf.FISCAL_YEAR > 2000]

    # budget df bdf
    b=OrderedDict()
    b['FY']                    = range(2001,2019)
    b['AFKCA_TAI']             = extract_budget(afkca=afkca_, kind='FORCES_TAI')[:-5]
    
    b['OM']                    = extract_budget(afkca=afkca_, kind='OM')[:-5]
    b['OM_CIVPERS']            = extract_budget(afkca=afkca_, kind='OM_CIVPERS')[:-5]
    b['OM_FHP']                = extract_budget(afkca=afkca_, kind='OM_FHP')[:-5]
    b['OM_FUEL']               = extract_budget(afkca=afkca_, kind='OM_FUEL')[:-5]
    b['OM_WSS']                = extract_budget(afkca=afkca_, kind='OM_WSS')[:-5]
    b['OM_REMAINING']          = extract_budget(afkca=afkca_, kind='OM_REMAINING')[:-5]
    
    b['MILPERS']               = extract_budget(afkca=afkca_, kind='MILPERS')[:-5]
    b['MILPERS_ENLISTED']      = extract_budget(afkca=afkca_, kind='MILPERS_ENLISTED')[:-5]
    b['MILPERS_OFFICER']       = extract_budget(afkca=afkca_, kind='MILPERS_OFFICER')[:-5]
    b['MILPERS_REMAINING']     = extract_budget(afkca=afkca_, kind='MILPERS_REMAINING')[:-5]
    
    bdf=pd.DataFrame.from_dict(b)


    #ACS_connections = extract_afkca(df_afkca[df_afkca.AFKCA == afkca_].CONNECTED_AFKCA_ACS.iloc[0])
    # 48 of these
    ACS_afkcas = sorted(list(df_budget[df_budget.OPS=='ACS'].AFKCA))
    ACS_BUDGETS = OrderedDict()
    for afk in ACS_afkcas:
        milpers = extract_budget(afkca=afk, kind='MILPERS')
        om      = extract_budget(afkca=afk, kind='OM')
        rdte    = extract_budget(afkca=afk, kind='RDTE')
        proc    = extract_budget(afkca=afk, kind='PROCUREMENT')
        milcon  = extract_budget(afkca=afk, kind='MILCON')
        other   = extract_budget(afkca=afk, kind='OTHER')
    #       ACS_BUDGETS[afk] = 1/TOTAL_TAI*(milpers + om + rdte + proc + milcon + other)
        ACS_BUDGETS[afk] =             (milpers + om + rdte + proc + milcon + other)[:-5]
    ACS_BUDGETS['FY']    = range(2001,2019)
    ACS_df = pd.DataFrame.from_dict(ACS_BUDGETS)

    #SSS_connections = extract_afkca(df_afkca[df_afkca.AFKCA == afkca_].CONNECTED_AFKCA_SSS.iloc[0])
    SSS_afkcas = sorted(list(df_budget[df_budget.OPS=='SSS'].AFKCA))
    SSS_BUDGETS = OrderedDict()
    for afk in SSS_afkcas:
        milpers = extract_budget(afkca=afk, kind='MILPERS')
        om      = extract_budget(afkca=afk, kind='OM')
        rdte    = extract_budget(afkca=afk, kind='RDTE')
        proc    = extract_budget(afkca=afk, kind='PROCUREMENT')
        milcon  = extract_budget(afkca=afk, kind='MILCON')
        other   = extract_budget(afkca=afk, kind='OTHER')
    #       SSS_BUDGETS[afk] = 1/TOTAL_TAI*(milpers + om + rdte + proc + milcon + other)
        SSS_BUDGETS[afk] = (milpers + om + rdte + proc + milcon + other)[:-5]
    SSS_BUDGETS['FY']    = range(2001,2019)
    SSS_df = pd.DataFrame.from_dict(SSS_BUDGETS)


    bdf=bdf[bdf.FY>=min(adf.FISCAL_YEAR)]
    ACS_df=ACS_df[ACS_df.FY>=min(adf.FISCAL_YEAR)]
    SSS_df=SSS_df[SSS_df.FY>=min(adf.FISCAL_YEAR)]

    df=pd.DataFrame()
    df['FY']      = adf.FISCAL_YEAR
    df['AVAIL']   = adf.AVAILABILITY
    # df['OM']      = bdf.OM
    # df['MILPERS'] = bdf.MILPERS

    df = df.merge(bdf,    on='FY')
    df = df.merge(ACS_df, on='FY')
    df = df.merge(SSS_df, on='FY')

    if ret=='all':
        return df
    elif ret=='adf':
        return adf
    elif ret=='bdf':
        return bdf
    elif ret=='ACS':
        return ACS_df
    elif ret=='SSS':
        return SSS_df
    

# aircraft = sorted(list(set(lut.AFKCA)))
# aircraft.remove('ARMS CONTROL (OC-135B)')
    
# for afk in aircraft:
#     print(afk)
#     try:
#         df=get_all_ACS_SSS_df(afk).corr()
#     except:
#         e = sys.exc_info()[0]
#         print(e)
    
# df

# df = get_all_ACS_SSS_df('F-15CD')
# df
    




F-22


<matplotlib.figure.Figure at 0x10c112240>

<matplotlib.figure.Figure at 0x1a1c886ba8>