# Classifying Party Affiliation Based on Stocks

### Introduction
Finding out correlations between the stock trades that any representatives does, along with their party affiliation is an interesting task. It allows us to see if there are any possible relationships between the laws and bills that they pass and what stocks they invest in. If there is such a pattern, we want to be able to potentially identify what party that they would affiliate with to ensure that there isn't too much of one party being involved in pushing restrictions in order to further their earnings from their stock investments. In this project, we will attempt to predict the party affiliation of a representative through their stock trades, the amount of money that they involve in their stock trades, as well as several other features that were included in this dataset.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import requests
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

# Data Cleaning

In [2]:
#read in dataframe from url
url = 'https://house-stock-watcher-data.s3-us-west-2.amazonaws.com/data/all_transactions.json'
all_transactions = requests.get(url).json() 
df = pd.DataFrame(all_transactions)

In [3]:
#change missing values with np.nan
df = df.replace('--', np.nan)
df = df.fillna(np.nan)

In [4]:
#convert dates into datetime objects
df['disclosure_date'] = pd.to_datetime(df['disclosure_date'])
df['transaction_date'] = pd.to_datetime(df['transaction_date'], errors = 'coerce')
#additionally, create a new column year so that we can identify stocks from 2018 above.
df['transaction_year'] = df['transaction_date'].apply(lambda x: x.year)
df = df[df['transaction_year'] >= 2018]

In [5]:
#drop the ptr-link,asset_desc,owner
df = df.drop(['ptr_link','asset_description','owner'],axis=1)
representative = []
#read in the representative's party from a text file.
with open('representative_party.txt') as f:
    lines = f.readlines()
    clean = []
for i in lines:
    if 'Transactions' in i or 'View' in i or 'photo' in i:
        continue
    clean.append(i)
    representative_dict = {}
for i in range(0, len(clean) - 1, 2):
    representative_dict[clean[i].strip()] = clean[i + 1].strip()

In [None]:
#create a separate file for representatives, and merge it with the originial dataframe
representatives = pd.DataFrame()
representatives['representative'] = representative_dict.keys()
representatives['party'] = representative_dict.values()
df = df.merge(representatives, on = 'representative')
#create a party column that represents the party of the representatives that are doing the trade
df['party'] = df['party'].apply(lambda x: x.split()[0] if len(x.split()) == 3 else None)
no_party = df[df['party'].isna()].index.to_list()
df = df.drop(no_party)
df


### Baseline Model
For our baseline model, we first found the representative's parties through the representative summary from the website's html source (https://housestockwatcher.com/summary_by_rep) and scraped the values. We stored this into the text file representative_party. We also decided that out of the columns in the datasets, the ones that we wanted to include in our baseline model were the amount(ordinal), type(nominal) and cap_gains_over_200_usd(nominal). We believed that the ptr_link and asset_description were not necessary to include, as the ptr_link is matched with the representative variable, and the asset description is matched with the ticker. Additionally we chose not to include the owner column. For the encoding of the amount, we chose to ordinally encode them, as they were put into categories that represented the amount of money that was invested. For the type, we decided to use one hot encoding, as there is no order to the categories. Finally, we encoded the cap_gains_over_200_usd using binary encoding. After creating the pipeline, the accuracy for the model ranged from around 58 to 72 percent. We believe that our model scored on the low side, as 80% accuracy would have been more sufficient compared to something that was closer to a 50/50. 

In [7]:
#encode amount column, since they are roughly categorical
def label_conv(df):
    keys = list(df['amount'].value_counts().index)
    values = np.arange(0,len(keys))
    conv= dict(zip(keys,values))
    return df['amount'].transform(lambda x: conv.get(x)).to_numpy().reshape(-1,1)
#convert the types into label encoding, decided to make exchange the same as sell
def type_conv(df):
    keys = list(df['type'].value_counts().index)
    values = np.arange(0,len(keys))
    conv= dict(zip(keys,values))
    return df['type'].transform(lambda x: conv.get(x)).to_numpy().reshape(-1,1)
#converts true into 1 and false into 0
def cap_conv(df):
    return df['cap_gains_over_200_usd'].transform(lambda x: 1 if x else 0).to_numpy().reshape(-1,1)

In [11]:
#create the training and testing sets through train_test_split
train_X,test_X,train_Y,test_Y = train_test_split(df[df.columns[:-1]],df.party)
#Transformers that will be used in the columntransformer
label_transformer = Pipeline([
    ('label',FunctionTransformer(label_conv))
])
type_transformer = Pipeline([
    ('type',FunctionTransformer(type_conv))
])
cap_transformer = Pipeline([
    ('cap',FunctionTransformer(cap_conv))
])
preproc = ColumnTransformer(
    transformers=[
    ('s1',label_transformer,['amount']),
    ('s2',type_transformer,['type']),
    ('s3',cap_transformer,['cap_gains_over_200_usd']),
])
p2 = Pipeline(steps = [('preprecessor',preproc),('regressor',KNeighborsClassifier())])
p2.fit(train_X,train_Y)
p2.score(test_X,test_Y)
#pretty low accuracy, want at least above 70

0.6696908602150538

### Final Model
Our final model includes two engineered features utilizing the ticker and transaction_date columns. The transformation of tickers is based on a groupby statement on each ticker label and party and aggregating their counts. The reasoning I am doing this is to gather the proportions between each party and measuring the political lean for every single ticker. We grab every single ticker that has more than 60% favor towards a party, which is a threshold we found to be best after testing different numbers, and after doing so ordinal encode a number based on party affiliation. For the transactions_dates, we followed the same formula by finding the proportions of Democrats and Republicans that traded on each month. Our metric for choosing party affiliation is the absolute difference of the proportions and choosing a number that is above the 75th percentile which turns out to be any number greater than .00143951. After doing so we hard encoded the values visually for the months their party affiliations. For out model type we chose the KNN classifier because our features are primariliy ordinal meaning that KNN will make more accurate predictions based on distance. The paramaters that performed the best were the algorithm='brute', leaf_size=1, n_neighbors=13, p=2, weights='distance'using GridSearchCV

In [12]:
#Using the ticker column, we attempt to determine the party of an individual
totals = df.groupby('ticker').count()
tickers = totals.index.to_list()
filtered = df[df.apply(lambda x: x['ticker'] in tickers, axis = 1)].groupby('ticker').count()['disclosure_year']
democrat = df[df['party'] == 'Democrat']
democrat = democrat[democrat.apply(lambda x: x['ticker'] in tickers, axis = 1)].groupby('ticker').count()
republican = df[df['party'] == 'Republican']
republican = republican[republican.apply(lambda x: x['ticker'] in tickers, axis = 1)].groupby('ticker').count()
libertarian = df[df['party'] == 'Libertarian']
libertarian = libertarian[libertarian.apply(lambda x: x['ticker'] in tickers, axis = 1)].groupby('ticker').count()
ratio_D = democrat['disclosure_year'].divide(filtered)
ratio_D = pd.DataFrame(ratio_D).rename(columns={"disclosure_year": "Democrat"})
ratio_R = republican['disclosure_year'].divide(filtered)
ratio_R = pd.DataFrame(ratio_R).rename(columns={"disclosure_year": "Republican"})
ratio_L = libertarian['disclosure_year'].divide(filtered)
ratio_L = pd.DataFrame(ratio_L).rename(columns={"disclosure_year": "Libertarian"})
final = pd.concat([ratio_D, ratio_R, ratio_L], axis = 1)
final = final[final.apply(lambda x: x > .6)]
tickers = final.dropna(how = 'all')
#This contains the Democrat's tickers
democrat = tickers[['Democrat']].dropna().index.to_list()
#This contains the Republican's tickers
republican = tickers[['Republican']].dropna().index.to_list()

In [None]:
test = df.copy(deep = True)
test['transaction_date'] = df['transaction_date'].apply(lambda x: x.month)
x = test[['transaction_date', 'party', 'disclosure_year']].groupby(['transaction_date', 'party']).agg('count')
x = x['disclosure_year']
demo = []
repub = []
for i in range(len(x)//2):
    demo.append(x.loc[(i+1, 'Democrat')] / 7504)
    repub.append(x.loc[(i+1, 'Republican')] / 4394)
ratios = pd.DataFrame({'democrat': demo, 'republican': repub})
np.percentile(abs(ratios.diff(axis = 1)).republican, q = [.25, .5, .75])

In [13]:
#New features that we included, which 
def encode_ticker(df):
    return df['ticker'].transform(lambda x: 1 if x in tickers else(2 if x in republican else 0)).to_numpy().reshape(-1,1)
def encode_day(df):
    return df['transaction_date'].dt.day.transform(lambda x: 0 if x in [3,5] else(1 if x==10 else 2)).to_numpy().reshape(-1,1)

In [14]:
train_X,test_X,train_Y,test_Y = train_test_split(df[df.columns[:-1]],df.party)
#Final Pipeline, using KNN parameters that were given from the GridSearchCV
knn=KNeighborsClassifier(algorithm='brute',
 leaf_size=1,
 n_neighbors=13,
 p=2,
 weights='distance')
label_transformer = Pipeline([
    ('label',FunctionTransformer(label_conv))
])
type_transformer = Pipeline([
    ('type',FunctionTransformer(type_conv))
])
cap_transformer = Pipeline([
    ('cap',FunctionTransformer(cap_conv))
])
ticker_transformer = Pipeline([
    ('ticker',FunctionTransformer(encode_ticker))
])
day_transformer = Pipeline([
    ('day',FunctionTransformer(encode_day))
])
preproc = ColumnTransformer(
    transformers=[
    ('s1',label_transformer,['amount']),
    ('s2',type_transformer,['type']),
    ('s3',cap_transformer,['cap_gains_over_200_usd']),
     #additional feature 1
    ('s4',ticker_transformer,['ticker']),
    #additional feature 2
    ('s5',day_transformer,['transaction_date']) 
])
p3 = Pipeline(steps = [('preprecessor',preproc),('regressor',knn)])
p3.fit(train_X,train_Y)
p3.score(test_X,test_Y)

0.8185483870967742

In [40]:
#create a deep copy of temp that has transformations applied so that we can run GridSearch
x = df.copy(deep = True)
temp = df.copy(deep=True)
temp['amount'] = label_conv(temp)
temp['type'] = type_conv(temp)
temp['cap_gains_over_200_usd'] = cap_conv(temp)
temp['ticker'] = encode_ticker(temp)
temp['transaction_date'] = encode_day(temp)
x = temp[['amount','type','cap_gains_over_200_usd','ticker','transaction_date']]
y= df.party

In [None]:
# Split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(x,y)
knn = KNeighborsClassifier()
param_grid = {
    'n_neighbors' : list(range(1,30)),
    'weights'     : ['uniform', 'distance'],
    'algorithm'   : ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size'   : list(range(1,10)),
    'p'           : [1,2]
}
# defining parameter range
grid = GridSearchCV(knn, param_grid, cv=2, scoring='accuracy', return_train_score=False)
grid_search=grid.fit(x_train, y_train)

### Fairness Evaluation
For our fairness evaluation, we chose to test the keyword 'Hon.' in representative names. We wanted to check if our model predictions are fair for individuals that do not have Hon. in their names. For our parity measure we are going to choose true positive parity because our model should equally classify party affiliation of representatives with or without the Hon. title. After observing the recall_score of each category (Hon or 'no'Hon) the scores came out to be .72 for Hon and .99 for no Hon.. Which makes sense because after doing some EDA we found that almost every single representative that didn't have Hon in their names were Republican hence the very high recall score. To test whether our model predictions were not affected by this, we ran a permutation test on the recall score and found a pval of .01 which signifies that our model is not fair because it is biased for representatives that do not have Hon in their names.

In [17]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
yh_true = hon.party
yn_true = no_hon.party
p2.fit(hon[hon.columns[:-1]],hon.party)
yh_pred = p2.predict(hon[hon.columns[:-1]])
p2.fit(no_hon[no_hon.columns[:-1]],no_hon.party)
yn_pred = p2.predict(no_hon[no_hon.columns[:-1]])

In [18]:
confusion_matrix(yh_true,yh_pred,labels=['Democrat','Republican'])
# top row predicted,side row is actual
recall_score(yh_true,yh_pred,average='micro')

0.7246401792639835

In [19]:
confusion_matrix(yn_true,yn_pred,labels=['Democrat','Republican'])
recall_score(yn_true,yn_pred,average='micro')

0.9966555183946488

In [20]:
yh_pred = p3.predict(df[df.columns[:-1]])
df['preds']=yh_pred

In [21]:
#permutation test
df['Hon'] = df['representative'].apply(lambda x: True if 'Hon' in x else False)
df['preds'] = df['preds'].apply(lambda x: 0 if x == 'Democrat' else 1)
df['party'] = df['party'].apply(lambda x: 0 if x == 'Democrat' else 1)
obs = df.groupby('Hon').apply(lambda x: recall_score(x.party, x.preds, average='micro')).diff().iloc[-1]
metrs = []
for _ in range(100):
    s = (
        df[['Hon', 'preds', 'party']]
        .assign(Hon = df.Hon.sample(frac=1.0, replace=False).reset_index(drop=True))
        .groupby('Hon')
        .apply(lambda x: recall_score(x.party, x.preds, average='micro'))
        .diff()
        .iloc[-1]
    )
    metrs.append(s)
pd.Series(metrs >= obs).mean()

0.99

In [22]:
( #getting the true accuracy of the representaives with/without hon being assigned a party
    df
    .groupby('Hon')
    .apply(lambda x: recall_score(x.party, x.preds, average='micro'))
    .rename('accuracy')
    .to_frame()
)

Unnamed: 0_level_0,accuracy
Hon,Unnamed: 1_level_1
False,0.765886
True,0.826424
