# Stock Trades by Members of the US House of Representatives

* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    - Can you predict the party affiliation of a representative from their stock trades?
    - Can you predict the geographic region that the representative comes from using their stock trades? E.g., west coast, east coast, south, etc.
    * Can you predict whether a particular trade is a BUY or SELL?

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
TODO

### Baseline Model
TODO

### Final Model
TODO

### Fairness Evaluation
TODO

# Code

In [6]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [50]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import Binarizer

from StdScalerByGroup import StdScalerByGroup

### Baseline Model

In [None]:
# TODO

### Final Model

In [None]:
# TODO

### Fairness Evaluation

In [None]:
# TODO

In [13]:
swp = pd.read_csv('stock_with_party.csv')

In [14]:
swp.head()

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,current_party,avg_amount
0,2021,2021-10-04,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0
1,2021,2021-10-04,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0
2,2021,2021-10-04,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0
3,2021,2021-10-04,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0
4,2021,2021-10-04,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Hon. Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0


In [15]:
swp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   disclosure_year         15919 non-null  int64  
 1   disclosure_date         15919 non-null  object 
 2   transaction_date        15919 non-null  object 
 3   owner                   8499 non-null   object 
 4   ticker                  14589 non-null  object 
 5   asset_description       15915 non-null  object 
 6   type                    15919 non-null  object 
 7   amount                  15919 non-null  object 
 8   representative          15919 non-null  object 
 9   district                15919 non-null  object 
 10  ptr_link                15919 non-null  object 
 11  cap_gains_over_200_usd  15919 non-null  bool   
 12  current_party           15919 non-null  object 
 13  avg_amount              15919 non-null  float64
dtypes: bool(1), float64(1), int64(1), obje

In [None]:
def state_to_region(entry, values):
    for key, val in values.items():
        if entry in val:
            entry = key

In [18]:
state_region = {'West': ['CO', 'WY', 'MT', 'ID', 'WA', 'OR', 'UT', 'NV', 'CA', 'AK', 'HI'],
    'Southwest': ['TX', 'OK', 'NM', 'AZ'],
    'Midwest': ['OH', 'IN', 'M', 'IL', 'MO', 'WI', 'MN', 'IA', 'KS', 'NE', 'SD', 'ND'],
    'Southeast': ['WV', 'VA', 'KY', 'TN', 'NC', 'SC', 'GA', 'AL', 'MS', 'AR', 'LA', 'FL'],
    'Northeast': ['ME', 'MA', 'RI', 'CT', 'NH', 'VT', 'NY', 'PA', 'NJ', 'DE', 'MD']}


In [24]:
west = ['CO', 'WY', 'MT', 'ID', 'WA', 'OR', 'UT', 'NV', 'CA', 'AK', 'HI']
southwest = ['TX', 'OK', 'NM', 'AZ']
midwest = ['OH', 'IN', 'M', 'IL', 'MO', 'WI', 'MN', 'IA', 'KS', 'NE', 'SD', 'ND']
southeast = ['WV', 'VA', 'KY', 'TN', 'NC', 'SC', 'GA', 'AL', 'MS', 'AR', 'LA', 'FL']
northeast = ['ME', 'MA', 'RI', 'CT', 'NH', 'VT', 'NY', 'PA', 'NJ', 'DE', 'MD']

In [29]:
to_region = swp['district'].str[0:2].apply(lambda x:'west' if x in west 
                                           else 'southwest' if x in southwest 
                                           else 'midwest' if x in midwest
                                          else 'southeast' if x in southeast
                                          else 'northeast')
swp['region'] = to_region
swp.head()

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,current_party,avg_amount,region
0,2021,2021-10-04,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0,southeast
1,2021,2021-10-04,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0,southeast
2,2021,2021-10-04,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0,southeast
3,2021,2021-10-04,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0,southeast
4,2021,2021-10-04,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Hon. Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0,west


In [88]:
len(swp['ticker'].unique())

2231

In [106]:
len(swp['type'].unique())

5

In [55]:
swp['transaction_date'] = pd.to_datetime(swp['transaction_date'])

In [56]:
swp['transaction_date'].dt.month.apply(lambda x:1 if x<=3 else 2 if x<=6 else 3 if x<=9 else 4)

0        3
1        3
2        3
3        3
4        3
        ..
15914    2
15915    2
15916    1
15917    1
15918    1
Name: transaction_date, Length: 15919, dtype: int64

In [68]:
swp[['transaction_date']].iloc[:,0]

0       2021-09-27
1       2021-09-13
2       2021-09-10
3       2021-09-28
4       2021-09-17
           ...    
15914   2020-04-09
15915   2020-04-09
15916   2020-03-13
15917   2020-03-13
15918   2020-03-13
Name: transaction_date, Length: 15919, dtype: datetime64[ns]

In [107]:
by_season = Pipeline([
        ('quarter', FunctionTransformer(lambda x:x.iloc[:,0].dt.month
                                        .apply(lambda x:1 if x<=3 else 2 if x<=6 else 3 if x<=9 else 4).to_frame(),
                                        validate=False)),
        ('ohe', OneHotEncoder(handle_unknown='ignore')),
    ]) #change transaction date to quarter and then onehot encode it

In [181]:
preproc = ColumnTransformer([
        ('to_quarter', by_season, ['transaction_date']),
        ('std', StdScalerByGroup(), ['current_party','avg_amount']),
        ('one_hot',  OneHotEncoder(handle_unknown='ignore'), ['disclosure_year', 'owner','type', 'current_party']),
        ('binary', Binarizer(), ['cap_gains_over_200_usd'])
        ], remainder='passthrough')
    
pl = Pipeline([
        ('pre', preproc),
        ('clf', RandomForestClassifier(max_depth=8))
    ])

In [182]:
all_owner = swp[swp['owner'].isna() == False]
all_owner

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,current_party,avg_amount,region
0,2021,2021-10-04,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0,southeast
1,2021,2021-10-04,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0,southeast
2,2021,2021-10-04,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0,southeast
3,2021,2021-10-04,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0,southeast
4,2021,2021-10-04,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Hon. Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0,west
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15909,2020,2020-06-10,2020-04-22,self,AAPL,Apple Inc.,sale_full,"$1,001 - $15,000",Hon. Ed Perlmutter,CO07,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0,west
15910,2020,2020-06-10,2020-04-22,self,COST,Costco Wholesale Corporation,sale_partial,"$1,001 - $15,000",Hon. Ed Perlmutter,CO07,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0,west
15911,2020,2020-06-10,2020-03-18,self,COST,Costco Wholesale Corporation,purchase,"$1,001 - $15,000",Hon. Ed Perlmutter,CO07,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0,west
15912,2020,2020-06-10,2020-04-22,self,FB,"Facebook, Inc. - Class A",sale_full,"$1,001 - $15,000",Hon. Ed Perlmutter,CO07,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0,west


In [189]:
X = swp[['disclosure_year', 'transaction_date', 'avg_amount','type', 'owner', 'current_party', 'cap_gains_over_200_usd']].fillna('_')

In [190]:
X

Unnamed: 0,disclosure_year,transaction_date,avg_amount,type,owner,current_party,cap_gains_over_200_usd
0,2021,2021-09-27,8000.0,purchase,joint,Republican,False
1,2021,2021-09-13,8000.0,purchase,joint,Republican,False
2,2021,2021-09-10,32500.0,purchase,joint,Republican,False
3,2021,2021-09-28,32500.0,purchase,joint,Republican,False
4,2021,2021-09-17,8000.0,sale_partial,self,Democratic,False
...,...,...,...,...,...,...,...
15914,2020,2020-04-09,8000.0,sale_partial,_,Democratic,False
15915,2020,2020-04-09,8000.0,sale_partial,_,Democratic,False
15916,2020,2020-03-13,175000.0,sale_full,_,Republican,False
15917,2020,2020-03-13,750000.0,sale_full,_,Republican,False


In [191]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    swp.region,
                                                    test_size = 0.25)

In [192]:
pl.fit(X_train, y_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df[col] = (sub_df[col]- sub_mean)/sub_std


Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('to_quarter',
                                                  Pipeline(steps=[('quarter',
                                                                   FunctionTransformer(func=<function <lambda> at 0x00000195249AB790>)),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['transaction_date']),
                                                 ('std', StdScalerByGroup(),
                                                  ['current_party',
                                                   'avg_amount']),
                                                 ('one_hot',
                                                  OneHotEncoder(handle_unknown='ignore'),
                 

In [193]:
pl.score(X_train, y_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df[col] = (sub_df[col]- sub_mean)/sub_std


0.5915068263673674

In [194]:
pl.score(X_test, y_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub_df[col] = (sub_df[col]- sub_mean)/sub_std


0.5726130653266331