# Stock Trades by Members of the US House of Representatives

* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    - Can you predict the party affiliation of a representative from their stock trades?
    - Can you predict the geographic region that the representative comes from using their stock trades? E.g., west coast, east coast, south, etc.
    * Can you predict whether a particular trade is a BUY or SELL?

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
TODO

### Baseline Model
TODO

### Final Model
TODO

### Fairness Evaluation
TODO

# Code

In [29]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import sys
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

### Baseline Model

In [32]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Binarizer


'ln' is not recognized as an internal or external command,
operable program or batch file.


In [7]:
# TODO

### Final Model

In [8]:
# TODO

### Fairness Evaluation

In [9]:
# TODO

In [33]:
from sklearn.base import BaseEstimator, TransformerMixin

class StdScalerByGroup(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        :Example:
        >>> cols = {'g': ['A', 'A', 'B', 'B'], 'c1': [1, 2, 2, 2], 'c2': [3, 1, 2, 0]}
        >>> X = pd.DataFrame(cols)
        >>> std = StdScalerByGroup().fit(X)
        >>> std.grps_ is not None
        True
        """
        # X might not be a pandas DataFrame (e.g. a np.array)
        df = pd.DataFrame(X)

        # Compute and store the means/standard-deviations for each column (e.g. 'c1' and 'c2'), 
        # for each group (e.g. 'A', 'B', 'C').  
        # (Our solution uses a dictionary)
        self.grps_ = df.groupby(df.iloc[:,0]).agg(['mean', 'std']).to_dict()

        return self

    def transform(self, X, y=None):
        """
        :Example:
        >>> cols = {'g': ['A', 'A', 'B', 'B'], 'c1': [1, 2, 3, 4], 'c2': [1, 2, 3, 4]}
        >>> X = pd.DataFrame(cols)
        >>> std = StdScalerByGroup().fit(X)
        >>> out = std.transform(X)
        >>> out.shape == (4, 2)
        True
        >>> np.isclose(out.abs(), 0.707107, atol=0.001).all().all()
        True
        """

        try:
            getattr(self, "grps_")
        except AttributeError:
            raise RuntimeError("You must fit the transformer before tranforming the data!")
        
        # Hint: Define a helper function here!

        df = pd.DataFrame(X)
        name = df.columns[0]
        out_lst = []

        def find_attr(col, attr, group):
            return self.grps_[(col, attr)][group]
        
        groups = df.groupby(name).mean().index
        for gr in groups:
            sub_df = df[df[name] == gr]
            for col in sub_df.columns[1:]:
                sub_mean = find_attr(col, 'mean', gr)
                sub_std = find_attr(col, 'std', gr)
                sub_df[col] = (sub_df[col]- sub_mean)/sub_std
            out_lst.append(sub_df)

        out = pd.concat(out_lst)

        return out.drop(columns = name)

In [10]:
swp = pd.read_csv('stock_with_party.csv')

In [11]:
swp.head()

Unnamed: 0,disclosure_year,disclosure_date,transaction_date,owner,ticker,asset_description,type,amount,representative,district,ptr_link,cap_gains_over_200_usd,current_party,avg_amount
0,2021,2021-10-04,2021-09-27,joint,BP,BP plc,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0
1,2021,2021-10-04,2021-09-13,joint,XOM,Exxon Mobil Corporation,purchase,"$1,001 - $15,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,8000.0
2,2021,2021-10-04,2021-09-10,joint,ILPT,Industrial Logistics Properties Trust - Common...,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0
3,2021,2021-10-04,2021-09-28,joint,PM,Phillip Morris International Inc,purchase,"$15,001 - $50,000",Hon. Virginia Foxx,NC05,https://disclosures-clerk.house.gov/public_dis...,False,Republican,32500.0
4,2021,2021-10-04,2021-09-17,self,BLK,BlackRock Inc,sale_partial,"$1,001 - $15,000",Hon. Alan S. Lowenthal,CA47,https://disclosures-clerk.house.gov/public_dis...,False,Democratic,8000.0


In [17]:
swp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   disclosure_year         15919 non-null  int64  
 1   disclosure_date         15919 non-null  object 
 2   transaction_date        15919 non-null  object 
 3   owner                   8499 non-null   object 
 4   ticker                  14589 non-null  object 
 5   asset_description       15915 non-null  object 
 6   type                    15919 non-null  object 
 7   amount                  15919 non-null  object 
 8   representative          15919 non-null  object 
 9   district                15919 non-null  object 
 10  ptr_link                15919 non-null  object 
 11  cap_gains_over_200_usd  15919 non-null  bool   
 12  current_party           15919 non-null  object 
 13  avg_amount              15919 non-null  float64
dtypes: bool(1), float64(1), int64(1), obje

In [12]:
swp['disclosure_date'].

0        2021-10-04
1        2021-10-04
2        2021-10-04
3        2021-10-04
4        2021-10-04
            ...    
15914    2020-06-10
15915    2020-06-10
15916    2020-06-10
15917    2020-06-10
15918    2020-06-10
Name: disclosure_date, Length: 15919, dtype: object

In [25]:
df = swp[['district']]
df.iloc[:,0].str[0:2]

0        NC
1        NC
2        NC
3        NC
4        CA
         ..
15914    CO
15915    CO
15916    TX
15917    TX
15918    TX
Name: district, Length: 15919, dtype: object

In [None]:
state_transform = Pipeline([
        ('split', FunctionTransformer(lambda x:x.iloc[:,0].str[0:2].to_frame(), validate=False)),
        ('ohe', OneHotEncoder(handle_unknown='ignore')),
    ])


In [None]:
preproc = ColumnTransformer([
        ('to_state', state_transform, ['district']), # transform and one-hot encode the state of each 
        ('std', StdScalerByGroup(), ['current_party','avg_amount']),
        ('one_hot',  OneHotEncoder(handle_unknown='ignore'), ['Sex'],
        )
        ], remainder='passthrough')

In [None]:
pl = Pipeline([
        ('pre', preproc),
        ('clf', RandomForestClassifier(max_depth=10))
    ])