## Example of creating the required pairwise data (y and Xs) file using individual actor-level data for running a mixed effects model that provides effect sizes for each feature quantifying a pair of actors with respect to the rate of agreement for decisions made by pair of actors.

### Note that the main code used here is provided in create_pairwise_data_file.py; the resulting files from this tutorial can be created by running `python create_pairwise_data_file.py --data data/example/ --output data/example/output/` - that creates and saves the pairwise data file (data/example/output/pairwise_data.csv); results of the mixed effects model can be obtained by running `Rscript mixed_effects_model.R data/example/decisions.csv data/example/output/pairwise_data.csv data/example/output/results.txt`

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from utils import *
from create_pairwise_data_file import *

### Required data directory structure: each data directory must have one CSV file named **decisions.csv**, and one subdirectory named **features/**. Note that none of the csv files are expected to have any headers/column names - please create files without column names and in the format described below (data/example/ provides examples of file formats that can be used). 

**decisions.csv**: This contains the categorical *decisions* made by every *actor* on each of multiple *items* - the rows corresponding to each individual actor, the first column contains IDs of each actor, and the subsequent columns all contain the integer value of all the decisions made. 

**features/**: Contains multiple CSV files where each CSV file corresponds to a particular feature and its values for each actor. Features can be of two types: *scalar* (one value quantifiying each actor) and *vector* (a vector of more than one values quantifying each actor). NOTE: Name each of these feature files appropriately, since they will be used to interpret the results (the name of the file is used as the feature name). 

In [2]:
data_base_path = 'data/example/' #path to the data directory - formatted as described above. 

### Loading actor-level decisions data file below: in our example data, the _actors_ are US House members (109th Congress) or legislators, _decisions_ are votes on congressional bills (_items_) (only a few keyvotes are considered for every legislator in this data).

In [3]:
decisions_df = pd.read_csv(Path(data_base_path) / 'decisions.csv', header = None)
print(decisions_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 433 entries, 0 to 432
Data columns (total 30 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       433 non-null    int64
 1   1       433 non-null    int64
 2   2       433 non-null    int64
 3   3       433 non-null    int64
 4   4       433 non-null    int64
 5   5       433 non-null    int64
 6   6       433 non-null    int64
 7   7       433 non-null    int64
 8   8       433 non-null    int64
 9   9       433 non-null    int64
 10  10      433 non-null    int64
 11  11      433 non-null    int64
 12  12      433 non-null    int64
 13  13      433 non-null    int64
 14  14      433 non-null    int64
 15  15      433 non-null    int64
 16  16      433 non-null    int64
 17  17      433 non-null    int64
 18  18      433 non-null    int64
 19  19      433 non-null    int64
 20  20      433 non-null    int64
 21  21      433 non-null    int64
 22  22      433 non-null    int64
 23  23      433 non

In [4]:
decisions_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,9789,0,1,0,0,0,1,0,1,1,...,0,0,1,0,1,1,0,0,0,0
1,9790,0,1,0,0,0,1,0,1,1,...,0,0,1,0,1,0,0,0,0,0
2,9738,1,1,0,0,0,1,1,1,1,...,0,0,1,0,0,0,0,1,0,0
3,9739,1,1,1,1,0,1,1,1,1,...,0,1,0,0,0,1,1,1,1,1
4,9737,1,0,1,1,0,0,1,0,1,...,1,1,0,1,1,1,1,1,1,1


#### Above shows that the first column is actor IDs (as required by the format) whereas all other columns represent bills, with cells corresponding to the votes ('yay' = 1, 'nay' = 0) on those bills by the corresponding legislator. In this data, we have 29 sets of votes (decisions) for 433 legislators (actors). 

In [5]:
actor_ids = list(decisions_df[0])
actor_ids = list(map(lambda x:str(x), actor_ids))
decision_vals = decisions_df.iloc[:,1:].values #get the matrix of actors X decisions
print(decision_vals.shape)

(433, 29)


#### We load all features for the individual actors - consistent with the format, scalar features will be encapsulated by a CSV with just one column, whereas vector features will have more than 1 columns (this fact is used to load and store scalar and vector features separately)

In [6]:
print('Loading various actor-level features...')
scalar_features_to_values, vector_features_to_values = load_feature_values_dicts(data_base_path)

Loading various actor-level features...


In [7]:
print(scalar_features_to_values.keys())

dict_keys(['State', 'Party'])


In [8]:
print(vector_features_to_values.keys())

dict_keys(['Speech_TFIDF'])


#### For our example data, the features for legislators include their *State*, their *Party*, and TF-IDF vector representations of their floor speeches in the 109th congressional session (*Speech_TFIDF*). Actors can be represented with all kinds of features - scalar features can be any data type, but the vector features must contain numerical values only (and generally, real-valued). 

When computing similarity between two vectors such as cosine similarity, having a vector of all 0s can result in null values; we recommending a small epsilon value to each element of vector features (default behavior): 

In [9]:
add_epsilon = True 

Currently, the only similarity metric supported for vectors is cosine similarity, implemented in utils.py (however, more similarity metrics can be added and used by the user with minor modification to the main code). 

In [10]:
metric = 'Cosine'

#### For scalar features, pairwise values are obtained using a simple identity function; for example, value of Same_State or Same_Party is 1 if the state/party values for a pair match (exact matching), and -1 if not; but for vector features, a similarity score between the vector representation of the two actors in a pair is computed and used as the pairwise feature value. 

In [11]:
#below can take some time, depending on the number of actors in the data...

print('Computing pairwise-level data required for the codecision model...')
pairwise_actor_ids, pairwise_codecision_agreement_rate, pairwise_scalar_feature_to_same_identity_vals, pairwise_vector_feature_to_similarity_vals, removed_pairs, indiv_actors1, indiv_actors2 = get_pairwise_data_elements(actor_ids, decision_vals, scalar_features_to_values, vector_features_to_values, metric, add_epsilon)

#use above pairwise data to create a dataframe which can be stored.
pairwise_df = get_pairwise_dataframe(pairwise_actor_ids, pairwise_codecision_agreement_rate, pairwise_scalar_feature_to_same_identity_vals, pairwise_vector_feature_to_similarity_vals, indiv_actors1, indiv_actors2, metric)

Computing pairwise-level data required for the codecision model...


In [12]:
output_path = 'data/example/output/' #stores all the output files created - pairwise data, removed actor pairs, and results


#### Below, we store the pairs of legislators removed for having perfect agreement or disagreement (on their votes) following Ringe et al. (2013) [Ringe, Nils, Jennifer Nicoll Victor, and Justin H. Gross. "Keeping your friends close and your enemies closer? Information networks in legislative politics." British Journal of Political Science 43, no. 3 (2013): 601-628.]

In [13]:
f = open(Path(output_path) / "removed_pair_actor_ids.txt", 'w')
for pair in removed_pairs:
    f.write(str(pair))
    f.write('\n')
f.close()

In [14]:
print('Storing pairwise-level data required for the codecision modeling...')
#store the pairwise data output - used as input to mixed effects modeling in R
pairwise_df.to_csv(Path(output_path) / "pairwise_data.csv", index=False)
print('Done.')

Storing pairwise-level data required for the codecision modeling...
Done.


#### Pairwise data is now stored in provided output path. The next step is running the R code in order to fit the mixed effects generalized linear model: `Rscript mixed_effects_model.R data/example/decisions.csv data/example/output/pairwise_data.csv data/example/output/results.txt`

## To compare the features on how much they explain the co-vote agreement rate, we look at the _Fixed Effects_ in **results.txt**: 

```
Fixed effects:

                        Estimate Std. Error z value Pr(>|z|)    
                        
(Intercept)             0.302620   0.023180   13.05   <2e-16 ***

Same_State              0.055361   0.003641   15.21   <2e-16 ***

Same_Party              0.829807   0.001770  468.94   <2e-16 ***

Speech_TFIDF_Cosine_Sim 1.045378   0.034412   30.38   <2e-16 ***
```

### Above shows that cosine similarity between speeches (represented as tf-idf vectors) has a higher estimate than same state or same party. 