# The Deconfounder simulation (with the death penalty dataset)
The following code follows the steps presented by Yixin Wang in her <a href="https://github.com/Lenchickk/deconfounder_tutorial/blob/master/deconfounder_tutorial.ipynb">deconfounder tutorial</a> and applies it to our death penalty dataset.

The (eventual) purpose of this notebook is to draft an application of the deconfounder for nlp problems and extensively large datasets in general.

# Initialization steps
Now we repeat the initialization steps from Wang's code.

Warning: Python 3 is necessary (with Python 2 you will get a bunch of errors). If your normal compiler is Python 2, one possibility is to create a separate environment in Anaconda with Python 3 and then switch to it (in the following code this environment is 'revelead' by the part of the file path with /envs/py36). Importantly, it is easy to switch back if necessary. 

In [72]:
!pip install tensorflow_probability



In [13]:
import tensorflow as tf
import numpy as np
import numpy.random as npr
import pandas as pd
import tensorflow as tf
import tensorflow_probability as tfp
import statsmodels.api as sm

from tensorflow_probability import edward2 as ed
from sklearn.datasets import load_breast_cancer
from pandas.tools.plotting import scatter_matrix
from scipy import sparse, stats
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

import matplotlib
matplotlib.rcParams.update({'font.sans-serif' : 'Helvetica',
                            'axes.labelsize': 10,
                            'xtick.labelsize' : 6,
                            'ytick.labelsize' : 6,
                            'axes.titlesize' : 10})
import matplotlib.pyplot as plt

import seaborn as sns
color_names = ["windows blue",
               "amber",
               "crimson",
               "faded green",
               "dusty purple",
               "greyish"]

#type(color_names)


  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [76]:
!pip show tensorflow

Name: tensorflow
Version: 1.11.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: /Users/elenalabzina/anaconda2/envs/py36/lib/python3.6/site-packages
Requires: termcolor, setuptools, keras-applications, absl-py, tensorboard, astor, keras-preprocessing, gast, grpcio, six, protobuf, wheel, numpy
Required-by: tensorflow-probability


In [77]:
!pip show tensorflow_probability

Name: tensorflow-probability
Version: 0.4.0
Summary: Probabilistic modeling and statistical inference in TensorFlow
Home-page: http://github.com/tensorflow/probability
Author: Google LLC
Author-email: no-reply@google.com
License: Apache 2.0
Location: /Users/elenalabzina/anaconda2/envs/py36/lib/python3.6/site-packages
Requires: six, tensorflow, numpy
Required-by: 


In [14]:
# set random seed so everyone gets the same number
import random
randseed = 123
print("random seed: ", randseed)
random.seed(randseed)
np.random.seed(randseed)
tf.set_random_seed(randseed)

random seed:  123


# The death penalty dataset


In [15]:
import csv

datafile = '/Users/elenalabzina/Documents/GitHub/DeconfounderAnalysis/data/death-penalty-cases.csv'
csv_file = open(datafile, mode='r')
csv_reader = csv.DictReader(csv_file)
data = pd.read_csv(datafile, encoding='utf-8')

Let's have a look at our data:

In [16]:
data

Unnamed: 0,author_id,caseName,citeCount,cluster_id,court_id,dateFiled,snippet,state,year
0,,In Re Waiver of Death Penalty,8,1923143,nj,1965-09-14T00:00:00Z,N.J. 501 (1965)\n213 A.2d 20\nIN RE WAIVER OF ...,NJ,1965
1,4019.0,State v. Dixon,552,1876220,fla,1973-07-26T00:00:00Z,"whether the death penalty is, per se, unconsti...",FL,1973
2,5765.0,Jurek v. State,143,2450978,texcrimapp,1975-04-16T00:00:00Z,#39;s contention that the assessment of the de...,TX,1975
3,,In the Matter of Death Penalty Sentencing,0,891563,nm,2009-11-30T00:00:00Z,.3d 673 (2009)\n2009-NMSC-053\nIN THE MATTER O...,NM,2009
4,5758.0,Ex Parte Traxler,56,4162563,texcrimapp,1944-12-20T00:00:00Z,assume the district attorney orally waived the...,TX,1944
5,550.0,Canadian Coalition Against Death Penalty v. Ryan,0,2528242,azd,2003-05-19T00:00:00Z,"Against Death Penalty, Stop Prisoner Rape, Ci...",AZ,2003
6,,In the Matter of Death Penalty Sentencing Jury...,0,891562,nm,2009-11-30T00:00:00Z,.3d 674 (2009)\n2009-NMSC-052\nIN THE MATTER O...,NM,2009
7,,State v. Pat Bondurant (Death Penalty),0,1082874,tenncrimapp,1998-03-18T00:00:00Z,"views on the death\n\npenalty, three stated th...",TN,1998
8,,In Re Readoption With Amendments of Death Pena...,0,2351704,nj,2004-11-10T00:00:00Z,.J. 147\nIN RE READOPTION WITH AMENDMENTS OF D...,NJ,2004
9,5765.0,Ex Parte Caldwell,80,1738821,texcrimapp,1964-10-14T00:00:00Z,", 1964, this Court received the record of a de...",TX,1964


For our further analysis we will only use some of the features as independent variables. Importantly, the variables require additional preprocessing before we can start applying the deconfounder algorithm.  
    

In [17]:
df = pd.DataFrame(data[['year','state','court_id','snippet']])

In [18]:
df

Unnamed: 0,year,state,court_id,snippet
0,1965,NJ,nj,N.J. 501 (1965)\n213 A.2d 20\nIN RE WAIVER OF ...
1,1973,FL,fla,"whether the death penalty is, per se, unconsti..."
2,1975,TX,texcrimapp,#39;s contention that the assessment of the de...
3,2009,NM,nm,.3d 673 (2009)\n2009-NMSC-053\nIN THE MATTER O...
4,1944,TX,texcrimapp,assume the district attorney orally waived the...
5,2003,AZ,azd,"Against Death Penalty, Stop Prisoner Rape, Ci..."
6,2009,NM,nm,.3d 674 (2009)\n2009-NMSC-052\nIN THE MATTER O...
7,1998,TN,tenncrimapp,"views on the death\n\npenalty, three stated th..."
8,2004,NJ,nj,.J. 147\nIN RE READOPTION WITH AMENDMENTS OF D...
9,1964,TX,texcrimapp,", 1964, this Court received the record of a de..."


For year, state, and court_id we need to create dummies.  

In [19]:
dummy1 = pd.get_dummies(df['state'], drop_first=True)
dummy1.head()

Unnamed: 0,AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,...,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [20]:
dummy2 = pd.get_dummies(df['year'], drop_first=True)
dummy2.head()

Unnamed: 0,1863,1864,1865,1867,1875,1878,1882,1883,1886,1890,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
dummy3 = pd.get_dummies(df['court_id'], drop_first=True)
dummy3.head()

Unnamed: 0,ala,alacivapp,alacrimapp,alactapp,alaska,alaskactapp,almd,alnd,alsd,ared,...,wawd,wied,wis,wisctapp,wiwd,wva,wvnd,wvsd,wyd,wyo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
df = pd.concat([df, dummy1, dummy2, dummy3], axis=1)
df

Unnamed: 0,year,state,court_id,snippet,AL,AR,AZ,CA,CO,CT,...,wawd,wied,wis,wisctapp,wiwd,wva,wvnd,wvsd,wyd,wyo
0,1965,NJ,nj,N.J. 501 (1965)\n213 A.2d 20\nIN RE WAIVER OF ...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1973,FL,fla,"whether the death penalty is, per se, unconsti...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1975,TX,texcrimapp,#39;s contention that the assessment of the de...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2009,NM,nm,.3d 673 (2009)\n2009-NMSC-053\nIN THE MATTER O...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1944,TX,texcrimapp,assume the district attorney orally waived the...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,2003,AZ,azd,"Against Death Penalty, Stop Prisoner Rape, Ci...",0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,2009,NM,nm,.3d 674 (2009)\n2009-NMSC-052\nIN THE MATTER O...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1998,TN,tenncrimapp,"views on the death\n\npenalty, three stated th...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,2004,NJ,nj,.J. 147\nIN RE READOPTION WITH AMENDMENTS OF D...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1964,TX,texcrimapp,", 1964, this Court received the record of a de...",0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now let's look at the outcome variable. 

In [23]:
dfy = (data['citeCount']).values
dfy.shape, dfy[:100]

((32567,),
 array([   8,  552,  143,    0,   56,    0,    0,    0,    0,   80,  723,
          91,  374,   10,  200, 1101, 2383,  244,    4,   96,    2, 2741,
        2467, 2126, 2094,   71,   14,   56,  543, 1547,  197,  432,   32,
          18, 1523, 1400,    6,  213,  138,   59, 5330,   20,  124,   10,
         685,   76,  130,  512,  107, 1313,   75,   90,  107,   65, 1601,
         158,  298,    9,  133,    6,  855,  173,  717,   54,  704,  290,
          23,  138,   12, 1833,   72,   44,  203,   28,  153, 1315, 1243,
          80,  104,   68, 1326,   68, 1182,   23, 1273,   46,    9,  131,
         220,    9,  103,   79,  160,  127,   44,   29,   94,   31,  160,
          49]))

The distinction of our example from Wang's is that our outcome is not binary. Hence, we will need to use something else instead of the logit model for the last step of our analysis. 

# Text preprocessing of the snippet field

Now we need to prepare our data to apply the first step of the decounfounder. 
The snippet field is a textual field, hence we cannot use it right away. 

The distinction of our example from Wang's is that our outcome is not binary. Hence, we will need to use something else instead of the logit model for the last step of our analysis. 

# Text preprocessing of the snippet field

Now we need to prepare our data to apply the first step of the decounfounder. 

The distinction of our example from Wang's is that our outcome is not binary. Hence, we will need to use something else instead of the logit model for the last step of our analysis. 

# Text preprocessing of the snippet field

Now we need to prepare our data to apply the first step of the decounfounder. The first approach to quantify the text is to create dummies for the 2-grams. 

## Text Cleaning: Obtaining tokens

In [35]:
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords 
from string import ascii_lowercase
from nltk import ngrams
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

In [58]:
stemmer = SnowballStemmer("english", ignore_stopwords=True)
tokenizer = RegexpTokenizer(r'[a-z]+')

##setting up and extending stopwords
stop = stopwords.words('english')
stop.extend(['nevertheless','would','nether''the','in','may','also','zero','one','two','three','four','five','six','seven','eight','nine','ten','across','among','beside','however','yet','within']+list(ascii_lowercase))

  return concat([self.open(f).read() for f in fileids])
  return concat([self.open(f).read() for f in fileids])


Now let's transform the text into tokens. 

In [162]:
n = 3
tokens = []

for statement in df['snippet']:
    words = tokenizer.tokenize(statement.lower())
    #print(words)
    words_clean = []
    for word in words:
        if word in stop: 
            continue
        words_clean.append(stemmer.stem(word))
    tokens.append(words_clean)
    
#tokens
#list(ngrams(words,3))    
#text[0]
#for ngrams in text[0]:
  #  for ngram in ngrams:
  #      print(ngram)
   
    


Let's have a look at the tokens:

In [61]:
tokens[0:2]

[['waiver',
  'death',
  'penalti',
  'suprem',
  'court',
  'new',
  'jersey',
  'septemb',
  'hellip',
  'counti',
  'court',
  'judg',
  'waiver',
  'death',
  'penalti',
  'suprem',
  'court',
  'concern',
  'excess',
  'hellip',
  'case',
  'prosecutor',
  'seek',
  'death',
  'penalti',
  'cogniz',
  'fact',
  'situat',
  'hellip',
  'although',
  'prosecutor',
  'right',
  'waiv',
  'death',
  'penalti',
  'inform',
  'juri',
  'juri',
  'hellip',
  'assum',
  'prosecutor',
  'death',
  'penalti',
  'return',
  'ask'],
 ['whether',
  'death',
  'penalti',
  'per',
  'se',
  'unconstitut',
  'whether',
  'discretionari',
  'death',
  'penalti',
  'statut',
  'hellip',
  'death',
  'penalti',
  'novemb',
  'state',
  'effort',
  'reinstat',
  'death',
  'penalti',
  'discretionari',
  'hellip',
  'interest',
  'impos',
  'death',
  'penalti',
  'anoth',
  'death',
  'penalti',
  'statut',
  'differ',
  'constitut',
  'hellip',
  'impos',
  'state',
  'death',
  'penalti',
  'flori

There are many ways to quantify these data. Let's start by building 2-grams, collecting a dictionary, and creating dummies for the 2-grams occurencies. 

## Creating a dictionary for 3-grams

In [163]:
my_ngrams=[]
n=3

for token_group in tokens:
    ngram = ngrams(token_group,n)
    my_ngrams.append(ngram)

For example, the ngrams for the first document are:

In [135]:
#for gram in my_ngrams[0]:
  #  print(gram) 

Below we create a dictionary of 3-grams. Also, we create a dictionary that includes the number of occurencies to be able to filter some of the n-grams. 

In [164]:
ngrams_dict=dict()
ngrams_counter=dict()
index = -1
max_occ=1

for token_group in my_ngrams:
    for token in token_group:
        my_key = tuple(token)
        if my_key in ngrams_dict:
            ngrams_counter[my_key] = ngrams_counter[my_key]+1
            if (max_occ < ngrams_counter[my_key]): max_occ=ngrams_counter[my_key]
            continue
        index = index + 1
        ngrams_dict[my_key]=index
        ngrams_counter[my_key] = 1
        

In [165]:
import operator

ngrams_counter_sorted = sorted(ngrams_counter.items(), key=operator.itemgetter(1))

ngrams_counter_sorted

index, max_occ

(331746, 7198)

In [166]:
type(ngrams_counter_sorted)

list

In [186]:
ngrams_counter_sorted[(int)(331746*0.05)], ngrams_counter_sorted[(int)(331746*0.999)]

((('confer', 'upon', 'hellip'), 1), (('hellip', 'state', 'file'), 140))

In [187]:
#import matplotlib 

#ngrams_counter.values()
ngrams

for key in ngrams_counter.keys():
    if ngrams_counter[key]<20:
        del ngrams_counter[key]
        del ngrams_dict[key]


RuntimeError: dictionary changed size during iteration