We used 3 different data sources and the transcripts took up a lot of memory. Since we experimented with the data at different levels, the following steps outline how we loaded and cleaned our data:

**Cornell Data**
* Create folder with transcripts at conversation level (entire case) 
* Create a folder with transcripts at the utterance level (a few sentences)
* Create dataframe of case information at conversation level
* Create a dataframe of case information at utterance level

Since we work with the data in different ways, we only add the transcripts to our dataframe right before we use them in our models. 

**Martin Quinn Data**
* Load in Martin Quinn scores and add to both transcript and utterance dataframes

**Washington University**
* Load in additional attribute data to transcript dataframes

## Cornell Data



Cornell collected supreme court transcripts and built a package to load them. We will use this to build our dataframe.

In [None]:
#install 'convokit' which is Cornell's supreme court python package
!pip3 install convokit
!python3 -m spacy download en_core_web_sm

#mount our drive
from google.colab import drive
drive.mount('/content/drive')


#load packages
import pandas as pd
import numpy as np
from convokit import Corpus, download

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting convokit
  Downloading convokit-2.5.3.tar.gz (167 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.0/168.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting msgpack-numpy>=0.4.3.2
  Downloading msgpack_numpy-0.4.8-py2.py3-none-any.whl (6.9 kB)
Collecting dill>=0.2.9
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting clean-text>=0.1.1
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting unidecode>=1.1.1
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.9/235.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (17

In [None]:
corpus = Corpus(filename=download("supreme-corpus"))

Downloading supreme-corpus to /root/.convokit/downloads/supreme-corpus
Downloading supreme-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/supreme-corpus/supreme-corpus.zip (1255.8MB)... Done


Now that we have the corpus of cases, we will get them at the CONVERSATION level. Each conversation has a unique transcript so it will be easy to match with our transcripts later. However, since the transcripts take up so much storage, we have downloaded those in a folder on our Google Drive and will only call those into our dataframe when building our DNNs later. The code to write those files are below:

In [None]:
#dataframe for conversations
df = corpus.get_conversations_dataframe()


In [None]:
# add convo_id as column rather than index
df['convo_id'] = df.index
#remove index
df = df.reset_index().drop(['id'], axis=1)

In [None]:
#function to get transcript, given a convo_id
def get_transcript(convo_id, remove_return=False):

  # pull conversation object
  convo = corpus.get_conversation(convo_id)

  # from conversation object, create list of utterance (i.e. the text bits) ids
  convo_utts = list(convo.iter_utterances())
  
  # combine text data from all utterances 
  if remove_return:
    convo_transcript = [utt.text.replace('\n', ' ') for utt in convo_utts]
  else:
    convo_transcript = [utt.text for utt in convo_utts]

  # join elements of list 
  convo_transcript = ''.join(convo_transcript)
  
  return convo_transcript


#make empty case_transcript column
df['case_transcript'] = np.nan

# populate df with transcripts
for i, id in enumerate(df.convo_id):

  transcript = get_transcript(id)

  try:
    df.at[i, 'case_transcript'] = transcript

  except: 
    pass


# write transcripts to files
for i, transcript in enumerate(df.case_transcript):
  # set file name
  file_name = str(df.iloc[i]['meta.case_id'])
  # set path name
  path = f'/content/drive/MyDrive/INFO251Final/Transcripts_Case_Convo/{file_name}.txt'
  # get transcript for case
  transcript = df.iloc[i]['case_transcript']

  with open(path, 'w') as convo_transcript: 
    convo_transcript.write(transcript)

In [None]:
df.to_csv('/content/drive/MyDrive/INFO251Final/ArgumentsTable.csv')

Since the Arguments table was so much storage, we ended up using the dataframes without the transcripts included and then added them right before modeling. 

Get dataframes of conversations and utterances WITHOUT actual transcripts attached

In [None]:
#dataframe for conversations
df_convo = corpus.get_conversations_dataframe()
df_convo

Unnamed: 0_level_0,vectors,meta.case_id,meta.advocates,meta.win_side,meta.votes_side
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13127,[],1955_71,"{'harry_f_murphy': {'side': 1, 'role': 'inferr...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
12997,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13024,[],1955_410,"{'howard_c_westwood': {'side': 1, 'role': 'inf...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13015,[],1955_351,"{'harry_d_graham': {'side': 3, 'role': 'inferr...",1,"{'j__john_m_harlan2': 1, 'j__hugo_l_black': 1,..."
13016,[],1955_38,"{'robert_n_gorman': {'side': 3, 'role': 'infer...",0,"{'j__john_m_harlan2': 0, 'j__hugo_l_black': 0,..."
...,...,...,...,...,...
24998,[],2019_19-635,"{'jay_alan_sekulow': {'side': 1, 'role': 'for ...",0,"{'j__john_g_roberts_jr': 0, 'j__clarence_thoma..."
24978,[],2019_19-46,"{'erica_l_ross': {'side': 1, 'role': 'Assistan...",0,"{'j__john_g_roberts_jr': 0, 'j__clarence_thoma..."
24979,[],2019_19-177,"{'christopher_g_michel': {'side': 1, 'role': '...",1,"{'j__john_g_roberts_jr': 1, 'j__clarence_thoma..."
24972,[],2019_18-1584,"{'anthony_a_yang': {'side': 1, 'role': 'for th...",1,"{'j__john_g_roberts_jr': 1, 'j__clarence_thoma..."


In [None]:
#dataframe for utterances
df_utt = corpus.get_utterances_dataframe()
df_utt.columns

Index(['timestamp', 'text', 'speaker', 'reply_to', 'conversation_id',
       'meta.case_id', 'meta.start_times', 'meta.stop_times',
       'meta.speaker_type', 'meta.side', 'meta.timestamp', 'vectors'],
      dtype='object')

## Martin Quinn Scores

In [None]:
# create datafrae of Martin Quinn scores to merge with dataframes
martin_quinn = pd.read_csv('/content/drive/MyDrive/INFO251Final/MartinQuinnScores.csv')

Now, begin restructuring and merge 

In [None]:
# set to run with utterances or conversations 
utts = False

In [None]:
def clean_utts(df_utt):
    # updating columns

    # rename columns
    df_utt = df_utt.rename(columns={'meta.votes_side': 'votes_side',
                                      'meta.win_side': 'win_side',
                                      'meta.case_id': 'case_id',
                                      'med': 'mq_score', 
                                      'conversation_id': 'convo_id',
                                      'term_year': 'term'})

    df_utt.term = df_utt.term.astype('int64')

    # add MartinQuinn Scores
    df_utt = df_utt.merge(martin_quinn[['term', 'med']], on='term')

    # drop unused columns
    # NOTE we may want to try an analysis on some of these later on
    df_utt = df_utt.drop(columns=['speaker', 
                                  'reply_to', 
                                  'timestamp', 
                                  'meta.start_times', 
                                  'meta.stop_times', 
                                  'meta.speaker_type',
                                  'meta.side',
                                  'meta.timestamp',
                                  'vectors',
                                  'Unnamed: 0'])


    # add "win_side" to utterance dataframe
    df_utt = df_utt.merge(df_convo[['convo_id', 'win_side']], on='convo_id')

    #df_utt.drop(columns=['Unnamed: 0'])
    df_utt = df_utt.rename(columns={'text': 'words'})

    # Remove instances where case outcome was unknown or ????
    df_utt.drop(df_utt[df_utt['win_side'] == -1.0].index, inplace = True)
    df_utt.drop(df_utt[df_utt['win_side'] == 2.0].index, inplace = True)

    # Remove null values from the few cases with incomplete data
    df_utt = df_utt.dropna()

    return df_utt

In [None]:
df_utt = clean_utts(df_utt)

AttributeError: ignored

In [None]:
#clean unique instance
df_utt['mq_score'] = df_utt['mq_score'].str.replace('0.162.5', '0.162')

In [None]:
#save utterance dataframe to Drive
df_utt.to_csv('/content/drive/MyDrive/INFO251Final/Utterances_Dataframe_CleanedMerged.csv')

In [None]:
def clean_convos(df_convo):
  # Remove instances where case outcome was unknown or ????
  df_convo.drop(df_convo[df_convo['win_side'] == -1.0].index, inplace = True)
  df_convo.drop(df_convo[df_convo['win_side'] == 2.0].index, inplace = True)

  # convert term from object to int
  df_convo.term = df_convo.term.astype('int64')

  # drop unused columns
  # NOTE we may want to try an analysis on some of these later on. Save memory now
  df_convo = df_convo.drop(columns=['Unnamed: 0', 'vectors', 'advocates', 'votes_side'])

  # 3 cases have null data due to oddities of transcribing data pre-digital transcripts
  df_convo = df_convo.dropna()

  return df_convo

In [None]:
df_convo = clean_convos(df_convo)

In [None]:
#clean unique instance
df_convo['mq_score'] = df_convo['mq_score'].str.replace('0.162.5', '0.162')

Now, we will merge with Washington data. Since we only used this data for our simple models at the case level, we only need to merge with our df_convo dataframe

In [None]:
#load in washington data

# error with Wash U data file so we need to find file encoding and add when reading
# commented out because now that it is discovered, no need to re-run each time

#!pip install chardet

#import chardet    
#rawdata = open('/content/drive/MyDrive/INFO251Final/WashU_onerowpercaseid.csv', 'rb').read()
#result = chardet.detect(rawdata)
#charenc = result['encoding']
#print(charenc)

In [None]:
# create dataframe from Washington University data
df_wash = pd.read_csv('/content/drive/MyDrive/INFO251Final/WashU_onerowpercaseid.csv', encoding='Windows-1252')

In [None]:
#Transform some data to words instead of numerical values?

# Column: Issue Area
# issue area is listed as numerical value but they correspond to different categories
# create dictionary of column integers and corresponding meanings
data_issueArea = ({'Integer Value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14],
                'Issue Area':['Criminal Procedure', 'Civil Rights', 'First Amendment', 'Due Process',
                                      'Privacy', 'Attorneys', 'Unions', 'Economic Activity', 'Judicial Power',
                                      'Federalism', 'Interstate Relations', 'Federal Taxation', 'Miscellaneous',
                                      'Private Action']})
# turn into dataframe
df_issueArea = pd.DataFrame(data_issueArea)
# replace the values in Wash U with the words
df_issueArea.set_index('Integer Value', inplace=True)
df_wash['issueArea'] = df_wash['issueArea'].map(df_issueArea['Issue Area'])
df_issueArea.reset_index(inplace=True)
print(df_wash['issueArea'])






# Column: lcDispositionDirection
# lower court disposition direction is listed as numerical value but they correspond to different categories
# create dictionary of column integers and corresponding meanings
data_lcdd = ({'Integer Value':[1,2,3],
                'direction':['conservative', 'liberal', 'unspecifiable']})

# turn into dataframe
df_lcdd = pd.DataFrame(data_lcdd)
# replace the values in Wash U with the words
df_lcdd.set_index('Integer Value', inplace=True)
df_wash['lcDispositionDirection'] = df_wash['lcDispositionDirection'].map(df_lcdd['direction'])
df_lcdd.reset_index(inplace=True)
print(df_wash['lcDispositionDirection'])


In [None]:
#get only important columns
df_wash_important = df_wash[['caseId', 'issueArea', 'lcDispositionDirection']]

#change case column so it matches df_case
df_wash_important = df_wash_important.rename(columns={'caseId':'case_id'})
df_wash_important['case_id'] = df_wash_important['case_id'].str.replace('-', '_')

df_wash_important.shape

In [None]:
#create additoinal case_id column in df_convo so that we can match in same format as df_wash_important

#because later we will need to match transcripts based on original case_id, we create a 
#duplicate column here to match and then delete extra so transcripts can be matched later
df_convo['og_case_id'] = df_convo['case_id']

In [None]:
#for df_convo, split the case_id column into two parts at the underscore, and add '0' if necessary
df_convo['case_id'] = df_convo['case_id'].apply(lambda x: '{}_{}'.format(x.split('_')[0], x.split('_')[1].zfill(3)))

#strip down any docket id info from case_id as well
df_convo['case_id'] = df_convo['case_id'].str[:8]

df_convo['case_id'].nunique()

In [None]:
#merge washington data with original cornell data
df_merged = pd.merge(df_convo, df_wash_important, on='case_id')

#drop where there weren't matches
df_merged = df_merged.dropna(subset=['win_side'])

print(df_merged.head())

In [None]:
#delete current case_id and replace with og_case_id so it is easier to match transcript format later
df_convo = df_convo.drop('case_id', axis=1)

df_convo = df_convo.rename(columns={'og_case_id':'case_id'})

In [None]:
df_merged.shape

In [None]:
df_merged.to_csv('/content/drive/MyDrive/INFO251Final/Outcomes_Dataframe_CleanedMerged.csv')