# Normalization and more pre-processing of the data

In this notebook some more steps are followed to pre-process the data and normalize it. The data set in the _data/df_preproceesed.csv_ file has been obtained by merging together all the data sets of the different months of the trial, built following the code of the _data_preprocessing.ipynb_ notebook (i.e. appending the data sets _January.csv_, _February.csv_..... one under the other)

In [1]:
import pandas as pd
import os
import glob
import numpy as np

# 1. Cleaning and some exploratory analysis 

In [3]:
# Import the data
df = pd.read_csv('df_preprocessed.csv')
df

Unnamed: 0,person,speech,date,time
0,0,0,0,0
1,#APPEARANCES:,(APPEARANCES AS HERETOFORE NOTED.)THE JURY:)\n,"APRIL 13, 1995\n",9:20 A.M.
2,#THE COURT:,"ALL RIGHT. GOOD MORNING, COUNSEL.\n","APRIL 13, 1995\n",9:20 A.M.
3,#MR. COCHRAN:,"GOOD MORNING, YOUR HONOR.\n","APRIL 13, 1995\n",9:20 A.M.
4,#MR. SHAPIRO:,"GOOD MORNING, YOUR HONOR.\n","APRIL 13, 1995\n",9:20 A.M.
...,...,...,...,...
313894,#THE COURT:,SO YOU SHOULD ANTICIPATE THAT. ANYTHING ELSE? ...,"SEPTEMBER 29, 1995",9:04 A.M.
313895,#MS. CLARK:,"THANK YOU, YOUR HONOR.\n","SEPTEMBER 29, 1995",9:04 A.M.
313896,#THE COURT:,WE'LL BE IN RECESS.\n,"SEPTEMBER 29, 1995",9:04 A.M.
313897,#MS. CLARK:,CAN WE AT SIDEBAR WITHOUT THE COURT REPORTER?\n,"SEPTEMBER 29, 1995",9:04 A.M.


## Some generic cleaning and data exploration

In [4]:
df[df['speech'].isnull()] # check na

Unnamed: 0,person,speech,date,time
6577,#MR. SCHECK:,,"APRIL 18, 1995\n",9:05 A.M.
38275,#MR. CLARKE:,,"AUGUST 4, 1995\n",9:07 A.M.
50365,#MR. NEUFELD:,,"AUGUST 16, 1995",9:40 A.M.
50889,#MR. NEUFELD:,,"AUGUST 16, 1995",9:40 A.M.
244718,#MR. GOLDBERG:,,"MAY 1, 1995\n",10:05 A.M.
300260,#MR. BLASIER:,,"SEPTEMBER 12, 1995",9:50 A.M.


In [5]:
# Eliminate lines containing description about the trial used for preprocessing (manually detected)
match = ['#APPEAR', '#LOS ANGELES,', '#DEPARTMENT NO', 'ALL THE MEMB', 'ALL', 'DETECTIVE', '#LADIES', '0']
s = df.loc[df['person'].str.startswith(tuple(match)), 'person']
idx_todrop = s.index.tolist()
df.drop(df.index[idx_todrop], inplace = True)
df = df.reset_index(drop=True)

In [6]:
# Check the dimension of the data set
df.shape

(312818, 4)

In [7]:
# Substitute some errors in the transcripts (manually detected)
df[df['person'].str.contains('GOOD')]
nrow = df.shape[0]
for idx in range(0, nrow): 
    string = df.iloc[idx, 0]
    if 'GOOD' in string: 
        #df.iloc[idx, 0] = '#MR. SHAPIRO: '
        df['person'] = df['person'].replace(['MR. SHAPIRO:GOOD AFTERNOON, LADIES AND GENTLEMEN OF THE JURY.\n'],'#MR SHAPIRO: ')
    elif '402' in string:
        df['person'] = df['person'].replace([string],'[LUPER]')      

In [8]:
# correct some details in the person column to make the text homogeneous
df["person"].replace({
    'MS. CLARK:\n': '#MS. CLARK: ',
    '##MS. CLARK: ': '#MS. CLARK: ',
    '##MR. BLASIER: ': '#MR. BLASIER: ',
    'MR. COCHRAN:\n': '#MR. COCHRAN: ',
    'MS. LEWIS:\n' : '#MS. LEWIS: ',
    '#DET. FUHRMAN: ': "['FUHRMAN,']",
    'MS. COCHRAN:\n': '#MR. COCHRAN: ',
    'MR. SCHECK:\n':'#MR. SCHECK: ',
    '##MR. KELBERG: ': '#MR. KELBERG: ',
    '##MR. CLARKE: ': '#MR. CLARKE: ',
    '##MR. NEUFELD: ': '#MR. NEUFELD: ',
    '##MR. HARMON: ': '#MR. HARMON: ',
    'MR. GOLDBERG:\n':'#MR. GOLDBERG: ',
    'MR. GORDON:\n':'#MR. GORDON: ', 
    'MR. DARDEN:\n':'#MR. DARDEN: ',
    'MR. DOUGLAS:\n':'#MR. DOUGLAS: ',
    'MR. BAILEY:\n':'#MR. BAILEY: ',
    'MR. SHAPIRO:\n':'#MR. SHAPIRO: ', 
    'MS. CHAPMAN:\n':'#MS. CHAPMAN: '}, inplace=True)

In [9]:
# Exploratory analysis looking at word counts by person
counts = df.groupby('person').count().sort_values(['speech'], ascending=False)
counts

Unnamed: 0_level_0,speech,date,time
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
#THE COURT:,45198,44930,45198
#MS. CLARK:,29516,29351,29516
#MR. COCHRAN:,27873,27868,27873
#MR. SCHECK:,16415,16416,16416
#MR. DARDEN:,14085,14069,14085
...,...,...,...
#JUROR NO. 353:,1,1,1
#DEPUTY RUSSELL:,1,1,1
#MR. ORMAN:,1,1,1
#MS. LEVIN:,1,1,1


Define some lists containing relevant groups of people. Except for the _witnesses_ list, the names have been defined manually by consulting sources about the trial: 

- the __dream team__ : attorneys defending OJ Simpson
- the __prosecution__ : attorneys prosecuting OJ Simpson
- the __experts__: experts called by the dream team and prosecution to give opinions about some aspects of the trial
- the __witnesses__: witnesses of the trial. Their name is between square brackets, because of how the text has been pre-processed

In [11]:
dream_team = ['SHAPIRO','COCHRAN','BAILEY', 'DERSHOWITZ', 'KARDASHIAN', 'HOLLEY', 'DOUGLAS', 'UELMEN', 'SCHECK', 'NEUFELD', 'BLASIER', 'THOMPSON', 'CHAPMAN', 'CAPLAN']
prosecution = ['CLARK', 'HODGMAN', 'DARDEN', 'KELBERG', 'HARMON', 'LEWIS', 'GORDON', 'BODIN', 'GOLDBERG', 'YOCHELSON', 'DARREL' , 'LYNCH']
experts = ['CLARKE', 'DR. LAKSHMANAN', 'MR. SIMS', 'MATHESON', 'MAZZOLA', 'DR. GERDES', 'DR. COTTON', 'DEEDRICK', 'BROCKBANK', 'LEE', 'DR. WEIR']
witness_match = ['[']

In [12]:
# Count of speech by people group
defense_count = list()
prosecution_count = list()
experts_count = list()
nrow_counts = counts.shape[0]
witness_count = list()
for row in range(0, nrow_counts):
    string = counts.index[row]
    if any(ext in string for ext in dream_team):
        defense_count.append(counts.iloc[row, 0])
    elif any(ext in string for ext in prosecution):
        prosecution_count.append(counts.iloc[row, 0])
    elif any(ext in string for ext in experts):
        experts_count.append(counts.iloc[row, 0])
    elif any(ext in string for ext in witness_match):
        witness_count.append(counts.iloc[row, 0])
defense_volume = sum(defense_count)
prosecution_volume = sum(prosecution_count)
experts_volume = sum(experts_count)
witness_volume = sum(witness_count)
count_grouped = pd.DataFrame({'defense': [defense_volume], 'prosecution': [prosecution_volume], 'experts': [experts_volume], 'witnesses':[witness_volume]})
count_grouped

Unnamed: 0,defense,prosecution,experts,witnesses
0,79383,82948,34085,35109


In [13]:
# Eliminate \n
df['speech'] = df['speech'].str.rstrip("\n")

In [14]:
# Add a column containing the count of words per document (document = speech)
df['number_of_words'] = df.speech.apply(lambda x: len(str(x).split()))

In [17]:
# Eliminate documents containing non relevant and short strings:
# i.e. documents containing greetings and having a low words count

df.drop(df[(df['speech'].str.contains('GOOD MORNING')) & (df['number_of_words']<6)].index, inplace = True)
df = df.reset_index(drop=True)
df.drop(df[(df['speech'].str.contains('GOOD AFTERNOON')) & (df['number_of_words']<6)].index, inplace = True)
df = df.reset_index(drop=True)
df.drop(df[(df['speech'].str.contains('GOOD EVENING')) & (df['number_of_words']<6)].index, inplace = True)
df = df.reset_index(drop=True)
df.drop(df[(df['speech'].str.contains('YES|NO|CORRECT'))  & (df['number_of_words']<6)].index, inplace = True)
df = df.reset_index(drop = True)
df.drop(df[(df['speech'].str.contains('CORRECT'))  & (df['number_of_words']<6)].index, inplace = True)
df = df.reset_index(drop = True)
df.drop(df[(df['speech'].str.contains('ALL RIGHT|ALRIGHT'))  & (df['number_of_words']<6)].index, inplace = True)
df = df.reset_index(drop = True)
df

Unnamed: 0,person,speech,date,time,number_of_words
0,#THE COURT:,BACK ON THE RECORD IN THE SIMPSON MATTER. MR. ...,"APRIL 13, 1995\n",9:20 A.M.,28
1,#THE COURT:,ALL RIGHT. THE PEOPLE ARE REPRESENTED BY MR. G...,"APRIL 13, 1995\n",9:20 A.M.,48
2,#MR. GOLDBERG:,"YOUR HONOR, MY UNDERSTANDING OF THE COURT'S OR...","APRIL 13, 1995\n",9:20 A.M.,65
3,#MR. SCHECK:,"WELL, YOUR HONOR, BEFORE WE LEFT WE GAVE MR. G...","APRIL 13, 1995\n",9:20 A.M.,69
4,#THE COURT:,"WELL, AS I RECALL, AS WE ENDED THE COURT DAY I...","APRIL 13, 1995\n",9:20 A.M.,18
...,...,...,...,...,...
252003,#THE COURT:,"I'LL ISSUE A DELIBERATION SCHEDULE, AN ANTICIP...","SEPTEMBER 29, 1995",9:04 A.M.,70
252004,#THE COURT:,SO YOU SHOULD ANTICIPATE THAT. ANYTHING ELSE? ...,"SEPTEMBER 29, 1995",9:04 A.M.,13
252005,#THE COURT:,WE'LL BE IN RECESS.,"SEPTEMBER 29, 1995",9:04 A.M.,4
252006,#MS. CLARK:,CAN WE AT SIDEBAR WITHOUT THE COURT REPORTER?,"SEPTEMBER 29, 1995",9:04 A.M.,8


In [18]:
# Eliminate part of text used for preprocessing (see ```preprocessing_transctipts_text()``` function), not relevant to text mining 
# This step might take a while

match_1 = 'CALLED AS A WITNESS BY'
match_2 = 'THE WITNESS ON THE STAND AT THE TIME'
match_3 = 'HAVING BEEN PREVIOUSLY SWORN'
match_4 = 'CROSS-EXAMINATIONBY'
match_5 = 'DIRECT EXAMINATIONBY'
match_4b = 'RECROSS-EXAMINATIONBY'
match_5b = 'REDIRECT EXAMINATIONBY'
match_6 = 'CROSS-EXAMINATION (RESUMED)BY'
match_7 = 'DIRECT EXAMINATION (RESUMED)BY'
match_6b = 'RECROSS-EXAMINATION (RESUMED)BY'
match_7b = 'REDIRECT EXAMINATION (RESUMED)BY'
nrows = df.shape[0]
for row in range(0, nrows):
    if all(match in str(df.iloc[row, 1]) for match in match_2):
        new = df.iloc[row, 1].partition("CALLED")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_2):
        new = df.iloc[row, 1].partition("WITNESS")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_3):
        new = df.iloc[row, 1].partition("HAVING")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_4):
        new = df.iloc[row, 1].partition("CROSS")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_5):
        new = df.iloc[row, 1].partition("DIRECT")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_4b):
        new = df.iloc[row, 1].partition("RECROSS")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_5b):
        new = df.iloc[row, 1].partition("REDIRECT")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_6):
        new = df.iloc[row, 1].partition("CROSS")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_7):
        new = df.iloc[row, 1].partition("DIRECT")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_6b):
        new = df.iloc[row, 1].partition("RECROSS")[0]
        df.iloc[row, 1] = new
    elif all(match in str(df.iloc[row, 1]) for match in match_7b):
        new = df.iloc[row, 1].partition("REDIRECT")[0]
        df.iloc[row, 1] = new

## Generate a new data frame with the above normalization and pre-processing

In [20]:
df.to_csv('df_normalized.csv', index = False) # Generate df for next tasks

In [21]:
df.head()

Unnamed: 0,person,speech,date,time,number_of_words
0,#THE COURT:,BACK ON THE RECORD IN THE SIMPSON MATTER. MR. ...,"APRIL 13, 1995\n",9:20 A.M.,28
1,#THE COURT:,ALL RIGHT. THE PEOPLE ARE REPRESENTED BY MR. G...,"APRIL 13, 1995\n",9:20 A.M.,48
2,#MR. GOLDBERG:,"YOUR HONOR, MY UNDERSTANDING OF THE COURT'S OR...","APRIL 13, 1995\n",9:20 A.M.,65
3,#MR. SCHECK:,"WELL, YOUR HONOR, BEFORE WE LEFT WE GAVE MR. G...","APRIL 13, 1995\n",9:20 A.M.,69
4,#THE COURT:,"WELL, AS I RECALL, AS WE ENDED THE COURT DAY I...","APRIL 13, 1995\n",9:20 A.M.,18
