# Data pre-processing

The main goal of this step is to import the data scraped from the website and pre-process it. This is achieved with the function ```preprocessing_transctipts_text()```, whose main steps are: 

- removing __html tags__
- removing __blank lines__
- __isolate the part of the transcript where people actually talk__ (i.e. text between the words _SUPERIOR COURT OF THE STATE OF CALIFORNIA_, present at the very beginning and at the end of each transcript) from the descriptive part, storing them in separate files
- __eliminate spaces__ between different paragraphs of the part of the dialogs where the same person is talking
- __isolate the person talking__ from the actual dialog; this is done by exploiting the fact that because of how the HTML file has been pre-processed, the name of the person talking has the # symbol before and it is followed by a colon (with regex this means isolating the following: ```re.search(r'\#.*?\: ', line)```)
- __find the date and the time of the transcript__ (with regex) 
- __create a dataframe__ with the following columns: _person_ (person talking), _speach_ (part of the dialog told by the person), _date_ (date of the transcript), _time_ (timestamp of the transcript). The first row of such data frame should contain the descriptive part of the transcript in the _speach_ column, having the value _DESCRIPTION_ for the _person_ column)
- manage the __witness questioning__, identifying the witness answering the questions -> SUBSTITUTE 'A: ' with the witness name. The name of the witness is retrieved by looking at particular formulas as _CALLED AS A WITNESS BY_, _THE WITNESS ON THE STAND AT THE TIME_, _HAVING BEEN PREVIOUSLY SWORN_, before the questioning. The names of all the witnesses of the transcript are stored in a list
- manage the __attorney questioning__, identifying the attorney making questions -> SUBSTITUTE 'Q:' with the name of the attorney questioning.  The name of the witness is retrieved by looking at particular formulas as _CROSS-EXAMINATIONBY_, _DIRECT EXAMINATIONBY_,_CROSS-EXAMINATION \(RESUMED\)BY_, _DIRECT EXAMINATION \(RESUMED\)BY_, before the questioning.

More details about this can be found in the report. 


__Note__: I am aware that the function is not as efficient as it could be, but the data was very dirty and I didn't have much time. I spent a lot of nighttime hours writing this, so be kind and appreciate the effort.
As someone once said, _the plural of regex is regrets_, if you know, you know :)

In [1]:
import re
import pandas as pd
import os
import fnmatch
from nb_exploration_cleaning import remove_html_tags, preprocessing_transctipts_text

In [4]:
# Create the scheleton of the data frame
mydict = {'person': [0], 'speech': [0],  'date': [0], 'time': [0]}
df = pd.DataFrame(mydict)
df

Unnamed: 0,person,speech,date,time
0,0,0,0,0


In [6]:
directory = os.getcwd() # this should be the same directory where you stored the txt files obtained through scraping
df_preprocessed = preprocessing_transctipts_text(directory, df)

jan11.html.txt
jan12.html.txt
jan13.html.txt
jan23.html.txt
jan24.html.txt
jan25.html.txt
jan26.html.txt
jan30.html.txt
jan31.html.txt


In [7]:
# Sample of the final data set
df_preprocessed.iloc[1500:1600,:]

Unnamed: 0,person,speech,date,time
1500,#MS. CLARK:,I'M SORRY. I COULDN'T HEAR THE COURT.\n,"JANUARY 13, 1995",9:17 A.M.
1501,#THE COURT:,I'LL BE HERE TUESDAY IF ANYTHING COMES UP.\n,"JANUARY 13, 1995",9:17 A.M.
1502,#MR. COCHRAN:,WE'RE NOT DUE IN COURT ON TUESDAY? EXPECT THE ...,"JANUARY 13, 1995",9:17 A.M.
1503,#MS. CLARK:,WHAT ABOUT THE WALK THROUGH?\n,"JANUARY 13, 1995",9:17 A.M.
1504,#THE COURT:,WE HAVE -- WE'LL HAVE THE SET UP PHYSICALLY TO...,"JANUARY 13, 1995",9:17 A.M.
...,...,...,...,...
1595,#MS. CLARK:,"LET ME JUST MAKE SURE, YOUR HONOR.\n","JANUARY 23, 1995",9:21 A.M.
1596,#THE COURT:,"ALL RIGHT. COUNSEL, IS THERE ANY PARTICULAR RE...","JANUARY 23, 1995",9:21 A.M.
1597,#MS. CLARK:,NOT THAT I KNOW OF.\n,"JANUARY 23, 1995",9:21 A.M.
1598,#THE COURT:,352 OBJECTION ESSENTIALLY.\n,"JANUARY 23, 1995",9:21 A.M.


In [8]:
# Save the result in a data frame
df_preprocessed.to_csv('January.csv', index = False )

In [9]:
# Final check (optional): are there still some lines where the attorney / the witness has not been identified? 
if df_preprocessed['person'].str.contains('#A:').any(): # check for '#Q' for attorneys
    print('yes')
else:
    print('no')

no
