Maya Asher, 4/8/24
# Reprocessing the Santa Barbara Corpus of Spoken American English
- **NEW REPLACEMENT for progress report 2**
- **EXISTING for progress report 3**

The SBCSAE is a collection of time-aligned transcripts of audio files. Along with timestamps, the transcripts also include many non-alphabetic characters that denote different aspects of the speech. 

In this notebook, I process and clean up the raw text so that I can easily search for and locate my target words in my later analysis. Specifically, I read in the files, put each transcript into individual dfs, separate them by column count, and pickle the three lists of dfs.
## Import

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import nltk
import pickle
%pprint

Pretty printing has been turned OFF


## Read in files
Originially, there were parsing errors with almost half the files, so I had to manually go through and fix some of the spacing in the files. I opened them in Atom, found the line with issues, and fixed the spacing, which usually just consisted of removing an extraneous tab.

Now, all 43 TRN files are able to be read in and inserted into an individual Pandas df, which is stored in the list `data_frames`.

In [2]:
# folder with necessary files
directory = "/Users/mayaasher/data_science/Stance-Taking-in-Spontaneous-Speech/data/utf-16/"
files = os.listdir(directory)

# only files ending in .trn
files = [file for file in files if file.endswith('.trn')]

# sort files based on numerical order
sorted_files = sorted(files, key=lambda x: int(x.split('.')[0][4:]))

#list to hold all dfs
data_frames = []

In [3]:
# read in all files IN ORDER!!!
for file_name in sorted_files:
    try:
        filename = file_name
        data = pd.read_csv(directory+filename, sep='\t', header=None, encoding='utf-16-be')
        df = pd.DataFrame(data)
        data_frames.append(df)
    except pd.errors.ParserError as e:
        print(f"Error parsing {filename}: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")

## Column issues
Unfortunately, the dfs have varying numbers of columns (2, 3, and 4 columns). The 2 and 3 column dfs have timestamps that go to the hundredths place while the 4 column dfs go to the thousandths place, so perhaps that caused processing issues. 

In [4]:
print(data_frames[12].head())
print(data_frames[11].head())
print(data_frames[13].head())

                     0                                         1
0  0.00 2.53  FRED:                                    ... Okay.
1   2.53 4.73                              One= large loan (Hx),
2   4.73 6.23                                  ... renewed (Hx),
3   6.23 8.08           ... a hundred ninety-seven= .. thousand,
4   8.08 9.23                          a hundred eighty dollars.
           0         1                                  2
0  0.00 1.24  KEVIN:     Is that just [carbonated water]?
1  0.45 1.24  WENDY:                      [No thank you].
2  1.24 1.50                                        [2No,
3  1.24 3.38  KEN:      [2(H) No this is2] crea=m [3soda.
4  1.50 2.36  WENDY:                   It's cream soda2].
       0      1        2                                                  3
0  2.660  2.805  JOANNE:                                               But,
1  2.805  4.685      NaN  so these slides <X should X> be real interesting.
2  6.140  6.325     KEN:          

In [5]:
count = 0
for frame in data_frames:
    print(count, frame.shape)
    count += 1

0 (1312, 3)
1 (1419, 3)
2 (1546, 3)
3 (1298, 3)
4 (826, 3)
5 (1767, 3)
6 (731, 3)
7 (1496, 3)
8 (725, 3)
9 (1107, 3)
10 (996, 3)
11 (2259, 3)
12 (1189, 2)
13 (1984, 4)
14 (1518, 4)
15 (1169, 4)
16 (566, 4)
17 (1266, 4)
18 (705, 4)
19 (1518, 4)
20 (875, 4)
21 (1214, 4)
22 (1539, 4)
23 (1845, 4)
24 (818, 4)
25 (739, 4)
26 (1330, 4)
27 (1822, 4)
28 (978, 4)
29 (719, 4)
30 (1497, 4)
31 (1431, 4)
32 (1197, 4)
33 (1162, 4)
34 (1128, 4)
35 (1273, 4)
36 (959, 4)
37 (1681, 4)
38 (1600, 4)
39 (1012, 4)
40 (982, 4)
41 (1857, 4)
42 (1013, 4)


## Column issue work-around
For the sake of searching the dfs, I'm separating them out into 3 lists depending on their column count.

In [18]:
df_2c = data_frames[12]
dfs_3c = data_frames[:11]
dfs_4c = data_frames[12:]

In [21]:
# only 1 in df_2c
print(len(dfs_3c))
print(len(dfs_4c))

11
31


## Pickle
I'm pickling the dfs so they can be used in the analysis

In [24]:
f = open('df2col.pkl', 'wb')
pickle.dump(df_2c, f, -1)
f.close()

In [26]:
f = open('df3col.pkl', 'wb')
pickle.dump(dfs_3c, f, -1)
f.close()

In [27]:
f = open('df4col.pkl', 'wb')
pickle.dump(dfs_4c, f, -1)
f.close()