Maya Asher, 4/8/24
# Reprocessing the Santa Barbara Corpus of Spoken American English
- **NEW REPLACEMENT for progress report 2**
- **EXISTING for progress report 3**

The SBCSAE is a collection of time-aligned transcripts of audio files. Along with timestamps, the transcripts also include many non-alphabetic characters that denote different aspects of the speech. 

In this notebook, I process and clean up the raw text so that I can easily search for and locate my target words in my later analysis. Specifically, I read files into individual dfs and put them into a dictionary. I then separated them by column count and pickled them.
## Import

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
%pprint

Pretty printing has been turned OFF


## Read in files
Originially, there were parsing errors with almost half the files, so I had to manually go through and fix some of the spacing in the files. I opened them in Atom, found the line with issues, and fixed the spacing, which usually just consisted of removing an extraneous tab.

Now, all 43 TRN files are able to be read in and inserted into an individual Pandas df, which is stored in the dictionary `data_frames`.

In [2]:
# folder with necessary files
directory = "/Users/mayaasher/data_science/Stance-Taking-in-Spontaneous-Speech/data/utf-16/"
files = os.listdir(directory)

# only files ending in .trn
files = [file for file in files if file.endswith('.trn')]

# sort files based on numerical order
sorted_files = sorted(files, key=lambda x: int(x.split('.')[0][4:]))

# dict to hold all dfs
data_frames = {}

In [3]:
# read in all files IN ORDER!!!
for file_name in sorted_files:
    try:
        filename = file_name
        data = pd.read_csv(directory+filename, sep='\t', header=None, encoding='utf-16-be')
        df = pd.DataFrame(data)
        data_frames[filename] = df
        #data_frames.append(df)
    except pd.errors.ParserError as e:
        print(f"Error parsing {filename}: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")

In [4]:
len(data_frames.keys())

43

## Column issues
Unfortunately, the dfs have varying numbers of columns (2, 3, and 4 columns). The 2 and 3 column dfs have timestamps that go to the hundredths place while the 4 column dfs go to the thousandths place, so perhaps that caused processing issues. 

In [5]:
print(data_frames['SBC014.trn'].head())
print(data_frames['SBC013.trn'].head())
print(data_frames['SBC015.trn'].head())

                     0                                         1
0  0.00 2.53  FRED:                                    ... Okay.
1   2.53 4.73                              One= large loan (Hx),
2   4.73 6.23                                  ... renewed (Hx),
3   6.23 8.08           ... a hundred ninety-seven= .. thousand,
4   8.08 9.23                          a hundred eighty dollars.
           0         1                                  2
0  0.00 1.24  KEVIN:     Is that just [carbonated water]?
1  0.45 1.24  WENDY:                      [No thank you].
2  1.24 1.50                                        [2No,
3  1.24 3.38  KEN:      [2(H) No this is2] crea=m [3soda.
4  1.50 2.36  WENDY:                   It's cream soda2].
       0      1        2                                                  3
0  2.660  2.805  JOANNE:                                               But,
1  2.805  4.685      NaN  so these slides <X should X> be real interesting.
2  6.140  6.325     KEN:          

In [6]:
for i in data_frames:
    print(i, data_frames[i].shape)

SBC001.trn (1312, 3)
SBC002.trn (1419, 3)
SBC003.trn (1546, 3)
SBC004.trn (1298, 3)
SBC005.trn (826, 3)
SBC006.trn (1767, 3)
SBC007.trn (731, 3)
SBC008.trn (1496, 3)
SBC009.trn (725, 3)
SBC010.trn (1107, 3)
SBC011.trn (996, 3)
SBC013.trn (2259, 3)
SBC014.trn (1189, 2)
SBC015.trn (1984, 4)
SBC016.trn (1518, 4)
SBC017.trn (1169, 4)
SBC018.trn (566, 4)
SBC019.trn (1266, 4)
SBC022.trn (705, 4)
SBC023.trn (1518, 4)
SBC024.trn (875, 4)
SBC029.trn (1214, 4)
SBC031.trn (1539, 4)
SBC032.trn (1845, 4)
SBC033.trn (818, 4)
SBC034.trn (739, 4)
SBC035.trn (1330, 4)
SBC036.trn (1822, 4)
SBC037.trn (978, 4)
SBC042.trn (719, 4)
SBC043.trn (1497, 4)
SBC044.trn (1431, 4)
SBC045.trn (1197, 4)
SBC047.trn (1162, 4)
SBC048.trn (1128, 4)
SBC049.trn (1273, 4)
SBC050.trn (959, 4)
SBC051.trn (1681, 4)
SBC056.trn (1600, 4)
SBC057.trn (1012, 4)
SBC058.trn (982, 4)
SBC059.trn (1857, 4)
SBC060.trn (1013, 4)


## Column issue work-around
For the sake of searching the dfs, I'm separating them out into 3 dicts depending on their column count.

In [7]:
# list of keys for each dict
all_keys_list = list(data_frames.keys())
df2c_keys = all_keys_list[12]
df3c_keys = all_keys_list[:12]
df4c_keys = all_keys_list[13:]

In [8]:
# adding keys and vals to each dict
df3c = {key: data_frames[key] for key in df3c_keys if key in data_frames}
df4c = {key: data_frames[key] for key in df4c_keys if key in data_frames}

In [9]:
# have to do this one manually
df2c = {'SBC014.trn':data_frames['SBC014.trn']}

In [10]:
df2c.keys()

dict_keys(['SBC014.trn'])

In [11]:
df3c.keys()

dict_keys(['SBC001.trn', 'SBC002.trn', 'SBC003.trn', 'SBC004.trn', 'SBC005.trn', 'SBC006.trn', 'SBC007.trn', 'SBC008.trn', 'SBC009.trn', 'SBC010.trn', 'SBC011.trn', 'SBC013.trn'])

In [12]:
df4c.keys()

dict_keys(['SBC015.trn', 'SBC016.trn', 'SBC017.trn', 'SBC018.trn', 'SBC019.trn', 'SBC022.trn', 'SBC023.trn', 'SBC024.trn', 'SBC029.trn', 'SBC031.trn', 'SBC032.trn', 'SBC033.trn', 'SBC034.trn', 'SBC035.trn', 'SBC036.trn', 'SBC037.trn', 'SBC042.trn', 'SBC043.trn', 'SBC044.trn', 'SBC045.trn', 'SBC047.trn', 'SBC048.trn', 'SBC049.trn', 'SBC050.trn', 'SBC051.trn', 'SBC056.trn', 'SBC057.trn', 'SBC058.trn', 'SBC059.trn', 'SBC060.trn'])

In [13]:
# looks good!
print(len(df2c),'+', len(df3c), '+', len(df4c), '=', (len(df2c)+len(df3c)+len(df4c)))

1 + 12 + 30 = 43


## Pickle
I'm pickling the dicts of dfs so they can be used in the analysis

In [14]:
f = open('df2col.pkl', 'wb')
pickle.dump(df2c, f, -1)
f.close()

In [15]:
f = open('df3col.pkl', 'wb')
pickle.dump(df3c, f, -1)
f.close()

In [16]:
f = open('df4col.pkl', 'wb')
pickle.dump(df4c, f, -1)
f.close()