# Preprocessing 

This notebook contains the preprocessing options for the data obtained from 
season_data_preparation.ipynb. The input is the following:

1.   An XLSX file grouped by segment, i.e. the output of [season_data_preparation.ipynb](https://github.com/TinfFoil/dar_tvseries/blob/main/season_data_preparation.ipynb) when option [1] is selected.




# Libraries

The main libraries used in this notebook are Scikit-Learn and Numpy, which provide tools for model training and evaluation. Regular expressions and Spacy are also used for preprocessing.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

from numpy import absolute
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import mean_absolute_error

import matplotlib.pyplot as plot
import pandas as pd
import spacy
import spacy.cli
import re

spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Importing the XLSX file


In [None]:
# Opening aligned .xlsx file 

excel_path = input('Enter .xlsx file path: ')
df = pd.read_excel(excel_path, index_col=0)

Enter .xlsx file path: /content/season_13_with_subtitles.xlsx


In [None]:
# Structure of the .xlsx file

df

Unnamed: 0,Code,Segment start,Segment end,PP,SP,MC,Segment text
0,GAS13E01,00:00:00,00:00:44,0,0,0,"<i>Previously on ""Grey's Anatomy""...</i> I wan..."
1,GAS13E01,00:00:44,00:00:49,0,0,0,♪
2,GAS13E01,00:00:49,00:02:18,0,6,0,♪ I ain't got no problem ♪ ♪ That's for real ♪...
3,GAS13E01,00:02:18,00:02:36,2,2,2,[Siren wails] Isaac: What do we got? We got a ...
4,GAS13E01,00:02:36,00:03:18,0,6,0,Two champagnes. You got it. I thought you were...
...,...,...,...,...,...,...,...
1466,GAS13E24,00:40:58,00:41:23,0,0,6,"[Engine starts] <i>Nobody wakes up thinking, ""..."
1467,GAS13E24,00:41:23,00:41:43,0,0,0,"<i>Sometimes, we wake up, we face our fears......"
1468,GAS13E24,00:41:43,00:41:47,0,0,6,<i>We take them by the hand.</i> ♪♪
1469,GAS13E24,00:41:59,00:42:10,0,0,6,"♪ So far away ♪ <i>- And we stand there, waiti..."


# Preprocessing

In this section, the following elements are **removed** from the text:
1.   Unallowed label combinations ("6 0 6")
2.   Song lyrics and markup symbols ("< i >< / i >")
3.   Boilerplates ("Synced & corrected by...")
4.   Off-camera speaker's names
5.   Noises between square brackets ("[Siren wails]")
6.   All punctuation except hyphens and apostrophes 
7.   Filler words ("Uh", "Wow") 
8.   Uppercase characters (converted to lowercase)
9.  Rows shorter than two characters 
10.   Double spaces


In [None]:
# Merging the labels into one column

df['Labels'] = df['PP'].astype(str) + ' ' + df['SP'].astype(str) + ' ' + df['MC'].astype(str)

In [None]:
# There are a few combinations to be fixed, like 2 6 0 

value_counts = df['Labels'].value_counts() 
print(value_counts[value_counts < 2]) # Label combinations that appear only once

2 6 0    1
0 3 0    1
1 1 4    1
0 6 6    1
0 0 3    1
2 3 1    1
6 0 6    1
Name: Labels, dtype: int64


In [None]:
# Fixing the labels

df.loc[351,'SP'] = [4]
df.loc[462,'PP'] = [3]
df.loc[221,'SP'] = [0]
df.loc[168,'MC'] = [6] 
df.loc[973,'PP'] = [0] 
df.loc[973,'MC'] = [0] 

In [None]:
# Total labeled segments

# df['Labels'] = df['PP'].astype(str) + ' ' + df['SP'].astype(str) + ' ' + df['MC'].astype(str)
# df['Labels'].value_counts().plot(kind='bar', figsize=(20, 10), fontsize=12, title='Total labeled segments');

In [None]:
# Dropping extra columns

df = df.drop('Code', axis=1)
df = df.drop('Segment end', axis=1)
df = df.drop('Segment start', axis=1)
df = df.drop('Labels', axis=1)

In [None]:
# Removing symbols

symbols_regex = '♪(.*?)♪'
symbols_regex2 = '♪'
symbols_regex3 = '<i>'
symbols_regex4 = '</i>'
symbols_regex5 = '- '

df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(symbols_regex, '', x))
df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(symbols_regex2, '', x))
df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(symbols_regex3, '', x))
df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(symbols_regex4, '', x))
df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(symbols_regex5, '', x))

In [None]:
# Removing boilerplates

boilerplate_regex = 'Synced & corrected by -robtor[-]?'
boilerplate_regex2 = 'Synced & corrected by -robtor- | Resync by Alice www.addic7ed.com'
boilerplate_regex3 = 'www.addic7ed.com'

df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(boilerplate_regex, '', x))
df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(boilerplate_regex2, '', x))
df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(boilerplate_regex3, '', x))

In [None]:
# Removing speakers' names

name_regex = '[A-Z][a-z]+\: '

df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(name_regex, '', x))

In [None]:
# Removing [sounds]

sounds_regex = '\[.*?\]'

df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(sounds_regex, '', x))

In [None]:
# Removing all punctuation except hyphens and apostrophes 

df['Segment text'] = df['Segment text'].str.replace('[^a-zA-Z0-9 :\-\']',' ', regex=True)

In [None]:
# Lowercasing

df['Segment text'] = df['Segment text'].str.lower()

In [None]:
# Removing filler words

filler_words = ['aah', 'aaaaaaah', 'aaaahh', 'ah', 'um', 'wow', 
                'uh', 'uh-huh', 'huh', 'ugh', 'oh', 'ooh', 
                'oooh', 'hey', 'mnh', 'mm-hmm', 'mm', 'hmm', 
                'hm', 'mnhmnh', 'yeah', 'y-yeah' 'ow', 
                'who-o-o-o-a', 'whoa', 'okay', 'n-o-o', 'o-okay', 
                'mwah', 'huh']

df['Segment text'] = df['Segment text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (filler_words)]))

In [None]:
# Removing double spaces

df['Segment text'] = df['Segment text'].apply(lambda x: re.sub(' +', ' ', x))

In [None]:
# Tokenizing the text and removing rows shorter than 2

nlp = spacy.load('en_core_web_sm')

df['Tokenized text'] = df['Segment text'].apply(lambda x: nlp.tokenizer(x))
df['Token count'] = df['Tokenized text'].apply(lambda x: len(x))
df = df[(df['Token count'] >= 2)]

In [None]:
# Resetting index

df = df.reset_index(drop=True)

In [None]:
# Reordering the columns

df = df.drop('Tokenized text', axis=1)
df = df.drop('Token count', axis=1)
df = df[['Segment text', 'PP', 'SP', 'MC']]

# Models

Here, the task is framed as a multioutput regression problem. Multi-learning algorithms are covered on [this page](https://scikit-learn.org/stable/modules/multiclass.html) from the Scikit-Learn documentation. The machine learning algorithms that inherently support multioutput regression are [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html), [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).



## Vectorizing (unigrams)

Term frequency-inverse document frequency, a common baseline representation, is used to vectorize the text. Terms that have a document frequency lower than 1 are ignored. [TfidfVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from the Scikit-Learn library converts documents to a matrix of TF-IDF features.

In [None]:
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(df['Segment text'].values)

A [valid representation](https://scikit-learn.org/stable/modules/multiclass.html) of multioutput y is a dense matrix of shape (n_samples, n_classes) of class labels. A column wise concatenation of 1d multiclass variables.

In [None]:
y = df[['PP', 'SP', 'MC']].to_numpy()

Shape of the input vectors:

In [None]:
print('Feature vector:', X.shape)
print('Target vector:', y.shape)

Feature vector: (1322, 6548)
Target vector: (1322, 3)


## Training and evaluating (cross-validation)

10-fold cross-validation with three repeats is used to evaluate the performance of the models. The mean absolute error (MAE) performance metric is used as the score.

In [None]:
# k-Nearest Neighbors

model = KNeighborsRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) 
n_scores = absolute(n_scores)
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: 2.026 (0.075)


In [None]:
# Decision Tree

model = DecisionTreeRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) 
n_scores = absolute(n_scores)
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: 2.303 (0.135)


In [None]:
# Linear Regression

model = LinearRegression()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) 
n_scores = absolute(n_scores)
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: 2.474 (0.169)
