<a href="https://colab.research.google.com/github/MK316/Spring2024/blob/main/Corpus/TEDdata/TED_preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TED data Text pre-processing

+ Last updated (6/4)

# 🍀Todo: Removing timestamp and parenthetical notes in the script

In [None]:
# Sample input text
text = """
The human voice: It's the instrument we all play. It's the most powerful sound in the world, probably. It's the only one that can start a war or say "I love you." And yet many people have the experience that when they speak, people don't listen to them. And why is that? How can we speak powerfully to make change in the world?

00:24
What I'd like to suggest, there are a number of habits that we need to move away from. I've assembled for your pleasure here seven deadly sins of speaking. I'm not pretending this is an exhaustive list, but these seven, I think, are pretty large habits that we can all fall into.

00:40
First, gossip. Speaking ill of somebody who's not present. Not a nice habit, and we know perfectly well the person gossiping, five minutes later, will be gossiping about us.

00:53
Second, judging. We know people who are like this in conversation, and it's very hard to listen to somebody if you know that you're being judged and found wanting at the same time.

01:03
Third, negativity. You can fall into this. My mother, in the last years of her life, became very negative, and it's hard to listen. I remember one day, I said to her, "It's October 1 today," and she said, "I know, isn't it dreadful?"

01:16
(Laughter)

01:18
It's hard to listen when somebody's that negative.

01:21
(Laughter)
"""

+ This script defines a function clean_text that takes in the raw text as input. It uses the re.sub function to substitute the patterns of timestamps and bracketed text with an empty string, effectively removing them from the text. After defining the function, you can pass any text to it to get the cleaned output.

```
re.sub(r'\d{2}:\d{2}\n', '', text)
```

+ Pattern Explained:
  + \d{2}: This matches exactly two digits. The \d denotes a digit (equivalent to [0-9]), and {2} specifies that exactly two instances of the preceding element (in this case, a digit) should be found.
  + :: This matches the colon character exactly as it appears.
  + \d{2}: This again matches exactly two digits following the colon, corresponding to the minute format in a timestamp.
  + \n: This matches a newline character, ensuring that the pattern matches timestamps that are followed by a line break, which is typical when timestamps are used to label sections of text.
+ Replacement: ' ' (an empty string) — this indicates that every match found by the pattern in the text should be replaced with nothing, effectively removing it.
+ Application: This line of code will search through text for any sequence that matches a time stamp (e.g., "01:21") followed immediately by a newline, and removes those sequences.

In [None]:
import re

def clean_text(text):
    # Remove timestamps in the format "00:00"
    text = re.sub(r'\d{2}:\d{2}\n', '', text)
    # Remove text within brackets
    text = re.sub(r'\(.*?\)', '', text)
    return text

cleaned_text = clean_text(text)
print(cleaned_text)


# 🍀Todo:

---
# Step by step to get a cleaned text for the text column in our csv file

+ Read csv file as data (using Github link)
+ Read Column 'Text' and remove time stamps and parenthetical notes, and write the cleaned text in a new column named 'Cleantext01'

## Split Speaker info

+ [data sample](https://raw.githubusercontent.com/MK316/Spring2024/main/Corpus/TEDdata/sample1.csv)

In [None]:
#@markdown Read csv, add Speaker information in 'Speaker' column
import pandas as pd
import re

# Load the CSV file
url = "https://raw.githubusercontent.com/MK316/Spring2024/main/Corpus/TEDdata/sample1.csv"
# file_path = '/content/sample1.csv'  # Adjust the file path after uploading your file to Colab
data = pd.read_csv(url)

# Function to extract speaker name using regex
def extract_speaker(title):
    # This regex looks for the pattern where the name is in all capitals at the end of the string
    match = re.search(r'\n([A-Z\s]+)$', title)
    if match:
        return match.group(1).strip()  # Return the matched group, stripped of leading/trailing whitespace
    return ''  # Return empty string if no speaker name is found

# Apply the function to separate the 'Speaker' from 'Title'
data['Speaker'] = data['Title'].apply(extract_speaker)

# Remove the speaker name from the 'Title' column
data['Title'] = data['Title'].apply(lambda x: re.sub(r'\n[A-Z\s]+$', '', x))

# Insert the 'Speaker' column right after the 'Title' column
speaker_col = data.pop('Speaker')  # Remove the 'Speaker' column temporarily
data.insert(1, 'Speaker', speaker_col)  # Insert it right after 'Title' column

# Save the updated DataFrame to a new CSV file
output_file_path = '/content/sample1_speaker.csv'
data.to_csv(output_file_path, index=False)

print("Updated DataFrame:")
print(data.head())


## Text Clean-up [1]: removing timestamp and parenthetical notes, and add a new column

```
def clean_text(text):
    # Remove timestamps in the format "00:00"
    text = re.sub(r'\d{2}:\d{2}\n', '', text)
    # Remove text within brackets
    text = re.sub(r'\(.*?\)', '', text)
    return text
```

In [None]:
# data.head()

import pandas as pd
import re

# Assuming 'data' is your original DataFrame
df = data

def clean_text(text):
    # Remove timestamps in the format "00:00"
    text = re.sub(r'\d{2}:\d{2}\n', '', text)
    # Remove text within brackets
    text = re.sub(r'\(.*?\)', '', text)
    return text

# Apply the clean_text function to each element in the 'Text' column
df['Cleanedtext01'] = df['Text'].apply(clean_text)

# Comparing the first item of 'Text' and 'Cleanedtext01'
original_text = df['Text'].iloc[0][0:1000]  # Access the first item in the 'Text' column
cleaned_text = df['Cleanedtext01'].iloc[0][0:1000]  # Access the first item in the 'Cleanedtext01' column

print("Original Text:")
print(original_text)
print("="*50)
print("\nCleaned Text:")
print(cleaned_text)

### Checking whether the cleaning process is completed as planned

1. find the timestamp expressions
2. find parenthetical notes

In [None]:
#@markdown 1. Check the first (timestamp) for both 'Text' and 'Cleanedtext01'
import pandas as pd
import re

# Assuming 'df' is your DataFrame
def remove_and_report_timestamps(text):
    # Find all occurrences of the timestamp pattern
    matches = re.findall(r'\d{2}:\d{2}\n', text)
    # Remove the timestamp pattern
    cleaned_text = re.sub(r'\d{2}:\d{2}\n', '', text)
    return cleaned_text, matches

# Apply the function and capture the cleaned text and the matches for 'Text'
cleaned_text_original, timestamp_matches_original = remove_and_report_timestamps(df['Text'][0])

# Print the number of occurrences and list each occurrence for 'Text'
if timestamp_matches_original:
    print(f"Found {len(timestamp_matches_original)} occurrences of the timestamp pattern in original text:")
    for match in timestamp_matches_original:
        print(match.strip())  # .strip() is used to remove any trailing newline for clean display
else:
    print("No timestamp pattern found in the original text.")

# Apply the same function and capture the cleaned text and the matches for 'Cleanedtext01'
cleaned_text_cleaned, timestamp_matches_cleaned = remove_and_report_timestamps(df['Cleanedtext01'][0])

# Print the number of occurrences and list each occurrence for 'Cleanedtext01'
if timestamp_matches_cleaned:
    print(f"Found {len(timestamp_matches_cleaned)} occurrences of the timestamp pattern in cleaned text:")
    for match in timestamp_matches_cleaned:
        print(match.strip())
else:
    print("No timestamp pattern found in the cleaned text.")


In [None]:
#@markdown 2. Check the second (parenthetical notes) for both 'Text' and 'Cleanedtext01'
import pandas as pd
import re

# Assuming 'df' is your DataFrame
def remove_and_report_timestamps(text):
    # Find all occurrences of the timestamp pattern
    matches = re.findall(r'\(.*?\)', text)
    # Remove the timestamp pattern
    cleaned_text = re.sub(r'\(.*?\)', '', text)
    return cleaned_text, matches

# Apply the function and capture the cleaned text and the matches for 'Text'
cleaned_text_original, timestamp_matches_original = remove_and_report_timestamps(df['Text'][0])

# Print the number of occurrences and list each occurrence for 'Text'
if timestamp_matches_original:
    print(f"Found {len(timestamp_matches_original)} occurrences of the timestamp pattern in original text:")
    for match in timestamp_matches_original:
        print(match.strip())  # .strip() is used to remove any trailing newline for clean display
else:
    print("No timestamp pattern found in the original text.")

# Apply the same function and capture the cleaned text and the matches for 'Cleanedtext01'
cleaned_text_cleaned, timestamp_matches_cleaned = remove_and_report_timestamps(df['Cleanedtext01'][0])

# Print the number of occurrences and list each occurrence for 'Cleanedtext01'
if timestamp_matches_cleaned:
    print(f"Found {len(timestamp_matches_cleaned)} occurrences of the timestamp pattern in cleaned text:")
    for match in timestamp_matches_cleaned:
        print(match.strip())
else:
    print("No parenthetical pattern found in the cleaned text.")


# Saving the processed file

In [None]:
print(df.head())
df.to_csv("Cleanedtext01.csv", encoding = "utf-8", index=False)