## Data Ingestion
This file processes the data for ingestion and database upload with the following steps:
1. Data File Creation
2. Data Chinking

## Step 1 : Data File Creation 
In this step, the data is read from text file where speech is separated by delegates. This includes the following step
1. Imports the data from delegated-separated speech text file 
2. Separates each text by ':' to indicate a new speaker delegate
3. Creates a csv with columns for id, session, meeting, speaker and text and writes text into its corresponding columns

In [1]:
text_file = '/Users/pelumioluwaabiola/Desktop/Transcriptions/Speaker-Labelled/Meeting_5_Session_3.txt'
csv_file = '/Users/pelumioluwaabiola/Desktop/Transcriptions/csv/Meeting_5_Session_3.csv'

In [5]:
import csv

def split_text(text, delimiter=': '):
    pairs = []
    num_session = 3  # Change this num to match with your file session, like Session 2 and Meeting 2
    num_meeting = 5  # Change this num to match with your file session, like Session 2 and Meeting 2
    numspeaker = 0  # Don't change, this one is for counting the number of speakers in your txt file 
    for line in text.split("\n"):
        if delimiter in line:
            numspeaker += 1

            id = f'S{num_session}M{num_meeting}{numspeaker}'
            parts = line.split(delimiter, 1)  
            pairs.append((id, num_session, num_meeting, parts[0].strip(), parts[1].strip()))
    return pairs

# Read text from file
with open(text_file, 'r', encoding='utf-8', errors='ignore') as file:
    text = file.read()

# Split text and write to CSV
parts = split_text(text)

with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Id', 'Session', 'Meeting', 'Speaker', 'Text'])
    for part in parts:
        writer.writerow(part)

## Step 2 : Data Chunking
This step splits the data from the csv in chunks, which includes the following steps
1. Imports data from the csv
2. import the recursive text splitter from langchain
3. Read the first row in the df and break down the text column in chunks
4. Create a new df (df_chunks) with the an id, session, meeting, speaker and chunk text, where each row is a chunk of the newly chunked texted
5. iterate through the df and repeat step 4 above into a df_temp and concat df_chunks with df_temp

In [4]:
#import libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import re
from langchain_community.embeddings import HuggingFaceEmbeddings
import json
import pandas as pd

In [6]:
df = pd.read_csv(csv_file)
df.head()

Unnamed: 0,Id,Session,Meeting,Speaker,Text
0,S3M51,3,5,Chairman,"Good morning, distinguished delegates. Please ..."
1,S3M52,3,5,Russia,Mr. Chairman. We would like to take the floor ...
2,S3M53,3,5,Chairman,I thank the distinguished representative of th...
3,S3M54,3,5,New Zealand,"Thank you, Mr. Chair, when New Zealand conside..."
4,S3M55,3,5,Chairman,I thank the distinguished representative of Ne...


In [7]:
#define the text splitter
def chucking_text(text):
    textsplitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=128,
        length_function=len,
        is_separator_regex=False,
    )
    TextChunks = textsplitter.split_text(text)
    return TextChunks


In [8]:
#retrive the first text in the first row of df in a list 
text = df['Text'][0]
id = df['Id'][0]
session = [df['Session'][0]]
meeting = [df['Meeting'][0]]
speaker = [df['Speaker'][0]]
chunks = chucking_text(text)

#print id, session, meeting ,speaker and chunks on newlines
print('Id:', id)
print('Session:', session)
print('Meeting:', meeting)
print('Speaker:', speaker)
print('Text:', text)
print('Chunks:', chunks)

Id: S3M51
Session: [3]
Meeting: [5]
Speaker: ['Chairman']
Text: Good morning, distinguished delegates. Please take your seats. Good morning. Ladies and gentlemen, distinguished delegates. Please take your seats. We will begin now. I would now like to continue our work under agenda item 6C, which is to make recommendations on possible norms, rules and principles of responsible behaviours relating to threats by states to space systems, including, as appropriate, how they would contribute to the negotiation of legally binding instruments, including on the prevention of an arms race in outer space. This morning, we will begin our consideration of topic 2B of the indicative timetable, which is norms, rules and principles relating to counter space capabilities, including space to Earth and space to space threats. Before I give the floor to any delegation that wishes to intervene under this topic, I would like to inform that there will be a family picture taken, as is already tradition, at 1:

In [9]:
#create a df where text is each chunk
session = session * len(chunks)
meeting = meeting * len(chunks)
speaker = speaker * len(chunks)
chunk_ids = []
i= 1
for chunk in chunks:
    chunk_id = f'{id}{i}'
    i+=1
    chunk_ids.append(chunk_id)

df_chunks = pd.DataFrame({
    'Id': chunk_ids,
    'Session': session,
    'Meeting': meeting,
    'Speaker' : speaker,
    'Text' : chunks
})

df_chunks.head()


Unnamed: 0,Id,Session,Meeting,Speaker,Text
0,S3M511,3,5,Chairman,"Good morning, distinguished delegates. Please ..."
1,S3M512,3,5,Chairman,"as appropriate, how they would contribute to t..."
2,S3M513,3,5,Chairman,give the floor to any delegation that wishes t...
3,S3M514,3,5,Chairman,have already referred to this topic in their f...


In [10]:
#iterate the step above for all the rows in the df
for i in range(1, len(df)):
    text = df['Text'][i]
    id = df['Id'][i]
    session = [df['Session'][i]]
    meeting = [df['Meeting'][i]]
    speaker = [df['Speaker'][i]]
    chunks = chucking_text(text)
    session = session * len(chunks)
    meeting = meeting * len(chunks)
    speaker = speaker * len(chunks)
    chunk_ids = []
    i= 1
    for chunk in chunks:
        chunk_id = f'{id}{i}'
        i+=1
        chunk_ids.append(chunk_id)

    df_temp = pd.DataFrame({
        'Id': chunk_ids,
        'Session': session,
        'Meeting': meeting,
        'Speaker' : speaker,
        'Text' : chunks
    })
    df_chunks = pd.concat([df_chunks, df_temp])

In [11]:
#convert data to vector embeddings
def generate_embeddings(text):
    embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    embeddings = embedding_model.embed_query(text)
    return embeddings


In [12]:
#create a new column in the df_chunks with the embeddings and add the embeded each text 
df_chunks['TextEmbeddings'] = df_chunks['Text'].apply(generate_embeddings)

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
df_chunks['SpeakerEmbeddings'] = df_chunks['Speaker'].apply(generate_embeddings)

In [14]:
df_chunks.head()

Unnamed: 0,Id,Session,Meeting,Speaker,Text,TextEmbeddings,SpeakerEmbeddings
0,S3M511,3,5,Chairman,"Good morning, distinguished delegates. Please ...","[0.07812656462192535, -0.008435213938355446, 0...","[0.028822308406233788, 0.05176027491688728, 0...."
1,S3M512,3,5,Chairman,"as appropriate, how they would contribute to t...","[0.0780441015958786, -0.014860063791275024, 0....","[0.028822308406233788, 0.05176027491688728, 0...."
2,S3M513,3,5,Chairman,give the floor to any delegation that wishes t...,"[-0.010861157439649105, 0.08269672840833664, -...","[0.028822308406233788, 0.05176027491688728, 0...."
3,S3M514,3,5,Chairman,have already referred to this topic in their f...,"[0.08796455711126328, -0.015926294028759003, 0...","[0.028822308406233788, 0.05176027491688728, 0...."
0,S3M521,3,5,Russia,Mr. Chairman. We would like to take the floor ...,"[0.07228662818670273, -0.050627969205379486, 0...","[0.04608728364109993, -0.003700980683788657, -..."


In [15]:
#read each row into a dictionary and store in a list
rows = []
for i in range(len(df_chunks)):
    row = df_chunks.iloc[i].to_dict()
    rows.append(row)

In [20]:
len(rows)

319

In [21]:
#save df_chunks as a csv file
df_chunks.to_csv('/Users/pelumioluwaabiola/Desktop/Transcriptions/detailed_data/Meeting_5_Session_3.csv', index=False)

In [19]:
#save in a json file
with open(f"{'/Users/pelumioluwaabiola/Desktop/Transcriptions/detailed_data/Meeting_5_Session_3'}.json", "w") as f:
    json.dump(rows, f)
