## Data Ingestion
This file processes the data for ingestion and database upload with the following steps:
1. Data File Creation
2. Data Chinking

## Step 1 : Data File Creation 
In this step, the data is read from text file where speech is separated by delegates. This includes the following step
1. Imports the data from delegated-separated speech text file 
2. Separates each text by ':' to indicate a new speaker delegate
3. Creates a csv with columns for id, session, meeting, speaker and text and writes text into its corresponding columns

In [11]:
text_file = '/Users/pelumioluwaabiola/Desktop/Transcriptions/Speaker-Labelled/Meeting_10_Session_3.txt'
csv_file = '/Users/pelumioluwaabiola/Desktop/Transcriptions/csv/Meeting_10_Session_3.csv'

In [12]:
import csv

def split_text(text, delimiter=': '):
    pairs = []
    num_session = 3  # Change this num to match with your file session, like Session 2 and Meeting 2
    num_meeting = 10  # Change this num to match with your file session, like Session 2 and Meeting 2
    numspeaker = 0  # Don't change, this one is for counting the number of speakers in your txt file 
    for line in text.split("\n"):
        if delimiter in line:
            numspeaker += 1

            id = f'S{num_session}M{num_meeting}{numspeaker}'
            parts = line.split(delimiter, 1)  
            pairs.append((id, num_session, num_meeting, parts[0].strip(), parts[1].strip()))
    return pairs

# Read text from file
with open(text_file, 'r', encoding='utf-8', errors='ignore') as file:
    text = file.read()

# Split text and write to CSV
parts = split_text(text)

with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Id', 'Session', 'Meeting', 'Speaker', 'Text'])
    for part in parts:
        writer.writerow(part)

## Step 2 : Data Chunking
This step splits the data from the csv in chunks, which includes the following steps
1. Imports data from the csv
2. import the recursive text splitter from langchain
3. Read the first row in the df and break down the text column in chunks
4. Create a new df (df_chunks) with the an id, session, meeting, speaker and chunk text, where each row is a chunk of the newly chunked texted
5. iterate through the df and repeat step 4 above into a df_temp and concat df_chunks with df_temp

In [1]:
#import libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
import json
import pandas as pd

In [44]:
csv_file = '/Users/pelumioluwaabiola/Desktop/Transcriptions/session2/Meeting1.csv'

In [45]:
df = pd.read_csv(csv_file)
df.head()

Unnamed: 0,Id,Session,Meeting,Speaker,Text
0,S2M11,2,1,Chairman,"Good morning, ladies and gentlemen.Excellencie..."
1,S2M12,2,1,Dr. Suzuki,"Thank you very much, Chair, and thank you very..."
2,S2M13,2,1,Chairman,"I thank Dr. Suzuki, professor from the Graduat..."
3,S2M14,2,1,Mr. James Black,"Thank you, Mr. Chairman. Good morning, everyon..."
4,S2M15,2,1,Professor Wang,"Thank you, Chair. Good morning, distinguished ..."


In [46]:
#define the text splitter
def chucking_text(text):
    textsplitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=128,
        length_function=len,
        is_separator_regex=False,
    )
    TextChunks = textsplitter.split_text(text)
    return TextChunks


In [47]:
#retrive the first text in the first row of df in a list 
text = df['Text'][0]
id = df['Id'][0]
session = [df['Session'][0]]
meeting = [df['Meeting'][0]]
speaker = [df['Speaker'][0]]
chunks = chucking_text(text)

#print id, session, meeting ,speaker and chunks on newlines
print('Id:', id)
print('Session:', session)
print('Meeting:', meeting)
print('Speaker:', speaker)
print('Text:', text)
print('Chunks:', chunks)

Id: S2M11
Session: [2]
Meeting: [1]
Speaker: ['Chairman']
Text: Good morning, ladies and gentlemen.Excellencies, distinguished delegates, ladies and gentlemen.I apologize for our late start, but I hope that we can start on time the rest of the week. Kindly take your seats, please.It is my honor and pleasure to declare open the second session of the open-ended working group on reducing space threats through norms, rules, and principles of responsible behaviors. But before we proceed with our agenda, I would like to express to the governments of the United Kingdom, of Great Britain and Northern Islands, and of the Commonwealth realms my heartfelt condolences on the passing of Her Majesty Queen Elizabeth II. In this regard, I would like to request delegations to observe a minute of silence in her honor. Before clarifying the organizational aspects and proceedings with our substantive work, please allow me to make some preliminary remarks in my capacity as Chair of the OEWG. This I will do

In [48]:
#create a df where text is each chunk
session = session * len(chunks)
meeting = meeting * len(chunks)
speaker = speaker * len(chunks)
chunk_ids = []
i= 1
for chunk in chunks:
    chunk_id = f'{id}{i}'
    i+=1
    chunk_ids.append(chunk_id)

df_chunks = pd.DataFrame({
    'Id': chunk_ids,
    'Session': session,
    'Meeting': meeting,
    'Speaker' : speaker,
    'Text' : chunks
})

df_chunks.head()


Unnamed: 0,Id,Session,Meeting,Speaker,Text
0,S2M111,2,1,Chairman,"Good morning, ladies and gentlemen.Excellencie..."
1,S2M112,2,1,Chairman,behaviors. But before we proceed with our agen...
2,S2M113,2,1,Chairman,aspects and proceedings with our substantive w...
3,S2M114,2,1,Chairman,"of responsible behaviors. During this session,..."
4,S2M115,2,1,Chairman,from the outcome of this session will in large...


In [78]:

# Initialize an empty DataFrame to store chunked data
df_chunks = pd.DataFrame(columns=['Id', 'Session', 'Meeting', 'Speaker', 'Text'])

# Iterate through each row of the original DataFrame
for index, row in df.iterrows():
    text = row['Text']
    id_ = row['Id']
    session = row['Session']
    meeting = row['Meeting']
    speaker = row['Speaker']
    
    # Chunk the text
    chunks = chucking_text(text)
    
    # Create chunk IDs ensuring uniqueness
    chunk_ids = [f'{id_}_{i+1}' for i in range(len(chunks))]  # Use index to ensure uniqueness
    
    # Expand session, meeting, and speaker lists to match the number of chunks
    session = [session] * len(chunks)
    meeting = [meeting] * len(chunks)
    speaker = [speaker] * len(chunks)
    
    # Create a DataFrame for the chunks of the current row
    df_temp = pd.DataFrame({
        'Id': chunk_ids,
        'Session': session,
        'Meeting': meeting,
        'Speaker': speaker,
        'Text': chunks
    })
    
    # Append the chunked data to the main DataFrame
    df_chunks = pd.concat([df_chunks, df_temp])

# Reset the index of the resulting DataFrame
df_chunks.reset_index(drop=True, inplace=True)

# Print the resulting DataFrame
df_chunks.head()


Unnamed: 0,Id,Session,Meeting,Speaker,Text
0,S2M11_1,2,1,Chairman,"Good morning, ladies and gentlemen.Excellencie..."
1,S2M11_2,2,1,Chairman,behaviors. But before we proceed with our agen...
2,S2M11_3,2,1,Chairman,aspects and proceedings with our substantive w...
3,S2M11_4,2,1,Chairman,"of responsible behaviors. During this session,..."
4,S2M11_5,2,1,Chairman,from the outcome of this session will in large...


In [79]:
#check rows with duplicate values in id
duplicate_rows = df_chunks[df_chunks['Id'].duplicated(keep=False)]
duplicate_rows

Unnamed: 0,Id,Session,Meeting,Speaker,Text


In [109]:
#convert data to vector embeddings
def generate_embeddings(text):
    embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    embeddings = embedding_model.embed_query(text)
    return embeddings


In [29]:
#create a new column in the df_chunks with the embeddings and add the embeded each text 
#df_chunks['SpeakerEmbeddings'] = df_chunks['Speaker'].apply(generate_embeddings)

In [81]:
#create a new column in the df_chunks with the embeddings and add the embeded each text 
df_chunks['TextEmbeddings'] = df_chunks['Text'].apply(generate_embeddings)

In [82]:
df_chunks.head()

Unnamed: 0,Id,Session,Meeting,Speaker,Text,TextEmbeddings
0,S2M11_1,2,1,Chairman,"Good morning, ladies and gentlemen.Excellencie...","[0.06073196232318878, 0.03783389925956726, 0.0..."
1,S2M11_2,2,1,Chairman,behaviors. But before we proceed with our agen...,"[0.030805662274360657, 0.019306646659970284, 0..."
2,S2M11_3,2,1,Chairman,aspects and proceedings with our substantive w...,"[0.058311160653829575, 0.03847544640302658, 0...."
3,S2M11_4,2,1,Chairman,"of responsible behaviors. During this session,...","[-0.007044042926281691, 0.07030506432056427, 0..."
4,S2M11_5,2,1,Chairman,from the outcome of this session will in large...,"[0.062239743769168854, 0.027549689635634422, 0..."


In [83]:
#get a list of all documents in text row
text_documents = df_chunks['Text'].tolist()
text_documents[0:3]

['Good morning, ladies and gentlemen.Excellencies, distinguished delegates, ladies and gentlemen.I apologize for our late start, but I hope that we can start on time the rest of the week. Kindly take your seats, please.It is my honor and pleasure to declare open the second session of the open-ended working group on reducing space threats through norms, rules, and principles of responsible behaviors. But before we proceed with our agenda, I would like to express to the governments of the United Kingdom, of',
 'behaviors. But before we proceed with our agenda, I would like to express to the governments of the United Kingdom, of Great Britain and Northern Islands, and of the Commonwealth realms my heartfelt condolences on the passing of Her Majesty Queen Elizabeth II. In this regard, I would like to request delegations to observe a minute of silence in her honor. Before clarifying the organizational aspects and proceedings with our substantive work, please allow me to make some preliminar

In [84]:
#get a list of all documents in text row
text_embeddings = df_chunks['TextEmbeddings'].tolist()
text_embeddings[0:3]

[[0.06073196232318878,
  0.03783389925956726,
  0.020920174196362495,
  0.010521446354687214,
  -0.025783361867070198,
  -0.009848172776401043,
  0.02830221876502037,
  -0.0006439101416617632,
  0.03423537686467171,
  0.021092427894473076,
  0.007729066535830498,
  -0.013106061145663261,
  -0.05244646221399307,
  -0.0018999157473444939,
  0.06203974783420563,
  -0.06058746948838234,
  0.11066429316997528,
  -0.06388664990663528,
  0.004803895018994808,
  -0.03362182527780533,
  0.01339819934219122,
  0.0008751021232455969,
  -0.04027486965060234,
  -0.004347799811512232,
  -0.007982109673321247,
  -0.024580711498856544,
  0.027292484417557716,
  0.003245210973545909,
  0.02745017781853676,
  -0.015159749425947666,
  0.07170005142688751,
  0.005119833163917065,
  -0.008901858702301979,
  -0.027876973152160645,
  2.5633760287746554e-06,
  -0.004513101652264595,
  0.005117173306643963,
  0.02619227208197117,
  0.011745836585760117,
  0.050380222499370575,
  0.009460522793233395,
  -0.0453

In [85]:
#get a list of all documents in text row
id = df_chunks['Id'].tolist()
id[0:3]

['S2M11_1', 'S2M11_2', 'S2M11_3']

In [86]:
#read each row for session, speaker and meeting columns into a dictionary and store in a list
metadata = df_chunks[['Session', 'Speaker', 'Meeting']].to_dict('records')
metadata[0:3]

[{'Session': 2, 'Speaker': 'Chairman', 'Meeting': 1},
 {'Session': 2, 'Speaker': 'Chairman', 'Meeting': 1},
 {'Session': 2, 'Speaker': 'Chairman', 'Meeting': 1}]

In [87]:
#save df_chunks as a csv file
df_chunks.to_csv('/Users/pelumioluwaabiola/Desktop/Transcriptions/session2/CleanedMeeting1.csv', index=False)

In [19]:
#save in a json file
#with open(f"{'/Users/pelumioluwaabiola/Desktop/Transcriptions/detailed_data/Meeting_5_Session_3'}.json", "w") as f:
#    json.dump(rows, f)


## Upload to Vector Store  Chroma DB - Open Source

Upload the data on chroma db which is an open source. However, chroma db does not support hybrid search. 
1. Connect to Chroma db
2. Create collection (vector base) on chroma db 
3. Add data to vector store
4. Query vector store: This would require embedding the query with the same embedding fuction utilized in the emdedding the text and feeding the  query embedding to Chroma 

In [37]:
import chromadb


In [88]:
#connect with chroma clinet
chroma_client = chromadb.Client()

In [None]:
#create a collection in chroma
collection = chroma_client.create_collection(
        name="space_collection",
        metadata={"hnsw:space": "cosine"} 
    )


In [89]:
#add data to collection
collection.add(
    documents=text_documents,
    embeddings=text_embeddings,
    metadatas=metadata,
    ids=id
)

In [106]:
# embedd the query
query = 'what was said by Australia in session 2 meeting 1'
query_embeddings = generate_embeddings(query)

In [107]:
#query the chroma db
results = collection.query(
    query_embeddings=query_embeddings,
    n_results=10
)

In [108]:
#veiw the top 10 results retrieved based on the query
results['documents']

[['This was partly thanks to the excellent level of the panelists, but what was even more important was the positive attitude of delegations. I hope that in this second session during which we will tackle even more complex and sensitive issues, we will similarly be able to keep the same constructive spirit and the wider and the rearranging participation of all delegations. If this condition is not met, then we will simply have no possibility of making any progress with our task. Contrary to the first session,',
  'from the outcome of this session will in large part depend being able to tackle further discussion on possible recommendations to face these threats. Distinguished colleagues, I am very well aware of the complex international context. It is a process which is not happening in a vacuum. that we are not immune to the global political situation. Nevertheless, with the active and constructive participation of all delegations, I am certain that we will be able to make progress in 