## ML-CHAMP Prep Work

This notebook is for testing various scripts that will be used in the QA_Pipeline project.

#### Test 1: Loading the docs

The document loading should not have to load all of the docs every time.  Generate a pandas dataframe with the following:

        - doc_id
        - doc_loc
        - doc_contents

The project should only check for new docs and only when specifically requested.  A document update does not necessarily force an index update as well since an existing index won't break trying to retrieve docs from the dataframe that don't exist.  You shouldn't run into the case where a newer index is calling an older copy of the doc dataframe because there should only be the one which is the latest.

##### HOWEVER:
If you delete the document_index.csv associated with the dataframe, ALL OF THE DOC IDS WILL GET REGENERATED!!!<br>
That will mess up all of the previously created index mappings...

In [1]:
import os
import pandas as pd
import uuid

In [2]:
data_dir = 'D:/JupyterPrograms/0-CHAT_GPT/EXPERIMENTS/ML_CHAMP/data/'
docs_dir = data_dir + 'docs'
docs_csv = data_dir + 'document_index.csv'

In [3]:
# Load the documents and their id mappings
documents_mapping = {}

def load_documents_from_folder(folder_path):    
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                documents_mapping[len(documents_mapping)] = {
                    'file_name': file_name,
                    'content': document_content
                }
    
    return documents_mapping

# Load documents from the 'data/docs' folder
documents_mapping = load_documents_from_folder(docs_dir)

# Construct documents array from the values of documents_mapping
documents = [doc_info['content'] for doc_info in documents_mapping.values()]

# Print the document ID to file name mapping
for doc_id, doc_info in documents_mapping.items():
    print(f"Document ID: {doc_id}, File Name: {doc_info['file_name']}")
    
# See if the DataFrame already exists and create it if it doesn't
# Check if the CSV file exists
if not os.path.exists(docs_csv):
    # If the file doesn't exist, create a new DataFrame
    # Create an empty DataFrame with column names
    column_names = ['doc_id', 'doc_filename', 'doc_contents']
    df = pd.DataFrame(columns=column_names)
    
    # Save the DataFrame to the CSV file
    df.to_csv(docs_csv, index=False)
    print(f"CSV file created at: {docs_csv}")
else:
    # If the file already exists, load it into a DataFrame
    df = pd.read_csv(docs_csv)
    print(f"CSV file already exists at: {docs_csv}")
    print("Loaded existing DataFrame:")
    
df.head()  # Displaying the first few rows of the loaded DataFrame

Document ID: 0, File Name: AI_1.txt
Document ID: 1, File Name: ClimateChange_1.txt
Document ID: 2, File Name: CulturalDiversityAndTraditions_1.txt
Document ID: 3, File Name: FinancialMarkets_1.txt
Document ID: 4, File Name: HistoryAndHistoricalEvents_1.txt
Document ID: 5, File Name: Terrorism_1.txt
Document ID: 6, File Name: WorldHealthIssues_1.txt
CSV file already exists at: D:/JupyterPrograms/0-CHAT_GPT/EXPERIMENTS/ML_CHAMP/data/document_index.csv
Loaded existing DataFrame:


Unnamed: 0,doc_id,doc_filename,doc_contents
0,2d28937a-69ae-4fc4-8543-1b9b5498fab2,AI_1.txt,Artificial intelligence (AI) vs. machine learn...
1,ae79310b-1c53-4784-b813-08537e483b7e,ClimateChange_1.txt,What Is Climate Change?\nClimate change refers...
2,a97167de-a09a-4d43-92a2-efa489a237d7,CulturalDiversityAndTraditions_1.txt,Cultural diversity\n\nArticle\nTalk\nRead\nEdi...
3,b58716b5-0ef5-4967-8cfa-96d71ce04f3e,FinancialMarkets_1.txt,"Financial Markets: Role in the Economy, Import..."
4,b78ff432-aa81-4a95-b3ec-9b4d050daaae,HistoryAndHistoricalEvents_1.txt,Americans Name the 10 Most Significant Histori...


In [4]:
# Get the list of files in the documents directory
existing_files = set(df['doc_filename'].tolist())
files_to_process = [file for file in os.listdir(docs_dir) if os.path.isfile(os.path.join(docs_dir, file))]

# Function to read file contents
def read_file_contents(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        contents = file.read()
    return contents

# Check for files that are not present in the DataFrame and add them
new_entries = []
for file_name in files_to_process:
    if file_name not in existing_files:
        # Generate a unique ID for the new document
        unique_id = str(uuid.uuid4())
        
        # Read file contents
        file_path = os.path.join(docs_dir, file_name)
        contents = read_file_contents(file_path)
        
        # Prepare a new entry for the DataFrame
        new_entry = {'doc_id': unique_id, 'doc_filename': file_name, 'doc_contents': contents}
        new_entries.append(new_entry)

# Concatenate new entries to the existing DataFrame
if new_entries:
    new_df = pd.DataFrame(new_entries)
    df = pd.concat([df, new_df], ignore_index=True)

    # Save the updated DataFrame to the CSV file
    df.to_csv(docs_csv, index=False)
    print("Documents checked and DataFrame updated successfully.")
else:
    print('No new documents were found.  No DataFrame update required.')
    
df.head(10)

No new documents were found.  No DataFrame update required.


Unnamed: 0,doc_id,doc_filename,doc_contents
0,2d28937a-69ae-4fc4-8543-1b9b5498fab2,AI_1.txt,Artificial intelligence (AI) vs. machine learn...
1,ae79310b-1c53-4784-b813-08537e483b7e,ClimateChange_1.txt,What Is Climate Change?\nClimate change refers...
2,a97167de-a09a-4d43-92a2-efa489a237d7,CulturalDiversityAndTraditions_1.txt,Cultural diversity\n\nArticle\nTalk\nRead\nEdi...
3,b58716b5-0ef5-4967-8cfa-96d71ce04f3e,FinancialMarkets_1.txt,"Financial Markets: Role in the Economy, Import..."
4,b78ff432-aa81-4a95-b3ec-9b4d050daaae,HistoryAndHistoricalEvents_1.txt,Americans Name the 10 Most Significant Histori...
5,cc1ea0f4-27e6-4c6b-abd1-2d042e6ca6f7,Terrorism_1.txt,Terrorism\n\nArticle\nTalk\nRead\nView source\...
6,e5e9975d-5826-4664-83a7-92115931b302,WorldHealthIssues_1.txt,"11 global health issues to watch in 2023, acco..."


In [5]:
# Function to retrieve doc_filename and doc_contents based on doc_id
def get_info_by_doc_id(dataframe, doc_id):
    # Check if doc_id exists in the DataFrame
    if doc_id in dataframe['doc_id'].values:
        # Using .loc[] to get the information based on doc_id
        info = dataframe.loc[dataframe['doc_id'] == doc_id, ['doc_id', 'doc_filename', 'doc_contents']]
        return info.iloc[0]  # Return the information as a Series
    else:
        return None  # Return None if doc_id is not found

# Example usage:8670d6cc-3f4e-459c-a659-fd1911dea0e6
given_doc_id = '8670d6cc-3f4e-459c-a659-fd1911dea0e6'  # Replace this with the desired doc_id
result = get_info_by_doc_id(df, given_doc_id)

if result is not None:
    doc_id = result['doc_id']
    doc_filename = result['doc_filename']
    doc_contents = result['doc_contents']
    print(f"Document ID: {doc_id}")
    print(f"Document Filename: {doc_filename}")
    print(f"Document Contents: {doc_contents}")
else:
    print("Document ID", given_doc_id, "not found in the DataFrame")

Document ID 8670d6cc-3f4e-459c-a659-fd1911dea0e6 not found in the DataFrame


In [6]:
df['doc_id'].values

array(['2d28937a-69ae-4fc4-8543-1b9b5498fab2',
       'ae79310b-1c53-4784-b813-08537e483b7e',
       'a97167de-a09a-4d43-92a2-efa489a237d7',
       'b58716b5-0ef5-4967-8cfa-96d71ce04f3e',
       'b78ff432-aa81-4a95-b3ec-9b4d050daaae',
       'cc1ea0f4-27e6-4c6b-abd1-2d042e6ca6f7',
       'e5e9975d-5826-4664-83a7-92115931b302'], dtype=object)