#### This file is used for the data processing for the whole parquet file

#### Instructions before running all the cells:
1. Drag the "merged_data_rag.parquet" into the "azure-conversational-assistant" folder
2. Install the required packages below

In [1]:
!pip install azure-cli pandas openai fastparquet pyarrow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [56]:
import glob
import json
import os

import pandas as pd
from azure.identity import AzureCliCredential, get_bearer_token_provider
from openai import AzureOpenAI
from tqdm import tqdm

#### Pre-processing the merged parquet file

###### 1. Remove articles with 'No HTML Tags' from the 'remove_type' column
###### 2. Remove the rows with 'No Extracted Content' from 'remove_type' column
###### 3. Remove the rows with 'NaN' from 'remove_type' column
###### 4. Remove 'Multilingual' from 'remove_type' column
###### 5. Remove the duplicated articles with specific 'id' values
###### 6. Remove 'Duplicated Content' from 'remove_type' column, except for specific 'id' values
###### 7. Remove the articles that are too lengthy

##### Link to the article ids to keep and remove: https://docs.google.com/spreadsheets/d/1PjRx_GkdlNZpV--Ui6sLd0Hvk-3LJ6qk/edit?gid=1214237528#gid=1214237528

In [3]:
df = pd.read_parquet("merged_data.parquet")

In [4]:
df[df["remove_type"] == "excel_error"].shape[0]

0

In [5]:
df.loc[df["remove_type"] == "No HTML Tags"].shape[0]

43

In [6]:
# Remove articles with 'No HTML Tags' from the 'remove_type' column
df = df.loc[df["remove_type"] != "No HTML Tags"]

In [7]:
df.loc[df["remove_type"] == "No HTML Tags"].shape[0]

0

In [8]:
df.loc[df["remove_type"] == "No Extracted Content"].shape[0]

19

In [9]:
# Remove the rows with 'No Extracted Content' from 'remove_type' column
df = df[df["remove_type"] != "No Extracted Content"]

In [10]:
df.loc[df["remove_type"] == "No Extracted Content"].shape[0]

0

In [11]:
df.loc[df["remove_type"] == "NaN"].shape[0]

11

In [12]:
# Remove the rows with 'NaN' from 'remove_type' column
df = df[df["remove_type"] != "NaN"]

In [13]:
df.loc[df["remove_type"] == "NaN"].shape[0]

0

In [14]:
df.loc[df["remove_type"] == "Multilingual"].shape[0]

6

In [15]:
# Remove 'Multilingual' from 'remove_type' column
df = df[df["remove_type"] != "Multilingual"]

In [16]:
df.loc[df["remove_type"] == "Multilingual"].shape[0]

0

In [17]:
df.shape[0]

2534

In [18]:
df.loc[df["id"].isin([1444496, 1445828, 1445798, 1444751, 1435183, 1435188, 1434614])]

Unnamed: 0,id,content_name,title,article_category_names,cover_image_url,full_url,full_url2,friendly_url,category_description,content_body,...,extracted_content_body,l1_mappings,l2_mappings,page_views,engagement_rate,bounce_rate,exit_rate,scroll_percentage,percentage_total_views,cumulative_percentage_total_views
621,1444496,Weekend Activities: 5 Ideas for Families,Outdoor Activities for Your Children,"Exercise and Fitness,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/live-healthy/ideas-fo...,www.healthhub.sg/live-healthy/ideas-for-an-act...,ideas-for-an-active-weekend,You’ve brought your kids to the National Museu...,"b'<div class=""ExternalClass196D7C5AC7594C8E8BC...",...,The lack of insufficient outdoor activity amon...,Well-being & Lifestyle,Exercise and Fitness,2492,0.948417,0.051583,0.169743,0.311898,0.00107,0.646955
1067,1445629,Sliced Fish with Bee Hoon Soup,Sliced Fish with Bee Hoon Soup,"Food and Nutrition,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/live-healthy/fish-bee...,www.healthhub.sg/live-healthy/fish-bee-hoon-soup,fish-bee-hoon-soup,Giving a healthier twist to an old favourite,"b'<div class=""ExternalClassF5C1DD3FA7E84963A88...",...,Mouthwatering sliced fish with bee hoon soup\n...,Well-being & Lifestyle,"Food, Diet and Nutrition",789,0.880255,0.119745,0.292776,0.349493,0.000339,0.913869
1215,1443608,Mee Goreng,Mee Goreng,"Food and Nutrition,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/live-healthy/mee-goreng,www.healthhub.sg/live-healthy/mee-goreng,mee-goreng,A healthier version of the Singaporean favouri...,"b'<div class=""ExternalClass60865CF1F8FA4603ABA...",...,By KK Womens and Childrens Hospital and Ms Hen...,Well-being & Lifestyle,"Food, Diet and Nutrition",528,0.907919,0.092081,0.215909,0.276042,0.000227,0.967905
2170,1434652,3 Be's To Beat Diabetes | Diabetes Hub,3 Be's To Beat Diabetes | Diabetes Hub,,https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/programmes/diabetes-hub,www.healthhub.sg/programmes/diabetes-hub,diabetes-hub,Come explore 3 easy-to-remember ways to manage...,"b'<div class=""ExternalClassFAC31D10071B445C93D...",...,3 BE’S TO BEAT DIABETES [/programmes/diabetes-...,,,40477,0.688088,0.311912,0.603059,0.363892,0.013596,0.902092
2395,1435335,Diabetes Hub: Guide to Managing Diabetes_care-...,Diabetes Hub: Guide to Managing Diabetes,,,https://www.healthhub.sg/programmes/diabetes-h...,www.healthhub.sg/programmes/diabetes-hub/care-...,care-team-resources,National Diabetes Reference Materials - An ini...,"b'<div class=""ExternalClass0F99B891157147D4A5D...",...,- \n Home [#] > Care Team Resources\nCare Tea...,,,788,0.920596,0.079404,0.185279,0.352475,0.00021,0.9948
2396,1435183,Care team resources | Diabetes Hub_care-team-r...,Care team resources | Diabetes Hub,,,https://www.healthhub.sg/programmes/diabetes-h...,www.healthhub.sg/programmes/diabetes-hub/care-...,care-team-resources,Care team resources | Diabetes Hub,"b'<div class=""ExternalClassFDB7091A35194B5EBBA...",...,_x000D_ _x000D_ _x000D_ _x000D__x000D_ _x000D...,,,788,0.920596,0.079404,0.185279,0.352475,0.00021,0.99501


In [19]:
# Remove the duplicated articles with the 'id' = "1445629", "1443608", "1435183", "1435335", "1434652".
df = df[~df["id"].isin([1444496, 1445828, 1445798, 1444751, 1435183, 1435188, 1434614])]

In [20]:
df.loc[df["id"].isin([1444496, 1445828, 1445798, 1444751, 1435183, 1435188, 1434614])]

Unnamed: 0,id,content_name,title,article_category_names,cover_image_url,full_url,full_url2,friendly_url,category_description,content_body,...,extracted_content_body,l1_mappings,l2_mappings,page_views,engagement_rate,bounce_rate,exit_rate,scroll_percentage,percentage_total_views,cumulative_percentage_total_views


In [21]:
df.loc[df["id"].isin([1495949, 1445972])]

Unnamed: 0,id,content_name,title,article_category_names,cover_image_url,full_url,full_url2,friendly_url,category_description,content_body,...,extracted_content_body,l1_mappings,l2_mappings,page_views,engagement_rate,bounce_rate,exit_rate,scroll_percentage,percentage_total_views,cumulative_percentage_total_views
1420,1497409,conversations-abt-vaping,Parenting Insights: Strategies for Conversatio...,,,https://www.healthhub.sg/live-healthy/conversa...,www.healthhub.sg/live-healthy/conversations-ab...,conversations-abt-vaping,,"b'<div class=""ExternalClassB1D1BA8198604AF5897...",...,Synopsis: Learn proactive parenting strategies...,,,144,0.902985,0.097015,0.222222,0.295139,6.2e-05,0.998076
1497,1469472,parents-model-little-habits-everyday,Parents Model the Way With Little Habits Every...,,,https://www.healthhub.sg/live-healthy/parents-...,www.healthhub.sg/live-healthy/parents-model-li...,parents-model-little-habits-everyday,Pui Yi and her husband inspire their children ...,b'<h2>Physical Activity Fun Both in and Out of...,...,Physical Activity Fun Both in and Out of the S...,,,19,0.789474,0.210526,0.315789,0.394737,8e-06,0.999909


In [22]:
df.loc[df["remove_type"] == "Duplicated Content"].shape[0]

17

In [23]:
df.shape[0]

2528

In [24]:
# Remove all the 'Duplicated Content' from 'remove_type' column and only keep the articles with the 'id' = "1497409", "1469472".
df = df[
    (df["remove_type"] != "Duplicated Content")
    | (df["id"].isin([1495949, 1445972, 1446081, 1445629, 1443608, 1445829, 1435335, 1435331, 1434652]))
]

In [25]:
df.shape[0]

2511

In [26]:
# Save the cleaned data to a new parquet file with the name 'merged_data_rag.parquet'
df.to_parquet("merged_data_rag.parquet")

In [27]:
# Count the data in the 'content_category' column
df["content_category"].value_counts()

content_category
live-healthy-articles          1148
medications                     579
diseases-and-conditions         318
program-sub-pages               285
programs                         70
medical-care-and-facilities      58
cost-and-financing               24
health-statistics                15
support-group-and-others         14
Name: count, dtype: int64

#### Read the new RAG parquet file

In [28]:
df = pd.read_parquet("merged_data_rag.parquet")

In [29]:
df.shape[0]

2511

In [30]:
df

Unnamed: 0,id,content_name,title,article_category_names,cover_image_url,full_url,full_url2,friendly_url,category_description,content_body,...,extracted_content_body,l1_mappings,l2_mappings,page_views,engagement_rate,bounce_rate,exit_rate,scroll_percentage,percentage_total_views,cumulative_percentage_total_views
0,1435040,Breast Screening Subsidies in Singapore,Breast Screening Subsidies in Singapore,"Conditions and Illnesses,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/costs-and-financi...,www.healthhub.sg/a-z/costs-and-financing/breas...,breast-cancer-screening-subsidies,Here’s all you need to know about breast cance...,"b'<div class=""ExternalClass07C58E0D957B4AA7B14...",...,Breast cancer is the number one cancer among w...,Support & Tools,Cost and Financing,19647,0.790040,0.209960,0.596020,0.411437,0.207367,0.207367
1,1435071,Marriage and Parenthood Schemes,Marriage and Parenthood Schemes,"Body Care,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/costs-and-financi...,www.healthhub.sg/a-z/costs-and-financing/marri...,marriage_parenthood_scheme,New parents and couples looking to conceive ca...,"b'<div class=""ExternalClassE1D82270F17241E4955...",...,MediSave Maternity Package\nWith the MediSave ...,,,10173,0.725810,0.274190,0.780104,0.394795,0.107372,0.314740
2,1434993,MediSave,MediSave,"Alerts and Advisories,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/costs-and-financi...,www.healthhub.sg/a-z/costs-and-financing/medisave,medisave,MediSave is the national medical savings schem...,"b'<div class=""ExternalClass67AD25F1F8B64B349E5...",...,"What is MediSave?\nMediSave, introduced in Apr...",,,5910,0.751147,0.248853,0.576481,0.352073,0.062378,0.506581
3,1435031,Hospital Bills Financial Assistance in Singapore,Hospital Bills Financial Assistance in Singapore,"Body Care,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/costs-and-financi...,www.healthhub.sg/a-z/costs-and-financing/finan...,financial-assistance-for-local-patients-in-sin...,Having trouble paying your medical bill? Here’...,"b'<div class=""ExternalClassE335708125E743FDAA3...",...,Patients or family members who have difficulty...,,,6209,0.788974,0.211026,0.686906,0.422532,0.065534,0.380273
4,1435043,Community Health Assist Scheme (CHAS) Singapore,Community Health Assist Scheme (CHAS) Singapore,"Alerts and Advisories,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/costs-and-financi...,www.healthhub.sg/a-z/costs-and-financing/chas,chas,"With a CHAS card, all Singapore citizens can r...",b'<h2>What is the Community Health Assist Sche...,...,What is the Community Health Assist Scheme (CH...,,,6057,0.775782,0.224218,0.665016,0.405977,0.063929,0.444203
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2608,1440763,Heart Failure Transitional Care Programme,Heart Failure Transitional Care Programme,"Conditions and Illnesses,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/support-groups-an...,www.healthhub.sg/a-z/support-groups-and-others...,transitional-care-programme-for-heart-failure,The team from NUHCS gives support to heart fai...,"b'<div class=""ExternalClassFC126593610D4F0587A...",...,Heart failure is the leading cause of rehospit...,,,597,0.936236,0.063764,0.063764,0.314070,0.035043,0.876966
2609,1440791,Brain and Head Injury Support Groups,Brain and Head Injury Support Groups,"Alerts and Advisories,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/support-groups-an...,www.healthhub.sg/a-z/support-groups-and-others...,2015-NNI-support-group,Read on for a list of brain injury support gro...,"b'<div class=""ExternalClass7C92735B78174928B28...",...,Brain Tumour Society (Singapore)\nThe Brain Tu...,,,587,0.949429,0.050571,0.050571,0.293441,0.034456,0.911423
2610,1440768,Ambulatory Nutrition Support,Ambulatory Nutrition Support,"Conditions and Illnesses,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/support-groups-an...,www.healthhub.sg/a-z/support-groups-and-others...,ambulatory-nutrition-support,Read about the ambulatory support benefits one...,"b'<div class=""ExternalClass3FABAC9D59A64BCAB96...",...,The Importance of Ambulatory Nutrition Support...,,,535,0.962832,0.037168,0.037168,0.295794,0.031404,0.942827
2611,1440766,LapBandits Support Group (Singapore),LapBandits Support Group (Singapore),"Body Care,",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/a-z/support-groups-an...,www.healthhub.sg/a-z/support-groups-and-others...,singapore-lapbandits-support-group,Have you just undergone bariatric surgery for ...,"b'<div class=""ExternalClassA4C749C7DB7647FBB6D...",...,About Khoo Teck Puat Hospitals LapBandits Supp...,,,491,0.964880,0.035120,0.035120,0.273931,0.028821,0.971648


#### Find the number of rows of tables to process

In [31]:
# Find the number of rows where 'has_table' is True
num_rows_with_table = df[df["has_table"]].shape[0]

print(f"Number of rows with 'has_table' == True: {num_rows_with_table}")

Number of rows with 'has_table' == True: 270


#### Flag out those articles that exceed input window length by len()

In [32]:
# # Define the length limit
# length_limit = 100000

# # Set display option to show full URLs without truncation
# pd.set_option('display.max_colwidth', None)

# # Filter and collect IDs, their content lengths, and full URLs
# exceeding_articles = []
# for _, row in df.iterrows():
#     if row['has_table']:
#         content_length = len(row['content_body'])
#         if content_length > length_limit:
#             exceeding_articles.append((
#                 int(row['id']),  # Use 'id' instead of 'article_id'
#                 content_length,
#                 row['full_url']
#             ))

# # Sort the results in descending order by content length
# exceeding_articles_sorted = sorted(exceeding_articles, key=lambda x: x[1], reverse=True)

# # Print the sorted results with full URLs
# print("Article IDs, their content lengths, and full URLs exceeding the length limit:")
# for article_id, content_length, full_url in exceeding_articles_sorted:
#     print(f"Article ID: {article_id}, Length: {content_length}")
#     print(f"Full URL: {full_url}")
#     print('-' * 80)  # Separator line for readability

# # Assuming df1 is your other DataFrame
# # Extract IDs that exceeded the length limit
# exceeding_ids = [article_id for article_id, _, _ in exceeding_articles_sorted]

# # Filter out rows from df1 where 'id' is in the exceeding_ids list
# df1_filtered = df[~df['id'].isin(exceeding_ids)]

# # Optional: Save the filtered DataFrame to a new file if needed
# # output_file_filtered = './data_processing/filtered_df1.parquet'
# # df1_filtered.to_parquet(output_file_filtered)

# print("Filtered df DataFrame where IDs exceeding the length limit have been excluded.")

#### Filter out those articles that is too lengthy to be processed by LLM due to input window limit

In [34]:
# Define the article ID to be filtered out
article_id_to_exclude = 1435223

# Filter out the article with the specified ID
df_filtered = df[df["id"] != article_id_to_exclude]

# Get the number of rows in the DataFrame
num_rows = df_filtered.shape[0]

# Print the number of rows
print(f"The number of rows in the DataFrame is: {num_rows}")

The number of rows in the DataFrame is: 2510


#### Pass the table content into GPT-4o and create a new parquet with the new column "processed_table_content"

In [35]:
def ask(html_content: str) -> str:
    azure_credential = AzureCliCredential()
    token_provider = get_bearer_token_provider(azure_credential, "https://cognitiveservices.azure.com/.default")

    openai_client = AzureOpenAI(
        api_version="2024-06-01",
        azure_endpoint="https://apim-jisfkas7teqvm.azure-api.net",
        azure_ad_token_provider=token_provider,
    )

    # Updated prompt to discourage repetition
    prompt = """
    Below is the given full article HTML. Extract the **content of the tables** and their **relevant descriptions** that help understand the tables. 
    Ensure:
    - Retain only essential markdown formatting, such as:
        - **Bold** for headers or important table titles.
        - **Tables** formatted using markdown syntax (e.g., `| Header 1 | Header 2 |`).
    - Avoid unnecessary dashes, bullet points, and extraneous markdown symbols.
    - Remove all other HTML tags.
    - Keep the output concise, accurate, and under 4,000 words. If it exceeds 4,000 words, prioritize summarization.
    - Output the response as a readable markdown string.

    {html_content}
    """

    # Prepare the messages for the API call
    query_messages = [
        {
            "role": "system",
            "content": "You are an AI assistant specialized in extracting structured content from HTML.",
        },
        {"role": "user", "content": prompt.format(html_content=html_content)},
    ]

    response = openai_client.chat.completions.create(
        messages=query_messages,
        model="chat",
        temperature=0.0,
        max_tokens=4096,
        n=1,
        seed=1234,
    )

    return response.choices[0].message.content


# Extract and process HTML tables


def process_html_tables(row):
    if row["has_table"]:
        return ask(row["content_body"])
    return None


# Apply processing to the DataFrame with tqdm for progress tracking
tqdm.pandas()  # Enable progress bar for DataFrame operations
df_filtered["processed_table_content"] = df_filtered.progress_apply(process_html_tables, axis=1)

# Save to a new Parquet file
output_file = "./data_processing/new_index.parquet"
df_filtered.to_parquet(output_file)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered["processed_table_content"] = df_filtered.apply(process_html_tables, axis=1)


#### Read the post-processed parquet file

In [36]:
df1 = pd.read_parquet("data_processing/new_index.parquet")

#### Find out the length of the post-processed "processed_table_content" 

In [50]:
# Filter rows where 'has_table' is True
df_with_table = df1[df1["has_table"]]

# Add a new column for the character count of 'processed_table_content'
df_with_table["char_count"] = df_with_table["processed_table_content"].apply(lambda x: len(x) if pd.notnull(x) else 0)

# Sort by 'char_count' in descending order and get the top k rows
top_k_longest = df_with_table.sort_values(by="char_count", ascending=False).head(30)

# Display the result
print(top_k_longest[["id", "char_count", "friendly_url"]])

# Count the number of articles with 'char_count' over 5000
articles_over_5000 = df_with_table[df_with_table["char_count"] > 5000].shape[0]

# Display the result
print(f"Number of articles with 'processed_table_content' length over 5000: {articles_over_5000}")

           id  char_count                                       friendly_url
950   1445538       17247                             travellersurvivalguide
378   1444820       13584                     recommended_dietary_allowances
773   1442929       13557                   growing-kid-raising-healthy-kids
1073  1445346       10552                          manage_weight_healthy_way
2175  1434716       10339                                              IQuit
553   1445172       10332            nutrition-for-toddlers-25-36-months-old
1705  1440453        9875                        Insulin-Injection-Technique
372   1446016        9460  pregnancy-nutrition-during-pregnancy-eating-ri...
357   1437940        9422              admissions-and-outpatient-attendances
1545  1437970        8562                                       respite_care
724   1445669        8334                                babysfirstyearbrain
1519  1445673        8164     A Healthy Food Foundation - for Kids and Teens

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_table["char_count"] = df_with_table["processed_table_content"].apply(lambda x: len(x) if pd.notnull(x) else 0)


#### Inspect the top k articles post-processed "processed_table_content" 

In [48]:
# # Step 2: Filter the DataFrame by the specific IDs
# filtered_df = df1[df1['id'].isin([1439448, 1445517])]

# # Step 3: Extract the 'processed_table_content' column for the selected rows
# content_1439448 = filtered_df[filtered_df['id'] == 1439448]['processed_table_content'].values[0]
# content_1445517 = filtered_df[filtered_df['id'] == 1445517]['processed_table_content'].values[0]

# # Step 4: Write the content to a .txt file
# with open("processed_content_1439448.txt", "w") as file1:
#     file1.write(content_1439448)

# with open("processed_content_1445517.txt", "w") as file2:
#     file2.write(content_1445517)

# print("Files saved successfully.")

Files saved successfully.


#### Manually curate the content to ensure the quality of the data for ingestion into the search index

In [49]:
# # Step 2: Define the new content for each ID
# new_content_1439448 = """
# **Week 3 of the pack:**

# | No missed tablets in the last 7 days | Take the missed tablet as soon as remembered, even if it means taking 2 tablets at the same time. Continue to take your tablets at your usual time and start the next pack right away without the 7-day tablet free period or 7-day white (inactive) tablets i.e. no gap should be left between packs. Your menses may not come until the next pack is finished, but there is no need to worry. However, if your menses do not occur after the next pack is finished, you should take a pregnancy test to make sure you are not pregnant. OR Stop taking medication from the current pack for 7 days (7-day tablet-free period). A withdrawal bleed (menses) usually occurs and then start a next pack after 7 days. |
# """

# new_content_1445517 = """
# **Suggestions for Overcoming Physical Activity Barriers**

# | Lack of time | Monitor your daily activities for one week. Identify available time slots where you can get at least 10 minutes of aerobic type physical activity. Add physical activity to your daily routine. Walk or ride your bicycle to work or to the shops, and organise your daily activities around physical activity. E.g. walk the dog, exercise while you watch TV, park farther away from your destination. Select activities requiring minimal time, such as walking, jogging or stair climbing. |
# """

# # Step 3: Update the 'processed_table_content' for the given IDs
# df1.loc[df1['id'] == 1439448, 'processed_table_content'] = new_content_1439448
# df1.loc[df1['id'] == 1445517, 'processed_table_content'] = new_content_1445517

# # Step 4: Save the updated DataFrame to a new Parquet file
# df1.to_parquet("data_processing/new_index.parquet")

# print("Updated content and saved to Parquet file successfully.")

Updated content and saved to Parquet file successfully.


#### Extract out the "content_body" as a .txt file to verify extraction

In [51]:
# Filter the DataFrame for rows where 'has_table' is True
filtered_df1 = df1[df1["has_table"]]

# Ensure the raw_content_body directory exists
raw_content_dir = "./data_processing/raw_content_body"
os.makedirs(raw_content_dir, exist_ok=True)

# Process each article ID
for article_id in filtered_df1["id"].unique():
    # Filter the DataFrame for the current article ID
    article_df = filtered_df1[filtered_df1["id"] == article_id]

    # Extract the 'content_body' column
    processed_content = article_df["content_body"].tolist()

    # Define the output file path
    output_txt_file = f"{raw_content_dir}/raw_article_{article_id}.txt"

    # Write the content to a text file
    with open(output_txt_file, "w") as file:
        for content in processed_content:
            # Check if content is in bytes and decode if necessary
            if isinstance(content, bytes):
                content = content.decode("utf-8")
            file.write(content + "\n")  # Write each entry on a new line

    print(f"Raw content body for article {article_id} saved to {output_txt_file}")

Raw content body for article 1435040 saved to ./data_processing/raw_content_body/raw_article_1435040.txt
Raw content body for article 1435071 saved to ./data_processing/raw_content_body/raw_article_1435071.txt
Raw content body for article 1435043 saved to ./data_processing/raw_content_body/raw_article_1435043.txt
Raw content body for article 1435005 saved to ./data_processing/raw_content_body/raw_article_1435005.txt
Raw content body for article 1434994 saved to ./data_processing/raw_content_body/raw_article_1434994.txt
Raw content body for article 1435059 saved to ./data_processing/raw_content_body/raw_article_1435059.txt
Raw content body for article 1435029 saved to ./data_processing/raw_content_body/raw_article_1435029.txt
Raw content body for article 1434998 saved to ./data_processing/raw_content_body/raw_article_1434998.txt
Raw content body for article 1437795 saved to ./data_processing/raw_content_body/raw_article_1437795.txt
Raw content body for article 1437742 saved to ./data_pr

#### Extract out the "processed_table_content" as a .txt file to verify extraction

In [52]:
# Filter the DataFrame for rows where 'has_table' is True
filtered_df = df1[df1["has_table"]]

# Ensure the processed_table_content directory exists
processed_content_dir = "./data_processing/processed_table_content"
os.makedirs(processed_content_dir, exist_ok=True)

# Loop through each unique article ID in the filtered DataFrame
for article_id in filtered_df["id"].unique():
    # Filter the DataFrame for the specific article ID
    article_df = filtered_df[filtered_df["id"] == article_id]

    if not article_df.empty:
        # Extract the 'processed_table_content' column
        processed_content = article_df["processed_table_content"].astype(str).tolist()

        # Define the output file path
        output_txt_file = f"{processed_content_dir}/processed_table_content_{article_id}.txt"

        # Write the content to a text file
        with open(output_txt_file, "w") as file:
            for content in processed_content:
                file.write(content + "\n")  # Write each entry on a new line

        print(f"Processed table content for article {article_id} saved to {output_txt_file}")

Processed table content for article 1435040 saved to ./data_processing/processed_table_content/processed_table_content_1435040.txt
Processed table content for article 1435071 saved to ./data_processing/processed_table_content/processed_table_content_1435071.txt
Processed table content for article 1435043 saved to ./data_processing/processed_table_content/processed_table_content_1435043.txt
Processed table content for article 1435005 saved to ./data_processing/processed_table_content/processed_table_content_1435005.txt
Processed table content for article 1434994 saved to ./data_processing/processed_table_content/processed_table_content_1434994.txt
Processed table content for article 1435059 saved to ./data_processing/processed_table_content/processed_table_content_1435059.txt
Processed table content for article 1435029 saved to ./data_processing/processed_table_content/processed_table_content_1435029.txt
Processed table content for article 1434998 saved to ./data_processing/processed_ta

#### Check that processed_table_content is added as the last column of the new index

In [53]:
# Print all columns of the DataFrame
print("Columns in DataFrame:")
for column in df1.columns:
    print(column)

Columns in DataFrame:
id
content_name
title
article_category_names
cover_image_url
full_url
full_url2
friendly_url
category_description
content_body
keywords
feature_title
pr_name
alternate_image_text
date_modified
number_of_views
last_month_view_count
last_two_months_view
content_category
to_remove
remove_type
has_table
has_image
related_sections
extracted_tables
extracted_raw_html_tables
extracted_links
extracted_headers
extracted_images
extracted_content_body
l1_mappings
l2_mappings
page_views
engagement_rate
bounce_rate
exit_rate
scroll_percentage
percentage_total_views
cumulative_percentage_total_views
processed_table_content


#### Specify the columns to extract for article content and tables

In [54]:
# columns_to_extract1 are the columns for the article content
# columns_to_extract2 are the columns for the table content
columns_to_extract1 = [
    "id",
    "title",
    "cover_image_url",
    "full_url",
    "extracted_content_body",
    "content_category",
    "category_description",
    "pr_name",
    "date_modified",
    "has_table",  # Add this column to filter rows with tables after extracting the content
]

columns_to_extract2 = [
    "id",
    "title",
    "cover_image_url",
    "full_url",
    "processed_table_content",
    "content_category",
    "category_description",
    "pr_name",
    "date_modified",
]

#### Entire parquet extraction

In [55]:
# Specify the directory
output_directory = "./data_processing/processed_articles"

# Ensure the directory exists
os.makedirs(output_directory, exist_ok=True)

# Loop through each row in the DataFrame using iterrows()
for index, row in df1.iterrows():
    # Extract the id of the row
    row_id = row["id"]

    # Extract the specified columns for the given row
    extracted_row1 = row[columns_to_extract1]
    has_table = extracted_row1["has_table"]  # Check if the row has a table

    # Convert the extracted row to a dictionary and remove 'has_table'
    extracted_data1 = extracted_row1.drop("has_table").to_dict()

    # Rename the key to "content"
    extracted_data1["content"] = str(extracted_data1.pop("extracted_content_body"))

    # Convert the id field to string and append a suffix
    extracted_data1["id"] = str(row_id) + "_content"

    # Wrap the dictionary in a list to match the desired format
    extracted_data_list1 = [extracted_data1]

    # Create a unique filename using the row ID
    output_filename1 = f"{row_id}_content.json"  # For content body

    # Define the output path
    output_path1 = os.path.join(output_directory, output_filename1)

    # Export the extracted content body to a JSON file
    with open(output_path1, "w") as json_file:
        json.dump(extracted_data_list1, json_file, indent=4)

    # If there is a table, extract and save it as well
    if has_table:
        extracted_row2 = row[columns_to_extract2]
        extracted_data2 = extracted_row2.to_dict()
        extracted_data2["content"] = str(extracted_data2.pop("processed_table_content"))

        # Convert the id field to string and append a suffix
        extracted_data2["id"] = str(row_id) + "_table"

        # Wrap the dictionary in a list to match the desired format
        extracted_data_list2 = [extracted_data2]

        # Create a unique filename for the table content
        output_filename2 = f"{row_id}_table.json"  # For raw HTML tables
        output_path2 = os.path.join(output_directory, output_filename2)

        # Export the extracted raw HTML tables to a JSON file
        with open(output_path2, "w") as json_file:
            json.dump(extracted_data_list2, json_file, indent=4)

        # Confirm the files were saved
        print(f"JSON files saved at: {output_path1} and {output_path2}")
    else:
        # Confirm the content body file was saved
        print(f"JSON file saved at: {output_path1}")

JSON files saved at: ./data_processing/processed_articles/1435040_content.json and ./data_processing/processed_articles/1435040_table.json
JSON files saved at: ./data_processing/processed_articles/1435071_content.json and ./data_processing/processed_articles/1435071_table.json
JSON file saved at: ./data_processing/processed_articles/1434993_content.json
JSON file saved at: ./data_processing/processed_articles/1435031_content.json
JSON files saved at: ./data_processing/processed_articles/1435043_content.json and ./data_processing/processed_articles/1435043_table.json
JSON files saved at: ./data_processing/processed_articles/1435005_content.json and ./data_processing/processed_articles/1435005_table.json
JSON files saved at: ./data_processing/processed_articles/1434994_content.json and ./data_processing/processed_articles/1434994_table.json
JSON file saved at: ./data_processing/processed_articles/1435035_content.json
JSON file saved at: ./data_processing/processed_articles/1435064_conten

#### Remove all .json files from a directory

In [45]:
def remove_json_files(directory):
    """
    Removes all .json files from the specified directory.

    Parameters:
    directory (str): The path to the directory where .json files are to be removed.
    """
    # Create the search pattern for .json files
    search_pattern = os.path.join(directory, "*.json")

    # Get a list of all .json files in the directory
    json_files = glob.glob(search_pattern)

    # Remove each .json file
    for file in json_files:
        os.remove(file)
        print(f"Removed file: {file}")


# Specify the directory where .json files are located
output_directory = "./data_processing/processed_articles"

# Call the function to remove .json files
remove_json_files(output_directory)