Purpose of this notebook is to retrieve the urls from the program subpages, for Question Generation for RAG Evaluation

This is the first notebook. Afterwards, please go to `group_program_subpages.ipynb`

After running the cells under `Setup`, ensure that a folder named `program-sub-pages` is created in this `notebooks` folder in order to store the created excels later on.

<hr>

### Setup

Enable capturing of output & load in kedro.ipython extension

In [3]:
%%capture
%load_ext kedro.ipython

Imports

In [1]:
import os
import pandas as pd

In [26]:
folder_path = 'program-sub-pages'
if not os.path.exists(folder_path):
    # If it doesn't exist, create the folder
    os.makedirs(folder_path)
    print(f"Folder created: {folder_path}")
else:
    print(f"Folder already exists: {folder_path}")

Folder already exists: program-sub-pages


<hr>

### Load Data

List all the datasets registered in the Kedro DataCatalog

In [4]:
# ruff: noqa: F821
print("Total datasets:", len(catalog.list()), "\nName of datasets:")
catalog.list() # list all avaliable datasets

Total datasets: 185 
Name of datasets:



[1m[[0m
    [32m'all_contents'[0m,
    [32m'missing_contents'[0m,
    [32m'all_contents_standardized'[0m,
    [32m'all_contents_added'[0m,
    [32m'all_contents_extracted'[0m,
    [32m'all_extracted_text'[0m,
    [32m'all_contents_mapped'[0m,
    [32m'merged_data'[0m,
    [32m'google_analytics_data'[0m,
    [32m'raw_word_counts'[0m,
    [32m'log_word_counts'[0m,
    [32m'flag_for_removal_by_type'[0m,
    [32m'recipes_data'[0m,
    [32m'filtered_data_with_keywords'[0m,
    [32m'embeddings_data'[0m,
    [32m'weighted_embeddings'[0m,
    [32m'ground_truth_data'[0m,
    [32m'first_level_pred_cluster'[0m,
    [32m'first_level_metrics'[0m,
    [32m'first_level_cluster_size'[0m,
    [32m'first_level_clustered_nodes'[0m,
    [32m'first_level_unclustered_nodes'[0m,
    [32m'final_cluster_size'[0m,
    [32m'final_predicted_cluster'[0m,
    [32m'final_clustered_nodes'[0m,
    [32m'final_unclustered_nodes'[0m,
    [32m'final_metrics'[0m,
   

Retrieve <mark><b>merged_data</b></mark> dataset [Merged collection of Health Hub articles across different content categories]

In [5]:
# ruff: noqa: F821
merged_data_og = catalog.load("merged_data")

Save all information of program sub pages into excel sheet [Before data cleaning/removal later on in this notebook][Optional]

In [7]:
# relevant_categories = ['program-sub-pages', 'programs']
# df_programSubPages_all = merged_data_og[merged_data_og["content_category"].isin(relevant_categories)]

# writer = pd.ExcelWriter(r"program-sub-pages\all_programSubPages.xlsx", engine='xlsxwriter')

# names = ['cleaned_program_subpages_url']
# dataframes = [df_programSubPages_all]

# for name, frame in zip(names, dataframes):
#    frame.to_excel(writer, sheet_name=name, index=False)

# writer.close()

<hr>

#### Initial Data Cleaning (with 'remove_type')

Remove the rows with value: "No Extracted Content", "NaN", "No relevant content and mainly links", "Table of Contents", "No HTML Tags"

In [9]:
merged_data_og['remove_type'].value_counts()


remove_type
Below Word Count                        [1;36m48[0m
Recipe                                  [1;36m48[0m
No HTML Tags                            [1;36m43[0m
No Extracted Content                    [1;36m29[0m
Duplicated Content                      [1;36m18[0m
Infographic                             [1;36m14[0m
NaN                                     [1;36m11[0m
Multilingual                             [1;36m6[0m
No relevant content and mainly links     [1;36m6[0m
Table of Contents                        [1;36m4[0m
Services Directory                       [1;36m3[0m
Duplicated URL                           [1;36m3[0m
Name: count, dtype: int64

In [7]:
merged_data_og.columns


[1;35mIndex[0m[1m([0m[1m[[0m[32m'id'[0m, [32m'content_name'[0m, [32m'title'[0m, [32m'article_category_names'[0m,
       [32m'cover_image_url'[0m, [32m'full_url'[0m, [32m'full_url2'[0m, [32m'friendly_url'[0m,
       [32m'category_description'[0m, [32m'content_body'[0m, [32m'keywords'[0m, [32m'feature_title'[0m,
       [32m'pr_name'[0m, [32m'alternate_image_text'[0m, [32m'date_modified'[0m, [32m'number_of_views'[0m,
       [32m'last_month_view_count'[0m, [32m'last_two_months_view'[0m, [32m'content_category'[0m,
       [32m'to_remove'[0m, [32m'remove_type'[0m, [32m'has_table'[0m, [32m'has_image'[0m,
       [32m'related_sections'[0m, [32m'extracted_tables'[0m, [32m'extracted_raw_html_tables'[0m,
       [32m'extracted_links'[0m, [32m'extracted_headers'[0m, [32m'extracted_images'[0m,
       [32m'extracted_content_body'[0m, [32m'l1_mappings'[0m, [32m'l2_mappings'[0m, [32m'page_views'[0m,
       [32m'engagement_rate'[0m

In [11]:
values_to_drop = ["No Extracted Content", "NaN", "No relevant content and mainly links", "Table of Contents", "No HTML Tags"]

# Drop rows where 'remove_type' is in the specified list
merged_data = merged_data_og[~merged_data_og['remove_type'].isin(values_to_drop)]
# Optionally, you can also drop NaN values if required
merged_data['content_category'].value_counts()


content_category
live-healthy-articles          [1;36m1142[0m
medications                     [1;36m579[0m
diseases-and-conditions         [1;36m324[0m
program-sub-pages               [1;36m303[0m
programs                         [1;36m61[0m
medical-care-and-facilities      [1;36m58[0m
cost-and-financing               [1;36m24[0m
health-statistics                [1;36m15[0m
support-group-and-others         [1;36m14[0m
Name: count, dtype: int64

<hr>

#### Data Understanding

Check information of merged_data:

- No. of rows: 2613
- No. of columns: 37 (after dropping "to_remove" & "remove_type)
- Domain: Health Hub Articles

In [12]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2520 entries, 0 to 2612
Data columns (total 39 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   id                                 2520 non-null   int64  
 1   content_name                       2520 non-null   object 
 2   title                              2520 non-null   object 
 3   article_category_names             1374 non-null   object 
 4   cover_image_url                    1598 non-null   object 
 5   full_url                           2519 non-null   object 
 6   full_url2                          2520 non-null   object 
 7   friendly_url                       2520 non-null   object 
 8   category_description               2502 non-null   object 
 9   content_body                       2520 non-null   object 
 10  keywords                           1378 non-null   object 
 11  feature_title                      761 non-null    object 
 1

Check column names
- More information of columns found under readme --> dataset portion

In [13]:
merged_data.columns


[1;35mIndex[0m[1m([0m[1m[[0m[32m'id'[0m, [32m'content_name'[0m, [32m'title'[0m, [32m'article_category_names'[0m,
       [32m'cover_image_url'[0m, [32m'full_url'[0m, [32m'full_url2'[0m, [32m'friendly_url'[0m,
       [32m'category_description'[0m, [32m'content_body'[0m, [32m'keywords'[0m, [32m'feature_title'[0m,
       [32m'pr_name'[0m, [32m'alternate_image_text'[0m, [32m'date_modified'[0m, [32m'number_of_views'[0m,
       [32m'last_month_view_count'[0m, [32m'last_two_months_view'[0m, [32m'content_category'[0m,
       [32m'to_remove'[0m, [32m'remove_type'[0m, [32m'has_table'[0m, [32m'has_image'[0m,
       [32m'related_sections'[0m, [32m'extracted_tables'[0m, [32m'extracted_raw_html_tables'[0m,
       [32m'extracted_links'[0m, [32m'extracted_headers'[0m, [32m'extracted_images'[0m,
       [32m'extracted_content_body'[0m, [32m'l1_mappings'[0m, [32m'l2_mappings'[0m, [32m'page_views'[0m,
       [32m'engagement_rate'[0m

Check the counts for each content_category. Ensure that it's 350 for program-sub-pages

In [45]:
merged_data["to_remove"].value_counts()


to_remove
[3;91mFalse[0m    [1;36m2380[0m
[3;92mTrue[0m      [1;36m140[0m
Name: count, dtype: int64

<hr>

### Prep Dataset for relevant categorie(s): programs & program-sub-pages

In [15]:
relevant_categories = ["programs", "program-sub-pages"]

df_programsubpages = merged_data[merged_data["content_category"].isin(relevant_categories)]
display(df_programsubpages)

Unnamed: 0,id,content_name,title,article_category_names,cover_image_url,full_url,full_url2,friendly_url,category_description,content_body,...,extracted_content_body,l1_mappings,l2_mappings,page_views,engagement_rate,bounce_rate,exit_rate,scroll_percentage,percentage_total_views,cumulative_percentage_total_views
2160,1434688,Screen for Life - National Health Screening Pr...,Screen for Life - National Health Screening Pr...,"Body Care,Conditions and Illnesses,Exercise an...",https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/programmes/Screen_for...,www.healthhub.sg/programmes/Screen_for_Life,Screen_for_Life,Subsidised health screening for Singapore Citi...,"b'<div class=""ExternalClassAF91117F1CED4D888CC...",...,We note that some users have experienced issue...,,,1069166,0.511571,0.488429,0.605897,0.310903,0.359128,0.359128
2161,1434612,"Live Well, Age Well Programme","Live Well, Age Well Programme",,https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/programmes/AAP,www.healthhub.sg/programmes/AAP,AAP,Stay healthy and active with FREE activities n...,"b'<div class=""ExternalClassBF0514EBE8F24C4A9A5...",...,"By 7 February 2023, all users must perform a o...",,,396253,0.486995,0.513005,0.771712,0.334680,0.133100,0.684081
2162,1434660,The Healthy 365 App,The Healthy 365 App,,https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/programmes/healthyliving,www.healthhub.sg/programmes/healthyliving,healthyliving,Start moving towards your health goals one day...,"b'<div class=""ExternalClassA8054F927A104E438A5...",...,[https://go.gov.sg/useh365] [https://go.gov.s...,,,571167,0.456685,0.543315,0.681652,0.312005,0.191853,0.550981
2163,1434580,"Eat, Drink, Shop Healthy Challenge","Eat, Drink, Shop Healthy Challenge",,https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/programmes/eat-drink-...,www.healthhub.sg/programmes/eat-drink-shop-hea...,eat-drink-shop-healthy-challenge,Enter the world of sure wins when you choose h...,"b'<div class=""hpb-container nutrition"" data-pa...",...,Menu [#clear]\n\nMenu [#clear]\n\nMenu [...,,,91000,0.749137,0.250863,0.771615,0.460604,0.030567,0.782880
2164,1434657,Great things start when you MOVE IT!,Great things start when you MOVE IT!,,https://ch-api.healthhub.sg/api/public/content...,https://www.healthhub.sg/programmes/LetsMoveIt,www.healthhub.sg/programmes/LetsMoveIt,LetsMoveIt,Physical activity is important to maintain a h...,"b'<div class=""hpb-container"" data-page-id=""hom...",...,[https://go.gov.sg/h365-moveit]\n\nGet Moving ...,,,101407,0.649578,0.350422,0.604731,0.354376,0.034062,0.752314
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2587,1435127,Types of diabetes | Diabetes Hub_types-of-diab...,Types of diabetes | Diabetes Hub,,,https://www.healthhub.sg/programmes/diabetes-h...,www.healthhub.sg/programmes/diabetes-hub-v2/ty...,types-of-diabetes,Did you know there's more than 2 types of diab...,"b'<div class=""ExternalClass17FC6DDA290D4A3AA72...",...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,,,0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000
2588,1435167,Hypoglycaemia | Diabetes Hub_hypoglycaemia_Level1,Hypoglycaemia | Diabetes Hub,,,https://www.healthhub.sg/programmes/diabetes-h...,www.healthhub.sg/programmes/diabetes-hub-v2/hy...,hypoglycaemia,Find out what causes your blood sugar level to...,"b'<div class=""ExternalClass990CCF03E6B64C28A16...",...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,,,0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000
2589,1468676,sya-test-1,Program Sub level 1,,,https://www.healthhub.sg/programmes/1test/sya-...,www.healthhub.sg/programmes/1test/sya-test-1/,sya-test-1,test1,"b'<p><a href=""""><img alt="""" src="""" style=""heig...",...,HealthHub\nRelaxation-ExerciseHealthhub\nList ...,,,0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000
2593,1435215,If you have coronary heart disease | Diabetes ...,If you have coronary heart disease | Diabetes Hub,,,https://www.healthhub.sg/programmes/diabetes-h...,www.healthhub.sg/programmes/diabetes-hub-v2/if...,if-you-have-coronary-heart-disease,"Exercising is good for your heart, but care mu...","b'<div class=""ExternalClass96F5FEC5915C4EAC94A...",...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,,,0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000


Check the number of articles for program-sub-pages + cost-and-financing

In [17]:
len(df_programsubpages)

[1;36m364[0m

Check information of dataset for program-sub-pages

Important to ensure that urls are all non-null

In [18]:
df_programsubpages.info()

<class 'pandas.core.frame.DataFrame'>
Index: 364 entries, 2160 to 2595
Data columns (total 39 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   id                                 364 non-null    int64  
 1   content_name                       364 non-null    object 
 2   title                              364 non-null    object 
 3   article_category_names             24 non-null     object 
 4   cover_image_url                    48 non-null     object 
 5   full_url                           363 non-null    object 
 6   full_url2                          364 non-null    object 
 7   friendly_url                       364 non-null    object 
 8   category_description               355 non-null    object 
 9   content_body                       364 non-null    object 
 10  keywords                           48 non-null     object 
 11  feature_title                      16 non-null     object 


Example of urls for the content categories

In [19]:
print(df_programsubpages["full_url"].iloc[0])

https://www.healthhub.sg/programmes/Screen_for_Life


<hr>

Check unqiue content names/article name

In [20]:
unique_contentName = df_programsubpages.content_name.unique()
print(len(unique_contentName))
unique_contentName

363



[1;35marray[0m[1m([0m[1m[[0m[32m'Screen for Life - National Health Screening Programme'[0m,
       [32m'Live Well, Age Well Programme'[0m, [32m'The Healthy 365 App'[0m,
       [32m'Eat, Drink, Shop Healthy Challenge'[0m,
       [32m'Great things start when you MOVE IT!'[0m,
       [32m'Healthier SG | Enrol via HealthHub'[0m, [32m'Parent Hub'[0m,
       [32m'National Steps Challenge™'[0m,
       [32m'Nutrition Facts And Information On Eating Healthy'[0m,
       [32m'STAY ONE STEP AHEAD WITH VACCINATIONS'[0m,
       [32m"3 Be's To Beat Diabetes | Diabetes Hub"[0m,
       [32m'Diabetes Hub: Guide to Managing Diabetes'[0m,
       [32m'Youth Preventive Dental Service'[0m, [32m'HPB Rewards Programme'[0m,
       [32m'Let’s BEAT Diabetes'[0m, [32m'I Quit Programme'[0m, [32m'HIV and AIDS'[0m,
       [32m'Children’s Health E-Services'[0m, [32m'MindSG'[0m, [32m'Korang OK?'[0m,
       [32m'Sexually Transmitted Infections'[0m,
       [32m'Misuse of 

<hr>

### Prep dataframe of url for export

Name of relevant dataframes:
- df_programsubpages

<mark><b>Columns to include for Question Generation from url:</b></mark>
1. id: article ID
2. content_name: name of article stored as metadata
3. title: article title [same as 2.]
3. article_category_names: categories that articles can belong to
4. full_url: article url
5. friendly_url: file path w reference from content category
6. category_description: brief summary of article
7. content_body: html element of entire article body
8. keywords: article keywords
9. feature_title: feature title of articles
10. content_category: content category that article matches on HH --> <mark><b>"program-sub-pages" + "cost-and-financing"</b></mark>
11. extracted_content_body: text ver of html content
12. remove_type

<hr>

#### Creation of 2 DataFrames for Program-sub-pages into Excel Workbook

1. df_programsubpages_cols: with all necessary columns
2. df_programsubpages_urls: with only identifier + url columns

In [46]:
# Check list of columns
merged_data.columns


[1;35mIndex[0m[1m([0m[1m[[0m[32m'id'[0m, [32m'content_name'[0m, [32m'title'[0m, [32m'article_category_names'[0m,
       [32m'cover_image_url'[0m, [32m'full_url'[0m, [32m'full_url2'[0m, [32m'friendly_url'[0m,
       [32m'category_description'[0m, [32m'content_body'[0m, [32m'keywords'[0m, [32m'feature_title'[0m,
       [32m'pr_name'[0m, [32m'alternate_image_text'[0m, [32m'date_modified'[0m, [32m'number_of_views'[0m,
       [32m'last_month_view_count'[0m, [32m'last_two_months_view'[0m, [32m'content_category'[0m,
       [32m'to_remove'[0m, [32m'remove_type'[0m, [32m'has_table'[0m, [32m'has_image'[0m,
       [32m'related_sections'[0m, [32m'extracted_tables'[0m, [32m'extracted_raw_html_tables'[0m,
       [32m'extracted_links'[0m, [32m'extracted_headers'[0m, [32m'extracted_images'[0m,
       [32m'extracted_content_body'[0m, [32m'l1_mappings'[0m, [32m'l2_mappings'[0m, [32m'page_views'[0m,
       [32m'engagement_rate'[0m

In [47]:
# Prep columns to be included
relevantAllCols = ['id', 'content_name', 'title', 'full_url', 'friendly_url', 'category_description', 'extracted_content_body', 'content_body', 'keywords', 'content_category', 'remove_type']
urlCols = ['id', 'title', 'full_url', 'extracted_content_body', 'content_category']

In [48]:
# Create the dataframes
df_programsubpages_cols = df_programsubpages[relevantAllCols]
df_programsubpages_url = df_programsubpages[urlCols]

In [49]:
df_programsubpages_url

Unnamed: 0,id,title,full_url,extracted_content_body,content_category
2160,1434688,Screen for Life - National Health Screening Pr...,https://www.healthhub.sg/programmes/Screen_for...,We note that some users have experienced issue...,programs
2161,1434612,"Live Well, Age Well Programme",https://www.healthhub.sg/programmes/AAP,"By 7 February 2023, all users must perform a o...",programs
2162,1434660,The Healthy 365 App,https://www.healthhub.sg/programmes/healthyliving,[https://go.gov.sg/useh365] [https://go.gov.s...,programs
2163,1434580,"Eat, Drink, Shop Healthy Challenge",https://www.healthhub.sg/programmes/eat-drink-...,Menu [#clear]\n\nMenu [#clear]\n\nMenu [...,programs
2164,1434657,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt,[https://go.gov.sg/h365-moveit]\n\nGet Moving ...,programs
...,...,...,...,...,...
2587,1435127,Types of diabetes | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages
2588,1435167,Hypoglycaemia | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages
2589,1468676,Program Sub level 1,https://www.healthhub.sg/programmes/1test/sya-...,HealthHub\nRelaxation-ExerciseHealthhub\nList ...,program-sub-pages
2593,1435215,If you have coronary heart disease | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages


##### Export to csv

In [25]:
writer = pd.ExcelWriter(r"program-sub-pages\cleaned_programSubpages.xlsx", engine='xlsxwriter')

names = ['cleaned_program_subpages_url', 'cleaned_program_subpages_all']
dataframes = [df_programsubpages_url, df_programsubpages_cols]

for name, frame in zip(names, dataframes):
   frame.to_excel(writer, sheet_name=name, index=False)

writer.close()