Purpose of this notebook is to group the urls by url hierarchy, according to the second level of the programmes (after `/programmes/`)

Please check that there's a folder named `program-sub-pages` in this `notebooks` folder in order to store the created excels later on.

Please run `get_program_subpages.ipynb` first to get the relevant excel file to run this notebook.

Note:
- Do not worry about the squiggly line under the dataframe variables, as I have defined the variables using global(). As long as the cells have been ran, there should be no issue.

<hr>

### Setup

In [2]:
import os
import pandas as pd

In [None]:
folder_path = 'program-sub-pages'
if not os.path.exists(folder_path):
    # If it doesn't exist, create the folder
    os.makedirs(folder_path)
    print(f"Folder created: {folder_path}")
else:
    print(f"Folder already exists: {folder_path}")

Load Clean Dataset of Program Sub-pages URL

In [3]:
df = pd.read_excel(r"program-sub-pages\cleaned_programSubpages.xlsx", sheet_name=0)
# For more information & checking on the rows, you can look this (Optional):
# df_cols = df = pd.read_excel(r"program-sub-pages\cleaned_programSubpages.xlsx", sheet_name=1)
df

Unnamed: 0,id,title,full_url,extracted_content_body,content_category
0,1434919,MindSG,https://www.healthhub.sg/programmes/MindSG/Car...,Caring for Ourselves\nSleeping Well\nSelect th...,program-sub-pages
1,1480345,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,[https://go.gov.sg/useh365] [https://go.gov.s...,program-sub-pages
2,1435221,Parent Hub: Student Immunisation And Screening,https://www.healthhub.sg/programmes/parent-hub...,CHILD IMMUNISATION AND SCREENING SERVICES\nThe...,program-sub-pages
3,1434809,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/tracke...,< Previous [#nscMastheadCarousel] > Next [#nsc...,program-sub-pages
4,1435018,Reduce Your Salt And Sugar Intake,https://www.healthhub.sg/programmes/nutrition-...,Menu [#] [#clear]\n\nMenu [#] [#clear]\n\n...,program-sub-pages
...,...,...,...,...,...
298,1435127,Types of diabetes | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages
299,1435167,Hypoglycaemia | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages
300,1468676,Program Sub level 1,https://www.healthhub.sg/programmes/1test/sya-...,HealthHub\nRelaxation-ExerciseHealthhub\nList ...,program-sub-pages
301,1435215,If you have coronary heart disease | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages


<hr>

### Extraction of second level of URL hierarchy

In [4]:
# Duplicate df to keep original df unchanged
df_new = df

In [5]:
# Create new col to store extracted portion of url after '/programmes/'
df_new['secondLvl'] = df_new['full_url'].str.split('/programmes/').str[1].str.split('/').str[0]
df_new

Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
0,1434919,MindSG,https://www.healthhub.sg/programmes/MindSG/Car...,Caring for Ourselves\nSleeping Well\nSelect th...,program-sub-pages,MindSG
1,1480345,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,[https://go.gov.sg/useh365] [https://go.gov.s...,program-sub-pages,LetsMoveIt
2,1435221,Parent Hub: Student Immunisation And Screening,https://www.healthhub.sg/programmes/parent-hub...,CHILD IMMUNISATION AND SCREENING SERVICES\nThe...,program-sub-pages,parent-hub
3,1434809,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/tracke...,< Previous [#nscMastheadCarousel] > Next [#nsc...,program-sub-pages,nsc
4,1435018,Reduce Your Salt And Sugar Intake,https://www.healthhub.sg/programmes/nutrition-...,Menu [#] [#clear]\n\nMenu [#] [#clear]\n\n...,program-sub-pages,nutrition-hub
...,...,...,...,...,...,...
298,1435127,Types of diabetes | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages,diabetes-hub-v2
299,1435167,Hypoglycaemia | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2
300,1468676,Program Sub level 1,https://www.healthhub.sg/programmes/1test/sya-...,HealthHub\nRelaxation-ExerciseHealthhub\nList ...,program-sub-pages,1test
301,1435215,If you have coronary heart disease | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2


<hr>

### Data Understanding

- No. of unique values for `secondLvl` = No. of unique groups

In [7]:
df_new

Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
0,1434919,MindSG,https://www.healthhub.sg/programmes/MindSG/Car...,Caring for Ourselves\nSleeping Well\nSelect th...,program-sub-pages,MindSG
1,1480345,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,[https://go.gov.sg/useh365] [https://go.gov.s...,program-sub-pages,LetsMoveIt
2,1435221,Parent Hub: Student Immunisation And Screening,https://www.healthhub.sg/programmes/parent-hub...,CHILD IMMUNISATION AND SCREENING SERVICES\nThe...,program-sub-pages,parent-hub
3,1434809,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/tracke...,< Previous [#nscMastheadCarousel] > Next [#nsc...,program-sub-pages,nsc
4,1435018,Reduce Your Salt And Sugar Intake,https://www.healthhub.sg/programmes/nutrition-...,Menu [#] [#clear]\n\nMenu [#] [#clear]\n\n...,program-sub-pages,nutrition-hub
...,...,...,...,...,...,...
298,1435127,Types of diabetes | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages,diabetes-hub-v2
299,1435167,Hypoglycaemia | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2
300,1468676,Program Sub level 1,https://www.healthhub.sg/programmes/1test/sya-...,HealthHub\nRelaxation-ExerciseHealthhub\nList ...,program-sub-pages,1test
301,1435215,If you have coronary heart disease | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2


Check the unqiue values for second level

In [8]:
unique_secondLvl = df_new['secondLvl'].unique()
unique_counts = df_new['secondLvl'].value_counts()
print(f"Number of unique values: {len(unique_secondLvl)}")
print("\nCounts for each unique value:\n", unique_counts)

Number of unique values: 25

Counts for each unique value:
 secondLvl
parent-hub                   71
MindSG                       70
diabetes-hub                 39
diabetes-hub-v2              25
nsc                          13
pressure-injury              13
nutrition-hub                11
LetsMoveIt                    8
AAP                           7
healthhub--parenting          7
healthia                      6
IQuit                         5
korangok                      4
MoveIt                        4
indian_outreach               3
Screen_for_Life               3
10-fun-ways-to-get-active     3
1test-Program                 2
1test                         2
ga-testing                    2
howareyoudoing                1
get-healthhub-track           1
healthhub-rewards             1
1test-221123                  1
Copy-of-1test                 1
Name: count, dtype: int64


- Hence, for now 25 groups
- Further analysis to understand groups, especially the smaller groups of 1/2

<hr>

### Form Grouped Dataframs

In [9]:
count = 1
grpList = []
# Create respective dfs for each unique group in `secondLvl`
for group in unique_secondLvl:
    # Replace invalid characters for variable name
    grpName = group.replace('-', '_').replace(' ', '_')
    # Create new df for each group
    globals()[f"df_{grpName}"] = df_new[df_new['secondLvl'] == group]
    print(f"{count}. Created df_{grpName}")
    grpList.append(f"df_{grpName}")
    count += 1

# Arrange grpList in descending order of count
grpList = sorted(grpList, key=lambda x: len(globals()[x]), reverse=True)
print("\n", grpList)

1. Created df_MindSG
2. Created df_LetsMoveIt
3. Created df_parent_hub
4. Created df_nsc
5. Created df_nutrition_hub
6. Created df_healthhub_rewards
7. Created df_Screen_for_Life
8. Created df_IQuit
9. Created df_AAP
10. Created df_MoveIt
11. Created df_diabetes_hub
12. Created df_pressure_injury
13. Created df_korangok
14. Created df_howareyoudoing
15. Created df_indian_outreach
16. Created df_get_healthhub_track
17. Created df_healthhub__parenting
18. Created df_diabetes_hub_v2
19. Created df_healthia
20. Created df_1test_Program
21. Created df_1test
22. Created df_ga_testing
23. Created df_1test_221123
24. Created df_10_fun_ways_to_get_active
25. Created df_Copy_of_1test

 ['df_parent_hub', 'df_MindSG', 'df_diabetes_hub', 'df_diabetes_hub_v2', 'df_nsc', 'df_pressure_injury', 'df_nutrition_hub', 'df_LetsMoveIt', 'df_AAP', 'df_healthhub__parenting', 'df_healthia', 'df_IQuit', 'df_MoveIt', 'df_korangok', 'df_Screen_for_Life', 'df_indian_outreach', 'df_10_fun_ways_to_get_active', 'df_1t

<hr>

### Group Understanding

In [10]:
# Loop through list & display  first few rows of each DataFrame
for df_name in grpList:
    index = grpList.index(df_name)
    print(f"{index+1}. DataFrame: {df_name}")
    globals()[df_name].info()
    display(globals()[df_name].head())
    print("*"*150)

1. DataFrame: df_parent_hub
<class 'pandas.core.frame.DataFrame'>
Index: 71 entries, 2 to 212
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      71 non-null     int64 
 1   title                   71 non-null     object
 2   full_url                71 non-null     object
 3   extracted_content_body  71 non-null     object
 4   content_category        71 non-null     object
 5   secondLvl               71 non-null     object
dtypes: int64(1), object(5)
memory usage: 3.9+ KB


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
2,1435221,Parent Hub: Student Immunisation And Screening,https://www.healthhub.sg/programmes/parent-hub...,CHILD IMMUNISATION AND SCREENING SERVICES\nThe...,program-sub-pages,parent-hub
16,1434753,Parent Hub: 0-2 Years - Healthy Diet,https://www.healthhub.sg/programmes/parent-hub...,fo \n MEAL TIMES To view all content in this s...,program-sub-pages,parent-hub
18,1435359,Parent Hub: 0-2 Years,https://www.healthhub.sg/programmes/parent-hub...,Here's How to Team Up with Your Wife for Paren...,program-sub-pages,parent-hub
19,1435231,"Parent Hub: Student Health Centre, Dental Centre",https://www.healthhub.sg/programmes/parent-hub...,STUDENT HEALTH CENTRE AND STUDENT DENTAL CENTR...,program-sub-pages,parent-hub
25,1434755,Parent Hub: 3-6 Years,https://www.healthhub.sg/programmes/parent-hub...,Handy Guide to Screen Use\nYour handy guide to...,program-sub-pages,parent-hub


******************************************************************************************************************************************************
2. DataFrame: df_MindSG
<class 'pandas.core.frame.DataFrame'>
Index: 70 entries, 0 to 249
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      70 non-null     int64 
 1   title                   70 non-null     object
 2   full_url                70 non-null     object
 3   extracted_content_body  70 non-null     object
 4   content_category        70 non-null     object
 5   secondLvl               70 non-null     object
dtypes: int64(1), object(5)
memory usage: 3.8+ KB


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
0,1434919,MindSG,https://www.healthhub.sg/programmes/MindSG/Car...,Caring for Ourselves\nSleeping Well\nSelect th...,program-sub-pages,MindSG
6,1434871,MindSG,https://www.healthhub.sg/programmes/MindSG/Dis...,Are we giving the right support?\nLearn how we...,program-sub-pages,MindSG
9,1435236,MindSG,https://www.healthhub.sg/programmes/MindSG/Sle...,[#helplines] [#helplines]\n\nSleep Tracking F...,program-sub-pages,MindSG
10,1435243,MindSG,https://www.healthhub.sg/programmes/MindSG/See...,Seeking Support\nChoose what youd like to read...,program-sub-pages,MindSG
11,1434875,MindSG,https://www.healthhub.sg/programmes/MindSG/Abo...,What is Mental Well-being\nChoose what youd li...,program-sub-pages,MindSG


******************************************************************************************************************************************************
3. DataFrame: df_diabetes_hub
<class 'pandas.core.frame.DataFrame'>
Index: 39 entries, 45 to 242
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      39 non-null     int64 
 1   title                   39 non-null     object
 2   full_url                39 non-null     object
 3   extracted_content_body  39 non-null     object
 4   content_category        39 non-null     object
 5   secondLvl               39 non-null     object
dtypes: int64(1), object(5)
memory usage: 2.1+ KB


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
45,1435164,Self-monitoring of blood sugar | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub
86,1435281,Diabetes Hub: Guide to Managing Diabetes,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub
87,1435129,Be Aware - What is diabetes,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub
94,1435171,Hyperglycaemia | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages,diabetes-hub
98,1435188,Download resources | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub


******************************************************************************************************************************************************
4. DataFrame: df_diabetes_hub_v2
<class 'pandas.core.frame.DataFrame'>
Index: 25 entries, 261 to 301
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      25 non-null     int64 
 1   title                   25 non-null     object
 2   full_url                25 non-null     object
 3   extracted_content_body  25 non-null     object
 4   content_category        25 non-null     object
 5   secondLvl               25 non-null     object
dtypes: int64(1), object(5)
memory usage: 1.4+ KB


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
261,1435175,Travelling overseas | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages,diabetes-hub-v2
263,1435159,Sleeping well | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2
264,1435136,Healthy eating | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2
266,1435156,Stigma of diabetes | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BES TO BEAT DIABETES [/programmes/diabetes-h...,program-sub-pages,diabetes-hub-v2
267,1435147,Avoid smoking and drinking | Diabetes Hub,https://www.healthhub.sg/programmes/diabetes-h...,3 BEaTMS TO BEAT DIABETES [/programmes/diabete...,program-sub-pages,diabetes-hub-v2


******************************************************************************************************************************************************
5. DataFrame: df_nsc
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, 3 to 215
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      13 non-null     int64 
 1   title                   13 non-null     object
 2   full_url                13 non-null     object
 3   extracted_content_body  13 non-null     object
 4   content_category        13 non-null     object
 5   secondLvl               13 non-null     object
dtypes: int64(1), object(5)
memory usage: 728.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
3,1434809,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/tracke...,< Previous [#nscMastheadCarousel] > Next [#nsc...,program-sub-pages,nsc
15,1434811,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/support/,Previous [#nscMastheadCarousel] > Next [#nscMa...,program-sub-pages,nsc
38,1434803,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/corpor...,< /> Previous [#nscMastheadCarousel] > Next [#...,program-sub-pages,nsc
122,1434969,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/corpor...,<Previous [#nscMastheadCarousel] >Next [#nscMa...,program-sub-pages,nsc
193,1434823,National Steps Challenge™,https://www.healthhub.sg/programmes/nsc/commun...,The National Steps Challenge\nSeason 5 Communi...,program-sub-pages,nsc


******************************************************************************************************************************************************
6. DataFrame: df_pressure_injury
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, 83 to 192
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      13 non-null     int64 
 1   title                   13 non-null     object
 2   full_url                13 non-null     object
 3   extracted_content_body  13 non-null     object
 4   content_category        13 non-null     object
 5   secondLvl               13 non-null     object
dtypes: int64(1), object(5)
memory usage: 728.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
83,1435123,Pressure Injury Hub,https://www.healthhub.sg/programmes/pressure-i...,Menu - Pressure Injury Hub\n- Preventing Press...,program-sub-pages,pressure-injury
119,1435275,Pressure Injury Hub,https://www.healthhub.sg/programmes/pressure-i...,Menu - Pressure Injury Hub\n- Preventing Press...,program-sub-pages,pressure-injury
124,1435149,Pressure Injury Hub,https://www.healthhub.sg/programmes/pressure-i...,Menu - Pressure Injury Hub\n- Preventing Press...,program-sub-pages,pressure-injury
170,1435116,Pressure Injury Hub,https://www.healthhub.sg/programmes/pressure-i...,Menu - Pressure Injury Hub\n- Preventing Press...,program-sub-pages,pressure-injury
181,1435120,Pressure Injury Hub,https://www.healthhub.sg/programmes/pressure-i...,Menu - Pressure Injury Hub\n- Preventing Press...,program-sub-pages,pressure-injury


******************************************************************************************************************************************************
7. DataFrame: df_nutrition_hub
<class 'pandas.core.frame.DataFrame'>
Index: 11 entries, 4 to 218
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      11 non-null     int64 
 1   title                   11 non-null     object
 2   full_url                11 non-null     object
 3   extracted_content_body  11 non-null     object
 4   content_category        11 non-null     object
 5   secondLvl               11 non-null     object
dtypes: int64(1), object(5)
memory usage: 616.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
4,1435018,Reduce Your Salt And Sugar Intake,https://www.healthhub.sg/programmes/nutrition-...,Menu [#] [#clear]\n\nMenu [#] [#clear]\n\n...,program-sub-pages,nutrition-hub
5,1472348,Nutri-Grade,https://www.healthhub.sg/programmes/nutrition-...,Menu [#clear]\n\nMenu [#clear]\n\nMenu [...,program-sub-pages,nutrition-hub
7,1435021,Make Healthy Food & Grocery Choices,https://www.healthhub.sg/programmes/nutrition-...,Menu [#clear]\n\nResources\nPick up useful t...,program-sub-pages,nutrition-hub
8,1435017,Nutritious Foods For A Healthy Diet,https://www.healthhub.sg/programmes/nutrition-...,Menu [#clear]\n\nEat More\nEat more nutritio...,program-sub-pages,nutrition-hub
12,1435014,Easy Healthy Recipes,https://www.healthhub.sg/programmes/nutrition-...,Menu [#clear]\n\nRecipes\nNeed a little culi...,program-sub-pages,nutrition-hub


******************************************************************************************************************************************************
8. DataFrame: df_LetsMoveIt
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, 1 to 197
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      8 non-null      int64 
 1   title                   8 non-null      object
 2   full_url                8 non-null      object
 3   extracted_content_body  8 non-null      object
 4   content_category        8 non-null      object
 5   secondLvl               8 non-null      object
dtypes: int64(1), object(5)
memory usage: 448.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
1,1480345,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,[https://go.gov.sg/useh365] [https://go.gov.s...,program-sub-pages,LetsMoveIt
39,1435259,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,Your quick guide to book an event\nFor a pdf v...,program-sub-pages,LetsMoveIt
52,1435239,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,Ready for a bigger challenge? Explore a variet...,program-sub-pages,LetsMoveIt
58,1435229,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,Well done on taking the first steps to an acti...,program-sub-pages,LetsMoveIt
97,1435235,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/LetsMoveIt...,"Now that youre prepared to get active, complet...",program-sub-pages,LetsMoveIt


******************************************************************************************************************************************************
9. DataFrame: df_AAP
<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 24 to 115
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      7 non-null      int64 
 1   title                   7 non-null      object
 2   full_url                7 non-null      object
 3   extracted_content_body  7 non-null      object
 4   content_category        7 non-null      object
 5   secondLvl               7 non-null      object
dtypes: int64(1), object(5)
memory usage: 392.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
24,1434907,"See, Hear & Eat Better",https://www.healthhub.sg/programmes/AAP/functi...,"Project Silver Screen is an affordable, nation...",program-sub-pages,AAP
48,1434895,7 Easy Exercises to an Active Lifestyle,https://www.healthhub.sg/programmes/AAP/easy-e...,Back to Healthy Ageing [/programmes/Healthy_Ag...,program-sub-pages,AAP
61,1434903,You can spot a stroke,https://www.healthhub.sg/programmes/AAP/stroke/,Back to Healthy Ageing [/programmes/Healthy_Ag...,program-sub-pages,AAP
76,1434905,Age Healthier When You Cook Right And Eat Smart,https://www.healthhub.sg/programmes/AAP/nutrit...,Back to Healthy Ageing [http://www.healthhub.s...,program-sub-pages,AAP
80,1434901,You can prevent falls,https://www.healthhub.sg/programmes/AAP/falls-...,Back to Healthy Ageing [http://www.healthhub.s...,program-sub-pages,AAP


******************************************************************************************************************************************************
10. DataFrame: df_healthhub__parenting
<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 254 to 260
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      7 non-null      int64 
 1   title                   7 non-null      object
 2   full_url                7 non-null      object
 3   extracted_content_body  7 non-null      object
 4   content_category        7 non-null      object
 5   secondLvl               7 non-null      object
dtypes: int64(1), object(5)
memory usage: 392.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
254,1434831,Be Gentle With Yourself,https://www.healthhub.sg/programmes/healthhub-...,Excited to hold your baby in your arms but fee...,program-sub-pages,healthhub--parenting
255,1434921,Happy 1-Year Old!,https://www.healthhub.sg/programmes/healthhub-...,Your baby is 1 year old! YouaTMve made it!\nFo...,program-sub-pages,healthhub--parenting
256,1434889,Is Your Baby Throwing Things Around?,https://www.healthhub.sg/programmes/healthhub-...,Cannot determine when your baby is cute or nau...,program-sub-pages,healthhub--parenting
257,1434873,Separation Anxiety,https://www.healthhub.sg/programmes/healthhub-...,Preparing to go back to work? Help baby cope w...,program-sub-pages,healthhub--parenting
258,1434911,"Time For Baby, Time For Mummy And Daddy",https://www.healthhub.sg/programmes/healthhub-...,"Is your baby anxious around strangers, clingin...",program-sub-pages,healthhub--parenting


******************************************************************************************************************************************************
11. DataFrame: df_healthia
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 262 to 302
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      6 non-null      int64 
 1   title                   6 non-null      object
 2   full_url                6 non-null      object
 3   extracted_content_body  6 non-null      object
 4   content_category        6 non-null      object
 5   secondLvl               6 non-null      object
dtypes: int64(1), object(5)
memory usage: 336.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
262,1434805,Dr Kimberly Douglas,https://www.healthhub.sg/programmes/healthia/d...,[/programs/Lists/Program Sub Pages/index.html]...,program-sub-pages,healthia
272,1434777,About Us,https://www.healthhub.sg/programmes/healthia/a...,[/programs/Lists/Program Sub Pages/index.html]...,program-sub-pages,healthia
285,1434801,level-1,https://www.healthhub.sg/programmes/healthia/l...,athis is level 1 content,program-sub-pages,healthia
289,1434796,Our Team,https://www.healthhub.sg/programmes/healthia/t...,[/programs/Lists/Program Sub Pages/index.html]...,program-sub-pages,healthia
295,1434807,level-2,https://www.healthhub.sg/programmes/healthia/l...,athis is level 2 content,program-sub-pages,healthia


******************************************************************************************************************************************************
12. DataFrame: df_IQuit
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 22 to 244
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      5 non-null      int64 
 1   title                   5 non-null      object
 2   full_url                5 non-null      object
 3   extracted_content_body  5 non-null      object
 4   content_category        5 non-null      object
 5   secondLvl               5 non-null      object
dtypes: int64(1), object(5)
memory usage: 280.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
22,1435084,Vape,https://www.healthhub.sg/programmes/IQuit/e-cig/,Toggle to find out more. I Quit Programme [/pr...,program-sub-pages,IQuit
55,1435096,Vape,https://www.healthhub.sg/programmes/IQuit/e-ci...,Toggle to find out more. I Quit Programme [/pr...,program-sub-pages,IQuit
178,1500502,Vape,https://www.healthhub.sg/programmes/IQuit/e-ci...,I Quit Programme [/programmes/88/IQuit/#home]\...,program-sub-pages,IQuit
243,1476504,Vape,https://www.healthhub.sg/programmes/IQuit/e-ci...,I Quit Programme [/programmes/88/IQuit/#home]\...,program-sub-pages,IQuit
244,1435093,Vape,https://www.healthhub.sg/programmes/IQuit/e-ci...,I Quit Programme [/programmes/88/IQuit/#home]\...,program-sub-pages,IQuit


******************************************************************************************************************************************************
13. DataFrame: df_MoveIt
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 37 to 250
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      4 non-null      int64 
 1   title                   4 non-null      object
 2   full_url                4 non-null      object
 3   extracted_content_body  4 non-null      object
 4   content_category        4 non-null      object
 5   secondLvl               4 non-null      object
dtypes: int64(1), object(5)
memory usage: 224.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
37,1435347,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/MoveIt/mov...,"Starting 12 July 2021, participants who do not...",program-sub-pages,MoveIt
62,1435340,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/MoveIt/mov...,"Starting 12 July 2021, participants who do not...",program-sub-pages,MoveIt
65,1434986,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/MoveIt/mov...,"Starting 12 July 2021, participants who do not...",program-sub-pages,MoveIt
250,1435344,Great things start when you MOVE IT!,https://www.healthhub.sg/programmes/MoveIt/mov...,In line with MOHs advisory on\n20 July 2021\no...,program-sub-pages,MoveIt


******************************************************************************************************************************************************
14. DataFrame: df_korangok
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 100 to 184
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      4 non-null      int64 
 1   title                   4 non-null      object
 2   full_url                4 non-null      object
 3   extracted_content_body  4 non-null      object
 4   content_category        4 non-null      object
 5   secondLvl               4 non-null      object
dtypes: int64(1), object(5)
memory usage: 224.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
100,1435115,Korang OK? - Fun Ways to Stay Active,https://www.healthhub.sg/programmes/korangok/f...,[/programmes/korangok/]\n- Home\n- Healthier C...,program-sub-pages,korangok
179,1435109,Korang OK? - Mental Well-being,https://www.healthhub.sg/programmes/korangok/m...,[/programmes/korangok/]\n- Home\n- Healthier C...,program-sub-pages,korangok
180,1435113,Korang OK? - IQuit for Good,https://www.healthhub.sg/programmes/korangok/q...,[/programmes/korangok/]\n- Home\n- Healthier C...,program-sub-pages,korangok
184,1435111,Korang OK? - Screen for Life,https://www.healthhub.sg/programmes/korangok/s...,[/programmes/korangok/]\n- Home\n- Healthier C...,program-sub-pages,korangok


******************************************************************************************************************************************************
15. DataFrame: df_Screen_for_Life
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 21 to 32
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      3 non-null      int64 
 1   title                   3 non-null      object
 2   full_url                3 non-null      object
 3   extracted_content_body  3 non-null      object
 4   content_category        3 non-null      object
 5   secondLvl               3 non-null      object
dtypes: int64(1), object(5)
memory usage: 168.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
21,1435054,Screen for Life - National Health Screening Pr...,https://www.healthhub.sg/programmes/Screen_for...,START YOUR SCREENING JOURNEY HERE\n- 18 - 24 y...,program-sub-pages,Screen_for_Life
29,1435061,Screen for Life - National Health Screening Pr...,https://www.healthhub.sg/programmes/Screen_for...,We have received feedback from some members of...,program-sub-pages,Screen_for_Life
32,1435053,Screen for Life - National Health Screening Pr...,https://www.healthhub.sg/programmes/Screen_for...,Download our Screen for Life booklet: English ...,program-sub-pages,Screen_for_Life


******************************************************************************************************************************************************
16. DataFrame: df_indian_outreach
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 186 to 252
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      3 non-null      int64 
 1   title                   3 non-null      object
 2   full_url                3 non-null      object
 3   extracted_content_body  3 non-null      object
 4   content_category        3 non-null      object
 5   secondLvl               3 non-null      object
dtypes: int64(1), object(5)
memory usage: 168.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
186,1434980,Take the first step with your loved ones,https://www.healthhub.sg/programmes/indian_out...,- Menu\n- Home\n- Healthy Eating\n- Physical A...,program-sub-pages,indian_outreach
251,1434979,Take the first step with your loved ones,https://www.healthhub.sg/programmes/indian_out...,- Menu\n- Home\n- Healthy Eating\n- Physical A...,program-sub-pages,indian_outreach
252,1434983,Take the first step with your loved ones,https://www.healthhub.sg/programmes/indian_out...,- Menu\n- Home\n- Healthy Eating\n- Physical A...,program-sub-pages,indian_outreach


******************************************************************************************************************************************************
17. DataFrame: df_10_fun_ways_to_get_active
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 282 to 293
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      3 non-null      int64 
 1   title                   3 non-null      object
 2   full_url                3 non-null      object
 3   extracted_content_body  3 non-null      object
 4   content_category        3 non-null      object
 5   secondLvl               3 non-null      object
dtypes: int64(1), object(5)
memory usage: 168.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
282,1435336,test page level 1 - 14-10-2020,https://www.healthhub.sg/programmes/10-fun-way...,athis is a test page,program-sub-pages,10-fun-ways-to-get-active
291,1435343,test page level 2-14-10-2020,https://www.healthhub.sg/programmes/10-fun-way...,atest page level 2,program-sub-pages,10-fun-ways-to-get-active
293,1435339,test page level 3-14-10-2020,https://www.healthhub.sg/programmes/10-fun-way...,atest page level 3,program-sub-pages,10-fun-ways-to-get-active


******************************************************************************************************************************************************
18. DataFrame: df_1test_Program
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 265 to 280
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      2 non-null      int64 
 1   title                   2 non-null      object
 2   full_url                2 non-null      object
 3   extracted_content_body  2 non-null      object
 4   content_category        2 non-null      object
 5   secondLvl               2 non-null      object
dtypes: int64(1), object(5)
memory usage: 112.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
265,1474295,1tes23,https://www.healthhub.sg/programmes/1test-Prog...,EDSH_Challenge_Event_Locations [https://ch-api...,program-sub-pages,1test-Program
280,1468228,1test programsub1,https://www.healthhub.sg/programmes/1test-Prog...,IQuit-Terms-and-Conditions_8Feb24 [https://ch-...,program-sub-pages,1test-Program


******************************************************************************************************************************************************
19. DataFrame: df_1test
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 276 to 300
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      2 non-null      int64 
 1   title                   2 non-null      object
 2   full_url                2 non-null      object
 3   extracted_content_body  2 non-null      object
 4   content_category        2 non-null      object
 5   secondLvl               2 non-null      object
dtypes: int64(1), object(5)
memory usage: 112.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
276,1468237,1test-sub-sya333,https://www.healthhub.sg/programmes/1test/1tes...,Sub-program entry - 15/11/2023 - sya33333 [htt...,program-sub-pages,1test
300,1468676,Program Sub level 1,https://www.healthhub.sg/programmes/1test/sya-...,HealthHub\nRelaxation-ExerciseHealthhub\nList ...,program-sub-pages,1test


******************************************************************************************************************************************************
20. DataFrame: df_ga_testing
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 277 to 287
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      2 non-null      int64 
 1   title                   2 non-null      object
 2   full_url                2 non-null      object
 3   extracted_content_body  2 non-null      object
 4   content_category        2 non-null      object
 5   secondLvl               2 non-null      object
dtypes: int64(1), object(5)
memory usage: 112.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
277,1435245,Persona B,https://www.healthhub.sg/programmes/ga-testing...,a-,program-sub-pages,ga-testing
287,1435247,Persona A,https://www.healthhub.sg/programmes/ga-testing...,a-,program-sub-pages,ga-testing


******************************************************************************************************************************************************
21. DataFrame: df_healthhub_rewards
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 17 to 17
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      1 non-null      int64 
 1   title                   1 non-null      object
 2   full_url                1 non-null      object
 3   extracted_content_body  1 non-null      object
 4   content_category        1 non-null      object
 5   secondLvl               1 non-null      object
dtypes: int64(1), object(5)
memory usage: 56.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
17,1434795,HPB Rewards Programme,https://www.healthhub.sg/programmes/healthhub-...,Frequently Asked Questions\n\nHPB Healthy 365 ...,program-sub-pages,healthhub-rewards


******************************************************************************************************************************************************
22. DataFrame: df_howareyoudoing
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 141 to 141
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      1 non-null      int64 
 1   title                   1 non-null      object
 2   full_url                1 non-null      object
 3   extracted_content_body  1 non-null      object
 4   content_category        1 non-null      object
 5   secondLvl               1 non-null      object
dtypes: int64(1), object(5)
memory usage: 56.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
141,1434977,Take the first step with your loved ones,https://www.healthhub.sg/programmes/howareyoud...,- Menu\n- Home\n- Healthy Eating\n- Physical A...,program-sub-pages,howareyoudoing


******************************************************************************************************************************************************
23. DataFrame: df_get_healthhub_track
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 253 to 253
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      1 non-null      int64 
 1   title                   1 non-null      object
 2   full_url                1 non-null      object
 3   extracted_content_body  1 non-null      object
 4   content_category        1 non-null      object
 5   secondLvl               1 non-null      object
dtypes: int64(1), object(5)
memory usage: 56.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
253,1435311,FAQs,https://www.healthhub.sg/programmes/get-health...,HealthHub Track Login: Facebook and Email Logi...,program-sub-pages,get-healthhub-track


******************************************************************************************************************************************************
24. DataFrame: df_1test_221123
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 279 to 279
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      1 non-null      int64 
 1   title                   1 non-null      object
 2   full_url                1 non-null      object
 3   extracted_content_body  1 non-null      object
 4   content_category        1 non-null      object
 5   secondLvl               1 non-null      object
dtypes: int64(1), object(5)
memory usage: 56.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
279,1482060,test72,https://www.healthhub.sg/programmes/1test-2211...,School Beverage List Apr 2024 As of Mar 2024 [...,program-sub-pages,1test-221123


******************************************************************************************************************************************************
25. DataFrame: df_Copy_of_1test
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 292 to 292
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      1 non-null      int64 
 1   title                   1 non-null      object
 2   full_url                1 non-null      object
 3   extracted_content_body  1 non-null      object
 4   content_category        1 non-null      object
 5   secondLvl               1 non-null      object
dtypes: int64(1), object(5)
memory usage: 56.0+ bytes


Unnamed: 0,id,title,full_url,extracted_content_body,content_category,secondLvl
292,1470565,test article,https://www.healthhub.sg/programmes/Copy-of-1t...,Website List_As of 31 Oct 2023 [https://ch-api...,program-sub-pages,Copy-of-1test


******************************************************************************************************************************************************


#### Summary of each Group

Legend:
- *, #, ~: to be grouped together
- -: to be removed

1. parent-hub:              taking care of children
2. MindSG:                  mental well being
3. diabetes-hub:            *diabetes
4. diabetes-hub-v2:         *diabetes, url have 404 error, can combine with above group
5. pressure-injury:         pressure injury
6. nsc:                     national steps challenge
7. nutrition-hub:           nurition
8. LetsMoveIt:              #physical activity
9. AAP:                     live well, age well (elderly)
10. healthhub--parenting:    taking care of baby
11. IQuit:                   ~vaping
12. MoveIt:                  #physical activity
13. korangok:                malay community
14. indian_outreach:         indian community
15. Screen_for_Life:         screenings
16. healthia:                healthia clinic information (take note bcos info seems off)
17. 1test-Program:            - API test calls (remove)
18. 1test:                    - API test calls (remove)
19. 1test-221123:             - API test calls (remove)
20. healthhub-rewards:        hh rewards
21. get-healthhub-track:      hh track
22. howareyoudoing:           - menu list (remove)
23. Copy-of-1test:            ~parent child vaping
24. ga-testing                - gibbish content (remove)
25. 10-fun-ways-to-get-active - test content (remove)

Hence, total expected a final of 16 groups.

<hr>

### Data Preparation

Final Clean, Removal & Merge of Dfs

#### Initial Group List

In [11]:
grpList

['df_parent_hub',
 'df_MindSG',
 'df_diabetes_hub',
 'df_diabetes_hub_v2',
 'df_nsc',
 'df_pressure_injury',
 'df_nutrition_hub',
 'df_LetsMoveIt',
 'df_AAP',
 'df_healthhub__parenting',
 'df_healthia',
 'df_IQuit',
 'df_MoveIt',
 'df_korangok',
 'df_Screen_for_Life',
 'df_indian_outreach',
 'df_10_fun_ways_to_get_active',
 'df_1test_Program',
 'df_1test',
 'df_ga_testing',
 'df_healthhub_rewards',
 'df_howareyoudoing',
 'df_get_healthhub_track',
 'df_1test_221123',
 'df_Copy_of_1test']

#### Removal of redundant groups

In [12]:
# Remove 4 groups
exclude = ['df_1test_Program', 'df_1test', 'df_1test_221123', 'df_howareyoudoing', 'df_10_fun_ways_to_get_active', 'df_ga_testing',  'df_Copy_of_1test']
filtered_grpList = [df for df in grpList if df not in exclude]
print(len(filtered_grpList))
filtered_grpList

18


['df_parent_hub',
 'df_MindSG',
 'df_diabetes_hub',
 'df_diabetes_hub_v2',
 'df_nsc',
 'df_pressure_injury',
 'df_nutrition_hub',
 'df_LetsMoveIt',
 'df_AAP',
 'df_healthhub__parenting',
 'df_healthia',
 'df_IQuit',
 'df_MoveIt',
 'df_korangok',
 'df_Screen_for_Life',
 'df_indian_outreach',
 'df_healthhub_rewards',
 'df_get_healthhub_track']

#### Merge common groups together

- diabetes_hub
- moveIt
- vaping

In [13]:
# * Run once only
df_diabetes_hub_concatenated = pd.concat([df_diabetes_hub, df_diabetes_hub_v2], axis=0)
df_LetsMoveIt_concatenated = pd.concat([df_LetsMoveIt, df_MoveIt], axis=0)
df_iquit_concatenated = pd.concat([df_IQuit, df_Copy_of_1test], axis=0)

newGrps = ['df_diabetes_hub_concatenated', 'df_LetsMoveIt_concatenated', 'df_iquit_concatenated']
for grp in newGrps:
    print(grp)
    filtered_grpList.append(grp)

exclude_mergedGrps = ['df_diabetes_hub', 'df_diabetes_hub_v2', 'df_LetsMoveIt', 'df_MoveIt', 'df_IQuit', 'df_Copy_of_1test']
final_grpList = [df for df in filtered_grpList if df not in exclude_mergedGrps]

len(final_grpList)

df_diabetes_hub_concatenated
df_LetsMoveIt_concatenated
df_iquit_concatenated


16

In [14]:
final_grpList

['df_parent_hub',
 'df_MindSG',
 'df_nsc',
 'df_pressure_injury',
 'df_nutrition_hub',
 'df_AAP',
 'df_healthhub__parenting',
 'df_healthia',
 'df_korangok',
 'df_Screen_for_Life',
 'df_indian_outreach',
 'df_healthhub_rewards',
 'df_get_healthhub_track',
 'df_diabetes_hub_concatenated',
 'df_LetsMoveIt_concatenated',
 'df_iquit_concatenated']

#### Export to xlsx

In [15]:
# Sort DataFrame by their number of rows
sorted_final_grpList = sorted(final_grpList, key=lambda df_name: len(eval(df_name)), reverse=True)
print(sorted_final_grpList)

['df_parent_hub', 'df_MindSG', 'df_diabetes_hub_concatenated', 'df_nsc', 'df_pressure_injury', 'df_LetsMoveIt_concatenated', 'df_nutrition_hub', 'df_AAP', 'df_healthhub__parenting', 'df_healthia', 'df_iquit_concatenated', 'df_korangok', 'df_Screen_for_Life', 'df_indian_outreach', 'df_healthhub_rewards', 'df_get_healthhub_track']


In [16]:
# keys = DataFrame names, values = DataFrame objects
dataframes = {name: globals()[name] for name in sorted_final_grpList}

# Save all DataFrames to Excel file
with pd.ExcelWriter(r"program-sub-pages\grouped_programSubpages.xlsx", engine='xlsxwriter') as writer:
    for df_name, df in dataframes.items():
        sheet_name = df_name.replace('df_', '')
        df.to_excel(writer, sheet_name=sheet_name, index=False)

print("DataFrames saved to 'grouped_programSubpages.xlsx'.")

DataFrames saved to 'grouped_programSubpages.xlsx'.


<hr>

### Important Note

Further manual checking & cleaning was done on this excel file. It was found that some `extract content body` did not make sense. As we will be performing topic/keyword modelling on this attribute, I have decided to remove such data.

Removed rows under the respective sheets are as follows:
1. diabetes-hub: rows with column `id` `1435297`, `1435302` & `1435299`,
    - `extracted_content_body` is ‘aaa’ and ‘aa’ only
2. healthia: rows with column `id` `1434801`, `1434807` & `1434815`
    - `extracted_content_body` is ‘athis is level 1 content’