In [26]:
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
df = pd.read_csv("/kaggle/input/master-tbl/Master_Tbl.csv")
df.head()
df = df.loc[:, ~df.columns.str.contains("^Unnamed")]
df.head()
df = df.drop(columns=["Column1"])
df.head()

/kaggle/input/master-tbl/Master_Tbl.csv


Unnamed: 0,demographic_type,demographic_group,high_priority_pct,weighted_base,Priority_strength,Round_up
0,Age,16-24,68.79,314.0,Moderate priority,68.8
1,Age,25-34,78.14,398.0,High priority,78.1
2,Age,35-44,77.27,374.0,High priority,77.3
3,Age,45-54,79.4,403.0,High priority,79.4
4,Age,55-64,86.85,365.0,Very high priority,86.9


## Data quality checks
Before analysis, I validated:
- No missing values in key fields
- Expected demographic types exist
- Weighted bases are available for each segment (used to flag small bases)
- Got rid of unneeded columns and NaN values
- Made sure all expected columns and values were visible


In [33]:
df.columns = df.columns.str.strip()
df.info()
df = df.dropna(subset=["demographic_group", "high_priority_pct", "weighted_base"]).reset_index(drop=True)
df = df.drop(columns=["Round_up"], errors="ignore")
df.info()
df
import sqlite3
conn = sqlite3.connect(":memory:")
df.to_sql("Master_Tbl", conn, index=False, if_exists="replace")
pd.read_sql_query("SELECT * FROM Master_Tbl;", conn)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   demographic_type   11 non-null     object 
 1   demographic_group  11 non-null     object 
 2   high_priority_pct  11 non-null     float64
 3   weighted_base      11 non-null     float64
 4   Priority_strength  11 non-null     object 
dtypes: float64(2), object(3)
memory usage: 572.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   demographic_type   11 non-null     object 
 1   demographic_group  11 non-null     object 
 2   high_priority_pct  11 non-null     float64
 3   weighted_base      11 non-null     float64
 4   Priority_strength  11 non-null     object 
dtypes: float64(2), object(3)
memory usage: 572.0+ bytes


Unnamed: 0,demographic_type,demographic_group,high_priority_pct,weighted_base,Priority_strength
0,Age,16-24,68.79,314.0,Moderate priority
1,Age,25-34,78.14,398.0,High priority
2,Age,35-44,77.27,374.0,High priority
3,Age,45-54,79.4,403.0,High priority
4,Age,55-64,86.85,365.0,Very high priority
5,Age,65-74,88.33,300.0,Very high priority
6,Age,75+,76.0,25.0,High priority
7,Gender,Male,78.44,1067.0,High priority
8,Gender,Female,81.23,1092.0,High priority
9,Education,Graduates,84.38,653.0,High priority


## Data cleaning and validation

The raw dataset contained blank separator rows and intermediate helper columns generated during Excel-based preparation.  
To ensure analytical integrity, only rows containing valid demographic groups and KPI values were retained.

Specifically:
- Rows missing demographic group, priority percentage, or weighted base values were removed
- Helper columns used during preparation (e.g. rounded values) were dropped
- The resulting table represents a clean, reporting-ready fact table with one row per demographic segment

This step ensures all subsequent SQL analysis is based on complete and reliable data.


In [34]:
pd.read_sql_query("""
SELECT demographic_group,
       high_priority_pct,
       weighted_base
FROM Master_Tbl
WHERE demographic_type = 'Age'
ORDER BY high_priority_pct DESC;
""", conn)


Unnamed: 0,demographic_group,high_priority_pct,weighted_base
0,65-74,88.33,300.0
1,55-64,86.85,365.0
2,45-54,79.4,403.0
3,25-34,78.14,398.0
4,35-44,77.27,374.0
5,75+,76.0,25.0
6,16-24,68.79,314.0


## Analysis 1: Cyber security priority by age group

This analysis explores how the perceived importance of cyber security varies by age group.
Understanding these differences helps identify segments that may benefit from targeted awareness activity.


In [36]:
pd.read_sql_query("""
SELECT demographic_group,
       high_priority_pct,
       weighted_base
FROM Master_Tbl
WHERE demographic_type = 'Age'
  AND weighted_base >= 100
ORDER BY high_priority_pct DESC;
""", conn)


Unnamed: 0,demographic_group,high_priority_pct,weighted_base
0,65-74,88.33,300.0
1,55-64,86.85,365.0
2,45-54,79.4,403.0
3,25-34,78.14,398.0
4,35-44,77.27,374.0
5,16-24,68.79,314.0


## Analysis 2: Age groups with reliable sample sizes

To ensure robust reporting, age groups with small weighted bases (<100) are excluded.

In [37]:
pd.read_sql_query("""
SELECT demographic_type,
       demographic_group,
       high_priority_pct,
       weighted_base
FROM Master_Tbl
WHERE high_priority_pct < 75
ORDER BY high_priority_pct ASC;
""", conn)


Unnamed: 0,demographic_type,demographic_group,high_priority_pct,weighted_base
0,Age,16-24,68.79,314.0


## Analysis 3: Groups with lower cyber security prioritisation

Groups below a 75% high-priority threshold may indicate lower awareness or higher risk.


In [38]:
pd.read_sql_query("""
SELECT demographic_type,
       ROUND(AVG(high_priority_pct), 2) AS avg_high_priority_pct,
       MIN(high_priority_pct) AS min_priority,
       MAX(high_priority_pct) AS max_priority
FROM Master_Tbl
GROUP BY demographic_type
ORDER BY avg_high_priority_pct DESC;
""", conn)


Unnamed: 0,demographic_type,avg_high_priority_pct,min_priority,max_priority
0,Education,81.2,78.02,84.38
1,Gender,79.84,78.44,81.23
2,Age,79.25,68.79,88.33


## Analysis 4: Summary by demographic type

This section summarises cyber security priority at a higher level across demographic categories.


In [41]:
dim_demo = pd.DataFrame({
    "demographic_type": ["Age", "Gender", "Education"],
    "description": [
        "Age group of respondents",
        "Self-reported gender",
        "Highest education level attained"
    ]
})

dim_demo.to_sql(
    "dim_demographic",
    conn,
    index=False,
    if_exists="replace"
)
pd.read_sql_query("""
SELECT m.demographic_type,
       d.description,
       m.demographic_group,
       m.high_priority_pct,
       m.weighted_base
FROM Master_Tbl m
JOIN dim_demographic d
  ON m.demographic_type = d.demographic_type
ORDER BY m.demographic_type, m.high_priority_pct DESC;
""", conn)


Unnamed: 0,demographic_type,description,demographic_group,high_priority_pct,weighted_base
0,Age,Age group of respondents,65-74,88.33,300.0
1,Age,Age group of respondents,55-64,86.85,365.0
2,Age,Age group of respondents,45-54,79.4,403.0
3,Age,Age group of respondents,25-34,78.14,398.0
4,Age,Age group of respondents,35-44,77.27,374.0
5,Age,Age group of respondents,75+,76.0,25.0
6,Age,Age group of respondents,16-24,68.79,314.0
7,Education,Highest education level attained,Graduates,84.38,653.0
8,Education,Highest education level attained,Non-graduates,78.02,1506.0
9,Gender,Self-reported gender,Female,81.23,1092.0


## Dimensional modelling and JOIN

To reflect reporting-ready data design, a demographic dimension table was created and joined to the KPI fact table.


## Final summary

This notebook demonstrates how a complex government survey cross-tabulation can be transformed into a clean, reporting-ready dataset and analysed using SQL.

Key steps included:
- Cleaning Excel-derived artefacts and removing non-analytical rows
- Structuring the data as a single fact table (`Master_Tbl`) with clear demographic dimensions
- Applying SQL `WHERE` clauses to isolate specific demographic groups and enforce reporting rules such as minimum base sizes
- Using `GROUP BY` to summarise performance across demographic categories
- Designing the data model to be compatible with downstream BI tools such as Power BI

### Key insights
- Cyber security is considered a high personal priority by most demographic groups, but levels of prioritisation vary notably by age and education.
- Older age groups report higher prioritisation, while younger groups show comparatively lower levels, indicating potential areas for targeted awareness activity.
- Applying base-size thresholds improves the robustness and reliability of reported insights.

### Next steps
- Extend the dataset to include additional survey questions (e.g. perceived risk or confidence in government response) to support multi-metric reporting.
- Build an interactive Power BI dashboard using the same fact and dimension structure.
- Apply the same methodology to other public sector survey datasets to support risk, assurance, and performance reporting.
