# 📊 **QA Script for Population Projections**

## 📝 **Introduction**
This notebook aims to identify outliers in the GLA population projection data. The analysis involves loading the dataset, preprocessing the data, defining utility functions, performing outlier detection, and presenting the results through visualisations and Dash appilcation in a HMTL report.

### 🎯 **Goals**
The analysis will focus on the following objectives:
- **Load** the population projections dataset.
- Define utility functions.
- **Process data**
- Perform **basic checks** on the dataset:
  - Range of years covered.
  - Missing values.
  - Duplicates.
  - Descriptive statistics.
  - Breakdown by components.
  - Detecting negative values.
  - Age group ranges.
- **Outlier Detection** over time for each component (Population Consistency Over Time):
  - Identify outliers using **Z-scores** and **Robust Z-scores**.
  - Analyse by **component**, **ward**, and **borough**.
  - Handle **infinite values** separately.
- **Total Population Outliers**:
  - Use Z-scores and Robust Z-scores for comparison.
  - Perform **cross-sectional comparisons**: Examine changes between boroughs and wards for a given year.
  - Conduct **temporal comparisons**: Measure percentage changes between years for both boroughs and wards.
  - Handle **infinite values** separately.
- **Gender Outliers**:
  - Investigate abnormal **gender ratios**.
  - Analyse by component.
  - Adjust the **outlier standard deviation thresholds** as needed based on different components.
- **Key Visualisations**:
  - Display the distribution of components.
  - Group data by **age ranges**.
  - Visualise **yearly totals**.
  - Show yearly total trends over time, broken down by components.
- **Dash Apps**:
  - Population pyramind app.
  - Line graph ward app.
  - Ward distribution app
- **Produce HMTL report**:
  - Formatting tables for hmtl report.
  - Pruce HTML layout

---

## 📂 **Datasets**
How should the dataset be structured?

### Dataset 1
The functions are designed to take datasets in the form produced by the GLA Population Projection Workflow. In the current iteration of this workflow, the first dataset contains several key columns and components as outlined below:

#### Main Columns
- **gss_code**: Borough geocode (e.g., GSS code).
- **la_name**: Local Authority name.
- **Year**: The year of the population data.
- **Sex**: The gender (male/female).
- **Age**: The age group or specific age.
- **Value**: Count of the population in the given category.
- **gss_code_ward**: Geocode for the ward.
- **ward_name**: Name of the ward.

#### Components Column:
This column includes specific population-related metrics:
- **net-flow**: Population migration inflow minus migration outflow.
- **population**: Total population.
- **birth**: Number of births.
- **deaths**: Number of deaths.

### Dataset 2

The second dataset that is examined for outliers is the average household size (ahs) dataset. This is analysed using the find_ahs_outliers_with_context function.

The dataset just contains average household size from 2021 until 2050 by ward

### 


---

## 🛠️ **Structure**
1. [**Load Data** the population projections dataset.](#load-data)
2. [**Define Utility Functions** for effective use.](#define-utility-functions)
3. [**Process Data**](#process-data)
4. [**Basic Checks** on the dataset.](#basic-checks)
5. [**Population Consistency Over Time** for each component.](#population-consistency-over-time)
6. [**Total Population Outliers**](#total-population-outliers)
7. [**Gender Outliers**](#gender-outliers)
8. [**Key Visualisations**](#key-visualisations)
9. [**Dash Apps**](#Dash-apps)
9. [**Collate Outliers** to determine key outlier rows.](#collate-outliers)
10. [**Average Household Size** Outliers.](#average-household-size_outliers)
11. [**HTML** Report.](#produce-html-report)



## Load Data
This section will cover how to load and preprocess the dataset.

---


In [213]:
import pandas as pd
import numpy as np
import pyreadr
import os
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px
from typing import Dict
from scipy.stats import zscore
from scipy.stats import skew
import plotly.express as px
import plotly.io as pio
import dash_bootstrap_components as dbc




In [310]:
#the older version of data
#combined_10yr_fert_old_version = pd.read_csv(r'C:\Users\user\Documents\population_data\combined_10yr_central_fert\combined_10yr_central_fert.csv').iloc[:, 1:]

In [5]:
#in component column replace 'popn' with 'population'
#combined_10yr_fert_old_version['component'] = combined_10yr_fert_old_version['component'].replace('popn', 'population')

In [6]:
#combined_10yr_fert = combined_10yr_fert_old_version.copy()

In [106]:
combined_10yr_fert = pd.read_csv(r"C:\Users\user\Documents\population_data\long_format_combined_components_10yr_central_fert_2022.csv")

## Define Utility Functions
Define utility functions that will be used for various parts of the analysis.

---

In [107]:
def view_descriptive_statistics(df, columns):
    """
    Calculate descriptive statistics, including mean, median, and mode, for specified columns in a DataFrame.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    columns (list): List of columns for which to calculate the statistics.
    
    Returns:
    pd.DataFrame: DataFrame containing the descriptive statistics including median and mode.
    """
    # Get descriptive statistics using describe()
    descriptive_stats = df[columns].describe()

    # Calculate median for each column
    median = df[columns].median()

    # Calculate mode for each column (in case of multiple modes, take the first one)
    mode = df[columns].mode().iloc[0]

    # Add median and mode to the descriptive statistics DataFrame
    descriptive_stats.loc['median'] = median
    descriptive_stats.loc['mode'] = mode

    # Return the combined descriptive statistics
    return descriptive_stats

In [108]:
def create_age_bins(df, age_column='age', bins=None, labels=None):
    """
    Create age bins for the specified age column in the given DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the age data.
    age_column (str): The name of the column containing age data. Default is 'age'.
    bins (list): A list of bin edges for categorising ages. Default is None.
    labels (list): A list of labels for the bins. Default is None.

    Returns:
    pd.DataFrame: The DataFrame with a new 'age' column containing binned age data.
    """
    
    # If bins and labels are not provided, set default values
    if bins is None:
        bins = [-1, 18, 30, 40, 50, 60, 70, 80, 89, 90]
    
    if labels is None:
        labels = ['0-18', '19-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-89', '90+']
    
    # Create a copy of the original DataFrame to avoid modifying it directly
    binned_df = df.copy()
    
    # Create age bins
    binned_df[age_column] = pd.cut(binned_df[age_column], bins=bins, labels=labels)
    
    return binned_df

# Example usage:
# combined_10yr_fert_agebins = create_age_bins(combined_10yr_fert)


In [109]:
# def calculate_zscores_and_find_outliers(df, component_columns, handle_inf=True, Geography='borough', z_score_threshold=2, For_population_totals=False, population_analysis_type='cross-sectional'):
#     """
#     Computes z-scores or robust z-scores (depending on distribution) for the respective columns,
#     and returns DataFrames containing outliers for either the component columns (handle_inf=True) 
#     or the percentage change columns (handle_inf=False).

#     Parameters:
#     df (pd.DataFrame): The input DataFrame containing population data component value columns.
#     component_columns (str or list): A single column name or a list of column names to be analysed.
#     handle_inf (bool): If True, uses the component columns to determine outliers. 
#                        If False, uses percentage change columns to determine outliers.
#     Geography (str): Specifies whether to group by 'borough' (using 'gss_code') or 'ward' (using 'gss_code_ward').
#                      Default is 'borough'.
#     For_population_totals (bool): If True, calculates total population sums and percentage changes before proceeding 
#                                   to z-score and outlier analysis.
#     population_analysis_type (str): Specifies whether to do 'cross-sectional' or 'temporal' analysis for population totals.
#                                     Default is 'cross-sectional'.
#     z_score_threshold (float or int): The threshold to consider as an outlier based on the z-score. Default is 2.

#     Returns:
#     dict: A dictionary containing DataFrames with outliers for each respective column based on the z-score threshold.
#     """
    
#     # If a single column name is provided as a string, convert it to a list
#     if isinstance(component_columns, str):
#         component_columns = [component_columns]

#     outliers_dict = {}
#     z_score_type = {}  # Dictionary to store which method was used

#     # Grouping and pivot based on the Geography parameter
#     if Geography == 'borough':
#         geo_column = 'gss_code'
#     elif Geography == 'ward':
#         geo_column = 'gss_code_ward'
#     else:
#         raise ValueError("Geography must be either 'borough' or 'ward'.")

#     # Automatically create pct_change_columns based on component_columns
#     pct_change_columns = [f"{col}_pct_change" for col in component_columns]

#     #Calculate the percentage change for the component columns
#     df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()
#     #pivoted = df.pivot_table(index=[geo_column, 'sex', 'age', 'year'], columns='component', values='value').reset_index()

#     # If For_population_totals is True, calculate population totals and percentage changes
#     if For_population_totals:
#         # Filter the DataFrame to only keep rows where the 'component' is 'population'
#         # population_df = pivoted[pivoted['component'] == 'population']
        
#         # Pivot the data to get 'population' values by geography, sex, age, and year
#         # pivoted = population_df.pivot_table(index=[geo_column, 'sex', 'age', 'year'], columns='component', values='value').reset_index()
        
#         # Ensure that there's a 'population' column in the resulting DataFrame
#         if 'population' in df.columns:
#             df['population'] = df['population']  # Extract the 'population' column
#         else:
#             raise ValueError("The 'population' column is missing after pivoting.")

#         # Group by geography and year, and sum the population values, create total population dfs for crossectional and temporal analysis change
#         population_sum = df.groupby([geo_column, 'year'])['population'].sum().reset_index()

#         population_sum_time = population_sum.copy()
#         population_sum_cross = population_sum.copy()

#         if population_analysis_type == 'temporal':
#             # Temporally: Calculate the population change over the years for each gss_code or ward
#             population_sum_time['population_pct_change_temporal'] = population_sum_time.groupby(geo_column)['population'].pct_change() * 100
#         elif population_analysis_type == 'cross-sectional':
#             # Cross-sectionally: Compare the population between different gss_code or wards for the same year and compare to the mean
#             population_sum_cross['population_mean'] = population_sum_cross.groupby('year')['population'].transform('mean')
#             population_sum_cross['population_pct_change_cross'] = ((population_sum_cross['population'] - population_sum_cross['population_mean']) / population_sum_cross['population_mean']) * 100
#         else:
#             raise ValueError("population_analysis_type must be either 'cross-sectional' or 'temporal'.")

#     # Now, decide how to determine outliers based on handle_inf (handling the infinite values e.i. no percentage change value) and pct_change_columns
#     if handle_inf:
#         # Outliers based on component columns value (not percentage change)
#         for comp_col, pct_change_col in zip(component_columns, pct_change_columns):
#             if pct_change_col in df.columns:
#                 # Filter rows where percentage change columns have inf or -inf
#                 df_filtered = df.replace([np.inf, -np.inf], np.nan).dropna(subset=[comp_col, pct_change_col])

#                 if not df_filtered.empty:
#                     # Check if the column is normally distributed using skewness
#                     skewness = df_filtered[comp_col].skew()

#                     if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
#                         df_filtered['z_score'] = stats.zscore(df_filtered[comp_col])
#                         outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
#                         z_score_type[comp_col] = 'Normal Z-Score'
#                         print(f"{comp_col} used Normal Z-Score.")
#                     else:
#                         # Use Robust Z-score (based on median and MAD) for non-normal distribution
#                         median = df_filtered[comp_col].median()
#                         mad = stats.median_abs_deviation(df_filtered[comp_col])
#                         df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
#                         outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
#                         z_score_type[comp_col] = 'Robust Z-Score'
#                         print(f"{comp_col} used Robust Z-Score.")

#                     # Store the outliers for this component column
#                     outliers_dict[comp_col] = outliers
#                 else:
#                     outliers_dict[comp_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
#             else:
#                 print(f"{comp_col} does not exist in DataFrame")

#     else:
#         # Outliers based on pct_change_columns
#         for pct_change_col in pct_change_columns:
#             # For population totals choose the correct DataFrame (from above) based on the selected column
#             if population_analysis_type == 'temporal' and 'population_pct_change_temporal' in pct_change_col:
#                 df_filtered = population_sum_time  # Use the temporal population data
#             elif population_analysis_type == 'cross-sectional' and 'population_pct_change_cross' in pct_change_col:
#                 df_filtered = population_sum_cross  # Use the cross-sectional population data
#             else:
#                 df_swarm_filtered = df  # Default to the original df if other percentage columns are provided

#             #pct column for component column zscores
#             if pct_change_col in df_filtered.columns:
#                 # Replace inf and -inf with NaN and work on the entire DataFrame after cleaning
#                 df_filtered = df_filtered.replace([np.inf, -np.inf], np.nan).dropna(subset=[pct_change_col])

#                 if not df_filtered.empty:
#                     # Check if the column is normally distributed using skewness
#                     skewness = df_filtered[pct_change_col].skew()

#                     if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
#                         df_filtered['z_score'] = stats.zscore(df_filtered[pct_change_col])
#                         outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
#                         z_score_type[pct_change_col] = 'Normal Z-Score'
#                         print(f"{pct_change_col} used Normal Z-Score.")
#                     else:
#                         # Use Robust Z-score (based on median and MAD) for non-normal distribution
#                         median = df_filtered[pct_change_col].median()
#                         mad = stats.median_abs_deviation(df_filtered[pct_change_col])
#                         df_filtered['robust_z_score'] = (df_filtered[pct_change_col] - median) / mad
#                         outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
#                         z_score_type[pct_change_col] = 'Robust Z-Score'
#                         print(f"{pct_change_col} used Robust Z-Score.")

#                     # Store the outliers for this percentage change column
#                     outliers_dict[pct_change_col] = outliers
#                 else:
#                     outliers_dict[pct_change_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
#             else:
#                 print(f"{pct_change_col} does not exist in DataFrame")

#     # Sort each DataFrame in outliers_dict by 'z_score' or 'robust_z_score' in descending order in one line
#     for key in outliers_dict.keys():
#         outliers_dict[key] = outliers_dict[key].sort_values(by='z_score' if 'z_score' in outliers_dict[key].columns else 'robust_z_score', ascending=False)

#     return outliers_dict  # Returning z_score_type for further use if needed


In [110]:
def calculate_zscores_and_find_outliers(df, component_columns, handle_inf=True, Geography='borough', z_score_threshold=2,
                                        For_population_totals=False, population_analysis_type='cross-sectional'):
    """
    Computes z-scores or robust z-scores for the specified columns and identifies outliers based on the provided criteria.
    """

    # Guard clauses to ensure valid inputs
    if not component_columns:
        raise ValueError("component_columns cannot be empty.")
    if z_score_threshold < 0:
        raise ValueError("z_score_threshold must be non-negative.")

    # Convert single column name to a list if necessary
    if isinstance(component_columns, str):
        component_columns = [component_columns]

    outliers_dict = {}
    z_score_type = {}  # Track z-score type for each column

    # Choose geographical grouping based on 'Geography' parameter
    geo_column = 'gss_code' if Geography == 'borough' else 'gss_code_ward' if Geography == 'ward' else None
    if not geo_column:
        raise ValueError("Geography must be either 'borough' or 'ward'.")

    pct_change_columns = [f"{col}_pct_change" for col in component_columns]
    df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()

    # Helper function for z-score calculation based on skewness
    def calculate_outliers(column, data):
        skewness = data[column].skew()
        if abs(skewness) < 0.5:  # Normal distribution
            data['z_score'] = stats.zscore(data[column])
            outliers = data[data['z_score'].abs() > z_score_threshold]
            z_score_type[column] = 'Normal Z-Score'
        else:  # Non-normal distribution: Robust Z-score
            median, mad = data[column].median(), stats.median_abs_deviation(data[column])
            data['robust_z_score'] = (data[column] - median) / mad
            outliers = data[data['robust_z_score'].abs() > z_score_threshold]
            z_score_type[column] = 'Robust Z-Score'
        return outliers

    if For_population_totals:
        if 'population' not in df.columns:
            raise ValueError("The 'population' column is required for population totals analysis.")
        
        # Process population totals for temporal or cross-sectional analysis
        population_sum = df.groupby([geo_column, 'year'])['population'].sum().reset_index()
        population_sum_time, population_sum_cross = population_sum.copy(), population_sum.copy()
        
        if population_analysis_type == 'temporal':
            population_sum_time['population_pct_change_temporal'] = population_sum_time.groupby(geo_column)['population'].pct_change().abs()
            population_sum_time = population_sum_time.sort_values(by='population_pct_change_temporal', ascending=False)
            print(population_sum_time)
        elif population_analysis_type == 'cross-sectional':
            population_sum_cross['population_mean'] = population_sum_cross.groupby('year')['population'].transform('mean')
            population_sum_cross['population_pct_change_cross'] = ((population_sum_cross['population'] - population_sum_cross['population_mean']) / 
                                                             population_sum_cross['population_mean']) * 100
        else:
            raise ValueError("population_analysis_type must be either 'cross-sectional' or 'temporal'.")

    # Outlier detection based on handle_inf flag
    if handle_inf:
        # Based on component columns
        for comp_col in component_columns:
            if comp_col in df.columns:
                df_filtered = df.replace([np.inf, -np.inf], np.nan).dropna(subset=[comp_col])
                if not df_filtered.empty:
                    outliers_dict[comp_col] = calculate_outliers(comp_col, df_filtered)
    else:
        # Based on pct_change columns
        for pct_change_col in pct_change_columns:
            df_filtered = (population_sum_time if population_analysis_type == 'temporal' and 'population_pct_change_temporal' in pct_change_col
                           else population_sum_cross if population_analysis_type == 'cross-sectional' and 'population_pct_change_cross' in pct_change_col
                           else df.replace([np.inf, -np.inf], np.nan).dropna(subset=[pct_change_col]))

            if not df_filtered.empty:
                outliers_dict[pct_change_col] = calculate_outliers(pct_change_col, df_filtered)

    # Sort each DataFrame in outliers_dict by the z-score type
    for key, df_outliers in outliers_dict.items():
        z_type_column = 'z_score' if 'z_score' in df_outliers.columns else 'robust_z_score'
        outliers_dict[key] = df_outliers.sort_values(by=z_type_column, ascending=False)

    return outliers_dict

In [111]:
# def calculate_zscores_and_find_outliers_percentage_change(df, component_columns, Geography='ward', percentage_change_threshold=0.05, For_population_totals=False, population_analysis_type='cross-sectional'):
#     """
#     Computes outliers based on absolute percentage change for the specified columns, this does not work for change from 0 values as this produces an 
#     infinite values as the percentage change. Instead these are handle using the function calculate_zscores_and_find_outliers_value.
#     Outliers are detected using a percentage change threshold instead of z-scores.
#     Optionally includes population totals analysis for either temporal or cross-sectional data.
#     """
#     # Guard clauses to ensure valid inputs
#     if not component_columns:
#         raise ValueError("component_columns cannot be empty.")
#     if percentage_change_threshold < 0:
#         raise ValueError("percentage_change_threshold must be non-negative.")
#     if For_population_totals and 'population' not in df.columns:
#         raise ValueError("The 'population' column is required for population totals analysis.")

#     # Convert single column name to a list if necessary
#     if isinstance(component_columns, str):
#         component_columns = [component_columns]

#     outliers_dict = {}

#     # Choose geographical grouping based on 'Geography' parameter
#     geo_column = 'gss_code' if Geography == 'borough' else 'gss_code_ward' if Geography == 'ward' else None
#     if not geo_column:
#         raise ValueError("Geography must be either 'borough' or 'ward'.")

#     # Calculate absolute percentage change for each component column
#     percentage_change_columns = [f"{col}_percentage_change" for col in component_columns]
#     df[percentage_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()

#     # If For_population_totals is True, calculate population-based percentage change
#     if For_population_totals:
#         population_sum = df.groupby([geo_column, 'year'])['population'].sum().reset_index()
#         population_sum_time, population_sum_cross = population_sum.copy(), population_sum.copy()

#         if population_analysis_type == 'temporal':
#             population_sum_time['population_pct_change_temporal'] = population_sum_time.groupby(geo_column)['population'].pct_change().abs() * 100
#         elif population_analysis_type == 'cross-sectional':
#             population_sum_cross['population_mean'] = population_sum_cross.groupby('year')['population'].transform('mean')
#             population_sum_cross['population_pct_change_cross'] = ((population_sum_cross['population'] - population_sum_cross['population_mean']).abs() / population_sum_cross['population_mean']) * 100
#         else:
#             raise ValueError("population_analysis_type must be either 'cross-sectional' or 'temporal'.")

#     # Outlier detection based on absolute percentage change threshold
#     for pct_change_col in percentage_change_columns:
#         df_filtered = df.dropna(subset=[pct_change_col])  # Dropping NaNs before analysis

#         if not df_filtered.empty:
#             # Detect outliers based on the absolute percentage change threshold
#             outliers = df_filtered[df_filtered[pct_change_col] > percentage_change_threshold]

#             if not outliers.empty:
#                 outliers_dict[pct_change_col] = outliers

#     # If For_population_totals is True, include population-based outliers
#     if For_population_totals:
#         # Process population percentage change columns
#         population_pct_change_columns = []
#         if population_analysis_type == 'temporal':
#             population_pct_change_columns.append('population_pct_change_temporal')
#         elif population_analysis_type == 'cross-sectional':
#             population_pct_change_columns.append('population_pct_change_cross')

#         for population_pct_change_col in population_pct_change_columns:
#             df_filtered = df.dropna(subset=[population_pct_change_col])  # Dropping NaNs before analysis

#             if not df_filtered.empty:
#                 # Detect population outliers based on the absolute percentage change threshold
#                 outliers = df_filtered[df_filtered[population_pct_change_col] > percentage_change_threshold]

#                 if not outliers.empty:
#                     outliers_dict[population_pct_change_col] = outliers

#     # Sort each DataFrame in outliers_dict by percentage change
#     for key, df_outliers in outliers_dict.items():
#         outliers_dict[key] = df_outliers.sort_values(by=key, ascending=False)

#     return outliers_dict




In [215]:
def calculate_zscores_and_find_outliers_percentage_change(
    df, component_columns, Geography='ward', percentage_change_threshold=0.05, 
    For_population_totals=False, population_analysis_type='cross-sectional'
):
    """
    Computes outliers based on absolute percentage change for the specified columns, excluding infinite values.
    Outliers are detected using a percentage change threshold instead of z-scores.
    Optionally includes population totals analysis for either temporal or cross-sectional data.
    """
    # Guard clauses to ensure valid inputs
    if not component_columns:
        raise ValueError("component_columns cannot be empty.")
    if percentage_change_threshold < 0:
        raise ValueError("percentage_change_threshold must be non-negative.")
    if For_population_totals and 'population' not in df.columns:
        raise ValueError("The 'population' column is required for population totals analysis.")

    # Convert single column name to a list if necessary
    if isinstance(component_columns, str):
        component_columns = [component_columns]

    outliers_dict = {}

    # Choose geographical grouping based on 'Geography' parameter
    geo_column = 'gss_code' if Geography == 'borough' else 'gss_code_ward' if Geography == 'ward' else None
    if not geo_column:
        raise ValueError("Geography must be either 'borough' or 'ward'.")
    

    # Calculate absolute percentage change for each component column
    percentage_change_columns = [f"{col}_pct_change" for col in component_columns]
    df[percentage_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()

    # Exclude infinite values from the percentage change columns
    for col in percentage_change_columns:
        df[col].replace([float('inf'), -float('inf')], float('nan'), inplace=True)

    for pct_change_col in percentage_change_columns:
        df_filtered = df.dropna(subset=[pct_change_col])  # Dropping NaNs before analysis

        if not df_filtered.empty:
            # Detect outliers based on the absolute percentage change threshold
            outliers = df_filtered[df_filtered[pct_change_col] > percentage_change_threshold]

            if not outliers.empty:
                outliers_dict[pct_change_col] = outliers

    # If For_population_totals is True, calculate population-based percentage change
    if For_population_totals:
        population_sum = df.groupby([geo_column, 'year'])['population'].sum().reset_index()
        population_sum_time, population_sum_cross = population_sum.copy(), population_sum.copy()
     

        if population_analysis_type == 'temporal':
            # Temporally: Calculate the population change over the years for each gss_code or ward
            population_sum_time['population_pct_change_temporal'] = population_sum_time.groupby(geo_column)['population'].pct_change().abs()
            print('df_time_pct', population_sum_time)
            df_outliers = population_sum_time[population_sum_time['population_pct_change_temporal'] > percentage_change_threshold]
            #sort ascending
            df_outliers = df_outliers.sort_values(by='population_pct_change_temporal', ascending=False)
            print('df_time_threshold', df_outliers)
        elif population_analysis_type == 'cross-sectional':
            # Cross-sectionally: Compare the population between different gss_code or wards for the same year and compare to the mean
            population_sum_cross['population_mean_for_year'] = population_sum_cross.groupby('year')['population'].transform('mean')
            population_sum_cross['population_pct_change_cross'] = ((population_sum_cross['population'] - population_sum_cross['population_mean_for_year']).abs() / population_sum_cross['population_mean_for_year']) 
            print('df_cross_pct', population_sum_cross)
            df_outliers = population_sum_cross[population_sum_cross['population_pct_change_cross'] > percentage_change_threshold]
            #sort ascending
            df_outliers = df_outliers.sort_values(by='population_pct_change_cross', ascending=False)
            print('df_cross_threshold', df_outliers)
        else:
            raise ValueError("population_analysis_type must be either 'cross-sectional' or 'temporal'.")
        
        return df_outliers

    # Sort each DataFrame in outliers_dict by percentage change

    for key, df_outliers in outliers_dict.items():
        outliers_dict[key] = df_outliers.sort_values(by=key, ascending=False)

    return outliers_dict

def calculate_zscores_and_find_outliers_value(df, component_columns, Geography='borough', z_score_threshold=2):
    """
    Computes z-scores or robust z-scores for the specified columns and identifies outliers,
    this function is used to handle values that are infinite, where handling infinite values such as where values change from 0 or to 1. 
    Thus percnetage change is not applicable in this case, so the values are used directly.
   
    """
    # Guard clauses to ensure valid inputs
    if not component_columns:
        raise ValueError("component_columns cannot be empty.")
    if z_score_threshold < 0:
        raise ValueError("z_score_threshold must be non-negative.")

    # Convert single column name to a list if necessary
    if isinstance(component_columns, str):
        component_columns = [component_columns]

    outliers_dict = {}
    z_score_type = {}  # Track z-score type for each column

    # Choose geographical grouping based on 'Geography' parameter
    geo_column = 'gss_code' if Geography == 'borough' else 'gss_code_ward' if Geography == 'ward' else None
    if not geo_column:
        raise ValueError("Geography must be either 'borough' or 'ward'.")

    percentage_change_columns = [f"{col}_pct_change" for col in component_columns]
    df[percentage_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()

    # Helper function for z-score calculation based on skewness
    def calculate_outliers(column, data):
        skewness = data[column].skew()
        if abs(skewness) < 0.5:  # Normal distribution
            data['z_score'] = stats.zscore(data[column])
            outliers = data[data['z_score'].abs() > z_score_threshold]
            z_score_type[column] = 'Normal Z-Score'
        else:  # Non-normal distribution: Robust Z-score
            median, mad = data[column].median(), stats.median_abs_deviation(data[column])
            data['robust_z_score'] = (data[column] - median) / mad
            outliers = data[data['robust_z_score'].abs() > z_score_threshold]
            z_score_type[column] = 'Robust Z-Score'
        return outliers

    # Outlier detection based on handling inf
    for comp_col in component_columns:
        if comp_col in df.columns:
            df_filtered = df.replace([np.inf, -np.inf], np.nan).dropna(subset=[comp_col]) #replacing inf values with NaN before proceeding
            if not df_filtered.empty:
                outliers_dict[comp_col] = calculate_outliers(comp_col, df_filtered)

    # Sort each DataFrame in outliers_dict by the z-score type
    for key, df_outliers in outliers_dict.items():
        z_type_column = 'z_score' if 'z_score' in df_outliers.columns else 'robust_z_score'
        outliers_dict[key] = df_outliers.sort_values(by=z_type_column, ascending=False)

    return outliers_dict

In [216]:
def robust_z_score(series):
    """Calculate robust z-score using median and MAD."""
    median = series.median()
    mad = (series - median).abs().median()
    return 0.6745 * (series - median) / mad if mad != 0 else np.zeros_like(series)

def gender_outliers(df, component_columns, geography='borough', outlier_std={'births': 2, 'deaths': 5, 'netflow': 2, 'population': 5}):
    """
    Processes gender data for either wards or boroughs, and calculates outliers for specified components.
    Uses either z-score or robust z-score based on skewness.

    Parameters:
    - df: pandas DataFrame containing the raw data
    - component_columns: list or single component column name(s) for which ratios and outliers need to be calculated
    - geography: str, either 'ward' or 'borough', default is 'borough'
    - outlier_std: dict specifying how many standard deviations to use for each component's threshold calculation.
    
    Returns:
    - outliers_dict: dictionary of outlier DataFrames for each component
    """

    # Check geography type and set index columns accordingly
    if geography == 'ward':
        geo_col = 'gss_code_ward'
    else:
        geo_col = 'gss_code'
    
    # Step 1: Create the pivot table
    gender_pivot = df.pivot_table(
        index=[geo_col, 'year', 'age', 'component'],  # Geography column and other grouping columns
        columns='sex',                                # Columns for sex (male, female)
        values='value',                               # Values (count of males and females)
        aggfunc='sum'                                 # Aggregation function (sum)
    ).reset_index()
    
    # Step 2: Calculate the ratio of females to males
    gender_pivot['ratio_female_to_male'] = gender_pivot['female'] / gender_pivot['male']
    
    # Handle division by zero and missing values
    gender_pivot['ratio_female_to_male'].replace([float('inf'), -float('inf')], pd.NA, inplace=True)
    gender_pivot['ratio_female_to_male'].fillna(np.nan, inplace=True)
    
    # Step 3: Pivot again to spread component values into columns
    gender_pivot = gender_pivot.pivot(
        index=[geo_col, 'year', 'age'], 
        columns='component', 
        values='ratio_female_to_male'
    ).reset_index()
    
    # Step 4: Convert the specified component columns to numeric
    for component in component_columns:
        gender_pivot[component] = pd.to_numeric(gender_pivot[component], errors='coerce')
    
    # Step 5: Detect skewness and calculate either z-score or robust z-score for each component
    outliers_dict = {}
    for component in component_columns:
        component_data = gender_pivot[component].dropna()

        # Calculate skewness
        component_skewness = skew(component_data)
        
        # Decide whether to use z-score or robust z-score
        if abs(component_skewness) > 0.5:
            # Use robust z-score if skewness is high
            z_column = 'robust_z_score'
            gender_pivot[z_column] = robust_z_score(gender_pivot[component])
        else:
            # Use standard z-score if skewness is low
            z_column = 'z_score'
            gender_pivot[z_column] = zscore(gender_pivot[component], nan_policy='omit')
        
        # Set the outlier threshold based on user input (default 2, 5, etc. depending on the component)
        threshold = outlier_std.get(component, 2)
        
        # Step 6: Identify outliers where absolute z-score is greater than the threshold
        outliers = gender_pivot[gender_pivot[z_column].abs() > threshold]
        
        # Step 7: Keep only relevant columns in the output (geo_col, year, age, z_score, and the component itself)
        outliers = outliers[[geo_col, 'year', 'age', component, z_column]]
        
        # Store the outliers DataFrame in the outliers dictionary
        outliers_dict[component] = outliers.reset_index(drop=True)
    
    return outliers_dict



In [217]:
def tally_outliers(outlier_dfs: Dict[str, pd.DataFrame], 
                   tally_by_age: bool = False, 
                   geography: str = 'ward') -> pd.DataFrame:
    """
    Tally occurrences of the same 'gss_code_ward' or 'gss_code', 
    'year', and optionally 'age' across multiple DataFrames, 
    and merge with the total outlier score from 'robust_z_score' or 'z_score'.

    Parameters:
    ----------
    outlier_dfs : Dict[str, pd.DataFrame]
        A dictionary where keys are DataFrame names and values are the DataFrames themselves.
        Each DataFrame should contain 'gss_code_ward' or 'gss_code', 'year', and optionally 'age' columns.

    tally_by_age : bool, optional
        If True, the function will tally occurrences based on 'gss_code_ward' or 'gss_code', 'year', and 'age'.
        If False, it will tally based only on 'gss_code_ward' or 'gss_code' and 'year'. Default is False.

    geography : str, optional
        Specifies the geographical code to use. Set to 'borough' to use 'gss_code' 
        or 'ward' to use 'gss_code_ward'. Default is 'ward'.

    Returns:
    -------
    pd.DataFrame
        A DataFrame with counts of occurrences for each combination of 
        'gss_code_ward' or 'gss_code', 'year', and 'age' (if applicable), 
        along with a total count across all DataFrames and the total outlier score.
    """
    
    # Validate input
    if not isinstance(outlier_dfs, dict):
        raise ValueError("Expected 'outlier_dfs' to be a dictionary of DataFrames.")
    
    if geography not in ['borough', 'ward']:
        raise ValueError("geography must be either 'borough' or 'ward'.")

    # Initialise an empty DataFrame for the combined tally
    combined_tally = pd.DataFrame()

    # Define the geography column based on the geography argument
    geo_column = 'gss_code' if geography == 'borough' else 'gss_code_ward'

    # Loop through each DataFrame in the outlier_dfs dictionary
    for df_name, df in outlier_dfs.items():
        if not isinstance(df, pd.DataFrame):
            raise ValueError(f"Expected '{df_name}' to be a pandas DataFrame.")
        
        # Ensure 'age' is consistently typed if tally_by_age is True
        if tally_by_age and 'age' in df.columns:
            df['age'] = df['age'].astype(str)

        # Select the required columns based on the tally_by_age flag
        group_cols = [geo_column, 'year'] + (['age'] if tally_by_age and 'age' in df.columns else [])

        # Ensure selected columns exist in the DataFrame
        missing_cols = [col for col in group_cols if col not in df.columns]
        if missing_cols:
            raise KeyError(f"The DataFrame '{df_name}' is missing columns: {missing_cols}")

        # Remove duplicate rows based on the selected grouping columns
        subset = df[group_cols].drop_duplicates()
        
        # Add a count column for tallying occurrences
        subset['count'] = 1

        # Aggregate the counts by the grouping columns
        grouped = subset.groupby(group_cols).count().reset_index()

        # Rename the count column to indicate which DataFrame it came from
        grouped.rename(columns={'count': df_name}, inplace=True)

        # Merge the current tally with the combined tally
        if combined_tally.empty:
            combined_tally = grouped
        else:
            combined_tally = pd.merge(combined_tally, grouped, on=group_cols, how='outer')

    # Fill NaN values with 0 (indicating no occurrences in that DataFrame)
    combined_tally.fillna(0, inplace=True)

    # Calculate the total tally across all DataFrame columns
    tally_columns = combined_tally.columns.difference(group_cols)
    combined_tally['total'] = combined_tally[tally_columns].sum(axis=1)

    # Sort the DataFrame by the total count in descending order
    combined_tally.sort_values(by='total', ascending=False, inplace=True)

    # --- New logic to handle multiple score column names ---

    # Concatenate all the DataFrames in the outlier_dfs dictionary
    combined_df = pd.concat(outlier_dfs.values())

    # Check if the DataFrame contains 'robust_z_score' or 'z_score' and create a unified column
    if 'robust_z_score' in combined_df.columns:
        combined_df['outlier_score'] = combined_df['robust_z_score']
    elif 'z_score' in combined_df.columns:
        combined_df['outlier_score'] = combined_df['z_score']
    else:
        raise ValueError("None of the DataFrames contain 'robust_z_score' or 'z_score'.")

    # Group by 'gss_code_ward' (or 'gss_code'), 'year', and 'age', and sum the outlier scores
    group_outliers = combined_df.groupby(
        [geo_column, 'year'] + (['age'] if tally_by_age and 'age' in combined_df.columns else []),
        as_index=False).agg(total_outlier_score=('outlier_score', lambda x: x.abs().sum())) #use abs() so that the magnitude of negative and positive outliers are counted for

    # Merge the total outlier score with the combined_tally DataFrame
    final_df = pd.merge(
        combined_tally, group_outliers, 
        on=[geo_column, 'year'] + (['age'] if tally_by_age and 'age' in group_outliers.columns else []), 
        how='left'
    )

    return final_df


In [218]:
def find_ahs_outliers_absolute_value(df, z_threshold=3):
    """
    Identifies outliers in the average household size (ahs) dataframe based on z-scores and returns a DataFrame with additional context.
    
    Args:
    - df (pd.DataFrame): DataFrame with household size data. First column should be 'gss_code_ward'.
    - z_threshold (float): Threshold for z-scores to determine outliers. Default is 3.
    
    Returns:
    - pd.DataFrame: DataFrame containing gss_code_ward, year, z_score, outlier value, year_before, year_after,
                    average_for_year (across wards), and average_for_ward (across years).
    """
    
    # Step 1: Calculate z-scores for each year column (excluding 'gss_code_ward')
    z_scores = df.iloc[:, 1:].apply(zscore)

    # Step 2: Identify the outliers (z-scores > z_threshold or z-scores < -z_threshold)
    outliers = z_scores[(z_scores > z_threshold) | (z_scores < -z_threshold)]

    # Step 3: Combine 'gss_code_ward' with outliers to present them in a readable format
    outliers_cleaned = pd.concat([df['gss_code_ward'], outliers], axis=1)

    # Step 4: Drop rows with no outliers
    outliers_cleaned = outliers_cleaned.dropna(how='all', subset=outliers_cleaned.columns[1:])

    # Step 5: Melt the DataFrame to have 'gss_code_ward', 'year', and 'z_score' columns
    outliers_melted = outliers_cleaned.melt(id_vars='gss_code_ward', var_name='year', value_name='z_score')

    # Step 6: Drop NaN values (non-outliers) and sort by z_score in descending order
    outliers_sorted = outliers_melted.dropna().sort_values(by='z_score', ascending=False)

    # Convert the 'year' column to integer for easier manipulation
    outliers_sorted['year'] = outliers_sorted['year'].astype(int)

    # Step 7: Create new columns for the value of AHS for outlier year, previous year, and next year
    outliers_sorted['outlier_value'] = outliers_sorted.apply(
        lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], str(row['year'])].values[0], axis=1)

    outliers_sorted['year_before_value'] = outliers_sorted.apply(
        lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], str(row['year'] - 1)].values[0]
        if str(row['year'] - 1) in df.columns else None, axis=1)

    outliers_sorted['year_after_value'] = outliers_sorted.apply(
        lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], str(row['year'] + 1)].values[0]
        if str(row['year'] + 1) in df.columns else None, axis=1)

    # Step 8: Calculate the average for that year across all gss_code_wards
    outliers_sorted['average_for_year'] = outliers_sorted.apply(
        lambda row: df[str(row['year'])].mean(), axis=1)

    # Step 9: Calculate the average for that gss_code_ward across all years
    outliers_sorted['average_for_ward'] = outliers_sorted.apply(
        lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], df.columns[1:]].mean(axis=1).values[0], axis=1)

    # Step 10: Create the final outlier_sorted_context DataFrame
    outlier_sorted_context = outliers_sorted[['gss_code_ward', 'year', 'z_score', 'outlier_value', 'year_before_value', 'year_after_value', 'average_for_year', 'average_for_ward']]

    return outlier_sorted_context

# Example usage:
#outliers = find_outliers_with_context(average_household_size, z_threshold=3)

# If you want to use a different threshold, e.g., z-score threshold of 2.5:
#outlier_sorted_context = find_outliers_with_context(average_household_size, z_threshold=2.5)

# Display the results
#print(outlier_sorted_context)


In [219]:
def find_ahs_outliers_pct_change(df, pct_change_threshold=0.3):
        """
        Identifies outliers in the average household size (ahs) dataframe based on percentage change and returns a DataFrame with additional context.
        
        Args:
        - df (pd.DataFrame): DataFrame with household size data. First column should be 'gss_code_ward'.
        - pct_change_threshold (float): Threshold for percentage change to determine outliers. Default is 0.2 (20%).
        
        Returns:
        - pd.DataFrame: DataFrame containing gss_code_ward, year, pct_change, outlier value, year_before, year_after,
                        average_for_year (across wards), and average_for_ward (across years).
        """
        
        # Step 1: Calculate percentage change for each year column (excluding 'gss_code_ward')
        pct_changes = df.iloc[:, 1:].pct_change(axis=1)
        
        # Step 2: Identify the outliers (percentage change > pct_change_threshold or < -pct_change_threshold)
        outliers = pct_changes[(pct_changes > pct_change_threshold) | (pct_changes < -pct_change_threshold)]
  
        # Step 3: Combine 'gss_code_ward' with outliers to present them in a readable format
        outliers_cleaned = pd.concat([df['gss_code_ward'], outliers], axis=1)

        # Step 4: Drop rows with no outliers
        outliers_cleaned = outliers_cleaned.dropna(how='all', subset=outliers_cleaned.columns[1:])
        
        # Step 5: Melt the DataFrame to have 'gss_code_ward', 'year', and 'pct_change' columns
        outliers_melted = outliers_cleaned.melt(id_vars='gss_code_ward', var_name='year', value_name='pct_change')

        # Step 6: Drop NaN values (non-outliers) and sort by pct_change in descending order
        outliers_sorted = outliers_melted.dropna().sort_values(by='pct_change', ascending=False)

        # Convert the 'year' column to integer for easier manipulation
        outliers_sorted['year'] = outliers_sorted['year'].astype(int)

        # Step 7: Create new columns for the value of AHS for outlier year, previous year, and next year
        outliers_sorted['outlier_value'] = outliers_sorted.apply(
            lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], str(row['year'])].values[0], axis=1)

        outliers_sorted['year_before_value'] = outliers_sorted.apply(
            lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], str(row['year'] - 1)].values[0]
            if str(row['year'] - 1) in df.columns else None, axis=1)

        outliers_sorted['year_after_value'] = outliers_sorted.apply(
            lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], str(row['year'] + 1)].values[0]
            if str(row['year'] + 1) in df.columns else None, axis=1)

        # Step 8: Calculate the average for that year across all gss_code_wards
        outliers_sorted['average_for_year'] = outliers_sorted.apply(
            lambda row: df[str(row['year'])].mean(), axis=1)

        # Step 9: Calculate the average for that gss_code_ward across all years
        outliers_sorted['average_for_ward'] = outliers_sorted.apply(
            lambda row: df.loc[df['gss_code_ward'] == row['gss_code_ward'], df.columns[1:]].mean(axis=1).values[0], axis=1)

        # Step 10: Create the final outlier_sorted_context DataFrame
        outlier_sorted_context = outliers_sorted[['gss_code_ward', 'year', 'pct_change', 'outlier_value', 'year_before_value', 'year_after_value', 'average_for_year', 'average_for_ward']]

        return outlier_sorted_context

    # Example usage:
    # outlier_sorted_context = find_ahs_outliers_with_pct_change(average_household_size, pct_change_threshold=0.2)

    # If you want to use a different threshold, e.g., percentage change threshold of 15%:
    # outlier_sorted_context = find_ahs_outliers_with_pct_change(average_household_size, pct_change_threshold=0.15)

    # Display the results
    # print(outlier_sorted_context)

## Process data

---

In [220]:
combined_10yr_fert

Unnamed: 0,gss_code,la_name,gss_code_ward,ward_name,age,sex,year,value,component
0,E09000001,City of London,E09000001,City of London,0,male,2012,24.000,births
1,E09000002,Barking and Dagenham,E05014053,Abbey,0,male,2012,50.489,births
2,E09000002,Barking and Dagenham,E05014054,Alibon,0,male,2012,94.217,births
3,E09000002,Barking and Dagenham,E05014055,Barking Riverside,0,male,2012,80.720,births
4,E09000002,Barking and Dagenham,E05014056,Beam,0,male,2012,61.713,births
...,...,...,...,...,...,...,...,...,...
14656715,E09000033,Westminster,E05013805,Regent's Park,90,female,2050,-4.960,netflow
14656716,E09000033,Westminster,E05013806,St James's,90,female,2050,-1.935,netflow
14656717,E09000033,Westminster,E05013807,Vincent Square,90,female,2050,-2.918,netflow
14656718,E09000033,Westminster,E05013808,West End,90,female,2050,-1.486,netflow


In [221]:
#split ward and borough dataframes
combined_10yr_fert_ward = combined_10yr_fert[~combined_10yr_fert['gss_code_ward'].isna()]
combined_10yr_fert_borough = combined_10yr_fert[~combined_10yr_fert['gss_code_ward'].isna()]

In [222]:
#bin ages
combined_10yr_fert_agebins = create_age_bins(combined_10yr_fert)

In [223]:
#seperate components into columns
combined_10yr_fert_agebins_component_columns = combined_10yr_fert_agebins.pivot_table(index=['gss_code','gss_code_ward','sex', 'age','year'], columns='component', values='value').reset_index()





In [224]:
### Create ward and borough names lookup

In [225]:
name_lookup = combined_10yr_fert[['gss_code', 'la_name', 'gss_code_ward', 'ward_name']]
#remove where all values in the row are the same
name_lookup = name_lookup.drop_duplicates()
#remove nan values
name_lookup = name_lookup.dropna()

In [226]:
# Identify duplicated ward names and create a unique name by appending the local authority name
name_lookup['ward_name_unique'] = name_lookup.apply(
    lambda row: f"{row['ward_name']} ({row['la_name']})" if row['ward_name'] in name_lookup[name_lookup['ward_name'].duplicated()]['ward_name'].values else row['ward_name'],
    axis=1
)

## Basic Checks
Perform basic checks on the dataset, including checking for missing values, duplicates, and descriptive statistics.

---

In [227]:
#min and max year
def get_year_range(df):
    return df['year'].max(), df['year'].min()

In [228]:
#year ranges
print("Complete year range:", get_year_range(combined_10yr_fert))
print("Year range for wards:", get_year_range(combined_10yr_fert_ward))
print("Year range for boroughs:", get_year_range(combined_10yr_fert_borough))

Complete year range: (2050, 2011)
Year range for wards: (2050, 2011)
Year range for boroughs: (2050, 2011)


##### missing values

In [229]:
missing_values = combined_10yr_fert.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 gss_code         0
la_name          0
gss_code_ward    0
ward_name        0
age              0
sex              0
year             0
value            0
component        0
dtype: int64


#### duplicates

In [230]:
duplicates = combined_10yr_fert.duplicated().sum()
print("Number of duplicate rows:", duplicates)

Number of duplicate rows: 0


##### Descriptive data

##### Description by components

In [231]:
combined_10yr_fert_agebins_component_columns

component,gss_code,gss_code_ward,sex,age,year,births,deaths,netflow,population
0,E09000001,E09000001,female,0-18,2011,,,,19.421053
1,E09000001,E09000001,female,0-18,2012,32.0,0.000,0.631579,20.368421
2,E09000001,E09000001,female,0-18,2013,36.0,0.000,-1.526316,19.947368
3,E09000001,E09000001,female,0-18,2014,26.0,0.000,0.052632,19.473684
4,E09000001,E09000001,female,0-18,2015,30.0,0.000,-0.210526,19.578947
...,...,...,...,...,...,...,...,...,...
489595,E09000033,E05013809,male,90+,2046,,7.843,0.884000,19.960000
489596,E09000033,E05013809,male,90+,2047,,8.132,0.889000,20.990000
489597,E09000033,E05013809,male,90+,2048,,8.482,0.901000,22.225000
489598,E09000033,E05013809,male,90+,2049,,8.726,0.916000,23.188000


#### check for negative values in columns

In [232]:
# Checking for negative values and extremely high values
negative_values = combined_10yr_fert[combined_10yr_fert['value'] < 0]
print('components with negative values:', negative_values['component'].unique())

components with negative values: ['netflow']


#### check age range

In [233]:
#print true if max age is 90 and min age is 0
print('Is max age is 90 and min age is 0:', (combined_10yr_fert['age'].max() == 90) & (combined_10yr_fert['age'].min() == 0))

Is max age is 90 and min age is 0: True


#### population totals for london



In [234]:
combined_10yr_fert_population = combined_10yr_fert[combined_10yr_fert['component'] == 'population']
#group by year and sum the values
population_totals = combined_10yr_fert_population.groupby('year')['value'].sum().reset_index()

In [235]:
population_totals_table = population_totals.rename(columns={'value': 'London population'})
population_totals_table = population_totals_table.set_index(['year', 'London population']).T

In [236]:
population_totals_table

year,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,...,2041,2042,2043,2044,2045,2046,2047,2048,2049,2050
London population,8204370.111,8320159.907,8437903.985,8546343.943,8660700.871,8747370.096,8778769.052,8834085.043,8888756.948,8866224.970,...,9688577.960,9720239.347,9749794.920,9777145.575,9802285.089,9825197.665,9846042.093,9864922.467,9881938.765,9897222.063


## Population Consistency Over Time

---

In [237]:
#use binned ages this will even out large flunctions between age group where the are likely to be unusally high e.i. 18 year olds moving to university
combined_10yr_fert_agebins_component_columns

component,gss_code,gss_code_ward,sex,age,year,births,deaths,netflow,population
0,E09000001,E09000001,female,0-18,2011,,,,19.421053
1,E09000001,E09000001,female,0-18,2012,32.0,0.000,0.631579,20.368421
2,E09000001,E09000001,female,0-18,2013,36.0,0.000,-1.526316,19.947368
3,E09000001,E09000001,female,0-18,2014,26.0,0.000,0.052632,19.473684
4,E09000001,E09000001,female,0-18,2015,30.0,0.000,-0.210526,19.578947
...,...,...,...,...,...,...,...,...,...
489595,E09000033,E05013809,male,90+,2046,,7.843,0.884000,19.960000
489596,E09000033,E05013809,male,90+,2047,,8.132,0.889000,20.990000
489597,E09000033,E05013809,male,90+,2048,,8.482,0.901000,22.225000
489598,E09000033,E05013809,male,90+,2049,,8.726,0.916000,23.188000


In [238]:
outliers_dict_borough = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'population'], 
    handle_inf=False, 
    Geography='borough',
    z_score_threshold=3,
    For_population_totals=False
    )




The default fill_method='ffill' in DataFrameGroupBy.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [239]:
outliers_dict_ward = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'population'], 
    handle_inf=False, 
    Geography='ward',
    z_score_threshold=3,
    For_population_totals=False
    )




The default fill_method='ffill' in DataFrameGroupBy.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [240]:
outliers_dict_borough_inf_values = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'population'], 
    handle_inf=True, 
    Geography='borough',
    z_score_threshold=3,
    For_population_totals=False
    )




The default fill_method='ffill' in DataFrameGroupBy.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [241]:
outliers_dict_ward_inf_values = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'population'], 
    handle_inf=True, 
    Geography='ward',
    z_score_threshold=3,
    For_population_totals=False)




The default fill_method='ffill' in DataFrameGroupBy.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [242]:
births_pct_change_borough_outlier_df = outliers_dict_borough['births_pct_change']
deaths_pct_change_borough_outlier_df = outliers_dict_borough['deaths_pct_change']
netflow_pct_change_borough_outlier_df = outliers_dict_borough['netflow_pct_change']
population_pct_change_borough_outlier_df = outliers_dict_borough['population_pct_change']

In [243]:
births_pct_change_ward_outlier_df = outliers_dict_ward['births_pct_change']
deaths_pct_change_ward_outlier_df = outliers_dict_ward['deaths_pct_change']
netflow_pct_change_ward_outlier_df = outliers_dict_ward['netflow_pct_change']
population_pct_change_ward_outlier_df = outliers_dict_ward['population_pct_change']

In [244]:
births_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['births']
deaths_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['deaths']
netflow_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['netflow']
population_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['population']

In [245]:
births_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['births']
deaths_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['deaths']
netflow_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['netflow']   
population_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['population']

### corrected function this uses a simplified version to help users understand 

In [246]:
outliers_dict_ward = calculate_zscores_and_find_outliers_percentage_change(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'population'], 
    Geography='ward',
    percentage_change_threshold=0.05,
    )




The default fill_method='ffill' in DataFrameGroupBy.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [247]:
outliers_dict_ward_inf_values = calculate_zscores_and_find_outliers_value(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'population'], 
    Geography='ward',
    z_score_threshold=3,
    )




The default fill_method='ffill' in DataFrameGroupBy.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [248]:
#extract dfs from dictionary
births_pct_change_ward_outlier_df = outliers_dict_ward['births_pct_change']
deaths_pct_change_ward_outlier_df = outliers_dict_ward['deaths_pct_change']
netflow_pct_change_ward_outlier_df = outliers_dict_ward['netflow_pct_change']
population_pct_change_ward_outlier_df = outliers_dict_ward['population_pct_change']

In [249]:
#extract dfs from dictionary
births_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['births']
deaths_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['deaths']
netflow_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['netflow']   
population_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['population']

---

## Total Population Outliers
#### Detect total outliers using Z-scores and Robust Z-scores, and perform cross-sectional and temporal comparisons.

#### Using total population (popn) perform cross-sectional comparisons: Examine changes between boroughs and wards totals for a given year.
#### Conduct temporal comparisons: Measure percentage changes between year total for both boroughs and wards

#### Total population per geographical boundary
#### Total population per year
#### i.e.
#### groupby('gss_code')['popn']
#### groupby('year')['popn']
---

In [250]:
combined_10yr_fert_agebins_component_columns

component,gss_code,gss_code_ward,sex,age,year,births,deaths,netflow,population,births_pct_change,deaths_pct_change,netflow_pct_change,population_pct_change
0,E09000001,E09000001,female,0-18,2011,,,,19.421053,,,,
1,E09000001,E09000001,female,0-18,2012,32.0,0.000,0.631579,20.368421,,,,0.048780
2,E09000001,E09000001,female,0-18,2013,36.0,0.000,-1.526316,19.947368,0.125000,,3.416667,0.020672
3,E09000001,E09000001,female,0-18,2014,26.0,0.000,0.052632,19.473684,0.277778,,1.034483,0.023747
4,E09000001,E09000001,female,0-18,2015,30.0,0.000,-0.210526,19.578947,0.153846,,5.000000,0.005405
...,...,...,...,...,...,...,...,...,...,...,...,...,...
489595,E09000033,E05013809,male,90+,2046,,7.843,0.884000,19.960000,,0.045873,0.000000,0.061307
489596,E09000033,E05013809,male,90+,2047,,8.132,0.889000,20.990000,,0.036848,0.005656,0.051603
489597,E09000033,E05013809,male,90+,2048,,8.482,0.901000,22.225000,,0.043040,0.013498,0.058838
489598,E09000033,E05013809,male,90+,2049,,8.726,0.916000,23.188000,,0.028767,0.016648,0.043330


In [251]:
total_pop_temporal_outliers_ward = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['population'], 
    handle_inf=False, 
    Geography='ward',
    For_population_totals=True,
    population_analysis_type='temporal'
    )





      gss_code_ward  year   population  population_pct_change_temporal
21044     E05013925  2015   449.540413                        4.830075
21041     E05013925  2012    45.305355                        3.323123
23846     E05014015  2017   205.619105                        0.716484
6771      E05013516  2022   887.879825                        0.395847
23847     E05014015  2018   285.615117                        0.389050
...             ...   ...          ...                             ...
27000     E05014116  2011   872.153431                             NaN
27040     E05014117  2011   587.925335                             NaN
27080     E05014118  2011   758.158815                             NaN
27120     E05014119  2011  1002.042897                             NaN
27160     E09000001  2011   720.248246                             NaN

[27200 rows x 4 columns]


In [252]:
total_pop_temporal_outliers_ward_df  = total_pop_temporal_outliers_ward['population_pct_change']

In [253]:
total_pop_temporal_outliers_ward_df

component,gss_code,gss_code_ward,sex,age,year,births,deaths,netflow,population,births_pct_change,deaths_pct_change,netflow_pct_change,population_pct_change,robust_z_score
158730,E09000011,E05014090,female,90+,2021,,3.157000,-1.996000,14.902000,,0.626876,0.778247,30.841880,3072.492486
368681,E09000025,E05013925,female,19-30,2012,,0.000000,16.587500,17.295667,,,,22.068578,2198.110802
303449,E09000021,E05013941,female,90+,2020,,4.124000,-1.092000,10.422000,,0.222180,0.802030,20.849057,2076.568475
369041,E09000025,E05013925,male,19-30,2012,,0.000000,14.357750,15.107667,,,,19.145794,1906.814641
159090,E09000011,E05014090,male,90+,2021,,1.686000,-0.984000,9.000000,,0.750333,1.275014,17.987342,1791.358750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,E09000004,E05011219,male,90+,2031,,21.493000,-1.038000,75.842000,,0.041219,0.076512,0.033417,2.000142
156504,E09000011,E05014087,female,71-80,2035,,1.514100,-1.481100,85.317400,,0.029790,0.012139,0.033416,2.000109
276427,E09000019,E05013707,male,81-89,2038,,1.524444,-0.034556,14.868444,,0.014043,0.502400,0.033416,2.000079
387621,E09000027,E05013775,female,71-80,2032,,0.417700,-0.287600,49.432100,,0.006264,0.078365,0.033416,2.000060


In [254]:
total_pop_cross_sectional_outliers_ward = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['population'], 
    handle_inf=False, 
    Geography='ward',
    For_population_totals=True,
    population_analysis_type='cross-sectional'
    )





In [255]:
total_pop_cross_sectional_outliers_ward_df = total_pop_cross_sectional_outliers_ward['population_pct_change']

In [256]:
total_pop_temporal_outliers_ward_df

component,gss_code,gss_code_ward,sex,age,year,births,deaths,netflow,population,births_pct_change,deaths_pct_change,netflow_pct_change,population_pct_change,robust_z_score
158730,E09000011,E05014090,female,90+,2021,,3.157000,-1.996000,14.902000,,0.626876,0.778247,30.841880,3072.492486
368681,E09000025,E05013925,female,19-30,2012,,0.000000,16.587500,17.295667,,,,22.068578,2198.110802
303449,E09000021,E05013941,female,90+,2020,,4.124000,-1.092000,10.422000,,0.222180,0.802030,20.849057,2076.568475
369041,E09000025,E05013925,male,19-30,2012,,0.000000,14.357750,15.107667,,,,19.145794,1906.814641
159090,E09000011,E05014090,male,90+,2021,,1.686000,-0.984000,9.000000,,0.750333,1.275014,17.987342,1791.358750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33820,E09000004,E05011219,male,90+,2031,,21.493000,-1.038000,75.842000,,0.041219,0.076512,0.033417,2.000142
156504,E09000011,E05014087,female,71-80,2035,,1.514100,-1.481100,85.317400,,0.029790,0.012139,0.033416,2.000109
276427,E09000019,E05013707,male,81-89,2038,,1.524444,-0.034556,14.868444,,0.014043,0.502400,0.033416,2.000079
387621,E09000027,E05013775,female,71-80,2032,,0.417700,-0.287600,49.432100,,0.006264,0.078365,0.033416,2.000060


### Adjusted simplified function for totals

In [257]:
Total_population_percentage_yearly_change_for_wards = calculate_zscores_and_find_outliers_percentage_change(
    combined_10yr_fert_agebins_component_columns, 
    ['population'], 
    Geography='ward',
    For_population_totals=True,
    population_analysis_type='temporal',
    percentage_change_threshold=0.05)




A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





df_time_pct       gss_code_ward  year   population  population_pct_change_temporal
0         E05009317  2011  1700.857837                             NaN
1         E05009317  2012  1690.477789                        0.006103
2         E05009317  2013  1718.920073                        0.016825
3         E05009317  2014  1705.484188                        0.007816
4         E05009317  2015  1639.378261                        0.038761
...             ...   ...          ...                             ...
27195     E09000001  2046  1245.595551                        0.006557
27196     E09000001  2047  1252.237239                        0.005332
27197     E09000001  2048  1261.019272                        0.007013
27198     E09000001  2049  1269.515882                        0.006738
27199     E09000001  2050  1279.148061                        0.007587

[27200 rows x 4 columns]
df_time_threshold       gss_code_ward  year   population  population_pct_change_temporal
21044     E05013925  

In [258]:
Total_population_percentage_yearly_change_for_wards

Unnamed: 0,gss_code_ward,year,population,population_pct_change_temporal
21044,E05013925,2015,449.540413,4.830075
21041,E05013925,2012,45.305355,3.323123
23846,E05014015,2017,205.619105,0.716484
6771,E05013516,2022,887.879825,0.395847
23847,E05014015,2018,285.615117,0.389050
...,...,...,...,...
26049,E05014092,2020,1057.558981,0.050238
2171,E05009401,2022,769.763158,0.050209
25733,E05014084,2024,644.563658,0.050196
25735,E05014084,2026,710.448171,0.050144


In [259]:
Total_population_percentage_change_form_ward_avg = calculate_zscores_and_find_outliers_percentage_change(
    combined_10yr_fert_agebins_component_columns, 
    ['population'], 
    Geography='ward',
    For_population_totals=True,
    population_analysis_type='cross-sectional',
    percentage_change_threshold=1)




A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





df_cross_pct       gss_code_ward  year   population  population_mean_for_year  \
0         E05009317  2011  1700.857837               1091.090039   
1         E05009317  2012  1690.477789               1109.071144   
2         E05009317  2013  1718.920073               1126.258599   
3         E05009317  2014  1705.484188               1143.568042   
4         E05009317  2015  1639.378261               1158.694114   
...             ...   ...          ...                       ...   
27195     E09000001  2046  1245.595551               1395.728767   
27196     E09000001  2047  1252.237239               1403.660697   
27197     E09000001  2048  1261.019272               1411.804023   
27198     E09000001  2049  1269.515882               1420.309149   
27199     E09000001  2050  1279.148061               1429.379855   

       population_pct_change_cross  
0                         0.558861  
1                         0.524228  
2                         0.526221  
3                     

In [260]:
Total_population_percentage_change_form_ward_avg

Unnamed: 0,gss_code_ward,year,population,population_mean_for_year,population_pct_change_cross
24593,E05014055,2044,5291.083901,1380.870145,2.831703
24594,E05014055,2045,5314.329596,1388.059545,2.828603
24592,E05014055,2043,5248.904905,1373.725117,2.820928
24595,E05014055,2046,5329.608444,1395.728767,2.818513
24596,E05014055,2047,5339.246146,1403.660697,2.803801
...,...,...,...,...,...
79,E05009318,2050,2883.596364,1429.379855,1.017376
15598,E05013736,2049,2852.847612,1420.309149,1.008610
20225,E05013904,2036,2637.056774,1314.294253,1.006443
15588,E05013736,2039,2694.162001,1344.808224,1.003380


## Gender Outliers
Investigate gender outliers, focusing on abnormal gender ratios and adjusting thresholds as needed.

---

In [261]:
#single year age 
component_columns = ['births', 'deaths', 'netflow', 'population']
gender_outlier_dictionary = gender_outliers(combined_10yr_fert, component_columns, geography='ward', outlier_std={'births': 2, 'deaths': 5, 'netflow': 2, 'population': 5})


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



In [262]:
#binned age
component_columns = ['births', 'deaths', 'netflow', 'population']
gender_outlier_dictionary = gender_outliers(combined_10yr_fert_agebins, component_columns, geography='ward', outlier_std={'births': 2, 'deaths': 5, 'netflow': 2, 'population': 5})




A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



In [263]:
gender_outlier_dictionary
births_gender_outliers_df = gender_outlier_dictionary['births']
deaths_gender_outliers_df = gender_outlier_dictionary['deaths']
netflow_gender_outliers_df = gender_outlier_dictionary['netflow']
population_gender_outliers_df = gender_outlier_dictionary['population']

In [264]:
births_gender_outliers_df

component,gss_code_ward,year,age,births,robust_z_score
0,E05009317,2012,0-18,0.973134,2501.286342
1,E05009317,2013,0-18,1.139178,22513.691666
2,E05009317,2014,0-18,1.205300,30482.965256
3,E05009317,2015,0-18,0.752435,-24098.454837
4,E05009317,2016,0-18,0.912557,-4799.746457
...,...,...,...,...,...
7807,E09000001,2032,0-18,0.952362,-2.317024
7808,E09000001,2034,0-18,0.952407,3.154138
7809,E09000001,2037,0-18,0.952400,2.292207
7810,E09000001,2039,0-18,0.952354,-3.282278


In [265]:
gender_outlier_dictionary

{'births': component gss_code_ward  year   age    births  robust_z_score
 0             E05009317  2012  0-18  0.973134     2501.286342
 1             E05009317  2013  0-18  1.139178    22513.691666
 2             E05009317  2014  0-18  1.205300    30482.965256
 3             E05009317  2015  0-18  0.752435   -24098.454837
 4             E05009317  2016  0-18  0.912557    -4799.746457
 ...                 ...   ...   ...       ...             ...
 7807          E09000001  2032  0-18  0.952362       -2.317024
 7808          E09000001  2034  0-18  0.952407        3.154138
 7809          E09000001  2037  0-18  0.952400        2.292207
 7810          E09000001  2039  0-18  0.952354       -3.282278
 7811          E09000001  2043  0-18  0.952407        3.178440
 
 [7812 rows x 5 columns],
 'deaths': component gss_code_ward  year    age    deaths  robust_z_score
 0             E05009317  2013    90+  2.756165        7.922055
 1             E05009317  2014  31-40  3.091483        9.232518
 2  

## Key Visualisations
Visualise important trends in the dataset, including distribution by components, age groups, yearly totals, and population pyramids.

---


#### yearly totals by component visulation

In [266]:
yearly_totals = combined_10yr_fert.groupby(['year','component'])['value'].sum().reset_index()
print("Yearly population totals by gss_code and year:\n", yearly_totals)

Yearly population totals by gss_code and year:
      year   component        value
0    2011  population  8204370.111
1    2012      births   134036.981
2    2012      deaths    47569.032
3    2012     netflow    29311.513
4    2012  population  8320159.907
..    ...         ...          ...
152  2049  population  9881938.765
153  2050      births   113319.794
154  2050      deaths    69183.781
155  2050     netflow   -28852.607
156  2050  population  9897222.063

[157 rows x 3 columns]


In [267]:
# Create the line plot with markers
fig = px.line(
    yearly_totals, 
    x='year', 
    y='value', 
    color='component', 
    markers=True,
    title="Yearly Totals by Component Over the Years",
    labels={'value': 'Total Value', 'year': 'Year'},
    color_discrete_sequence=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']  # Four distinct colors
)

# Update layout for refined appearance
fig.update_layout(
    title={
        'text': "Yearly Totals by Component Over the Years",
        'x': 0.5,  # Center the title
        'xanchor': 'center',  # Anchor the title to the center
        'yanchor': 'top'
    },
    title_font=dict(size=20, color='#333333', family='Arial'),
    xaxis_title='Year',
    yaxis_title='Total Value',
    legend_title='<b>Component</b><br>(Please click components<br> to toggle visibility)',
    width=900,
    height=600,
    font=dict(family="Arial", size=14, color="#333333"),
    plot_bgcolor='#f7f7f7',
    paper_bgcolor='#ffffff',
    xaxis=dict(showgrid=True, gridcolor='lightgrey', zeroline=False),
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    legend=dict(
        orientation="h",
        yanchor="top",
        y=-0.15,
        x=0.5,
        xanchor="center",
        font=dict(size=12)
    ),
    margin=dict(l=60, r=60, t=60, b=60)  # Set margins to adjust spacing around the plot
)

# Show the figure
fig.show()

# Save the figure as an HTML file for embedding
pio.write_html(fig, 'interactive_line_plot.html', include_plotlyjs='cdn')



## Dash Apps

#### population pyramid app

In [268]:
#for unit age
combined_10yr_fert_population = combined_10yr_fert[combined_10yr_fert['component'] == 'population']
population_pyramids_unit_age = combined_10yr_fert_population.copy()
population_pyramids_unit_age = population_pyramids_unit_age[~population_pyramids_unit_age['gss_code_ward'].isna()]

In [269]:
population_pyramids_unit_age

population_pyramids_unit_age.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 4950400 entries, 4879680 to 9830079
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   gss_code       object 
 1   la_name        object 
 2   gss_code_ward  object 
 3   ward_name      object 
 4   age            int64  
 5   sex            object 
 6   year           int64  
 7   value          float64
 8   component      object 
dtypes: float64(1), int64(2), object(6)
memory usage: 1.8 GB


In [270]:
#sum count for la_name for each year and gender
population_pyramids_unit_age['LA_total'] = population_pyramids_unit_age.groupby(['la_name', 'year', 'sex', 'age'])['value'].transform('sum')


In [None]:
# Initialise the Dash app with a Bootstrap theme
app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

# Define layout with CSS styling
app.layout = html.Div([

    # Header section
    html.Div([
        html.H1("Population Pyramid Analysis", 
                style={'textAlign': 'center', 'color': '#333333', 'margin-top': '20px', 'font-size': '30px'}),
        html.P("Explore demographic change by ward", 
               style={'textAlign': 'center', 'fontSize': '18px', 'color': '#7F8C8D', 'margin-top': '10px'}),
    ], style={'padding': '20px', 'backgroundColor': '#ECF0F1'}),

    # Dropdown Container
    html.Div([
        dbc.Row([
            dbc.Col([
                html.Label("Select Borough", style={'font-size': '16px', 'margin-bottom': '5px', 'font-weight': 'bold'}),
                dcc.Dropdown(
                    id='location-dropdown',
                    options=[{'label': loc, 'value': loc} for loc in population_pyramids_unit_age['la_name'].unique()],
                    value=population_pyramids_unit_age['la_name'].unique()[0],
                    style={'width': '100%', 'font-size': '16px'}
                )
            ], width=6),
            
            dbc.Col([
                html.Label("Select Ward", style={'font-size': '16px', 'margin-bottom': '5px', 'font-weight': 'bold'}),
                dcc.Dropdown(
                    id='ward-dropdown',
                    style={'width': '100%', 'font-size': '16px'}
                )
            ], width=6)
        ], className="row justify-content-center"),
    ], className="container", style={'padding': '20px'}),
    
    # Graph
    dcc.Graph(id='population-pyramid'),
    
], style={'padding': '20px', 'font-family': 'Arial', 'background-color': '#f7f7f7'})

# Update ward dropdown based on selected borough
@app.callback(
    Output('ward-dropdown', 'options'),
    Output('ward-dropdown', 'value'),
    Input('location-dropdown', 'value')
)
def update_ward_dropdown(selected_location):
    filtered_data = population_pyramids_unit_age[population_pyramids_unit_age['la_name'] == selected_location]
    ward_options = [{'label': ward, 'value': ward} for ward in filtered_data['ward_name'].unique()]
    ward_options.append({'label': 'All Wards', 'value': 'All Wards'})  # Add "All Wards" as the last option
    return ward_options, ward_options[0]['value']  # Set the first actual ward as default

# Generate the population pyramid chart
@app.callback(
    Output('population-pyramid', 'figure'),
    Input('location-dropdown', 'value'),
    Input('ward-dropdown', 'value')
)
def update_pyramid(selected_location, selected_ward):
    # If a specific location is selected with "All Wards", use LA_total for that location
    if selected_ward == 'All Wards':
        filtered_data = population_pyramids_unit_age[population_pyramids_unit_age['la_name'] == selected_location].copy()
        filtered_data['value'] = filtered_data['LA_total']  # Use LA_total for overall population
        title = f'Population Pyramid for {selected_location} - All Wards'
    
    # For a specific ward
    else:
        filtered_data = population_pyramids_unit_age[
            (population_pyramids_unit_age['la_name'] == selected_location) & 
            (population_pyramids_unit_age['ward_name'] == selected_ward)
        ]
        title = f'Population Pyramid for {selected_location} - {selected_ward}'

    # Prepare data for the pyramid
    filtered_data['value'] = filtered_data.apply(lambda row: -row['value'] if row['sex'] == 'female' else row['value'], axis=1)

    # Calculate range_x based on selection type
    if selected_ward == 'All Wards':
        max_value = abs(filtered_data['value'].max() * len(filtered_data['ward_name'].unique()))
        range_x = [-max_value * 1.2, max_value * 1.2] 
    else:
        max_value = abs(filtered_data['value'].max())
        range_x = [-max_value * 1.2, max_value * 1.2]

    fig = px.bar(
        filtered_data,
        x='value',
        y='age',
        color='sex',
        animation_frame='year',
        orientation='h',
        title=title,
        labels={'value': 'Population', 'age': 'Age'},
        color_discrete_map={'male': '#1f77b4', 'female': '#ff7f0e'},
        height=600,
        range_x=range_x,
        hover_data={'ward_name': True}
    )

    # Custom layout for enhanced appearance
    fig.update_layout(
        barmode='relative',
        title={'text': title, 'x':0.5, 'xanchor': 'center', 'yanchor': 'top'},
        title_font=dict(size=20, color='#333333', family='Arial'),
        xaxis_title='Population',
        yaxis_title='Age',
        font=dict(family="Arial", size=14, color="#333333"),
        xaxis=dict(showgrid=True, gridcolor='lightgrey', zeroline=False),
        yaxis=dict(showgrid=True, gridcolor='lightgrey'),
        plot_bgcolor='#f7f7f7',
        paper_bgcolor='#ffffff',
        showlegend=True,
        legend=dict(title="Gender", orientation="h", y=1.1, x=0.5, xanchor="center")
    )

    # Set a default animation speed for transitions
    fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 500  # Default animation frame duration
    fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 250  # Transition duration

    return fig

if __name__ == '__main__':
    app.run_server(debug=True, port=1223)


In [272]:
#add ward and borough names
combined_10yr_fert_agebins_component_columns_with_names = pd.merge(combined_10yr_fert_agebins_component_columns, name_lookup, left_on='gss_code_ward', right_on='gss_code_ward', how='left') 

### Line graph app

In [273]:
data = combined_10yr_fert_agebins_component_columns_with_names.copy()

In [274]:
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import dash_bootstrap_components as dbc
import plotly.express as px

# Initialise the Dash app with a Bootstrap theme
app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

# Colors for 'male' and 'female'
gender_colors = {
    'male': '#1f77b4',  # blue
    'female': '#ff7f0e'  # orange
}

# App layout with updated styling
app.layout = html.Div([
    # Header section
    html.Div([
        html.H1("Ward Demographic Dashboard", 
                style={'textAlign': 'center', 'color': '#333333', 'margin-top': '20px', 'font-size': '30px'}),
        html.P("Explore demographic trends by ward and component", 
               style={'textAlign': 'center', 'fontSize': '18px', 'color': '#7F8C8D', 'margin-top': '10px'}),
    ], style={'padding': '20px', 'backgroundColor': '#ECF0F1'}),
    
    # Dropdowns and Controls Section
    html.Div([
        dbc.Row([
            dbc.Col([
                html.Label("Select Ward", style={'font-size': '16px', 'font-weight': 'bold'}),
                dcc.Dropdown(
                    id='ward-dropdown',
                    options=[{'label': ward, 'value': ward} for ward in data['ward_name_unique'].unique()],
                    value=data['ward_name_unique'].unique()[0],
                    style={'font-size': '16px', 'margin-bottom': '20px'}
                )
            ], width=6),
            
            dbc.Col([
                html.Label("Select Component", style={'font-size': '16px', 'font-weight': 'bold'}),
                dcc.Dropdown(
                    id='component-dropdown',
                    options=[
                        {'label': 'Births', 'value': 'births'},
                        {'label': 'Deaths', 'value': 'deaths'},
                        {'label': 'Netflow', 'value': 'netflow'},
                        {'label': 'Population', 'value': 'population'},
                        {'label': 'Births % Change', 'value': 'births_pct_change'},
                        {'label': 'Deaths % Change', 'value': 'deaths_pct_change'},
                        {'label': 'Netflow % Change', 'value': 'netflow_pct_change'},
                        {'label': 'Population % Change', 'value': 'population_pct_change'}
                    ],
                    value='births',
                    style={'font-size': '16px'}
                )
            ], width=6)
        ], style={'margin-bottom': '20px'}),

        # Dropdown for selecting age group
        html.Div([
            html.Label("Select Age Group", style={'font-size': '16px', 'font-weight': 'bold'}),
            dcc.Dropdown(id='age-dropdown', style={'font-size': '16px'})
        ], style={'padding': '10px'}),

        # Radio buttons for selecting plot type
        html.Div([
            html.Label("Select Plot Type", style={'font-size': '16px', 'font-weight': 'bold'}),
            dcc.RadioItems(
                id='plot-type',
                options=[
                    {'label': 'Line Plot', 'value': 'line'},
                    {'label': 'Box Plot', 'value': 'box'}
                ],
                value='line',
                labelStyle={'display': 'inline-block', 'margin-right': '10px'}
            ),
        ], style={'padding': '10px'}),
    ], style={'padding': '20px', 'backgroundColor': '#f7f7f7'}),
    
    # Graph placeholder
    dcc.Graph(id='plot', style={'margin-top': '20px'})
], style={'background-color': '#f7f7f7', 'font-family': 'Arial', 'padding': '20px'})

# Callback to update age dropdown options based on selected component
@app.callback(
    Output('age-dropdown', 'options'),
    Output('age-dropdown', 'value'),
    Input('component-dropdown', 'value')
)
def update_age_dropdown(selected_component):
    if selected_component == 'births':
        # Allow only age 0 when "Births" is selected
        options = [{'label': 'Age 0', 'value': '0-18'}]
        value = '0-18'
    else:
        # Show all available age options and "All Ages"
        options = [{'label': age, 'value': age} for age in data['age'].unique()] + [{'label': 'All Ages', 'value': 'all'}]
        value = 'all'
    return options, value

# Update graph based on dropdowns and selected plot type
@app.callback(
    Output('plot', 'figure'),
    [Input('ward-dropdown', 'value'),
     Input('component-dropdown', 'value'),
     Input('age-dropdown', 'value'),
     Input('plot-type', 'value')]
)
def update_plot(selected_ward, selected_component, selected_age, selected_plot_type):
    # Filter data by selected ward
    filtered_data = data[data['ward_name_unique'] == selected_ward]
    
    # Filter by age group if not "all"
    if selected_age != 'all':
        filtered_data = filtered_data[filtered_data['age'] == selected_age]
    
    # Common layout settings for grey grid
    grid_layout = {
        'xaxis': {
            'showgrid': True,
            'gridcolor': '#D3D3D3',  # Light grey grid color
            'gridwidth': 0.5,         # Grid line width
        },
        'yaxis': {
            'showgrid': True,
            'gridcolor': '#D3D3D3',  # Light grey grid color
            'gridwidth': 0.5,         # Grid line width
        },
        'plot_bgcolor': "white"       # White background for the plot
    }

    if selected_plot_type == 'line':
        if selected_age == 'all':
            # Aggregate data by year and sex for "All Ages"
            aggregated_data = filtered_data.groupby(['year', 'sex'])[selected_component].sum().reset_index()
            fig = px.line(
                aggregated_data,
                x='year',
                y=selected_component,
                color='sex',
                color_discrete_map=gender_colors,
                title=f"{selected_component.capitalize()} Trend Over Time in {selected_ward} (All Ages)"
            )
        else:
            # Create a line plot without aggregation
            fig = px.line(
                filtered_data,
                x='year',
                y=selected_component,
                color='sex',
                color_discrete_map=gender_colors,
                title=f"{selected_component.capitalize()} Trend Over Time in {selected_ward}"
            )

        # Apply the common grey grid layout
        fig.update_layout(   
            xaxis=dict(showgrid=True, gridcolor='lightgrey', zeroline=False),
            yaxis=dict(showgrid=True, gridcolor='lightgrey'),
            plot_bgcolor='#f7f7f7',
            paper_bgcolor='#ffffff',)
    
    elif selected_plot_type == 'box':
        # Create a box plot
        fig = px.box(
            filtered_data,
            x='year',
            y=selected_component,
            color='sex',
            color_discrete_map=gender_colors,
            title=f"{selected_component.capitalize()} Distribution by Gender in {selected_ward} (Box Plot)",
            hover_data={'age': True}
        )

        # Apply the common grey grid layout
        fig.update_layout(        
            xaxis=dict(showgrid=True, gridcolor='lightgrey', zeroline=False),
            yaxis=dict(showgrid=True, gridcolor='lightgrey'),
            plot_bgcolor='#f7f7f7',
            paper_bgcolor='#ffffff',)

    return fig

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True, port=1333)




### All ward distribution app 

In [275]:
df_swarm = combined_10yr_fert_agebins_component_columns_with_names.copy()

# Preprocess the data: Ensure the relevant columns are numeric
columns_to_visualize = ['births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change']
df_swarm[columns_to_visualize] = df_swarm[columns_to_visualize].apply(pd.to_numeric, errors='coerce')

# Initialise the Dash app with a Bootstrap theme
app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])

# Dropdown options for columns
column_options = [{'label': col, 'value': col} for col in columns_to_visualize]

# Colors for 'male' and 'female'
gender_colors = {
    'male': '#1f77b4',  # blue
    'female': '#ff7f0e'  # orange
}

#only population has 2011 data so filter this out so plot renders on data
df_swarm = df_swarm[df_swarm['year'] > 2011]

# Create the swarm plot function
def create_swarm_plot(selected_column):
    #trim df so 2012 in the min year

    min_year = df_swarm['year'].min()
    max_year = df_swarm['year'].max()

    # Calculate range_y based on selection type
    max_value = abs(df_swarm[selected_column].max())
    range_y = [-max_value * 1.2, max_value * 1.2]
    
    # Create the figure with animation
    fig = px.strip(
        df_swarm,  # Use the full DataFrame for animation across years
        x='age',  # Age on the x-axis
        y=selected_column,  # Selected column on the y-axis
        color='sex',  # Color by sex
        color_discrete_map=gender_colors,
        animation_frame='year',  # Animation by year
        title=f"{selected_column.capitalize()} by Age Over Time",
        hover_data=['ward_name', 'la_name', 'sex'],  # Hover data
        category_orders={'age': sorted(df_swarm['age'].unique())},  # Sort ages on x-axis
        range_y=range_y
    )
    
    # Update layout for readability and consistency with the population pyramid
    fig.update_layout(
        xaxis_title="Age Group",
        yaxis_title=selected_column.capitalize(),
        template="plotly_white",  # Lighter background for consistency
        height=600,
        font=dict(family="Arial, sans-serif", size=14, color="#333333"),  # Consistent font
        margin=dict(l=50, r=50, t=50, b=50),
        plot_bgcolor='#f7f7f7',  # Same background color as the population pyramid
        paper_bgcolor='#ffffff',  # Same background as the graph paper
        showlegend=True,
        title={
            'text': f"{selected_column.capitalize()} by Age Over Time",
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(size=20, color='#333333')  # Same title styling as pyramid
        },
        xaxis=dict(
            showgrid=True, 
            gridcolor='lightgrey', 
            zeroline=False
        ),
        yaxis=dict(
            showgrid=True, 
            gridcolor='lightgrey'
        ),
        legend=dict(
            title="Gender",
            orientation="h",  # Horizontal legend like in the population pyramid
            y=1.1,
            x=8.5,
            xanchor="center"
        )
    )
    
    return fig

# Define the layout of the app
app.layout = html.Div([
    # Header section
    html.Div([
        html.H1("Interactive Distribution of Components by Age", 
                style={'textAlign': 'center', 'color': '#333333', 'margin-top': '20px', 'font-size': '30px'}),
        html.P("Select the component to visualise animated by year", 
               style={'textAlign': 'center', 'fontSize': '18px', 'color': '#7F8C8D', 'margin-top': '10px'}),
    ], style={'padding': '20px', 'backgroundColor': '#ECF0F1'}),
    
    # Dropdown for column selection
    html.Div([
        dbc.Row([
            dbc.Col([
                dbc.Label("Select Metric", style={'fontSize': '16px'}),
                dcc.Dropdown(
                    id='column-dropdown',
                    options=column_options,
                    value=columns_to_visualize[0],  # Default value
                    style={'width': '100%'}
                ),
            ], width=6)
        ], style={'padding': '20px'}),
    ]), 
    
    # Graph section with loading spinner
    dbc.Row([
        dbc.Col([
            dcc.Loading(
                id="loading-spinner",
                type="circle",  # Use a circular spinner
                children=[
                    dcc.Graph(id='swarm-plot'),
                ]
            ),
        ], width=12)
    ], style={'padding': '20px'}),
], style={'padding': '20px', 'font-family': 'Arial', 'background-color': '#f7f7f7'})

# Callback to update the plot based on user input
@app.callback(
    Output('swarm-plot', 'figure'),
    [Input('column-dropdown', 'value')]
)
def update_swarm_plot(selected_column):
    return create_swarm_plot(selected_column)

# Run the Dash app
if __name__ == '__main__':
    app.run_server(debug=True, port=2000)




## Collate Outliers
Summarize and compile all detected outliers for further analysis.

---

In [276]:
# Filter the global variables to find dataframes with 'outlier' in their name
outlier_dfs = {name: df for name, df in globals().items() if isinstance(df, pd.DataFrame) and 'outlier' in name.lower()}

# Display the name, columns, and length of each dataframe
for name, df in outlier_dfs.items():
    print(f"DataFrame Name: {name}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Length: {len(df)}")
    print("\n" + "-"*50 + "\n")

DataFrame Name: births_pct_change_borough_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change', 'robust_z_score']
Length: 16284

--------------------------------------------------

DataFrame Name: deaths_pct_change_borough_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change', 'robust_z_score']
Length: 117951

--------------------------------------------------

DataFrame Name: netflow_pct_change_borough_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change', 'robust_z_score']
Length: 142008

--------------------------------------------------

In [277]:
# Only ward data datatframes
# Filter the global variables to find dataframes with 'outlier' in their name and exclude ones with 'borough'
outlier_dfs_wards = {name: df for name, df in globals().items() 
               if isinstance(df, pd.DataFrame) 
               and 'outlier' in name.lower() 
               and 'borough' not in name.lower()}
# Display the name, columns, and length of each dataframe
for name, df in outlier_dfs_wards.items():
    print(f"DataFrame Name: {name}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Length: {len(df)}")
    print("\n" + "-"*50 + "\n")

DataFrame Name: births_pct_change_ward_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change']
Length: 11604

--------------------------------------------------

DataFrame Name: deaths_pct_change_ward_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change']
Length: 116544

--------------------------------------------------

DataFrame Name: netflow_pct_change_ward_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'population', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change']
Length: 295885

--------------------------------------------------

DataFrame Name: population_pct_change_ward_outlier_df
Columns

In [278]:
# View age column in dfs to check if they are the strings
age_snippets = {df_name: df['age'].iloc[0] for df_name, df in outlier_dfs_wards.items()}
age_snippets

{'births_pct_change_ward_outlier_df': '0-18',
 'deaths_pct_change_ward_outlier_df': '41-50',
 'netflow_pct_change_ward_outlier_df': '81-89',
 'population_pct_change_ward_outlier_df': '90+',
 'births_pct_change_ward_inf_outlier_df': '0-18',
 'deaths_pct_change_ward_inf_outlier_df': '90+',
 'netflow_pct_change_ward_inf_outlier_df': '19-30',
 'population_pct_change_ward_inf_outlier_df': '31-40',
 'total_pop_temporal_outliers_ward_df': '90+',
 'total_pop_cross_sectional_outliers_ward_df': '90+',
 'births_gender_outliers_df': '0-18',
 'deaths_gender_outliers_df': '90+',
 'netflow_gender_outliers_df': '81-89',
 'population_gender_outliers_df': '81-89'}

In [279]:
# Initialise an empty list to store relevant rows from all DataFrames
all_rows = []

# Loop through all outlier dataframes
for name, df in outlier_dfs_wards.items():
    # Check if 'gss_code' and 'year' columns exist
    if 'gss_code_ward' in df.columns and 'year' in df.columns:
        # Select the relevant columns: 'gss_code', 'year' and 'age' (if it exists)
        cols = ['gss_code_ward', 'year']
        if 'age' in df.columns:
            cols.append('age')

        # Append the relevant data from the current DataFrame to the list
        all_rows.append(df[cols])

# Concatenate all collected data into one DataFrame
if all_rows:
    combined_df = pd.concat(all_rows)

    # Group by 'gss_code', 'year', and 'age' (where applicable) and count occurrences
    tally_df = combined_df.groupby(cols).size().reset_index(name='count')

    # Sort the result by count in descending order
    tally_df = tally_df.sort_values(by='count', ascending=False)

    # Display the top rows of the tally DataFrame
    print(tally_df)
else:
    print("No relevant data found.")





       gss_code_ward  year    age  count
221265     E05014055  2036   0-18     18
221211     E05014055  2030   0-18     18
221283     E05014055  2038   0-18     18
221220     E05014055  2031   0-18     18
221256     E05014055  2035   0-18     18
...              ...   ...    ...    ...
151491     E05013767  2043  41-50      0
151492     E05013767  2043  51-60      0
151497     E05013767  2044   0-18      0
151500     E05013767  2044  41-50      0
122400     E05013687  2011   0-18      0

[244800 rows x 4 columns]


In [280]:
#tally all outlier dfs
df_tally = tally_outliers(outlier_dfs_wards, tally_by_age=True)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [281]:
df_tally

Unnamed: 0,gss_code_ward,year,age,births_pct_change_ward_outlier_df,deaths_pct_change_ward_outlier_df,netflow_pct_change_ward_outlier_df,population_pct_change_ward_outlier_df,births_pct_change_ward_inf_outlier_df,deaths_pct_change_ward_inf_outlier_df,netflow_pct_change_ward_inf_outlier_df,population_pct_change_ward_inf_outlier_df,total_pop_temporal_outliers_ward_df,total_pop_cross_sectional_outliers_ward_df,births_gender_outliers_df,deaths_gender_outliers_df,netflow_gender_outliers_df,population_gender_outliers_df,total,total_outlier_score
0,E05013993,2013,90+,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,11.0,490.263068
1,E05013988,2014,90+,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,11.0,458.350286
2,E05013560,2013,90+,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,11.0,210.484843
3,E05011476,2028,90+,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,11.0,343.177278
4,E05013631,2013,90+,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,11.0,278.483357
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
216826,E05013703,2048,0-18,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.000000
216827,E05013703,2047,51-60,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.234116
216828,E05013703,2047,41-50,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.000000
216829,E05013703,2047,31-40,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,25.429758


In [282]:
# potential alterations:
# using single year of age vs binned ages 
# add weighting to final tally
# remove certain age groups like 90+ from outlier detection


## Average Household Size Outliers
This section will use the function to produce a dataframe with outliers regrading the average household size.
It also contains some contextual results so comparison can be more easily made, such as the year before and after, so that teh change can be seen. As well as the average household size for that word as well as that year across all wards.

In [283]:
#read in rds file
average_household_size = pyreadr.read_r(r"C:\Users\user\Documents\population_data\2022_Identified_Capacity_15yr_high_fert\2022_Identified_Capacity_15yr_high_fert\ahs.rds")
average_household_size = average_household_size[None]

In [284]:
#remove nan columns
average_household_size = average_household_size.dropna(axis=1, how='all')

In [285]:
average_household_size

Unnamed: 0,gss_code_ward,2021,2022,2023,2024,2025,2026,2027,2028,2029,...,2041,2042,2043,2044,2045,2046,2047,2048,2049,2050
0,E09000001,1.729055,2.141426,2.087426,2.040820,2.014128,1.994125,1.979868,1.968333,1.958878,...,1.910785,1.920089,1.927280,1.933181,1.938368,1.942776,1.946773,1.950291,1.953340,1.956169
1,E05014053,2.813790,2.548360,2.275661,2.153346,2.133935,2.105798,2.090412,2.078165,2.067971,...,2.135847,2.145677,2.156537,2.168107,2.179532,2.191080,2.202550,2.213891,2.224858,2.235705
2,E05014054,2.909020,2.883138,2.922277,2.923762,2.903617,2.882361,2.865726,2.849186,2.833638,...,2.799580,2.807506,2.818172,2.830171,2.843024,2.855550,2.867720,2.879492,2.890862,2.901789
3,E05014055,2.951577,2.634936,2.445635,2.358832,2.298595,2.248707,2.224863,2.213345,2.205129,...,2.170113,2.238887,2.279179,2.306654,2.327505,2.345364,2.360961,2.375091,2.388186,2.400488
4,E05014056,3.048425,2.873637,2.762628,2.698691,2.621724,2.577278,2.570153,2.579542,2.593865,...,2.627652,2.621036,2.617148,2.614639,2.612952,2.611757,2.610252,2.608712,2.606719,2.604483
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
675,E05013805,2.090312,2.182428,2.214119,2.199514,2.189446,2.177585,2.166442,2.155516,2.145647,...,2.083577,2.088475,2.093266,2.097676,2.101798,2.105602,2.109047,2.112118,2.114787,2.117179
676,E05013806,1.805705,1.890946,1.862247,1.822636,1.809524,1.797709,1.787583,1.779312,1.772270,...,1.716521,1.720707,1.725001,1.728974,1.732693,1.735995,1.739187,1.741991,1.744623,1.747075
677,E05013807,2.004284,2.053970,2.069157,2.044950,2.034003,2.020535,2.006764,1.992909,1.982163,...,1.908214,1.912048,1.916078,1.919994,1.923638,1.926997,1.930111,1.932949,1.935549,1.937979
678,E05013808,1.816990,1.941948,1.968739,1.952043,1.940812,1.928824,1.917613,1.907717,1.899374,...,1.845124,1.848929,1.853117,1.857303,1.861331,1.865219,1.868803,1.872123,1.875104,1.877877


In [286]:
# Example usage:
anomalous_average_household_size_absolute = find_ahs_outliers_absolute_value(average_household_size, z_threshold=3)



In [287]:
anomalous_average_household_size_absolute

Unnamed: 0,gss_code_ward,year,z_score,outlier_value,year_before_value,year_after_value,average_for_year,average_for_ward
34,E05013921,2023,4.354520,4.086808,4.036352,4.074413,2.574048,3.821120
21,E05013921,2022,4.335116,4.036352,3.995211,4.086808,2.552756,3.821120
47,E05013921,2024,4.311738,4.074413,4.086808,4.045974,2.562898,3.821120
60,E05013921,2025,4.257276,4.045974,4.074413,4.007388,2.551880,3.821120
73,E05013921,2026,4.195264,4.007388,4.045974,3.971685,2.535572,3.821120
...,...,...,...,...,...,...,...,...
172,E05013538,2034,3.001468,3.497292,3.509477,3.489739,2.454628,3.508954
77,E05014015,2026,-3.018934,1.476446,1.519301,1.454275,2.535572,1.488101
116,E05014015,2029,-3.037740,1.432531,1.441933,1.453046,2.487824,1.488101
90,E05014015,2027,-3.042164,1.454275,1.476446,1.441933,2.519411,1.488101


In [288]:
#unique values in gss_code_ward for outlier_sorted_context
print("outlier wards:", anomalous_average_household_size_absolute['gss_code_ward'].unique())

outlier wards: ['E05013921' 'E05013913' 'E05013914' 'E05013537' 'E05011248' 'E05013514'
 'E05013926' 'E05011240' 'E05013539' 'E05013538' 'E05014064' 'E05013909'
 'E05014015']


In [289]:
anomalous_average_household_size_pct_change = find_ahs_outliers_pct_change(average_household_size, pct_change_threshold=0.10
                                                               )

In [290]:
anomalous_average_household_size_pct_change 

Unnamed: 0,gss_code_ward,year,pct_change,outlier_value,year_before_value,year_after_value,average_for_year,average_for_ward
11,E05013516,2022,0.33597,2.95953,2.215267,2.972539,2.552756,2.978746
8,E09000001,2022,0.238495,2.141426,1.729055,2.087426,2.552756,1.949733
12,E05013653,2022,0.232125,2.261578,1.83551,2.28082,2.552756,2.107772
15,E05013796,2022,0.128527,2.381155,2.109967,2.354674,2.552756,2.190323
13,E05013570,2022,0.114253,3.213425,2.883927,3.267856,2.552756,3.219093
14,E05009336,2022,0.102414,2.673119,2.424788,2.648123,2.552756,2.491788
17,E05014053,2023,-0.10701,2.275661,2.54836,2.153346,2.574048,2.171645
10,E05014055,2022,-0.107278,2.634936,2.951577,2.445635,2.552756,2.26552


## Produce HMTL report

##### Table formatting got html

In [291]:
# Define the DataFrames, shorten.
#Take first 50 rows, too many otherwise and the futher you descend the less important they become
tally_df_20 = df_tally.head(50)  # DataFrame containing outliers
anomalous_average_household_size_absolute_df = anomalous_average_household_size_absolute.head(50)  # DataFrame with summary statistics
anomalous_average_household_size_pct_change_df = anomalous_average_household_size_pct_change  # DataFrame with outliers    

In [292]:
name_lookup = combined_10yr_fert[['gss_code', 'la_name', 'gss_code_ward', 'ward_name']]
#remove where all values in the row are the same
name_lookup = name_lookup.drop_duplicates()
#remove nan values
name_lookup = name_lookup.dropna()

In [293]:
# Identify duplicated ward names and create a unique name by appending the local authority name
name_lookup['ward_name_unique'] = name_lookup.apply(
    lambda row: f"{row['ward_name']} ({row['la_name']})" if row['ward_name'] in name_lookup[name_lookup['ward_name'].duplicated()]['ward_name'].values else row['ward_name'],
    axis=1
)


In [294]:
#duplicate ward_name
name_lookup

Unnamed: 0,gss_code,la_name,gss_code_ward,ward_name,ward_name_unique
0,E09000001,City of London,E09000001,City of London,City of London
1,E09000002,Barking and Dagenham,E05014053,Abbey,Abbey (Barking and Dagenham)
2,E09000002,Barking and Dagenham,E05014054,Alibon,Alibon
3,E09000002,Barking and Dagenham,E05014055,Barking Riverside,Barking Riverside
4,E09000002,Barking and Dagenham,E05014056,Beam,Beam
...,...,...,...,...,...
675,E09000033,Westminster,E05013805,Regent's Park,Regent's Park (Westminster)
676,E09000033,Westminster,E05013806,St James's,St James's
677,E09000033,Westminster,E05013807,Vincent Square,Vincent Square
678,E09000033,Westminster,E05013808,West End,West End


In [295]:
#merge dataframes with name_lookup
tally_df_20 = pd.merge(tally_df_20, name_lookup, left_on='gss_code_ward', right_on='gss_code_ward', how='left') 
anomalous_average_household_size_absolute_df = pd.merge(anomalous_average_household_size_absolute_df, name_lookup, left_on='gss_code_ward', right_on='gss_code_ward', how='left')   
anomalous_average_household_size_pct_change_df = pd.merge(anomalous_average_household_size_pct_change_df, name_lookup, left_on='gss_code_ward', right_on='gss_code_ward', how='left')  
Total_population_percentage_yearly_change_for_wards_df = pd.merge(Total_population_percentage_yearly_change_for_wards, name_lookup, left_on='gss_code_ward', right_on='gss_code_ward', how='left')
Total_population_percentage_change_form_ward_avg_df = pd.merge(Total_population_percentage_change_form_ward_avg, name_lookup, left_on='gss_code_ward', right_on='gss_code_ward', how='left')


In [296]:
#rename column heading
#tally_df = tally_df.rename(columns={'gss_code': 'Borough Code', 'ward_name': 'Ward Name', 'gss_code_ward': 'Ward Code', 'ward_name_unique': 'Unique Ward Name', 'la_name': 'Borough Name', 'year': 'Year', 'age': 'Age', 'births_pct_change_ward_outlier_df': 'Births % Change Outlier', 'deaths_pct_change_ward_outlier_df': 'Deaths % Change Outlier', 'netflow_pct_change_ward_outlier_df': 'Netflow % Change Outlier', 'population_pct_change_ward_outlier_df': 'Population % Change Outlier', 'births_pct_change_ward_inf_outlier_df': 'Births % Change Outlier (Inf Values)', 'deaths_pct_change_ward_inf_outlier_df': 'Deaths % Change Outlier (Inf Values)', 'netflow_pct_change_ward_inf_outlier_df': 'Netflow % Change Outlier (Inf Values)', 'population_pct_change_ward_inf_outlier_df': 'Population % Change Outlier (Inf Values)', 'total_pop_temporal_outliers_ward_df': 'Total Population Temporal Outlier', 'total_pop_cross_sectional_outliers_ward_df': 'Total Population Cross-Sectional Outlier'})
#drop column ward_name
tally_df_20= tally_df_20.drop(columns=['ward_name'])
anomalous_average_household_size_absolute_df = anomalous_average_household_size_absolute_df.drop(columns=['ward_name'])
anomalous_average_household_size_pct_change_df = anomalous_average_household_size_pct_change_df.drop(columns=['ward_name'])
#rename ward to unique 
tally_df_20 = tally_df_20.rename(columns={'gss_code': 'Borough Code', 'gss_code_ward': 'Ward Code', 'ward_name_unique': 'Ward Name', 'la_name': 'Borough Name', 'year': 'Year', 'age': 'Age', 'total': 'Total Occurrence', 'total_outlier_score': 'Total Outlier Score'})
anomalous_average_household_size_absolute_df = anomalous_average_household_size_absolute_df.rename(columns={'gss_code': 'Borough Code', 'gss_code_ward': 'Ward Code', 'ward_name_unique': 'Unique Ward Name', 'la_name': 'Borough Name', 'year': 'Year', 'z_score': 'Z-Score', 'outlier_value': 'Outlier Value', 'year_before_value': 'Year Before Value', 'year_after_value': 'Year After Value', 'average_for_year': 'Average for Year', 'average_for_ward': 'Average for Ward'})
anomalous_average_household_size_pct_change_df = anomalous_average_household_size_pct_change_df.rename(columns={'gss_code': 'Borough Code', 'gss_code_ward': 'Ward Code', 'ward_name_unique': 'Unique Ward Name', 'la_name': 'Borough Name', 'year': 'Year', 'outlier_value': 'Outlier Value', 'year_before_value': 'Year Before Value', 'year_after_value': 'Year After Value', 'average_for_year': 'Average for Year', 'average_for_ward': 'Average for Ward'})
Total_population_percentage_yearly_change_for_wards_df = Total_population_percentage_yearly_change_for_wards_df.rename(columns={'gss_code': 'Borough Code','gss_code_ward': 'Ward Code', 'ward_name_unique': 'Ward Name', 'la_name': 'Borough Name', 'year': 'Year'})
Total_population_percentage_change_form_ward_avg_df = Total_population_percentage_change_form_ward_avg_df.rename(columns={'gss_code': 'Borough Code','gss_code_ward': 'Ward Code', 'ward_name_unique': 'Ward Name', 'la_name': 'Borough Name', 'year': 'Year'})

In [297]:
Total_population_percentage_yearly_change_for_wards_df.drop(columns=['ward_name'], inplace=True)
Total_population_percentage_change_form_ward_avg_df.drop(columns=['ward_name'], inplace=True)

In [298]:
#order columns
tally_df_20 = tally_df_20[['Borough Name','Borough Code','Ward Name','Ward Code',
 'Year',
 'Age',
 'births_pct_change_ward_outlier_df',
 'deaths_pct_change_ward_outlier_df',
 'netflow_pct_change_ward_outlier_df',
 'population_pct_change_ward_outlier_df',
 'births_pct_change_ward_inf_outlier_df',
 'deaths_pct_change_ward_inf_outlier_df',
 'netflow_pct_change_ward_inf_outlier_df',
 'population_pct_change_ward_inf_outlier_df',
 'total_pop_temporal_outliers_ward_df',
 'total_pop_cross_sectional_outliers_ward_df',
 'births_gender_outliers_df',
 'deaths_gender_outliers_df',
 'netflow_gender_outliers_df',
 'population_gender_outliers_df',
 'Total Occurrence',
 'Total Outlier Score',]]

anomalous_average_household_size_absolute_df = anomalous_average_household_size_absolute_df[['Borough Name','Borough Code', 'Unique Ward Name','Ward Code',
 'Year',
 'Z-Score',
 'Outlier Value',
 'Year Before Value',
 'Year After Value',
 'Average for Year',
 'Average for Ward',
]]

anomalous_average_household_size_pct_change_df = anomalous_average_household_size_pct_change_df[['Borough Name',
    'Borough Code',
    'Unique Ward Name',
    'Ward Code',
    'Year',
    'pct_change',
    'Outlier Value',
    'Year Before Value',
    'Year After Value',
    'Average for Year',
    'Average for Ward']]

Total_population_percentage_yearly_change_for_wards_df = Total_population_percentage_yearly_change_for_wards_df[['Borough Name',
    'Borough Code',
    'Ward Name',
    'Ward Code',
    'Year',
    'population',
    'population_pct_change_temporal',
    ]]

Total_population_percentage_change_form_ward_avg_df = Total_population_percentage_change_form_ward_avg_df[['Borough Name',
'Borough Code',
'Ward Name',
'Ward Code',
'Year',
'population',
'population_mean_for_year',
'population_pct_change_cross',
]]                                                                                                         
                                                                                                           
    


In [299]:
#sort df by descending order by key column
# sort byt population_pct_change_temporal
Total_population_percentage_yearly_change_for_wards_df = Total_population_percentage_yearly_change_for_wards_df.sort_values(by='population_pct_change_temporal', ascending=False)
# sort by population_pct_change_cross_sectional
Total_population_percentage_change_form_ward_avg_df = Total_population_percentage_change_form_ward_avg_df.sort_values(by='population_pct_change_cross', ascending=False)

#### HTML layout

In [300]:
# Get the directory where the notebook file is located
script_dir = os.path.dirname(os.path.abspath('QA_population_projection.ipynb'))
# Set the working directory to 'Projections_QA' so the hmtl is saved in the correct location
os.chdir(script_dir)

In [301]:
# Convert DataFrames to HTML
tally_df_html = tally_df_20.to_html()
average_household_size_absolute_html = anomalous_average_household_size_absolute_df.to_html()
anomalous_average_household_size_pct_change_df_html = anomalous_average_household_size_pct_change_df.to_html()  # Added parentheses 
df1_html = Total_population_percentage_yearly_change_for_wards_df.to_html(index=False, classes='data')
df2_html = Total_population_percentage_change_form_ward_avg_df.to_html(index=False, classes='data')   
population_totals_table_html = population_totals_table.to_html(index=False, classes='data') 



In [302]:
#describe table for all numeric columns
component_columns_describe_value_columns_describe = combined_10yr_fert_agebins_component_columns[['births', 'deaths', 'netflow', 'population']].describe()
#remove inf and nan values from pct change columns
combined_10yr_fert_agebins_component_columns_no_nans = combined_10yr_fert_agebins_component_columns.replace([np.inf, -np.inf], np.nan).dropna()
#describe percentage change columns
pct_change_columns_describe = combined_10yr_fert_agebins_component_columns_no_nans[['births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'population_pct_change']].describe()
# Concatenate horizontally (side by side)
combined_describe = pd.concat([component_columns_describe_value_columns_describe, pct_change_columns_describe], axis=1)

In [303]:
combined_describe_html = combined_describe.to_html()

In [304]:
missing_values = combined_10yr_fert.isnull().sum().to_frame()
missing_values_df = missing_values.rename(columns={0: "missing values"})
missing_values_df = missing_values_df.to_html()



In [305]:
# Perform checks and convert results to HTML-safe strings
year_range_fert = get_year_range(combined_10yr_fert)
year_range_ward = get_year_range(combined_10yr_fert_ward)
year_range_boroughs = get_year_range(combined_10yr_fert_boroughs)

missing_values = combined_10yr_fert.isnull().sum().to_frame().to_html()  # Convert to HTML format
duplicates = combined_10yr_fert.duplicated().sum()

# Checking for negative values and extremely high values
negative_values = combined_10yr_fert[combined_10yr_fert['value'] < 0]
negative_components = negative_values['component'].unique()
negative_component_str = negative_components[0]

# Check for the max and min age condition
age_check = (combined_10yr_fert['age'].max() == 90) & (combined_10yr_fert['age'].min() == 0)
max_age = combined_10yr_fert['age'].max()
min_age = combined_10yr_fert['age'].min()

In [306]:
# Get the directory where the notebook file is located
script_dir = os.path.dirname(os.path.abspath('QA_population_projection.ipynb'))
# Set the working directory to 'Projections_QA' so the hmtl is saved in the correct location
os.chdir(script_dir)

In [307]:
#what is current working directory (i.e where will the hmtl be saved)
os.getcwd()

'c:\\Users\\user\\Documents\\Python Scripts\\Projections_QA'

In [308]:
# Prepare all checks as strings to inject into HTML
check_results = f"""
<h2>Basic sense check</h2>
<h4>Year ranges</h4>
<p>Complete year range: {year_range_fert}</p>
<p>Year range for wards: {year_range_ward}</p>
<p>Year range for boroughs: {year_range_boroughs}</p>

<h4>Missing Values</h4>
<p>Missing values per column:</p>
{missing_values_df}

<h4>Duplicates</h4>
<p>Number of duplicate rows: {duplicates}</p>

<h4>Negative Values</h4>
<p>Components columns that contain negative values: {negative_component_str} </p>


<h4>Age Range Check</h4>
<p>Max age: {max_age}</p>
<p>Min age: {min_age}</p>

<h4>Descriptive Statistics</h4>
<div class="table-container">
    {combined_describe_html}
</div>

<h4>Population Totals for London</h4>
<div class="table-container">
    {population_totals_table_html}
</div>
"""

In [309]:
# Define the HTML content with embedded JavaScript for tab functionality
html_content = f"""
<html>
<head>
    <title>Population Projections Report</title>
    <link rel="icon" href="https://resource.esriuk.com/wp-content/uploads/2017/06/GLA-Logo-Resized.png" type="image/png">
    
    <!-- Include jQuery and DataTables CSS & JS -->
    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
    <link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.13.4/css/jquery.dataTables.min.css">
    <script type="text/javascript" charset="utf8" src="https://cdn.datatables.net/1.13.4/js/jquery.dataTables.min.js"></script>

    <style>
        /* General Styling */
        body {{
            font-family: Arial, sans-serif;
            color: #333;
            background-color: #f4f4f9;
            margin: 0;
            padding: 0;
        }}
        .container {{
            width: 80%;
            margin: auto;
            background-color: #fff;
            padding: 20px;
            box-shadow: 0px 0px 15px rgba(0, 0, 0, 0.1);
            border-radius: 8px;
        }}
        .logo {{
            display: block;
            margin: 0 auto;
            width: 250px;
        }}
        h1, h2 {{
            text-align: center;
            color: #2c3e50;
        }}
        h1 {{
            font-size: 2.8em;
            border-bottom: 2px solid #2c3e50;
            padding-bottom: 10px;
            margin-bottom: 20px;
        }}
        h2 {{
            margin-top: 30px;
        }}
        .tabs {{
            display: flex;
            justify-content: center;
            margin-bottom: 20px;
        }}
        .tab {{
            margin: 0 10px;
            padding: 10px 20px;
            cursor: pointer;
            background-color: #2c3e50;
            color: white;
            border-radius: 5px;
        }}
        .tab:hover {{
            background-color: #34495e;
        }}
        .tab.active {{
            background-color: #1abc9c;
        }}
        .tab-content {{
            display: none;
        }}
        .tab-content.active {{
            display: block;
        }}
        .table-container {{
            max-height: 400px;
            overflow-y: auto;
            overflow-x: auto;
            border: 1px solid #ddd;
            padding: 5px;
        }}
        table {{
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            font-size: 1em;
        }}
        table, th, td {{
            border: 1px solid #ddd;
            padding: 12px;
        }}
        th {{
            background-color: #2c3e50;
            color: #fff;
        }}
        tr:nth-child(even) {{
            background-color: #f2f2f2;
        }}
    </style>
</head>
<body>
    <div class="container">
        <!-- Company Logo -->
        <img src="https://resource.esriuk.com/wp-content/uploads/2017/06/GLA-Logo-Resized.png" alt="Company Logo" class="logo">

        <!-- Report Title -->
        <h1>Population Projections Outlier Report</h1>

        <!-- Add Check Results -->
        {check_results}

        <!-- Tallied Dataframe -->
        <h2>Tallied Outlier Table</h2>
        <div class="table-container">
            {tally_df_html}
        </div>

        <!-- Household Size Tables with Tabs -->
        <h2>Household Size Analysis</h2>
        <div class="tabs household-size-tabs">
            <div class="tab active" data-tab="tab1">Household Size Absolute</div>
            <div class="tab" data-tab="tab2">Household Size Percentage Change</div>
        </div>

        <div class="tab-content household-size-tab-content active" id="tab1">
            <h2>Average Household Size Absolute Table</h2>
            <div class="table-container">
                {average_household_size_absolute_html}
            </div>
        </div>
        <div class="tab-content household-size-tab-content" id="tab2">
            <h2>Average Household Size Percentage Change Table</h2>
            <div class="table-container">
                {anomalous_average_household_size_pct_change_df_html}
            </div>
        </div>

        <!-- Population Analysis Tables with Tabs -->
        <h2>Total Yearly Population Tables</h2>
        <div class="tabs population-analysis-tabs">
            <div class="tab active" data-tab="tab3">Total Ward Population Yearly Change</div>
            <div class="tab" data-tab="tab4">Total Population Compared to Average Ward Population</div>
        </div>

        <div class="tab-content population-analysis-tab-content active" id="tab3">
            <h2>Total Population Percentage Change Per Year by Ward"</h2>
            <div class="table-container">
                {df1_html}
            </div>
        </div>
        <div class="tab-content population-analysis-tab-content" id="tab4">
            <h2>Annual Percentage Change in Total Population Compared to the Average Population of All Wards</h2>
            <div class="table-container">
                {df2_html}
            </div>
        </div>

        <!-- Swarm plot -->
        <h2>Visualising Potential Outliers</h2>
        <div>
            <iframe src="http://localhost:2000" width="100%" height="800" frameborder="0"></iframe>
        </div>

        <!-- Population Pyramid Visualisation -->
        <h2>Population Pyramid</h2>
        <div>
            <iframe src="http://localhost:1223" width="100%" height="800" frameborder="0"></iframe>
        </div>

        <!-- Component Trends Over Time -->
        <h2>Component Trends Over Time</h2>
        <div>
            <iframe src="http://localhost:1333" width="100%" height="800" frameborder="0"></iframe>
        </div>

        <!-- Interactive Line Plot -->
        <h2>Component Trends Over Time Totals</h2>
        <div>
            <iframe src="interactive_line_plot.html" width="100%" height="600" frameborder="0"></iframe>
        </div>

        <!-- Footer Section -->
        <div class="footer">
            <p>This is an automated report produced by the Greater London Authority (GLA). The script provided is merely a suggestion for areas to investigate further regarding outliers and potential errors in the population projections. It is not guaranteed to capture every error that may exist within the dataset.</p>
            <p>If you require further information, please email <a href="mailto:Sebastian.Heslin-Rees@london.gov.uk">Sebastian.Heslin-Rees@london.gov.uk</a>.</p>
        </div>
    </div>

    <!-- DataTable and Tab Script -->
    <script>
        $(document).ready(function() {{
            // Initialize DataTables
            $('table').DataTable();

            // Tab functionality for Household Size
            $('.household-size-tabs .tab').on('click', function() {{
                var tabId = $(this).data('tab');
                $('.household-size-tabs .tab').removeClass('active');
                $(this).addClass('active');
                $('.household-size-tab-content').removeClass('active');
                $('#' + tabId).addClass('active');
            }});

            // Tab functionality for Population Analysis
            $('.population-analysis-tabs .tab').on('click', function() {{
                var tabId = $(this).data('tab');
                $('.population-analysis-tabs .tab').removeClass('active');
                $(this).addClass('active');
                $('.population-analysis-tab-content').removeClass('active');
                $('#' + tabId).addClass('active');
            }});
        }});
    </script>
</body>
</html>
"""

# Save the report to an HTML file
with open("population_projections_report.html", "w") as f:
    f.write(html_content)

print("HTML report with tabs generated successfully!")


HTML report with tabs generated successfully!
