# 📊 **QA script for Population Projections**

---

## 📝 **Introduction**
This notebook aims to identify outliers in the GLA population projection data. The analysis involves loading the dataset, preprocessing the data, defining utility functions, performing outlier detection, and presenting the results through visualizations.

### 🎯 **Goals**
The analysis will focus on the following objectives:
- **Load and preprocess** the population projections dataset.
- Define utility functions
- Perform **basic checks** on the dataset:
  - Range of years covered.
  - Missing values.
  - Duplicates.
  - Descriptive statistics.
  - Breakdown by components.
  - Detecting negative values.
  - Age group ranges.
- **Outlier Detection** over time for each component:
  - Identify outliers using **Z-scores** and **Robust Z-scores**.
  - Analyze by **component**, **ward**, and **borough**.
  - Handle **infinite values** separately.
- **Total Outliers**:
  - Use Z-scores and Robust Z-scores for comparison.
  - Perform **cross-sectional comparisons**: Examine changes between boroughs and wards for a given year.
  - Conduct **temporal comparisons**: Measure percentage changes between years for both boroughs and wards.
  - Handle **infinite values** separately.
- **Gender Outliers**:
  - Investigate abnormal **gender ratios**.
  - Analyze by component.
  - Adjust the **outlier standard deviation thresholds** as needed based on different components.
- **Key Visualizations**:
  - Display the distribution of components.
  - Group data by **age ranges**.
  - Visualize **yearly totals**.
  - Show yearly total trends over time, broken down by components.
  - Create **population pyramids**.
- **Collate Outliers**: Summarize and determine the key outlier rows.

---

## 📂 **Dataset**
The dataset used in this analysis contains population projections for wards and borough, including population counts, births, death and net-flows

---

## 🛠️ **Structure**
1. [**Load and Preprocess** the population projections dataset.](#load-and-preprocess)
2. [**Define Utility Functions** for effective use.](#define-utility-functions)
3. [**Basic Checks** on the dataset.](#basic-checks)
4. [**Population Consistency Over Time** for each component.](#outlier-detection)
5. [**Total Outliers**](#total-outliers)
6. [**Gender Outliers**](#gender-outliers)
7. [**Key Visualisations**](#key-visualisations)
8. [**Collate Outliers** to determine key outlier rows.](#collate-outliers)



## Load and Preprocess
This section will cover how to load and preprocess the dataset.

---


In [157]:
import pandas as pd
import numpy as np
import pyreadr
import os
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px

In [7]:
os.chdir(r'C:\Users\user\Documents\population_data\combined_10yr_central_fert')
combined_10yr_fert = pd.read_csv('combined_10yr_central_fert.csv').iloc[:, 1:]

  combined_10yr_fert = pd.read_csv('combined_10yr_central_fert.csv').iloc[:, 1:]


In [5]:
#split ward and borough dataframes
combined_10yr_fert_boroughs = combined_10yr_fert[combined_10yr_fert['gss_code_ward'].isna()]
combined_10yr_fert_ward = combined_10yr_fert[~combined_10yr_fert['gss_code_ward'].isna()]

In [19]:
combined_10yr_fert_agebins = create_age_bins(combined_10yr_fert)

In [20]:
combined_10yr_fert_agebins

Unnamed: 0,gss_code,la_name,year,sex,age,value,component,gss_code_ward,ward_name
0,E09000001,City of London,2012.0,female,0-18,32.0,births,,
1,E09000001,City of London,2012.0,male,0-18,24.0,births,,
2,E09000001,City of London,2013.0,female,0-18,36.0,births,,
3,E09000001,City of London,2013.0,male,0-18,35.0,births,,
4,E09000001,City of London,2014.0,female,0-18,26.0,births,,
...,...,...,...,...,...,...,...,...,...
15428057,E09000033,Westminster,2050.0,male,81-89,16.9,popn,E05013809,Westbourne
15428058,E09000033,Westminster,2050.0,male,81-89,13.3,popn,E05013809,Westbourne
15428059,E09000033,Westminster,2050.0,male,81-89,11.0,popn,E05013809,Westbourne
15428060,E09000033,Westminster,2050.0,male,81-89,9.9,popn,E05013809,Westbourne


## Define Utility Functions
Define utility functions that will be used for various parts of the analysis.

---

In [82]:
def view_descriptive_statistics(df, columns):
    """
    Calculate descriptive statistics, including mean, median, and mode, for specified columns in a DataFrame.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    columns (list): List of columns for which to calculate the statistics.
    
    Returns:
    pd.DataFrame: DataFrame containing the descriptive statistics including median and mode.
    """
    # Get descriptive statistics using describe()
    descriptive_stats = df[columns].describe()

    # Calculate median for each column
    median = df[columns].median()

    # Calculate mode for each column (in case of multiple modes, take the first one)
    mode = df[columns].mode().iloc[0]

    # Add median and mode to the descriptive statistics DataFrame
    descriptive_stats.loc['median'] = median
    descriptive_stats.loc['mode'] = mode

    # Return the combined descriptive statistics
    return descriptive_stats

In [10]:
def create_age_bins(df, age_column='age', bins=None, labels=None):
    """
    Create age bins for the specified age column in the given DataFrame.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the age data.
    age_column (str): The name of the column containing age data. Default is 'age'.
    bins (list): A list of bin edges for categorizing ages. Default is None.
    labels (list): A list of labels for the bins. Default is None.

    Returns:
    pd.DataFrame: The DataFrame with a new 'age' column containing binned age data.
    """
    
    # If bins and labels are not provided, set default values
    if bins is None:
        bins = [-1, 18, 30, 40, 50, 60, 70, 80, 89, 90]
    
    if labels is None:
        labels = ['0-18', '19-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-89', '90+']
    
    # Create a copy of the original DataFrame to avoid modifying it directly
    binned_df = df.copy()
    
    # Create age bins
    binned_df[age_column] = pd.cut(binned_df[age_column], bins=bins, labels=labels)
    
    return binned_df

# Example usage:
# combined_10yr_fert_agebins = create_age_bins(combined_10yr_fert)


In [140]:
def calculate_zscores_and_find_outliers(df, component_columns, handle_inf=True, Geography='borough', z_score_threshold=2, For_population_totals=False): 
    """
    Computes z-scores or robust z-scores (depending on distribution) for the respective columns,
    and returns DataFrames containing outliers for either the component columns (handle_inf=True) 
    or the percentage change columns (handle_inf=False).
    
    Parameters:
    df (pd.DataFrame): The input DataFrame containing population data component value columns.
    component_columns (str or list): A single column name or a list of column names to be analysed.
    handle_inf (bool): If True, uses the component columns to determine outliers. 
                       If False, uses percentage change columns to determine outliers.
    Geography (str): Specifies whether to group by 'borough' (using 'gss_code') or 'ward' (using 'gss_code_ward').
                     Default is 'borough'.
    For_population_totals (bool): If True, calculates total population sums and percentage changes before proceeding 
                                  to z-score and outlier analysis.
    z_score_threshold (float or int): The threshold to consider as an outlier based on the z-score. Default is 2.
    
    Returns:
    dict: A dictionary containing DataFrames with outliers for each respective column based on the z-score threshold.
    """
    
    # If a single column name is provided as a string, convert it to a list
    if isinstance(component_columns, str):
        component_columns = [component_columns]

    outliers_dict = {}
    z_score_type = {}  # Dictionary to store which method was used

    # Grouping and pivot based on the Geography parameter
    if Geography == 'borough':
        geo_column = 'gss_code'
    elif Geography == 'ward':
        geo_column = 'gss_code_ward'
    else:
        raise ValueError("Geography must be either 'borough' or 'ward'.")

    # Automatically create pct_change_columns based on component_columns
    pct_change_columns = [f"{col}_pct_change" for col in component_columns]

    # Calculate the percentage change for the component columns
    df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()

    pivoted = df.pivot_table(index=[geo_column, 'sex', 'age', 'year'], columns='component', values='value').reset_index()

    # If For_population_totals is True, calculate population totals and percentage changes
    if For_population_totals:

        # Group by geography and year, and sum the population values
        population_sum = pivoted.groupby([geo_column, 'year'])['popn'].sum().reset_index()

        population_sum_time = population_sum.copy()
        population_sum_cross = population_sum.copy()    

        # Temporally: Calculate the population change over the years for each gss_code or ward
        population_sum_time['popn_pct_change_temporal'] = population_sum_time.groupby(geo_column)['popn'].pct_change() * 100

        # Cross-sectionally: Compare the population between different gss_code or wards for the same year and compare to the mean
        population_sum_cross['popn_mean'] = population_sum_cross.groupby('year')['popn'].transform('mean')
        population_sum_cross['popn_pct_change_cross'] = ((population_sum_cross['popn'] - population_sum_cross['popn_mean']) / population_sum_cross['popn_mean']) * 100

    # Now, decide how to determine outliers based on handle_inf and pct_change_columns
    if handle_inf:
        # Outliers based on component columns
        for comp_col, pct_change_col in zip(component_columns, pct_change_columns):
            if pct_change_col in df.columns:
                if handle_inf:
                    # Filter rows where percentage change columns have inf or -inf
                 df_filtered = df[df[pct_change_col].isin([np.inf, -np.inf])]
                else:
                # Replace inf and -inf with NaN and work on entire DataFrame after cleaning
                    df_filtered = df.replace([np.inf, -np.inf], np.nan).dropna(subset=[comp_col, pct_change_col])

                print(df_filtered)

                if not df_filtered.empty:
                    # Check if the column is normally distributed using skewness
                    skewness = df_filtered[comp_col].skew()

                    if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
                        df_filtered['z_score'] = stats.zscore(df_filtered[comp_col])
                        outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
                        z_score_type[comp_col] = 'Normal Z-Score'
                        print(f"{comp_col} used Normal Z-Score.")
                    else:
                        # Use Robust Z-score (based on median and MAD) for non-normal distribution
                        median = df_filtered[comp_col].median()
                        mad = stats.median_abs_deviation(df_filtered[comp_col])
                        df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
                        outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
                        z_score_type[comp_col] = 'Robust Z-Score'
                        print(f"{comp_col} used Robust Z-Score.")

                    # Store the outliers for this component column
                    outliers_dict[comp_col] = outliers
                else:
                    outliers_dict[comp_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
            else:
                print(f"{comp_col} does not exist in DataFrame")

    else:
        # Outliers based on pct_change_columns
        for pct_change_col in pct_change_columns:
            # Choose the correct DataFrame based on the column
            if pct_change_col == 'popn_pct_change_temporal':
                df_filtered = population_sum_time  # Use the temporal population data
                print(df_filtered)
            elif pct_change_col == 'popn_pct_change_cross':
                df_filtered = population_sum_cross  # Use the cross-sectional population data
                print(df_filtered)
            else:
                df_filtered = df  # Default to the original df if other percentage columns are provided

            if pct_change_col in df_filtered.columns:
                # Replace inf and -inf with NaN and work on the entire DataFrame after cleaning
                df_filtered = df_filtered.replace([np.inf, -np.inf], np.nan).dropna(subset=[pct_change_col])

                if not df_filtered.empty:
                    # Check if the column is normally distributed using skewness
                    skewness = df_filtered[pct_change_col].skew()
                    print(pct_change_col)

                    if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
                        df_filtered['z_score'] = stats.zscore(df_filtered[pct_change_col])
                        outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
                        z_score_type[pct_change_col] = 'Normal Z-Score'
                        print(f"{pct_change_col} used Normal Z-Score.")
                    else:
                        # Use Robust Z-score (based on median and MAD) for non-normal distribution
                        median = df_filtered[pct_change_col].median()
                        mad = stats.median_abs_deviation(df_filtered[pct_change_col])
                        df_filtered['robust_z_score'] = (df_filtered[pct_change_col] - median) / mad
                        outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
                        z_score_type[pct_change_col] = 'Robust Z-Score'
                        print(f"{pct_change_col} used Robust Z-Score.")

                    # Store the outliers for this percentage change column
                    outliers_dict[pct_change_col] = outliers
                else:
                    outliers_dict[pct_change_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
            else:
                print(f"{pct_change_col} does not exist in DataFrame")

    return outliers_dict, z_score_type  # Returning z_score_type for further use if needed



In [199]:
import pandas as pd
import numpy as np
from scipy import stats

def calculate_zscores_and_find_outliers(df, component_columns, handle_inf=True, Geography='borough', z_score_threshold=2, For_population_totals=False, population_analysis_type='cross-sectional'):
    """
    Computes z-scores or robust z-scores (depending on distribution) for the respective columns,
    and returns DataFrames containing outliers for either the component columns (handle_inf=True) 
    or the percentage change columns (handle_inf=False).

    Parameters:
    df (pd.DataFrame): The input DataFrame containing population data component value columns.
    component_columns (str or list): A single column name or a list of column names to be analysed.
    handle_inf (bool): If True, uses the component columns to determine outliers. 
                       If False, uses percentage change columns to determine outliers.
    Geography (str): Specifies whether to group by 'borough' (using 'gss_code') or 'ward' (using 'gss_code_ward').
                     Default is 'borough'.
    For_population_totals (bool): If True, calculates total population sums and percentage changes before proceeding 
                                  to z-score and outlier analysis.
    population_analysis_type (str): Specifies whether to do 'cross-sectional' or 'temporal' analysis for population totals.
                                    Default is 'cross-sectional'.
    z_score_threshold (float or int): The threshold to consider as an outlier based on the z-score. Default is 2.

    Returns:
    dict: A dictionary containing DataFrames with outliers for each respective column based on the z-score threshold.
    """
    
    # If a single column name is provided as a string, convert it to a list
    if isinstance(component_columns, str):
        component_columns = [component_columns]

    outliers_dict = {}
    z_score_type = {}  # Dictionary to store which method was used

    # Grouping and pivot based on the Geography parameter
    if Geography == 'borough':
        geo_column = 'gss_code'
    elif Geography == 'ward':
        geo_column = 'gss_code_ward'
    else:
        raise ValueError("Geography must be either 'borough' or 'ward'.")

    # Automatically create pct_change_columns based on component_columns
    pct_change_columns = [f"{col}_pct_change" for col in component_columns]

    # If For_population_totals is True, calculate population totals and percentage changes
    if For_population_totals:

        # Ensure that there's a 'popn' column in the resulting DataFrame
        if 'popn' in df.columns:
            df['popn'] = df['popn']  # Extract the 'popn' column
        else:
            raise ValueError("The 'popn' column is missing after pivoting.")

        # Group by geography and year, and sum the population values
        population_sum = df.groupby([geo_column, 'year'])['popn'].sum().reset_index()

        population_sum_time = population_sum.copy()
        population_sum_cross = population_sum.copy()

        if population_analysis_type == 'temporal':
            # Temporally: Calculate the population change over the years for each gss_code or ward
            population_sum_time['popn_pct_change_temporal'] = population_sum_time.groupby(geo_column)['popn'].pct_change() * 100
            print(population_sum_time.head())
        elif population_analysis_type == 'cross-sectional':
            # Cross-sectionally: Compare the population between different gss_code or wards for the same year and compare to the mean
            population_sum_cross['popn_mean'] = population_sum_cross.groupby('year')['popn'].transform('mean')
            population_sum_cross['popn_pct_change_cross'] = ((population_sum_cross['popn'] - population_sum_cross['popn_mean']) / population_sum_cross['popn_mean']) * 100
            print(population_sum_cross.head())
        else:
            raise ValueError("population_analysis_type must be either 'cross-sectional' or 'temporal'.")

    # Now, decide how to determine outliers based on handle_inf and pct_change_columns
    if handle_inf:
        # Outliers based on component columns
        for comp_col, pct_change_col in zip(component_columns, pct_change_columns):
            if pct_change_col in df.columns:
                # Filter rows where percentage change columns have inf or -inf
                df_filtered = df.replace([np.inf, -np.inf], np.nan).dropna(subset=[comp_col, pct_change_col])

                if not df_filtered.empty:
                    # Check if the column is normally distributed using skewness
                    skewness = df_filtered[comp_col].skew()

                    if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
                        df_filtered['z_score'] = stats.zscore(df_filtered[comp_col])
                        outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
                        z_score_type[comp_col] = 'Normal Z-Score'
                    else:
                        # Use Robust Z-score (based on median and MAD) for non-normal distribution
                        median = df_filtered[comp_col].median()
                        mad = stats.median_abs_deviation(df_filtered[comp_col])
                        df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
                        outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
                        z_score_type[comp_col] = 'Robust Z-Score'

                    # Store the outliers for this component column
                    outliers_dict[comp_col] = outliers
                else:
                    outliers_dict[comp_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
            else:
                print(f"{comp_col} does not exist in DataFrame")

    else:
        # Outliers based on pct_change_columns
        for pct_change_col in pct_change_columns:
            # Choose the correct DataFrame based on the column
            if population_analysis_type == 'temporal' and 'popn_pct_change_temporal' in pct_change_col:
                df_filtered = population_sum_time  # Use the temporal population data
            elif population_analysis_type == 'cross-sectional' and 'popn_pct_change_cross' in pct_change_col:
                df_filtered = population_sum_cross  # Use the cross-sectional population data
            else:
                df_filtered = df  # Default to the original df if other percentage columns are provided

            if pct_change_col in df_filtered.columns:
                # Replace inf and -inf with NaN and work on the entire DataFrame after cleaning
                df_filtered = df_filtered.replace([np.inf, -np.inf], np.nan).dropna(subset=[pct_change_col])

                if not df_filtered.empty:
                    # Check if the column is normally distributed using skewness
                    skewness = df_filtered[pct_change_col].skew()

                    if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
                        df_filtered['z_score'] = stats.zscore(df_filtered[pct_change_col])
                        outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
                        z_score_type[pct_change_col] = 'Normal Z-Score'
                    else:
                        # Use Robust Z-score (based on median and MAD) for non-normal distribution
                        median = df_filtered[pct_change_col].median()
                        mad = stats.median_abs_deviation(df_filtered[pct_change_col])
                        df_filtered['robust_z_score'] = (df_filtered[pct_change_col] - median) / mad
                        outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
                        z_score_type[pct_change_col] = 'Robust Z-Score'

                    # Store the outliers for this percentage change column
                    outliers_dict[pct_change_col] = outliers
                else:
                    outliers_dict[pct_change_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
            else:
                print(f"{pct_change_col} does not exist in DataFrame")

    return outliers_dict, z_score_type  # Returning z_score_type for further use if needed


In [178]:
import pandas as pd
import numpy as np
from scipy import stats

def calculate_zscores_and_find_outliers(df, component_columns, handle_inf=True, Geography='borough', z_score_threshold=2, For_population_totals=False, population_analysis_type='cross-sectional'):
    """
    Computes z-scores or robust z-scores (depending on distribution) for the respective columns,
    and returns DataFrames containing outliers for either the component columns (handle_inf=True) 
    or the percentage change columns (handle_inf=False).

    Parameters:
    df (pd.DataFrame): The input DataFrame containing population data component value columns.
    component_columns (str or list): A single column name or a list of column names to be analysed.
    handle_inf (bool): If True, uses the component columns to determine outliers. 
                       If False, uses percentage change columns to determine outliers.
    Geography (str): Specifies whether to group by 'borough' (using 'gss_code') or 'ward' (using 'gss_code_ward').
                     Default is 'borough'.
    For_population_totals (bool): If True, calculates total population sums and percentage changes before proceeding 
                                  to z-score and outlier analysis.
    population_analysis_type (str): Specifies whether to do 'cross-sectional' or 'temporal' analysis for population totals.
                                    Default is 'cross-sectional'.
    z_score_threshold (float or int): The threshold to consider as an outlier based on the z-score. Default is 2.

    Returns:
    dict: A dictionary containing DataFrames with outliers for each respective column based on the z-score threshold.
    """
    
    # If a single column name is provided as a string, convert it to a list
    if isinstance(component_columns, str):
        component_columns = [component_columns]

    outliers_dict = {}
    z_score_type = {}  # Dictionary to store which method was used

    # Grouping and pivot based on the Geography parameter
    if Geography == 'borough':
        geo_column = 'gss_code'
    elif Geography == 'ward':
        geo_column = 'gss_code_ward'
    else:
        raise ValueError("Geography must be either 'borough' or 'ward'.")

    # Automatically create pct_change_columns based on component_columns
    pct_change_columns = [f"{col}_pct_change" for col in component_columns]

    # Calculate the percentage change for the component columns
    df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()
    pivoted = df.pivot_table(index=[geo_column, 'sex', 'age', 'year'], columns='component', values='value').reset_index()

    # If For_population_totals is True, calculate population totals and percentage changes
    if For_population_totals:
        # Filter the DataFrame to only keep rows where the 'component' is 'popn'
        popn_df = pivoted[pivoted['component'] == 'popn']
        
        # Pivot the data to get 'popn' values by geography, sex, age, and year
        pivoted = popn_df.pivot_table(index=[geo_column, 'sex', 'age', 'year'], columns='component', values='value').reset_index()
        
        # Ensure that there's a 'popn' column in the resulting DataFrame
        if 'popn' in pivoted.columns:
            pivoted['popn'] = pivoted['popn']  # Extract the 'popn' column
        else:
            raise ValueError("The 'popn' column is missing after pivoting.")

        # Group by geography and year, and sum the population values
        population_sum = pivoted.groupby([geo_column, 'year'])['popn'].sum().reset_index()

        population_sum_time = population_sum.copy()
        population_sum_cross = population_sum.copy()

        if population_analysis_type == 'temporal':
            # Temporally: Calculate the population change over the years for each gss_code or ward
            population_sum_time['popn_pct_change_temporal'] = population_sum_time.groupby(geo_column)['popn'].pct_change() * 100
        elif population_analysis_type == 'cross-sectional':
            # Cross-sectionally: Compare the population between different gss_code or wards for the same year and compare to the mean
            population_sum_cross['popn_mean'] = population_sum_cross.groupby('year')['popn'].transform('mean')
            population_sum_cross['popn_pct_change_cross'] = ((population_sum_cross['popn'] - population_sum_cross['popn_mean']) / population_sum_cross['popn_mean']) * 100
        else:
            raise ValueError("population_analysis_type must be either 'cross-sectional' or 'temporal'.")

    # Now, decide how to determine outliers based on handle_inf and pct_change_columns
    if handle_inf:
        # Outliers based on component columns
        for comp_col, pct_change_col in zip(component_columns, pct_change_columns):
            if pct_change_col in df.columns:
                # Filter rows where percentage change columns have inf or -inf
                df_filtered = df.replace([np.inf, -np.inf], np.nan).dropna(subset=[comp_col, pct_change_col])

                if not df_filtered.empty:
                    # Check if the column is normally distributed using skewness
                    skewness = df_filtered[comp_col].skew()

                    if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
                        df_filtered['z_score'] = stats.zscore(df_filtered[comp_col])
                        outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
                        z_score_type[comp_col] = 'Normal Z-Score'
                    else:
                        # Use Robust Z-score (based on median and MAD) for non-normal distribution
                        median = df_filtered[comp_col].median()
                        mad = stats.median_abs_deviation(df_filtered[comp_col])
                        df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
                        outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
                        z_score_type[comp_col] = 'Robust Z-Score'

                    # Store the outliers for this component column
                    outliers_dict[comp_col] = outliers
                else:
                    outliers_dict[comp_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
            else:
                print(f"{comp_col} does not exist in DataFrame")

    else:
        # Outliers based on pct_change_columns
        for pct_change_col in pct_change_columns:
            # Choose the correct DataFrame based on the column
            if population_analysis_type == 'temporal' and 'popn_pct_change_temporal' in pct_change_col:
                df_filtered = population_sum_time  # Use the temporal population data
            elif population_analysis_type == 'cross-sectional' and 'popn_pct_change_cross' in pct_change_col:
                df_filtered = population_sum_cross  # Use the cross-sectional population data
            else:
                df_filtered = df  # Default to the original df if other percentage columns are provided

            if pct_change_col in df_filtered.columns:
                # Replace inf and -inf with NaN and work on the entire DataFrame after cleaning
                df_filtered = df_filtered.replace([np.inf, -np.inf], np.nan).dropna(subset=[pct_change_col])

                if not df_filtered.empty:
                    # Check if the column is normally distributed using skewness
                    skewness = df_filtered[pct_change_col].skew()

                    if abs(skewness) < 0.5:  # If skewness is less than 0.5, use normal Z-score
                        df_filtered['z_score'] = stats.zscore(df_filtered[pct_change_col])
                        outliers = df_filtered[df_filtered['z_score'].abs() > z_score_threshold]  # Z-score > threshold
                        z_score_type[pct_change_col] = 'Normal Z-Score'
                    else:
                        # Use Robust Z-score (based on median and MAD) for non-normal distribution
                        median = df_filtered[pct_change_col].median()
                        mad = stats.median_abs_deviation(df_filtered[pct_change_col])
                        df_filtered['robust_z_score'] = (df_filtered[pct_change_col] - median) / mad
                        outliers = df_filtered[df_filtered['robust_z_score'].abs() > z_score_threshold]  # Robust Z-score > threshold
                        z_score_type[pct_change_col] = 'Robust Z-Score'

                    # Store the outliers for this percentage change column
                    outliers_dict[pct_change_col] = outliers
                else:
                    outliers_dict[pct_change_col] = pd.DataFrame()  # Return empty DataFrame if no rows found
            else:
                print(f"{pct_change_col} does not exist in DataFrame")

    return outliers_dict, z_score_type  # Returning z_score_type for further use if needed


In [83]:
def gender_outliers(df, component_columns, geography='borough', outlier_std={'births': 2, 'deaths': 5, 'netflow': 2, 'popn': 5}):
    """
    Processes gender data for either wards or boroughs, and calculates outliers for specified components.
    
    Parameters:
    - df: pandas DataFrame containing the raw data
    - component_columns: list or single component column name(s) for which ratios and outliers need to be calculated
    - geography: str, either 'ward' or 'borough', default is 'borough'
    - outlier_std: dict specifying how many standard deviations to use for each component's threshold calculation.
    
    Returns:
    - outliers_dict: dictionary of outlier DataFrames for each component
    """

    # Check geography type and set index columns accordingly
    if geography == 'ward':
        geo_col = 'gss_code_ward'
    else:
        geo_col = 'gss_code'
    
    # Step 1: Create the pivot table
    gender_pivot = df.pivot_table(
        index=[geo_col, 'year', 'age', 'component'],  # Geography column and other grouping columns
        columns='sex',                                # Columns for sex (male, female)
        values='value',                               # Values (count of males and females)
        aggfunc='sum'                                 # Aggregation function (sum)
    ).reset_index()
    
    # Step 2: Calculate the ratio of females to males
    gender_pivot['ratio_female_to_male'] = gender_pivot['female'] / gender_pivot['male']
    
    # Handle division by zero and missing values
    gender_pivot['ratio_female_to_male'].replace([float('inf'), -float('inf')], pd.NA, inplace=True)
    gender_pivot['ratio_female_to_male'].fillna(np.nan, inplace=True)
    
    # Step 3: Pivot again to spread component values into columns
    gender_pivot = gender_pivot.pivot(
        index=[geo_col, 'year', 'age'], 
        columns='component', 
        values='ratio_female_to_male'
    ).reset_index()
    
    # Step 4: Convert the specified component columns to numeric
    for component in component_columns:
        gender_pivot[component] = pd.to_numeric(gender_pivot[component], errors='coerce')
    
    # Step 5: Calculate mean and standard deviation for each component
    means = {}
    stds = {}
    for component in component_columns:
        means[component] = gender_pivot[component].mean()
        stds[component] = gender_pivot[component].std()
    
    # Step 6: Set thresholds for outliers using mean ± specified standard deviations
    thresholds = {}
    for component in component_columns:
        high_threshold = means[component] + outlier_std.get(component, 2) * stds[component]
        low_threshold = means[component] - outlier_std.get(component, 2) * stds[component]
        thresholds[component] = (low_threshold, high_threshold)
    
    # Step 7: Identify outliers for each component
    outliers_dict = {}
    for component in component_columns:
        low_threshold, high_threshold = thresholds[component]
        outliers = gender_pivot[(gender_pivot[component] > high_threshold) | (gender_pivot[component] < low_threshold)]
        outliers_dict[component] = outliers.reset_index(drop=True)
    
    return outliers_dict


## Basic Checks
Perform basic checks on the dataset, including checking for missing values, duplicates, and descriptive statistics.

---

In [3]:
#min and max year
def get_year_range(df):
    return df['year'].max(), df['year'].min()

In [7]:
#year ranges
print(get_year_range(combined_10yr_fert))
print(get_year_range(combined_10yr_fert_ward))
print(get_year_range(combined_10yr_fert_boroughs))

(2050.0, 2002.0)
(2050.0, 2011.0)
(2050.0, 2002.0)


##### missing values

In [8]:
missing_values = combined_10yr_fert.isnull().sum()
print("Missing values per column:\n", missing_values)

Missing values per column:
 gss_code              0
la_name               0
year                  0
sex                   0
age                   0
value                 0
component             0
gss_code_ward    771342
ward_name        771342
dtype: int64


#### duplicates

In [9]:
duplicates = combined_10yr_fert.duplicated().sum()
print("Number of duplicate rows:", duplicates)

Number of duplicate rows: 0


##### Descriptive data

In [10]:
combined_10yr_fert.describe()

Unnamed: 0,year,age,value
count,15428060.0,15428060.0,15428060.0
mean,2030.736,44.83779,48.7953
std,11.43556,26.35879,222.7471
min,2002.0,0.0,-713.0
25%,2021.0,22.0,0.0
50%,2031.0,45.0,0.8
75%,2041.0,68.0,47.9
max,2050.0,90.0,5922.6


##### Description by components

In [11]:
# Group by 'component' column
grouped = combined_10yr_fert.groupby('component')

# Apply describe to each group
described_groups = grouped.describe()

In [12]:
described_groups

Unnamed: 0_level_0,year,year,year,year,year,year,year,year,age,age,age,age,age,value,value,value,value,value,value,value,value
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
component,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
births,55614.0,2031.0,11.25473,2012.0,2021.0,2031.0,2041.0,2050.0,55614.0,0.0,...,0.0,0.0,55614.0,162.293288,377.997791,0.0,59.9,81.6,110.9,3264.0
deaths,5060874.0,2031.0,11.25463,2012.0,2021.0,2031.0,2041.0,2050.0,5060874.0,45.0,...,68.0,90.0,5060874.0,0.885798,6.327541,0.0,0.0,0.1,0.6,825.2
netflow,5120934.0,2030.712656,11.499358,2002.0,2021.0,2031.0,2041.0,2050.0,5120934.0,45.0,...,68.0,90.0,5120934.0,-0.182117,20.742161,-713.0,-2.3,-0.4,1.2,1938.0
popn,5190640.0,2030.5,11.543397,2011.0,2020.75,2030.5,2040.25,2050.0,5190640.0,45.0,...,68.0,90.0,5190640.0,142.610704,363.229346,0.0,45.0,71.2,103.6,5922.6


#### check for negative values in columns

In [13]:
# Checking for negative values and extremely high values
negative_values = combined_10yr_fert[combined_10yr_fert['value'] < 0]
print('components with negative values:', negative_values['component'].unique())

components with negative values: ['netflow']


#### check age range

In [14]:
#print true if max age is 90 and min age is 0
print('Is max age is 90 and min age is 0:', (combined_10yr_fert['age'].max() == 90) & (combined_10yr_fert['age'].min() == 0))

Is max age is 90 and min age is 0: True


## Population Consistency Over Time

---

##### place ages in bins this will even out large flunctions between age group where the are likely to the unusally high e.i. 18 year olds moving to university

In [17]:
combined_10yr_fert_agebins = create_age_bins(combined_10yr_fert)

#### seperate components into columns

In [24]:
combined_10yr_fert_agebins_component_columns = combined_10yr_fert_agebins.pivot_table(index=['gss_code','gss_code_ward','sex', 'age','year'], columns='component', values='value').reset_index()

  combined_10yr_fert_agebins_component_columns = combined_10yr_fert_agebins.pivot_table(index=['gss_code','gss_code_ward','sex', 'age','year'], columns='component', values='value').reset_index()


In [25]:
combined_10yr_fert_agebins_component_columns

component,gss_code,gss_code_ward,sex,age,year,births,deaths,netflow,popn
0,E09000001,E09000001,female,0-18,2011.0,,,,19.421053
1,E09000001,E09000001,female,0-18,2012.0,32.0,0.0,0.631579,20.368421
2,E09000001,E09000001,female,0-18,2013.0,36.0,0.0,-1.526316,19.947368
3,E09000001,E09000001,female,0-18,2014.0,26.0,0.0,0.052632,19.473684
4,E09000001,E09000001,female,0-18,2015.0,30.0,0.0,-0.210526,19.578947
...,...,...,...,...,...,...,...,...,...
489595,E09000033,E05013809,male,90+,2046.0,,7.8,0.900000,20.000000
489596,E09000033,E05013809,male,90+,2047.0,,8.1,0.900000,21.000000
489597,E09000033,E05013809,male,90+,2048.0,,8.5,0.900000,22.200000
489598,E09000033,E05013809,male,90+,2049.0,,8.7,0.900000,23.200000


In [None]:
# List of percentage change columns
pct_change_columns = ['births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'popn_pct_change']

# Call the function
descriptive_stats = view_descriptive_statistics(pivoted, pct_change_columns)

# Display the descriptive statistics
print(descriptive_stats)

In [40]:
outliers_dict_borough, z_score_type = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'popn'], 
    handle_inf=False, 
    Geography='borough',
    z_score_threshold=3,
    For_population_totals=False
    )

  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()
  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()


births_pct_change
births_pct_change used Robust Z-Score.
deaths_pct_change
deaths_pct_change used Robust Z-Score.
netflow_pct_change
netflow_pct_change used Robust Z-Score.
popn_pct_change
popn_pct_change used Robust Z-Score.


In [48]:
outliers_dict_ward, z_score_type = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'popn'], 
    handle_inf=False, 
    Geography='ward',
    z_score_threshold=3,
    For_population_totals=False
    )

  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()
  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()


births_pct_change
births_pct_change used Robust Z-Score.
deaths_pct_change
deaths_pct_change used Robust Z-Score.
netflow_pct_change
netflow_pct_change used Robust Z-Score.
popn_pct_change
popn_pct_change used Robust Z-Score.


In [79]:
outliers_dict_borough_inf_values, z_score_type = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'popn'], 
    handle_inf=True, 
    Geography='borough',
    z_score_threshold=3,
    For_population_totals=False
    )

  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()
  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()


component   gss_code gss_code_ward   sex   age    year  births  deaths  \
465482     E09000032     E05014015  male  0-18  2013.0     2.1     0.0   
465484     E09000032     E05014015  male  0-18  2015.0     0.2     0.0   

component   netflow      popn  births_pct_change  deaths_pct_change  \
465482     0.026316  1.300000                inf                NaN   
465484     0.457895  1.584211                inf                NaN   

component  netflow_pct_change  popn_pct_change  
465482               0.912281         0.020661  
465484               1.121951         0.127341  
births used Robust Z-Score.
component   gss_code gss_code_ward     sex    age    year  births    deaths  \
51         E09000001     E09000001  female  19-30  2022.0     NaN  0.083333   
85         E09000001     E09000001  female  31-40  2016.0     NaN  0.100000   
89         E09000001     E09000001  female  31-40  2020.0     NaN  0.100000   
92         E09000001     E09000001  female  31-40  2023.0     NaN  0.010

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median

In [75]:
outliers_dict_ward_inf_values, z_score_type = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['births', 'deaths', 'netflow', 'popn'], 
    handle_inf=True, 
    Geography='ward',
    z_score_threshold=3,
    For_population_totals=False)

  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()
  df[pct_change_columns] = df.groupby([geo_column, 'sex', 'age'])[component_columns].pct_change().abs()


component   gss_code gss_code_ward   sex   age    year  births  deaths  \
465482     E09000032     E05014015  male  0-18  2013.0     2.1     0.0   
465484     E09000032     E05014015  male  0-18  2015.0     0.2     0.0   

component   netflow      popn  births_pct_change  deaths_pct_change  \
465482     0.026316  1.300000                inf                NaN   
465484     0.457895  1.584211                inf                NaN   

component  netflow_pct_change  popn_pct_change  
465482               0.912281         0.020661  
465484               1.121951         0.127341  
births used Robust Z-Score.
component   gss_code gss_code_ward     sex    age    year  births    deaths  \
51         E09000001     E09000001  female  19-30  2022.0     NaN  0.083333   
85         E09000001     E09000001  female  31-40  2016.0     NaN  0.100000   
89         E09000001     E09000001  female  31-40  2020.0     NaN  0.100000   
92         E09000001     E09000001  female  31-40  2023.0     NaN  0.010

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median

component   gss_code gss_code_ward   sex  age    year  births  deaths  \
356372     E09000025     E05013907  male  90+  2023.0     NaN     0.0   
413972     E09000028     E05011114  male  90+  2023.0     NaN     0.6   

component  netflow  popn  births_pct_change  deaths_pct_change  \
356372         0.5   0.5                NaN                1.0   
413972         0.2   2.6                NaN                inf   

component  netflow_pct_change  popn_pct_change  
356372                   1.25              inf  
413972                    inf              inf  
popn used Robust Z-Score.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['robust_z_score'] = (df_filtered[comp_col] - median) / mad


In [47]:
outliers_dict_borough

{'births_pct_change': component   gss_code gss_code_ward     sex   age    year  births    deaths  \
 2          E09000001     E09000001  female  0-18  2013.0    36.0  0.000000   
 3          E09000001     E09000001  female  0-18  2014.0    26.0  0.000000   
 4          E09000001     E09000001  female  0-18  2015.0    30.0  0.000000   
 5          E09000001     E09000001  female  0-18  2016.0    38.0  0.000000   
 6          E09000001     E09000001  female  0-18  2017.0    35.0  0.000000   
 ...              ...           ...     ...   ...     ...     ...       ...   
 489248     E09000033     E05013809    male  0-18  2019.0    73.8  0.000000   
 489250     E09000033     E05013809    male  0-18  2021.0    57.8  0.000000   
 489251     E09000033     E05013809    male  0-18  2022.0    54.0  0.052632   
 489252     E09000033     E05013809    male  0-18  2023.0    60.2  0.021053   
 489253     E09000033     E05013809    male  0-18  2024.0    61.9  0.021053   
 
 component   netflow       po

In [46]:
births_pct_change_borough_outlier_df = outliers_dict_borough['births_pct_change']
deaths_pct_change_borough_outlier_df = outliers_dict_borough['deaths_pct_change']
netflow_pct_change_borough_outlier_df = outliers_dict_borough['netflow_pct_change']
popn_pct_change_borough_outlier_df = outliers_dict_borough['popn_pct_change']

In [182]:
births_pct_change_ward_outlier_df = outliers_dict_ward['births_pct_change']
deaths_pct_change_ward_outlier_df = outliers_dict_ward['deaths_pct_change']
netflow_pct_change_ward_outlier_df = outliers_dict_ward['netflow_pct_change']
popn_pct_change_ward_outlier_df = outliers_dict_ward['popn_pct_change']

In [80]:
births_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['births']
deaths_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['deaths']
netflow_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['netflow']
popn_pct_change_borough_inf_outlier_df = outliers_dict_borough_inf_values['popn']

In [201]:
births_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['births']
deaths_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['deaths']
netflow_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['netflow']   
popn_pct_change_ward_inf_outlier_df = outliers_dict_ward_inf_values['popn']

## Total Outliers
#### Detect total outliers using Z-scores and Robust Z-scores, and perform cross-sectional and temporal comparisons.

#### Using total population (popn) perform cross-sectional comparisons: Examine changes between boroughs and wards totals for a given year.
#### Conduct temporal comparisons: Measure percentage changes between year total for both boroughs and wards

#### Total population per geographical boundary
#### Total population per year
#### i.e.
#### groupby('gss_code')['popn']
#### groupby('year')['popn']
---

In [198]:
total, z_score_type = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['popn'], 
    handle_inf=False, 
    Geography='ward',
    For_population_totals=True,
    population_analysis_type='temporal'
    )

  gss_code_ward    year         popn  popn_pct_change_temporal
0     E05009317  2011.0  1700.927778                       NaN
1     E05009317  2012.0  1690.413480                 -0.618151
2     E05009317  2013.0  1719.007105                  1.691517
3     E05009317  2014.0  1705.480351                 -0.786893
4     E05009317  2015.0  1639.372456                 -3.876204


In [200]:
total, z_score_type = calculate_zscores_and_find_outliers(
    combined_10yr_fert_agebins_component_columns, 
    ['popn'], 
    handle_inf=False, 
    Geography='ward',
    For_population_totals=True,
    population_analysis_type='cross-sectional'
    )

  gss_code_ward    year         popn    popn_mean  popn_pct_change_cross
0     E05009317  2011.0  1700.927778  1091.088114              55.892797
1     E05009317  2012.0  1690.413480  1109.064110              52.418013
2     E05009317  2013.0  1719.007105  1126.260870              52.629568
3     E05009317  2014.0  1705.480351  1143.567766              49.136798
4     E05009317  2015.0  1639.372456  1158.691562              41.484801


{'popn_pct_change': component   gss_code gss_code_ward     sex    age    year  births  deaths  \
 1          E09000001     E09000001  female   0-18  2012.0    32.0     0.0   
 11         E09000001     E09000001  female   0-18  2022.0    34.0     0.0   
 12         E09000001     E09000001  female   0-18  2023.0    25.9     0.0   
 41         E09000001     E09000001  female  19-30  2012.0     NaN     0.0   
 43         E09000001     E09000001  female  19-30  2014.0     NaN     0.0   
 ...              ...           ...     ...    ...     ...     ...     ...   
 489595     E09000033     E05013809    male    90+  2046.0     NaN     7.8   
 489596     E09000033     E05013809    male    90+  2047.0     NaN     8.1   
 489597     E09000033     E05013809    male    90+  2048.0     NaN     8.5   
 489598     E09000033     E05013809    male    90+  2049.0     NaN     8.7   
 489599     E09000033     E05013809    male    90+  2050.0     NaN     9.3   
 
 component   netflow       popn  births_pct

## Gender Outliers
Investigate gender outliers, focusing on abnormal gender ratios and adjusting thresholds as needed.

---

In [150]:
component_columns = ['births', 'deaths', 'netflow', 'popn']
gender_outlier_dictionary = gender_outliers(combined_10yr_fert, component_columns, geography='borough', outlier_std={'births': 2, 'deaths': 5, 'netflow': 2, 'popn': 5})

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  gender_pivot['ratio_female_to_male'].replace([float('inf'), -float('inf')], pd.NA, inplace=True)
  gender_pivot['ratio_female_to_male'].fillna(np.nan, inplace=True)


In [152]:
gender_outlier_dictionary
births_gender_outliers_df = gender_outlier_dictionary['births']
deaths_gender_outliers_df = gender_outlier_dictionary['deaths']
netflow_gender_outliers_df = gender_outlier_dictionary['netflow']
popn_gender_outliers_df = gender_outlier_dictionary['popn']

## Key Visualisations
Visualise important trends in the dataset, including distribution by components, age groups, yearly totals, and population pyramids.

---


#### yearly totals by component visulation

In [165]:
yearly_totals = combined_10yr_fert.groupby(['year','component'])['value'].sum().reset_index()
print("Yearly population totals by gss_code and year:\n", yearly_totals)

Yearly population totals by gss_code and year:
        year component       value
0    2002.0   netflow      7036.6
1    2003.0   netflow    -37021.3
2    2004.0   netflow    -16399.9
3    2005.0   netflow     21802.2
4    2006.0   netflow      9229.4
..      ...       ...         ...
162  2049.0      popn  19763868.6
163  2050.0    births    226639.0
164  2050.0    deaths    137879.9
165  2050.0   netflow    -57727.1
166  2050.0      popn  19794455.5

[167 rows x 3 columns]


In [166]:

# Create a bar graph using Plotly Express
fig = px.line(
    yearly_totals, 
    x='year', 
    y='value', 
    color='component', 
    markers=True,
    title="Yearly Totals by Component Over the Years",
    labels={'value': 'Total Value', 'year': 'Year'},
    
)

# Update layout for better appearance
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Total Value',
    legend_title='Component',
    width=900,
    height=600
)

# Show the figure
fig.show()

#### population pyramid

In [162]:
#for unit age
combined_10yr_fert_popn = combined_10yr_fert[combined_10yr_fert['component'] == 'popn']
population_pyramids_unit_age = combined_10yr_fert_popn.copy()
population_pyramids_unit_age = population_pyramids_unit_age[~population_pyramids_unit_age['gss_code_ward'].isna()]

In [163]:
population_pyramids_unit_age

Unnamed: 0,gss_code,la_name,year,sex,age,value,component,gss_code_ward,ward_name
10477662,E09000001,City of London,2011.0,female,0.0,37.0,popn,E09000001,City of London
10477663,E09000001,City of London,2011.0,female,1.0,34.0,popn,E09000001,City of London
10477664,E09000001,City of London,2011.0,female,2.0,28.0,popn,E09000001,City of London
10477665,E09000001,City of London,2011.0,female,3.0,18.0,popn,E09000001,City of London
10477666,E09000001,City of London,2011.0,female,4.0,21.0,popn,E09000001,City of London
...,...,...,...,...,...,...,...,...,...
15428057,E09000033,Westminster,2050.0,male,86.0,16.9,popn,E05013809,Westbourne
15428058,E09000033,Westminster,2050.0,male,87.0,13.3,popn,E05013809,Westbourne
15428059,E09000033,Westminster,2050.0,male,88.0,11.0,popn,E05013809,Westbourne
15428060,E09000033,Westminster,2050.0,male,89.0,9.9,popn,E05013809,Westbourne


In [164]:
app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Dropdown(
        id='location-dropdown',
        options=[{'label': loc, 'value': loc} for loc in population_pyramids_unit_age['la_name'].unique()] + [{'label': 'London Total (All LAs)', 'value': 'London Total (All LAs)'}],
        value=population_pyramids_unit_age['la_name'].unique()[0]
    ),
    dcc.Dropdown(
        id='ward-dropdown'
    ),
    dcc.Graph(id='population-pyramid')
])

@app.callback(
    Output('ward-dropdown', 'options'),
    Output('ward-dropdown', 'value'),
    Input('location-dropdown', 'value')
)
def update_ward_dropdown(selected_location):
    if selected_location == 'London Total (All LAs)':
        # If "London Total (All LAs)" is selected, disable the ward dropdown
        return [{'label': 'All Wards', 'value': 'All Wards'}], 'All Wards'
    else:
        # Filter data for the selected location to get ward names
        filtered_data = population_pyramids_unit_age[population_pyramids_unit_age['la_name'] == selected_location]
        
        # Get unique ward names for the selected location
        ward_options = [{'label': ward, 'value': ward} for ward in filtered_data['ward_name'].unique()]
        
        # Add 'All Wards' option
        ward_options.insert(0, {'label': 'All Wards', 'value': 'All Wards'})
        
        # Return the options and set the default value to 'All Wards'
        return ward_options, 'All Wards'

@app.callback(
    Output('population-pyramid', 'figure'),
    Input('location-dropdown', 'value'),
    Input('ward-dropdown', 'value')
)
def update_pyramid(selected_location, selected_ward):
    if selected_location == 'London Total (All LAs)':
        # Combine data for all locations
        filtered_data = population_pyramids_unit_age.copy()
        title = 'Population Pyramid for London Total (All LAs)'
    elif selected_ward == 'All Wards':
        # Combine data for all wards in the selected location
        filtered_data = population_pyramids_unit_age[population_pyramids_unit_age['la_name'] == selected_location]
        title = f'Population Pyramid for {selected_location} - All Wards'
    else:
        # Filter data for the selected location and ward
        filtered_data = population_pyramids_unit_age[
            (population_pyramids_unit_age['la_name'] == selected_location) & 
            (population_pyramids_unit_age['ward_name'] == selected_ward)
        ]
        title = f'Population Pyramid for {selected_location} - {selected_ward}'

    # Negate female population values to create a pyramid
    filtered_data['value'] = filtered_data.apply(lambda row: -row['value'] if row['sex'] == 'female' else row['value'], axis=1)

    # Create a plotly express bar chart with animation
    fig = px.bar(
        filtered_data,
        x='value',
        y='age',
        color='sex',
        animation_frame='year',
        orientation='h',
        title=title,
        labels={'value': 'Population', 'age': 'Age'},
        color_discrete_map={'male': 'blue', 'female': 'pink'},
        height=600,
        range_x=[-filtered_data['value'].max()*1.2, filtered_data['value'].max()*1.2], #need to keep x-axis consistent over time for comparison
        hover_data={'ward_name': True}
    )

    # Update layout for better appearance
    fig.update_layout(
        barmode='relative',
        xaxis_title='Population',
        yaxis_title='Age',
        showlegend=True,
        width=800,
    )
    return fig

if __name__ == '__main__':
    app.run_server(debug=True, port=1223)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/


## Collate Outliers
Summarize and compile all detected outliers for further analysis.

---

In [167]:
# Filter the global variables to find dataframes with 'outlier' in their name
outlier_dfs = {name: df for name, df in globals().items() if isinstance(df, pd.DataFrame) and 'outlier' in name.lower()}

# Display the name, columns, and length of each dataframe
for name, df in outlier_dfs.items():
    print(f"DataFrame Name: {name}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"Length: {len(df)}")
    print("\n" + "-"*50 + "\n")

DataFrame Name: births_pct_change_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'popn', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'popn_pct_change', 'robust_z_score']
Length: 16305

--------------------------------------------------

DataFrame Name: deaths_pct_change_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'popn', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'popn_pct_change', 'robust_z_score']
Length: 117296

--------------------------------------------------

DataFrame Name: netflow_pct_change_outlier_df
Columns: ['gss_code', 'gss_code_ward', 'sex', 'age', 'year', 'births', 'deaths', 'netflow', 'popn', 'births_pct_change', 'deaths_pct_change', 'netflow_pct_change', 'popn_pct_change', 'robust_z_score']
Length: 138078

--------------------------------------------------

DataFrame Name: popn_pct_change_outlier_df
Columns: ['gss_

In [168]:
# Initialise an empty list to store relevant rows from all DataFrames
all_rows = []

# Loop through all outlier dataframes
for name, df in outlier_dfs.items():
    # Check if 'gss_code' and 'year' columns exist
    if 'gss_code' in df.columns and 'year' in df.columns:
        # Select the relevant columns: 'gss_code', 'year' and 'age' (if it exists)
        cols = ['gss_code', 'year']
        if 'age' in df.columns:
            cols.append('age')

        # Append the relevant data from the current DataFrame to the list
        all_rows.append(df[cols])

# Concatenate all collected data into one DataFrame
if all_rows:
    combined_df = pd.concat(all_rows)

    # Group by 'gss_code', 'year', and 'age' (where applicable) and count occurrences
    tally_df = combined_df.groupby(cols).size().reset_index(name='count')

    # Sort the result by count in descending order
    tally_df = tally_df.sort_values(by='count', ascending=False)

    # Display the top rows of the tally DataFrame
    print(tally_df)
else:
    print("No relevant data found.")

        gss_code    year   age  count
2679   E09000008  2022.0   90+    417
2668   E09000008  2021.0   90+    415
2638   E09000008  2018.0   90+    412
2605   E09000008  2015.0   90+    399
7975   E09000022  2022.0   90+    396
...          ...     ...   ...    ...
10774  E09000029  2034.0  90.0      1
9150   E09000025  2028.0  27.0      1
10764  E09000029  2033.0  90.0      1
10754  E09000029  2032.0  90.0      1
0      E09000001  2011.0   2.0      1

[12366 rows x 4 columns]


In [169]:
from collections import defaultdict

# Initialize a dictionary to store the occurrences
occurrences = defaultdict(lambda: {'count': 0, 'dataframes': set()})

# Loop through all outlier dataframes
for name, df in outlier_dfs.items():
    # Check if 'gss_code' and 'year' columns exist
    if 'gss_code' in df.columns and 'year' in df.columns:
        # Select the relevant columns: 'gss_code', 'year' and 'age' (if it exists)
        cols = ['gss_code', 'year']
        if 'age' in df.columns:
            cols.append('age')

        # Iterate through the rows of the current dataframe
        for _, row in df[cols].iterrows():
            key = tuple(row[col] for col in cols)  # Create a tuple key of the combination
            occurrences[key]['count'] += 1  # Increment the count
            occurrences[key]['dataframes'].add(name)  # Add the dataframe name to the set

In [170]:
# Convert results list into a DataFrame
tally_df = pd.DataFrame(results, columns=columns)

NameError: name 'results' is not defined

In [None]:
import pandas as pd
from collections import defaultdict

# Initialize a defaultdict to track occurrences of each combination per DataFrame
occurrences = defaultdict(lambda: defaultdict(int))

# Loop through all outlier dataframes
for name, df in outlier_dfs.items():
    # Check if 'gss_code' and 'year' columns exist
    if 'gss_code' in df.columns and 'year' in df.columns:
        # Select the relevant columns: 'gss_code', 'year' and 'age' (if it exists)
        cols = ['gss_code', 'year']
        if 'age' in df.columns:
            cols.append('age')

        # Iterate through the rows of the current dataframe
        for _, row in df[cols].iterrows():
            key = tuple(row[col] for col in cols)  # Create a tuple key of the combination
            occurrences[key][name] += 1  # Increment the count for the specific dataframe

# Convert the defaultdict to a list of dictionaries for easier conversion to a DataFrame
results = []

# Get the list of all dataframe names
all_dataframe_names = list(outlier_dfs.keys())

# Populate the results list with the occurrences
for key, value in occurrences.items():
    # Start with the key (gss_code, year, age)
    row = list(key)

    # Add the counts for each dataframe, setting to 0 if not present
    for df_name in all_dataframe_names:
        row.append(value.get(df_name, 0))

    # Add the total count across all dataframes
    total_count = sum(row[len(key):])  # Sum from the index after the key columns to the end
    row.append(total_count)

    results.append(row)

# Create columns names based on the keys, the dataframe names, and a 'total' column
columns = cols + all_dataframe_names + ['total']

# Convert results list into a DataFrame
tally_df = pd.DataFrame(results, columns=columns)

# Sort the DataFrame by the 'total' column in descending order
tally_df = tally_df.sort_values(by='total', ascending=False)
