# Comparison of Population Estimates and Methods for Afghanistan 

##  GHS Population dataset
* The GHS population dataset is availabe [here](https://ghsl.jrc.ec.europa.eu/download.php?ds=pop)
* Has 5 year temporal resolution from 1975 to 2030
* The raw global census data harmonized by CIESIN (GPWv4.11) is disaggregated to grid cells
* The population growth rate is also from CIESIN
* GHS-Built-S based on Sentinel/Landsat is used as target for disaggreagtion of population estimates
* The estimates are scaled to match:
    * The total population time series at ’city’ level from the extended database feeding the UN World Urbanisation Prospects 2018
    * The total population time series at country level provided by the UN World Population Prospects 2019

##  WorldPop
* The WorldPop population dataset is available [here](https://hub.worldpop.org/geodata/listing?id=29)
* Random Forest model is used to generate weights for dasymetric redistribution
* The predictirs used include: night time light, topographic variables, land cover and climatic variables
* The output of the model is used as a weighting layer for dasymetric mapping
* Growth rate for rural and urban areas are from UN World Urbanization Prospects Database, 2012 version (UNPD)

## WorldPop Population Adjusted to Match UNPD Estimate
* The WorlPOp UNDP adjusted dataset is available [here](https://hub.worldpop.org/geodata/summary?id=49924)
* The population source is 2010 estimates from UN Urban and Rural Population by Age and Sex (URPAS)
* The Built-Settlement Growth Model  is used to identify unsettled pixels
* Dasymetric redistribution based on Random Forest is used for mapping
* The country total population is adjusted to match the official United Nations population estimates


## Landscan
* The landscan population data is available [here](https://landscan.ornl.gov/)
* Use top-down modeling approach to disaggregate census data
* Spatial data and high resolution imagery are use for multi-variable dasymetric modeling
* The ML models are country specific

## GPW
* The GPW v4 is available [here](https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-adjusted-to-2015-unwpp-country-totals-rev11/data-download)
* Raw census estimates are extrapolated to to get data for target years: 2000, 2005, 2010, 2015 and 2020.
* The estimates are proportionally allocated to ratser cells using a uniform area weighting approach.
* The estimated population have been nationally adjusted to match United Nation’s World Population Prospects: The 2015 Revision (United Nations, 2015).

## NSIA
* The population data is extracted from this [pdf](http://nsia.gov.af:8080/wp-content/uploads/2022/05/%D8%A8%D8%B1%D8%A2%D9%88%D8%B1%D8%AF-%D9%86%D9%81%D9%88%D8%B3-%D8%B3%D8%A7%D9%84-1401_compressed.pdf)
* The 2003 - 05 household listing data was used to estimpate population for 2022 - 23
* The formula used to estimate population is $P_{t} = P_{0}e^{rt}$ where $P_{t}$ is estimated population, $P_{0}$ is the base year population, $r$ is population growth rate and $t$ is time interval
* The population for 2004 is considered as the base year.

## FlowMinder
* The population data is extratced from this [pdf](https://www.ipcinfo.org/fileadmin/user_upload/ipcinfo/docs/IPC_Afghanistan_AcuteFoodInsec_2022Mar_2022Nov_report.pdf). The method used to estimate flowminder data is not clear from online search it seem like one of the variable used mobile phone data.

In [16]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [17]:
#import warnings
#warnings.simplefilter(action='ignore')

In [18]:
import os

In [19]:
import numpy as np
import pandas as pd
import geopandas as gpd
from rasterstats import zonal_stats
import luigi

from luigi.util import requires
from kiluigi.tasks import Task, ExternalTask
from kiluigi.targets import CkanTarget, LocalTarget, IntermediateTarget
from luigi.configuration import get_config
import statsmodels.formula.api as smf
from shapely import wkt

In [20]:
class PullFlowMinderPopulation(ExternalTask):
    """
    Pull flow minder population count
    """
    def output(self):
        return {2022: CkanTarget(
        dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"},
        resource={"id": "8d8bd8ce-57f1-4746-a91a-a4df43a72cfc"}
        ),
        2021: CkanTarget(
        dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"},
        resource={"id": "b4548813-2779-4a2f-bad4-022d59ba0db8"}
        ),
        }

    
class PullNSIAPopulation(ExternalTask):
    """
    Pull NSIA population estimate for 2020-23, 2021-24 and 2022-25
    """
    def output(self):
        return CkanTarget(
        dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"},
        resource={"id": "5363e5a0-498f-4791-ace0-468ed0784354"}
            
        )

In [21]:
@requires(PullFlowMinderPopulation)
class FlowMinderPopData(Task):
    def output(self):
        return IntermediateTarget(task=self, timeout=360000)
    
    @staticmethod
    def get_admin_d_pop(x):
        x=x['flow']
        x = x.split(" ")
        name = x[0]
        try:
            pop = int(float(x[1].replace(",", "")))
        except ValueError:
            name = f"{x[0]} {x[1]}"
            pop = int(float(x[2].replace(",", "")))
        return name, pop
    
    def process(self, df, dataset=2022):
        if  dataset==2022:
            df[['admin1', 'flowminder']] = df.apply(lambda x: self.get_admin_d_pop(x), axis=1, result_type='expand')
            df = df.drop("flow", axis=1)
            df['name'] = df['admin1'].str.replace(" Urban", "")
            df = df.groupby(by='name', as_index=False)['flowminder'].sum()
        else:
            df = df.drop(0, axis=0)
            df = df.drop(['Admin0_Name', 'Admin0_Code', 'Admin1_Code'], axis=1)
            rename_map = {i:f"{i.strip()}_2021" for i in df.columns}
            rename_map.update({"Admin1_Name": "name", 'T_TL_2021': 'flowminder_2021', "M_TL_2021": "flowminder_2021_male", "F_TL_2021": "flowminder_2021_female"})
            df = df.rename(columns=rename_map)
            for i in df.columns:
                if i != "name":
                    df[i] = df[i].astype(float)
            df['name'] = df['name'].replace({'Hilmand': 'Helmand', 'Maidan Wardak': 'Wardak', 'Sar-e-Pul': 'Sari pul'})
        return df

    def run(self):
        df = pd.read_excel(self.input()[2022].get(), header=None, names=["flow"])
        df = self.process(df)
        df_21 = pd.read_csv(self.input()[2021].get())
        df_21 = self.process(df_21, 2021)
        df = df.merge(df_21, on='name', how='left')
        with self.output().open("w") as out:
            out.write(df)

In [22]:
@requires(PullNSIAPopulation)
class ReadNSIAPopulation(Task):
    """
    """
    def output(self):
        return IntermediateTarget(task=self, timeout=360000)
    
    @staticmethod
    def process(df):
        df = df.rename(columns={i: i.strip() for i in df.columns})
        df['Province'] = df['Province'].str.strip()
        df['Province'] = df['Province'].replace({'Helmand': 'Hilmand', 'Herat': 'Hirat', 'Kunarha': 'Kunar', 'Nooristan': 'Nuristan', 'Urozgan': 'Uruzgan'})
        df = df.rename(columns={'Province': 'name', '1399/ 2020-23': 'nsia_2020', '1400/ 2021-24':  'nsia_2021', '1401/ 2022-25':  'nsia_2022'})
        for i in ['nsia_2020', 'nsia_2021', 'nsia_2022']:
            df[i] = df[i]* 1000
        return df
    
    def run(self):
        df = pd.read_excel(self.input().get())
        breakpoint()
        df = self.process(df)
        with self.output().open("w") as out:
            out.write(df)    

In [23]:
class NSIAPopData(ExternalTask):
    def output(self):
        return IntermediateTarget(task=self, timeout=360000)
    
    def process(self):
        nsia_pop_map = {
            "Badakhshan": 1091.76,
            "Badghis": 569.15,
            "Baghlan": 1053.20,
            "Balkh": 1578.51,
            "Bamyan": 513.19,
            "Daykundi": 534.68,
            "Farah": 583.42,
            "Faryab": 1_150.15,
            "Ghazni": 1_411.38,
            "Ghor": 791.48,
            "Helmand": 1498.48,
            "Hirat": 2234.66,
            "Jawzjan": 625.07,
            "Kabul": 5572.63,
            "Kandahar": 1464.89,
            "Kapisa": 505.50,
            "Khost": 659.10,
            "Kunar": 517.18,
            "Kunduz": 1_184.02,
            "Laghman": 510.93,
            "Logar": 449.81,
            "Nangarhar": 1_769.99,
            "Nimroz": 190.43,
            "Nuristan": 169.58,
            "Paktika": 802.86,
            "Paktya": 633.87,
            "Panjsher": 175.91,
            "Parwan": 764.58,
            "Samangan": 446.10,
            "Sari pul": 643.53,
            "Takhar": 1_133.57,
            "Uruzgan": 451.64,
            "Wardak": 683.54,
            "Zabul": 398.05,
        }
        df = pd.DataFrame([nsia_pop_map]).T
        df = df.reset_index()
        df.columns = ["name", "nsia"]
        df["nsia"] = df["nsia"] * 1000
        return df
    
    def run(self):
        df = self.process()
        with self.output().open("w") as out:
            out.write(df)

In [24]:
class PullFDWPopulationData(ExternalTask):
    def output(self):
        return CkanTarget(
            dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"},
            resource={"id": "d18c10c0-ba12-4460-9a1c-138e225d4dbf"}
        )


@requires(PullFDWPopulationData)
class ReadFDWPopData(Task):
    def output(self):
        return IntermediateTarget(task=self, timeout=36000)
    def run(self):
        df = pd.read_json(self.input().get())
        df['admin_1'] = df['admin_1'].replace({'Maydan Wardak': 'Maidan Wardak', 'Sari Pul': 'Sar-e-Pul'})
        df = df.groupby(by=['admin_1', 'period_date'], as_index=False)['value'].sum()
        df = pd.pivot_table(df, index=['admin_1'], columns='period_date')
        df.columns = [i[1]  for i in df.columns]
        df = df.reset_index()
        df = df.rename(columns={'2004-12-31': 'FDW_2004', '2011-12-31': 'FDW_2011', 'admin_1': 'name'})
        with self.output().open("w") as out:
            out.write(df)

In [25]:
class Admin1GeoJson(ExternalTask):
    def output(self):
        return CkanTarget(
        dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"},
        resource={"id": "288a0783-01bd-4e7d-b51d-ddf4431b0874"}
        )

In [26]:
@requires(FlowMinderPopData, Admin1GeoJson, ReadNSIAPopulation, ReadFDWPopData)
class Admin1GeomToPopData(Task):
    def output(self):
        return IntermediateTarget(task=self, timeout=360000)
    
    def process(self, gdf, df_minder):
        df = df_minder.copy()
        df["name"] = df["name"].replace(
            {"Helmand": "Hilmand", "Sari pul": "Sar-e-Pul", "Wardak": "Maidan Wardak"}
        )
        gdf = gdf.merge(df, on="name", how="outer")
        return gdf
    
    def run(self):
        with self.input()[0].open() as src:
            df_minder = src.read() 
        gdf = gpd.read_file(self.input()[1].get())
        
        with self.input()[2].open() as src:
            df_nsia_3_levels = src.read()
        
        gdf = self.process(gdf, df_minder)
        gdf = gdf.merge(df_nsia_3_levels, on='name', how='outer')
        
        with self.input()[3].open() as src:
            df_fdw = src.read()
        
        gdf = gdf.merge(df_fdw, on='name', how='outer')
        with self.output().open("w") as out:
            out.write(gdf) 

In [27]:
class RasterPopData(ExternalTask):
    def output(self):
        return {
            "landscan_2020": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "e0aeee8b-8b28-4394-b6b6-0a9917dc228a"}),
            "landscan_2021": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "b39afc1d-d367-4a67-8c88-2a6b4af5375c"}),
            "world_pop_2020_UNadj": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "ea289d79-b4ac-4da6-8550-38486bc245ff"}),
            "world_pop_2019_UNadj": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "874b00cd-729a-49af-b5b8-e33198ad27a7"}),
            "world_pop_2020": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "a53f74a1-e02f-477c-a0da-86aa904bf9f8"}),
            "world_pop_2019": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "0b19ec59-dc97-4e80-91b7-d7be37f11106"}),
            "gpw_2020": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "99253feb-20bc-4086-9741-95304f050334"}),
            "ghs_2020": CkanTarget(dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"}, resource={"id": "436eadef-947c-4238-a1bc-d21c33db6ff0"}),
        }

In [28]:
@requires(Admin1GeomToPopData, RasterPopData)
class ExtractZonalStats(Task):
    def output(self):
        return IntermediateTarget(task=self, timeout=360000)
    
    @staticmethod
    def get_zonal_stats(gdf, file_target, var_name):
        try:
            stats = zonal_stats(gdf, file_target.get(), stats="sum", all_touched=False, geojson_out=True)
        except:
            breakpoint()
        gdf = gdf = gpd.GeoDataFrame.from_features(stats)
        gdf = gdf.rename(columns={"sum": var_name})
        return gdf
    
    def run(self):
        with self.input()[0].open() as src:
            gdf = src.read()
        for var_name, file_target in self.input()[1].items():
            print(f"Computing zonal stats for {var_name}")
            gdf = self.get_zonal_stats(gdf, file_target, var_name)
        with self.output().open("w") as out:
            out.write(gdf)  

Cache the output of the `ExtractZonalStats` in ckan [here](https://data.kimetrica.com/dataset/afghanistan-population-datasets/resource/3238efb3-9543-43f6-a149-cbd2a2b4d117) as it takes some time to compute.

In [29]:
class PullPreprocessedData(ExternalTask):
    def output(self):
        return CkanTarget(
            dataset={"id": "37b098b1-0ee9-4fb0-8411-88f3cebd776a"},
            resource={"id": "3238efb3-9543-43f6-a149-cbd2a2b4d117"}
        )

@requires(PullPreprocessedData)
class PopZonalStats(Task):
    def output(self):
        return IntermediateTarget(task=self, timeout=360000)
    
    def run(self):
        df = pd.read_csv(self.input().get())
        with self.output().open("w") as out:
            out.write(df)

In [30]:
luigi.build([PopZonalStats()], local_scheduler=True)

True

In [31]:
df = PopZonalStats().output().open().read()
df.shape

(34, 101)

In [32]:
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output
import plotly.subplots as sp
import plotly.express as px
from dash import dash_table

In [33]:
def plot_total_population():
    df = PopZonalStats().output().open().read()
    var_names = ["flowminder_2022","landscan_2021",'flowminder_2021', "landscan_2020", "world_pop_2020", "world_pop_2019",  "world_pop_2020_UNadj", "world_pop_2019_UNadj","gpw_2020", "ghs_2020",
             'nsia_2020', 'nsia_2021', 'nsia_2022', 'FDW_2004', 'FDW_2011']
    df = pd.melt(df, id_vars=['admin_1'], value_vars=var_names)
    df = df.groupby(by=['variable'], as_index=False)['value'].sum()
    df = df.sort_values(by=['value'], ascending=False)
    fig = px.bar(df, x='variable', y='value', barmode='group', title="Total Population")
    fig.update_layout(legend=dict(yanchor="top", y=1, xanchor="left", x=0.9), xaxis_title='Population Source', yaxis_title='Total Population')
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30)))
    return fig

def plot_heatmap():
    df = PopZonalStats().output().open().read()
    var_names = ["flowminder_2022","flowminder_2021", "landscan_2021", "landscan_2020", "world_pop_2020", "world_pop_2019",  "world_pop_2020_UNadj", "world_pop_2019_UNadj","gpw_2020", "ghs_2020",
                'nsia_2020', 'nsia_2021', 'nsia_2022', 'FDW_2004', 'FDW_2011']
    corr = df[var_names].corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    mask = pd.DataFrame(mask, index=corr.index, columns=corr.columns, dtype=bool)
    plot_data = np.ma.masked_where(np.asarray(mask), corr)
    plot_data[plot_data.mask]=np.nan
    plot_data = pd.DataFrame(plot_data.data, index=corr.index, columns=corr.columns)
    fig = px.imshow(plot_data, title='Population Correlation Matrix at Admin 1', color_continuous_scale='reds') # width=600, height=600
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30)))
    fig.update_xaxes(tickangle=45)
    return fig 

def plot_admin1_pop():
    df = PopZonalStats().output().open().read()
    var_names = ["flowminder_2022",'nsia_2022', "landscan_2021",  "world_pop_2020_UNadj","gpw_2020", 'flowminder_2021', 'FDW_2011']
    df = df.sort_values(by=['flowminder_2022'], ascending=False)
    fig = px.bar(df, x='name', y=var_names, barmode='group', title="Population At Admin 1 Level")
    fig.update_layout(legend=dict(yanchor="top", y=1, xanchor="left", x=0.9), xaxis_title='Admin 1', yaxis_title='Population Count')
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30)))
    return fig


In [34]:
def plot_polygon(pop_source='nsia_2022', title="NSIA Population Count Estimate", range_color=[0, 7_000_000], cmap='reds', pct=False):
    df = PopZonalStats().output().open().read()
    df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry'])
    df = gpd.GeoDataFrame(df, geometry='geometry')
    df = df.rename(columns={'name':'Admin 1'})
    df.index = df['Admin 1']
    fig = px.choropleth_mapbox(df, geojson=df.geometry, 
                               locations=df.index, color=pop_source,
                               #height=500, width=600,
                               mapbox_style="carto-positron",
                               color_continuous_scale=cmap,
                               center = {"lat": 34.15, "lon":65.22},
                               zoom=4,
                              range_color=range_color)
    fig.update_geos(fitbounds="locations", visible=False)
    fig.update_layout(
        title_text=title
    )
    fig.update(layout = dict(title=dict(x=0.5)))
    fig.update_layout(
        margin={"r":0,"t":30,"l":10,"b":10},
        coloraxis_colorbar={
            'title':'Population'})
    if pct:
        fig.update_traces(hovertemplate='Admin 1 =%{location}<br>Pct Change =%{z:.0%}<extra></extra>')
    else:
        fig.update_traces(hovertemplate='Admin 1 =%{location}<br>Pop Count =%{z}<extra></extra>')
    return fig

In [35]:
df = PopZonalStats().output().open().read()
df = df.sort_values(by=['nsia_2022'], ascending=False)

In [36]:
def plot_population_bar(baseline='nsia_2022', scenario='landscan_2021', title="Population count At Admin 1"):
    df = PopZonalStats().output().open().read()
    df = df.sort_values(by=['name'], ascending=True)
    df = df.rename(columns={'name':'Admin 1'})
    df.index = df['Admin 1']
    fig = px.bar(df, x='Admin 1', y=[baseline, scenario], barmode='group', title=title)#, height=500, width=800)
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30)))
    fig.update_layout(legend=dict(yanchor="top", y=0.9, xanchor="left", x=0.8), xaxis_title='Admin 1', yaxis_title='Population Count')
    return fig

In [37]:
def plot_population_difference_bar(baseline='nsia_2022', scenario='landscan_2021'):
    df = PopZonalStats().output().open().read()
    df = df.sort_values(by=['name'], ascending=True)
    #df = df.rename(columns={'name':'Admin 1'})
    df.index = df['name']
    df['pct_change'] = (df[scenario]- df[baseline])/df[baseline]
    #df['pct_change'] = df['pct_change'] * 100
    df['test'] = df['pct_change'].abs()
    fig = px.bar(df, x='name', y=['pct_change'], title=f"Pct Change from {baseline} to {scenario} At Admin 1")#, height=500, width=800)
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30, b=0, l=0, r=0)))
    fig.update_layout(showlegend=False, yaxis=dict(tickformat=".0%"), xaxis_title='Admin 1', yaxis_title='Pct Change')
    fig.update_traces(hovertemplate='Admin 1 =%{customdata}<br>Pct Change=%{value:.0%}<br><extra></extra>', customdata=df['name'])
    fig.update_xaxes(tickangle=45)
    return fig

In [38]:
demographic_map = {"Female": {"baseline": "nsia_2022_female", "scenario": "flowminder_2021_female"}, "Male": {"baseline": "nsia_2022_male", "scenario": "flowminder_2021_male"}}

In [39]:
def plot_total_pop_demo(gender="Female"):
    baseline = demographic_map[gender]['baseline']
    scenario = demographic_map[gender]['scenario']
    df = PopZonalStats().output().open().read()
    plot_df = pd.melt(df, id_vars=['admin_1'], value_vars=[baseline, scenario])
    plot_df = plot_df.groupby(by=['variable'], as_index=False)['value'].sum()
    plot_df = plot_df.sort_values(by=['variable'], ascending=True)
    fig = px.bar(plot_df, x='variable', y='value', barmode='group', title="Total Population")
    fig.update_layout(legend=dict(yanchor="top", y=1, xanchor="left", x=0.9), xaxis_title='Population Source', yaxis_title='Total Population')
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30)))
    return fig

In [40]:
def scatter_plot_w_reg(gender="Female"):
    baseline = demographic_map[gender]['baseline']
    scenario = demographic_map[gender]['scenario']
    df = PopZonalStats().output().open().read()
    lm = smf.ols(formula=f'{baseline} ~ {scenario}', data=df).fit()
    title = f'{baseline} = {round(lm.params[0], 2)} + {round(lm.params[1], 4)}*{scenario} \n (R squared = {round(lm.rsquared, 4)})'
        
    fig = px.scatter(df, x=scenario, y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 3_500_000], range_y=[0, 3_500_000], title=title)
    fig.update(layout = dict(title=dict(x=0.5), margin=dict(t=30, b=0, l=0, r=0)))
    return fig

In [41]:
def plot_population_bar_demo(gender='Female'):
    baseline = demographic_map[gender]['baseline']
    scenario = demographic_map[gender]['scenario']
    plot = plot_population_bar(baseline, scenario)
    return plot

In [42]:
def plot_population_difference_bar_demo(gender='Female'):
    baseline = demographic_map[gender]['baseline']
    scenario = demographic_map[gender]['scenario']
    plot = plot_population_difference_bar(baseline, scenario)
    return plot

In [43]:
drop_down_map = {"Flowminder 2022":"flowminder_2022",  "Flowminder 2021":"flowminder_2021", "Landscan 2021":"landscan_2021",   "WorldPop 2020 UNadj":"world_pop_2020_UNadj",
                  "GPW 2020":"gpw_2020",  "NSIA 2020":'nsia_2020',  "NSIA 2021":'nsia_2021',  "NSIA 2022":'nsia_2022', 'FDW 2004':'FDW_2004', 'FDW 2011':'FDW_2011'}

In [44]:
def total_pct_change(baseline="flowminder_2022"):
    df = PopZonalStats().output().open().read()
    var_names = ["flowminder_2022","landscan_2021",'flowminder_2021', "landscan_2020", "world_pop_2020", "world_pop_2019",  "world_pop_2020_UNadj", "world_pop_2019_UNadj","gpw_2020", "ghs_2020",
                'nsia_2020', 'nsia_2021', 'nsia_2022', 'FDW_2004', 'FDW_2011']
    df_total = df[var_names].sum()
    df_total = pd.DataFrame(df_total).T
    df_nsia = df_total.copy()

    for i in df_nsia.columns:
        if i != baseline:
            df_nsia[i] = (df_nsia[i] - df_nsia[baseline]) / df_nsia[baseline]

    df_nsia = df_nsia.drop(baseline, axis=1)
    df_nsia = df_nsia.T
    df_nsia = df_nsia.reset_index()
    df_nsia.columns = ['Population Source', 'Pct Change']
    df_nsia = df_nsia.sort_values('Pct Change', ascending=False)
    df_nsia['Pct Change'] = df_nsia['Pct Change'].apply(lambda x: f"{x:.0%}")
    df_nsia['Population Source'] = df_nsia['Population Source'].apply(lambda x: x.replace("_", " ").title())
    df_nsia = df_nsia.set_index('Population Source')
    return df_nsia.T

In [45]:
df_flow = total_pct_change(baseline="flowminder_2022")
df_nsia = total_pct_change(baseline="nsia_2022")

In [46]:
def corr_table(baseline="flowminder_2022"):
    df = PopZonalStats().output().open().read()
    var_names = ["flowminder_2022","flowminder_2021", "landscan_2021", "landscan_2020", "world_pop_2020", "world_pop_2019",  "world_pop_2020_UNadj", "world_pop_2019_UNadj","gpw_2020", "ghs_2020",
                'nsia_2020', 'nsia_2021', 'nsia_2022', 'FDW_2004', 'FDW_2011']
    corr = df[var_names].corr()
    flow_corr = corr[baseline]
    flow_corr.index = [i.replace("_", " ").title() for i in flow_corr.index]
    flow_corr = flow_corr.sort_values(ascending=False)
    flow_corr = pd.DataFrame(flow_corr).T
    flow_corr = flow_corr.drop(baseline.replace("_", " ").title(), axis=1)
    return flow_corr.round(5)

In [47]:
flow_corr = corr_table(baseline="flowminder_2022")
nsia_corr = corr_table(baseline="nsia_2022")

In [53]:
var_list = ["flowminder_2022", "landscan_2021",  "world_pop_2020_UNadj","gpw_2020", 'nsia_2021', 'nsia_2022', 'FDW_2004', 'FDW_2011', "flowminder_2021"]

fig1 = plot_total_population()

fig2 = plot_heatmap()

fig3 = plot_admin1_pop()

app = dash.Dash()

app.layout = html.Div([
    dcc.Tabs(id="tabs", children=[
        dcc.Tab(label='Description', children=[
            html.Div([
                dcc.Markdown('''
                    # Population Data Overview

                    This project is designed to compare and analyze population data for Afghanistan from various sources. 
                    The goal is to provide a comprehensive overview of the country's population, identify disparities and
                    inconsistencies in the data, and gain insights into the factors that contribute to these differences.
                    
                    The data sources used are:
                    ##  GHS Population dataset

                    * The GHS population dataset is availabe [here](https://ghsl.jrc.ec.europa.eu/download.php?ds=pop)
                    * Has 5 year temporal resolution from 1975 to 2030
                    * The raw global census data harmonized by CIESIN (GPWv4.11) is disaggregated to grid cells
                    * The population growth rate is also from CIESIN
                    * GHS-Built-S based on Sentinel/Landsat is used as target for disaggreagtion of population estimates
                    * The estimates are scaled to match:
                        * The total population time series at ’city’ level from the extended database feeding the UN World Urbanisation Prospects 2018
                        * The total population time series at country level provided by the UN World Population Prospects 2019


                    ##  WorldPop

                    * The WorldPop population dataset is available [here](https://hub.worldpop.org/geodata/listing?id=29)
                    * Random Forest model is used to generate weights for dasymetric redistribution
                    * The predictirs used include: night time light, topographic variables, land cover and climatic variables
                    * The output of the model is used as a weighting layer for dasymetric mapping
                    * Growth rate for rural and urban areas are from UN World Urbanization Prospects Database, 2012 version (UNPD)


                    ## WorldPop Population Adjusted to Match UNPD Estimate

                    * The WorlPOp UNDP adjusted dataset is available [here](https://hub.worldpop.org/geodata/summary?id=49924)
                    * The population source is 2010 estimates from UN Urban and Rural Population by Age and Sex (URPAS)
                    * The Built-Settlement Growth Model  is used to identify unsettled pixels
                    * Dasymetric redistribution based on Random Forest is used for mapping
                    * The country total population is adjusted to match the official United Nations population estimates


                    ## Landscan

                    * The landscan population data is available [here](https://landscan.ornl.gov/)
                    * Use top-down modeling approach to disaggregate census data
                    * Spatial data and high resolution imagery are use for multi-variable dasymetric modeling
                    * The ML models are country specific


                    ## GPW

                    * The GPW v4 is available [here](https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-adjusted-to-2015-unwpp-country-totals-rev11/data-download)
                    * Raw census estimates are extrapolated to to get data for target years: 2000, 2005, 2010, 2015 and 2020.
                    * The estimates are proportionally allocated to ratser cells using a uniform area weighting approach.
                    * The estimated population have been nationally adjusted to match United Nation’s World Population Prospects: The 2015 Revision (United Nations, 2015).


                    ## NSIA

                    * The population data is extracted from this [pdf](http://nsia.gov.af:8080/wp-content/uploads/2022/05/%D8%A8%D8%B1%D8%A2%D9%88%D8%B1%D8%AF-%D9%86%D9%81%D9%88%D8%B3-%D8%B3%D8%A7%D9%84-1401_compressed.pdf)
                    * The 2003 - 05 household listing data was used to estimpate population for 2022 - 23
                    * The formula used to estimate population is $P_{t} = P_{0}e^{rt}$ where $P_{t}$ is estimated population, $P_{0}$ is the base year population, $r$ is population growth rate and $t$ is time interval


                    ## FlowMinder
                    * The population data is extratced from this [pdf](https://www.ipcinfo.org/fileadmin/user_upload/ipcinfo/docs/IPC_Afghanistan_AcuteFoodInsec_2022Mar_2022Nov_report.pdf). The method used to estimate flowminder data is not clear from online search it seem like one of the variable used mobile phone data.
                                        
                ''', mathjax=True)
            ], {'display':'inline-block', 'margin-left':'10px', 'margin-right':'10px'})
        ]),
        dcc.Tab(label='Population Data Overview', children=[
            html.Div([
                html.Div([
                    dcc.Graph(figure=fig1)
                ], style={'width': '70%', 'display': 'inline-block'}),
                html.Div([
                    dcc.Graph(figure=fig2)
                ], style={'width': '30%', 'display': 'inline-block'})
            ]),
            html.Div([
                html.Div([
                    dcc.Graph(figure=fig3)
                ], style={'width': '100%', 'display': 'inline-block'})
            ])
        ]),
        dcc.Tab(label='Population Data comparison', children=[
            html.Div([
                html.Div([
                    dcc.Dropdown(
                        id='baseline_source',
                        options=[{'label': k, 'value': v} for k, v in drop_down_map.items()],
                        value='flowminder_2022'
                    ),
                    dcc.Dropdown(
                        id='scenario_source',
                        options=[{'label': k, 'value':v} for k, v in drop_down_map.items()],
                        value='nsia_2022'
                    )
                ], style={'width': '10%', 'display': 'inline-block', 'vertical-align':'top'}),
                html.Div([
                    dcc.Graph(id='baseline_map'),
                ], style={'width': '45%', 'display': 'inline-block', 'height': '100%', 'vertical-align':'top'}),
                html.Div([
                    dcc.Graph(id='scenario_map'),
                ], style={'width': '45%', 'display': 'inline-block', 'height': '100%', 'vertical-align':'top'}),
            ]),
            html.Div([
                html.Div([], style={'width': '10%', 'display': 'inline-block'}),
                html.Div([
                    dcc.Graph(id='difference_map'),
                ], style={'width': '45%', 'display': 'inline-block', 'height': '100%'}),
                html.Div([
                    dcc.Graph(id='ratio_map'),
                ], style={'width': '45%', 'display': 'inline-block', 'height': '100%'}),
            ])
        ]),
        dcc.Tab(label='Population Data Regression', children=[
            html.Div([
                html.Div([
                    html.Label('Baseline:'),
                    dcc.Dropdown(
                        id='baseline-dropdown',
                        options=[{'label': var, 'value': var} for var in var_list],
                        value='nsia_2022'
                        )]),
                html.Div([
                    dcc.Graph(id='scatter-plots')
                    ], style={'width': '100%', 'display': 'inline-block', 'height': '95%', 'vertical-align':'top'})
                ])
        ]),
        dcc.Tab(label='Population Demographic', children=[
            html.Div([
                html.Div([
                    dcc.Dropdown(
                        id='demographic_dropdowm',
                        options=[{'label': i, 'value': i} for i in ['Female', 'Male']],
                        value='Female'
                    ),
                ]),
                html.Div([
                    dcc.Graph(id='total-pop-demographic'),
                ], style={'width': '15%', 'display': 'inline-block', 'height': '100%', 'vertical-align':'top'}),
                html.Div([
                    dcc.Graph(id='pop-demographic-admin1'),
                ], style={'width': '85%', 'display': 'inline-block', 'height': '100%', 'vertical-align':'top'}),
            ]),
            html.Div([
                html.Div([
                    dcc.Graph(id='difference_map_demographic'),
                ], style={'width': '50%', 'display': 'inline-block', 'height': '100%'}),
                html.Div([
                    dcc.Graph(id='regression_demographic'),
                ], style={'width': '50%', 'display': 'inline-block', 'height': '100%'}),
            ])
        ]),
        dcc.Tab(label='Summary', children=[
            html.H3("Data availability"),
            html.P("""The population data from Flowminder that we have access to is at the admin level 1, which implies that we are unable to compare the population data at lower administrative levels."""),
            html.Br(),
            html.H3('Total Population Percentage change'),
            dash_table.DataTable(df_nsia.to_dict('records'),[{"name": i, "id": i} for i in df_nsia.columns], id='tbl_1', style_cell={'textAlign': 'left'}, style_header={"backgroundColor": "paleturquoise", "textAlign": "center"}),
            html.P('Table 1: Total population percentage change from NSIA 2022'),
            html.Br(),
            dash_table.DataTable(df_flow.to_dict('records'),[{"name": i, "id": i} for i in df_flow.columns], id='tbl_2', style_cell={'textAlign': 'left'},style_header={"backgroundColor": "paleturquoise", "textAlign": "center"}),
            html.P('Table 2: Total population percentage change from Flow minder 2022'),
            html.Br(),
            html.H3('Population correlation matrix at Admin 1'),
            html.P("The correlation coefficients are positive and close to 1, this indicates strong positive linear relationship. As population from NSIA/Flowminder source increases, the population from other sources also tends to increase."),
            dash_table.DataTable(nsia_corr.to_dict('records'),[{"name": i, "id": i} for i in nsia_corr.columns], id='tbl_3', style_cell={'textAlign': 'left'},style_header={"backgroundColor": "paleturquoise", "textAlign": "center"}),
            html.P('Table 3: Correlation between NSIA 2022 and other population datasets'),
            html.Br(),
            dash_table.DataTable(flow_corr.to_dict('records'),[{"name": i, "id": i} for i in flow_corr.columns], id='tbl_4', style_cell={'textAlign': 'left'},style_header={"backgroundColor": "paleturquoise", "textAlign": "center"}),
            html.P('Table 4: Correlation between Flowminder 2022 and other population datasets'),

        ]),
    ]),
])

@app.callback(
    Output('baseline_map', 'figure'),
    [Input('baseline_source', 'value')])
def update_baseline_map(pop_source):
    return plot_polygon(pop_source, f"{pop_source} Population Count Estimate")

@app.callback(
    Output('scenario_map', 'figure'),
    [Input('scenario_source', 'value')])
def update_scenario_map(pop_source):
    return plot_polygon(pop_source, f"{pop_source} Population Count Estimate")

@app.callback(
    Output('difference_map', 'figure'),
    [Input('baseline_source', 'value'), Input('scenario_source', 'value')])
def update_difference_map(baseline_source, scenario_source):
    fig = plot_population_difference_bar(baseline_source, scenario_source)
    #diff_data = df[scenario_source]- df[baseline_source]
    #fig = plot_polygon(diff_data, f"Difference between {baseline_source} and {scenario_source} ")
    return fig

@app.callback(
    Output('ratio_map', 'figure'),
    [Input('baseline_source', 'value'), Input('scenario_source', 'value')])
def update_pct_change_map(baseline_source, scenario_source):
    df = PopZonalStats().output().open().read()
    pct_change = (df[scenario_source]- df[baseline_source])/df[baseline_source]
    fig = plot_polygon(pct_change, f"Pct Change From {baseline_source} to {scenario_source}", [-1, 1], "RdBu", True)
    return fig

@app.callback(
    Output('scatter-plots', 'figure'),
    [Input('baseline-dropdown', 'value')]
)
def update_scatter_plots(baseline):
    df = PopZonalStats().output().open().read()
    var_list = ["flowminder_2022", "landscan_2021",  "world_pop_2020_UNadj","gpw_2020", 'nsia_2020', 'nsia_2022','world_pop_2020', 'flowminder_2021', 'FDW_2011']
    plot_list = list(set(var_list) - {baseline})
    def scatter_plot_title(var):
        lm = smf.ols(formula=f'{baseline} ~ {var}', data=df).fit()
        return f'{baseline} = {round(lm.params[0], 2)} + {round(lm.params[1], 4)}*{var} \n (R squared = {round(lm.rsquared, 4)})'
        
    fig = sp.make_subplots(rows=4, cols=2, subplot_titles=[scatter_plot_title(i) for i in plot_list])
    fig.add_trace(px.scatter(df, x=plot_list[0], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000], title='test').data[0], row=1, col=1)
    fig.add_trace(px.scatter(df, x=plot_list[0], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=1, col=1)
    
    fig.add_trace(px.scatter(df, x=plot_list[1], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=1, col=2)
    fig.add_trace(px.scatter(df, x=plot_list[1], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=1, col=2)
    
    fig.add_trace(px.scatter(df, x=plot_list[2], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=2, col=1)
    fig.add_trace(px.scatter(df, x=plot_list[2], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=2, col=1)
    
    fig.add_trace(px.scatter(df, x=plot_list[3], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=2, col=2)
    fig.add_trace(px.scatter(df, x=plot_list[3], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=2, col=2)
    
    fig.add_trace(px.scatter(df, x=plot_list[4], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=3, col=1)
    fig.add_trace(px.scatter(df, x=plot_list[4], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=3, col=1)
    
    fig.add_trace(px.scatter(df, x=plot_list[5], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=3, col=2)
    fig.add_trace(px.scatter(df, x=plot_list[5], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=3, col=2)
    
    fig.add_trace(px.scatter(df, x=plot_list[6], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=4, col=1)
    fig.add_trace(px.scatter(df, x=plot_list[6], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=4, col=1)
    
    fig.add_trace(px.scatter(df, x=plot_list[7], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[0], row=4, col=2)
    fig.add_trace(px.scatter(df, x=plot_list[7], y=baseline, trendline='ols', trendline_color_override='red', range_x=[0, 7_000_000], range_y=[0, 7_000_000]).data[1], row=4, col=2)
    
    fig.update_layout(width=1800,height=1000)
    return fig

@app.callback(
    Output('total-pop-demographic', 'figure'),
    [Input('demographic_dropdowm', 'value')]
)
def update_total_pop_demo(demographic_dropdowm):
    fig = plot_total_pop_demo(demographic_dropdowm)
    return fig

@app.callback(
    Output('pop-demographic-admin1', 'figure'),
    [Input('demographic_dropdowm', 'value')]
)
def update_admin1_pop_demo(demographic_dropdowm):
    fig = plot_population_bar_demo(demographic_dropdowm)
    return fig

@app.callback(
    Output('difference_map_demographic', 'figure'),
    [Input('demographic_dropdowm', 'value')]
)
def update_diff_map_demographic(demographic_dropdowm):
    fig = plot_population_difference_bar_demo(demographic_dropdowm)
    return fig
    
@app.callback(
    Output('regression_demographic', 'figure'),
    [Input('demographic_dropdowm', 'value')]
)
def update_regression_demographic(demographic_dropdowm):
    fig = scatter_plot_w_reg(demographic_dropdowm)
    return fig



if __name__ == '__main__':
    app.run_server(debug=True, use_reloader=False, port=9185, host='0.0.0.0')

Dash is running on http://0.0.0.0:9185/

Dash is running on http://0.0.0.0:9185/

Dash is running on http://0.0.0.0:9185/

Dash is running on http://0.0.0.0:9185/

 * Serving Flask app '__main__'
 * Debug mode: on


## Next steps
* Identify key insights
* Add population data from AsiaPop and GRUMP

## References