# Load Data

> Functions that load the data for the map.   

**Contents**

`Solomon Geospatial Data`
- `SolomonGeo`: A class that cleans the solomon islandas census and geography data
- `SolomonGeo.read_test`: Loads and transforms the test data
- `SolomonGeo.get_geojson`: Returns the geo_df as a geojson datatset

In [None]:
#| default_exp load_data

In [None]:
#| export
from nbdev.showdoc import *
import geopandas as gpd
import pandas as pd
from git import Repo
import json
from fastcore import *
from fastcore.basics import patch

In [None]:
#| hide
repo = Repo('.', search_parent_directories=True)
fp = str(repo.working_tree_dir) + "/testData/"

# Check that the files exist using fastcore (both census and geo)

## Solomon Geospatial Data
> Load the geography and census data
### Geography Data
Solomons islands geography data is organised at the levels
 - adm0 - The country as as whole, Solomon Islands
 - adm1 - Also referred to as the province e.g. Honiara, Malaita
 - adm2 - The Consituency e.g. Central Honiara
 - adm3 - Ward, the smallest geography I am reporting. E.g. Cruz

### Census Data
 Solomon islands census data has been used from the 2009 and 2019 census. For the respective census:
2009
 - We have the total population in for each of the administration regions
2019
 - There is only data available down to the province level

In [None]:
#| export
class SolomonGeo:
    # TODO work out how to format the attributes
    # Look at nbdev docs maybe?
    # TODO change all data to int?
    # TODO - should I make this a dataclass for the auto functionaliy? potentially should try it out
    '''
    Load the solomon islands geography data 
    Attributes:
        geo_df    Geopandas dataframe containing geographies and census data
    '''
    def __init__(self, 
                geo_df:gpd.GeoDataFrame): # A geopandas dataset containing population and geography boundaries for each aggregation
        self.geo_df = geo_df

    @classmethod
    def read_test(cls,
                 )-> gpd.GeoDataFrame: # The geopandas dataset for given aggregation
        '''
        Initialise the object using the local testing data
        '''
        # TODO might need to further abstract this concatenation process
        df, geo = cls.extract_from_file('ward', '2009')
        gdf_ward = cls.transform('ward', '2009', df, geo)
        
        df, geo = cls.extract_from_file('constituency', '2009')
        gdf_const = cls.transform('constituency', '2009', df, geo)
        
        # Append the datasets together
        geo_df = pd.concat([gdf_ward, gdf_const])

        return cls(
            geo_df = geo_df
        )

    @classmethod
    def extract_from_file(cls, 
                            aggregation:str, # Indicates the aggregation of the data
                            year:str, # The year of that data, only relevant for census data
                 ) -> (pd.DataFrame, 
                      gpd.GeoDataFrame): # Returns input pandas and geopandas datasets
        '''
        Extract and return input datasets from file
        '''
        repo = Repo('.', search_parent_directories=True)
        pw = str(repo.working_tree_dir) + "/testData/"
        return (
            pd.read_csv(pw + 'sol_census_' + aggregation + '_' + year + '.csv'), 
            gpd.read_file(pw + 'sol_geo_' + aggregation + '.json')
        )

    @classmethod
    def transform(cls, 
            aggregation:str, # Inicates the aggregation of the data
            year:str, # The year of that data, only relevant for census data
            df:pd.DataFrame, # Uncleaned input census dataset
            geo:gpd.GeoDataFrame, # Uncleaned input geospatial dataset
           )-> gpd.GeoDataFrame: # The geopandas dataset for given aggregation
        '''
        Tranform given raw input dataset into a cleaned and combined geopandas dataframe
        '''
        # Clean the geospatial dataframe
        # Rename columns and keep only necessary ones, Note that id can be province id, contsituency id etc.
        geo.columns = geo.columns.str.replace(r'^[a-zA-Z]+name$', 'geo_name', case = False, regex = True)
        # TODO this assume the key column is the first one (which so far it is...)
        geo.rename(columns = {geo.columns[0]:'id'}, inplace=True)
        geo = geo.loc[:, ['id', 'geo_name', 'geometry']]
        
        # Add a column that indicates level of aggregation and one for the year
        geo.loc[:, 'agg'] = aggregation
        geo.loc[:, 'year'] = year
        
        # Clean the census data
        df = df.dropna()
        # Rename columns to be consistent across geography
        df.columns = df.columns.str.replace(r'^[a-zA-Z]+_name$', 'geo_name', case = False, regex = True)
        # id needs to change types twice so that it is a string of an int
        df = df.astype({'id': 'int', 'male_pop':'int', 	'female_pop':'int', 'total_pop':'int'})
        df = df.astype({'id': 'str'})
        
        # Merge the data together
        geo_df = geo.merge(df, on=['id', 'geo_name']).set_index("geo_name")
        return geo_df
        

In [None]:
#| hide
show_doc(SolomonGeo)

---

[source](https://github.com/Gippers/SolomonIslandsDataMap/blob/main/SolomonIslandsDataMap/load_data.py#L16){target="_blank" style="float:right; font-size:smaller"}

### SolomonGeo

>      SolomonGeo (geo_df:geopandas.geodataframe.GeoDataFrame)

Load the solomon islands geography data 
Attributes:
    geo_df    Geopandas dataframe containing geographies and census data

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| geo_df | GeoDataFrame | A geopandas dataset containing population and geography boundaries for each aggregation |

In [None]:
#| hide
show_doc(SolomonGeo.read_test)

---

[source](https://github.com/Gippers/SolomonIslandsDataMap/blob/main/SolomonIslandsDataMap/load_data.py#L31){target="_blank" style="float:right; font-size:smaller"}

### SolomonGeo.read_test

>      SolomonGeo.read_test ()

Initialise the object using the local testing data

In [None]:
#| export
@patch
def get_geojson(self:SolomonGeo, 
               ) -> dict: # Geo JSON formatted dataset
    '''
    A getter method for the SolomonGeo class that returns a Geo JSON formatted dataset
    '''
    return json.loads(self.geo_df.to_json())

In [None]:
#| hide
show_doc(SolomonGeo.get_geojson)

---

[source](https://github.com/Gippers/SolomonIslandsDataMap/blob/main/SolomonIslandsDataMap/load_data.py#L93){target="_blank" style="float:right; font-size:smaller"}

### SolomonGeo.get_geojson

>      SolomonGeo.get_geojson ()

A getter method for the SolomonGeo class that returns a Geo JSON formatted dataset

# Testing

In [None]:
import warnings

def my_formatwarning(message, category, filename, lineno, line=None):
  print(message, category)
  # lineno is the line number you are looking for
  print('file:', filename, 'line number:', lineno) 
  ...

warnings.formatwarning = my_formatwarning

In [None]:
sol_geo = SolomonGeo.read_test()

                         id  \
geo_name                      
West Guadalcanal        634   
Temotu Vatu             950   
South Guadalcanal       635   
East Guadalcanal        636   
North West Choiseul     102   
East AreAre             726   
North West Guadalcanl   633   
East Choiseul           103   
West Areare             727   
Small Malaita           728   
Malaita Outer Island    729   
East Malaita            722   
East Kwaio              724   
South Choiseul          101   
West Kwaio              725   
West Kwara'ae           720   
Central Kwara'ae        721   
Rennell-Bellona         532   
Marovo                  212   
Temotu Pele             948   
North New Georgia       210   
Fataleka                719   
East Honiara           1041   
Central Honiara        1042   
West Honiara           1043   
North Malaita           716   
Nggela                  430   
Central Guadalcanal     640   
North Vella la Vella    205   
West Makira             844   
South Ve

NameError: name 'geo_df' is not defined

In [None]:
sol_geo.geo_df

Unnamed: 0_level_0,id,geometry,agg,year,male_pop,female_pop,total_pop
geo_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Nggosi,1001,"MULTIPOLYGON (((159.92475 -9.42335, 159.92641 ...",ward,2009,5240,4822,10062
Mbumburu,1002,"MULTIPOLYGON (((159.94507 -9.43466, 159.94519 ...",ward,2009,1912,1713,3625
Rove - Lengakiki,1003,"MULTIPOLYGON (((159.94362 -9.42779, 159.94382 ...",ward,2009,1464,1149,2613
Cruz,1004,"MULTIPOLYGON (((159.95288 -9.43426, 159.95284 ...",ward,2009,125,107,232
Vavaea,1005,"MULTIPOLYGON (((159.95970 -9.42874, 159.95877 ...",ward,2009,3788,3208,6996
...,...,...,...,...,...,...,...
Duff Islands,913,"MULTIPOLYGON (((167.24674 -9.93260, 167.24704 ...",ward,2009,262,249,511
Utupua,914,"MULTIPOLYGON (((166.50509 -11.30801, 166.50353...",ward,2009,586,582,1168
Vanikoro,915,"MULTIPOLYGON (((166.98326 -11.67945, 166.98327...",ward,2009,625,668,1293
Tikopia,916,"MULTIPOLYGON (((168.84036 -12.28504, 168.84096...",ward,2009,604,681,1285


In [None]:
test = sol_geo.get_geojson()

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()