#  EPA-122A *Spatial* Data Science


## Assignment 2: Geographic Visualisation


---



# ``Instructions``

This assignment puts together what you learned in **Weeks 3-4**. Assignment 3 will build upon what you do in Assignment 2.

_Note:_ Go through **labs and homeworks 03-04** before starting this assignment.

#### 1.1 Submission

Please submit the results by Brightspace under **Assignment 02**, using a single file as example,

```text
firstname_secondname_thirdname_lastname_02.html

```

**If your file is not named in lowercase letters as mentioned above, your assignment will not be read by the script that works to compile 200 assignments and you will miss out on the grades. I don't want that, so be exceptionally careful that you name it properly. Don't worry if you spelled your name incorrectly. I want to avoid a situation where I have 200 assignments all called assignment_02.html**

Please **do not** submit any data or files other than the ``html file``.

Don't worry about your submission _rendering without the images_ **after** you submitted the file on brightspace. That is a brigthspace related issue of viewing your own submission but when I download all assignments as a batch file, I get all your images and code as you intended to submit. So make sure that your html shows everything you want us to see **before you submit**.

#### 1.2 How do you convert to HTML?

There are 2 ways,

1. from a running notebook, you can convert it into html by clicking on the file tab on the main menu of Jupyter Lab
    * File &rightarrow; Export Notebooks as... &rightarrow; Export Notebook to HTML
2. go to terminal or command line and type
    * ``jupyter nbconvert --to html <notebook_name>.ipynb  ``

#### 1.3 Learning Objectives

This assignment is designed to support three different learning objectives. After completing the following exercises you will be able to:

* Combine different datasets
* Explore and Visualise Geographic data
* Plot (graphs, scatter plots, choropleth, etc..) and discuss (observations, outliers or relationships) important information about the data using the `principles of graphical excellence` and `guidelines of exploratory data analysis`.

<br/>

***

# ``Critical Data Science``

Throughout the assignment, we encourage you to critically reflect on your choices during the Data Science process. To help you set-up your Critical Data Science process, we have provided you with a 'Guide on Critical Data Science'. Section 3.2 contains a step-by-step approach with key considerations for each part of the data science process. The guide can be found [here](https://epa122a.github.io/resources_index.html#a-guide-to-critical-data-science). Below, you will find specific questions which you can use to reflect on your data science choices.

<br/>

***

# ``Problem``

`Problem Statement`:
- For this assignment you will use the SHRUG (Socioeconomic High-resolution Rural-Urban Geographic) Platform for India to formulate a hypothesis/RQ and conduct an exploratory data analysis.
- To formulate the hypothesis, provide at least `two measurements` that may be related to each other (for example: your hypothesis is that areas with `high air pollution` also have relative high `night light pollution`). And explain why? For example, higher night lights pollution entails densification and nighttime activity. This is possible when many people are clustered together for exchnage of goods and services in a city, ultimately leading to more pollution due to use of resources, traffic and mobility, use of water, discharge of pollutants from cars and factories, etc.
- Be explicit about how you define these measurements using markdown cells (for example: how do you measure the air pollution, and how do you measure the levels of light pollution during the night?). Do you use variables to proxy any of these effects?
- Observe that the measurements have a normative value attached to it (for example: according to your hypothesis, `high levels` of air pollution in an area is of `more` interest). Please do not assume that there is only one normative definition of a certain measurement and skip your reasoning.
- On the basis of the hypothesis and its associated measurements, you will conduct some exploratory/spatial data analysis and provide a reflection of how your hypothesis manifests spatially, using maps and other aiding plots.

_Note:_ I am not looking for mathematical equations as justification, but you are welcome to also form simple relations and show them in markdown.

A formalised example:


> I hypothesise that district in India with relatively high air pollution also have high levels of light pollution, due to high levels of [economic activity](https://www.imf.org/en/Publications/fandd/issues/2019/09/satellite-images-at-night-and-economic-growth-yao).

>Definitions of metrics:
>- I will measure these effects at district level for all variables.
>- Surface PM2.5 pollution (estimated annual ground-level fine particulate matter (PM2.5))
>- Night time lights (night time luminosity)
>- Facebook relative wealth index (measured as an index between 0 and 1). The index is determined by a machine-learning model that collects target variables from traditional survey data spatially linked to features constructed from non-traditional data like high-resolution satellite imagery, data from mobile phone networks, and topographic maps, as well as aggregated and deidentified connectivity data from Facebook. Used in this analysis as a proxy for wealth and economic activity. This is arguably an oversimplification of the concept of inequality but is considered a suitable approximation given the available datasets.

<br/>

***

# ``Tasks``

For your convenience, the assignment has been divided into the following tasks,

**Exercise 1**:
1. Formulate a hypothesis for this assignment as explained above in `problem statement`.
2. Using a critical data science lens, evaluate your hypothesis and contrast it with your previous assignment (A1). The following questions can be used as guides to carry out this task.
    - Were there any design choices you would make differently this time? Why? (Because of data availability/ methodology for using certain columns as a proxy?)
    - Explain what your dependent variable is. Explain your choice of independent variables.
    - Would you wish to include any variables that are not available? How is the inclusion beneficial for the hypothesis? Are there any variables (proxies) present with which you could replace these missing values?
    - Would that have an effect on your outcomes? Think about bias in the data. Explain your reasoning.
    - Reflect on any cases in which this hypothesis will be rejected? Why?
    - Reflect on cases in which the hypothesis will be falsely rejected or falsely accepted? Think about bias in the data and in your own reasoning.
    - Reflect if there are any important perspectives that you are not taking into account by choosing this hypothesis?

**Exercise 2**:
1. Use two datasets: merge a shapefile and a csv file.
2. Clean your data and make it tabular for your own good! (think about weeks 1-2 and assignment 1)
4. Carry out an exploratory data analysis (EDA)
5. Report on your hypothesis results both in relationships of the variables and spatial manifestation of the outcome.
    * Use at least **3 figures** to support your analysis. Think about exploratory data analysis (build data, clean data, explore global/group properties).
    * These figures should have followed the principles of graphical excellence. Using markdown, write explicity under **each** figure at least **3 principles of excellence** you have used to make it.
    * Create **choropleths** to display **region-specific information** (ex. population, voting choice or jobs availability) together with **some other elements like the sea, canals, centroids, or amenities** (you may try Open Street Maps data - using `osmnx`).
    * Be careful with the use of color (and try to be inclusive of color blind people)
    * Use **one method** from the lectures to discuss what you observe for your variable(s). Examples below,
          * local or global spatial autocorrelation
          * network measures
          * spatial weights / spatial lag
          * binning
          * feature scaling, normalisation or standardisation


**Exercise 3**:
1. Critically reflect on your EDA and visualizations following the questions below.
    - Do they show the relevant information in a precise manner or is the data skewed or biased due to certain choices that you have made in the EDA?
    - Could they be further reduced/enhanced? (Refer/Use [Critical Data Science Handbook](https://epa122a.github.io/resources_index.html#a-guide-to-critical-data-science) where needed)

***Remember to always document your code! Justify everything you do (cleaning data, analysisng data, exploring data, defining hypothesis or measurements, etc.) using markdown cells as you go through the notebook.***

<br/>

***

# ``Data``

Information about the SHRUG can be found [here](https://docs.devdatalab.org/). On this [website](https://www.devdatalab.org/shrug_download/) you can download data for your variables of interest (in csv format) and the shapefiles that you can use for mapping the variables. Put the data in a convenient location on your computer or laptop, ideally in a folder called **data** which is next to this **jupyter notebook**. Make sure you’ve set your working directory in the [correct manner](https://www.delftstack.com/howto/python/relative-path-in-python/).

These are a big files and it may take a while to load onto your laptop and into Python (running on the jupyter labs environment).

As mentioned above in the problem introduction, you will use at least two datasets.

1. **First Dataset:** Download Shapefiles of SHRUG for the geographic level of your assignment (shrid, district, subdistrict)
2. **Second Dataset:** Get a second dataset of your choice in SHRUG using the links above (curate this dataset as you like)

<br/>

***

# ``Start your analysis``

### Hypothesis on Corruption in Rural Constituencies

I hypothesize that rural constituencies with a higher prevalence of criminal charges against elected representatives exhibit lower levels of public service delivery, indicative of mismanagement and corruption, while accounting for geographic challenges and socioeconomic disparities.

---

##### **Corruption Proxy**
- `mean_num_crim`: Average number of criminal charges against elected representatives → *Reflects governance quality and potential corruption.*

---

##### **Public Service Delivery**
- `pc11_vd_power_all`: Power Supply For All Users [0 or 1] → *Measures accessibility and reliability of basic utilities, indicating public infrastructure quality.*
- `pc11_vd_wat_tap_trt`: Access to treated tap water [0 or 1] → *Captures access to clean drinking water, a key sanitation and health service.*
- `pc11_vd_rd_all_wthr`: Roads that are all-weather [0 or 1]. → *Proxy to indicate better infrastructure service*

---

##### **Control Variables**
- **Wealth Index**
  - `facebook_mean_rwi`**: Mean relative wealth index → *Reflects wealth distribution and economic disparity across constituencies.*
- **Terrain Ruggedness**
  - `tri_mean`: Mean terrain ruggedness index → *Captures geographic challenges that may affect infrastructure development and service delivery.*


---



Unfortunately, the binary design of public service delivery variables, such as access to electricity or treated water, simplifies complex realities into a "yes" or "no" framework. This overlooks important aspects like quality or sufficiency. For example, having electricity might not mean reliable power supply, leading to misunderstandings about actual service levels. Analyzing each variable separately also makes it harder to see overall service delivery patterns.

To address this, the binary variables are combined into a **public service index**, representing the proportion of services available in each constituency. This simplification improves analysis but assumes all services are equally important, which may not reflect their actual impact (e.g., clean water vs. electricity). The **dependent variable** is this composite index. The **independent variable**, `mean_num_crim`, serves as a proxy for corruption, assuming more criminal charges indicate weaker governance. **Control variables** like terrain ruggedness (`tri_mean`), population size, and wealth index (`facebook_mean_rwi`) account for geographic and socioeconomic factors that also influence service delivery.

##### **Missing Variables and Proxies**
Key missing variables include:
- **Public Budget Allocation:** A direct measure of service investment.
- **Poverty Rate:** A more specific indicator of economic hardship.
- **Service Quality Metrics:** Measures of reliability or adequacy.

Proxies like the wealth index and binary access indicators partially address these gaps, but their limitations may bias outcomes. Missing data risks overstating corruption’s role if service delivery depends more on unmeasured factors, like regional policies or historical investments.

##### **Hypothesis Rejection and Bias**
The hypothesis could be rejected if service delivery is driven more by factors like regional policies, funding, or terrain ruggedness than by corruption. Additionally, rejection may occur if there are significant **time-related effects** not captured in the analysis, such as gradual improvements in service delivery due to long-term investments or reforms that mask the immediate impact of corruption. False rejection might occur if `mean_num_crim` poorly represents actual corruption, while false acceptance could result from unmeasured confounders, like state policies or funding patterns, creating spurious correlations.


##### **Conclusion**
The binary design simplifies the analysis but has limitations that the composite index only partially addresses. Including variables like public budgets or service quality metrics and accounting for temporal and cultural factors would improve accuracy and provide a more correct understanding of public service delivery in rural constituencies.


---

In [26]:
import geopandas as gpd
import pandas as pd

In [27]:
def process_dataset(file_path, relevant_columns, merge_key):
    """
    Function to process a single dataset, keeping only the relevant columns and ensuring the merge key is included.

    Args:
        file_path (str): Path to the CSV file.
        relevant_columns (list): List of columns to retain from the dataset.
        merge_key (str): Column to use as the common key for merging datasets.

    Returns:
        pd.DataFrame: Filtered dataframe with relevant columns and the merge key.
    """
    try:
        # Load the dataset
        df = pd.read_csv("data/" + file_path)
        
        # Add merge key to the relevant columns if it's not already included
        if merge_key not in relevant_columns:
            relevant_columns = [merge_key] + relevant_columns
        
        # Filter the relevant columns
        filtered_df = df[relevant_columns]
        return filtered_df

    except Exception as e:
        print(f"Error processing file {file_path}: {e}")
        return None



In [28]:
"""
Criminal Charges of Politicians:
Measuring the extent of criminal charges among politicians as a direct indicator of corruption.
- `mean_num_crim`: Average number of criminal charges against elected representatives 

source: affidavits_ac
"""

criminal_dataset = process_dataset('affidavits_ac.csv', ['year', 'mean_num_crim'], 'ac08_id')

criminal_dataset.head()

Unnamed: 0,ac08_id,year,mean_num_crim
0,2008-01-001,2008.0,0.0
1,2008-01-001,2014.0,0.0
2,2008-01-002,2008.0,0.066667
3,2008-01-002,2014.0,0.0
4,2008-01-003,2008.0,0.2


In [29]:
"""
Public service delivery
Measuring the availability of basic utilities such as electricity and treated water, which may indicate governance quality.
- pc11_vd_power_all: Power Supply For All Users
- pc11_vd_wat_tap_trt: Percentage of households with access to treated tap water.
- pc11_vd_rd_all_wthr: Proportion of roads that are all-weather (indicating better infrastructure quality).

source: pc11_vd_clean_con08
"""

public_delivery_service = process_dataset('pc11_vd_clean_con08.csv', ['pc11_vd_power_dom', 'pc11_vd_wat_tap_trt', 'pc11_vd_rd_all_wthr'], 'ac08_id')

public_delivery_service.head(50)

unique_values = public_delivery_service['pc11_vd_power_dom'].unique()
print(unique_values)


[ 1. nan  0.]


In [30]:
"""
Wealth Disparity:
Measuring the distribution of wealth within a region and disparities that might arise due to corruption.
- facebook_mean_rwi: Mean relative wealth index of a region.

source: facebook_rwi_con08
"""

wealth_disparity_dataset = process_dataset('facebook_rwi_con08.csv', ['facebook_mean_rwi'], 'ac08_id')

wealth_disparity_dataset.head()

Unnamed: 0,ac08_id,facebook_mean_rwi
0,2008-01-001,-0.203812
1,2008-01-002,0.06598
2,2008-01-003,-0.072646
3,2008-01-004,0.123039
4,2008-01-005,0.300398


In [31]:
"""
Terrain Ruggedness:
Analyzing terrain ruggedness as a geographic barrier that may influence access to public services and infrastructure.
- tri_mean: Mean terrain ruggedness index, representing geographic challenges within regions.

Source:
- Terrain Ruggedness: terrain_ruggedness.csv
"""

# Load terrain ruggedness dataset
terrain_ruggedness_dataset = process_dataset('terrain_ruggedness.csv', ['tri_mean'], 'ac08_id')

# Display the first few rows of the dataset
terrain_ruggedness_dataset.head()


Error processing file terrain_ruggedness.csv: [Errno 2] No such file or directory: 'data/terrain_ruggedness.csv'


AttributeError: 'NoneType' object has no attribute 'head'

In [32]:
"""
Shapefile

"""


# Load the shapefile
shapefile = gpd.read_file("data/India_AC.shp")

shapefile.info()
shapefile.head()

# Print unique values for the specified columns


<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 4182 entries, 0 to 4181
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   OBJECTID    4182 non-null   int32   
 1   ST_CODE     4182 non-null   int64   
 2   ST_NAME     4182 non-null   object  
 3   DT_CODE     4112 non-null   float64 
 4   DIST_NAME   4110 non-null   object  
 5   AC_NO       4182 non-null   int64   
 6   AC_NAME     4148 non-null   object  
 7   PC_NO       4182 non-null   int64   
 8   PC_NAME     4148 non-null   object  
 9   PC_ID       4182 non-null   int64   
 10  STATUS      524 non-null    object  
 11  Shape_Leng  4182 non-null   float64 
 12  Shape_Area  4182 non-null   float64 
 13  geometry    4182 non-null   geometry
dtypes: float64(3), geometry(1), int32(1), int64(4), object(5)
memory usage: 441.2+ KB


Unnamed: 0,OBJECTID,ST_CODE,ST_NAME,DT_CODE,DIST_NAME,AC_NO,AC_NAME,PC_NO,PC_NAME,PC_ID,STATUS,Shape_Leng,Shape_Area,geometry
0,1,13,NAGALAND,1.0,MON,41,Tizit,1,NAGALAND,1301,Pre delimitation,1.381854,0.055845,"POLYGON ((94.94575 26.93518, 94.9551 26.93975,..."
1,1,13,NAGALAND,1.0,MON,43,Tapi,1,NAGALAND,1301,Pre delimitation,1.056157,0.030387,"POLYGON ((95.22324 26.75964, 95.2176 26.75589,..."
2,1,13,NAGALAND,1.0,MON,42,Wakching,1,NAGALAND,1301,Pre delimitation,0.980303,0.018828,"POLYGON ((94.86775 26.82831, 94.87219 26.82334..."
3,1,13,NAGALAND,2.0,TUENSANG,49,Tamlu,1,NAGALAND,1301,Pre delimitation,1.133296,0.021899,"POLYGON ((94.73863 26.76868, 94.74029 26.77594..."
4,1,13,NAGALAND,3.0,MOKOKCHUNG,21,Tuli,1,NAGALAND,1301,Pre delimitation,0.965989,0.022397,"POLYGON ((94.73863 26.76868, 94.73627 26.74956..."


In [None]:
"""
Public Sector Employment:
Measuring the size of public sector employment, which could indicate patronage networks tied to corruption.
- ec13_emp_gov: Total employment in public sector jobs.

source: ec13_con08
"""

public_employment_dataset = process_dataset('ec13_con08.csv', ['ec13_emp_gov'], 'ac08_id')
public_employment_dataset.head()