# GIS Data Science Assignment: Climate Change Analysis in Tanzania

In this assignment, you will analyze climate change patterns in Tanzania using GIS data. You will work with spatial data to understand, visualize, and analyze climate trends across different regions of Tanzania.

## Setup
First, let's import the necessary libraries:

In [1]:
# Run this cell to install any missing dependencies
!pip install geopandas matplotlib numpy pandas seaborn folium mapclassify xarray rasterio contextily

Collecting mapclassify
  Downloading mapclassify-2.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting xarray
  Downloading xarray-2025.1.2-py3-none-any.whl.metadata (11 kB)
Collecting networkx>=2.7 (from mapclassify)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Downloading mapclassify-2.8.1-py3-none-any.whl (59 kB)
Downloading xarray-2025.1.2-py3-none-any.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m293.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m539.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: networkx, xarray, mapclassify
Successfully installed mapclassify-2.8.1 networkx-3.4.2 xarray-2025.1.2


In [1]:
# Import necessary libraries
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import os
from matplotlib.colors import ListedColormap
import contextily as ctx

# Set plotting parameters
plt.rcParams['figure.figsize'] = (12, 8)
sns.set(style="whitegrid")

In [4]:
!pwd

/Users/dmachuve/Dropbox/Projects/OmdenaSchool/ClimateKIC/week8/gis-data-science-twiga2


## Part 1: GIS Data Basics

### Task 1.1: Load the Tanzania Shapefile
Load the Tanzania administrative boundaries shapefile and examine its structure.

In [2]:
# TODO: Load the Tanzania shapefile
# Hint: Use gpd.read_file() to load the shapefile

tz_shapefile = r"/Users/dmachuve/Dropbox/Projects/OmdenaSchool/ClimateKIC/week8/gis-data-science-twiga2/data/tanzania_regions.shp"
gdf = gpd.read_file(tz_shapefile)

# Function to display basic information about a GeoDataFrame
def describe_geodataframe(gdf):
    """Display basic information about a GeoDataFrame.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to describe
    
    Returns:
    dict: A dictionary containing basic information about the GeoDataFrame
    """
    orign_crs = gdf.crs #original crs
    #to find area in sq kilometers, we need CRS that uses meters
    
    if orign_crs.is_geographic or orign_crs is None:
        gdf = gdf.to_crs(epsg=3857)  #this projection uses meters
    total_area = gdf.area.sum()/1000000 #area in sq kilometers
    gdf = gdf.to_crs(epsg=4326) # revert back to original crs    
    
    info = {
        'crs': orign_crs,  # TODO: Get the coordinate reference system
        'geometry_type': gdf.geometry.type.unique().tolist(),  # TODO: Get the geometry type
        'num_features': len(gdf),  # TODO: Get the number of features
        'attributes': gdf.columns.tolist(),  # TODO: Get the attribute column names
        'total_area': total_area,  # TODO: Calculate the total area in square kilometers
        'bounds': gdf.total_bounds  # TODO: Get the bounds of the dataset
    }
    return info

# Call the function with your loaded shapefile
tz_info = print(describe_geodataframe(gdf))

{'crs': <Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
, 'geometry_type': ['Polygon'], 'num_features': 31, 'attributes': ['REGION_NAM', 'REGION_COD', 'ZONE', 'LAND_AREA_', 'POPULATION', 'POP_DENSIT', 'ELEVATION_', 'DIST_TO_CO', 'geometry'], 'total_area': np.float64(340445.3190935693), 'bounds': array([ 30.05418469, -11.96787178,  40.30646215,  -1.00507558])}


### Task 1.2: Understand Coordinate Reference Systems
Explain the current CRS and reproject the data to a suitable projection for Tanzania.

In [24]:
# TODO: Identify the current CRS and explain why it might not be optimal for Tanzania

# TODO: Reproject the data to a more appropriate CRS for Tanzania
# Hint: Consider using EPSG:21037 (Arc 1960 / UTM zone 37S) which is suitable for Tanzania
tz_projected = gdf.to_crs(epsg=21037)

tz_info2 =describe_geodataframe(gdf) # original gdf info

# TODO: Compare the original and reprojected data
# Hint: Create a function that compares areas before and after reprojection
def compare_projections(original_gdf, reprojected_gdf):
    """Compare the original and reprojected GeoDataFrames.
    
    Parameters:
    original_gdf (GeoDataFrame): The original GeoDataFrame
    reprojected_gdf (GeoDataFrame): The reprojected GeoDataFrame
    
    Returns:
    dict: A dictionary containing comparison metrics
    """
    
    
  
    comparison = {
        'original_crs': original_gdf.crs,  # TODO: Get the original CRS
        'new_crs': reprojected_gdf.crs,  # TODO: Get the new CRS
        'original_area': tz_info2.get("total_area"),  # TODO: Calculate the total area in the original projection
        'new_area': reprojected_gdf.area.sum()/1000000,  # TODO: Calculate the total area in the new projection
        #: None # TODO: Calculate the percentage difference in area
    }
    areadiff = np.subtract(comparison.get("original_area"), comparison.get("new_area"))/ comparison.get("original_area") * 100
    comparison['percent_difference']= f"{areadiff:.2f}%"
    return comparison

# Call the comparison function
projection_comparison = print(compare_projections(gdf, tz_projected))

{'original_crs': <Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
, 'new_crs': <Projected CRS: EPSG:21037>
Name: Arc 1960 / UTM zone 37S
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Kenya - south of equator and east of 36°E; Tanzania - east of 36°E.
- bounds: (36.0, -11.75, 41.6, 0.0)
Coordinate Operation:
- name: UTM zone 37S
- method: Transverse Mercator
Datum: Arc 1960
- Ellipsoid: Clarke 1880 (RGS)
- Prime Meridian: Greenwich
, 'original_area': np.float64(340445.3190935693), 'new_area': np.float64(334462.65500737564), 'percent_difference': '1.76%'}


## Part 2: Data Loading and Processing

### Task 2.1: Load Climate Data
Load the provided climate data for Tanzania and examine its structure.

In [29]:
# TODO: Load the climate data CSV file
climate_data = pd.read_csv("/Users/dmachuve/Dropbox/Projects/OmdenaSchool/ClimateKIC/week8/gis-data-science-twiga2/data/tanzania_annual_climate_data.csv")

# TODO: Display the first few rows and basic statistics of the climate data
# Hint: Use .head(), .describe(), and .info() methods

# TODO: Check for missing values and handle them appropriately
def check_missing_values(df):
    """Check for missing values in a DataFrame and return a summary.
    
    Parameters:
    df (DataFrame): The DataFrame to check
    
    Returns:
    DataFrame: A summary of missing values by column
    """
    # TODO: Implement this function
    print("Dataset Information:")
    print(df.info())
    print("Missing values by column")
    print(df.isnull().sum())
    

missing_summary = check_missing_values(climate_data)

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 744 entries, 0 to 743
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   REGION_CODE             744 non-null    object 
 1   REGION_NAME             744 non-null    object 
 2   YEAR                    744 non-null    int64  
 3   ZONE                    744 non-null    object 
 4   ANNUAL_AVG_TEMP_C       744 non-null    float64
 5   MAX_TEMP_C              744 non-null    float64
 6   MIN_TEMP_C              744 non-null    float64
 7   ANNUAL_PRECIP_MM        744 non-null    float64
 8   ANNUAL_RAIN_DAYS        744 non-null    int64  
 9   ANNUAL_HEAVY_RAIN_DAYS  744 non-null    int64  
 10  ANNUAL_DROUGHT_INDEX    744 non-null    float64
 11  ELEVATION_M             744 non-null    float64
 12  DISTANCE_TO_COAST_KM    744 non-null    float64
dtypes: float64(7), int64(3), object(3)
memory usage: 75.7+ KB
None
Missing val

### Task 2.2: Join Climate Data with Spatial Data
Merge the climate data with the Tanzania shapefile based on a common identifier.

In [30]:
# TODO: Identify the common field between the climate data and the shapefile
# Hint: Look for a region/district identifier in both datasets

# TODO: Join the climate data with the shapefile
# Hint: Use the merge() or join() method
tz_climate = climate_data
gdf = tz_projected

print(tz_climate.columns)
print(gdf.columns)

#common identifier is 'REGION_CODE', 'REGION_NAME', ZONE
tz_climate.rename(columns={'REGION_CODE':'REGION_COD'}, inplace = True)

merged_tzgdf = gdf.merge(tz_climate, on='REGION_COD', how='left')

#save the merged dataset
merged_tzgdf.to_file("merged_tz_climate.shp")

# TODO: Verify the join was successful by checking the shape and contents of the result
def verify_join(original_gdf, joined_gdf, climate_df):
    """Verify that the join between spatial and climate data was successful.
    
    Parameters:
    original_gdf (GeoDataFrame): The original spatial GeoDataFrame
    joined_gdf (GeoDataFrame): The joined GeoDataFrame
    climate_df (DataFrame): The climate DataFrame
    
    Returns:
    dict: A dictionary containing verification metrics
    """
    verification = {
        'original_features': None,  # TODO: Get the number of features in the original GeoDataFrame
        'joined_features': None,  # TODO: Get the number of features in the joined GeoDataFrame
        'climate_records': None,  # TODO: Get the number of records in the climate DataFrame
        'joined_columns': None,  # TODO: Get the column names in the joined GeoDataFrame
        'is_successful': None  # TODO: Determine if the join was successful
    }
    return verification

join_verification = verify_join(gdf, merged_tzgdf, tz_climate)

Index(['REGION_CODE', 'REGION_NAME', 'YEAR', 'ZONE', 'ANNUAL_AVG_TEMP_C',
       'MAX_TEMP_C', 'MIN_TEMP_C', 'ANNUAL_PRECIP_MM', 'ANNUAL_RAIN_DAYS',
       'ANNUAL_HEAVY_RAIN_DAYS', 'ANNUAL_DROUGHT_INDEX', 'ELEVATION_M',
       'DISTANCE_TO_COAST_KM'],
      dtype='object')
Index(['REGION_NAM', 'REGION_COD', 'ZONE', 'LAND_AREA_', 'POPULATION',
       'POP_DENSIT', 'ELEVATION_', 'DIST_TO_CO', 'geometry'],
      dtype='object')


In [31]:
tz_climate.head()

Unnamed: 0,REGION_CODE,REGION_NAME,YEAR,ZONE,ANNUAL_AVG_TEMP_C,MAX_TEMP_C,MIN_TEMP_C,ANNUAL_PRECIP_MM,ANNUAL_RAIN_DAYS,ANNUAL_HEAVY_RAIN_DAYS,ANNUAL_DROUGHT_INDEX,ELEVATION_M,DISTANCE_TO_COAST_KM
0,AR,Arusha,2000,Northern,21.71,26.6,16.99,6918.2,107,21,0.105833,920.2,469.3
1,AR,Arusha,2001,Northern,22.45,27.81,17.78,6333.0,104,15,0.428333,920.2,469.3
2,AR,Arusha,2002,Northern,22.81,27.3,17.74,10795.2,144,41,0.0,920.2,469.3
3,AR,Arusha,2003,Northern,21.38,25.77,16.39,5592.0,62,10,0.899167,920.2,469.3
4,AR,Arusha,2004,Northern,21.67,26.89,16.57,10407.2,150,39,0.0,920.2,469.3


In [32]:
gdf.head()

Unnamed: 0,REGION_NAM,REGION_COD,ZONE,LAND_AREA_,POPULATION,POP_DENSIT,ELEVATION_,DIST_TO_CO,geometry
0,Arusha,AR,Northern,8238.33,523949,63.6,920.2,469.3,"POLYGON ((-5772.502 9729969.442, -35445.56 966..."
1,Dar es Salaam,DS,Eastern,7765.76,311157,40.07,1298.9,257.1,"POLYGON ((467450.916 9825892.725, 444928.169 9..."
2,Dodoma,DO,Central,10271.91,784652,76.39,1790.4,416.6,"POLYGON ((222290.011 8779222.413, 198470.279 8..."
3,Geita,GE,Lake,11836.41,565591,47.78,1440.9,460.9,"POLYGON ((-202602.361 9156239.463, -215020.089..."
4,Iringa,IR,Southern Highlands,18649.73,214091,11.48,476.1,82.6,"POLYGON ((-249740.672 8938318.243, -301201.682..."


## Part 3: Data Visualization

### Task 3.1: Create a Choropleth Map
Create a choropleth map showing average temperature across Tanzania regions.

In [None]:
# TODO: Create a choropleth map of average temperature by region
# Hint: Use the .plot() method with the column parameter
def create_choropleth(gdf, column, title, cmap='viridis', figsize=(12, 8)):
    """Create a choropleth map for a GeoDataFrame.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to plot
    column (str): The column to use for coloring
    title (str): The title of the map
    cmap (str or Colormap): The colormap to use
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    # TODO: Implement this function
    pass

temp_map = None

### Task 3.2: Create a Time Series Visualization
Create a time series visualization showing temperature trends over time for selected regions.

In [None]:
# TODO: Select a few representative regions for the time series
# Hint: Choose regions from different parts of the country
selected_regions = None

# TODO: Filter the climate data for these regions
region_climate_data = None

# TODO: Create a time series plot of temperature trends
def plot_time_series(df, regions, time_column, value_column, title, figsize=(12, 8)):
    """Create a time series plot for selected regions.
    
    Parameters:
    df (DataFrame): The DataFrame containing the time series data
    regions (list): The list of regions to include
    time_column (str): The column containing time information
    value_column (str): The column containing the values to plot
    title (str): The title of the plot
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    # TODO: Implement this function
    pass

temp_time_series = None

### Task 3.3: Create an Interactive Map
Create an interactive map showing climate data using Folium.

In [None]:
# TODO: Convert the projected GeoDataFrame to WGS84 for use with Folium
tz_wgs84 = None

# TODO: Create an interactive map using Folium
def create_interactive_map(gdf, column, popup_columns, title, center=None, zoom_start=6):
    """Create an interactive map using Folium.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to map (must be in WGS84)
    column (str): The column to use for coloring
    popup_columns (list): Columns to include in the popup
    title (str): The title of the map
    center (tuple): The center coordinates [lat, lon]
    zoom_start (int): The initial zoom level
    
    Returns:
    folium.Map: The created interactive map
    """
    # TODO: Implement this function
    pass

interactive_map = None
# Display the map
# interactive_map

## Part 4: Climate Change EDA

### Task 4.1: Analyze Temperature Trends
Analyze the trends in temperature across Tanzania over time.

In [None]:
# TODO: Calculate temperature trends for each region
def calculate_temperature_trends(df, region_column, year_column, temp_column):
    """Calculate temperature trends for each region.
    
    Parameters:
    df (DataFrame): The DataFrame containing climate data
    region_column (str): The column containing region identifiers
    year_column (str): The column containing year information
    temp_column (str): The column containing temperature values
    
    Returns:
    DataFrame: A DataFrame containing trend information for each region
    """
    # TODO: Implement this function using linear regression or other trend analysis methods
    pass

temp_trends = None

# TODO: Visualize the temperature trends
def plot_temperature_trends(trends_df, region_column, trend_column, title, figsize=(12, 8)):
    """Plot temperature trends by region.
    
    Parameters:
    trends_df (DataFrame): The DataFrame containing trend information
    region_column (str): The column containing region identifiers
    trend_column (str): The column containing trend values
    title (str): The title of the plot
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    # TODO: Implement this function
    pass

trends_plot = None

### Task 4.2: Identify Climate Change Hotspots
Identify regions in Tanzania that are experiencing the most significant climate change.

In [None]:
# TODO: Define criteria for climate change hotspots
# Hint: Consider temperature trends, precipitation changes, extreme weather events, etc.

# TODO: Implement a function to identify hotspots based on your criteria
def identify_hotspots(climate_gdf, criteria_columns, threshold_values):
    """Identify climate change hotspots based on specified criteria.
    
    Parameters:
    climate_gdf (GeoDataFrame): The GeoDataFrame containing climate and spatial data
    criteria_columns (list): The columns to use as criteria
    threshold_values (dict): A dictionary mapping criteria columns to threshold values
    
    Returns:
    GeoDataFrame: A GeoDataFrame containing only the hotspot regions
    """
    # TODO: Implement this function
    pass

hotspots = None

# TODO: Visualize the identified hotspots
hotspot_map = None

### Task 4.3: Regional Climate Variation Analysis
Analyze how climate variables vary across different regions of Tanzania.

In [None]:
# TODO: Calculate regional statistics for climate variables
def calculate_regional_stats(gdf, region_column, climate_columns):
    """Calculate statistics for climate variables by region.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame containing climate and spatial data
    region_column (str): The column containing region identifiers
    climate_columns (list): The columns containing climate variables
    
    Returns:
    DataFrame: A DataFrame containing statistics for each region and climate variable
    """
    # TODO: Implement this function
    pass

regional_stats = None

# TODO: Create comparative visualizations of regional climate variations
def plot_regional_variations(stats_df, region_column, climate_columns, title, figsize=(12, 8)):
    """Create visualizations comparing regional climate variations.
    
    Parameters:
    stats_df (DataFrame): The DataFrame containing regional statistics
    region_column (str): The column containing region identifiers
    climate_columns (list): The columns containing climate variables
    title (str): The title of the plot
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    # TODO: Implement this function
    pass

variations_plot = None

## Conclusion

### Task 5: Summarize Findings
Summarize your key findings from the climate change analysis.

**TODO: Write a summary of your findings here.**

Your summary should include:
1. Key observations about temperature trends
2. Identified climate change hotspots
3. Notable regional variations
4. Potential implications for Tanzania
5. Recommendations for further analysis