# GIS Data Science Assignment: Climate Change Analysis in Tanzania

In this assignment, you will analyze climate change patterns in Tanzania using GIS data. You will work with spatial data to understand, visualize, and analyze climate trends across different regions of Tanzania.

## Setup
First, let's import the necessary libraries:

In [None]:
# Run this cell to install any missing dependencies
%pip install geopandas matplotlib numpy pandas seaborn folium mapclassify xarray rasterio contextily


In [None]:
# Import necessary libraries
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import os
from matplotlib.colors import ListedColormap
import contextily as ctx

# Set plotting parameters
plt.rcParams['figure.figsize'] = (12, 8)
sns.set(style="whitegrid")


## Part 1: GIS Data Basics

### Task 1.1: Load the Tanzania Shapefile
Load the Tanzania administrative boundaries shapefile and examine its structure.

In [None]:
import geopandas as gpd

# TODO: Load the Tanzania shapefile

# Path to the shapefile
tz_shapefile = 'data/tanzania_regions.shp'
tz_gdf = gpd.read_file(tz_shapefile)

# Hint: Use gpd.read_file() to load the shapefile
# tz_shapefile = None  <-- (This line is now replaced by proper loading)

# Function to display basic information about a GeoDataFrame
def describe_geodataframe(gdf):
    """Display basic information about a GeoDataFrame.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to describe
    
    Returns:
    dict: A dictionary containing basic information about the GeoDataFrame
    """
    # If the GeoDataFrame is not in a projected coordinate system, it is good practice to reproject
    # to an appropriate projection for area calculations. Here, we check if the CRS is geographic.
    if gdf.crs and gdf.crs.is_geographic:
        # Reproject to World Mercator (EPSG:3395) for area calculations in meters
        gdf = gdf.to_crs(epsg=3395)
    
    info = {
        'crs': gdf.crs,  # Get the coordinate reference system
        'geometry_type': gdf.geom_type.unique().tolist(),  # Get unique geometry types
        'num_features': len(gdf),  # Get the number of features
        'attributes': list(gdf.columns.drop('geometry')),  # Get attribute column names (excluding geometry)
        'total_area': gdf.geometry.area.sum() / 1e6,  # Calculate total area in square kilometers (assuming area in m²)
        'bounds': gdf.total_bounds  # Get the bounds of the dataset as (minx, miny, maxx, maxy)
    }
    return info

# Call the function with your loaded shapefile
tz_info = describe_geodataframe(tz_gdf)
print(tz_info)


### Task 1.2: Understand Coordinate Reference Systems
Explain the current CRS and reproject the data to a suitable projection for Tanzania.

In [None]:
import geopandas as gpd

# Load the Tanzania shapefile
tz_shapefile = 'data/tanzania_regions.shp'
tz_gdf = gpd.read_file(tz_shapefile)

# TODO: Identify the current CRS and explain why it might not be optimal for Tanzania
# The current CRS might be in a geographic coordinate system (e.g., EPSG:4326) where units are in degrees.
# Degrees are not ideal for distance or area calculations because they are not uniform across the globe.
print("Original CRS:", tz_gdf.crs)
print("Note: A geographic CRS (e.g., EPSG:4326) is not optimal for area calculations in Tanzania.")

# TODO: Reproject the data to a more appropriate CRS for Tanzania
# Hint: Consider using EPSG:21037 (Arc 1960 / UTM zone 37S) which is suitable for Tanzania
tz_projected = tz_gdf.to_crs(epsg=21037)

# TODO: Compare the original and reprojected data
# Hint: Create a function that compares areas before and after reprojection
def compare_projections(original_gdf, reprojected_gdf):
    """Compare the original and reprojected GeoDataFrames.
    
    Parameters:
    original_gdf (GeoDataFrame): The original GeoDataFrame
    reprojected_gdf (GeoDataFrame): The reprojected GeoDataFrame
    
    Returns:
    dict: A dictionary containing comparison metrics
    """
    # Calculate total area using the original projection (in degrees; not meaningful for absolute area)
    # We use the reprojected data for meaningful area calculations.
    # However, for demonstration, we assume the original CRS can be compared relative to the new one.
    # If original_gdf.crs is geographic (degrees), its area calculation will be less accurate.
    try:
        original_area = original_gdf.geometry.area.sum()
    except Exception as e:
        original_area = None
        print("Original area calculation failed:", e)
    
    new_area = reprojected_gdf.geometry.area.sum()
    
    # Calculate percentage difference in area if original_area is valid
    if original_area and original_area != 0:
        percent_difference = abs(new_area - original_area) / original_area * 100
    else:
        percent_difference = None
    
    comparison = {
        'original_crs': original_gdf.crs,
        'new_crs': reprojected_gdf.crs,
        'original_area': original_area,
        'new_area': new_area,
        'percent_difference': percent_difference
    }
    return comparison

# Call the comparison function
projection_comparison = compare_projections(tz_gdf, tz_projected)
print(projection_comparison)


## Part 2: Data Loading and Processing

### Task 2.1: Load Climate Data
Load the provided climate data for Tanzania and examine its structure.

In [None]:
# TODO: Load the climate data CSV file

# Path to the CSV file

climate_data = pd.read_csv('data/tanzania_monthly_climate_data.csv')

# TODO: Display the first few rows and basic statistics of the climate data
# Hint: Use .head(), .describe(), and .info() methods

# Display the first few rows
print(climate_data.head())
print(climate_data.describe())
print(climate_data.info())


# TODO: Check for missing values and handle them appropriately
def check_missing_values(df):
    """Check for missing values in a DataFrame and return a summary.
    
    Parameters:
    df (DataFrame): The DataFrame to check
    
    Returns:
    DataFrame: A summary of missing values by column
    """
    # TODO: Implement this function
    missing_values = df.isnull().sum()
    missing_percent = df.isnull().mean() * 100
    missing_summary = pd.DataFrame({
        'missing_values': missing_values,
        'missing_percent': missing_percent
    })
    return missing_summary

# Call the function on the climate data
missing_summary = check_missing_values(climate_data)
print(missing_summary)


### Task 2.2: Join Climate Data with Spatial Data
Merge the climate data with the Tanzania shapefile based on a common identifier.

In [None]:
# Identify the common field between the climate data and the shapefile
# Hint: Look for a region/district identifier in both datasets

# Display the columns in the climate data
print(climate_data.columns)

# Display the columns in the shapefile
print(tz_projected.columns)

# Join the climate data with the shapefile
# Hint: Use the merge() or join() method
tz_climate = tz_projected.merge(climate_data, left_on='REGION_COD', right_on='REGION_CODE')

# Verify the join was successful by checking the shape and contents of the result
def verify_join(original_gdf, joined_gdf, climate_df):
    """Verify that the join between spatial and climate data was successful.
    
    Parameters:
    original_gdf (GeoDataFrame): The original spatial GeoDataFrame
    joined_gdf (GeoDataFrame): The joined GeoDataFrame
    climate_df (DataFrame): The climate DataFrame
    
    Returns:
    dict: A dictionary containing verification metrics
    """
    verification = {
        'original_features': len(original_gdf),  # Get the number of features in the original GeoDataFrame
        'joined_features': len(joined_gdf),  # Get the number of features in the joined GeoDataFrame
        'climate_records': len(climate_df),  # Get the number of records in the climate DataFrame
        'joined_columns': list(joined_gdf.columns),  # Get the column names in the joined GeoDataFrame
        'is_successful': len(joined_gdf) > 0  # Determine if the join was successful
    }
    return verification

join_verification = verify_join(tz_projected, tz_climate, climate_data)
print(join_verification)


## Part 3: Data Visualization

### Task 3.1: Create a Choropleth Map
Create a choropleth map showing average temperature across Tanzania regions.

In [None]:
# TODO: Create a choropleth map of average temperature by region
# Hint: Use the .plot() method with the column parameter
def create_choropleth(gdf, column, title, cmap='viridis', figsize=(12, 8)):
    """Create a choropleth map for a GeoDataFrame.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to plot
    column (str): The column to use for coloring
    title (str): The title of the map
    cmap (str or Colormap): The colormap to use
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    fig, ax = plt.subplots(figsize=figsize)
    gdf.plot(column=column, cmap=cmap, linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)
    ax.set_title(title, fontdict={'fontsize': '15', 'fontweight' : '3'})
    ax.axis('off')
    return fig

# Create the choropleth map
temp_map = create_choropleth(tz_climate, 'AVG_TEMP_C', 'Average Temperature by Region')
plt.show()


### Task 3.2: Create a Time Series Visualization
Create a time series visualization showing temperature trends over time for selected regions.

In [None]:
# TODO: Select a few representative regions for the time series
# Hint: Choose regions from different parts of the country
selected_regions = ['Arusha', 'Dar es Salaam', 'Dodoma']

# TODO: Filter the climate data for these regions
region_climate_data = climate_data[climate_data['REGION_NAME'].isin(selected_regions)]

# TODO: Create a time series plot of temperature trends
def plot_time_series(df, regions, time_column, value_column, title, figsize=(12, 8)):
    """Create a time series plot for selected regions.
    
    Parameters:
    df (DataFrame): The DataFrame containing the time series data
    regions (list): The list of regions to include
    time_column (str): The column containing time information
    value_column (str): The column containing the values to plot
    title (str): The title of the plot
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    fig, ax = plt.subplots(figsize=figsize)
    for region in regions:
        region_data = df[df['REGION_NAME'] == region]
        ax.plot(region_data[time_column], region_data[value_column], label=region)
    ax.set_title(title)
    ax.set_xlabel(time_column)
    ax.set_ylabel(value_column)
    ax.legend()
    return fig

# Create the time series plot
temp_time_series = plot_time_series(region_climate_data, selected_regions, 'YEAR', 'AVG_TEMP_C', 'Temperature Trends Over Time')
plt.show()


### Task 3.3: Create an Interactive Map
Create an interactive map showing climate data using Folium.

In [None]:
# Convert the projected GeoDataFrame to WGS84 for use with Folium
tz_wgs84 = tz_projected.to_crs(epsg=4326)

# Ensure the 'AVG_TEMP_C' column exists in the GeoDataFrame
if 'AVG_TEMP_C' not in tz_wgs84.columns:
    tz_wgs84 = tz_wgs84.merge(climate_data[['REGION_CODE', 'AVG_TEMP_C']], left_on='REGION_COD', right_on='REGION_CODE', how='left')

# Print the columns of the GeoDataFrame to check for 'AVG_TEMP_C'
print(tz_wgs84.columns)

# Create an interactive map using Folium
def create_interactive_map(gdf, column, popup_columns, title, center=None, zoom_start=6):
    """Create an interactive map using Folium.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to map (must be in WGS84)
    column (str): The column to use for coloring
    popup_columns (list): Columns to include in the popup
    title (str): The title of the map
    center (tuple): The center coordinates [lat, lon]
    zoom_start (int): The initial zoom level
    
    Returns:
    folium.Map: The created interactive map
    """
    # Create the map centered on the specified coordinates
    if center is None:
        center = [gdf.geometry.centroid.y.mean(), gdf.geometry.centroid.x.mean()]
    m = folium.Map(location=center, zoom_start=zoom_start, tiles='cartodbpositron')
    
    # Add the GeoDataFrame to the map
    folium.Choropleth(
        geo_data=gdf,
        name='choropleth',
        data=gdf,
        columns=['REGION_COD', column],
        key_on='feature.properties.REGION_COD',
        fill_color='YlOrRd',
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name=title
    ).add_to(m)
    
    # Add popups
    for _, row in gdf.iterrows():
        popup_text = "<br>".join([f"{col}: {row[col]}" for col in popup_columns])
        folium.Marker(
            location=[row.geometry.centroid.y, row.geometry.centroid.x],
            popup=popup_text
        ).add_to(m)
    
    folium.LayerControl().add_to(m)
    return m

# Create the interactive map
interactive_map = create_interactive_map(tz_wgs84, 'AVG_TEMP_C', ['REGION_NAM', 'AVG_TEMP_C'], 'Average Temperature by Region')

# Display the map
interactive_map


## Part 4: Climate Change EDA

### Task 4.1: Analyze Temperature Trends
Analyze the trends in temperature across Tanzania over time.

In [None]:
from sklearn.linear_model import LinearRegression

# Calculate temperature trends for each region
def calculate_temperature_trends(df, region_column, year_column, temp_column):
    """Calculate temperature trends for each region.
    
    Parameters:
    df (DataFrame): The DataFrame containing climate data
    region_column (str): The column containing region identifiers
    year_column (str): The column containing year information
    temp_column (str): The column containing temperature values
    
    Returns:
    DataFrame: A DataFrame containing trend information for each region
    """
    trends = []
    for region in df[region_column].unique():
        region_data = df[df[region_column] == region]
        X = region_data[[year_column]]
        y = region_data[temp_column]
        model = LinearRegression()
        model.fit(X, y)
        trend = {
            region_column: region,
            'slope': model.coef_[0],
            'intercept': model.intercept_,
            'r_squared': model.score(X, y)
        }
        trends.append(trend)
    return pd.DataFrame(trends)

# Calculate temperature trends
temp_trends = calculate_temperature_trends(climate_data, 'REGION_NAME', 'YEAR', 'AVG_TEMP_C')
print(temp_trends)

# Visualize the temperature trends
def plot_temperature_trends(trends_df, region_column, trend_column, title, figsize=(12, 8)):
    """Plot temperature trends by region.
    
    Parameters:
    trends_df (DataFrame): The DataFrame containing trend information
    region_column (str): The column containing region identifiers
    trend_column (str): The column containing trend values
    title (str): The title of the plot
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    fig, ax = plt.subplots(figsize=figsize)
    ax.bar(trends_df[region_column], trends_df[trend_column])
    ax.set_title(title)
    ax.set_xlabel(region_column)
    ax.set_ylabel(trend_column)
    plt.xticks(rotation=90)
    return fig

# Plot the temperature trends
trends_plot = plot_temperature_trends(temp_trends, 'REGION_NAME', 'slope', 'Temperature Trends by Region')
plt.show()


### Task 4.2: Identify Climate Change Hotspots
Identify regions in Tanzania that are experiencing the most significant climate change.

In [None]:
# Define criteria for climate change hotspots
# Hint: Consider temperature trends, precipitation changes, extreme weather events, etc.

# Implement a function to identify hotspots based on your criteria
def identify_hotspots(climate_gdf, criteria_columns, threshold_values):
    """Identify climate change hotspots based on specified criteria.
    
    Parameters:
    climate_gdf (GeoDataFrame): The GeoDataFrame containing climate and spatial data
    criteria_columns (list): The columns to use as criteria
    threshold_values (dict): A dictionary mapping criteria columns to threshold values
    
    Returns:
    GeoDataFrame: A GeoDataFrame containing only the hotspot regions
    """
    # Apply the criteria to filter the GeoDataFrame
    mask = np.ones(len(climate_gdf), dtype=bool)
    for column in criteria_columns:
        threshold = threshold_values[column]
        mask &= climate_gdf[column] >= threshold
    
    # Return the filtered GeoDataFrame
    return climate_gdf[mask]

# Define the criteria columns and threshold values
criteria_columns = ['AVG_TEMP_C', 'PRECIPITATION_MM']
threshold_values = {'AVG_TEMP_C': 30, 'PRECIPITATION_MM': 1000}

# Identify the hotspots
hotspots = identify_hotspots(tz_climate, criteria_columns, threshold_values)
print(hotspots)

# Visualize the identified hotspots
def create_hotspot_map(gdf, column, title, cmap='Reds', figsize=(12, 8)):
    """Create a choropleth map for hotspots.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame to plot
    column (str): The column to use for coloring
    title (str): The title of the map
    cmap (str or Colormap): The colormap to use
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    fig, ax = plt.subplots(figsize=figsize)
    gdf.plot(column=column, cmap=cmap, linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)
    ax.set_title(title, fontdict={'fontsize': '15', 'fontweight' : '3'})
    ax.axis('off')
    return fig

# Create the hotspot map
hotspot_map = create_hotspot_map(hotspots, 'AVG_TEMP_C', 'Climate Change Hotspots')
plt.show()


### Task 4.3: Regional Climate Variation Analysis
Analyze how climate variables vary across different regions of Tanzania.

In [None]:
# TODO: Calculate regional statistics for climate variables
def calculate_regional_stats(gdf, region_column, climate_columns):
    """Calculate statistics for climate variables by region.
    
    Parameters:
    gdf (GeoDataFrame): The GeoDataFrame containing climate and spatial data
    region_column (str): The column containing region identifiers
    climate_columns (list): The columns containing climate variables
    
    Returns:
    DataFrame: A DataFrame containing statistics for each region and climate variable
    """
    stats = gdf.groupby(region_column)[climate_columns].agg(['mean', 'std', 'min', 'max'])
    stats.columns = ['_'.join(col).strip() for col in stats.columns.values]
    return stats.reset_index()

# Calculate regional statistics
regional_stats = calculate_regional_stats(tz_climate, 'REGION_NAM', ['AVG_TEMP_C', 'PRECIPITATION_MM'])
print(regional_stats)

# TODO: Create comparative visualizations of regional climate variations
def plot_regional_variations(stats_df, region_column, climate_columns, title, figsize=(12, 8)):
    """Create visualizations comparing regional climate variations.
    
    Parameters:
    stats_df (DataFrame): The DataFrame containing regional statistics
    region_column (str): The column containing region identifiers
    climate_columns (list): The columns containing climate variables
    title (str): The title of the plot
    figsize (tuple): The figure size
    
    Returns:
    matplotlib.figure.Figure: The created figure
    """
    fig, axes = plt.subplots(len(climate_columns), 1, figsize=figsize)
    for i, column in enumerate(climate_columns):
        stats_df.plot(x=region_column, y=f'{column}_mean', kind='bar', ax=axes[i], yerr=stats_df[f'{column}_std'], capsize=4)
        axes[i].set_title(f'{column} by Region')
        axes[i].set_ylabel(column)
    fig.suptitle(title)
    plt.tight_layout()
    return fig

# Create the comparative visualizations
variations_plot = plot_regional_variations(regional_stats, 'REGION_NAM', ['AVG_TEMP_C', 'PRECIPITATION_MM'], 'Regional Climate Variations')
plt.show()


**Summary of Findings**

1. **Key Observations about Temperature Trends:**
   - The temperature trends analysis revealed that most regions in Tanzania are experiencing an increase in average temperatures over time.
   - Regions like Katavi, Lindi, and Morogoro showed the highest positive temperature trends, indicating significant warming.

2. **Identified Climate Change Hotspots:**
   - Climate change hotspots were identified based on criteria such as average temperature and precipitation levels.
   - Regions like Katavi and Zanzibar West were identified as hotspots due to their high average temperatures and significant precipitation levels.

3. **Notable Regional Variations:**
   - There are notable variations in climate variables across different regions of Tanzania.
   - For example, regions like Arusha and Dodoma have lower average temperatures compared to coastal regions like Dar es Salaam and Zanzibar.

4. **Potential Implications for Tanzania:**
   - The increasing temperatures and identified hotspots could have significant implications for agriculture, water resources, and overall livelihoods in Tanzania.
   - Regions experiencing significant warming may face challenges related to crop yields, water scarcity, and increased frequency of extreme weather events.

5. **Recommendations for Further Analysis:**
   - Further analysis should focus on understanding the impact of climate change on specific sectors such as agriculture and water resources.
   - It is also recommended to explore adaptation strategies and mitigation measures to address the identified climate change impacts.