# **Implement** a Du Bois Choropleth Map with **Python**:

## Map of Black Population

<div>
<img src="https://github.com/ajstarks/dubois-data-portraits/blob/master/plate02/original-plate-02.jpg?raw=true" width="700" />
</div>

<b>Plate 02</b>

### This interactive exercise is inspired by the annual #DuBoisChallenge

The #DuBoisChallenge is a call to scientists, students, and community members to recreate, adapt, and share on social media the data visualzations created by W.E.B. Du Bois and his collaborators in 1900. Before doing the interactive exercise, please read this article about the [Du Bois Challenge](https://nightingaledvs.com/the-dubois-challenge/) You can find the latest Du Bois visualizations by searching for the [#DuBoisChallenge2025](https://github.com/ajstarks/dubois-data-portraits/tree/master/challenge/2025) hash tag on social media (Twitter, Bluesky, Insta etc). And you can even use the hashtag to share your own recreations.

### In this interactive excercise, you will:
1. Learn how to create a variation of a **choropleth map**.
2. Learn and modify code in the programming lanugage **Python**.
3. Learn how to write spatial and statistical code to:
    * create visualizations that consistently and accurately represent your data
    * create a transparent record of exactly how you visualized something
    * make it easy for you or others to recreate or modify your visualization
4. Your instructor may also ask you answer questions and submit screenshots as you go in a parallel Catcourses (or other Canvas system) as you go.

### You will learn how to use the *Python* programming language by creating a choropleth map:
1. You will recreate Du Bois' visualization of Black population across US states. Du Bois created the visualization in 1900.
2. You will reproduce Du Bois' visualization using data on Black population in the US today. Du Bois used cartography to the show geographical patterns of enslavement and emancipation for Black Americans.
3. An important context of Du Bois's map is history of the African slave trade. The first visual in the 1900 exhibition is a map of world and lines showing the transatlantic slave trade. In the U.S. maps there is clustering of Black population to the southern states and the state of Georgia. Here is Du Bois plate 1 prefacing plate 2:

<div>
<img src="https://github.com/ajstarks/dubois-data-portraits/blob/master/plate01/original-plate-01.jpg?raw=true" width="700"/>
</div>
<b>Plate 01</b>

### 1. How to use this interactive **Jupyter Notebook**

If you know how to use Python on your own computer with another code runner you can copy, paste, and edit code there. If you have Jupyter Lab, you can also download this Notebook to use it on your own computer. You can download the Notebook or view a non-interactive version of this Notebook by clicking here.

Grey cells in the *Notebook* like the one below are code cells where you will write and edit **Python** Code. To try it out:

1. Click your cursor on the grey cell below. After you click on it, it will change to white to indicate you are editing it.
2. After you click on the cell below and it turns white. Type ```2+2``` to use Python as a calculator.
3. After typing ```2+2```, click the <span class="play-button">&#9654;</span> play button at the top of this page.

### 2. Keeping track of your work and using **Python** outside of this notebook.

If you leave this Notebook idle or close/re-open it, your work will not be saved in the Notebook. But you can export an HTML file showing your work at any time. You can then open and browse the HTML file in any web browser.

And after completing the Notebook exercises, you can export a final HTML file to submit for any course assignments using this Notebook.

To export the *Notebook*, click the **File** dropdown above, select **Export File As** and then select **HTML** as shown here:

<div>
<img src="https://github.com/HigherEdData/Du-Bois-STEM/blob/main/readings-images/htmlexport.jpg?raw=true" width="500" />
</div>

### 3. Getting hints and answers.

Sometimes, the code cells will already have code in them that you will be asked to edit or run by clicking the play button. In the process, you can click on <span class="play-button">&#9654;</span> dropdown buttons like the ones below to get hints and answers. For example:

1. Click on the cell below with ```3+3=``` and click the play triangle above. You should get an error message highlighted in pink saying "SyntaxError: cannot assign to expression". To complete this activity, you'll want to get each code cell to run without a pink error message.
2. Based on the ```2+2``` code you tried above, try to edit the 3+3= code to get it to report the sum of 3+3 without the error message. For a hint, click the first <span class="play-button">&#9654;</span> button below.

<details> <summary>Click this triangle for a hint.</summary>

Try deleting the ```=``` sign in the cell and click play again. If that doesn't work, click the next triangle below for the answer.

</details>

<details> <summary>Click this triangle for the answer.</summary>

**Answer:** Delete all the text in the cell below and write or paste this answer in the cell: ```3+3``` before clicking play again.

</details>

In [2]:
print(3+3)

6

### 4. Reading and writing comments that explain your code

In code cells, we can write **comment** text that explains our code. We put a ```#```  before **comment** text to tell Python that the text is not code it should execute. Any text after a ```#``` on a given line will be treated as a **comment**. In Jupyter, comment text after a ```#``` will be displayed in a dark turquois color. To see how this works, try the following below:

1. Try to run the code below. You should get an error message because the comment text "This is code that adds 2+2" is not Python code and doesn't have a #  sign in front of it.
2. Add a #  sign before "This is code that adds 2+2". This should change the color of the text to turquois like the text after the 2+2 where there is already a #  sign.
3. Click the <span class="play-button">&#9654;</span> at the top of the notebook and the code should run and output **4** below.

<details> <summary><strong>Hints:</strong></summary>
    
**Answer:** ```print(2+2)```
    
</details>

In [None]:
This is code that adds 2+2

print(2+2) # the result of 2 +2 should be 4

### 5. Importing libraries
In **Python**, users can import libraries which a range of tools we can call on for our code. These are the most common libraries pertaining to statisitcs, mapping, and visualizations:
* **pandas** is a library with statistical capacities including data management.
* **geopandas** is the library has geospatial abilities.
* **matplotlib** is a library with data visualizations, we will be using pyplots, colors, and patches.

In today's workbook, we will rely on these libraries.

Once you have the appropriate **Python** libraries, you call on these libraries at the beginning of your **Python** code script. There are three ways to call libraries:

1. Call on the library the basic way: import library_name
2. Call on the library and add aliases to call it easily: import library_name as lib
3. Call on specifics functions from library: import library.function as fun

In [2]:
#importing necessary libaries
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.patches as mpatches

### 6. Tabular Data and Reading Du Bois' data into an **Pandas** Data Frame

In order to make data visuals, we first need some sort of tabular data. Tabular data is typically presented in an excel file (.xls) or comma-separated vector (.csv). Anthony Starks has published tabular data to recreate Du Bois data portraits on (Github)[https://github.com/ajstarks/dubois-data-portraits]. This exercise recreates *Plate 02* therefore we will be reading in that data using **Pandas**. 

**Pandas** is the most common library for data management and analysis in **Python**. Next step is to read data in a **Pandas** dataframe. And we're going to place the data into a dataframe named d_pop. **Pandas** allows us to read a comma-separated values (csv) file to a dataframe.  File paths should be used with parentheses:

```df = pd.read_csv("web_address_with_data/data_file_name.csv")```

* <b>```.df.head()```</b> 

We use the **Pandas** ```df.head()``` to check that the data has been read correctly. ```df.head()``` shows the first five rows of the data:

```df.head()```

<details> <summary><strong>Hints:</strong></summary>

When reading in a dataframe with Pandas, you should specify Pandas with its alias:
    
**Answer:** ```data = pd.read_csv(url="https://github.com/HigherEdData/Du-Bois-STEM/raw/refs/heads/main/data/d_popmap.csv")```
    
</details>

In [None]:
url="https://github.com/HigherEdData/Du-Bois-STEM/raw/refs/heads/main/data/d_popmap.csv"
data = __.read_csv(url)
data.head()

### 7. Spatial Data and Using **GeoPandas** to read, plot, and edit a shapefile

Tabular data is presented in a table and spatial data is tabular data presented in geographic units. A common way to use spatial data is to use a *shapefile* which is a data file with geographical components. 

**Geopandas** is a library with geospatial data tools. __[NHGIS]("https://www.nhgis.org/")__ provides free historical shapefiles for non-commercial use. Note that *Plate 02* is a 1900 map of the United States.

Two interesting features of the Du Bois **Plate 02**. First, the map is a contiguous U.S., meaning it does not include Alaska and Hawaii. Second, the area known today as Oklahoma is divided into states: Oklahoma and Indian Terrority.

* <b>Reading a geospatial data</b>

Using **GeoPandas**, you read shapefile files into geospatial dataframes using ```gpd.read_file()```. File paths should be used with parentheses. Again we can use the ```gdf.head()``` to check the data has been read. Notice the geometry column is the geographical reference.

```gdf = gpd.read_file(fp)```

* <b>Plotting geospatial data</b>

Once a geospatial dataframe is loaded, use the ```gpd.plot()``` to visually examine our shapefile.

```gpd.plot()```

<details> <summary><strong>Hints:</strong></summary>

When reading the shapefile, you call on the GeoPandas library using the alias:
    
**Answer:** ```gdf = gpd.read_file(fp)```
    
</details>

In [None]:
fp_map="https://github.com/HigherEdData/Du-Bois-STEM/raw/refs/heads/main/data/us_state_1900/us_state_1900_reduced.shp"
gdf = ___.read_file(fp_map)
gdf.head()

In [None]:
gdf.plot()

### 8. Merging table data with the shapefile
Often times, we need to link a dataframes to a spatial dataframe. Using **GeoPandas**, we can easily marge table data with the shapefile using the merge command. When merging two dataframes, it is important to identify the key variable that will link the two datasets. In this example, the key variable is STATENAM.

```gdf_new = spatial_df.merge(df, on='key_variable')```

<details> <summary><strong>Hints:</strong></summary>

Look at the example code.
    
**Answer:**   
```gdf_new = gdf.merge(data, on='STATENAM')```
    
</details>

In [None]:
gdf_new = gdf._____(data, on='STATENAM')
gdf_new.head()

### 9. **Understanding Choropleth Maps** and Using **GeoPandas** to plot a **choropleth map**
**Choropleth maps** are a popular way to visualize spatial data. They can be used to show data across geographic units by using coloring geographic units based on a color that corresponds to a value. For example, the count of a species across zip codes or the average household income by census tract. A common choropleth map will shade higher numbers in darker colors and lower numbers are shaded in ligther colors.

The word choropleth comes from the Greek choro, meaning "region." These maps are designed to show data at the regional level, so they’re best used with data tied to specific areal units—think polygons on a map, like the shape of Merced County or the state of California.

Because choropleth maps aggregate data by geographic unit, we often need to group our data into a smaller number of categories, or classes. This step is called *classification*. Each class gets its own color, and all regions that fall within the same class are shaded the same. This helps simplify complex data so that it’s easier to interpret visually.

While modern mapping software allows us to create unclassed maps—where every unique value gets its own symbol or color—there are still good reasons to use classed choropleths. Grouping data reduces visual clutter and helps map readers focus on patterns rather than being overwhelmed by too much detail. It’s a balance between simplification and meaning.

According Dr. Erin Hestir, there are three key steps to creating effective choropleth maps:

1. **How many classes should I use?** Too many, and the map gets confusing; too few, and important detail is lost.

2. **What classification method should I apply?** Options include equal interval, quantiles, natural breaks, and more—each tells a slightly different story.

3. **How should I symbolize the data?** Thoughtful color choice matters. The color scheme needs to reflect the underlying pattern without misleading the viewer.

In sum, choropleth maps help us communicate non-spatial information—like health outcomes or economic indicators—through the spatial lens of a map. They sit at the intersection of data science, geography, and design, offering a powerful way to explore and share insights about our world.

Throughout this workbook, we will review these steps. We hope by the end of the session, you will discover why choropleth maps are an effective data visualization technique. 

It is easy to create **choropleth map** in **GeoPandas** simply by choosing a variable.

```gdf.plot('variable')```

* <b>Modify the line color and width for sharper maps. lw means line width.</b>

```gdf.plot('variable', edgecolor='color', lw=#)```

Plot a chorpleth map using ```Population``` variable.

<details> <summary><strong>**Answer:** </strong></summary>

```gdf_new.plot('Population', edgecolor='black', lw=.1)```
    
</details>

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 10)) #this makes the blank canvas to plot our map
gdf_new.plot('______', ax=ax, edgecolor='black', lw=.1, legend = True)
ax.set_axis_off() # Remove the axes

### 10. Using dictionaries and functions in **Python**

The **choropleth map** made in Step 9 does not match the colors from the original **Plate 02**. In order to specify the color, we need to do some additional coding. 

* <b>Dictionaries in **Python**</b>

In **Python**, a dictionary is a list of pairs and in the form of: ```key:values```. 

```dict1 = {'a':1, 'b':2}```

* <b> **Functions** in **Python** </b>
    
In Python, you can write functions to perform a iteration of tasks. For example, linking our data with the color and label dicionaries. 

* <b>Using dictionaries and functions for color and legend</b>

We can use dictionaries and functions to connect variable categories to labels and color schemes. Dictionarieis can be helpful here from translating from color names to HEX numbers. **__[The Du Boisian Visualization Toolkit]("https://www.dignityanddebt.org/projects/du-boisian-resources/")__** shares the color scheme within Du Bois data visualizations. 

During **#DuBoisChallenge** in 2024, **__[Edriessen published in Github]("https://github.com/edriessen/dubois24-python-matplotlib/tree/main)__** uses the toolkit to write dictionaries and functions in Python. We use Edriessen's approach.

In [23]:
#This is a dictionary of color labels to HEX numbers.
dubois_colors = {
    'black': '#000000',
    'brown': '#654321',
    'tan': '#d2b48c',
    'gold': '#ffd700',
    'pink': '#ffc0cb',
    'crimson': '#dc143c',
    'green': '#00aa00',
    'blue': '#4682b4',
    'purple': '#7e6583',
    'bg': '#FAF0E6',
    'white': '#ffffff',
    'lightgrey':  '#d3d3d3',
}

#This is a dictionary links Population categories with color labels
color_map = {
    'UNDER - 10,000': dubois_colors['lightgrey'],
    '10,000 - 25,000': dubois_colors['gold'],
    '25,000 - 50,000': dubois_colors['pink'],
    '50,000 - 100,000': dubois_colors['crimson'],
    '100,000 - 200,000': dubois_colors['tan'],
    '200,000 - 300,000': dubois_colors['blue'],
    '300,000 - 500,000': dubois_colors['brown'],
    '500,000 - 600,000': dubois_colors['white'],
    '600,000 - 750,000': dubois_colors['white'],
    '750,000 AND OVER': dubois_colors['black'],  
}

#This is dictionary of category labels
range_map = {
    'UNDER - 10,000': 'UNDER - 10,000',
    '10,000 - 25,000': '10,000 - 25,000',
    '25,000 - 50,000': '25,000 - 50,000',
    '50,000 - 100,000': '50,000 - 100,000',
    '100,000 - 200,000': '100,000 - 200,000',
    '200,000 - 300,000': '200,000 - 300,000',
    '300,000 - 500,000': '300,000 - 500,000',
    '500,000 - 600,000': '500,000 - 600,000',
    '600,000 - 750,000': '600,000 - 750,000',
    '750,000 AND OVER': '750,000 NEGROS AND OVER'
}

def map_colors(value):
    if value not in color_map:
        return 'white'
    
    return color_map[value]

gdf_new['colors pop'] = gdf_new.apply(lambda row: map_colors(row['Population']), axis=1)

# preview colour data
gdf_new.head()

### 10.5 Update the **choropleth map**

Now, that everything been linked, we can now map a choropleth map using Du Bois colors.

<details> <summary><strong>**Answer:** </strong></summary>

```gdf_new.plot(color=gdf_new['colors pop'], edgecolor='black', lw=.1)```
    
</details>

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gdf_new.____(color=gdf_new['colors pop'], ax=ax, edgecolor='black', lw=.1, legend = True)
ax.set_axis_off() # Remove the axes

### 11. Using **mathplotlib** to make subplots

Subplots allows us to create a figure with mutiple plots in a grid and we can modify them. The most basic is single plot:

```fig, ax = plt.subplots(figsize=(height, width), faceolor='color')```

<b>If we want to make a grid:</b>

```fig, axes = plt.subplots(row_num, col_num,figsize=(height, width), faceolor='color')```

In order to recreate **Plate 02**, we want to plot the map in top and the legend in the lower plot.

Below correct the code to make a 11 inch width and 14 inch height. Also, use bg (background) for the facecolor.

<details> <summary><strong>**Answer:** </strong></summary>

```    facecolor=dubois_colors['bg']```
    
</details>

In [None]:
fig, (axs) = plt.subplots(
    2,
    1,
    figsize=(x,y),     #1: type (11,14) to indicate the size    
    gridspec_kw={
        'wspace': -0.15,
        'hspace': -.15
    },
    facecolor=dubois_colors['___'] #2: Add 'bg' as background color
)

# This makes the plot 2D
axes = axs.flat

### Step 11.1 Building visual

<b>Next, we will add the plot we made earlier to the code from step 10.5. We will also add the title</b>

<details> <summary><strong>**Answer:** </strong></summary>

```fig.suptitle(r"RELATIVE NEGRO POPULATION OF THE STATES OF THE" "\n" "UNITED STATES", y=.98 , fontsize=23)```
    
</details>

In [None]:
fig, (axs) = plt.subplots(
    2,
    1,
    figsize=(11,14),         
    gridspec_kw={
        'wspace': -0.15,
        'hspace': -.15
    },
    facecolor=dubois_colors['bg']
)

axes = axs.flat

#set axes bounds
axs[1].set_xlim(0, 1)
axs[1].set_ylim(-1, 5)
axs[0].set_ylim(-1500000, 2500000)

gdf_new.plot(ax=axes[0], color=gdf_new['colors pop'], edgecolor='black', lw=.2)

# set title
fig.suptitle(r"RELATIVE NEGRO POPULATION OF THE STATES OF THE" "\n" "______", y=.98 , fontsize=23) #1: finish the title name based on Du Bois plate 02
axs[0].text(-900000,2800000, r'Recreated by _______________', fontsize=10) #12 Add your name

### Step 11.2 Using *plot()* to make pattern map

We can use conditional statements with the *plot()* command to make pattern/texture map. The original *Plate 02* has two pattern categories: "600,000-750,000" is grid and "500,000-600,000" is slash. In *matplotlib*, you can use "+" for grid and chr(92) for "\".

````gdf[gdf["variable"] == "SINGLE_CATEGORY"].plot(facecolor="color", hatch="pattern")````

<details> <summary><strong>**Answer:** </strong></summary>

````gdf_new[gdf_new["Population_"] == "500,000 - 600,000"].plot(ax=axes[0], facecolor="white", hatch= 7*chr(92))````

````gdf_new[gdf_new["Population_"] == "600,000 - 750,000"].plot(ax=axes[0], facecolor="white", hatch="+++++")````
    
</details>

In [None]:
fig, (axs) = plt.subplots(
    2,
    1,
    figsize=(11,14),         
    gridspec_kw={
        'wspace': -0.15,
        'hspace': -.15
    },
    facecolor=dubois_colors['bg']
)

axes = axs.flat

axs[1].set_xlim(0, 1)
axs[1].set_ylim(-1, 5)
axs[0].set_ylim(-1500000, 2500000)

gdf_new.plot(ax=axes[0], color=gdf_new['colors pop'], edgecolor='black', lw=.2)
gdf_new[gdf_new["Population"] == "CATEGORY"].plot(ax=axes[0], facecolor="white", hatch= 7*chr(92)) #3: Add category name for the grid
gdf_new[gdf_new["Population"] == "CATEGORY"].plot(ax=axes[0], facecolor="white", hatch="+++++") #2: Add category name for the slash

fig.suptitle(r"RELATIVE NEGRO POPULATION OF THE STATES OF THE" "\n" "UNITED STATES", y=.98, fontsize=23)
axs[0].text(-900000,2800000, r'Recreated by __________', fontsize=10) #1: Add your name

### Step 11.3 Using loop to make the legend

Next is to make a legend. In order to make these conseNext, we make a loop to set up the legend and utilize the lower graph. 

* <b> Loops </b> *
Loops are useful tool within coding to do repeat action until a condition is satisfied. 

<i> For loops </i>
```for``` loops do a set of actions for each item in a array or list.

<i> If loops </i>
```if``` loops run a set of action IF the condition is satisfed. You can set mutiple conditions using the ```elif``` and ```else```

<details> <summary><strong>**Answer:** </strong></summary>

````ax.axis('off')````
    
</details>

In [None]:
fig, (axs) = plt.subplots(
    2,
    1,
    figsize=(11,14),         
    gridspec_kw={
        'wspace': -0.15,
        'hspace': -.15
    },
    facecolor=dubois_colors['bg']
)

axes = axs.flat

axs[1].set_xlim(0, 1)
axs[1].set_ylim(-1, 5)
axs[0].set_ylim(-1500000, 2500000)

gdf_new.plot(ax=axes[0], color=gdf_new['colors pop'], edgecolor='black', lw=.2)
gdf_new[gdf_new["Population"] == "500,000 - 600,000"].plot(ax=axes[0], facecolor="white", hatch= 7*chr(92))
gdf_new[gdf_new["Population"] == "600,000 - 750,000"].plot(ax=axes[0], facecolor="white", hatch="+++++")

axs[1].set_xlim(0, 1)
axs[1].set_ylim(-1, 5)
axs[0].set_ylim(-1500000, 2500000)

for index, color in enumerate(color_map):
    if index < 5:
        axes[1].add_patch(plt.Rectangle((0.6, index-.5), .05,.5, facecolor=color_map[color], edgecolor='black'))
        axes[1].annotate(range_map[color], (.7,index-.35))
    elif index == 7:
        axes[1].add_patch(plt.Rectangle((0.2, index-5.5), .05,.5, hatch=7*chr(92), facecolor=color_map[color], edgecolor='black'))
        axes[1].annotate(range_map[color], (.3,index-5.35))
    elif index == 8:
        axes[1].add_patch(plt.Rectangle((0.2, index-5.5), .05,.5, hatch='+++++', facecolor=color_map[color], edgecolor='black'))
        axes[1].annotate(range_map[color], (.3,index-5.35))
    else:
        axes[1].add_patch(plt.Rectangle((0.2, index-5.5), .05,.5, facecolor=color_map[color], edgecolor='black'))
        axes[1].annotate(range_map[color], (.3,index-5.35))

for ax in axes:
    ax.axis('___') #2: type 'off' to remove axes
    
fig.suptitle(r"RELATIVE NEGRO POPULATION OF THE STATES OF THE" "\n" "UNITED STATES", y=.98, fontsize=20)
axs[0].text(-900000,2800000, r'Recreated by __________', fontsize=10) #1: Add your name

### Step 12 Expanding on spatial data using Modern Data

Recreating Du Bois portraits using modern data allows us to apply important spatial analysis techniques such as rates/proportions, data classification, and visual accessibility. For these exercises, we will being 2020 data from the Census. As mentioned above, __[NHGIS]("https://www.nhgis.org/")__ provides historical data and shapefile for non-commercial use for free. Census data and shapefile are avaliable from the NHGIS. 

There are two updated datasets:
* 2020 of Black population by state: data2020
* Shapefile of state: US_state_2020.shp

Using pandas and geopandas, do the following steps:
* Step a: Read the csv file
* Step b: Read the state 2020 shapefile and plot the map.
* Step c: Merge the dataframe with the shapefile using key variable STATENAM.
* Step d: Plot a choropleth map using 2020 number of Black residents

<details> <summary><strong>**Answer:** </strong></summary>

```df = pd.read_csv(fp_data)```
    
```df.head()```
    
</details>

In [None]:
#Step a: Read the csv file
fp_data="https://github.com/HigherEdData/Du-Bois-STEM/raw/refs/heads/main/data/d_2020_updated.csv"
df = ___.read_csv(fp_data)
df.head()

<details> <summary><strong>**Answer:** </strong></summary>

```gdf = gpd.read_file(fp_map)```
    
```gdf.head()```
    
```gdf.plot()```
    
</details>

In [None]:
#Step b: Read the state 2020 shapefile and plot the map.
fp_map="https://github.com/HigherEdData/Du-Bois-STEM/raw/refs/heads/main/data/us_state_2020/us_state_2020_reduced.shp"
gdf = ___.read_file(fp_map)
gdf.head()

<details> <summary><strong>**Answer:** </strong></summary>

```gdf_recreate = gdf.merge(df, on='STATENAM')```
```gdf_recreate.head()```
    
</details>

In [None]:
#Step c: Merge the dataframe with the shapefile using STATENAM
gdf_recreate = gdf.merge(df, on='___')
gdf_recreate.head()

<details> <summary><strong>**Answer:** </strong></summary>

```fig, ax = plt.subplots(1, 1, figsize=(10, 10))```
```gdf_recreate.plot('blkpop', ax=ax, edgecolor='black', lw=.1, legend = True)```
```ax.set_axis_off() # Remove the axes```
    
</details>

In [None]:
#Step d: Plot a choropleth map using 2020 number of Black residents
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gdf_recreate.plot('____', ax=ax, edgecolor='black', lw=.1, legend = True) #type 'blkpop' as the value for choropleth map
ax.set_axis_off() # Remove the axes
plt.tight_layout()

Take a moment to examine the map above. The lighter-colored states indicate areas with a higher number of people identifying as Black. For example, states like California and Texas stand out with relatively higher counts. However, it's important to consider that these states also have the largest overall populations, which naturally contributes to higher absolute numbers. Keep in mind that this map shows **counts**, not **proportions**—so larger states may appear prominent simply because more people live there overall.

## Step 13 Normalization of Data

Raw or total values—like those shown in the map above—can be useful, but they often lead to misleading interpretations when used in choropleth maps. This is because larger areas tend to have higher totals simply due to their size, not necessarily because the variable of interest is more concentrated there.

For example, California may show a high total population of people identifying as Black—not necessarily because it has a higher proportion, but because it's a large state with a large overall population. In contrast, smaller states like Maryland may have lower totals even if the proportion is higher.

This makes it difficult to fairly compare values between areas of different sizes. To address this, we normalize the data—typically by converting raw totals into rates, percentages, or densities. Normalization allows us to make more meaningful comparisons by accounting for differences in population size or geographic area.

### Understanding Rates and Proportions in Mapping

 When working with aggregated data, it's important to understand how **rates** and **proportions** help us make fair comparisons across different regions.
 - A **rate** expresses the relationship between two quantities, typically as a value per a larger base – for example, **population density** (people per square kilometer) or a **drug overdose death rate** ([26.9 deaths per 100,000 people in California](https://www.cdc.gov/nchs/pressroom/states/california/ca.htm)).
 - A **proportion** shows how a part relates to the whole, often represented as a percentage. For instance, accoding to the[2024 US Census](https://www.census.gov/quickfacts/fact/table/CA/SEX255223), **50.1% of California's population is female.**

 Both rates and proportions are forms of **data normalization.** They adjust for differences in population or area size, allowing for more meaningful comparisons in choropleth maps and other forms of spatial analysis.

### Normalize Black Population by Total Population

Now, let’s explore what happens when we normalize the Black population by the total population of each state. Instead of looking at raw counts or density, we calculate the **proportion of Black individuals within each state's population.** This gives us a percentage that reflects how large the Black population is **relative to the total population** in each state.

By mapping these proportions, we gain insight into **demographic representation**—how prevalent a group is within a given area—rather than just how many people live there.

$$Percent\% = 100*\frac{count}{totcount}$$

Given within our dataframe ```blkpop``` is the number of Black residents and ```totpop``` is the total population, how can we make a new variable to be the percent of Black residents?

<details> <summary><strong>**Answer:** </strong></summary>

```gdf_recreate['blkpct'] = (gdf_recreate['blkpop'] / gdf_recreate['totpop']) * 100```
    
</details>

In [None]:
# Calculate the proportion of Black population and add it to the attribute table
gdf_recreate['blkpct'] = (gdf_recreate['____'] / gdf_recreate['____']) * 100
gdf_recreate.head()

In [None]:
#Replot the data
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gdf_recreate.plot('blkpct', ax=ax, edgecolor='black', lw=.1, legend = True) #type 'blkpct' as the value for choropleth map
ax.set_axis_off() # Remove the axes
plt.tight_layout()

By calculating and mapping proportions, we gain a deeper understanding of demographic representation across geographic regions. While raw counts and densities can show where populations are large or concentrated, proportions allow us to see where a group makes up a significant share of the total population—regardless of overall size or area. **In this map, we see that the southeast states have a higher rates of Black people as compared to the rest of the country.** This type of normalization is essential when comparing areas with different population totals, ensuring that our interpretations are fair, meaningful, and grounded in context. As you continue exploring choropleth maps, remember that the way data is scaled—whether through totals, rates, or proportions—directly influences the story your map tells.

## Step 13 Data Classification

When visualizing geographic data, especially through choropleth maps, we often need to group similar values together into categories. This process—called **data classification**—involves assigning attribute values to a limited number of meaningful groups, or classes. Classification simplifies continuous or complex datasets into interpretable patterns, making it easier to compare and communicate information through maps.

In the context of cartography and spatial analysis, classification is particularly important for **quantitative attributes** (such as population density or income), where values exist on ordinal, interval, or ratio scales and can be ordered. Grouping these values into classes helps reduce visual complexity while still preserving important trends and differences. This is not only useful for single-map interpretation, but also for **temporal or spatial comparison**, such as examining changes in Black population density across centuries or between regions.

Data classification also has implications beyond visual clarity. The choice of classification method directly shapes how the user perceives the underlying data. For example, classifying continuous population values into five categories can reveal regional trends that might be obscured in an unclassed map. Similarly, choosing one classification scheme over another—like quantiles versus natural breaks—can highlight or mask certain spatial patterns.

Ultimately, the goal of classification is to **enhance map readability while retaining analytical integrity.** As you’ll see in this section, selecting the number of classes and how data is grouped isn't arbitrary—it’s a crucial design decision that affects both the **accuracy** and **interpretability** of your map.


## Choosing the Number of Classes

The number of classes you select should be informed by the spatial distribution of the data and the audience's ability to interpret the map. Too many classes can overwhelm the viewer, while too few can oversimplify the story the data is telling.

- **Too many classes** (e.g., 9 or more) can clutter the map, make colors harder to distinguish, and demand greater cognitive effort from the reader.

- **Too few classes** (e.g., 2 or 3) can obscure important variation and may group very different values into the same category.

*Note: Data quality or uncertainty should not influence your choice of class number—classification is about visualization, not about error correction.*

Below is a code example that demonstrates the visual impact of choosing different numbers of classes on a choropleth map.

In [None]:
import mapclassify

# Plot maps with different numbers of classes
fig, axs = plt.subplots(1, 3, figsize=(18, 6))

# 3 Classes
gdf_recreate.plot(
    column='blkpct',
    scheme='Quantiles',
    k=3,
    legend=True,
    ax=axs[0],
    legend_kwds={},
    edgecolor='0.9'
)
axs[0].set_title("Three Classes")
axs[0].axis('off')

# 5 Classes
gdf_recreate.plot(
    column='blkpct',
    scheme='Quantiles',
    k=5,
    legend=True,
    ax=axs[1],
    legend_kwds={},
    edgecolor='0.9'
)
axs[1].set_title("Five Classes")
axs[1].axis('off')

# 9 Classes
gdf_recreate.plot(
    column='blkpct',
    scheme='Quantiles',
    k=9,
    legend=True,
    ax=axs[2],
    legend_kwds={},
    edgecolor='0.9'
)
axs[2].set_title("Nine Classes")
axs[2].axis('off')

plt.tight_layout()
plt.suptitle("Impact of Class Number on Choropleth Map Readability", fontsize=16, y=1.05)
plt.show()

What are some patterns you notice from the graph?

## Common Data Classification Methods

Once you decide to classify your data, the next important step is choosing a **classification method.** Each method follows a different logic for dividing the data into classes, and the method you choose can significantly influence how patterns are perceived on the map. Below are some of the most commonly used approaches:

1. **Equal Interval**
This method divides the range of values into **equal-sized intervals.** Each class spans the same numerical distance, regardless of how the data are distributed.

  - Best for: Uniformly distributed data

  - Drawback: Can obscure meaningful variation in skewed data

2. **Quantiles**
Quantiles divide the dataset so that each class contains an **equal number of features** (e.g., states or counties).

  - Best for: Highlighting relative ranking or comparison

  - Drawback: Class breaks may fall between very different values

3. **Natural Breaks (Jenks)**
This method identifies natural groupings and gaps in the data using statistical optimization. It minimizes variance within classes and maximizes variance between them.

  - Best for: Skewed or clustered data

  - Drawback: Classes may be unevenly populated

4. **Standard Deviation**
Class breaks are based on how far values deviate from the mean. This is useful when comparing values to an average or identifying outliers.

  - Best for: Normally distributed data or emphasizing deviation

  - Drawback: Not intuitive for general audiences

5. **Manual Classification**
You choose the class breaks yourself based on domain knowledge or meaningful thresholds.

  - Best for: Custom comparisons or policy-relevant thresholds

  - Drawback: Can introduce subjectivity or bias

Each method reveals different patterns in the same dataset. It's important to experiment and think critically about which classification best supports your map’s purpose. In the next section, you'll learn how to apply these classification methods in Python using the `mapclassify` library with `GeoPandas`.



In [None]:
# Define classification methods to test
schemes = {
    "Equal Interval": mapclassify.EqualInterval,
    "Quantiles": mapclassify.Quantiles,
    "Natural Breaks (Jenks)": mapclassify.NaturalBreaks,
    "Standard Deviation": mapclassify.StdMean
}

# Set up plot
fig, axs = plt.subplots(2, 2, figsize=(14, 10))
axs = axs.flatten()

# Plot each classification method
for i, (name, classifier) in enumerate(schemes.items()):
    # Use the correct string names for the schemes
    scheme_name = ""
    if name == "Equal Interval":
        scheme_name = "equalinterval"
    elif name == "Quantiles":
        scheme_name = "quantiles"
    elif name == "Natural Breaks (Jenks)":
        scheme_name = "naturalbreaks"
    elif name == "Standard Deviation":
        scheme_name = "stdmean"

    gdf_recreate.plot(
        column='blkpct',
        scheme=scheme_name,
        classification_kwds={'k': 5} if name != "Standard Deviation" else {},
        legend=True,
        edgecolor='0.9',
        linewidth=0.5,
        ax=axs[i]
    )
    axs[i].set_title(name)
    axs[i].axis('off')

plt.tight_layout()
plt.suptitle("Choropleth Comparison Using Different Classification Methods", fontsize=16, y=1.03)
plt.show()

# Step 14. Putting it all together
Given the general practices of data classifications, we will now recreate a choropleth map. The original plate had 10 classes, but here we want to simplify to five classes. We will use quantiles to highlight the relative comparison among states.

In [None]:
gdf_recreate['quintiles'] = pd.qcut(gdf_recreate['blkpct'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4','Q5'])
print(gdf_recreate.head())

In [None]:
#Replot the data
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gdf_recreate.plot('quintiles', ax=ax, edgecolor='black', lw=.1, legend = True) #type 'blkpct' as the value for choropleth map
ax.set_axis_off() # Remove the axes
ax.set_title("Quintiles of Percent of Black Population by State", fontsize=14)
plt.tight_layout()

### 15. Export a final HTML file of your Notebook

Now that you're done, remember to export a final HTML file showing your work and displaying your name within the visualizations you created.

As noted above, you can export the Notebook by clicking the **File** dropdown above, selecting **Export File As** and then selecting **HTML** as shown here:
<div>
<img src="https://github.com/HigherEdData/Du-Bois-STEM/blob/main/readings-images/htmlexport.jpg?raw=true" width="500" />
</div>

# More Resources and References

Github Repository for the #DuboisChallenge2024
https://github.com/ajstarks/dubois-data-portraits/blob/master/challenge/2024/README.md

Du Bois Challenge 2024 Recap
https://speakerdeck.com/ajstarks/du-bois-challenge-2024-recap

2024 Du Bois Challenge using R Programming.
https://medium.com/illumination/2024-du-bois-challenge-using-r-programming-02af8afa5626

Developing Du Bois’s Data Portraits with Python and Matplotlib
https://www.edriessen.com/2024/02/07/developing-du-boiss-data-portraits-with-python-and-matplotlib/

Three Tricks I Learned In The Du Bois Data Visualization Challenge
https://nightingaledvs.com/recreating-historical-dataviz-three-tricks-i-learned-in-the-du-bois-data-visualization-challenge/

Molly Kuhs Du Bois Challenge repo
https://github.com/makuhs/DuboisChallenge

#DuBoisChallenge2024 using Python and Matplotlib
https://github.com/edriessen/dubois24-python-matplotlib

#DuboisChallenge2024 using R
https://github.com/sndaba/2024DuBoisChallengeInRstats/tree/main

#DuboisChallenge2024 using Tableau
https://public.tableau.com/app/profile/camaal.moten7357/vizzes