# Exploring Air Force personnel data

By: Vatsal Vinay Parikh

In this project, we'll explore a US Department of Defense personnel demographics dataset. The publicly available dataset was taken [from data.gov](https://catalog.data.gov/dataset/personnel-trends-by-gender-race), and has been cleaned and tidied, so we can get straight into exploratory data analysis.

The dataset contains counts of military personnel by gender, race, and paygrade. It was compiled in March 2010.

## 1: Import the packages

Today, we'll be using `pandas` for data manipulation and calculations, and `plotly.express` for visualization.

- Import the `pandas` package using the alias `pd`.
- Import the `plotly.express` package using the alias `px`.

In [1]:
# Import the pandas package
import pandas as pd

# Import the plotly express package
import plotly.express as px

## 2: Read in the dataset

The demographics dataset is contained in a CSV file named `"dod_demographics.csv"`.

- Use `pandas` to read this CSV file. Assign it to a variable named `dod_demographics`.

In [3]:
# Import the demographic data from "dod_demographics.csv"
dod_demographics = pd.read_csv('dod_demographics.csv')

# See the result
dod_demographics

Unnamed: 0,service,gender,race,hispanicity,paygrade,count
0,Army,MALE,AMI/ALN,HISP,O01,2
1,Army,MALE,AMI/ALN,NON-HISP,O01,38
2,Army,MALE,ASIAN,HISP,O01,0
3,Army,MALE,ASIAN,NON-HISP,O01,361
4,Army,MALE,BLACK,HISP,O01,18
...,...,...,...,...,...,...
3243,Coast Guard,FEMALE,P/I,NON-HISP,E09,0
3244,Coast Guard,FEMALE,WHITE,HISP,E09,0
3245,Coast Guard,FEMALE,WHITE,NON-HISP,E09,13
3246,Coast Guard,FEMALE,UNK,HISP,E09,0


The dataset has 6 columns.

- **service**: Army, Navy, Marine Corps, Air Force, Coast Guard. (Space Force didn't exist when the dataset was compiled.)
- **gender**: MALE or FEMALE.
- **race**: AMI/ALN, ASIAN, BLACK, MULTI, P/I, WHITE, UNK.
- **hispanicity**: HISP, NON-HISP.
- **paygrade**: Enlisted grades E00 to E09, Warrant Officer grades W01 to W05, Officer grades O01 to O10.
- **count**: number of personnel in that demographic.

## 3: Get the subset with the Air Force dataset

The dataset contains data for all the services, but we only want to analyze the Air Force data.

- Query `dod_demographics` for rows where the `service` is equal to `"Air Force"`. Assign to `air_force`.

In [5]:
# Query dod_demographics for rows in the "Air Force" service
air_force = dod_demographics.query('service == "Air Force"')

# See the results
air_force

Unnamed: 0,service,gender,race,hispanicity,paygrade,count
840,Air Force,MALE,AMI/ALN,HISP,O01,0
841,Air Force,MALE,AMI/ALN,NON-HISP,O01,36
842,Air Force,MALE,ASIAN,HISP,O01,2
843,Air Force,MALE,ASIAN,NON-HISP,O01,252
844,Air Force,MALE,BLACK,HISP,O01,6
...,...,...,...,...,...,...
2655,Air Force,FEMALE,P/I,NON-HISP,E09,1
2656,Air Force,FEMALE,WHITE,HISP,E09,1
2657,Air Force,FEMALE,WHITE,NON-HISP,E09,160
2658,Air Force,FEMALE,UNK,HISP,E09,3


## 4: Start exploring! How much data do we have?

We're ready to start asking some questions about the dataset. A good way to start exploring data is to ask how much data you have for different groups. Let's calculate the personnel count by gender.

- Group `air_force` by `"gender"` and calculate the sum of the `"count"`s.

In [6]:
# Group air_force by gender and calculate the total count
air_force.groupby("gender").sum()

Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
FEMALE,64200
MALE,267286


Let's visualize these total counts using a bar plot. Plotly prefers to have all variables in the plot as columns in the dataframe, so we need an additional step of reseting the index.

- Copy and paste the previous code, then reset the index and assign to `total_counts_by_gender`.

In [7]:
# Redo the previous analysis, then reset the index.
total_counts_by_gender = air_force.groupby("gender").sum().reset_index()

# See the result
total_counts_by_gender

Unnamed: 0,gender,count
0,FEMALE,64200
1,MALE,267286


- Using `total_counts_by_gender`, draw a bar plot of `count` versus `gender`.

In [8]:
# Using total_counts_by_gender, draw a bar plot of count versus gender
px.bar(total_counts_by_gender, x = "gender", y = "count")

Bar plots work with vertical or horizontal bars (and often horizontal bars make the plot easier to read). Try swapping the x- and y-axes on the previous plot.

- Redraw the previous plot with the x- and y-axes swapped.

In [9]:
# Redraw the previous plot with the axes swapped
px.bar(total_counts_by_gender, x = "count", y = "gender")

## 5: How much data do we have by race?

See if you can repeat the total count analysis, this time breaking down the data by race.

- Calculate the total count of personnel by race.
- Draw a bar plot of the counts by race.

In [11]:
# Calculate the total count of personnel by race
total_counts_by_race = air_force.groupby("race").sum().reset_index()
    

# Draw a bar plot of the counts by race
px.bar(total_counts_by_race, x = "count", y = "race")

- Optional bonus task: the bar plot is easier to read if the bars are shown in order from longest to shortest. Sort the values of the counts and redraw the plot.

In [13]:
# Sort the counts from largest to smallest and redraw the bar plot
# Calculate the total count of personnel by race
total_counts_by_race_sorted = total_counts_by_race.sort_values("count")
    
# Draw a bar plot of the counts by race
px.bar(total_counts_by_race_sorted, x = "count", y = "race")

## 6: Exploring the highest paygrades by group

Let's take a look at the highest paygrades of personnel for different groups. These questions are easiest to answer if we first sort the dataset by the values of paygrade, then filter the dataset for rows where the count is positive.

- Sort `air_force` by the values of `paygrade`.
- Query for rows where the count is greater than zero.
- Assign the result with no zero counts to `airforce_nz`.

In [14]:
# Sort air_force by the values of paygrade, and query for positive counts
air_force_nz = air_force.sort_values('paygrade').query('count > 0')

# See the result
air_force_nz

Unnamed: 0,service,gender,race,hispanicity,paygrade,count
2405,Air Force,FEMALE,WHITE,NON-HISP,E00,1
2391,Air Force,MALE,WHITE,NON-HISP,E00,4
2427,Air Force,FEMALE,BLACK,NON-HISP,E01,379
2428,Air Force,FEMALE,MULTI,HISP,E01,1
2429,Air Force,FEMALE,MULTI,NON-HISP,E01,76
...,...,...,...,...,...,...
1089,Air Force,FEMALE,WHITE,NON-HISP,O09,1
1075,Air Force,MALE,WHITE,NON-HISP,O09,33
1074,Air Force,MALE,WHITE,HISP,O09,1
1069,Air Force,MALE,BLACK,NON-HISP,O09,1


Now we can start answering questions. Let's find what the highest paygrade a female had. Since the dataset is sorted by paygrade, it's the last row of the dataset containing females.

- Using `airforce_nz`, query for rows where `gender` is equal to `"FEMALE"`.
- Get the "tail" (last row) of the results.

In [17]:
# Find the row with the female with the highest paygrade
air_force_nz.query('gender == "FEMALE"').tail(1)

Unnamed: 0,service,gender,race,hispanicity,paygrade,count
1089,Air Force,FEMALE,WHITE,NON-HISP,O09,1


Your turn. What is the highest paygrade that anyone hispanic had?

In [18]:
# Find the row with the hispanic with the highest paygrade
air_force_nz.query('hispanicity == "HISP"').tail(1)

Unnamed: 0,service,gender,race,hispanicity,paygrade,count
1074,Air Force,MALE,WHITE,HISP,O09,1


## 7: What's the distribution of paygrades?

Let's take a look at the distribution of paygrades. This time, it's easier to visualize using a line plot.

- Calculate the total count by paygrade. Assign to `total_counts_by_paygrade`.
- Using `total_counts_by_paygrade`, draw a line plot of `count` versus `paygrade`.

In [19]:
# Calculate the total count by paygrade
total_counts_by_paygrade = air_force_nz.groupby("paygrade").sum().reset_index()

# Draw a line plot of count vs. paygrade
px.line(total_counts_by_paygrade, x = "paygrade", y = "count")