<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#OPM-Employment-Data-(Non-DOD)" data-toc-modified-id="OPM-Employment-Data-(Non-DOD)-1">OPM Employment Data (Non-DOD)</a></span><ul class="toc-item"><li><span><a href="#Imports-and-Constants" data-toc-modified-id="Imports-and-Constants-1.1">Imports and Constants</a></span></li><li><span><a href="#Functions-for-Easier-Loading" data-toc-modified-id="Functions-for-Easier-Loading-1.2">Functions for Easier Loading</a></span><ul class="toc-item"><li><span><a href="#Dynamic-Data" data-toc-modified-id="Dynamic-Data-1.2.1">Dynamic Data</a></span></li><li><span><a href="#Status-Data" data-toc-modified-id="Status-Data-1.2.2">Status Data</a></span></li></ul></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-1.3">Analysis</a></span></li></ul></li></ul></div>

# OPM Employment Data (Non-DOD)
<hr style='height:3px'>
    This notebook is meant to assist in analyzing the OPM non-DOD data and has helper functions to make your life easier. The data is Buzzfeed's OPM data, hosted <a href="https://archive.org/details/opm-federal-employment-data/docs/2015-02-11-opm-foia-response/">here</a> on the Internet Archive. To download, <a href="https://archive.org/compress/opm-federal-employment-data" target="_blank">click this link</a>. It's about 34 GB. It will be structured in the same way as the data I am using here (made explicit at the bottom of this cell).

The data outside the 1973-09 to 2014-06 range is horribly arranged. I've forgone making the load functions handle that data. If you need to load that data, feel free to contact me and I'll make and send over a script that can handle them. I don't think they even have the dynamic files after that end date, just the status files.


This is using:<br>
Python 3.8.3 or higher<br>
Pandas 1.0.5 or higher


The folder containing the data is structured as:
    <ul>
        <li>.\opm-federal-employment-data</li>
        <ul>
            <li>data</li>
            <ul>
                <li>1973-09-to-2014-06</li>
                <ul>
                    <li>dod</li>
                    <ul>
                        <li>Irrelevant for us</li>
                    </ul>
                    <li>non-dod</li>
                    <ul>
                        <li>dynamic</li>
                        <ul>
                            <li>\*.NONDOD.FO05M3.TXT</li>
                        </ul>
                        <li>status</li>
                        <ul>
                            <li>Status_Non_DoD_\*.txt</li>
                        </ul>
                    </ul>
                    <li>SCTFILE.TXT</li>
                </ul>
                <li>2014-09-to-2016-09</li>
                <ul>
                    <li>Irrelevant for us</li>
                </ul>
                <li>2016-12-to-2017-03</li>
                <ul>
                    <li>Irrelevant for us</li>
                </ul>
            </ul>
            <li>docs</li>
            <ul>
                <li>Irrelevant for us</li>
            </ul>
        </ul>
    </ul>

## Imports and Constants

In [6]:
import pandas as pd
from collections import OrderedDict
import os


dynamic_dir = os.path.join(".", "opm-federal-employment-data",
                           "data", "1973-09-to-2014-06", "non-dod", "dynamic")

status_dir = os.path.join(".", "opm-federal-employment-data",
                          "data", "1973-09-to-2014-06", "non-dod", "status")

fwf_columns = OrderedDict([
    ('Pseudo ID', (0, 9)),
    ('Name', (9, 32)),
    ('File Date', (32, 40)),
    ('SubAgency', (40, 44)),
    ('Duty Station', (44, 53)),
    ('Age Range', (53, 59)),
    ('Education Level', (59, 61)),
    ('Pay Plan', (61, 63)),
    ('Grade', (63, 65)),
    ('LOS Level', (65, 71)),
    ('Occupation', (71, 75)),
    ('PATCO', (75, 76)),
    ('Adjusted Basic Pay', (76, 82)),
    ('Supervisory Status', (82, 83)),
    ('TOA', (83, 85)),
    ('Work Schedule', (85, 86)),
    ('NSFTP Indicator', (86, 87))
])

## Functions for Easier Loading
<hr>
The goal here is to abstract away the bad naming scheme of the files. The data is partitioned by qaurter. You can run the functions without arguments to load just the first file in each category (status vs. dynamic). I recommend this so you can get your bearings with the data without much wait time.

Heads up that, for me, 16GB of RAM wasn't enough for ~1980-2005. If you need code to run on data that's larger than your RAM, using a cloud compute solution can work if you use cURL in the terminal with the link https ://archive .org/compress/opm-federal-employment-data (I put spaces in the link so you don't accidentally click on it and start downloading the files again).

### Dynamic Data
The dynamic data is a quarterly report

In [25]:
def load_dynamic(start_date: str = "1982-03", end_date: str = "1982-04") -> pd.DataFrame:
    """Loads the dynamic file for the given date range. Keep in mind, the data can get sizeable really quickly.
    Date is passed in a YYYY-MM format in the defaults, but this should handle several date formats. The reports
    are quarterly, so the months should be December, March, June, or September. The start and end date defaults
    were chosen because they are the dates with available data."""
    if start_date > end_date:
        raise Exception(
            "Remember to have an end date *after* the start date. No file loading was done."
        )
        return
    start_date = pd.to_datetime(start_date)
    end_date = pd.to_datetime(end_date)
    df = pd.DataFrame()
    files_to_load = []  # files are not sorted in the directory, so the data would be messier
    for file in os.listdir(dynamic_dir):
        file_date = pd.to_datetime(file[:7])
        if start_date <= file_date <= end_date:
            files_to_load.append(file)
    files_to_load = sorted(files_to_load, key=lambda x: pd.to_datetime(x[:7]))
    for file in files_to_load:
        df = df.append(pd.read_fwf(os.path.join(dynamic_dir, file),
                                   colspecs=list(fwf_columns.values()),
                                   header=None,
                                   names=list(fwf_columns.keys())))
    return df

### Status Data
The status data is a quarterly report

In [37]:
def load_status(start_date: str = "1973-09", end_date: str = "1973-12") -> pd.DataFrame:
    """Loads the status file for the given date range. Keep in mind, the data can get sizeable really quickly.
    Date is passed in a YYYY-MM format in the defaults, but this should handle several date formats. The reports
    are quarterly, so the months should be December, March, June, or September. The start and end date defaults
    were chosen because they are the dates with available data."""
    if start_date > end_date:
        raise Exception(
            "Remember to have an end date *after* the start date. No file loading was done."
        )
        return
    start_date = pd.to_datetime(start_date)
    end_date = pd.to_datetime(end_date)
    df = pd.DataFrame()
    files_to_load = []  # files are not sorted in the directory, so the data would be messier
    for file in os.listdir(status_dir):
        file_date = pd.to_datetime("-".join(file[-11:-4].split("_")))
        if start_date <= file_date <= end_date:
            files_to_load.append(file)
    files_to_load = sorted(files_to_load, key=lambda x: pd.to_datetime(
        "-".join(file[-11:-4].split("_"))))
    for file in files_to_load:
        df = df.append(pd.read_fwf(os.path.join(status_dir, file),
                                   colspecs=list(fwf_columns.values()),
                                   header=None,
                                   names=list(fwf_columns.keys())))
    return df

## Analysis
<hr style="height:3px">