# Tutorial for Analyzing LMS Files

## Purpose
This notebook is designed to help analysts understand and access the files created by the Ed-Fi LMS Toolkit extractor utilities.

## Pandas

We'll be using [Pandas](https://pandas.pydata.org/), which is the industry standard tool for high performance data analysis in Python. If you're new to Python and Pandas and want to do some additional reading, then you might be interested in these two books by Jake VanderPlas, both available for free:

* [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/)
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)

## Getting Started

This notebook was developed with Python 3.8 and might not work with earlier versions. The source code directory for this notebook contains a `poetry.lock` and `pyproject.toml` file; these are used by [Poetry](https://python-poetry.org/) to manage resource dependencies. Ideally you will install Poetry and then run `poetry install` from a command prompt to load required resources. If you would like to run without using Poetry, then you can manually load dependencies using Pip:

```bash
# Optional step if NOT using poetry
pip install pandas
pip install ipykernel
pip install jupyter --user
```

It is recommended that these be installed in a virtual environment, which Poetry handles for you. Alternately, you can use equivalent commands in Anaconda.

This notebook can be run from within Visual Studio Code or it can run in a browser window by executing the following command:

```bash
# With Poetry
poetry run jupyter notebook

# Without Poetry
jupyter notebook
```

## Understanding the Filesystem

The LMS Extractors output a number of discrete files, corresponding to concepts in the Ed-Fi LMS Unified Data Model:

* Activities
* Assignments
* Attendance (_only with Schoology_)
* Grades
* Sections
* Section Associations (aka _enrollments_)
* Submissions
* Users

Each file contains all of the current data for the given model, so that you only need to read one file to get a complete snapshot for a single resource. But there is a catch to that: some of these concepts are section-specific, for example assignments. Rather than try to store all assignment data _for all sections_ in a single file, we create _one assignment file per section_. The same is true for activities, attendance, grades, and section associations. Furthermore, submissions are dependent on assignments, thus there is one submissions file per assignment. The file layout mirrors this heirarchy: there is a directory for each resource type, and dependent resources are nested under directories named for the given section or assignment.

This convention may seem a little strange for a human, but it is very easy to navigate for the computer. Each directory may have multiple files, one for each time you run the extractor, but you only need to load the most recent file to get the current snapshot. We make that easy by using the date and time as the file name. Thus after running once, you may end up with files like this, where 12345 is the source system identifier for a unique section and 67890 is the source system identifier for a unique assignment:

![Sample file layout](filesystem.svg)

Note: `base_directory` is whatever directory was specified in the configuration when running the extractor utility.

Presumably the extractor utility will be run on a periodic basis, for instance weekly or daily. In that case each directory will have multiple files. Since the filenames have the date and time embedded in them, sorting on the file names will make it easy to pick up only the most recent file, regardless of whether or not some other process has modified the file and thus altered the operating system date on the file.

## Helper Functions

### Filesystem Helpers

Below you will find a set of functions to help you navigate this filesystem.

LANGUAGE USE NOTE: in these examples we use Python's optional [type hint system](https://docs.python.org/3.8/library/typing.html) to help us all recognize what data types are being passed into functions and returned by them.

In [4]:
import os

def _get_newest_file(directory: str) -> str:
    files = [(f.path, f.name) for f in os.scandir(directory) if f.name.endswith(".csv")]
    files = sorted(files, key=lambda x: x[1], reverse=True)

    return files[0][0]

def _get_file_for_section(base_directory, section_id, file_type) -> str:
    return _get_newest_file(os.path.join(base_directory, f"section={section_id}", file_type))

def get_users_file(base_directory) -> str:
    return _get_newest_file(os.path.join(base_directory, "users"))

def get_sections_file(base_directory):
    return _get_newest_file(os.path.join(base_directory, "sections"))

def get_section_associations_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "section-associations")

def get_activities_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "activities")

def get_assignments_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "assignments")

def get_grades_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "grades")

#### Function Validation

The code below validates the output from these functions.

In [6]:
sample_dir = os.path.join("..", "..", "docs", "sample-out")

# Users
expected_file = os.path.join(sample_dir, "users", "2020-09-18-15-05-01.csv")
actual_file = get_users_file(sample_dir)
assert expected_file == actual_file, f"Users > Expected: {expected_file}, Actual: {actual_file}"

# Sections
expected_file = os.path.join(sample_dir, "sections", "2020-09-17-15-04-23.csv")
actual_file = get_sections_file(sample_dir)
assert expected_file == actual_file, f"Sections > Expected: {expected_file}, Actual: {actual_file}"

# Activities for section 2385758954
expected_file = os.path.join(sample_dir, "section=2385758954", "activities", "2020-09-21-11-08-34.csv")
actual_file = get_activities_file(sample_dir, 2385758954)
assert expected_file == actual_file, f"Activities > Expected: {expected_file}, Actual: {actual_file}"

# Assignments for section 123456780
expected_file = os.path.join(sample_dir, "section=123456780", "assignments", "2020-09-18-15-04-24.csv")
actual_file = get_assignments_file(sample_dir, 123456780)
assert expected_file == actual_file, f"Assignments > Expected: {expected_file}, Actual: {actual_file}"

# Grades for section 123456780
expected_file = os.path.join(sample_dir, "section=123456780", "grades", "2020-09-18-15-04-24.csv")
actual_file = get_grades_file(sample_dir, 123456780)
assert expected_file == actual_file, f"Grades > Expected: {expected_file}, Actual: {actual_file}"

# Section Associations for section 123456789
expected_file = os.path.join(sample_dir, "section=123456789", "section-associations", "2020-09-18-15-04-24.csv")
actual_file = get_section_associations_file(sample_dir, 123456789)
assert expected_file == actual_file, f"Section Associations > Expected: {expected_file}, Actual: {actual_file}"


print("All good")

All good


### Loading Files into DataFrames

Next, let's create a few functions that leverage the filesystem helpers to read files into Pandas DataFrames.

In [None]:
import pandas as pd  # Aliasing as `pd` is a common practice in the industry

