# Tutorial for Analyzing LMS Files

## Purpose
This notebook is designed to help analysts understand and access the files created by the Ed-Fi LMS Toolkit extractor utilities.

## Pandas

We'll be using [Pandas](https://pandas.pydata.org/), which is the industry standard tool for high performance data analysis in Python. If you're new to Python and Pandas and want to do some additional reading, then you might be interested in these two books by Jake VanderPlas, both available for free:

* [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/)
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)

## Getting Started

This notebook was developed with Python 3.8 and might not work with earlier versions. The source code directory for this notebook contains a `poetry.lock` and `pyproject.toml` file; these are used by [Poetry](https://python-poetry.org/) to manage resource dependencies. Ideally you will install Poetry and then run `poetry install` from a command prompt to load required resources. If you would like to run without using Poetry, then you can manually load dependencies using Pip:

```bash
# Optional step if NOT using poetry
pip install pandas
pip install ipykernel
pip install jupyter --user
```

It is recommended that these be installed in a virtual environment, which Poetry handles for you. Alternately, you can use equivalent commands in Anaconda.

This notebook can be run from within Visual Studio Code or it can run in a browser window by executing the following command:

```bash
# With Poetry
poetry run jupyter notebook

# Without Poetry
jupyter notebook
```

## Understanding the Filesystem

The LMS Extractors output a number of discrete files, corresponding to concepts in the Ed-Fi LMS Unified Data Model:

* Activities
* Assignments
* Attendance (_only with Schoology_)
* Grades
* Sections
* Section Associations (aka _enrollments_)
* Submissions
* Users

Each file contains all of the current data for the given model, so that you only need to read one file to get a complete snapshot for a single resource. But there is a catch to that: some of these concepts are section-specific, for example assignments. Rather than try to store all assignment data _for all sections_ in a single file, we create _one assignment file per section_. The same is true for activities, attendance, grades, and section associations. Furthermore, submissions are dependent on assignments, thus there is one submissions file per assignment. The file layout mirrors this heirarchy: there is a directory for each resource type, and dependent resources are nested under directories named for the given section or assignment.

This convention may seem a little strange for a human, but it is very easy to navigate for the computer. Each directory may have multiple files, one for each time you run the extractor, but you only need to load the most recent file to get the current snapshot. We make that easy by using the date and time as the file name. Thus after running once, you may end up with files like this, where 12345 is the source system identifier for a unique section and 67890 is the source system identifier for a unique assignment:

![Sample file layout](filesystem.svg)

Note: `base_directory` is whatever directory was specified in the configuration when running the extractor utility.

Presumably the extractor utility will be run on a periodic basis, for instance weekly or daily. In that case each directory will have multiple files. Since the filenames have the date and time embedded in them, sorting on the file names will make it easy to pick up only the most recent file, regardless of whether or not some other process has modified the file and thus altered the operating system date on the file.

## Helper Functions

### Accessing Files

Below you will find a set of functions to help you navigate this filesystem.

LANGUAGE USE NOTE: in these examples we use Python's optional [type hint system](https://docs.python.org/3.8/library/typing.html) to help us all recognize what data types are being passed into functions and returned by them.

In [7]:
import os

def _get_newest_file(directory: str) -> str:
    files = [(f.path, f.name) for f in os.scandir(directory) if f.name.endswith(".csv")]
    files = sorted(files, key=lambda x: x[1], reverse=True)

    return files[0][0]

def _get_file_for_section(base_directory, section_id, file_type) -> str:
    return _get_newest_file(os.path.join(base_directory, f"section={section_id}", file_type))

def get_users_file(base_directory) -> str:
    return _get_newest_file(os.path.join(base_directory, "users"))

def get_sections_file(base_directory):
    return _get_newest_file(os.path.join(base_directory, "sections"))

def get_section_associations_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "section-associations")

def get_activities_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "activities")

def get_assignments_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "assignments")

def get_grades_file(base_directory, section_id):
    return _get_file_for_section(base_directory, section_id, "grades")

### Loading Files into DataFrames

Next, let's create a few functions that leverage the filesystem helpers to read files into Pandas DataFrames.

In [23]:
import pandas as pd  # Aliasing as `pd` is a common practice in the industry

def get_all_users(base_directory: str) -> pd.DataFrame:
    file = get_users_file(base_directory)

    return pd.read_csv(file, engine="c", parse_dates=True, infer_datetime_format=True)
    
def get_all_sections(base_directory: str) -> pd.DataFrame:
    file = get_sections_file(base_directory)

    return pd.read_csv(file, engine="c", parse_dates=True, infer_datetime_format=True)

#### Function Validation

The next code cells validate the output from these functions and show you the available data.

In [30]:
from IPython.display import display, Markdown

users_df = get_all_users(sample_dir)
display(Markdown("### Users"))
display(users_df)

sections_df = get_all_sections(sample_dir)
display(Markdown("### Sections"))
display(sections_df)

### Users

Unnamed: 0,SourceSystemIdentifier,SourceSystem,UserRole,LocalUserIdentifier,SISUserIdentifier,Name,EmailAddress,EntityStatus,CreateDate,LastModifiedDate
0,100032890,Schoology,student,mary.archer,604863,Mary Archer,Mary.Archer@studentgps.org,Archived,2020-08-20 12:34:50,2020-09-18 12:34:50
1,100032891,Schoology,student,kyle.hughes,604874,Kyle Hughes,Kyle.Hughes@studentgps.org,Archived,2020-08-20 12:34:50,2020-09-18 12:34:50
2,100032892,Schoology,student,peter.nash,604918,Peter Ivan Nash,Peter.Nash@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
3,100032893,Schoology,student,larry.mahoney,604927,Larry Mahoney,Larry.Mahoney@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
4,100032894,Schoology,student,roland.phillips,604938,Roland Phillips,Roland.Phillips@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
5,100032895,Schoology,student,stephen.caldwell,604969,Stephen Caldwell,Stephen.Caldwell@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
6,100032896,Schoology,student,olivia.hardy,604974,Olivia Doris Hardy,Olivia.Hardy@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
7,100032897,Schoology,student,micheal.turner,605015,Micheal Turner,Micheal.Turner@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
8,100032898,Schoology,teacher,kelley.christian,207270,Kelley Heidi Christian,Kelley.Christian@studentgps.org,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
9,100032899,Schoology,teacher,sara.preston,207268,Sara Stacy Preston,Sara.Preston@studentgps.org,Archived,2020-08-20 12:34:50,2020-08-20 12:34:50


### Sections

Unnamed: 0,SourceSystemIdentifier,SourceSystem,SISSectionIdentifier,Title,SectionDescription,Term,LMSSectionStatus,EntityStatus,CreateDate,LastModifiedDate
0,123456780,Google Classroom,25590100102Trad220ALG112011,ALG-1,Algebra I,"255901001_2020_2019-2020_Fall,255901001_2020_2...",Archived,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
1,123456789,Google Classroom,25590100101Trad120ENG112011,ENG-1,English/Language Arts I (9th grade),"255901001_2021_2020-2021_Fall,255901001_2021_2...",Active,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
2,123456790,Google Classroom,25590100102Trad220ALG112011,ALG-1,Algebra I,"255901001_2021_2020-2021_Fall,255901001_2021_2...",Active,Active,2020-08-20 12:34:50,2020-08-20 12:34:50
3,123456791,Google Classroom,,ENG-STAFF-1,English language arts staff meeting,,Unpublished,Active,2020-09-03 01:02:03,2020-09-04 00:01:02


PENDING: what comes next? Do something that demonstrates merging all assignments into a single DataFrame. Maybe do a lightweight analysis.