# Manitoba

In this notebook the high school enrollment for the Manitoba area will be scraped. Accessing the data is the most straight forward of all the reciprocity areas. However the data comes in different formats year to year and needs to be dynamically cleaned.

First the appropriate packages are loaded.

In [1]:
import datetime
import time
import wget
import numpy as np
import pandas as pd

Next we have a rolling window object to complete the URL to download the enrollment file each year.

In [2]:
Year = str(datetime.datetime.now().year - 1)
File = 'http://www.edu.gov.mb.ca/k12/finance/sch_enrol/enrolment_' + Year + '.xlsx'

Using wget to download the URL, we can also use the same object the download is saved with to read the file into memory.

In [3]:
ThisYearDownload = wget.download(File)

100% [............................................................................] 516251 / 516251

Above is the file downloaded from last year and below the most recent file is downloaded. The change in enrollment will be an important metric to forcast enrollment from the area for the upcoming admissions cycle. 

In [4]:
LsYear = str(datetime.datetime.now().year - 2)
File = 'http://www.edu.gov.mb.ca/k12/finance/sch_enrol/enrolment_' + LsYear + '.xlsx'
LastYearDownload = wget.download(File)

100% [............................................................................] 516931 / 516931

Below we clean the data file from this year. When moving the code into production the following can be utilized as a function so that it does not need to be repeated for cleaning last years data file. There is multiple tables in each sheet of the excel file. The first few sheets describe the data. Below we dynamically extract the tables and combine them together.

In [58]:
# Clean This Year Enrollment File
appended_data = []
for i in range(7, 39):
    Enrollment = pd.read_excel(ThisYearDownload, 
                               sheetname = str(i), 
                               header = None, skiprows = 4)
    Enrollment = Enrollment.dropna(axis = 'columns',how='all')
    # Rename columns as numeric order 
    Enrollment = Enrollment.rename(columns=
                                   {x:y for x,y 
                                    in zip(Enrollment.columns,
                                           range(0,len(Enrollment.columns)))})
    # Remove missing schools and aggregates
    Enrollment = Enrollment[(Enrollment[0].notnull()) & 
                            (Enrollment[0].str.contains('TOTAL') == False ) & 
                            (Enrollment[0].str.contains('SCHOOL NAME') == False) & 
                            (Enrollment[0].str.contains('DIVISON') == False )]
    
    # Subset to only juniors and seniors
    Enrollment = Enrollment[(Enrollment[15] > 0) |
                            (Enrollment[16] > 0)]
    # Subset to only appropriate columns
    Enrollment = Enrollment[[0, 1, 15, 16]]
    # Combine data
    appended_data.append(Enrollment)
# Convert to Pandas data frame
Enrollment = pd.concat(appended_data, axis=0)

After the data has been cleaned the columns are renamed appropriately.

In [59]:
Enrollment[0] = Enrollment[0].str.strip()
Enrollment[1] = Enrollment[1].str.strip()
EnrollmentThisYear = Enrollment.rename(columns = {0:'SCHOOL NAME', 
                             1:'COMMUNITY', 
                             15:'Juniors This Year', 
                             16:'Seniors This Year'})

Below the data file for last year is cleaned.

In [61]:
# Clean Last Year Enrollment File
appended_data = []
for i in range(7, 39):
    Enrollment = pd.read_excel(LastYearDownload, 
                               sheetname = str(i), 
                               header = None, 
                               skiprows = 4)
    # Remove empty columns
    Enrollment = Enrollment.dropna(axis = 'columns',how='all')
    # Rename columns as numeric order 
    Enrollment = Enrollment.rename(columns=
                                   {x:y for x,y 
                                    in zip(Enrollment.columns,
                                           range(0,len(Enrollment.columns)))})
    Enrollment = Enrollment[(Enrollment[0].notnull()) & 
                            (Enrollment[0].str.contains('TOTAL') == False) & 
                            (Enrollment[0].str.contains('SCHOOL NAME') == False ) & 
                            (Enrollment[0].str.contains('DIVISON') == False )]
    Enrollment = Enrollment[(Enrollment[15] > 0) |
                            (Enrollment[16] > 0)]
    Enrollment = Enrollment[[0, 1, 15, 16]]
    appended_data.append(Enrollment)
Enrollment = pd.concat(appended_data, axis=0)

The code below is used to extract the School Districts. However it is not needed for the current data model for admissions. The code is left in here in case it is decided at a later date to add it into the model.

In [62]:
SchoolDistrict = pd.read_excel(LastYearDownload, sheetname = '7', skiprows = 2)
    # Remove empty columns
SchoolDistrict = SchoolDistrict.dropna(axis = 'columns',how='all')
SchoolDistrict = SchoolDistrict.dropna(axis = 'rows',how='all')

Below the clean enrollment file for last year is reformatted.

In [64]:
Enrollment[0] = Enrollment[0].str.strip()
Enrollment[1] = Enrollment[1].str.strip()
EnrollmentLastYear = Enrollment.rename(columns = {0:'SCHOOL NAME', 
                             1:'COMMUNITY', 
                             15:'Juniors Last Year', 
                             16:'Seniors Last Year'})

To get an idea of the layout, the first few rows are displayed below.

In [65]:
EnrollmentLastYear.head()

Unnamed: 0,SCHOOL NAME,COMMUNITY,Juniors Last Year,Seniors Last Year
1,Acadia Colony School,Carberry ¹,2.0,3.0
3,Carberry Collegiate,Carberry,45.0,34.0
7,Neepawa Area Collegiate,Neepawa,74.0,92.0
9,Riverbend Colony School,Carberry ¹,0.0,3.0
10,Riverside Colony School,Neepawa ¹,1.0,0.0


Next, this years file and last years are joined together.

In [67]:
Manitoba = pd.merge(EnrollmentThisYear, EnrollmentLastYear, 
                    how = 'inner', on = ['SCHOOL NAME', 'COMMUNITY']).drop_duplicates()

Here the enrollment deltas are calculated for the junior and senior classes.

In [70]:
Manitoba['JuniorDelta'] = Manitoba['Juniors This Year']/Manitoba['Juniors Last Year']
Manitoba['SeniorDelta'] = Manitoba['Seniors This Year']/Manitoba['Seniors Last Year']

Finally the cleaned data set is written to a .csv file

In [71]:
Manitoba.to_csv('Manitoba.csv', index = False)