In [11]:
import pandas as pd
import numpy as np
import random
import os
from collections import Counter

# implementing random seed for control
sd = 42
np.random.seed(sd)

# school names for consistency 
schools = ['Oakland Middle School',
    'Siegel Middle School',
    'Whitworth-Buchanan Middle School',
    'Christiana Middle School',
    'Smyrna Middle School',
    'Stewarts Creek Middle School',
    'Rockvale Middle School',
    'Rocky Fork Middle School',
    'Blackman Middle School',
    'Thurman Francis Arts Academy',
    'Rock Springs Middle School',
    'LaVergne Middle School'
]

# get event year
try:
    event_year = int(input('Enter 4-digit year of YouScience Event\n'))
except ValueError:
    event_year = 2022
print(f"Running for YS Career Fair {event_year}")

# preparing directories
log_pathway = f'../Logs/{event_year}-TEST'
if not os.path.exists(log_pathway):
    os.makedirs(log_pathway)
for school in schools:
    path = f'../YouScienceData/Schedules/{event_year}-TEST/{school}'
    pathway_path = f'../YouScienceData/Schedules/{event_year}-TEST/{school}/Pathway_Rosters'
    if not os.path.exists(path):
        os.makedirs(path)
    if not os.path.exists(pathway_path):
        os.makedirs(pathway_path)
    

Running for YS Career Fair 2022


# FUNCTIONS DOCUMENTATION

IN_PROCESS: Updated from Alpha series. Organization has been updated to include {event_year} in file system. 
<table>
    <tr>
        <th>Term</th>
        <th>Definition/Explanation</th>
    </tr>
    <tr>
        <td>POS</td>
        <td>"Pathways of Study", w/o context it is used synomously with pathway(s). These are essentially potential course mappings that students would follow for their highschool credits.</td>
    </tr>
    <tr>
        <td>Cluster</td>
        <td>From YouScience, focus areas matched to students based on their perceived aptitude and interests. Prioritized by RuCo ranking system (see beta_get_criteria.ipynb).</td>
    </tr>
</table>

### Format: <strong>Function_Name</strong> (<i>parameters</i>), description, <i>-f- required files</i>, <i>-o- required objects</i>.</br>

Clarify: -f- required file, -F- required file Folder, -fnc- required function

<ul>
    <li><strong>get_POS_from_clusters</strong>(<i>school</i>), returns school-specific df of rows that contain student info and ranked POS matches or 0's. 0's indicate either school-unsupported YS clusters or students with no YS results. </br></br><i>-f- direct_join_prepared.xlsx, -F- YouScienceData/YS_Criteria_by_School</i></li></br>
    <li><strong>get_POS_demand</strong>(<i>school_pos_df</i>), tabulates and return the demand Counter object (dict-like) of each POS offered at a given school, flag for demand met is deprecated here.</br></br><i>-o- school_pos_df</i></li></br>
    <li><strong>uncouple_POS_matches</strong>(<i>demand_counter_object</i>), CALLED by fnc initialize_block_rosters(), it returns a desc-ordered list of individual school-specific pathways by breaking up grouped pathways from direct_join_prepared.xlsx.</br></br><i>-o- demand Counter object</i></li></br>
    <li><strong>extract_capacity_vector</strong>(<i>school</i>), reads in and extracts capacity vector object as a dict per the school from capacity_vectors.csv.</br></br><i>-f- Reports/{event_year}/capacity_vectors.csv</li></br>
    <li><strong>initialize_block_rosters</strong>(<i>school, demand_object</i>), demand object should be counter object of school returned by get_POS_demand(), returns structured dictionary for school pathway rosters and a capacity vector dictionary with assigned POS per room.</br></br><i>-fnc- uncouple_POS_matches(), -fnc- extract_capacity_vector(), -o- demand Counter object</i></li></br>
    <li><strong></strong>(<i></i>), </li></br>
    <li><strong></strong>(<i></i>), </li></br>
    <li><strong></strong>(<i></i>), </li></br>
    <li><strong></strong>(<i></i>), </li></br>
</ul>

In [71]:
# functions
def get_POS_from_clusters(school):
    # read in YouScience Ranked criteria file
    ys_path = f'../YouScienceData/YS_Criteria_by_School/{event_year}/{school} YSCriteria.csv'
    ys_match_df = pd.read_csv(ys_path)
    # Read in appropriate translation columns form direct_join_prepared.xlsx
    dj_path = f'../direct_join_prepared.xlsx'
    djp_df = pd.read_excel(dj_path)
    djp_df = djp_df[['YouScience Clusters',school]]

    # create replacement dictionary
    to_replace = {}
    for i in range(len(djp_df)):
        cluster = djp_df.iloc[i]
        # key/value = YS cluster/school's coresponding POS
        if type(cluster[school]) != float:
            to_replace[cluster['YouScience Clusters']] = cluster[school]
        else:
            to_replace[cluster['YouScience Clusters']] = 0
    
    # at some point '0' is introduced somewhere. I suspect from the added positions in updated rosters
    to_replace['0'] = 0

    return ys_match_df.replace(to_replace=to_replace).drop('Unnamed: 0', axis=1)

def get_POS_demand(school_pos_df):
    # previously, returned the number of lg rooms needed to calculate capacity vectors later
    # capacity vectors are now generated by beta_capacity_report.py
    C = Counter()
    for i in range(len(school_pos_df)):
        # read in student's POS
        student = school_pos_df.iloc[i]
        # compile list of POS
        pos_list = []
        for rank in ['First','Second','Third','Fourth','Fifth','Sixth']:
            if student[rank] not in pos_list:
                pos_list.append(student[rank])
        # update counter
        # remove redundancy of double matches
        pos_list = list(set(pos_list))
        C.update(pos_list)

    return C

def uncouple_POS_matches(demand_counter_object):
    # Non-Empty matches only
    NE_pos_list = [] 
    for pos_group in demand_counter_object.most_common():
        # passing 0, which would be number of ranked choices that didn't translate to school-offered pathway
        if pos_group[0] == 0:
            continue
        
        # Break up the Coupled POS groups into separate, equally-demanded pathways
        if ',' in pos_group[0]:
            # temp list of pathways
            l = pos_group[0].split(', ')
            for pos in l:
                NE_pos_list.append([pos, pos_group[1]])
        # If not coupled, pass in as is 
        else:
            NE_pos_list.append(list(pos_group))

    return NE_pos_list

def extract_capacity_vector(school):
    # read in capacity vector object 
    df = pd.read_csv(f'../YouScienceData/Reports/{event_year}/capacity_vectors.csv')
    df.drop("Unnamed: 0", axis=1, inplace=True)

    # inital object is long str
    vector_str = df.loc[df.School == school, "Capacity_Vector"].values[0]
    if vector_str == '{}':
        print(f'{school} missing data.')
        return 0
    # transforms it into list of room: cap strings
    vector_list = vector_str.split(', ')

    # transforming into actionable dictionary vector
    vector_dict = {}
    for item in vector_list:
        k, v = item.split(': ')
        # checking for first item to have '{'
        if '{' in k:
            k = k[1:]
        # checking for last item to have '}' 
        if '}' in v:
            v = v[:-1]
        try:
            vector_dict[int(k)] = int(v)
        except ValueError:
            # shave off quotations 
            vector_dict[k[1:-1]] = int(v)
    
    return {k: v for k, v in sorted(vector_dict.items(), key=lambda item: item[1], reverse=True)}

def initialize_block_rosters(school, demand_object):
    # initialize dictionary for POS rosters per school
    pos_rosters = {}
    
    # extract capacity vector dictionary for school
    cap_vector_dict = extract_capacity_vector(school)
    # list of rooms (cap_vector_dict.keys()), in DESC order of capacity
    rooms = list(cap_vector_dict.keys())
    assigned_cap_vector_dict = {}
    i = 0
    for pos_couplet in uncouple_POS_matches(demand_object):
        # pos_key will be a doubleton set [pos_key, demand value] in DESC order
        pos_rosters[pos_couplet[0]] = {
            'All':[],
            'B1':[],
            'B2':[],
            'B3':[],
            'B4':[],
        }

        # restructuring cap vector {room:int(cap)} -> {room: arr[int(cap), str(POS assignment)]}
        assigned_cap_vector_dict[rooms[i]] = [cap_vector_dict[rooms[i]], pos_couplet[0]]
        i+=1

    return pos_rosters, assigned_cap_vector_dict

In [38]:
school_pos_df = get_POS_from_clusters(school=schools[0])


In [None]:
demand = {}
demand[schools[0]] = get_POS_demand(school_pos_df)
uncouple_POS_matches(demand['Oakland Middle School'])

In [18]:
vector_dict = extract_capacity_vector(school=schools[0])

In [72]:
all_rosters, all_capacity_vectors = {}, {}
all_rosters[schools[0]], all_capacity_vectors[schools[0]] = initialize_block_rosters(schools[0], demand['Oakland Middle School'])

In [None]:
all_rosters['Oakland Middle School']

# Thoughts and Development

## get_POS_demand()
<ul>
    <li><strong>get_POS_demand()</strong>: previously it returned the number of lg rooms for the purpose of creating capacity vectors</li></br>
    <li>Curently, I'm thinking read in the capacity report outside of the iterative loop, and read in/extract the relevant capacity vector per school.</li></br>
    <li>I think the question is whether or not identifying if demand per POS could be met is a value-added for now? I think it is, and the move is to read in the appropriate vector here and compare.</li>
</ul>

### Decision

This is inappropriate to do so here before the uncoupling of bundled POS. Otherwise, such assignments or checks is premature and misleading. For example in C.most_common()[1] for schools[0] is a 3-bundled POS. Comparing len(rooms) to len(C.most_common()) results in 19 to 13, but there are actually 19 POS pathways. It's not 6 rooms extra. It's at minimum viable.</br></br> <strong><u>!!!FLAG!!!</u></strong></br>

### Future Development

Possibility: add in clause to let the algorithm be greedy if there are extra rooms provided by school, based on the capacity report. 

## Capacity Vector structure
<ul>
    <li>Previously, the capacity vector was a simple array of the length of the school's provided rooms with entries of 50/35 per lg/sm room distinction.</li></br>
    <li>In an attempt to handle both the much broader range of capacities per room and make assigning rooms easier in the scheduling process, a school's capacity vector as built by beta_capacity_report.py is a dictionary that has room names/numbers as keys and corresponding capacity as values. </li></br>
    <li><strong>IDEA:</strong> add the assigned pathway to the respective key/value pair. For ex: {'Library':40} -> {'Library':[40, 'BioSTEM']}</li></br>
</ul>

### Decision -Delayed

CONCERN: we do not have enough information for this year's event currently. Going off of last year's data we can decide what room assignments could be changed, but this year's data has not included assigned pathways to rooms. 
As is, the assigned capacity vector(s) can be exported as a DF ready to be read in by the schedule building script, and included in schools' folders for organizational purposes. 