# Weighted Projections

### What is Weighted Projections?
Weighted Projections is a method of predicting the number of students who will enroll in a course based on the number of students who have enrolled in the course in the past. The method is based on the assumption that the number of students who enroll in a course in the future will be of similar proportions to the number of students who have enrolled in the course in the past.

### How are the weights calculated?
The weights are calculated by taking into account the proportion of enrollment on a smaller subset of data with respect to total new enrollment. The subsets are created based on factors like Student Admit Term, Course Registration Term, Program, Course, Course Section Modality, and Student Visa Status. The weights are calculated as follows:

$$\mathscr{Weight}_{\tiny{eat,rt,prog,crs,mod,visa}} = \frac{\mathscr{CrsDemandHist}_{\tiny{at,rt,prog,crs,mod,visa}}}{\mathscr{NewAdmissions}_{\tiny{at, prog}}}$$

Finally, the individual predictions are calculated by multiplying the weights with the expected number of new admissions. The predictions are calculated as follows:

$$\mathscr{Prediction}_{\tiny{eat,rt,prog,crs,mod,visa}} = \mathscr{Weight}_{\tiny{eat,rt,prog,crs,mod,visa}} * {\mathscr{ExpectedNewAdmissions}_{\tiny{eat, prog}}}$$

$$\mathscr{Prediction}_{\tiny{eat,rt,prog,crs,mod,visa}} = {\mathscr{CrsDemandHist}_{\tiny{at,rt,prog,crs,mod,visa}} * \frac{\mathscr{ExpectedNewAdmissions}_{\tiny{eat, prog}}}{\mathscr{NewAdmissions}_{\tiny{at, prog}}}}$$

### What are the benefits of using Weighted Projections?
- It takes into account the enrollment behavior of student cohorts in the past.
- It makes predictions for future student cohorts, enabling insights for course demand throughout various semesters within a student enrollment lifecycle.
- It is a very simple method to implement, and is highly scalable as new factors are added.

### What are the drawbacks of using Weighted Projections?
- It assumes that the enrollment behavior of student cohorts in the past will be similar to the enrollment behavior of student cohorts in the future.
- It does not take into account any shifts/changes in the enrollment behavior. Some of the reasons for those can be:
    - courses being added/removed from the curriculum
    - new grants being awarded to certain programs/courses
    - Job Demand for certain programs/courses

### How can Weighted Projections be improved in future?
- Weights can be averaged over multiple years to account for changes in enrollment behavior. Alternatively, regression models can be used to predict the weights in the future semesters to account for changes in enrollment behavior.

---
# Code: Weighted Projections
## Initialization

In [1]:
# importing required libraries
# importing the required libraries
import os
import pandas as pd

os.chdir( os.path.join("..", "..", "..") )

# importing custom modules
from Code.src.modules.db_ops import *
from Code.src.modules.dataManager import DataManager
from Code.src.modules.eda import *
from Code.src.modules.db_ops import *

# initializing the DataManager
DM = DataManager()

In [2]:
# importing the data for analysis
df_finalEnrollment = DM.get_data('EnrollmentFinalStatus', 'pkl', 'processed')
db_finalEnrollment = DM.get_data('EnrollmentFinalStatus', 'db', 'processed')

In [3]:
# Testing
db_finalEnrollment.runQuery(""" --sql
    SELECT stu_admit_term_desc, COUNT(DISTINCT stu_id)
    FROM EnrollmentFinalStatus
    WHERE
        stu_prog_desc = 'MS Data Analytics Engineering'
        AND stu_admit_term_name IN ('Fall', 'Spring')
    GROUP BY stu_admit_term_desc
    ORDER BY stu_admit_term_code
""")

Unnamed: 0,stu_admit_term_desc,COUNT(DISTINCT stu_id)
0,Spring 2012,1
1,Fall 2013,1
2,Spring 2014,1
3,Fall 2014,9
4,Spring 2015,2
5,Fall 2015,27
6,Spring 2016,35
7,Fall 2016,92
8,Spring 2017,45
9,Fall 2017,119


### Configuring the model

In [4]:
def get_term_desc(term_code):
    if term_code == 202010:
        return 'Spring 2020 - COVID-19'
    term_year = int(term_code) // 100
    term_name = 'Fall' if int(term_code) % 100 == 70 else 'Spring'
    return str(term_name) + " " + str(term_year)

def get_term_code(term_desc):
    if term_desc == 'Spring 2020 - COVID-19':
        return 202010
    term_year = int(term_desc.split(' ')[1])
    term_name = 70 if term_desc.split(' ')[0] == 'Fall' else 10
    return term_year * 100 + term_name

get_term_desc(202110)

'Spring 2021'

The below code is unacceptable. Need to figure out a better way to do this. The future semesters should be autopopulated based on the latest regisration term in the data.

In [5]:
# Getting all the semesters
semesters = db_finalEnrollment.runQuery("""--sql
    SELECT reg_term_desc, reg_term_code
    FROM EnrollmentFinalStatus
    WHERE
        reg_term_year > 2017
    GROUP BY reg_term_desc, reg_term_code
    ORDER BY reg_term_code
""")
semesters['reg_term_code'] = semesters['reg_term_code'].astype(int)

future_semesters = [202370, 202410, 202470, 202510]
future_semesters = pd.DataFrame(future_semesters, columns=['reg_term_code'])
future_semesters['reg_term_desc'] = future_semesters['reg_term_code'].apply(get_term_desc)

### Taking inputs

In [6]:
# Defining the model inputs
inputs = {    
    'eat_desc'      : "Fall 2023",                      # Accepted Inputs: "[Fall/Spring] [YYYY]"
    'prog_desc'     : "MS Data Analytics Engineering",  # Accepted Inputs: "[prog_desc]" OR "All"
    'courses'       : ['AIT 580', 'STAT 515', 'OR 531', 'CS 504'],                            # Accepted Inputs: "[crs]" OR "All"
    'mod'           : "combined",                       # Accepted Inputs: "combined" OR "separate"
    'visa'          : 'F1 Visa',                       # Accepted Inputs: "[stu_visa]" OR "combined"
    'ExpN_eat'      : 100                               # Accepted Inputs: "[int]"
}

In [7]:
# Building the model parameters
model_params = {
    'eat_desc'          : inputs['eat_desc'],
    'eat_code'          : get_term_code(inputs['eat_desc']),
    'prog_desc'         : inputs['prog_desc'],
    'courses'           : inputs['courses'],
    'mod'               : inputs['mod'],
    'visa'              : inputs['visa'],
    'ExpN_eat'          : inputs['ExpN_eat'],
    'enr_hist'          : 4 if inputs['visa'] == 'F1 Visa' else 8,
    'filters'           : {
        'prog_filter'   : inputs['prog_desc'] != 'All',
        'course_filter' : inputs['courses'] != 'All',
        'visa_filter'   : inputs['visa'] != 'combined'
    },
    'groupby'           : {
        'mod_groupby'   : inputs['mod'] != 'combined'
    }
}

### Model Calculations

- $at$        : Latest Suitable Academic Term
- $ExpN$      : Expected New Admissions for $eat$ Academic Term
    - $rat$   : Ratio of $ExpN_{eat}$ to $N_{at}$ Academic Terms

In [8]:
# Getting the latest suitable term code for the model
def get_latest_at(eat_code):

    # List of eligible semesters
    term_list = semesters[
        semesters['reg_term_code'] % 100 == eat_code % 100
    ] \
        .reg_term_code.tolist()

    # Earliest Data Point Available
    at_code_min = term_list[0]

    # Last enr_hist semesters before the last reg_term
    if semesters.reg_term_code.max() % 100 == model_params['eat_code'] % 100:
        at_code_max_1 = semesters.reg_term_code[ semesters.index.max() - model_params['enr_hist'] ]
    else:
        at_code_max_1 = semesters.reg_term_code[ semesters.index.max() + 1 - model_params['enr_hist'] ]

    # Atleast 2 terms before Expected Admit Term
    at_code_max_2 = eat_code - 100

    # Actual stu_admit_term_code
    print(f"at_code_min \t: {at_code_min}  |  (Earliest Data Point Available)",
        f"\nat_code_max_1 \t: {at_code_max_1}  |  (Latest reg_term_code for the given enr_hist)",
        f"\nat_code_max_2 \t: {at_code_max_2}  |  (Atleast 2 terms before the Expected Admit Term)",
        f"\nBest_at \t: {max(at_code_min, min(at_code_max_1, at_code_max_2))}  |  (Best admit_term for the given criteria.)"
    )
    return max(at_code_min, min(at_code_max_1, at_code_max_2))

model_params['at_code'] = get_latest_at(model_params['eat_code'])
model_params['at_desc'] = get_term_desc(model_params['at_code'])

at_code_min 	: 201870  |  (Earliest Data Point Available) 
at_code_max_1 	: 202170  |  (Latest reg_term_code for the given enr_hist) 
at_code_max_2 	: 202270  |  (Atleast 2 terms before the Expected Admit Term) 
Best_at 	: 202170  |  (Best admit_term for the given criteria.)


In [9]:
def list_to_str(l):
    """
    Create a string in the format of "('a', 'b', 'c', 'd')" from list ['a', 'b', 'c', 'd']
    """
    return str(tuple(l))

# Building the SQL WHERE conditions

# Program Filter
if model_params['filters']['prog_filter']:
    prog_cond = f"AND stu_prog_desc = '{model_params['prog_desc']}'"
else:
    prog_cond = "-- No program filters"

# Course Filter
if model_params['filters']['course_filter']:
    course_cond = f"AND crs IN {list_to_str(model_params['courses'])}"
else:
    course_cond = "-- No course filters"

# Visa Filter
if model_params['filters']['visa_filter']:
    visa_cond = f"AND stu_visa = '{model_params['visa']}'"
else:
    visa_cond = "-- No visa filters"

# Mod Groupby
if model_params['groupby']['mod_groupby']:
    mod_groupby = ",crs_sect_modality"
else:
    mod_groupby = "-- No modality grouping"

In [10]:
# Querying the database for the number of new enrollments in `at` Academic Term
query = f"""--sql
    SELECT COUNT(DISTINCT stu_id) AS new_enrollments
    FROM EnrollmentFinalStatus
    WHERE
        stu_admit_term_code = {model_params['at_code']}
        {prog_cond}
        {visa_cond}
    GROUP BY
        stu_admit_term_code
"""

# Calculating number of new enrollments in `at` Academic Term
model_params['N_at'] = db_finalEnrollment.runQuery(query).new_enrollments[0]

# Calculating the ratio of Expected new enrollments in `eat` Academic Term to `at` Academic Term
model_params['rat_eat_at'] = model_params['ExpN_eat'] / model_params['N_at']


print(query, model_params['N_at'], model_params['rat_eat_at'], sep = '\n\n')

--sql
    SELECT COUNT(DISTINCT stu_id) AS new_enrollments
    FROM EnrollmentFinalStatus
    WHERE
        stu_admit_term_code = 202170
        AND stu_prog_desc = 'MS Data Analytics Engineering'
        AND stu_visa = 'F1 Visa'
    GROUP BY
        stu_admit_term_code


149

0.6711409395973155


In [11]:
# Getting the list of semesters to be considered for calculating the historical enrollment demand
term_list = semesters.reg_term_code.tolist()[
    semesters.reg_term_code.tolist().index(model_params['at_code']) : semesters.reg_term_code.tolist().index(model_params['at_code']) + model_params['enr_hist']
]

In [12]:
# Getting CourseDemandHistory for the `at` Academic Term
query = f"""--sql
    SELECT crs, reg_term_code
        {',crs_sect_modality' if model_params['groupby']['mod_groupby'] else '--'}
        ,COUNT(DISTINCT stu_id) AS demand
    FROM EnrollmentFinalStatus
    WHERE
        stu_admit_term_code = {model_params['at_code']}
        AND reg_term_code IN {list_to_str(term_list)}
        {prog_cond}
        {course_cond}
        {visa_cond}
    GROUP BY
        crs, reg_term_code
        {mod_groupby}
"""

print(query)

--sql
    SELECT crs, reg_term_code
        --
        ,COUNT(DISTINCT stu_id) AS demand
    FROM EnrollmentFinalStatus
    WHERE
        stu_admit_term_code = 202170
        AND reg_term_code IN (202170, 202210, 202270, 202310)
        AND stu_prog_desc = 'MS Data Analytics Engineering'
        AND crs IN ('AIT 580', 'STAT 515', 'OR 531', 'CS 504')
        AND stu_visa = 'F1 Visa'
    GROUP BY
        crs, reg_term_code
        -- No modality grouping



In [13]:
# Querying the database for CourseDemandHistory for the `at` Academic Term with the above conditions
    # Resetting the index to crs and reg_term_code
        # Unstacking the reg_term_code column to get the demand for each semester in pivot table format, filling the missing values with 0
            # Resetting the index to get the demand for each semester in a column

df_CrsDemHist = db_finalEnrollment.runQuery(query)
df_CrsDemHist = df_CrsDemHist \
    .set_index(['crs', 'reg_term_code']) \
        .unstack(fill_value=0) \
            .reset_index()

df_CrsDemHist.columns = df_CrsDemHist.columns.droplevel(0)
df_CrsDemHist.columns.name = None
df_CrsDemHist.rename({'' : 'crs'}, axis=1, inplace=True)
df_CrsDemHist

Unnamed: 0,crs,202170,202210,202270,202310
0,AIT 580,131,21,1,0
1,CS 504,116,46,12,1
2,OR 531,94,51,0,0
3,STAT 515,132,13,0,0


In [14]:
# Generating predictions dataframe for eat Term by multiplying the ratio_eat_at with the previous Enrollment Demand History
df_Pred = df_CrsDemHist.copy()
df_Pred = pd.merge(
    df_Pred.iloc[:, :1],
    df_Pred.iloc[:, 1:] \
        .apply(lambda x: x * model_params['rat_eat_at'], axis=1),
    left_index=True, right_index=True
)

# Renaming the columns in the df_Pred dataframe with term_list
col_list = df_Pred.columns[1:].tolist()
new_col_list = []
for col_name in col_list:
    new_col_name = int(col_name) + model_params['eat_code'] - int(df_Pred.columns[1])
    new_col_name = get_term_desc(new_col_name)
    new_col_list.append(new_col_name)
df_Pred.columns = ['crs'] + new_col_list

df_Pred

Unnamed: 0,crs,Fall 2023,Spring 2024,Fall 2024,Spring 2025
0,AIT 580,87.919463,14.09396,0.671141,0.0
1,CS 504,77.852349,30.872483,8.053691,0.671141
2,OR 531,63.087248,34.228188,0.0,0.0
3,STAT 515,88.590604,8.724832,0.0,0.0


### Calculating absolute error: (WIP)

In [15]:
model_params['eat_code'] in semesters.reg_term_code.tolist()

False

In [16]:
# Getting difference between the dataframes
df_Diff = df_Pred.copy()
df_Diff.columns = ['crs'] + ['Sem'+str(i+1) for i in range(df_Diff.shape[1]-1)]
df_temp = df_CrsDemHist.copy()
df_temp.columns = ['crs'] + ['Sem'+str(i+1) for i in range(df_temp.shape[1]-1)]

# Subtracting df_temp from df_Diff
df_Diff = df_Diff.iloc[:, :1].merge(df_Diff.iloc[:, 1:].sub(df_temp.iloc[:, 1:]), left_index=True, right_index=True)
df_Diff

Unnamed: 0,crs,Sem1,Sem2,Sem3,Sem4
0,AIT 580,-43.080537,-6.90604,-0.328859,0.0
1,CS 504,-38.147651,-15.127517,-3.946309,-0.328859
2,OR 531,-30.912752,-16.771812,0.0,0.0
3,STAT 515,-43.409396,-4.275168,0.0,0.0


In [17]:
# Renaming the column of df_Pred and df_CrsDemHist by replacing the term code with term description
# df_Pred.columns = ['crs'] + [get_term_desc(i) for i in df_Pred.columns.to_list()[1:]]
df_CrsDemHist.columns = ['crs'] + [get_term_desc(i) for i in df_CrsDemHist.columns.to_list()[1:]]

In [18]:
df_Pred.to_csv( os.path.join('Data', '02_processed', 'test_predictions.csv'), index=False)