# Attempt at Item-Item Collaborative Filtering #

Information about Item-Item CF can be found here: 
https://towardsdatascience.com/item-based-collaborative-filtering-in-python-91f747200fab

WHY ARE WE DOING THIS

Item-Item Collaborative Filtering is useful for determining the similarities between a pair of items among a set of people. In this project, we have determined that it may be useful to find the similarities between WAITLISTING, COURSE ENROLLMENT, and COURSE AVAILABILITY among different CORE classes. 

This is helpful because it can highlight student behavior from an UNSUPERVISED perspective. COURSE DEMAND can be used as a measuring stick to determine the desireability of a CORE class, and answer the following questions:
1) Classes that NEW students are going to initially pick, assuming that they will NOT take classes that overwhelm them.
2) A rough estimate as to what classes INTERNATIONAL students will LIKELY pick PER semester dependent on waitlist.

COURSE DEMAND is defined as follows...

(COURSE_ENROLLMENT-COURSE_AVAILABILITY)+COURSE_WAITLIST

This information is grouped by EVERY professor availabile teaching with a KNOWN and UNKNOWN class.

OUR INPUT WILL BE AS FOLLOWS:

PROFESSOR/COURSE
COURSE_1 COURSE_DEMAND
COURSE_2 COURSE_DEMAND
...
COURSE_N COURSE_DEMAND

OUR OUTPUT WILL BE AS FOLLOWS:
For a list of professors within a semester, provide a prediction for the NEXT semester

In [154]:
import sqlite3
import pandas as pd
import os
pd.options.mode.chained_assignment = None

In [330]:
from scipy.spatial.distance import pdist,squareform
import numpy as np

First, we have to select information based on the instructor, their course, and their section id. Cumulative term will be taken as a means to 

In [30]:
process_path = os.path.join(os.sep+"home"+os.sep+"jupyter"+os.sep+"Team-Prophecy","Data","02_processed","intermediate.db")
print(process_path)

/home/jupyter/Team-Prophecy/Data/02_processed/intermediate.db


In [31]:
process_connection = sqlite3.connect(process_path)

In [323]:
try:
    process_connection.execute("DROP TABLE resultant_iicf_values;")
    process_connection.commit()
except:
    print()




In [None]:
#THIS IS EXPLICITLY FOR OUR RESULTANT_VALUES
process_connection.execute("CREATE TABLE resultant_iicf_values("
                   "result_id INTEGER PRIMARY KEY AUTOINCREMENT, "
                   "crs_name TEXT NOT NULL, "
                   "crs_sect_id INTEGER NOT NULL, "
                   "course_demand INTEGER NOT NULL "
                   ");")
process_connection.commit()

In [37]:
wait_stat = pd.DataFrame(process_connection.execute("""
    SELECT reg_term_code, crs, sect_id, COUNT(reg_final_status) AS amt
    FROM registration_status rs
    GROUP BY reg_term_code, crs, sect_id, reg_final_status
    HAVING reg_final_status == 'W'
""").fetchall(),columns=["reg_term_code", "crs", "sect_id", "Waitlist"])
registered_stat = pd.DataFrame(process_connection.execute("""
    SELECT reg_term_code, crs, sect_id, COUNT(reg_final_status) AS amt
    FROM registration_status rs
    GROUP BY reg_term_code, crs, sect_id, reg_final_status
    HAVING reg_final_status == 'R'
""").fetchall(),columns=["reg_term_code", "crs", "sect_id","Registered"])
#dropped_stat = pd.DataFrame(process_connection.execute("""
#    SELECT reg_term_code, crs, sect_id, COUNT(reg_final_status) AS amt
#    FROM registration_status rs
#    GROUP BY reg_term_code, crs, sect_id, reg_final_status
#    HAVING reg_final_status == 'D'
#""").fetchall(),columns=["reg_term_code", "crs", "sect_id","Dropped"])

total_stat = registered_stat.merge(wait_stat, on=["reg_term_code", "crs", "sect_id"], how='left').fillna(0)
#total_stat = total_stat.merge(dropped_stat, on=["reg_term_code", "crs", "sect_id"], how='left').fillna(0)
#total_stat.sort_values(by="Dropped",ascending=False).head()

In [116]:
total_stat.head()

Unnamed: 0,reg_term_code,crs,sect_id,Registered,Waitlist
0,201770,AIT512,001,109,0.0
1,201770,AIT524,001,160,1.0
2,201770,AIT524,DL3,94,0.0
3,201770,AIT542,DL1,150,0.0
4,201770,AIT580,001,265,32.0


We do some mild cleaning because we want to then take care of the data present.

Here's fundamentally what we want to know:
1) We want to have at least two or more students enrolled/available for a given class.

In [253]:
instr_count_df = pd.DataFrame(process_connection.execute("""
    SELECT sc.cum_term, i_n.instr_home_org, sc.cum_instr, sc.cum_term_code, sc.cum_sect_id, 
        sc.cum_seat_enroll+abs(sc.cum_seat_avail)+sc.cum_seat_wait as cum_seat_total,
        IIF(sc.cum_seat_avail < 0, sc.cum_seat_enroll+abs(sc.cum_seat_avail),sc.cum_seat_enroll) as cum_seat_enroll,
        IIF(sc.cum_seat_avail < 0, 0, sc.cum_seat_avail) as cum_seat_avail
    FROM semester_course_offerings sc
    INNER JOIN instructors i_n ON sc.cum_instr == i_n.instr_name
""").fetchall(),columns = ["reg_term_code","home_organization","instr","crs","sect_id","cum_seat_total","cum_seat_enroll","cum_seat_avail"])

In [254]:
instr_count_df["dem_ratio"] = instr_count_df["cum_seat_enroll"]/instr_count_df["cum_seat_total"]

In [127]:
#instr_count_df["norm_dem_ratio"] = (instr_count_df["dem_ratio"]-min(instr_count_df["dem_ratio"]))/(max(instr_count_df["dem_ratio"])-min(instr_count_df["dem_ratio"]))

In [336]:
instr_count_df.sort_values("crs").head()

Unnamed: 0,reg_term_code,home_organization,instr,crs,sect_id,cum_seat_total,cum_seat_enroll,cum_seat_avail,dem_ratio
8599,201910,IST Department,"Butu, Emilia Virginia",AIT502,P01,4,0,4,0.0
13757,202070,No Value,No Value,AIT502,P01,2,0,2,0.0
13756,202070,IST Department,"Boicu, Mihai",AIT502,DL1,27,12,15,0.444444
13755,202070,No Value,No Value,AIT502,001,23,0,23,0.0
9407,201940,No Value,No Value,AIT502,A01,30,0,30,0.0


In [337]:
f_instr_count_df = instr_count_df[["reg_term_code","home_organization","instr","crs","sect_id","dem_ratio"]]

In [312]:
#f_instr_count_df["reg_term_code"] = f_instr_count_df["reg_term_code"].astype(str)+"-"+f_instr_count_df["home_organization"].astype(str)
#f_instr_count_df = f_instr_count_df.drop(["home_organization"],axis=1)

In [338]:
f_instr_count_df["crs"] = f_instr_count_df["crs"].astype(str)+"-"+f_instr_count_df["sect_id"].astype(str)
f_instr_count_df = f_instr_count_df.drop(["sect_id"],axis=1)

In [339]:
reg_term_code_list = f_instr_count_df["reg_term_code"].unique().tolist()

In [366]:
dept_list = f_instr_count_df["home_organization"].unique().tolist()

In [340]:
crs_list = f_instr_count_df["crs"].unique().tolist()

If it doesn't work, it doesn't work, but Item-Item CF was presented poorly. Dan's feedback was "I don't need recommendations, I need factual definitive information."
- Tell DAN what to WATCH

In [367]:
r_df = f_instr_count_df.groupby("reg_term_code")
sam_df = None
sim_sam_matrix = None
for reg in range(0,len(reg_term_code_list)):    
    #r_df["row_number"] = r_df.reset_index().index
    #r_df=f_instr_count_df.set_index("crs").T
    #r_df = r_df[r_df["instr"] == "No Value"]
    """
    prior_sam_df = r_df.get_group(reg_term_code_list[reg-1])
    prior_sam_df = prior_sam_df.pivot(index=["crs"],columns="home_organization",values="dem_ratio").fillna(0)
    
    prior_sim_sam_matrix = 1-squareform(pdist(prior_sam_df,"cosine"))
    """
    
    sam_df = r_df.get_group(reg_term_code[reg])
    sam_df = sam_df.pivot(index=["crs"],columns="home_organization",values="dem_ratio").fillna(0)
    
    sim_sam_matrix = 1-squareform(pdist(sam_df,"cosine"))
    #We now tackle different departments for each course
    for d in dept_list:
        dept_rating = sam_df[sam_df["home_organization"] != d]
        dept_rating_dos = sam_df[sam_df["home_organization"] != d]
        prediction_rating = deepcopy(dept_rating)
        
    break
sam_df

home_organization,AFC Operations,AVP Strategic Advancement Systems,Affiliates,Aquatic Administration,Bioengineering Department,Biology Department,C4I Center,CEC Deans Office Admin,CEC Graduate Stud Services,CEC Instructional Lab Expenses,...,STEM Camps,School of Systems Biology,Statistics Department,Stdt Ctrs Event Operations,Stdt Govt Campus Ministry,Systems Eng and Ops Research Dept,Tech Talent Investment Program,Telecommunication Program,Wiley DL CEC Data Analytics,Wiley DL CEC MS AIT
crs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AIT524-DL1,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
AIT580-001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
AIT580-002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
AIT580-621,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
AIT580-P01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCOM690-003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0,0.0
TCOM690-004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
TCOM690-DL1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
TCOM698-001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0
