# Major Map: An AI-Powered Academic Planner and Predictor for SJSU Students

## 1. Data Loading

In [50]:
!rm -rf major_map
!git clone https://github.com/7HE-LUCKY-FISH/major_map.git

Cloning into 'major_map'...
remote: Enumerating objects: 5215, done.[K
remote: Counting objects: 100% (4736/4736), done.[K
remote: Compressing objects: 100% (3550/3550), done.[K
remote: Total 5215 (delta 1126), reused 4663 (delta 1094), pack-reused 479 (from 1)[K
Receiving objects: 100% (5215/5215), 16.98 MiB | 14.19 MiB/s, done.
Resolving deltas: 100% (1259/1259), done.


In [51]:
!ls /content/

major_map  sample_data


In [52]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from pathlib import Path

In [53]:
# Path to repo in Colab
data_dir = Path("major_map/data/csv_data")

# Get all csv files
csv_files = sorted(data_dir.glob("*.csv"))

print("Found CSV files:")
for f in csv_files:
    print(" -", f.name)

# Combine all the csv files into one
# Check method number 4: https://medium.com/@stella96joshua/how-to-combine-multiple-csv-files-using-python-for-your-analysis-a88017c6ff9e
df_original = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)

Found CSV files:
 - Fall-2022.csv
 - Fall-2023.csv
 - Fall-2024.csv
 - Spring-2022.csv
 - Spring-2023.csv
 - Spring-2024.csv
 - Spring-2025.csv


## 2. Data Understanding

### 2a. Basic Inspection

In [54]:
df_original.head()

Unnamed: 0,Section,Number,Mode,Title,Satifies,Unit,Type,Days,Times,Instructor,Location,Dates,Seats,Year,Semester
0,BIOL 10 (Section 01),40529,In Person,The Living World,GE: B2,3,LEC,TR,09:00AM-10:15AM,Allison Harness,SCI164,08/19/22-12/06/22,59,2022,Fall
1,BIOL 10 (Section 03),40060,In Person,The Living World,GE: B2,3,LEC,MW,10:30AM-11:45AM,Phillip Hawkins,SCI164,08/19/22-12/06/22,42,2022,Fall
2,BIOL 10 (Section 04),47603,Fully Online,The Living World,GE: B2,3,LEC,TBA,TBA,Phillip Hawkins,ONLINE,08/19/22-12/06/22,6,2022,Fall
3,BIOL 10 (Section 99),41828,Fully Online,The Living World,GE: B2,3,LEC,TBA,TBA,Mary Poffenroth,ONLINE,08/19/22-12/06/22,1,2022,Fall
4,CHEM 1A (Section 01),40081,In Person,General Chemistry,GE: B1+B3,5,LEC,MWF,09:30AM-10:20AM,Resa Kelly,SCI142,08/19/22-12/06/22,0,2022,Fall


In [55]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4007 entries, 0 to 4006
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Section     4007 non-null   object
 1   Number      4007 non-null   int64 
 2   Mode        4007 non-null   object
 3   Title       4007 non-null   object
 4   Satifies    1615 non-null   object
 5   Unit        4007 non-null   int64 
 6   Type        4007 non-null   object
 7   Days        4007 non-null   object
 8   Times       4007 non-null   object
 9   Instructor  4007 non-null   object
 10  Location    3945 non-null   object
 11  Dates       4007 non-null   object
 12  Seats       4007 non-null   int64 
 13  Year        4007 non-null   int64 
 14  Semester    4007 non-null   object
dtypes: int64(4), object(11)
memory usage: 469.7+ KB


In [56]:
df_original['Year'].value_counts()

Unnamed: 0_level_0,count
Year,Unnamed: 1_level_1
2024,1191
2022,1127
2023,1115
2025,574


In [57]:
df_original['Semester'].value_counts()

Unnamed: 0_level_0,count
Semester,Unnamed: 1_level_1
Spring,2160
Fall,1847


In [58]:
df_original['Instructor'].nunique()

646

In [59]:
df_original['Section'].nunique()

980

### 2b. Check Missing / Special Values

In [60]:
df_original['Times'].value_counts().head(10)

Unnamed: 0_level_0,count
Times,Unnamed: 1_level_1
10:30AM-11:45AM,371
12:00PM-01:15PM,366
09:00AM-10:15AM,316
01:30PM-02:45PM,300
03:00PM-04:15PM,263
04:30PM-05:45PM,185
06:00PM-08:45PM,138
TBA,102
09:00AM-11:45AM,102
03:00PM-05:45PM,97


In [61]:
df_original['Days'].value_counts().head(10)

Unnamed: 0_level_0,count
Days,Unnamed: 1_level_1
MW,1120
TR,989
F,400
T,343
W,310
R,289
M,229
MTWR,113
TBA,102
MWF,86


In [62]:
print(df_original['Instructor'].value_counts())
print('\n-----------------------------\n')
print(df_original['Section'].value_counts())
print('\n-----------------------------\n')
print(df_original['Semester'].value_counts())

Instructor
Richard Low                                 53
Padmavati Tanniru                           52
Alla Petrosyan                              51
Olga Kovaleva                               48
Medha Bodas                                 47
                                            ..
Neomi Millan                                 1
Peter Beyersdorf / Kenneth Wharton           1
Ehsan Khatami                                1
Nargis Adham / Azadeh Shahid Faylienejad     1
Resa Kelly                                   1
Name: count, Length: 646, dtype: int64

-----------------------------

Section
ENGR 100W (Section 16)    14
ENGR 100W (Section 14)    14
ENGR 100W (Section 06)    14
ENGR 100W (Section 18)    11
ENGR 100W (Section 12)    10
                          ..
CMPE 165 (Section 80)      1
CS 100W (Section 84)       1
CS 146 (Section 81)        1
CS 146 (Section 82)        1
CS 147 (Section 81)        1
Name: count, Length: 980, dtype: int64

-----------------------------

Sem

## 3. Data Preprocessing

In this section we will go through each features and analyze what is the best way to preprocess the data so they can be useful to use for the machine leanring model.

In [63]:
df_preprocess = df_original.copy()
df_preprocess.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4007 entries, 0 to 4006
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Section     4007 non-null   object
 1   Number      4007 non-null   int64 
 2   Mode        4007 non-null   object
 3   Title       4007 non-null   object
 4   Satifies    1615 non-null   object
 5   Unit        4007 non-null   int64 
 6   Type        4007 non-null   object
 7   Days        4007 non-null   object
 8   Times       4007 non-null   object
 9   Instructor  4007 non-null   object
 10  Location    3945 non-null   object
 11  Dates       4007 non-null   object
 12  Seats       4007 non-null   int64 
 13  Year        4007 non-null   int64 
 14  Semester    4007 non-null   object
dtypes: int64(4), object(11)
memory usage: 469.7+ KB


### 3a. "Section" Feature

Ex: BIOL 10 (Section 01)
* Just in case, convert every value in "Section" to a string type. If something was not a string (like a number or a missing value NaN), it becomes a string (NaN --> "nan").

* Removes any leading and trailing whitespaces from each string. This is so we can split "Section" into "Dept" and "CourseNumber", or just "Course".

In [64]:
df_preprocess['Section'] = df_preprocess['Section'].astype(str).str.strip()
print("Missing Section:", df_preprocess['Section'].isna().sum())
print(df_preprocess['Section'].head())

Missing Section: 0
0    BIOL 10 (Section 01)
1    BIOL 10 (Section 03)
2    BIOL 10 (Section 04)
3    BIOL 10 (Section 99)
4    CHEM 1A (Section 01)
Name: Section, dtype: object


### 3b. "Number" Feature

Ex: 40529
* Preprocessing "Number" by ensure it's integer, and have no missing.

* "Number" feature might not be even use because this feature is not really meaningful or important.

In [65]:
df_preprocess['Number'] = df_preprocess['Number'].astype(int)
print("Missing Number:", df_preprocess['Number'].isna().sum())
print(df_preprocess['Number'].head())


Missing Number: 0
0    40529
1    40060
2    47603
3    41828
4    40081
Name: Number, dtype: int64


### 3c. "Mode" Feature

Ex: "In Person", "Fully Online", "Hybird"

* There are only 3 unique possible values for mode. We will one-hot encoding it later.

In [66]:
print("Unique Mode values:", df_preprocess['Mode'].unique())

Unique Mode values: ['In Person' 'Fully Online' 'Hybrid']


### 3d. "Title" Feature

Ex: "In Person", "Fully Online", "Hybird"

* Also not really important feature, but just clean it just in case.

In [67]:
df_preprocess['Title'] = df_preprocess['Title'].astype(str).str.strip()
print("Missing Title:", df_preprocess['Title'].isna().sum())
print(df_preprocess['Title'].head())

Missing Title: 0
0     The Living World
1     The Living World
2     The Living World
3     The Living World
4    General Chemistry
Name: Title, dtype: object


### 3e. "Satifies" Feature

* Can be use to track if it satifies for GE area or not. Fill missing (NaN) which MajorOnly.

In [68]:
print("Unique Satifies values:", df_preprocess['Satifies'].unique())
df_preprocess['Satifies'] = df_preprocess['Satifies'].fillna('MajorOnly')
print('------------------------------------------------------------')
print("Missing Satifies:", df_preprocess['Satifies'].isna().sum())

Unique Satifies values: ['GE: B2' 'GE: B1+B3' nan 'GE: S' 'GE: V' 'GE: WID' 'GE: A2' 'GE: C2'
 'GE: E' 'GE: WID+R' 'GE: B4' 'GE: 5B' 'GE: 5A+5C' 'GE: 4' 'GE: 3'
 'GE: 1A' 'GE: 3B' 'GE: WID+3' 'GE: 2']
------------------------------------------------------------
Missing Satifies: 0


### 3f. "Unit" Feature

* The unit can tell us about the hours and type of the course/section.

In [69]:
print(df_preprocess['Unit'].value_counts().sort_index())

Unit
0     905
1     234
2      60
3    2405
4     384
5      19
Name: count, dtype: int64


### 3g. "Type" Feature

* The types are LEC, SEM and LAB. Will do one hot encoding later.

In [70]:
print("Unique Type values:", df_preprocess['Type'].unique())

Unique Type values: ['LEC' 'SEM' 'LAB']


### 3h. "Days" Feature

* Give us the day patterns. TBA is a special flag. TBA is usually online and asyn classes.

In [71]:
print("Missing Days:", df_preprocess['Days'].isna().sum())
print("Unique Days values:", df_preprocess['Days'].unique())
print(df_preprocess['Days'].value_counts().sort_index())

Missing Days: 0
Unique Days values: ['TR' 'MW' 'TBA' 'MWF' 'F' 'T' 'W' 'R' 'M' 'S' 'MTWR']
Days
F        400
M        229
MTWR     113
MW      1120
MWF       86
R        289
S         26
T        343
TBA      102
TR       989
W        310
Name: count, dtype: int64


### 3i. "Times" Feature

* Give the time range (start time and end time) of a section

In [72]:
print("Missing Times:", df_preprocess['Times'].isna().sum())
print(df_preprocess['Times'].value_counts())

Missing Times: 0
Times
10:30AM-11:45AM    371
12:00PM-01:15PM    366
09:00AM-10:15AM    316
01:30PM-02:45PM    300
03:00PM-04:15PM    263
                  ... 
01:30PM-02:24PM      1
03:00PM-03:45PM      1
12:00AM-01:15AM      1
06:00PM-08:20PM      1
08:00AM-09:15AM      1
Name: count, Length: 114, dtype: int64


### 3j. "Instructor" Feature

* Give us the professor name

In [73]:
print("Missing Instructor:", df_preprocess['Instructor'].isna().sum())
print("Number of unique instructors:", df_preprocess['Instructor'].nunique())
print(df_preprocess['Instructor'].unique)

Missing Instructor: 0
Number of unique instructors: 646
<bound method Series.unique of 0                 Allison Harness
1                 Phillip Hawkins
2                 Phillip Hawkins
3                 Mary Poffenroth
4                      Resa Kelly
                  ...            
4002    Azadeh Shahid Faylienejad
4003      Vakini Santhanakrishnan
4004               Ramen Bahuguna
4005                 Nargis Adham
4006               Ramen Bahuguna
Name: Instructor, Length: 4007, dtype: object>


### 3k. "Location" Feature

* Give us the room the lecutre is held in. Online class location is lablelled as "Online"

In [74]:
print("Missing Location before fill:", df_preprocess['Location'].isna().sum())
df_preprocess['Location'] = df_preprocess['Location'].fillna('Unknown')
print("Missing Location after fill:", df_preprocess['Location'].isna().sum())
print('---------------------------------------------------------------')
print(df_preprocess['Location'].value_counts())
print('---------------------------------------------------------------')
print(df_preprocess['Location'].unique())

Missing Location before fill: 62
Missing Location after fill: 0
---------------------------------------------------------------
Location
ONLINE    366
MH424     143
MH224     105
MH323     101
ENG392     97
         ... 
BBC203      1
DMH161      1
BBC126      1
BBC107      1
ENG336      1
Name: count, Length: 195, dtype: int64
---------------------------------------------------------------
['SCI164' 'ONLINE' 'SCI142' 'MD101' 'DH412' 'DH506' 'DH507' 'ENG325'
 'ENG405' 'ENG337' 'ENG489' 'ENG343' 'ENG341' 'ENG286' 'ENG301' 'ENG331'
 'BBC003' 'ENG288' 'ENG206' 'CL222' 'DMH234' 'MH323' 'MH424' 'MH523'
 'MH222' 'MH223' 'SCI311' 'WSQ109' 'BBC202' 'MH225' 'MH422' 'MH233'
 'CL243' 'SCI258' 'DH450' 'DH351' 'BBC004' 'Unknown' 'ENG189' 'ENG258'
 'ENG290' 'ENG305' 'ENG345' 'ENG307' 'ENG238' 'ENG317' 'ENG319' 'ENG321'
 'ENG244' 'ENG289' 'ENG291' 'ENG376' 'BBC124' 'SH411' 'BBC121' 'BBC128'
 'BBC122' 'SH348' 'DMH354' 'BBC123' 'BBC221' 'CL316' 'BBC130' 'CL225B'
 'SH444' 'CL225A' 'BBC225' 'DMH347' 'SH4

### 3l. "Dates" Feature

* Not really important, we already have Year and Semester

In [75]:
print("Missing Dates:", df_preprocess['Dates'].isna().sum())
print(df_preprocess['Dates'].value_counts())

Missing Dates: 0
Dates
08/21/24-12/09/24    631
08/21/23-12/06/23    626
08/19/22-12/06/22    590
01/23/25-05/12/25    574
01/24/24-05/13/24    560
01/26/22-05/16/22    537
01/25/23-05/15/23    489
Name: count, dtype: int64


### 3m. "Seats" Feature

* Number of seats left for a section. Not really useful, will drop this feature.

In [76]:
df_preprocess['Seats'] = df_preprocess['Seats'].astype(int)
print("Missing Seats:", df_preprocess['Seats'].isna().sum())
print(df_preprocess['Seats'].describe())

Missing Seats: 0
count    4007.000000
mean        3.279760
std         9.260627
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max       181.000000
Name: Seats, dtype: float64


### 3n. "Year" Feature

* Calendar Year: should be only 2022, 2023, 2024 and 2025

In [77]:
print("Missing Year:", df_preprocess['Year'].isna().sum())
print("Years:", df_preprocess['Year'].unique())

Missing Year: 0
Years: [2022 2023 2024 2025]


### 3o. "Semester" Feature

* Term name: We only focus on Fall and Spring (no Winter or Summer)

In [78]:
print("Missing Semester:", df_preprocess['Semester'].isna().sum())
print("Semesters:", df_preprocess['Semester'].unique())

Missing Semester: 0
Semesters: ['Fall' 'Spring']


In [79]:
df_preprocess.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4007 entries, 0 to 4006
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Section     4007 non-null   object
 1   Number      4007 non-null   int64 
 2   Mode        4007 non-null   object
 3   Title       4007 non-null   object
 4   Satifies    4007 non-null   object
 5   Unit        4007 non-null   int64 
 6   Type        4007 non-null   object
 7   Days        4007 non-null   object
 8   Times       4007 non-null   object
 9   Instructor  4007 non-null   object
 10  Location    4007 non-null   object
 11  Dates       4007 non-null   object
 12  Seats       4007 non-null   int64 
 13  Year        4007 non-null   int64 
 14  Semester    4007 non-null   object
dtypes: int64(4), object(11)
memory usage: 469.7+ KB


## 4. Data Engineering

In [80]:
# Make a copy for data engineering
df_engineer = df_preprocess.copy()

Originally, we have a total of 15 features:
1. Section          (original)
2. Number           (original, not used as feature)
3. Mode             (original, categorical feature)
4. Title            (original, usually not used as feature in v1)
5. Satifies         (original, optional categorical feature)
6. Unit             (original, numeric feature)
7. Type             (original, categorical feature)
8. Days             (original, used to build slot, not as input feature)
9. Times            (original, used to build StartMinutes/slot)
10. Instructor      (original, used to create instructor_id target)
11. Location        (original)
12. Dates           (original, not used as feature)
13. Seats           (original)
14. Year            (original, numeric feature)
15. Semester        (original, categorical feature)

New Features we will get from data engineering:
1. Dept            (engineered from Section)
2. CourseNumber    (engineered from Section)
3. CourseCode     (engineered from Dept + CourseNumber, optional)
4. HasGE           (engineered from Satifies, 0/1 flag, optional)
5. StartMinutes    (engineered from Times)
6. EndMinutes      (engineered from Times)
7. DurationMinutes (engineered from End-Start)
7. Slot            (engineered from Days + StartMinutes, e.g. "MWF_540")
8. instructor_id   (engineered target from Instructor)
9. slot_id         (engineered target from slot)
10. Building        (engineered from Location, e.g. "SCI", "MD", "ONLINE")
12. Term            (engineered from Year + Semester, e.g. "2022_Fall")
13. SemesterIndex   (engineered from term, 0,1,2,... in time order)

---






### 4a. Section → Dept, CourseNumber, and CourseCode

In [81]:
section = df_engineer['Section'].str.extract(r'^(\w+)\s+([^\s]+)')

df_engineer['Dept'] = section[0]
df_engineer['CourseNumber'] = section[1]
df_engineer['CourseCode'] = df_engineer['Dept'].astype(str) + ' ' + df_engineer['CourseNumber'].astype(str)
df_engineer[['Section', 'Dept', 'CourseNumber', 'CourseCode']].head()

Unnamed: 0,Section,Dept,CourseNumber,CourseCode
0,BIOL 10 (Section 01),BIOL,10,BIOL 10
1,BIOL 10 (Section 03),BIOL,10,BIOL 10
2,BIOL 10 (Section 04),BIOL,10,BIOL 10
3,BIOL 10 (Section 99),BIOL,10,BIOL 10
4,CHEM 1A (Section 01),CHEM,1A,CHEM 1A


### 4b. Times → StartMinutes and EndMinutes

* Minutes after midnight (00:00 am)

In [82]:
def parse_time_range(s):
  if s == "TBA" or "-" not in s:
    return -1, -1, -1

  # s looks like "09:00AM-10:15AM"
  # split the start and end with '-'
  start_str, end_str = s.split('-')

  # parse each part as time
  # format='%I:%M%p'
  # %I = hour in 12-hour clock (01-12)
  # %M = minutes (00-59)
  # %p = AM/PM
  start_dt = pd.to_datetime(start_str, format='%I:%M%p')
  end_dt   = pd.to_datetime(end_str,   format='%I:%M%p')

  # convert to "minutes since midnight"
  start_min = start_dt.hour * 60 + start_dt.minute
  end_min = end_dt.hour * 60 + end_dt.minute

  # section duration
  duration_min = end_min - start_min

  return start_min, end_min, duration_min

# Apply to df_model['Times'] and create three new columns
df_engineer[['StartMinutes', 'EndMinutes', 'DurationMinutes']] = df_engineer['Times'].apply(
    lambda s: pd.Series(parse_time_range(s))
)

df_engineer[['Times', 'StartMinutes', 'EndMinutes', 'DurationMinutes']].head()


Unnamed: 0,Times,StartMinutes,EndMinutes,DurationMinutes
0,09:00AM-10:15AM,540,615,75
1,10:30AM-11:45AM,630,705,75
2,TBA,-1,-1,-1
3,TBA,-1,-1,-1
4,09:30AM-10:20AM,570,620,50


### 4c. Days + StartMinutes + EndMinutes → Slot

* This show use the day and start minute. This feature is important for schedule planning later.

* We will later turn them into Slot_ID

In [83]:
def make_slot(row):
    days = str(row['Days']).strip()
    start = int(row['StartMinutes'])
    end   = int(row['EndMinutes'])

    # If this was a TBA row that you encoded as -1
    if start == -1 or end == -1:
        return days + '_TBA'

    return f"{days}_{start}_{end}"

df_engineer['Slot'] = df_engineer.apply(make_slot, axis=1)
df_engineer[['Days', 'Times', 'StartMinutes', 'Slot']].head()

Unnamed: 0,Days,Times,StartMinutes,Slot
0,TR,09:00AM-10:15AM,540,TR_540_615
1,MW,10:30AM-11:45AM,630,MW_630_705
2,TBA,TBA,-1,TBA_TBA
3,TBA,TBA,-1,TBA_TBA
4,MWF,09:30AM-10:20AM,570,MWF_570_620


### 4d. Satifies → HasGE

* Pretty straightforward. Class that have satifies mean it is in the major requirements. This feature can be good to create roadmaps as well as schedules later.


In [84]:
print(df_engineer['Satifies'].unique())
print('---------------------------------------------')
# 1 if it's a GE area ("GE: ..."), 0 if MajorOnly or anything else
df_engineer['HasGE'] = df_engineer['Satifies'].astype(str).str.startswith('GE:').astype(int)
print('---------------------------------------------')
print(df_engineer[['Satifies', 'HasGE']].head())
print('---------------------------------------------')
print(df_engineer['HasGE'].value_counts())

['GE: B2' 'GE: B1+B3' 'MajorOnly' 'GE: S' 'GE: V' 'GE: WID' 'GE: A2'
 'GE: C2' 'GE: E' 'GE: WID+R' 'GE: B4' 'GE: 5B' 'GE: 5A+5C' 'GE: 4'
 'GE: 3' 'GE: 1A' 'GE: 3B' 'GE: WID+3' 'GE: 2']
---------------------------------------------
---------------------------------------------
    Satifies  HasGE
0     GE: B2      1
1     GE: B2      1
2     GE: B2      1
3     GE: B2      1
4  GE: B1+B3      1
---------------------------------------------
HasGE
0    2392
1    1615
Name: count, dtype: int64


### 4e. Location → Building

* Ex: SCI164 into SCI, and Online stays Online
* Note: MD is Morris Dailey Auditorium, the SJSU website had changed the name into Town Hall


In [85]:
def get_building(location):
  location = str(location).strip()
  if location in ['ONLINE', 'Unknown']:
    return location

  prefix = ''
  for ch in location:
    if ch.isalpha():
      prefix += ch
    else:
      break
  return prefix if prefix else 'Unknown'

df_engineer['Building'] = df_engineer['Location'].apply(get_building)
print(df_engineer['Building'].value_counts())

Building
ENG        1159
MH          662
DH          516
ONLINE      366
SCI         314
BBC         281
SH          188
CL          176
WSQ         138
DMH          90
Unknown      62
MD           24
ISB          15
YUH           9
HGH           2
CCB           2
IS            2
DBH           1
Name: count, dtype: int64


### 4e. Year + Semester → Term

In [86]:
df_engineer['Term'] = df_engineer['Year'].astype(str) + '_' + df_engineer['Semester'].astype(str)
df_engineer[['Year', 'Semester', 'Term']].head()

Unnamed: 0,Year,Semester,Term
0,2022,Fall,2022_Fall
1,2022,Fall,2022_Fall
2,2022,Fall,2022_Fall
3,2022,Fall,2022_Fall
4,2022,Fall,2022_Fall


### 4f. Semester → SemesterIndex

* Sort the unique terms and assing 0, 1, 2,... in time order

In [87]:
sem_order = {'Spring': 0, 'Fall': 1}

# Unique (Year, Semester) combos
term_df = (
    df_engineer[['Year', 'Semester']]
    .drop_duplicates()
    .copy()
)


# Sort by Year, then Spring/Fall
term_df['sem_order'] = term_df['Semester'].map(sem_order)
term_df = term_df.sort_values(['Year', 'sem_order']).reset_index(drop=True)

# Assign 0,1,2,... as SemesterIndex
term_df['SemesterIndex'] = term_df.index

print(term_df)

# Merge into df_engineer
df_engineer = df_engineer.merge(
    term_df[['Year', 'Semester', 'SemesterIndex']],
    on=['Year', 'Semester'],
    how='left'
)

print(
    df_engineer[['Year', 'Semester', 'SemesterIndex']]
    .drop_duplicates()
    .sort_values('SemesterIndex')
)


   Year Semester  sem_order  SemesterIndex
0  2022   Spring          0              0
1  2022     Fall          1              1
2  2023   Spring          0              2
3  2023     Fall          1              3
4  2024   Spring          0              4
5  2024     Fall          1              5
6  2025   Spring          0              6
      Year Semester  SemesterIndex
1847  2022   Spring              0
0     2022     Fall              1
2384  2023   Spring              2
590   2023     Fall              3
2873  2024   Spring              4
1216  2024     Fall              5
3433  2025   Spring              6


### 4g. Label Encoding Targets: Instructor, Slot and CourseCode
* Encode each unique instructor, slot, and course code as an integer ID using `LabelEncoder`.

In [88]:
instr_le = LabelEncoder()
df_engineer['Instructor_ID'] = instr_le.fit_transform(df_engineer['Instructor'])

print("Number of unique instructors:", df_engineer['Instructor_ID'].nunique())
df_engineer[['Instructor', 'Instructor_ID']].head()

Number of unique instructors: 646


Unnamed: 0,Instructor,Instructor_ID
0,Allison Harness,35
1,Phillip Hawkins,439
2,Phillip Hawkins,439
3,Mary Poffenroth,363
4,Resa Kelly,471


In [89]:
# LabelEncode Slot → slot_id
slot_le = LabelEncoder()
df_engineer['Slot_ID'] = slot_le.fit_transform(df_engineer['Slot'])

print("Number of unique slots:", df_engineer['Slot_ID'].nunique())
df_engineer[['Slot', 'Slot_ID']].head()

Number of unique slots: 261


Unnamed: 0,Slot,Slot_ID
0,TR_540_615,180
1,MW_630_705,90
2,TBA_TBA,168
3,TBA_TBA,168
4,MWF_570_620,63


In [90]:
course_le = LabelEncoder()
df_engineer['CourseCode_ID'] = course_le.fit_transform(df_engineer['CourseCode'])

print("Number of unique course codes:", df_engineer['CourseCode_ID'].nunique())
df_engineer[['CourseCode', 'CourseCode_ID']].head()

Number of unique course codes: 92


Unnamed: 0,CourseCode,CourseCode_ID
0,BIOL 10,0
1,BIOL 10,0
2,BIOL 10,0
3,BIOL 10,0
4,CHEM 1A,1


In [91]:
print(df_engineer.info())
print("-------------------------------------------")
df_engineer.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4007 entries, 0 to 4006
Data columns (total 29 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Section          4007 non-null   object
 1   Number           4007 non-null   int64 
 2   Mode             4007 non-null   object
 3   Title            4007 non-null   object
 4   Satifies         4007 non-null   object
 5   Unit             4007 non-null   int64 
 6   Type             4007 non-null   object
 7   Days             4007 non-null   object
 8   Times            4007 non-null   object
 9   Instructor       4007 non-null   object
 10  Location         4007 non-null   object
 11  Dates            4007 non-null   object
 12  Seats            4007 non-null   int64 
 13  Year             4007 non-null   int64 
 14  Semester         4007 non-null   object
 15  Dept             4007 non-null   object
 16  CourseNumber     4007 non-null   object
 17  CourseCode       4007 non-null   

Unnamed: 0,Section,Number,Mode,Title,Satifies,Unit,Type,Days,Times,Instructor,...,EndMinutes,DurationMinutes,Slot,HasGE,Building,Term,SemesterIndex,Instructor_ID,Slot_ID,CourseCode_ID
0,BIOL 10 (Section 01),40529,In Person,The Living World,GE: B2,3,LEC,TR,09:00AM-10:15AM,Allison Harness,...,615,75,TR_540_615,1,SCI,2022_Fall,1,35,180,0
1,BIOL 10 (Section 03),40060,In Person,The Living World,GE: B2,3,LEC,MW,10:30AM-11:45AM,Phillip Hawkins,...,705,75,MW_630_705,1,SCI,2022_Fall,1,439,90,0
2,BIOL 10 (Section 04),47603,Fully Online,The Living World,GE: B2,3,LEC,TBA,TBA,Phillip Hawkins,...,-1,-1,TBA_TBA,1,ONLINE,2022_Fall,1,439,168,0
3,BIOL 10 (Section 99),41828,Fully Online,The Living World,GE: B2,3,LEC,TBA,TBA,Mary Poffenroth,...,-1,-1,TBA_TBA,1,ONLINE,2022_Fall,1,363,168,0
4,CHEM 1A (Section 01),40081,In Person,General Chemistry,GE: B1+B3,5,LEC,MWF,09:30AM-10:20AM,Resa Kelly,...,620,50,MWF_570_620,1,SCI,2022_Fall,1,471,63,1


# Major Map: Neural Network Models

* MLP
* RNN
* LSTM
* GRU
* CNN
* TCN
* Transformer
* Temporal Fusion Transformer

Important features: Mode, Unit, Type, Year, Semester, Dept, CourseCode, StartMinutes, EndMinutes, DurationMinutes, HasGE, Building, Term, SemesterIndex, Instructor_ID, Slot_ID, CourseCode_ID

Useful: Dept, Mode, Type

In [92]:
# Neural network dataframe
df_nn = df_engineer.copy()

## Two-head MLP

Goal: given a course and a future term, predict the “offering pair”:
* Which professor teaches it
* Which time slot it happens in

Inputs:
* CourseCode_ID (main)
* SemesterIndex (time)

Outputs:
* Instructor_ID
* Slot_ID



In [93]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# ------------- 1) Select columns and clean -------------
df = df_nn.copy()

needed = ["CourseCode_ID", "SemesterIndex", "Instructor_ID", "Slot_ID"]
for c in needed:
    if c not in df.columns:
        raise ValueError(f"Missing column: {c}")


# Make sure they're integers where needed
df["CourseCode_ID"]  = df["CourseCode_ID"].astype(int)
df["Instructor_ID"]  = df["Instructor_ID"].astype(int)
df["Slot_ID"]        = df["Slot_ID"].astype(int)
df["SemesterIndex"]  = df["SemesterIndex"].astype(float)

# ------------- 2) Time-based split -------------
cutoff = df["SemesterIndex"].quantile(0.6)
train_df = df[df["SemesterIndex"] <= cutoff].copy()
test_df  = df[df["SemesterIndex"] >  cutoff].copy()

print("Train semesters:", sorted(train_df["SemesterIndex"].unique()))
print("Test semesters:",  sorted(test_df["SemesterIndex"].unique()))

# ------------- 3) Scaling -------------
scaler = StandardScaler()
train_sem = scaler.fit_transform(train_df[["SemesterIndex"]].values)
test_sem  = scaler.transform(test_df[["SemesterIndex"]].values)

# ------------- 4) Embedding -------------
n_courses = int(df["CourseCode_ID"].max()) + 1
n_instr   = int(df["Instructor_ID"].max()) + 1
n_slots   = int(df["Slot_ID"].max()) + 1

print("n_courses:", n_courses, "n_instr:", n_instr, "n_slots:", n_slots)

# ------------- 5) Dataset / DataLoader -------------
class CourseDataset(Dataset):
    def __init__(self, course_ids, sem_scaled, y_instr, y_slot):
        self.course_ids = torch.tensor(course_ids, dtype=torch.long)
        self.sem = torch.tensor(sem_scaled, dtype=torch.float32)  # shape (N,1)
        self.y_instr = torch.tensor(y_instr, dtype=torch.long)
        self.y_slot = torch.tensor(y_slot, dtype=torch.long)

    def __len__(self):
        return len(self.course_ids)

    def __getitem__(self, idx):
        return self.course_ids[idx], self.sem[idx], self.y_instr[idx], self.y_slot[idx]

train_ds = CourseDataset(
    train_df["CourseCode_ID"].values,
    train_sem,
    train_df["Instructor_ID"].values,
    train_df["Slot_ID"].values
)

test_ds = CourseDataset(
    test_df["CourseCode_ID"].values,
    test_sem,
    test_df["Instructor_ID"].values,
    test_df["Slot_ID"].values
)

train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=512, shuffle=False)

# ------------- 6) Two-head MLP model -------------
class TwoHeadMLP(nn.Module):
    def __init__(self, n_courses, n_instr, n_slots, embed_dim=16):
        super().__init__()
        self.emb = nn.Embedding(n_courses, embed_dim)

        # input = embedding + 1 numeric feature
        self.mlp = nn.Sequential(
            nn.Linear(embed_dim + 1, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 32),
            nn.ReLU(),
        )

        self.head_instr = nn.Linear(32, n_instr)
        self.head_slot  = nn.Linear(32, n_slots)

    def forward(self, course_id, sem_scaled):
        # course_id: (B,), sem_scaled: (B,1)
        e = self.emb(course_id)  # (B, embed_dim)
        x = torch.cat([e, sem_scaled], dim=1)  # (B, embed_dim+1)
        h = self.mlp(x)
        logits_instr = self.head_instr(h)
        logits_slot  = self.head_slot(h)
        return logits_instr, logits_slot

embed_dim = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TwoHeadMLP(n_courses, n_instr, n_slots, embed_dim=embed_dim).to(device)

# Loss for each head
crit_instr = nn.CrossEntropyLoss()
crit_slot  = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ------------- 7) Metrics -------------
@torch.no_grad()
def topk_accuracy(logits, y_true, k=3):
    # logits: (B,C), y_true: (B,)
    topk = torch.topk(logits, k=k, dim=1).indices  # (B,k)
    correct = (topk == y_true.unsqueeze(1)).any(dim=1).float().mean().item()
    return correct

@torch.no_grad()
def evaluate(model, loader):
    model.eval()
    total = 0
    instr_correct1 = 0
    slot_correct1 = 0
    instr_top3_sum = 0.0
    slot_top3_sum = 0.0

    for course_id, sem_scaled, y_instr, y_slot in loader:
        course_id = course_id.to(device)
        sem_scaled = sem_scaled.to(device)
        y_instr = y_instr.to(device)
        y_slot = y_slot.to(device)

        logits_instr, logits_slot = model(course_id, sem_scaled)

        # Top-1
        instr_pred = logits_instr.argmax(dim=1)
        slot_pred  = logits_slot.argmax(dim=1)
        instr_correct1 += (instr_pred == y_instr).sum().item()
        slot_correct1  += (slot_pred == y_slot).sum().item()

        # Top-3
        instr_top3_sum += topk_accuracy(logits_instr, y_instr, k=3) * len(course_id)
        slot_top3_sum  += topk_accuracy(logits_slot,  y_slot,  k=3) * len(course_id)

        total += len(course_id)

    return {
        "instr_acc": instr_correct1 / total,
        "slot_acc": slot_correct1 / total,
        "instr_top3": instr_top3_sum / total,
        "slot_top3": slot_top3_sum / total
    }
# ------------- 8) Train loop -------------
def train_one_epoch(model, loader):
    model.train()
    total_loss = 0.0
    total = 0

    for course_id, sem_scaled, y_instr, y_slot in loader:
        course_id = course_id.to(device)
        sem_scaled = sem_scaled.to(device)
        y_instr = y_instr.to(device)
        y_slot = y_slot.to(device)

        optimizer.zero_grad()
        logits_instr, logits_slot = model(course_id, sem_scaled)

        loss_instr = crit_instr(logits_instr, y_instr)
        loss_slot  = crit_slot(logits_slot, y_slot)

        loss = loss_instr + loss_slot
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * len(course_id)
        total += len(course_id)

    return total_loss / total

# ------------- 9) Run epochs -------------

# variables for early stopping
best_score = -1.0
patience = 8
bad = 0
best_state = None

for epoch in range(1, 201):  # 10 epochs baseline
    train_loss = train_one_epoch(model, train_loader)
    metrics = evaluate(model, test_loader)

    # score is used for early stopping, instructor top-3 is chosen
    score = metrics["instr_top3"]

    print(
        f"Epoch {epoch:02d} | loss {train_loss:.4f} | "
        f"instr_acc {metrics['instr_acc']:.4f} top3 {metrics['instr_top3']:.4f} | "
        f"slot_acc {metrics['slot_acc']:.4f} top3 {metrics['slot_top3']:.4f}"
    )

    # checks for early stopping
    if score > best_score:
        best_score = score
        bad = 0
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
    else:
        bad += 1
        if bad >= patience:
            print(f"Early stopping. Best instr_top3={best_score:.4f}")
            break

# Restore best weights
model.load_state_dict(best_state)

Train semesters: [np.float64(0.0), np.float64(1.0), np.float64(2.0), np.float64(3.0), np.float64(4.0)]
Test semesters: [np.float64(5.0), np.float64(6.0)]
n_courses: 92 n_instr: 646 n_slots: 261
Epoch 01 | loss 12.0199 | instr_acc 0.0025 top3 0.0141 | slot_acc 0.0216 top3 0.0564
Epoch 02 | loss 11.9114 | instr_acc 0.0133 top3 0.0315 | slot_acc 0.0382 top3 0.0946
Epoch 03 | loss 11.7308 | instr_acc 0.0149 top3 0.0299 | slot_acc 0.0398 top3 0.1129
Epoch 04 | loss 11.4026 | instr_acc 0.0158 top3 0.0340 | slot_acc 0.0589 top3 0.1386
Epoch 05 | loss 10.9124 | instr_acc 0.0149 top3 0.0340 | slot_acc 0.0423 top3 0.1162
Epoch 06 | loss 10.4912 | instr_acc 0.0174 top3 0.0365 | slot_acc 0.0407 top3 0.1187
Epoch 07 | loss 10.1971 | instr_acc 0.0216 top3 0.0523 | slot_acc 0.0423 top3 0.1178
Epoch 08 | loss 9.9665 | instr_acc 0.0274 top3 0.0772 | slot_acc 0.0415 top3 0.1178
Epoch 09 | loss 9.7522 | instr_acc 0.0357 top3 0.0863 | slot_acc 0.0382 top3 0.1170
Epoch 10 | loss 9.5372 | instr_acc 0.0456 t

<All keys matched successfully>

### Single-Head MLP

In [94]:
df["Pair_ID"] = list(zip(df["Instructor_ID"], df["Slot_ID"]))
df["Pair_ID"] = pd.factorize(df["Pair_ID"])[0]
print("Unique pairs:", df["Pair_ID"].nunique())

Unique pairs: 2238


In [95]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

torch.manual_seed(42)
np.random.seed(42)

# ===== 1) Clean + create IDs =====
df = df_nn.copy()
needed = ["CourseCode_ID", "SemesterIndex", "Instructor_ID", "Slot_ID", "Dept", "Mode", "Type"]
df = df.dropna(subset=needed).copy()

df["CourseCode_ID"] = df["CourseCode_ID"].astype(int)
df["Instructor_ID"] = df["Instructor_ID"].astype(int)
df["Slot_ID"] = df["Slot_ID"].astype(int)
df["SemesterIndex"] = df["SemesterIndex"].astype(float)

# Factorize categorical strings to compact integer IDs
df["Dept_ID"] = pd.factorize(df["Dept"].astype(str))[0].astype(int)
df["Mode_ID"] = pd.factorize(df["Mode"].astype(str))[0].astype(int)
df["Type_ID"] = pd.factorize(df["Type"].astype(str))[0].astype(int)

n_depts = int(df["Dept_ID"].max()) + 1
n_modes = int(df["Mode_ID"].max()) + 1
n_types = int(df["Type_ID"].max()) + 1

# ===== 2) Create Pair_ID target =====
pairs = list(zip(df["Instructor_ID"], df["Slot_ID"]))
df["Pair_ID"], pair_uniques = pd.factorize(pairs)
df["Pair_ID"] = df["Pair_ID"].astype(int)

n_pairs = int(df["Pair_ID"].max()) + 1
print("Unique pairs:", n_pairs)

# ===== 3) Time split =====
cutoff = df["SemesterIndex"].quantile(0.8)
train_df = df[df["SemesterIndex"] <= cutoff].copy()
test_df  = df[df["SemesterIndex"] >  cutoff].copy()

# ===== 4) Scale numeric =====
scaler = StandardScaler()
train_sem = scaler.fit_transform(train_df[["SemesterIndex"]].values)
test_sem  = scaler.transform(test_df[["SemesterIndex"]].values)

# Embedding sizes need max+1 if IDs have gaps.
# CourseCode_ID should already be 0..N-1 from LabelEncoder, but safer:
n_courses = int(df["CourseCode_ID"].max()) + 1

print("n_courses:", n_courses, "n_depts:", n_depts, "n_modes:", n_modes, "n_types:", n_types)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ===== 5) Dataset =====
class PairDataset(Dataset):
    def __init__(self, df_part, sem_scaled):
        self.course = torch.tensor(df_part["CourseCode_ID"].values, dtype=torch.long)
        self.dept   = torch.tensor(df_part["Dept_ID"].values, dtype=torch.long)
        self.mode   = torch.tensor(df_part["Mode_ID"].values, dtype=torch.long)
        self.type   = torch.tensor(df_part["Type_ID"].values, dtype=torch.long)
        self.sem    = torch.tensor(sem_scaled, dtype=torch.float32)  # (N,1)
        self.y      = torch.tensor(df_part["Pair_ID"].values, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.course[idx], self.dept[idx], self.mode[idx], self.type[idx], self.sem[idx], self.y[idx]

train_ds = PairDataset(train_df, train_sem)
test_ds  = PairDataset(test_df,  test_sem)

train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=512, shuffle=False)

# ===== 6) Model =====
class PairMLP(nn.Module):
    def __init__(self, n_courses, n_depts, n_modes, n_types, n_pairs,
                 course_dim=16, dept_dim=8, mode_dim=4, type_dim=4):
        super().__init__()
        self.emb_course = nn.Embedding(n_courses, course_dim)
        self.emb_dept   = nn.Embedding(n_depts,  dept_dim)
        self.emb_mode   = nn.Embedding(n_modes,  mode_dim)
        self.emb_type   = nn.Embedding(n_types,  type_dim)

        in_dim = course_dim + dept_dim + mode_dim + type_dim + 1  # +1 for SemesterIndex
        self.mlp = nn.Sequential(
            nn.Linear(in_dim, 128),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.15),
        )
        self.out = nn.Linear(64, n_pairs)

    def forward(self, course_id, dept_id, mode_id, type_id, sem_scaled):
        e_course = self.emb_course(course_id)
        e_dept   = self.emb_dept(dept_id)
        e_mode   = self.emb_mode(mode_id)
        e_type   = self.emb_type(type_id)

        x = torch.cat([e_course, e_dept, e_mode, e_type, sem_scaled], dim=1)
        h = self.mlp(x)
        return self.out(h)

model = PairMLP(n_courses, n_depts, n_modes, n_types, n_pairs).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

@torch.no_grad()
def topk_acc(logits, y_true, k=3):
    topk = torch.topk(logits, k=k, dim=1).indices
    return (topk == y_true.unsqueeze(1)).any(dim=1).float().mean().item()

@torch.no_grad()
def evaluate(model, loader):
    model.eval()
    total = 0
    correct1 = 0
    top3_sum = 0.0
    for course, dept, mode, typ, sem, y in loader:
        course = course.to(device)
        dept   = dept.to(device)
        mode   = mode.to(device)
        typ    = typ.to(device)
        sem    = sem.to(device)
        y      = y.to(device)

        logits = model(course, dept, mode, typ, sem)
        pred = logits.argmax(dim=1)
        correct1 += (pred == y).sum().item()

        top3_sum += topk_acc(logits, y, k=3) * len(y)
        total += len(y)

    return {"acc": correct1 / total, "top3": top3_sum / total}

def train_one_epoch(model, loader):
    model.train()
    total_loss = 0.0
    total = 0
    for course, dept, mode, typ, sem, y in loader:
        course = course.to(device)
        dept   = dept.to(device)
        mode   = mode.to(device)
        typ    = typ.to(device)
        sem    = sem.to(device)
        y      = y.to(device)

        optimizer.zero_grad()
        logits = model(course, dept, mode, typ, sem)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * len(y)
        total += len(y)

    return total_loss / total

# ===== 7) Train with early stopping =====
best = -1.0
patience = 8
bad = 0
best_state = None

for epoch in range(1, 201):
    train_loss = train_one_epoch(model, train_loader)
    metrics = evaluate(model, test_loader)

    print(f"Epoch {epoch:02d} | loss {train_loss:.4f} | acc {metrics['acc']:.4f} | top3 {metrics['top3']:.4f}")

    if metrics["top3"] > best:
        best = metrics["top3"]
        bad = 0
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
    else:
        bad += 1
        if bad >= patience:
            print(f"Early stopping. Best top3={best:.4f}")
            break

model.load_state_dict(best_state)
final_metrics = evaluate(model, test_loader)
print("Final (best) test:", final_metrics)


Unique pairs: 2238
n_courses: 92 n_depts: 13 n_modes: 3 n_types: 3


  df["Pair_ID"], pair_uniques = pd.factorize(pairs)


Epoch 01 | loss 7.7227 | acc 0.0000 | top3 0.0017
Epoch 02 | loss 7.6554 | acc 0.0070 | top3 0.0157
Epoch 03 | loss 7.5324 | acc 0.0087 | top3 0.0139
Epoch 04 | loss 7.2981 | acc 0.0087 | top3 0.0139
Epoch 05 | loss 7.0016 | acc 0.0209 | top3 0.0470
Epoch 06 | loss 6.6923 | acc 0.0314 | top3 0.0645
Epoch 07 | loss 6.3875 | acc 0.0331 | top3 0.0749
Epoch 08 | loss 6.0723 | acc 0.0557 | top3 0.0906
Epoch 09 | loss 5.7559 | acc 0.0592 | top3 0.1150
Epoch 10 | loss 5.4794 | acc 0.0627 | top3 0.1272
Epoch 11 | loss 5.2052 | acc 0.0714 | top3 0.1429
Epoch 12 | loss 4.9507 | acc 0.0662 | top3 0.1446
Epoch 13 | loss 4.7628 | acc 0.0784 | top3 0.1620
Epoch 14 | loss 4.5871 | acc 0.0836 | top3 0.1533
Epoch 15 | loss 4.4519 | acc 0.0889 | top3 0.1655
Epoch 16 | loss 4.3202 | acc 0.0767 | top3 0.1620
Epoch 17 | loss 4.2126 | acc 0.0836 | top3 0.1690
Epoch 18 | loss 4.1407 | acc 0.0906 | top3 0.1620
Epoch 19 | loss 4.0699 | acc 0.0819 | top3 0.1742
Epoch 20 | loss 3.9976 | acc 0.0854 | top3 0.1725


In [96]:
train_pairs = set(train_df["Pair_ID"].unique())
test_pairs  = set(test_df["Pair_ID"].unique())

unseen = [p for p in test_pairs if p not in train_pairs]
print("Test unique pairs:", len(test_pairs))
print("Unseen-in-train test pairs:", len(unseen))
print("Percent unseen:", len(unseen)/len(test_pairs))


Test unique pairs: 485
Unseen-in-train test pairs: 211
Percent unseen: 0.4350515463917526
