<a href="https://colab.research.google.com/github/7HE-LUCKY-FISH/major_map/blob/hoang-test/notebooks/Major_Map_Hoang_Nguyen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Major Map: An AI-Powered Academic Planner and Predictor for SJSU Students

## 1. Data Loading

In [21]:
!rm -rf major_map
!git clone https://github.com/7HE-LUCKY-FISH/major_map.git

Cloning into 'major_map'...
remote: Enumerating objects: 325, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 325 (delta 33), reused 21 (delta 19), pack-reused 260 (from 1)[K
Receiving objects: 100% (325/325), 809.01 KiB | 1.57 MiB/s, done.
Resolving deltas: 100% (101/101), done.


In [22]:
!ls /content/

major_map  sample_data


In [23]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from pathlib import Path

In [24]:
# Path to repo in Colab
data_dir = Path("major_map/data/csv_data")

# Get all csv files
csv_files = sorted(data_dir.glob("*.csv"))

print("Found CSV files:")
for f in csv_files:
    print(" -", f.name)

# Combine all the csv files into one
# Check method number 4: https://medium.com/@stella96joshua/how-to-combine-multiple-csv-files-using-python-for-your-analysis-a88017c6ff9e
df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)

Found CSV files:
 - Fall-2022.csv
 - Fall-2023.csv
 - Fall-2024.csv
 - Spring-2022.csv
 - Spring-2023.csv
 - Spring-2024.csv
 - Spring-2025.csv


## 2. Data Understanding

### 2a. Basic Inspection

In [25]:
df.head()

Unnamed: 0,Section,Number,Mode,Title,Satifies,Unit,Type,Days,Times,Instructor,Location,Dates,Seats,Year,Semester
0,BIOL 10 (Section 01),40529,In Person,The Living World,GE: B2,3.0,LEC,TR,09:00AM-10:15AM,Allison Harness,SCI164,08/19/22-12/06/22,59,2022,Fall
1,BIOL 10 (Section 03),40060,In Person,The Living World,GE: B2,3.0,LEC,MW,10:30AM-11:45AM,Phillip Hawkins,SCI164,08/19/22-12/06/22,42,2022,Fall
2,BIOL 10 (Section 04),47603,Fully Online,The Living World,GE: B2,3.0,LEC,TBA,TBA,Phillip Hawkins,ONLINE,08/19/22-12/06/22,6,2022,Fall
3,BIOL 10 (Section 99),41828,Fully Online,The Living World,GE: B2,3.0,LEC,TBA,TBA,Mary Poffenroth,ONLINE,08/19/22-12/06/22,1,2022,Fall
4,CHEM 1A (Section 01),40081,In Person,General Chemistry,GE: B1+B3,5.0,LEC,MWF,09:30AM-10:20AM,Resa Kelly,SCI142,08/19/22-12/06/22,0,2022,Fall


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4007 entries, 0 to 4006
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Section     4007 non-null   object 
 1   Number      4007 non-null   int64  
 2   Mode        4007 non-null   object 
 3   Title       4007 non-null   object 
 4   Satifies    1615 non-null   object 
 5   Unit        4007 non-null   float64
 6   Type        4007 non-null   object 
 7   Days        4007 non-null   object 
 8   Times       4007 non-null   object 
 9   Instructor  4007 non-null   object 
 10  Location    3945 non-null   object 
 11  Dates       4007 non-null   object 
 12  Seats       4007 non-null   int64  
 13  Year        4007 non-null   int64  
 14  Semester    4007 non-null   object 
dtypes: float64(1), int64(3), object(11)
memory usage: 469.7+ KB


In [28]:
df['Year'].value_counts()

Unnamed: 0_level_0,count
Semester,Unnamed: 1_level_1
Spring,2160
Fall,1847


In [29]:
df['Semester'].value_counts()

Unnamed: 0_level_0,count
Semester,Unnamed: 1_level_1
Spring,2160
Fall,1847


In [30]:
df['Instructor'].nunique()

646

In [31]:
df['Section'].nunique()

980

### 2b. Check Missing / Special Values

In [37]:
df['Times'].value_counts().head(10)

Unnamed: 0_level_0,count
Times,Unnamed: 1_level_1
10:30AM-11:45AM,371
12:00PM-01:15PM,366
09:00AM-10:15AM,316
01:30PM-02:45PM,300
03:00PM-04:15PM,263
04:30PM-05:45PM,185
06:00PM-08:45PM,138
09:00AM-11:45AM,102
TBA,102
03:00PM-05:45PM,97


In [38]:
df['Days'].value_counts().head(10)

Unnamed: 0_level_0,count
Days,Unnamed: 1_level_1
MW,1120
TR,989
F,400
T,343
W,310
R,289
M,229
MTWR,113
TBA,102
MWF,86


In [52]:
print(df['Instructor'].value_counts())
print('\n-----------------------------\n')
print(df['Section'].value_counts())
print('\n-----------------------------\n')
print(df['Semester'].value_counts())

Instructor
Richard Low              53
Padmavati Tanniru        52
Alla Petrosyan           51
Olga Kovaleva            48
Medha Bodas              47
                         ..
Vishwa Samirbhai Shah     1
Paul Varun Guddeti        1
Neomi Millan              1
Nathan Samarasena         1
Resa Kelly                1
Name: count, Length: 646, dtype: int64

-----------------------------

Section
ENGR 100W (Section 14)    14
ENGR 100W (Section 16)    14
ENGR 100W (Section 06)    14
ENGR 100W (Section 18)    11
ENGR 100W (Section 12)    10
                          ..
ENGL 1B (Section 26)       1
EE 120 (Section 07)        1
EE 120 (Section 10)        1
EE 120 (Section 09)        1
EE 120 (Section 08)        1
Name: count, Length: 980, dtype: int64

-----------------------------

Semester
Spring    2160
Fall      1847
Name: count, dtype: int64
