Project Overview: Build an ML system that predicts where courses from different universities are eligible for transfer credit using NLP and classification algorithms.

In [1]:
'''Core Environment Setup'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
import os
warnings.filterwarnings('ignore')

In [2]:
'''NLP Environment Setup'''
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from difflib import SequenceMatcher
import nltk
import spacy

In [3]:
'''Deep Learning Environment Setup'''
from transformers import AutoTokenizer, AutoModel
import torch

In [4]:
'''Plotting'''
plt.style.use('seaborn-v0_8')
sb.set_palette("husl")

In [5]:
'''NLTK Data'''
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sarathivelmurugan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sarathivelmurugan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sarathivelmurugan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Project Configuration can now be completed. Data loading and exploration is the next step.

In [6]:
'''Sample Data Function'''

def create_sample_data():
#Sample Purdue Courses
        purdue_samp = {'course_code': ['CS 180', 'CS 250', 'CS159'], 
               'title': ['Problem Solving and Object-Oriented Programming', 
                        'Computer Architecture',
                        'C Programming (Applications for Engineers)'], 
                'description':['Intro to programming using Python',
                               'Computer organization and architecture...',
                               'Intro to programming using C'],
                'credits':[4, 4, 3],
                'department':['CS', 'CS', 'CS'],
                'level':['Intro', 'Intermediate', 'Intro']
        }

        #Sample Berkeley Courses
        berkeley_samp = { 'course_code':['CS 61A', 'CS 61C', 'CS 164', 'MATH 1A', 'PHYS 7A'],
                'title': ['Structure and Interpretation of Computer Programs',
                        'Machine Structures', 
                        'Programming Languages and Compilers',
                        'Calculus',
                        'Physics for Scientists and Engineers'],
                'description': ['Introduction to programming and computer science...',
                       'Machine structures, assembly language...',
                       'Survey of programming languages, compilers...',
                       'Differential and integral calculus...',
                       'Mechanics, oscillations, waves...'],
                'credits':[4, 4, 4, 4, 4],
                'department':['CS', 'CS', 'CS', 'MATH', 'PHYS'],
                'level':['Intro', 'Intermediate', 'Advanced', 'Intro', 'Intro']

        }

        return pd.DataFrame(berkeley_samp), pd.DataFrame(purdue_samp)

In [7]:
'''Data Loading'''
berkCS_courses_indices = np.arange(2000, 2099).tolist()

def load_course_data():
    try:
        purdueCS_courses = pd.read_csv('Course_CSV_Files/Purdue_CS_Courses_CSV.csv')
        berkeley_courses = pd.read_csv('Course_CSV_Files/UCB_Courses.csv', header=0)
        berkeley_courses = berkeley_courses.iloc[berkCS_courses_indices]

        print(f"Purdue CS courses loaded: {len(purdueCS_courses)}")
        print(f"Berkeley courses loaded: {len(berkeley_courses)}")

        return purdueCS_courses, berkeley_courses
    
    except FileNotFoundError:
        print("CSV Files not found. Sample data will be created for demo.")
        return create_sample_data()
    

In [8]:
'''Loading and Displaying Basic Info for Purdue'''
purdue_df, berkeley_df = load_course_data()

print("\n=== PURDUE COURSES PREVIEW ===")
display(purdue_df)
print(f"\nShape: {purdue_df.shape}")
print(f"Columns: {list(purdue_df.columns)}")

print("\n=== BERKELEY COURSES PREVIEW ===")
berkeley_df.head()
display(berkeley_df)
print(f"\nShape: {berkeley_df.shape}")
print(f"Columns: {list(berkeley_df.columns)}")

Purdue CS courses loaded: 1803
Berkeley courses loaded: 99

=== PURDUE COURSES PREVIEW ===


Unnamed: 0,Id,Number,SubjectId,Title,CreditHours,Description
0,97744585-87e3-4616-8a1f-bff2ab88471b,9200,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Professional Practice II,0,
1,631d471f-f14b-47b1-a43f-1efae5ec584a,9300,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Professional Practice III,0,
2,5237a73f-f2db-4130-8cc8-33f03a1bab55,9400,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Professional Practice IV,0,
3,1782c85f-49f5-4b08-ad99-62910a6794bd,9500,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Professional Practice V,0,
4,5f26945a-ff4b-429a-8cb7-37cdba96e319,10100,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Digital Literacy,3,
...,...,...,...,...,...,...
1798,8840e6c2-6020-4eb8-8dd4-15bafaadfb24,69000,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Software Trust Management,3,
1799,1b868766-2557-40b9-b224-61976374b9aa,69000,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Git Based Data Model For Nosql,3,
1800,e02608d1-96d1-4516-85d8-f3343852e40c,69000,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Cryptography II,3,
1801,70f6cb0f-86e6-4ec4-893d-eb6d22698488,69800,86ad8a59-6ddc-4067-9f6b-169c8eec86a6,Research MS Thesis,1,



Shape: (1803, 6)
Columns: ['Id', 'Number', 'SubjectId', 'Title', 'CreditHours', 'Description']

=== BERKELEY COURSES PREVIEW ===


Unnamed: 0,Subject,Course Number,Department(s),Credits - Units - Minimum Units,Credits - Units - Maximum Units,Terms Offered,Course Description,Cross-Listed Course(s),Repeat Rules,Repeat Rule: Special Circumstances,Offering Information,Additional Offering Information
2000,COMPSCI,150,Electrical Engineering and Computer Sciences,5,5,-,Basic building blocks and design methods to co...,-,Course is not repeatable for credit.,-,-,-
2001,COMPSCI,152,Electrical Engineering and Computer Sciences,4,4,-,"Instruction set architecture, microcoding, pip...",-,Course is not repeatable for credit.,-,-,-
2002,COMPSCI,160,Electrical Engineering and Computer Sciences,4,4,-,"The design, implementation, and evaluation of ...",-,Course is not repeatable for credit.,-,-,-
2003,COMPSCI,161,Electrical Engineering and Computer Sciences,4,4,-,Introduction to computer security. Cryptograph...,-,Course is not repeatable for credit.,-,-,-
2004,COMPSCI,162,Electrical Engineering and Computer Sciences,4,4,-,Basic concepts of operating systems and system...,-,Course is not repeatable for credit.,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...
2094,COMPSCI,H196A,Electrical Engineering and Computer Sciences,1,4,-,Thesis work under the supervision of a faculty...,-,Course is not repeatable for credit.,-,-,-
2095,COMPSCI,H196B,Electrical Engineering and Computer Sciences,1,4,-,Thesis work under the supervision of a faculty...,-,Course is not repeatable for credit.,-,-,-
2096,COMPSCI,W10,Electrical Engineering and Computer Sciences,4,4,-,This course meets the programming prerequisite...,-,Course is not repeatable for credit.,-,-,-
2097,COMPSS,201,Computational Social Science Graduate Group,3,3,-,The Master of Computational Social Science pro...,-,Course is not repeatable for credit.,-,-,-



Shape: (99, 12)
Columns: ['Subject', 'Course Number', 'Department(s)', 'Credits - Units - Minimum Units', 'Credits - Units - Maximum Units', 'Terms Offered', 'Course Description', 'Cross-Listed Course(s)', 'Repeat Rules', 'Repeat Rule: Special Circumstances', 'Offering Information', 'Additional Offering Information']


Data preprocessing and Cleaning Section. Course IDs and others need to be separated and/or removed.

In [9]:
'''Data Cleaning for Purdue'''
#Cleaning and standardizing course data for Purdue

purdueDfCopy = purdue_df.copy() #creates a copy dataframe that will not change the original

#University Identifier
purdueDfCopy['university'] = 'Purdue University West Lafayette'

#Course codes
purdueDfCopy['course_code'] = purdue_df['SubjectId'].str.strip().str.upper()

#Clean titles and descriptions
purdueDfCopy['Title'] = purdue_df['Title'].str.strip()
purdueDfCopy['Description'] = purdue_df['Description'].fillna("").str.strip()

#Credit standardization
purdueDfCopy['credits'] = pd.to_numeric(purdue_df['CreditHours'], errors='coerce')

#Combined Text for NLP
purdueDfCopy['combined_text'] = purdueDfCopy['Title'] + ' ' + purdueDfCopy['Description']

#Course Level/Num
purdueDfCopy['Course Num'] = purdue_df['Number']

#Drop Extra Columns
purdueDfCopy = purdueDfCopy.drop(['Id', 'Number', 'SubjectId', 'Title', 'CreditHours', 'Description'], axis=1)

purdue_copy = purdueDfCopy.to_csv('Purdue CSV File With Required Columns', index=False)

display(purdueDfCopy)
    

Unnamed: 0,university,course_code,credits,combined_text,Course Num
0,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,0,Professional Practice II,9200
1,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,0,Professional Practice III,9300
2,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,0,Professional Practice IV,9400
3,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,0,Professional Practice V,9500
4,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,3,Digital Literacy,10100
...,...,...,...,...,...
1798,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,3,Software Trust Management,69000
1799,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,3,Git Based Data Model For Nosql,69000
1800,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,3,Cryptography II,69000
1801,Purdue University West Lafayette,86AD8A59-6DDC-4067-9F6B-169C8EEC86A6,1,Research MS Thesis,69800


In [10]:
'''Data Cleaning for Berkeley'''
#Cleaning and standardizing course data for Berkeley CS Courses

berkDfCopy = berkeley_df.copy()

#University Identifier
berkDfCopy['University'] = 'UC Berkeley'

#Clean titles and descriptions
berkDfCopy['Title'] = berkeley_df['Subject'].str.strip()
berkDfCopy['Description'] = berkeley_df['Course Description'].fillna("").str.strip()

#Credit standardization
berkDfCopy['credits'] = berkeley_df['Credits - Units - Minimum Units']

#Combined Text for NLP
berkDfCopy['combined_text'] = berkDfCopy['Title'] + ' ' + berkDfCopy['Description']

#Course Level/Num
berkDfCopy['Course Num'] = berkeley_df['Course Number']

#Drop Extra Columns
berkDfCopy = berkDfCopy.drop(['Credits - Units - Maximum Units', 'Terms Offered', 'Cross-Listed Course(s)', 
                              'Repeat Rules', 'Repeat Rule: Special Circumstances', 'Offering Information', 'Additional Offering Information',
                              'Subject', 'Course Number', 'Credits - Units - Minimum Units'], axis=1)

berkeley_copy = berkDfCopy.to_csv('Berkeley CSV File With Required Columns', index=False)

display(berkDfCopy)
    

Unnamed: 0,Department(s),Course Description,University,Title,Description,credits,combined_text,Course Num
2000,Electrical Engineering and Computer Sciences,Basic building blocks and design methods to co...,UC Berkeley,COMPSCI,Basic building blocks and design methods to co...,5,COMPSCI Basic building blocks and design metho...,150
2001,Electrical Engineering and Computer Sciences,"Instruction set architecture, microcoding, pip...",UC Berkeley,COMPSCI,"Instruction set architecture, microcoding, pip...",4,"COMPSCI Instruction set architecture, microcod...",152
2002,Electrical Engineering and Computer Sciences,"The design, implementation, and evaluation of ...",UC Berkeley,COMPSCI,"The design, implementation, and evaluation of ...",4,"COMPSCI The design, implementation, and evalua...",160
2003,Electrical Engineering and Computer Sciences,Introduction to computer security. Cryptograph...,UC Berkeley,COMPSCI,Introduction to computer security. Cryptograph...,4,COMPSCI Introduction to computer security. Cry...,161
2004,Electrical Engineering and Computer Sciences,Basic concepts of operating systems and system...,UC Berkeley,COMPSCI,Basic concepts of operating systems and system...,4,COMPSCI Basic concepts of operating systems an...,162
...,...,...,...,...,...,...,...,...
2094,Electrical Engineering and Computer Sciences,Thesis work under the supervision of a faculty...,UC Berkeley,COMPSCI,Thesis work under the supervision of a faculty...,1,COMPSCI Thesis work under the supervision of a...,H196A
2095,Electrical Engineering and Computer Sciences,Thesis work under the supervision of a faculty...,UC Berkeley,COMPSCI,Thesis work under the supervision of a faculty...,1,COMPSCI Thesis work under the supervision of a...,H196B
2096,Electrical Engineering and Computer Sciences,This course meets the programming prerequisite...,UC Berkeley,COMPSCI,This course meets the programming prerequisite...,4,COMPSCI This course meets the programming prer...,W10
2097,Computational Social Science Graduate Group,The Master of Computational Social Science pro...,UC Berkeley,COMPSS,The Master of Computational Social Science pro...,3,COMPSS The Master of Computational Social Scie...,201


Feature Engineering for Course Prediction using classes.

In [11]:
'''Feature Engineering Class'''
class CourseFeature:

    def __init__(self):
        self.tfidf_vectorizer = None
        self.scaler = StandardScaler()

    #Extracting Text features From Course Texts
    def extract_text_features(self, texts):
        if self.tfidf_vectorizer is None:
            self.tfidf_vectorizer = TfidfVectorizer(
                max_features = 5000, 
                stop_words = 'english',
                ngram_range = (1,2),
                min_df = 2,
                max_df = 0.95)
            
            tfidf_features = self.tfidf_vectorizer.fit_transform(texts)
        else:
            tfidf_features = self.tfidf_vectorizer.transform(texts)

        return tfidf_features.toarray()

    #Extracting Number Features
    def extract_num_features(self, df):
        features = []
        
        #Credit Hours
        features.append(df['credits'].values.reshape(-1, 1))

        #Course Number (Normalized)
        course_num_norm = df['Course Num'].fillna(0).values.reshape(-1, 1)
        features.append(course_num_norm)

        return np.concatenate(features, axis=1)
    
    #Similarity measures between course
    def compute_similarity_features(self, text1, text2):
        #TF-IDF course similarity
        combined_texts = [text1, text2]
        tfidf_matrix = self.tfidf_vectorizer.transform(combined_texts)
        cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

        #Similarity between strings
        string_sim = SequenceMatcher(None, text1.lower(), text2.lower()).ratio() #As a ratio
        
        return [cosine_sim, string_sim]

In [12]:
#Initializing feature engineer class
feature_engineer = CourseFeature()

print("Extracting features....")

#Text Features
purdueT_features = feature_engineer.extract_text_features(purdueDfCopy['combined_text'])
berkT_features = feature_engineer.extract_text_features(berkDfCopy['combined_text'])

#Numerical Features
purdueN_features = feature_engineer.extract_num_features(purdueDfCopy)
berkN_features = feature_engineer.extract_num_features(berkDfCopy)

#Purdue text and numerical features shape
print(f"Purdue text features shape: {purdueT_features.shape}")
print(f"Berkeley text features shape: {berkT_features.shape}")

#Berkeley text and numerical features shape
print(f"Purdue numerical features shape: {purdueN_features.shape}")
print(f"Berkeley numerical features shape: {berkN_features.shape}")

Extracting features....
Purdue text features shape: (1803, 1562)
Berkeley text features shape: (99, 1562)
Purdue numerical features shape: (1803, 2)
Berkeley numerical features shape: (99, 2)


Ground Truth Label Generation. This is in order to see if two courses in Purdue and Berkeley are the same. This will act as data labeling for prediction.

In [13]:
'''Creating Training Pairs'''

def create_training_pairs(purdue_df, berkeley_df):
    #Synthetic labels will be created based on similarity
    n_positive = 100
    n_negative = 300
    pairs = []
    labels = []

    #Positive Equivalent Pairs for Labeling as tuples with indexes
    positive_pair_index = [
        (0, 0), #purdue[0] matches berkeley[0]
        (1, 1), #purdue[1] matches berkeley[1]
        (2, 2),
        (3, 3),
        (4, 4)
    ]

    for purdue_idx, berk_idx in positive_pair_index:
        if purdue_idx < len(purdue_df) and berk_idx < len(berkeley_df):
            pairs.append((purdue_idx, berk_idx))
            labels.append(1) #1 would mean that they are equivalent

    #Generating negative pairs (courses that are not equivalent)
    np.random.seed(42)
    for i in range(n_negative):
        purdue_idx = np.random.randint(0, len(purdue_df))
        berk_idx = np.random.randint(0, len(berkeley_df))

        #Avoiding positive pairs
        if (purdue_idx, berk_idx) not in positive_pair_index:
            pairs.append((purdue_idx, berk_idx))
            labels.append(0) #0 if courses are not equivalent

    return pairs, labels

In [14]:
'''Generating Training Pairs'''
course_pairs, equivalency_labels = create_training_pairs(purdueDfCopy, berkDfCopy)

print(f"Generated {len(course_pairs)} total course pairs for training")
print(f"Positive examples (equivalent courses): {sum(equivalency_labels)}")
print(f"Negative examples (not equivalent courses): {len(equivalency_labels) - sum(equivalency_labels)}")

Generated 305 total course pairs for training
Positive examples (equivalent courses): 5
Negative examples (not equivalent courses): 300


Creating vectors for model training.

In [15]:
'''Numeeric Feature Correction '''
def safe_to_numeric(arr):
    arr = np.array(arr).flatten()
    safeList = pd.to_numeric(pd.Series(arr), errors="coerce").fillna(0).to_numpy()
    return safeList

In [16]:
'''Pair Features'''
def create_pair_features(pairs, purdue_features_text, berk_features_text, 
                         purdue_features_num, berk_features_num, purdue_df, berk_df):
    pair_features = []

    for purdue_idx, berk_idx in pairs:
        #Text Feature differences and similarities
        p_text_feat = purdue_features_text[purdue_idx]
        b_text_feat = berk_features_text[berk_idx]

        #Computing feature differences
        text_diff = np.abs(p_text_feat - b_text_feat)
        text_sim = p_text_feat * b_text_feat #Element-wise Product

        #Numeric Feature Conversion (for codes like 16B)
        p_num_feat = safe_to_numeric(purdue_features_num[purdue_idx])
        b_num_feat = safe_to_numeric(berk_features_num[berk_idx])

        #Numerical feature differences
        #p_num_feat = np.array(purdue_features_num[purdue_idx], dtype=float)
        #b_num_feat = np.array(berk_features_num[berk_idx], dtype=float)

        #Make shorter feature vectors to match longer vectors
        max_len = max(len(p_num_feat), len(b_num_feat))
        p_num_padded = np.pad(p_num_feat, (0, max_len - len(p_num_feat)))
        b_num_padded = np.pad(b_num_feat, (0, max_len - len(b_num_feat)))

        num_diff = np.abs(p_num_padded - b_num_padded)

        #Combining features
        combined_feats = np.concatenate([text_diff, 
                                         text_sim[:100], #Limits features to prevent large vectors
                                         num_diff])
        pair_features.append(combined_feats)

    return np.array(pair_features)

In [17]:
'''Feature Matrix'''
print("Creating feature matrix for course pairs....")
X = create_pair_features(course_pairs, purdueT_features, berkT_features,
                         purdueN_features, berkN_features, purdueDfCopy, berkDfCopy)

y = np.array(equivalency_labels)

print(f"Feature matrix shape: {X.shape}")
print(f"Labels shape: {y.shape}")

Creating feature matrix for course pairs....
Feature matrix shape: (305, 1664)
Labels shape: (305,)


Model Training and Evaluation

In [47]:
'''Multi-model Ensemble Class'''
class EquivalencyPredictor:
    def __init__(self):
        self.models = {
            'logistic': LogisticRegression(random_state=42, max_iter=1000), #Logistic Regression Model
            'random_forest': RandomForestClassifier(n_estimators=100, random_state=42), #Random Forest Model
            'svm': SVC(probability=True, random_state=42), #SVM
            'naive_bayes': MultinomialNB() #Naive Bayes Model
        }
        self.trained_models = {}
        self.scaler = StandardScaler()

    '''Training models and storing results'''
    def train_models(self, X_train, y_train):
        #Scaling Features
        X_train_scaled = self.scaler.fit_transform(X_train)

        for name, model in self.models.items():
            print(f"Training {name}....")

        #Negative Values for Naive Bayes
        if name == 'naive_bayes':
            X_train_nb = X_train_scaled - X_train_scaled.min() + 1
            model.fit(X_train_nb, y_train)
        else:
            model.fit(X_train_scaled, y_train)
        
        self.trained_models[name] = model

    print("All models have been trained!")

    '''Evaluating Training of Models'''
    def eval_models(self, X_test, y_test):
        X_test_scaled = self.scaler.transform(X_test)
        results = {}

        for name, model in self.trained_models.items():
            #Negative Values for Naive Bayes
            if name == 'naive_bayes':
                X_test_nb = X_test_scaled - X_test_scaled.min() + 1
                y_pred = model.predict(X_test_nb)
                if hasattr(model, "predict_proba"):
                    y_prob = model.predict_proba(X_test_nb)[:, 1]
                else:
                    y_prob = None
            else:
                y_pred = model.predict(X_test_scaled)
                if hasattr(model, "predict_proba"):
                    y_prob = model.predict_proba(X_test_scaled)[:, 1]
                elif hasattr(model, "decision_function"):
                    y_prob = model.decision_function(X_test_scaled)
                else:
                    y_prob = None

            #Calculating Metrics
            if y_prob is not None and len(np.unique(y_test)) > 1:
                auc = roc_auc_score(y_test, y_prob)
            else:
                auc = float("nan")

            results[name] = {
                'accuracy': accuracy_score(y_test, y_pred),
                'precision': precision_score(y_test, y_pred, zero_division=0),
                'recall': recall_score(y_test, y_pred, zero_division=0),
                'f1': f1_score(y_test, y_pred, zero_division=0),
                'auc': roc_auc_score(y_test, y_prob),
                'y_pred': y_pred.tolist(),
                'y_true': y_test.tolist()
            }

        return results

    

All models have been trained!


In [48]:
'''Splitting Data for Training and Testing'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [49]:
'''Output'''
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

Training set: 244 samples
Test set: 61 samples


In [50]:
'''Initializing and Training Equivalency Predictor'''
predictor = EquivalencyPredictor() #Equivalency Predictor Class
predictor.train_models(X_train, y_train)

#Evaluate Models
print("\nEvaluating models....")
evaluation_results = predictor.eval_models(X_test, y_test)

#Display Results
resultsDf = pd.DataFrame(evaluation_results).T
print("\n MODEL PERFORMANCE COMPARISION")
print(resultsDf.round(4))

Training logistic....
Training random_forest....
Training svm....
Training naive_bayes....

Evaluating models....

 MODEL PERFORMANCE COMPARISION
             accuracy precision recall   f1  auc  \
naive_bayes  0.983607       0.0    0.0  0.0  1.0   

                                                        y_pred  \
naive_bayes  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   

                                                        y_true  
naive_bayes  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  


The training indicates that there is perfect model prediction, and this is never the case with Naive Bayes. A 98.36% accuracy is very imbalanced. Naive Bayes not predicting any positive classes is also suspicious.

AUC being 1.0 is perfect separation, which does not make sense since Naive Bayes did not predict positive values.

In [52]:
'''Testing Class Balance'''
print("Overall class distribution:", np.bincount(y))

Overall class distribution: [300   5]


Since the overall class distribution is massive (300 vs 5), this is the root cause of the prediction imbalance.

In [51]:
'''Checking Distribution After Stratified Split'''
print("Train:", np.bincount(y_train))
print("Test:", np.bincount(y_test)) 

Train: [240   4]
Test: [60  1]


With this dataset distribution in mind, adding a confusion matrix can help with understanding the true negatives and positives, and false negatives and positives.