## Research Question

**How are institutional financial resources and spending efficiency related to graduation outcomes across colleges?**

Predicting: Is this institution a “high performing” graduation institution (top quartile) or not?

So answering: Based on institutional characteristics and resources, can we classify whether a college belongs to the top quartile of graduation outcomes?

## Target Variable

To model this question as a classification problem, I define a binary target variable,
`high_grad_rate`, which equals 1 if an institution’s graduation rate at 150% of normal
time (`grad_150_value`) is in the top quartile of all institutions, and 0 otherwise.


## Q2 — Build a kNN model (k = 3) 

In [1]:
# 1) Imports
# ----------------------------
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier


In [2]:
def prep_college_data_for_knn(
    path="cc_institution_details.csv",
    test_size=0.2,
    random_state=42
):
    df = pd.read_csv(path)
    df.replace("NULL", np.nan, inplace=True)

    # Convert selected columns to categorical
    for col in ["state", "level", "control", "basic"]:
        if col in df.columns:
            df[col] = df[col].astype("category")

    # Convert hbcu/flagship to 0/1
    for col in ["hbcu", "flagship"]:
        if col in df.columns:
            df[col] = df[col].notna().astype(int)

    # Clean counted_pct
    if "counted_pct" in df.columns:
        df["counted_pct"] = df["counted_pct"].astype(str).str.split("|", expand=True)[0]
        df["counted_pct"] = pd.to_numeric(df["counted_pct"], errors="coerce") / 100.0

    # Drop columns you don’t want
    drop_cols = [
        "index", "unitid", "site", "vsa_year",
        "vsa_grad_after4_first", "vsa_grad_elsewhere_after4_first",
        "vsa_enroll_after4_first", "vsa_enroll_elsewhere_after4_first",
        "vsa_grad_after6_first", "vsa_grad_elsewhere_after6_first",
        "vsa_enroll_after6_first", "vsa_enroll_elsewhere_after6_first",
        "vsa_grad_after4_transfer", "vsa_grad_elsewhere_after4_transfer",
        "vsa_enroll_after4_transfer", "vsa_enroll_elsewhere_after4_transfer",
        "vsa_grad_after6_transfer", "vsa_grad_elsewhere_after6_transfer",
        "vsa_enroll_after6_transfer", "vsa_enroll_elsewhere_after6_transfer",
        "med_sat_value", "med_sat_percentile",
        "chronname", "city", "nicknames", "long_x", "lat_y"
    ]
    drop_cols = [c for c in drop_cols if c in df.columns]
    df = df.drop(columns=drop_cols)

    # Drop rows missing the outcome (must exist to build target)
    df = df.dropna(subset=["grad_150_value"])

    # Create target (top quartile)
    thr = df["grad_150_value"].quantile(0.75)
    df["high_grad_rate"] = (df["grad_150_value"] > thr).astype(int)
    df = df.drop(columns=["grad_150_value"])

    # One-hot encode categoricals
    cat_cols = list(df.select_dtypes(include=["category"]).columns)
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

    # Drop any leftover string/object columns (safer than coercing everything)
    obj_cols = df.select_dtypes(include=["object", "string"]).columns
    df = df.drop(columns=obj_cols)

    # Now drop remaining missing rows (kNN/scaler cannot handle NaNs)
    df = df.dropna()

    # DEBUG: check dataset size before split
    #print("Rows after cleaning:", df.shape[0])
    #print("Columns after encoding:", df.shape[1])

    # Split
    X = df.drop(columns=["high_grad_rate"])
    y = df["high_grad_rate"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        stratify=y,
        random_state=random_state
    )

    # Scale (fit on train only)
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    return X_train_scaled, X_test_scaled, y_train, y_test






In [3]:
# Run the data preparation pipeline
X_train, X_test, y_train, y_test = prep_college_data_for_knn()

# Train kNN with k = 3
knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_train, y_train)


0,1,2
,"n_neighbors  n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.",3
,"weights  weights: {'uniform', 'distance'}, callable or None, default='uniform' Weight function used in prediction. Possible values: - 'uniform' : uniform weights. All points in each neighborhood  are weighted equally. - 'distance' : weight points by the inverse of their distance.  in this case, closer neighbors of a query point will have a  greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an  array of distances, and returns an array of the same shape  containing the weights. Refer to the example entitled :ref:`sphx_glr_auto_examples_neighbors_plot_classification.py` showing the impact of the `weights` parameter on the decision boundary.",'uniform'
,"algorithm  algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm  based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.",'auto'
,"leaf_size  leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.",30
,"p  p: float, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. This parameter is expected to be positive.",2
,"metric  metric: str or callable, default='minkowski' Metric to use for distance computation. Default is ""minkowski"", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance `_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is ""precomputed"", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only ""nonzero"" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.",'minkowski'
,"metric_params  metric_params: dict, default=None Additional keyword arguments for the metric function.",
,"n_jobs  n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. Doesn't affect :meth:`fit` method.",


A k-Nearest Neighbors (kNN) classification model with k = 3 was trained to predict the binary target variable `high_grad_rate`. Because kNN is distance-based, all categorical variables were one-hot encoded and features were scaled using Min–Max scaling fit on the training data only to avoid data leakage

## Q3 — DataFrame with actuals, predictions, and probabilities

In [4]:
# Predict the class (0/1) for each institution in the test set
y_pred = knn3.predict(X_test)

# Predict probability of the positive class (high_grad_rate = 1)
# predict_proba returns [P(class=0), P(class=1)] for each row
y_prob_pos = knn3.predict_proba(X_test)[:, 1]

# Create the required dataframe
q3_results = pd.DataFrame({
    "actual": y_test.values,
    "predicted": y_pred,
    "prob_positive": y_prob_pos
})

# Display a preview
q3_results.head()

Unnamed: 0,actual,predicted,prob_positive
0,0,0,0.0
1,0,0,0.0
2,0,0,0.0
3,0,0,0.0
4,0,0,0.0


When using k = 3, the model predicts almost all colleges as not having a high
graduation rate, and many test observations have a predicted probability of 0.0.
This means that for most colleges, all three of their nearest neighbors in the
training data also have lower graduation rates.

Because this raised the question of whether the model was working correctly,
I tested a larger value of k (k = 10) as a quick check. With k = 10, the model
produced smoother probability values (such as 0.2), showing that some nearby
colleges do have high graduation rates. This confirmed that the data preparation
and model were working as expected, and that the results with k = 3 are due to
how kNN behaves with many features and an imbalanced target.

In [5]:
# ============================================================
# checking kNN with higher k (diagnostic only)
# ============================================================

# Try a higher k
knn10 = KNeighborsClassifier(n_neighbors=10)
knn10.fit(X_train, y_train)

# Predict probabilities on test set
y_prob_pos_k10 = knn10.predict_proba(X_test)[:, 1]
y_pred_k10 = (y_prob_pos_k10 >= 0.5).astype(int)

# Look at first few rows
pd.DataFrame({
    "actual": y_test.values,
    "predicted_k10": y_pred_k10,
    "prob_positive_k10": y_prob_pos_k10
}).head()


Unnamed: 0,actual,predicted_k10,prob_positive_k10
0,0,0,0.0
1,0,0,0.0
2,0,0,0.0
3,0,0,0.0
4,0,0,0.2


## Q4 — Effect of k on Threshold and Confusion Matrix
#### If you adjusted the k hyperparameter what do you think would happen to the threshold function? Would the confusion matrix look the same at the same threshold levels or not? Why or why not?

Changing the value of k changes how predicted probabilities are calculated in a
kNN model. With a small k, predictions depend on only a few nearby points, which
often leads to extreme probabilities such as 0.0 or 1.0. As k increases, the
model averages over more neighbors, producing smoother probability values that
are closer to the middle.

Because the predicted probabilities change when k changes, using the same
classification threshold will not produce the same confusion matrix. Even if
the threshold stays fixed, different probability values will cross the
threshold for different observations, leading to changes in false positives
and false negatives. As such, adjusting k directly affects how the threshold
behaves and how the confusion matrix looks.

## Q5. Evaluate the results using the confusion matrix. 
#### Then "walk" through your question, summarize what concerns or positive elements do you have about the model as it relates to your question? 

In [6]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = knn3.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

print(classification_report(y_test, y_pred))


[[311  20]
 [ 30  81]]
              precision    recall  f1-score   support

           0       0.91      0.94      0.93       331
           1       0.80      0.73      0.76       111

    accuracy                           0.89       442
   macro avg       0.86      0.83      0.84       442
weighted avg       0.88      0.89      0.89       442



My confusion matrix is:

                        [[311  20]
                         [ 30  81]]

Meaning: 
- **311** colleges correctly predicted as low graduation (class 0)
- **81** colleges correctly predicted as high graduation (class 1)
- **20** colleges incorrectly predicted as high (false positives)
- **30** colleges incorrectly predicted as low (false negatives)

support

0 = 311 + 20 → 331

1 = 81 + 30 → 111

total = 442

**Prevalence** = proportion of positive class in test set
so 111 / 442 = **0.251 ≈ 25%**

**Baseline Accuracy** = Predict the majority class for everything.
majority class is class 0 so 331 / 442 = **0.749 ≈ 75%**

**Accuracy = 0.89** from the classification report; calculated (311 + 81) / 442

For class 1 (high graduation):
- **Precision = 0.80**
- **Recall = 0.73**

Recall (0.73)

Recall = TP / (TP + FN)

= 81 / (81 + 30) = 81 / 111 = 0.73

The model correctly identifies 73% of high-performing institutions. So it misses 27% of them 

Precision (0.80)

Precision = TP / (TP + FP)

= 81 / (81 + 20) = 81 / 101 ≈ 0.80

When the model predicts a college is high-performing, it is correct 80% of the time.


**Research question**: How are institutional financial resources and spending efficiency related to graduation outcomes?

Model is testing: Can institutional characteristics predict whether a college is in the top quartile of graduation rates?

* The model performs much better than baseline.

* It has strong precision (80%).

* It has good recall (73%).

* It correctly classifies 89% overall.

That suggests institutional features have predictive signal.


My confusion matrix means that 311 colleges were correctly predicted as low graduation institutions (class 0), and 81 colleges were correctly predicted as high graduation institutions (class 1). The model incorrectly labeled 20 low-performing institutions as high-performing (false positives) and failed to identify 30 high-performing institutions (false negatives).

There are 111 high-graduation institutions out of 442 total institutions in the test set, so the prevalence of the positive class is approximately 25%. A baseline model that predicts the majority class (low graduation) for every observation would achieve an accuracy of about 75%. In contrast, the kNN model achieves 89% accuracy, which is a improvement over baseline and suggests that the model is capturing the structure in the data.

For high-graduation institutions (class 1), the recall is 0.73, meaning the model correctly identifies 73% of institutions in the top quartile of graduation rates. The precision is 0.80, meaning that when the model predicts an institution is high-performing, it is correct 80% of the time. This indicates that the model is reasonably effective at identifying high-performing institutions, although it still misses about 27% of them.

In relation to my research question, how institutional financial resources and spending efficiency relate to graduation outcomes, these results suggest that institutional characteristics do contain predictive information about whether a college is in the top quartile of graduation performance. Institutions with similar financial and structural characteristics tend to have similar graduation outcomes. However, the model is based on similarity between institutions and does not prove that financial resources directly cause higher graduation rates. The results show predictive patterns rather than cause-and-effect relationships.

## Q6. Create two functions:

#### Function 1 — Clean + split (train/test)

In [17]:

def clean_split_college_knn(path="cc_institution_details.csv", test_size=0.2, random_state=42):
    df = pd.read_csv(path)
    df = df.replace("NULL", np.nan)

    # categorical columns
    for col in ["state", "level", "control", "basic"]:
        if col in df.columns:
            df[col] = df[col].astype("category")

    # hbcu/flagship -> 0/1
    for col in ["hbcu", "flagship"]:
        if col in df.columns:
            df[col] = df[col].notna().astype(int)

    # clean counted_pct safely (escape pipe)
    if "counted_pct" in df.columns:
        part = df["counted_pct"].astype(str).str.split(r"\|", expand=True)[0]
        df["counted_pct"] = pd.to_numeric(part, errors="coerce") / 100.0

    # drop columns
    drop_cols = [
        "index", "unitid", "site", "vsa_year",
        "vsa_grad_after4_first", "vsa_grad_elsewhere_after4_first",
        "vsa_enroll_after4_first", "vsa_enroll_elsewhere_after4_first",
        "vsa_grad_after6_first", "vsa_grad_elsewhere_after6_first",
        "vsa_enroll_after6_first", "vsa_enroll_elsewhere_after6_first",
        "vsa_grad_after4_transfer", "vsa_grad_elsewhere_after4_transfer",
        "vsa_enroll_after4_transfer", "vsa_enroll_elsewhere_after4_transfer",
        "vsa_grad_after6_transfer", "vsa_grad_elsewhere_after6_transfer",
        "vsa_enroll_after6_transfer", "vsa_enroll_elsewhere_after6_transfer",
        "med_sat_value", "med_sat_percentile",
        "chronname", "city", "nicknames", "long_x", "lat_y"
    ]
    drop_cols = [c for c in drop_cols if c in df.columns]
    df = df.drop(columns=drop_cols)

    # must have grad_150_value to define target
    df = df.dropna(subset=["grad_150_value"])

    # create target (top quartile)
    thr = df["grad_150_value"].quantile(0.75)
    df["high_grad_rate"] = (df["grad_150_value"] > thr).astype(int)
    df = df.drop(columns=["grad_150_value"])  # avoid leakage

    # one-hot encode categoricals
    cat_cols = list(df.select_dtypes(include=["category"]).columns)
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

    # drop leftover string columns
    obj_cols = df.select_dtypes(include=["object", "string"]).columns
    df = df.drop(columns=obj_cols)

    # drop remaining missing
    df = df.dropna()

    X = df.drop(columns=["high_grad_rate"])
    y = df["high_grad_rate"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, stratify=y, random_state=random_state
    )

    # scale using train only
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    return X_train, X_test, y_train.to_numpy(), y_test.to_numpy()


#### Function 2 — Train/test with adjustable k + threshold

In [8]:
def run_knn_k_threshold(X_train, y_train, X_test, y_test, k=3, threshold=0.5):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)

    prob_pos = model.predict_proba(X_test)[:, 1]
    pred = (prob_pos >= threshold).astype(int)

    cm = confusion_matrix(y_test, pred)  # [[TN, FP],[FN, TP]]
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) else 0.0
    recall = tp / (tp + fn) if (tp + fn) else 0.0

    results_df = pd.DataFrame({
        "actual": y_test,
        "predicted": pred,
        "prob_positive": prob_pos
    })

    summary = {
        "k": k,
        "threshold": threshold,
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "tn": tn, "fp": fp, "fn": fn, "tp": tp
    }

    return results_df, cm, summary


#### Use to optimize: test several k + threshold combos

In [9]:
def grid_search_knn(X_train, y_train, X_test, y_test, k_values, thresholds):
    all_rows = []
    for k in k_values:
        for t in thresholds:
            _, _, summary = run_knn_k_threshold(X_train, y_train, X_test, y_test, k=k, threshold=t)
            all_rows.append(summary)
    return pd.DataFrame(all_rows).sort_values(["accuracy", "recall"], ascending=False)


#### Run grid search

In [10]:
X_train, X_test, y_train, y_test = clean_split_college_knn()

k_values = [1, 3, 5, 7, 10, 15, 25]
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]

search_df = grid_search_knn(X_train, y_train, X_test, y_test, k_values, thresholds)
search_df.head(10)


Unnamed: 0,k,threshold,accuracy,precision,recall,tn,fp,fn,tp
17,7,0.5,0.902715,0.895349,0.693694,322,9,34,77
26,15,0.4,0.900452,0.791304,0.81982,307,24,20,91
12,5,0.5,0.89819,0.866667,0.702703,319,12,33,78
13,5,0.6,0.89819,0.866667,0.702703,319,12,33,78
22,10,0.5,0.893665,0.847826,0.702703,317,14,33,78
23,10,0.6,0.893665,0.910256,0.63964,324,7,40,71
27,15,0.5,0.891403,0.862069,0.675676,319,12,36,75
25,15,0.3,0.88914,0.738462,0.864865,297,34,15,96
31,25,0.4,0.88914,0.77193,0.792793,305,26,23,88
30,25,0.3,0.886878,0.732824,0.864865,296,35,15,96


In [11]:
best = search_df.iloc[0]
best_k = int(best["k"])
best_t = float(best["threshold"])

best_results, best_cm, best_summary = run_knn_k_threshold(
    X_train, y_train, X_test, y_test, k=best_k, threshold=best_t
)

best_summary, best_cm
best_results.head()


Unnamed: 0,actual,predicted,prob_positive
0,0,0,0.0
1,0,0,0.0
2,0,0,0.0
3,0,0,0.0
4,0,0,0.0


In [14]:
best_results[best_results["predicted"] == 1].head()


Unnamed: 0,actual,predicted,prob_positive
10,1,1,1.0
18,1,1,0.714286
29,0,1,0.714286
31,1,1,0.714286
35,1,1,0.571429


## Q7 How well does the model perform?

Original model:

* k = 3, threshold = 0.5

* Accuracy ≈ 0.89

* Precision ≈ 0.80

* Recall ≈ 0.73

Best accuracy model:

* k = 7, threshold = 0.5

*Accuracy ≈ 0.903

* Precision ≈ 0.895

* Recall ≈ 0.694

Highest recall model tested:

* k = 15, threshold = 0.4

* Accuracy ≈ 0.900

* Precision ≈ 0.791

* Recall ≈ 0.820

Overall: 
- Lower thresholds (0.3–0.4) increased recall but produced more false positives.

- Higher thresholds (0.6–0.7) increased precision but reduced recall.

- Increasing k generally made predictions more stable and slightly improved overall accuracy.


Overall, the model performs well. The original k = 3 model achieved 89% accuracy, which was already a strong improvement over the 75% baseline accuracy. After testing multiple combinations of k values and probability thresholds, the best-performing model achieved approximately 90% accuracy (k = 7, threshold = 0.5). This represents a modest improvement over the original model.

Adjusting k and the threshold did help improve performance slightly, but more importantly, it changed the balance between precision and recall. Increasing k made predictions more stable by averaging over more neighboring institutions, which slightly improved accuracy and precision. Adjusting the threshold changed how sensitive the model was to identifying high-performing institutions. Lower thresholds increased recall (identifying more high-performing colleges) but also increased false positives. Higher thresholds improved precision but reduced recall.

This shows that the interaction between k and the classification threshold affects how the model behaves. The improvements in accuracy were relatively small, but changing these parameters allowed for better control over the tradeoff between identifying high-performing institutions and avoiding incorrect positive predictions.

## Q8 Choose another variable as the target in the dataset

In [22]:
def clean_split_college_retain(
    path="cc_institution_details.csv",
    test_size=0.2,
    random_state=42
):
    # --------------------------------------------------
    # 1. Load the dataset and clean obvious missing values
    # --------------------------------------------------
    df = pd.read_csv(path)
    df = df.replace("NULL", np.nan)

    # --------------------------------------------------
    # 2. Convert selected columns to categorical type
    #    These are institutional descriptors, not numeric quantities
    # --------------------------------------------------
    for col in ["state", "level", "control", "basic"]:
        if col in df.columns:
            df[col] = df[col].astype("category")

    # --------------------------------------------------
    # 3. Convert hbcu and flagship indicators to 0/1
    #    Missing values mean "No"
    # --------------------------------------------------
    for col in ["hbcu", "flagship"]:
        if col in df.columns:
            df[col] = df[col].notna().astype(int)

    # --------------------------------------------------
    # 4. Clean counted_pct column
    #    Some entries look like "85|year"
    #    We split at "|" and keep only the numeric percentage
    # --------------------------------------------------
    if "counted_pct" in df.columns:
        part = df["counted_pct"].astype(str).str.split(r"\|", expand=True)[0]
        df["counted_pct"] = pd.to_numeric(part, errors="coerce") / 100.0

    # --------------------------------------------------
    # 5. Drop columns that are irrelevant or highly missing
    # --------------------------------------------------
    drop_cols = [
        "index", "unitid", "site", "vsa_year",
        "vsa_grad_after4_first", "vsa_grad_elsewhere_after4_first",
        "vsa_enroll_after4_first", "vsa_enroll_elsewhere_after4_first",
        "vsa_grad_after6_first", "vsa_grad_elsewhere_after6_first",
        "vsa_enroll_after6_first", "vsa_enroll_elsewhere_after6_first",
        "vsa_grad_after4_transfer", "vsa_grad_elsewhere_after4_transfer",
        "vsa_enroll_after4_transfer", "vsa_enroll_elsewhere_after4_transfer",
        "vsa_grad_after6_transfer", "vsa_grad_elsewhere_after6_transfer",
        "vsa_enroll_after6_transfer", "vsa_enroll_elsewhere_after6_transfer",
        "med_sat_value", "med_sat_percentile",
        "chronname", "city", "nicknames", "long_x", "lat_y"
    ]
    drop_cols = [c for c in drop_cols if c in df.columns]
    df = df.drop(columns=drop_cols)

    # --------------------------------------------------
    # 6. Define NEW TARGET VARIABLE: Retention Rate
    #    We classify institutions in the top quartile
    #    of retention rate as 1, others as 0
    # --------------------------------------------------
    df = df.dropna(subset=["retain_value"])

    threshold = df["retain_value"].quantile(0.75)

    df["high_retain"] = (df["retain_value"] > threshold).astype(int)

    # Drop the original retain_value column to prevent leakage
    df = df.drop(columns=["retain_value"])

    # --------------------------------------------------
    # 7. One-hot encode categorical variables
    # --------------------------------------------------
    cat_cols = list(df.select_dtypes(include=["category"]).columns)
    df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

    # --------------------------------------------------
    # 8. Drop any remaining string columns
    # --------------------------------------------------
    obj_cols = df.select_dtypes(include=["object", "string"]).columns
    df = df.drop(columns=obj_cols)

    # --------------------------------------------------
    # 9. Remove any remaining missing values
    # --------------------------------------------------
    df = df.dropna()

    # --------------------------------------------------
    # 10. Split into features (X) and target (y)
    # --------------------------------------------------
    X = df.drop(columns=["high_retain"])
    y = df["high_retain"]

    # --------------------------------------------------
    # 11. Train/Test split (stratified)
    # --------------------------------------------------
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        stratify=y,
        random_state=random_state
    )

    # --------------------------------------------------
    # 12. Scale features using MinMaxScaler
    #     Fit ONLY on training data to avoid leakage
    # --------------------------------------------------
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    return X_train, X_test, y_train.to_numpy(), y_test.to_numpy()


In [23]:
X_train2, X_test2, y_train2, y_test2 = clean_split_college_retain()


In [24]:
search_df2 = grid_search_knn(
    X_train2, y_train2, X_test2, y_test2,
    k_values=[1,3,5,7,10,15,25],
    thresholds=[0.3,0.4,0.5,0.6,0.7]
)

search_df2.head()


Unnamed: 0,k,threshold,accuracy,precision,recall,tn,fp,fn,tp
23,10,0.6,0.911765,0.884298,0.810606,296,14,25,107
28,15,0.6,0.909502,0.896552,0.787879,298,12,28,104
17,7,0.5,0.904977,0.857143,0.818182,292,18,24,108
22,10,0.5,0.902715,0.82963,0.848485,287,23,20,112
27,15,0.5,0.89819,0.832061,0.825758,288,22,23,109


Target Variable:

* retain_value (top quartile classified as 1 = high retention)

Best-performing model:

* k = 10

* Threshold = 0.6

* Accuracy ≈ 0.912

* Precision ≈ 0.884

* Recall ≈ 0.811

Confusion matrix components (best model):

* True Negatives (TN) = 296

* False Positives (FP) = 14

* False Negatives (FN) = 25

* True Positives (TP) = 107

Compared to the graduation model:

* Graduation best accuracy ≈ 0.903

* Retention best accuracy ≈ 0.912

* Retention recall (≈ 0.81) is higher than graduation recall in most tuned models

Overall: 

* Increasing k improved stability of predictions.

* A slightly higher threshold (0.6) improved precision while maintaining strong recall.

* The retention model produced slightly stronger overall performance than the graduation model.

For the second model, I selected retain_value as the target variable and defined institutions in the top quartile of retention as 1. After optimization, the best-performing model achieved approximately 91% accuracy. The model correctly identified about 81% of high-retention institutions and was correct nearly 88% of the time when predicting high retention.

Compared to the graduation model, the retention model performed slightly better overall. This suggests that institutional characteristics may be somewhat more predictive of short-term retention outcomes than long-term graduation outcomes. As with the previous model, the results show predictive relationships rather than causal conclusions.