# Assignment 6 - Supervised Learning

Note: Put your code in functions and use default arguments as makes sense. You are now expected to use at least some functions in your code to make it reusable. We will check that there is at least one function in your code when grading. 

Extra credit: Use type hints in your functions to make sure you are using the right types when you call the functions.

1.) Clean your dataset to turn categorical values into numerical ones. One-hot encoding is likely the answer, but it depends on the dataset. Your data may have ordinal columns, for example where one-hot encoding is not as appropriate. 

In [42]:
import numpy as np
import pandas as pd

cc_main_df = pd.read_csv("../Misc./Credit_card.csv")
cc_label_df = pd.read_csv("../Misc./Credit_card_label.csv")
cc_combined_df = pd.merge(cc_main_df, cc_label_df, on = "Ind_ID", how = "inner")

In [43]:
cc_combined_df.fillna(method = 'ffill', inplace=True)
ordinal_mappers = {"EDUCATION": {"Lower secondary": 1, "Secondary / secondary special": 2, "Incomplete higher": 3, "Higher education": 4, "Academic degree": 5}}
cc_combined_df.replace(ordinal_mappers, inplace = True)

one_hot_columns = ["GENDER", "Car_Owner", "Propert_Owner", "Type_Income", "Marital_status", "Housing_type", "Type_Occupation"]
cc_combined_df = pd.get_dummies(cc_combined_df, columns = one_hot_columns)

  cc_combined_df.fillna(method = 'ffill', inplace=True)
  cc_combined_df.replace(ordinal_mappers, inplace = True)


2.) Perform univariate linear regression on the dataset. Select your variable to predict. How well did this model perform? Is this a good approach for this dataset? Why or why not? 

In [61]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error

def linear_regression_train(X_train: np.ndarray, y_train: np.ndarray) -> LinearRegression:
    sc = StandardScaler()
    X_train_scaled = sc.fit_transform(X_train)
    regression = LinearRegression().fit(X_train_scaled, y_train)
    return regression

def regression_performance(regression: LinearRegression, X_test_scaled: np.ndarray, y_test: np.ndarray) -> tuple:
    y_predicted = regression.predict(X_test_scaled)
    r2 = r2_score(y_test, y_predicted)
    mse = mean_squared_error(y_test, y_predicted)
    return r2, mse

X = cc_combined_df.drop(columns = ['Ind_ID', 'EMAIL_ID', 'label'])
y = cc_combined_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 9383028)

regression_model = linear_regression_train(X_train, y_train)

sc = StandardScaler()
X_test_scaled = sc.fit_transform(X_test)

r2_score_value, mse_score = regression_performance(regression_model, X_test_scaled, y_test)
print("R-squared Score:", r2_score_value)
print("Mean Squared Error:", mse_score)

R-squared Score: -2.7780671829797753e+25
Mean Squared Error: 2.7592448061821554e+24


- (Non-normalized R-squared Score and Mean Squared Error of 0.102 and 0.951, respectively... SEE PROBLEM 5)

This is NOT a good approach for this dataset, since we're trying to predict BINARY outputs, and the linear regression predicts values well outside the 0 --> 1 (True, or False) range. In other words, the extentuating values are NOT interpretable... what could "45" mean in the context of this dataset? Instead, we should use logistic regression, whose values are interpretable as probabilites/leverage between 0 and 1.

3.) Perform KNN on this dataset. As part of this, write a function that selects the optimal value of k. How well did this model perform?

In [59]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

def train_knn(X_train: np.ndarray, y_train: np.ndarray, k: int) -> KNeighborsClassifier:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    return knn

def select_optimal_k(X_train: np.ndarray, y_train: np.ndarray, max_k: int = 13, cv: int = 5) -> int:
    optimal_k = 1
    max_score = 0
    for k in range(1, max_k + 1):
        knn = KNeighborsClassifier(n_neighbors=k)
        scores = cross_val_score(knn, X_train, y_train, cv = cv)
        avg_score = np.mean(scores)
        if avg_score > max_score:
            max_score = avg_score
            optimal_k = k
    return optimal_k

optimal_k = select_optimal_k(X_train, y_train)

knn = train_knn(X_train, y_train, optimal_k)

performance = knn.score(X_test, y_test)
print("Optimal K is:", optimal_k)
print("Accuracy:", performance)

Optimal K is: 12
Accuracy: 0.8817204301075269


- (Non-normalized accuracy of 0.879... SEE PROBLEM 5)

The KNN is incredibly accurate, due in large part to the advantage it has reducing noise and sticking to only the nearest "neighborhood" in a (relatively) large dataset.

4.) Work with your dataset to perform logistic regression. How well did this perform?

In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def logistic_regression_train(X_train: np.ndarray, y_train: np.ndarray) -> LogisticRegression:
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    logistic_reg = LogisticRegression()
    logistic_reg.fit(X_train_scaled, y_train)
    return logistic_reg

def logistic_regression_performance(logistic_reg: LogisticRegression, X_test_scaled: np.ndarray, y_test: np.ndarray) -> float:
    y_pred = logistic_reg.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

X = cc_combined_df.drop(columns = ['Ind_ID', 'EMAIL_ID', 'label'])
y = cc_combined_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 18030)

logistic_reg_model = logistic_regression_train(X_train, y_train)

scaler = StandardScaler()
X_test_scaled = scaler.fit_transform(X_test)

accuracy = logistic_regression_performance(logistic_reg_model, X_test_scaled, y_test)
print(accuracy)

0.886021505376344


- (Non-normalized accuracy of 0.879... SEE PROBLEM 5)

As we hypothesized earlier, a logistic performed SIGNIFICANTLY better than a linear regression——given it squeesing the values between 0 and 1, our range——and only a little worse than our arguably more complex KNN function!

5.) Perform normalization on your dataset. Does it change the performance for 2-4? What is the best measure of performance for your dataset (accuracy or something else) and why?

After returning to the LinearRegression dataset and changing the function to use normalized data (see above), the R-squared of 0.102 and Mean-Squared Error of 0.095 dropped drastically to -2.778(e+25) and -2.592(e+24), respectively, meaning the performance fell from bad to horrific. This is likely due, in large part, to our Linear Regression overfitting the training data and essentially learning to predict its "noise"——that noise accentuated by normalizing——rather than predictions in general.

In the case of logistic and KNN, normalizing nudged our accuracy in the right direction (made it more accurate), likely due to the fact doing so scaled features down to the same range, preventing those features with wider ranges from overwhelming those without.

In the context of this problem, accuracy would almost certainly be the best measure of performance in the dataset, since we're trying to determine how many people actually recieved credit cards based on our models (TP, FP, binary outcomes), not the distance of our predictions from a mean, as in linear regression.