In [None]:
from IA3_skeleton_code import *

df_train = pd.read_csv("IA3-train.csv")
df_val = pd.read_csv("IA3-dev.csv")

tfidfvectorizer = TfidfVectorizer(use_idf=True, lowercase=True)

X_train = tfidfvectorizer.fit_transform(df_train['text'])
y_train = df_train['sentiment']

X_val = tfidfvectorizer.transform(df_val['text'])
y_val = df_val['sentiment']

In [None]:

exp = range(-4,5)
cVals = list(map(lambda x: 10**(x), exp))

acc_train = {}
acc_val = {}
nSVMs = {}

print("Training SVM with Quadratic Kernel")
for c in cVals:
    print("Training on c =", c)
    acc_train[c], acc_val[c], nSVMs[c] = trainSVM(X_train, y_train, X_val, y_val, c, "poly", 2)

In [None]:
plotAcc(exp, acc_train, acc_val, "Quadratic SVM", "qsvmAcc.jpg")
plotNSV(exp, nSVMs, "Quadratic SVM", "qsvmNSV.jpg")

In [None]:
tmpQTA, tmpQVA, tmpQSVs = optimizeC(X_train, y_train, X_val, y_val, 0.3, 0.7, "poly", 2)

In [None]:
plotOptC(tmpQTA, "Training", "optCtrain.jpg", None, None)

In [None]:
plotOptC(tmpQVA, "Validation", "optCval.jpg", (0.5, 0.55), (0.916, 0.918))

### Old results (non-balanced class)
It seems that the quadratic kernel with c $\in$ \[0.1,1\] is where the ideal c values falls within. However, values of c below 0.8 does not result in high validation accuracy (below 90%) while reaching high training accuracy (around 99%). If we optimize c based on validation accuracy, this would be c $\in$ \[0.9,1\] but the model is obviously overfittin with near 100% training accuracy (10% difference leads me to think overfitting to training data due to high number of SV's - probably around 6300 support vectors for a model that falls in this c range). Optimal c = 0.95

### New results (balanced class)
Now, the best range for c (in terms of validation accuracy) was $\in$ \[0.3, 0.7\]. This resulted in extremely high training accuracy (signs of overfitting) but also increased our validation accuracy to a peark of around 0.52 (this is the highest with validation acc = 91.72%).

In [None]:
n_train = len(y_train)
n_val = len(y_val)

balanced_optimalQSVM = SVC(C=0.7, kernel="poly", degree=2, max_iter=25000, class_weight="balanced")
unbalanced_optimalQSVM = SVC(C=0.7, kernel="poly", degree=2, max_iter=25000)
balanced_optimalQSVM.fit(X_train, y_train)
unbalanced_optimalQSVM.fit(X_train, y_train)


balanced_y_pred_train = balanced_optimalQSVM.predict(X_train)
balanced_y_pred_val = balanced_optimalQSVM.predict(X_val)

unbalanced_y_pred_train = unbalanced_optimalQSVM.predict(X_train)
unbalanced_y_pred_val = unbalanced_optimalQSVM.predict(X_val)

In [None]:
from sklearn.metrics import balanced_accuracy_score, accuracy_score, recall_score, cohen_kappa_score
from sklearn.metrics import precision_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

balanced_rc = recall_score(y_val, balanced_y_pred_val)
balanced_pr = precision_score(y_val, balanced_y_pred_val)
balanced_rAcc = accuracy_score(y_val, balanced_y_pred_val)
balanced_bAcc = balanced_accuracy_score(y_val, balanced_y_pred_val)
balanced_kappa = cohen_kappa_score(y_val, balanced_y_pred_val)

unbalanced_rc = recall_score(y_val, unbalanced_y_pred_val)
unbalanced_pr = precision_score(y_val, unbalanced_y_pred_val)
unbalanced_rAcc = accuracy_score(y_val, unbalanced_y_pred_val)
unbalanced_bAcc = balanced_accuracy_score(y_val, unbalanced_y_pred_val)
unbalanced_kappa = cohen_kappa_score(y_val, unbalanced_y_pred_val)

print("Balanced")
print("Validation Recall:", balanced_rc)
print("Validation Precision:", balanced_pr)
print("Validation Accuracy:", balanced_rAcc)
print("Balanced Validation Accuracy:", balanced_bAcc)
print("Kappa:", balanced_kappa)

print("Unbalanced")

print("Validation Recall:", unbalanced_rc)
print("Validation Precision:", unbalanced_pr)
print("Validation Accuracy:", unbalanced_rAcc)
print("Balanced Validation Accuracy:", unbalanced_bAcc)
print("Kappa:", unbalanced_kappa)

In [None]:
print(classification_report(y_val, balanced_y_pred_val))
print(classification_report(y_val, unbalanced_y_pred_val))

In [None]:
ConfusionMatrixDisplay(confusion_matrix(y_val, y_pred_val)).plot()

### Before using balanced class weights (original results)
All metrics for the validation data:
- The optimal model accuracy (non-balanced) was 89.88%
- The balanced accuracy (accounting for the imbalance of classes) was 77.05%
- The accuracy for class 0 (majority class) was 99.14%
- The accuracy for class 1 (minorirt class) was 54.96%

The model is clearly overfitting to the training data and focusing on the majority class, leaving the minority class to be nearly equivalent to random guessing. We may need to change the way we evaluate and pick models (not sure if we can do that in this report, seems more like a training issue then a evaluation issue - I don't believe changing the value of c will account for class imbalance). Training below will be used with SVC and the "balanced" class weight to ensure they are treated evenly.

### After using balanced class weights (new results)
- The optimal model accuracy (non-balanced) was 91.72%
- The balanced accuracy (accounting for the imbalance of classes) was 82.84%
- The accuracy for class 0 (majority class) was 98.12%
- The accuracy for class 1 (minorirt class) was 67.65%


We should probably include some discussion on how we changed from non-balanced to balanced class weights for the SVC model. Assigning class 0 (negative tweets) to all observations in the validation data gave us approximately 89% accuracy, so the original quadratic model was not predicting very good.

# Part 3 - SVM with RBF

In [None]:
tfidfvectorizer = TfidfVectorizer(use_idf=True, lowercase=True)

X_train = tfidfvectorizer.fit_transform(df_train['text'])
y_train = df_train['sentiment']

X_val = tfidfvectorizer.transform(df_val['text'])
y_val = df_val['sentiment']

acc_train = {}
acc_val = {}
nSVs = {}

In [None]:
def trainSVMrbf(X_train, y_train, x_val, y_val, c, g):
    n_val = len(y_val)
    n_train = len(y_train)
    
    svm = SVC(C=c, kernel="rbf", gamma=g ,max_iter=25000, class_weight="balanced")
    svm.fit(X_train, y_train)

    y_pred_train = svm.predict(X_train)
    y_pred_val = svm.predict(X_val)

    acc_train = (n_train - np.count_nonzero(y_pred_train - y_train)) / n_train
    acc_val = (n_val - np.count_nonzero(y_pred_val - y_val)) / n_val
    return acc_train, acc_val, svm.n_support_

In [None]:
gRange = [10**i for i in range(-5,2)]
cRange = [10**i for i in range(-4,5)]

for c in cRange:
    for g in gRange:
        print("Training on c =", c, "| gamma =", g)
        acc_train[(c,g)], acc_val[(c,g)], nSVs[(c,g)] = trainSVMrbf(X_train, y_train, X_val, y_val, c, g)

- Appears that high values of c (>= 10) and high values of gamma (=10) do not converge even with max iterations being 25000 (more than enough for all other combos).

In [None]:
# Convert dictionaries for training and validation accuract to matrix of x=gamma, y=c
tAccMat = pd.DataFrame(np.nan, index=cRange, columns=gRange)
for i,c in enumerate(cRange):
    newRow = [acc_train[(c, g)] for g in gRange]
    tAccMat.iloc[i] = newRow
    
vAccMat = pd.DataFrame(np.nan, index=cRange, columns=gRange)
for i,c in enumerate(cRange):
    newRow = [acc_val[(c, g)] for g in gRange]
    vAccMat.iloc[i] = newRow

In [None]:
plotHeatMap(tAccMat, "Training", "rbfHeatTrain.jpg")
plotHeatMap(vAccMat, "Validation", "rbfHeatVal.jpg")

#### Q1 - Optimize C & Gamma
- Optimizing based on accuracy from above heatmap (validation).
- General trend of validation accuracy .8952 and .9128 corresponding to training accuracy before the model overfits. I chose a gamma and c in these ranges to optimize before model starts overfitting.
- Best combination in form of (c, gamma) from above: (10, 0.001), (10, 0.01)
    - Optimizing in this range: c $\in [9,11]$ and gamma $\in [0.005, 0.04]$

In [None]:
opt_tAcc = {}
opt_vAcc = {}

gRange = np.arange(0.005, 0.041, 0.005)
cRange = np.arange(9, 11.1, 0.25)

for c in cRange:
    for g in gRange:
        g = round(g, 3) # fix floating point error
        print("Training on c =", c, "| gamma =", g)
        opt_tAcc[(c,g)], opt_vAcc[(c,g)], _ = trainSVMrbf(X_train, y_train, X_val, y_val, c, g)

In [None]:
# Convert dictionaries for training and validation accuract to matrix of x=gamma, y=c
gColNames = [round(g,3) for g in gRange]
tAccMat = pd.DataFrame(np.nan, index=cRange, columns=gColNames)
for i,c in enumerate(cRange):
    newRow = [opt_tAcc[(c, round(g,3))] for g in gRange]
    tAccMat.iloc[i] = newRow
    
vAccMat = pd.DataFrame(np.nan, index=cRange, columns=gColNames)
for i,c in enumerate(cRange):
    newRow = [opt_vAcc[(c, round(g,3))] for g in gRange]
    vAccMat.iloc[i] = newRow
    
plotHeatMap(tAccMat, "Training", "rbfHeatOptTrain.jpg")
plotHeatMap(vAccMat, "Validation", "rbfHeatOptVal.jpg")

The best model (without overfitting) uses the parameters $c=10$, $gamma=0.02$. The training accuracy is 95.33% and the validation accuracy is 92.12% (some overfitting, but very minimal). Even as the model begins to overfit (last 2 columns), the validation accuracy does not increase much past the optimal model (about +0.04% when training accuracy increases by +4.92%).

#### Q2 - Accuracy with fixed gamma and across c

In [None]:
# No code needed (use original heatmap looking across each column - fixed gamma - top->down)

#### Q3 - - Accuracy with fixed c and across gamma

In [None]:
# No code needed (use original heatmap looking across each row - fixed c - left->right)

#### Q4 - plot support vectors across fixed gamma = 0.1

In [None]:
fixedG = [sum(nSVs[(c, 0.1)]) for c in cRange]
SVplot(range(-4,5), fixedG, "C (Gamma = 0.1)", "cSVcount.jpg")

#### Q5 - plot support vectors across fixed c=10

In [None]:
fixedC = [sum(nSVs[(10, g)]) for g in gRange]
SVplot(range(-5,2), fixedC, "Gamma (C=10)", "gammaSVcount.jpg")