In [None]:
''' Libraries here '''

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs

In [None]:
''' Loading in our data '''
test_df = pd.read_csv('/content/test.csv')
train_df = pd.read_csv('/content/train.csv')
sample_submission_data = pd.read_csv('/content/submit.csv')

''' ctrl + / for mass comment on codes, can view and hide comment column '''
# print(train_df.head(5)) # id, title, author, text, label (1: unreliable, 0: reliable)
# print('~~~~~~~~~~~~')
# print(sample_submission_data.head(5)) # id, label

# Final Project for DSCI 320, Fall 2024

In this project we will

1) Implement the very basic version of support vector machine.
2) Apply support vector machine to do a classification problem.
3) Apply neural network to do the same classification problem.

We have three classification problems for you to choose. The first is the identification of fake news, with dataset available at:

https://www.kaggle.com/competitions/fake-news/data (this is the one we're doing)

Dataset information:

train.csv: A full training dataset with the following attributes:
- id: unique id for a news article
- title: the title of a news article
- author: author of the news article
- text: the text of the article; could be incomplete
- label: a label that marks the article as potentially unreliable (1: unreliable, 0: reliable)

test.csv: A testing training dataset with all the same attributes at train.csv without the label.

submit.csv: A sample submission that you can

For any chosen problem, data normalization will always be helpful. Those who understand the singular value decomposition (SVD) well could try to use SVD first to reduce the dimensionality before moving to the classification using SVM or neural network.

## Part 1: Naıve implementation of support vector machine

Here we will create the a small 2-D dataset and implement a naıve algorithm to find the hyperplane that separate the two clusters in the dataset. Let the hyperplane be $y = w^{T}x$ where $x$ is the extended variable padded with 1s, the algorithm goes as:

A simple way to generate small datasets is using make_blobs in sklearn. This way one can generate points of a given number in 2-D plane clustered around given centers, for example:

        from sklearn.datasets import make_blobs
        centers = [(1,1),(-1,-1])] # centers of the two clusters
        cluster_std = [0.2,0.2] # variance of the two clusters
        X, Y=make_blobs(n_samples=10, cluster_std=cluster_std, centers=centers,
                        n_features=2, random_state=1) # two clusters, 10 points in total.

Please note that the labels of the two clusters will be 0 and 1 in this case, and you will have to change them to be 1 and −1, so they are consistent with the algorithm. For this part, please make a few figures showing the improved separation of clusters as the iterations continue and the final state of the separation. Attach these figures to your project report.

1) Define learning rate $l_{r} = 0.1$
2) Define expand factor $f_{e} = 0.9$
3) Define reduce factor $f_{r} = 1.1$
4) Pick an arbitrary data point $(x, y)$ and determine whether it is misclassified

        if Classified correctly then
                if Margin too small then
                        w ← w + lr · fr · yx
                else if Margin too wide then
                        w ← fe · w
                end if
        else
                w ← w + lr · yx
        end if
                Goto 4 
        
and continue the process until convergence, or a preset number of iterations is reached.

In [None]:
"""
We want to implement a basic version of an SVM algorithm on a simple 2D dataset (???)
I don't think we're using our data sets here, just making a synthetic one...
need clusters, centers, X,Y, hyperplane. Use make_blobs in sklearn
"""
centers = [(1,1), (-1,-1)] # centers of the two clusters
cluster_std = [0.2, 0.2] # variance (SD) of the two clusters
X, Y = make_blobs(n_samples=10, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1) # two clusters, 10 points in total.

Y = np.where(Y == 0, -1, 1) # "you will have to change them to be 1 and −1", np.where(condition, x, y)

# print(X)
# print('~~~~~~~~~')
# print(Y)
# print('~~~~~~~~~')
# plt.scatter(X[:, 0], X[:, 1], c = Y) # plotting x and y on scatter plot to observe
# plt.show()

X = np.hstack([X, np.ones((X.shape[0], 1))]) # "x  is the extended variable padded with 1s", bias term
# print(X) # new column of 1 values the length of X

''' y = (w^T)x, w is the vector of weights (slope), x is the data point, y is the label... I think'''
w = np.random.randn(3) # 3 for columns + bias term

def func(X, Y, w): # here's the function for our hyperplane
  hyperplane = Y * np.dot(w, X)
  return hyperplane

lr = 0.1 # learning rate
fe = 0.9 # expand factor
fr = 1.1 # reduce factor
nmax = 100 # max iterations, I wasn't sure if this should be 100 or 1000

def svm(X, Y, w, lr, fe, fr, nmax):

  point_classification = np.zeros(len(X), dtype=bool) # seeing if the point is correctly classified at the end of loops

  for i in range(nmax):
    for j in range(len(X)):
      current_X = X[j]
      current_Y = Y[j]
      margin = func(current_X, current_Y, w) # set equal to hyperplane functions above
      if (margin > 0): # "if classified correctly then"
        point_classification[j] = True # point was correctly classified
        if (margin < 1): # "if Margin too small then"
          w = w + (lr * fr * current_Y * current_X) # "w ← w + lr · fr · yx"
        elif (margin > 1): # "else if Margin too wide then"
          w = fe * w # "w ← fe · w"
      else:
        w = w * (lr * current_Y * current_X) # "w ← w + lr · yx"
        point_classification[j] = False # point was misclasified

  return w, point_classification # return the final weight vector and the final classification

''' Pick an arbitrary data point (x,y) and determine whether it is misclassified '''
test_point_w, test_classification = svm(X, Y, w, lr, fe, fr, nmax)

for l, classified in enumerate(test_classification): # I want to test all ten points instead of one for comparison
  current_point = X[l, :-1]
  point_class = Y[1]
  if classified:
      print(f"Our arbitrary data point {l} (X = {current_point}, Y = {point_class}) is classified correctly.")
  else:
      print(f"Our arbitrary data point {l} (X = {current_point}, Y = {point_class}) is misclassified.")

correct_points = X[test_classification == True]
misclassified_points = X[test_classification == False]
plt.scatter(correct_points[:, 0], correct_points[:, 1], c = 'blue', label = "Correctly Classified")
plt.scatter(misclassified_points[:, 0], misclassified_points[:, 1], c = 'red', label = "Misclassified")
x_vals = np.linspace(-3, 3, 100)
y_vals = (-test_point_w[0] * x_vals - test_point_w[2]) / test_point_w[1]  # Compute corresponding y values for the hyperplane equation
plt.plot(x_vals, y_vals, color='green', label='Hyperplane')
plt.xlabel('X1')
plt.ylabel('X2')
plt.title('SVM Calculate Hyperplane and Point Classifications')
plt.legend()
plt.grid(alpha = 0.5) # making the grid lines a little transparent bc I think it looks nicer :)

'''
I'm guessing that there's some misclassified points because the testing data is so small.
Everytime you run this code new points and calculations are generated, if we put this on our report
  then I vote we choose a graph that generates more classified points lol
'''

plt.show() 

## Part 2: Classification using support vector machine

In this step we will implement the support vector machine modules in sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) which also allows you to use various type of kernels. You can then compare results with and without kernels. You are expected to:

1) Load and understand the dataset.
2) Implement SVM in sklearn for the multi-class classification.
3) Understand the functionality of various kernels for SVM and compare their performance for the problem. Quantify the accuracy of your classifications.

In [None]:
## code chunk

## Part 3: Classification Using Neural Network

In this step you will implement neural network for classification. You are not required to build your own neural network and its training. Instead, you may use the Multilayer Perception Classifier in sklearn. More details information about the classification function can be found at https://scikit-learn.org/stable/modules/neural_networks_supervised.html and https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html. Alternatively, you could work with PyTorch (https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html). In either case, we might want to walk through the sample given in these introductory pages and make sure they are pruning as expected. Then read the sample code very carefully, understand each line and every single feature or parameter of the method, before moving to build our own implementation and applications. These features or parameter include:

1) Depth and width of the network.
2) Solver of the minimization problem.
3) Target function and strength of regularization
4) Batch size.
5) Initial and adaptive earning rate.
6) Initialization of the network.
7) Momentum method.
8) Termination of the training.
9) Use of well-trained network for prediction on test dataset.

Your datasets contain both train and test sets, so you will be able to quantify your classification.

At the end of these three parts, each group is expected to:

1) Describe the problem and your algorithm.
2) Describe your implementation of the algorithm, major steps, and key parameters.
3) Describe the training process. Quantify the accuracy of your classifications.
4) Wrap up the results and finish a project report with eight or more pages (excluding your code), including diagrams, tables, or figures.
5) Python code will be submitted separately.

In [None]:
## code chunk

In [None]:
## Caitlyn's test chunk