# Feature Selection

In this class, we're going to address the problem of WAY TOO MANY features.

Think Bag of Words for Fake News detection. Our vocabulary was WAY too big, which caused all kinds of problems!

So we're going to look at a few feature selection techniques and then you can try them out on the fake news data!

In [None]:
!brew install python

In [None]:
!pip3 install scikit-learn
!pip3 install matplotlib
!pip3 install pandas

In [None]:
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
print(X)
sel = VarianceThreshold(threshold=0.8*(1-0.8))
X_new = sel.fit_transform(X)
print('-'*60)
print(X_new)

## Removing Features with Low Variance

*  Variance is a statistical measure that describes the spread or dispersion of a set of data points. Specifically, it quantifies how far each number in the set is from the mean (average) and, thus, from every other number in the set. In other words, variance provides a measure of the variability or volatility of a dataset.
* Features with low variance might not contain much information. If a feature has almost the same value for all the samples in a dataset, it might not be very discriminative.
*  VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by p*(1-p).

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html


As expected, VarianceThreshold has removed the first column, which has a probability p=5/6 > 0.8 of containing a zero.


## Univariate Feature Selection

* Univariate feature selection is a statistical method used in machine learning and data analysis to select the most informative features from a dataset. It evaluates each feature individually to determine the strength of the relationship of the feature with the response variable. The main idea is to keep the best features based on certain criteria and discard the less informative ones.

Scikit-learn exposes feature selection routines as objects that implement the transform method:

- SelectKBest removes all but the k highest scoring features
- SelectPercentile removes all but a user-specified highest scoring percentage of features using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
- Generic Univariate Select allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.

###Iris Dataset
https://en.wikipedia.org/wiki/Iris_flower_data_set

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html


In [None]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

X, y = load_iris(return_X_y=True)
print(X.shape)


X_new = SelectKBest(f_classif, k=3).fit_transform(X, y)

print(X_new.shape)

## L1-based Feature Selection

L1-based feature selection refers to the use of L1 regularization (also known as Lasso regularization) to induce sparsity in the coefficients of a linear model, which in turn can be used to select important features. When L1 regularization is applied, many feature coefficients become exactly zero, effectively excluding those features from the model. The features with non-zero coefficients are considered to be the selected or important features.

---
* L1 Regularization: In linear regression, the objective is to minimize the sum of squared residuals. With L1 regularization, an additional term is added to this objective: the sum of the absolute values of the coefficients (i.e., model weights), multiplied by a regularization parameter (often denoted as α or λ).

* The objective becomes: Minimize(sum of squared residuals + λ × sum of absolute values of coefficients)

* As λ increases, the penalty for non-zero coefficients also increases, pushing more coefficients to become exactly zero.
---

When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with SelectFromModel to select the non-zero coefficients. In particular, sparse estimators useful for this purpose are the Lasso for regression, and of LogisticRegression and LinearSVC for classification:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel

A **Support Vector Machine (SVM)** is a supervised machine learning algorithm used for classification tasks (and sometimes for regression). It aims to find the best boundary (or hyperplane) that separates data points from different classes. Here’s a breakdown of how an SVM classifier works:

**Key Concepts**

**Hyperplane:**

* In SVM, the goal is to find a hyperplane that best divides a dataset into classes.

* In a 2D space, this hyperplane is a line, in 3D space, it's a plane, and in higher dimensions, it becomes a hyperplane.

For example, in a 2D feature space with two features \\(x_1\\) and \\(x_2\\), the decision boundary is defined by:

\\(w_1 ⋅ x_1 + w_2 ⋅ x_2 + b = 0\\)

where:

- \\(w_1\\) and \\(w_2\\) are the weights (or coefficients) for the features \\(x_1\\) and \\(x_2\\).
- \\(b\\) is the intercept (bias).
- The values of \\(x_1\\) and \\(x_2\\) determine the input data points.


* The best hyperplane is the one that maximizes the margin between the two classes.


**Margin:**

* The margin is the distance between the hyperplane and the nearest data points from each class. These nearest points are called support vectors.

* SVM maximizes this margin to improve the generalization ability of the model. The larger the margin, the better the separation between classes.

**Support Vectors:**

Support vectors are the data points that are closest to the hyperplane. These points are critical because the position of the hyperplane depends on them.
Even if other data points are changed, as long as the support vectors stay the same, the hyperplane remains unchanged.

**Linear and Non-Linear Classification:**

Linear SVM: If the data is linearly separable, meaning you can separate the classes with a straight line or hyperplane, SVM works by finding that optimal hyperplane.

Non-Linear SVM: When data is not linearly separable, SVM uses the kernel trick to map the data into a higher-dimensional space where it becomes linearly separable. Popular kernels include:
* Polynomial kernel
* Radial Basis Function (RBF) kernel

**SVM Variants**

**LinearSVC:**

This is a variant of SVM used when the data is linearly separable. It uses a linear kernel and is faster for large datasets compared to non-linear SVM.

**Soft Margin SVM:**

In real-world data, it's common to have some overlap between classes. Soft Margin SVM allows some misclassification of points but tries to minimize it. The C parameter controls this trade-off between maximizing the margin and minimizing the classification error.
* A large C value aims for fewer misclassifications but at the cost of a smaller margin.
* A small C value allows a larger margin but permits more misclassifications.

**Kernel SVM:**

If the data cannot be separated by a straight line, kernel methods (like RBF or polynomial) allow the SVM to classify non-linear data by transforming the input space into higher dimensions.

In [None]:
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)
print(X.shape)

lsvc = LinearSVC(penalty='l1', C=0.01, dual=False, max_iter=10000)
lsvc = lsvc.fit(X,y)

print(lsvc.coef_) # # Weights for class 1, 2, 3

model = SelectFromModel(lsvc, prefit=True) # uses the mean of the coefficients across all classes.
X_new = model.transform(X)

print(X_new.shape)

##Tree-based Feature Selection
* Tree-based feature selection leverages the structure and properties of tree-based machine learning models, such as decision trees, random forests, and gradient-boosted trees, to rank or select important features. These models inherently perform feature selection by choosing the best features to split on at each node of the tree.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn-ensemble-extratreesclassifier


In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)
print(X.shape)

clf = ExtraTreesClassifier(n_estimators=100).fit(X,y)
print(clf.feature_importances_)

X_new = SelectFromModel(clf, prefit=True).transform(X)

print(X_new.shape)

# Homework: Try Feature Selection on Fake News Dataset (Due next Monday)


Use feature selection techniques to reduce the size of the vocabulary! Try any of the above techniques or other ones you feel like: https://scikit-learn.org/stable/modules/feature_selection.html

After feature selection, use our 4 classification models (decision tree, KNN, neural network) to classify the Fake/Real News data (use 80% for training and 20% for testing, like usual).

## data generation

In [None]:
def clean_text(text):
  # lower case
  text = text.lower()
  # initializing punctuations string
  punctuations = '''1234567890!@#$%^&*()-=_+[]{}\|;"':,./<>?`~“”’‘''' # use ''' ''' three quotes
  for element in punctuations:
    text = text.replace(element, '')
  # split into words
  text=text.strip().split()

  # remove links:
  text = [x for x in text if 'www' not in x] # short-way of looping operations for list data
  text = [x for x in text if 'http' not in x]

  return text

In [None]:
# generate a training dataset using 80% of the data

import pandas

folder_path = '/workspaces/CS-345_UNR_FA2024/Homework 9/' # my google drive folder path
news_folder_path = folder_path + 'news_dataset/'

df_fake = pandas.read_csv(news_folder_path + "Fake.csv")['text']
df_real = pandas.read_csv(news_folder_path + "True.csv")['text']
print('number of fake news samples: {}'.format(len(df_fake)))
print('number of true news samples: {}'.format(len(df_real)))


In [None]:
#Get Vocab Words for Fake
word_dict = {}
for text in df_fake:
  text = clean_text(text)
  for word in text:
    try:
      word_dict[word] += 1
    except:
      word_dict[word] = 0
#Get Vocab Words for Real
for text in df_real:
  text = clean_text(text)
  for word in text:
    try:
      word_dict[word] += 1
    except:
      word_dict[word] = 0

#Remove words that occur less than min_thresh times and more than max_thresh times:
# we still need to shorten the feature dimension manually, as google colab cannot deal with this large feature space.
vocab = list(word_dict)
print("Vocabulary Length Before Min/Max Removal:", len(vocab))

min_thresh = 100
max_thresh = 5000
for word in vocab:
  if word_dict[word] <= min_thresh or word_dict[word] > max_thresh:
    word_dict.pop(word)

vocab = list(word_dict)
print("Vocabulary Length After Min/Max Removal:", len(vocab))

In [None]:
from collections import defaultdict

# Function to process the texts and generate feature vectors
def process_texts(texts, label):
    X_data = []  # To store feature vectors
    y_data = []  # To store labels

    # Process each text
    for text in texts:
        # Clean the text
        text = clean_text(text)

        # Use defaultdict to avoid checking if word exists
        article_dict = defaultdict(int)

        # Count word occurrences
        for word in text:
            article_dict[word] += 1

        # Turn the count dictionary into a list of values corresponding to the vocabulary
        article_list = [article_dict[word] for word in vocab]

        # Append the feature vector and the label
        X_data.append(article_list)
        y_data.append(label)

    return X_data, y_data

# Initialize data containers
X_all = []
y_all = []

# Process both fake and real news texts and collect the data
X_fake, y_fake = process_texts(df_fake, 1)  # Label 1 for fake news
X_real, y_real = process_texts(df_real, 0)  # Label 0 for real news

# Combine the data
X_all.extend(X_fake)
X_all.extend(X_real)
y_all.extend(y_fake)
y_all.extend(y_real)

# At this point, X_all contains the feature vectors, and y_all contains the corresponding labels

In [None]:
# Convert X_all and y_all to appropriate formats if necessary (e.g., NumPy arrays)
import numpy as np
X_all = np.array(X_all)
y_all = np.array(y_all)

# Verify the shape of the dataset
print(f"Shape of feature dataset: {X_all.shape}")
print(f"Shape of labels: {y_all.shape}")


## without feature selection

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

scaler = StandardScaler()

# standarization the whole dataset
X_all_scaled = scaler.fit_transform(X_all)

# # Split the data into training (80%) and testing (20%)

X_train, X_test, y_train, y_test = train_test_split(X_all_scaled, y_all, test_size=0.2, random_state=42)

# Verify the shape of the data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")


# model training
nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5,2), batch_size=128, max_iter=1000, random_state=1)

nn = nn.fit(X_train, y_train)
print('training loss', nn.best_loss_)
plt.plot(nn.loss_curve_, '-.', color='red', label='nn 2')
plt.ylabel('training loss')
plt.xlabel('# of iterations')

# testing
X_test_scaled = scaler.transform(X_test)
y_pred = nn.predict(X_test_scaled)
print('testing accuracy', accuracy_score(y_pred, y_test))

In [None]:
# testing during the training

import numpy as np
from sklearn.metrics import log_loss

nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5, 2), batch_size=128,
                   max_iter=1, warm_start=True, random_state=1)

# Lists to store training and testing accuracy
test_accuracies = []
train_accuracies = []

# Lists to store training and testing loss
train_losses = []
test_losses = []

# Train the model for multiple iterations
for i in range(100):  # 100 iterations

    # Incremental Training: partial_fit allows you to
    # incrementally fit the model and track the performance after each iteration.

    if i == 0:
        nn.partial_fit(X_train, y_train, classes=np.unique(y_train))
    else:
        nn.partial_fit(X_train, y_train)

    # Append the training loss (stored in nn.loss_curve_)
    train_losses.append(nn.loss_)

    # Evaluate on training set
    train_pred = nn.predict(X_train)
    train_acc = accuracy_score(y_train, train_pred)
    train_accuracies.append(train_acc)

    # Evaluate on test set
    test_pred = nn.predict(X_test)
    test_acc = accuracy_score(y_test, test_pred)
    test_accuracies.append(test_acc)

    y_test_prob = nn.predict_proba(X_test)  # Get predicted probabilities
    test_loss = log_loss(y_test, y_test_prob)  # Compute the log loss for the test set
    test_losses.append(test_loss)

In [None]:
# Plot the test accuracy over iterations
plt.plot(test_accuracies, label="Test Accuracy")
plt.plot(train_accuracies, label="Train Accuracy")
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Test and Train Accuracy over Iterations')
plt.legend()
plt.show()

# Plot the training and testing loss over iterations
plt.plot(train_losses, label="Training Loss")
plt.plot(test_losses, label="Testing Loss")
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training and Testing Loss over Iterations')
plt.legend()
plt.show()

## with feature selection

In [None]:
## feature selection
import pandas
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

# scaler = StandardScaler()
# X_all_scaled = scaler.fit_transform(X_all)

lsvc = LinearSVC(penalty='l1', C=0.01, dual=False, max_iter=10000)
lsvc = lsvc.fit(X_all_scaled,y_all)

model = SelectFromModel(lsvc, prefit=True) # uses the mean of the coefficients across all classes.
X_new = model.transform(X_all_scaled)

print('shape of new data features: {}'.format(X_new.shape))


In [None]:
# Split the data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X_new, y_all, test_size=0.2, random_state=42)

# Verify the shape of the data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")


# model training
nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5,2), batch_size=128, max_iter=1000, random_state=1)

nn = nn.fit(X_train, y_train)
print(nn.best_loss_)
plt.plot(nn.loss_curve_, '-.', color='red', label='nn 2')
plt.ylabel('training loss')
plt.xlabel('# of iterations')

# testing after training
y_pred = nn.predict(X_test)
print(accuracy_score(y_pred, y_test))

### testing during the training

In [52]:
# testing during the training

import numpy as np
from sklearn.metrics import log_loss

nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5, 2), batch_size=128,
                   max_iter=1, warm_start=True, random_state=1)

# Lists to store training and testing accuracy
test_accuracies = []
train_accuracies = []

# Lists to store training and testing loss
train_losses = []
test_losses = []

# Train the model for multiple iterations
for i in range(100):  # 100 iterations

    # Incremental Training: partial_fit allows you to
    # incrementally fit the model and track the performance after each iteration.

    if i == 0:
        nn.partial_fit(X_train, y_train, classes=np.unique(y_train))
    else:
        nn.partial_fit(X_train, y_train)

    # Append the training loss (stored in nn.loss_curve_)
    train_losses.append(nn.loss_)

    # Evaluate on training set
    train_pred = nn.predict(X_train)
    train_acc = accuracy_score(y_train, train_pred)
    train_accuracies.append(train_acc)

    # Evaluate on test set
    test_pred = nn.predict(X_test)
    test_acc = accuracy_score(y_test, test_pred)
    test_accuracies.append(test_acc)

    y_test_prob = nn.predict_proba(X_test)  # Get predicted probabilities
    test_loss = log_loss(y_test, y_test_prob)  # Compute the log loss for the test set
    test_losses.append(test_loss)

In [None]:
# Plot the test accuracy over iterations
plt.plot(test_accuracies, label="Test Accuracy")
plt.plot(train_accuracies, label="Train Accuracy")
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Test and Train Accuracy over Iterations')
plt.legend()
plt.show()

# Plot the training and testing loss over iterations
plt.plot(train_losses, label="Training Loss")
plt.plot(test_losses, label="Testing Loss")
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training and Testing Loss over Iterations')
plt.legend()
plt.show()

## feature selection without standarization

In [None]:
## feature selection without standarization
lsvc = LinearSVC(penalty='l1', C=0.01, dual=False, max_iter=10000)
lsvc = lsvc.fit(X_all,y_all)

model = SelectFromModel(lsvc, prefit=True) # uses the mean of the coefficients across all classes.
X_new = model.transform(X_all_scaled)

print('shape of new data features: {}'.format(X_new.shape))


In [None]:
# Split the data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X_new, y_all, test_size=0.2, random_state=42)

# Verify the shape of the data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")


# model training
nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5,2), batch_size=128, max_iter=1000, random_state=1)

nn = nn.fit(X_train, y_train)
print(nn.best_loss_)
plt.plot(nn.loss_curve_, '-.', color='red', label='nn 2')
plt.ylabel('training loss')
plt.xlabel('# of iterations')

# testing after training
y_pred = nn.predict(X_test)
print(accuracy_score(y_pred, y_test))

In [56]:
# testing during the training

import numpy as np
from sklearn.metrics import log_loss

nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5, 2), batch_size=128,
                   max_iter=1, warm_start=True, random_state=1)

# Lists to store training and testing accuracy
test_accuracies = []
train_accuracies = []

# Lists to store training and testing loss
train_losses = []
test_losses = []

# Train the model for multiple iterations
for i in range(100):  # 100 iterations

    # Incremental Training: partial_fit allows you to
    # incrementally fit the model and track the performance after each iteration.

    if i == 0:
        nn.partial_fit(X_train, y_train, classes=np.unique(y_train))
    else:
        nn.partial_fit(X_train, y_train)

    # Append the training loss (stored in nn.loss_curve_)
    train_losses.append(nn.loss_)

    # Evaluate on training set
    train_pred = nn.predict(X_train)
    train_acc = accuracy_score(y_train, train_pred)
    train_accuracies.append(train_acc)

    # Evaluate on test set
    test_pred = nn.predict(X_test)
    test_acc = accuracy_score(y_test, test_pred)
    test_accuracies.append(test_acc)

    y_test_prob = nn.predict_proba(X_test)  # Get predicted probabilities
    test_loss = log_loss(y_test, y_test_prob)  # Compute the log loss for the test set
    test_losses.append(test_loss)

In [None]:
# Plot the test accuracy over iterations
plt.plot(test_accuracies, label="Test Accuracy")
plt.plot(train_accuracies, label="Train Accuracy")
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Test and Train Accuracy over Iterations')
plt.legend()
plt.show()

# Plot the training and testing loss over iterations
plt.plot(train_losses, label="Training Loss")
plt.plot(test_losses, label="Testing Loss")
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training and Testing Loss over Iterations')
plt.legend()
plt.show()

## how to improve it? A better feature selection method?

In [None]:
# Homeworks: Try other feature selection methods and re-train the NN model 
# # to see which feature selection works best on this dataset

#USED UNIVARIATE FEATURE SELECTION
## feature selection
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

X, y = load_iris(return_X_y=True)
print(X.shape)


X_new = SelectKBest(f_classif, k=3).fit_transform(X, y)

print(X_new.shape)

In [None]:
def clean_text(text):
  # lower case
  text = text.lower()
  # initializing punctuations string
  punctuations = '''1234567890!@#$%^&*()-=_+[]{}\|;"':,./<>?`~“”’‘''' # use ''' ''' three quotes
  for element in punctuations:
    text = text.replace(element, '')
  # split into words
  text=text.strip().split()

  # remove links:
  text = [x for x in text if 'www' not in x] # short-way of looping operations for list data
  text = [x for x in text if 'http' not in x]

  return text

In [26]:
# generate a training dataset using 80% of the data

import pandas

folder_path = '/workspaces/CS-345_UNR_FA2024/Homework 9/' # my google drive folder path
news_folder_path = folder_path + 'news_dataset/'

df_fake = pandas.read_csv(news_folder_path + "Fake.csv")['text']
df_real = pandas.read_csv(news_folder_path + "True.csv")['text']

In [27]:
#Get Vocab Words for Fake
word_dict = {}
for text in df_fake:
  text = clean_text(text)
  for word in text:
    try:
      word_dict[word] += 1
    except:
      word_dict[word] = 0
#Get Vocab Words for Real
for text in df_real:
  text = clean_text(text)
  for word in text:
    try:
      word_dict[word] += 1
    except:
      word_dict[word] = 0

#Remove words that occur less than min_thresh times and more than max_thresh times:
# we still need to shorten the feature dimension manually, as google colab cannot deal with this large feature space.
vocab = list(word_dict)

min_thresh = 100
max_thresh = 5000
for word in vocab:
  if word_dict[word] <= min_thresh or word_dict[word] > max_thresh:
    word_dict.pop(word)

vocab = list(word_dict)

In [28]:
from collections import defaultdict

# Function to process the texts and generate feature vectors
def process_texts(texts, label):
    X_data = []  # To store feature vectors
    y_data = []  # To store labels

    # Process each text
    for text in texts:
        # Clean the text
        text = clean_text(text)

        # Use defaultdict to avoid checking if word exists
        article_dict = defaultdict(int)

        # Count word occurrences
        for word in text:
            article_dict[word] += 1

        # Turn the count dictionary into a list of values corresponding to the vocabulary
        article_list = [article_dict[word] for word in vocab]

        # Append the feature vector and the label
        X_data.append(article_list)
        y_data.append(label)

    return X_data, y_data

# Initialize data containers
X_all = []
y_all = []

# Process both fake and real news texts and collect the data
X_fake, y_fake = process_texts(df_fake, 1)  # Label 1 for fake news
X_real, y_real = process_texts(df_real, 0)  # Label 0 for real news

# Combine the data
X_all.extend(X_fake)
X_all.extend(X_real)
y_all.extend(y_fake)
y_all.extend(y_real)

# At this point, X_all contains the feature vectors, and y_all contains the corresponding labels

In [None]:
# Split the data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X_new, y_all, test_size=0.2, random_state=42)

# model training
nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5,2), batch_size=128, max_iter=1000, random_state=1)

nn = nn.fit(X_train, y_train)
print(nn.best_loss_)
plt.plot(nn.loss_curve_, '-.', color='red', label='nn 2')
plt.ylabel('training loss')
plt.xlabel('# of iterations')

# testing after training
y_pred = nn.predict(X_test)

In [None]:
# testing during the training

import numpy as np
from sklearn.metrics import log_loss

nn = MLPClassifier(solver='adam', hidden_layer_sizes=(5, 2), batch_size=128,
                   max_iter=1, warm_start=True, random_state=1)

# Lists to store training and testing accuracy
test_accuracies = []
train_accuracies = []

# Lists to store training and testing loss
train_losses = []
test_losses = []

# Train the model for multiple iterations
for i in range(100):  # 100 iterations

    # Incremental Training: partial_fit allows you to
    # incrementally fit the model and track the performance after each iteration.

    if i == 0:
        nn.partial_fit(X_train, y_train, classes=np.unique(y_train))
    else:
        nn.partial_fit(X_train, y_train)

    # Append the training loss (stored in nn.loss_curve_)
    train_losses.append(nn.loss_)

    # Evaluate on training set
    train_pred = nn.predict(X_train)
    train_acc = accuracy_score(y_train, train_pred)
    train_accuracies.append(train_acc)

    # Evaluate on test set
    test_pred = nn.predict(X_test)
    test_acc = accuracy_score(y_test, test_pred)
    test_accuracies.append(test_acc)

    y_test_prob = nn.predict_proba(X_test)  # Get predicted probabilities
    test_loss = log_loss(y_test, y_test_prob)  # Compute the log loss for the test set
    test_losses.append(test_loss)

In [None]:
# Plot the test accuracy over iterations
plt.plot(test_accuracies, label="Test Accuracy")
plt.plot(train_accuracies, label="Train Accuracy")
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Test and Train Accuracy over Iterations')
plt.legend()
plt.show()

# Plot the training and testing loss over iterations
plt.plot(train_losses, label="Training Loss")
plt.plot(test_losses, label="Testing Loss")
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.title('Training and Testing Loss over Iterations')
plt.legend()
plt.show()