Task 1 \\
Implement a Python class named LinearSVC which learns a linear Support Vector Classifier (SVC) from a set of training data. The class is required to have the following components: \\


*    A constructor init which initialize an SVC using the given learning rate, number of
epochs and a random seed. (Similar to the perceptron class in our textbook.)
*   A training function fit which trains the SVC based on a given labeled dataset. We consider
the soft-margin SVC using a hinge loss. You are required to integrate L2-regularization and expose the
regularization parameter to users
*   A function net input which computes the preactivation value for a given input sample.
*   A function predict which generates the prediction for a given input sample.





In [None]:
import numpy as np;
class LinearSVC:
  """
  A constructor init which initialize an SVC using the given learning rate, number of
  epochs and a random seed
  """
  def __init__(self, eta=0.001, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state

  """
  A training function fit which trains the SVC based on a given labeled dataset. We consider
  the soft-margin SVC using a hinge loss. You are required to integrate L2-regularization and expose the
  regularization parameter to users.
  """
  """Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_examples, n_features]
          Training vectors, where n_examples is the number of
          examples and n_features is the number of features.
        y : array-like, shape = [n_examples]
          Target values.
        C : Regularization parameter. Should be greater than 0

        Returns
        -------
        self : object

        """
  def fit(self, X, y, C=1):

      rgen = np.random.RandomState(self.random_state)
      self.w_ = rgen.normal(loc=0.0, scale=0.01,
                            size=X.shape[1])
      self.b_ = np.float_(0.)
      self.errors_ = []

      for _ in range(self.n_iter):
          errors = 0
          for xi, target in zip(X, y):

              # Predicted score (wx+b)
              predicted_score = np.dot(self.w_ ,xi) + self.b_

              # Hinge Loss
              hinge_loss = max(0, 1 - target * predicted_score)
              if hinge_loss > 0:
                # Gradients w.r.t weights and bias
                dw = -target * xi + C * self.w_
                db = -target
                errors += 1
              else:
                dw = C * self.w_
                db = 0
              # Updating weights with L2-regularization and bias
              self.w_ -= self.eta * dw
              self.b_ -= self.eta * db

          self.errors_.append(errors)
      return self

  """
  A function net input which computes the preactivation value for a given input sample.
  """
  def net_input(self, X):
    """Calculate net input"""
    return np.dot(X, self.w_) + self.b_

  """
  A function predict which generates the prediction for a given input sample.
  """
  def predict(self, X):
    """Return class label after unit step"""
    return np.where(self.net_input(X) >= 0.0, 1, -1)

In [None]:
import numpy as np;
class LinearSVC:
  """
  A constructor init which initialize an SVC using the given learning rate, number of
  epochs and a random seed
  """
  def __init__(self, eta=0.001, n_iter=50, random_state=1):
        self.eta = eta
        self.n_iter = n_iter
        self.random_state = random_state

  """
  A training function fit which trains the SVC based on a given labeled dataset. We consider
  the soft-margin SVC using a hinge loss. You are required to integrate L2-regularization and expose the
  regularization parameter to users.
  """
  """Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_examples, n_features]
          Training vectors, where n_examples is the number of
          examples and n_features is the number of features.
        y : array-like, shape = [n_examples]
          Target values.
        C : Regularization parameter. Should be greater than 0

        Returns
        -------
        self : object

        """
  def fit(self, X, y, C=1):
      n_samples = X.shape[0]
      rgen = np.random.RandomState(self.random_state)
      self.w_ = rgen.normal(loc=0.0, scale=0.01,
                            size=X.shape[1])
      self.b_ = np.float_(0.)
      self.errors_ = []

      for _ in range(self.n_iter):
          errors = 0
          for xi, target in zip(X, y):

              # Margin (y'*(wx+b))
              margin = target * (np.dot(self.w_ ,xi) + self.b_)

              # Hinge Loss
              # hinge_loss = max(0, 1 - margin)
              if margin < 1:
                # Gradients w.r.t weights and bias
                dw = self.w_/n_samples - C * target * xi
                db = -C * target
                errors += 1
              else:
                dw = self.w_ / n_samples
                db = 0.0
              # Updating weights and bias
              self.w_ -= self.eta * dw
              self.b_ -= self.eta * db
          self.errors_.append(errors)
      return self

  """
  A function net input which computes the preactivation value for a given input sample.
  """
  def net_input(self, X):
    """Calculate net input"""
    return np.dot(X, self.w_) + self.b_

  """
  A function predict which generates the prediction for a given input sample.
  """
  def predict(self, X):
    """Return class label after unit step"""
    return np.where(self.net_input(X) >= 0.0, 1, -1)

Task 2 \\
Write a Python function make classification which generates a set of linearly separable data
based on a random separation hyperplane. We learned that an (d − 1)-dimensional hyperplane can be defined
as the set of points in the Euclidean space Rd satisfying an equation  ̄aT  ̄x = b, i.e., { ̄x ∈ Rd |  ̄aT  ̄x = b}. For
simplicity, we assume that b = 0, then the hyperplane can be determined by a random vector  ̄a. We use this
idea to design the following algorithm to generate random data which are linearly separable:

*   Step 1. Randomly generate a d-dimensional vector  ̄a.
*   Step 2. Randomly select n samples  ̄x1, . . . ,  ̄xn in the range of [−u, u] in each dimension. You may use a uniform or Gaussian distribution to do so.
*   Step 3. Give each  ̄xi a label yi such that if  ̄aT  ̄x < 0 then yi = −1, otherwise yi = 1.

Therefore, your function should have the following parameters that should given by the user: d, n, u, and a
random seed for reproducing the data. You need to additionally subdivide the dataset to a training dataset
(70%) and a test dataset (30%). You may use the scikit-learn function to do so, but make sure that you
specify the random seed such that the subdivision is reproducible.

In [None]:
from sklearn.model_selection import train_test_split

def make_classification(n, d, u=1.0, random_seed=1):
    """
    Generates a set of linearly separable data based on a random separation hyperplane.

    Parameters:
    - n: Number of samples to generate.
    - d: Dimensionality of the data.
    - u: Range for generating random samples in each dimension (default is 1.0).
    - random_seed: Seed for random number generation (optional).

    Returns:
    - X: A numpy array of shape (n, d) containing the generated samples.
    - y: A numpy array of shape (n,) containing the labels (-1 or 1) for each sample.
    - a: The random vector defining the separation hyperplane.
    """

    # Set random seed for reproducibility
    np.random.seed(random_seed)

    # Step 1: Randomly generate a d-dimensional vector a
    a = np.random.randn(d)

    # Step 2: Randomly select n samples in the range [-u, u] in each dimension
    X = np.random.uniform(low=-u, high=u, size=(n, d))

    # Step 3: Assign labels based on the position relative to the hyperplane
    y = np.where(np.dot(X, a) < 0, -1, 1)

    # Split into training(70%) and testing data(30%)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

    return X_test, X_train, y_train, y_test, X, y

# Example usage:
n = 100  # Number of samples
d = 2    # Dimensionality
u = 5.0  # Range for generating samples

X_test, X_train, y_train, y_test,X, y = make_classification(n, d, u, random_seed=1)

def export_data_numpy(X_train, X_test, y_train, y_test, filename="dataset"):
    np.savetxt(f"{filename}_X_train.csv", X_train, delimiter=",")
    np.savetxt(f"{filename}_X_test.csv", X_test, delimiter=",")
    np.savetxt(f"{filename}_y_train.csv", y_train, delimiter=",")
    np.savetxt(f"{filename}_y_test.csv", y_test, delimiter=",")

    # Export full dataset (X, y)
    np.savetxt(f"{filename}_X_full.csv", X, delimiter=",")
    np.savetxt(f"{filename}_y_full.csv", y, delimiter=",")

    print('Exported Dataset')

# Example usage:
export_data_numpy(X_train, X_test, y_train, y_test)


Exported Dataset


Task 3 \\
Investigate the scalability of the LinearSVC class you have implemented. You may consider the
datasets of the combinations of the following scales: d = 10, 50, 100, 500, 1000 and n = 500, 1000, 5000,
10000, 100000. Please feel free to adjust the scales according to your computers’ configurations, however the time costs should be obviously different. Make sure that you use the same dataset for each combination.
This can be controlled by using the same random seed (see textbook).

In [None]:
import time

# Define the combinations of n (samples) and d (features)
n_values = [500, 1000, 5000, 10000, 100000]
d_values = [10, 50, 100, 500, 1000]

# Store results
results = []

# Iterate over all combinations of n and d
for nv in n_values:
    for dv in d_values:
        # Generate a linearly separable dataset
        X_train = np.load(f"dataset_X_train.csv")
        X_test = np.load(f"dataset_X_test.csv")
        y_train = np.load(f"dataset_y_train.csv")
        y_test = np.load(f"dataset_y_test.csv")

        # Initialize LinearSVC
        model = LinearSVC(eta=0.01, n_iter=10000, random_state=1)

        # Measure training time
        start_time = time.time()
        model.fit(X, y, C=1)
        end_time = time.time()

        execution_time = end_time - start_time

        y_pred = model.predict(X_test)
        accuracy = np.mean(y_pred == y_test)

        # Store results
        results.append((nv, dv, execution_time))

# Print results in a table
print("\nResults:")
print("n\t\td\t\tTraining Time (s)")
print("-----------------------------------------")
for result in results:
    print(f"{result[0]}\t\t{result[1]}\t\t{result[2]:.4f}")

ValueError: Cannot load file containing pickled data when allow_pickle=False