# Homework 2

Disclaimer: Feel free to include any necessary imports below as needed.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Fix the seed and the random state
seed=42
random_state=42
np.random.seed(seed)

# Qubics (5 Points)



In this task, our goal is to construct a **classifier** that implements **logistic regression** to **predict the nature of the roots** of a **cubic equation** of the form:

$
x^3 + p x + q = 0$

Note that the $ x^2 $ term is omitted **for simplicity**, which still allows the equation to exhibit a rich variety of root structures while reducing the number of parameters.

Specifically, we aim to classify each equation into one of two categories based on the type of its real roots:

- **Three distinct real roots**
- **One real root and a pair of complex conjugate roots**

This classification depends on the **discriminant** of the cubic equation. Our objective is to train a model that learns the relationship between the coefficients $ p $ and $ q $, and the corresponding root structure, effectively learning to **rediscover the discriminant-based decision boundary** through data.

In [None]:
# Generate dataset
n_samples = 10000
MAX_VAL = 10
p = np.random.uniform(-MAX_VAL, MAX_VAL, n_samples)
q = np.random.uniform(-MAX_VAL, MAX_VAL, n_samples)
# x^3 + p x + q = 0


# Calculate discriminant
delta_dep = [0] * n_samples
for i, (pi, qi) in enumerate(zip(p, q)):
   delta_dep[i] =  -4 * pi**3 -27 * qi**2
delta_dep = np.array(delta_dep)

# Label data: 1 for 3 real roots, 0 for 1 real root
labels_dep = np.where(delta_dep < 0, 1, 0)

# Create dataframe
df_dep = pd.DataFrame({'p': p, 'q': q, 'label': labels_dep})

# Prepare data
X_dep = df_dep[['p', 'q']]
y_dep = df_dep['label']

In [None]:
# Count the occurrences of each label
label_counts = y_dep.value_counts()

# Create the pie chart
plt.figure(figsize=(6, 6))
plt.pie(label_counts, labels=label_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Labels (y)')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = #<YOUR CODE HERE>

# Initialize lists to store accuracies and polynomial degrees
degrees = range(1, 6)
accuracies = []

# Train and evaluate polynomial logistic regression models for different degrees
for degree in degrees:
    #<YOUR CODE HERE>
    # Try to use make_pipeline


    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f'Degree: {degree}, Accuracy: {accuracy}')

# Plot the results
plt.plot(degrees, accuracies, marker='o')
plt.xlabel('Polynomial Degree')
plt.ylabel('Accuracy')
plt.title('Accuracy of Polynomial Logistic Regression vs. Degree')
plt.grid(True)
plt.show()

In [None]:
best_degree = #<YOUR CODE HERE>

best_model = #<YOUR CODE HERE>

# Predict on the test set
y_pred = best_model.predict(X_test)

In [None]:
plt.figure(figsize=(8, 6))

# Scatter plot of test data points
plt.scatter(X_test['p'], X_test['q'], c=y_pred, cmap='viridis', label='Predicted Labels', alpha=0.5)

# Generate points for the curve
p_curve = np.linspace(-MAX_VAL, MAX_VAL, 400)
q_curve = np.sqrt(-4 * p_curve**3 / 27)
plt.plot(p_curve, q_curve, color='red', label='-4 * pi**3 - 27 * qi**2 = 0')
plt.plot(p_curve, -q_curve, color='red')


plt.xlabel('p')
plt.ylabel('q')
plt.title('2D Scatter Plot of Predictions with Curve')
plt.legend()
plt.grid(True)
plt.show()

# $2 \times 2$ Matrices Classification (5 Points)

<!-- ## Task Overview: Identifying Group Structures Among Latin Squares -->

In this task, we aim to develop a **classifier** that can distinguish between **$2 \times 2$ integer matrices** that are **singular** (i.e., have determinant zero) and those that are **non-singular** (i.e., have non-zero determinant).

Each matrix has entries drawn from the integer range $-10$ to $10$, giving us $21^4$ possible matrices in total. For a $2 \times 2$ matrix:

$
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix},
$

the determinant is calculated as:

$
\det = ad - bc
$

Notably, the majority of randomly generated matrices in this space will have a **non-zero determinant**. The number of matrices for which \(ad = bc\) (i.e., the determinant is zero) is relatively small, creating a **class imbalance** problem.

The goal is to train a binary **logistic regression**—to predict whether a given matrix is singular or not. However, because the dataset is imbalanced, a naive model might achieve high **accuracy** simply by always predicting the majority class (non-singular). This would be misleading, as it would fail to detect the minority class (singular matrices), which is often the class of interest in applications such as system solvability, matrix inversion, or numerical stability.



In [None]:
import numpy as np
from itertools import product

def generate_integer_dataset(n_samples=1000, value_range=(-10, 10), seed=42):
    rng = np.random.default_rng(seed)

    values = list(range(value_range[0], value_range[1] + 1))
    all_matrices = list(product(values, repeat=4))  # (a, b, c, d)

    labels = []
    class_1 = []

    for row in all_matrices:
        a, b, c, d = row
        det = a * d - b * c
        if det == 0:
            labels.append(0)
        else:
            labels.append(1)

    return np.array(all_matrices), np.array(labels)

In [None]:
X_int, y_int = generate_integer_dataset()

labels, counts = np.unique(y_int, return_counts=True)

# Plot pie chart
plt.figure(figsize=(6, 6))
plt.pie(counts, labels=[f'Class {label}' for label in labels], autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title('Distribution of Labels in y_int')
plt.axis('equal')  # Equal aspect ratio makes the pie a circle.
plt.show()

We generated a collection of such matrices, and the resulting classes are **highly imbalanced**. This means we need to be **especially cautious** when evaluating model performance.

For example, a naive classifier that always predicts class `1` might achieve a **low overall error**, simply because class `1` dominates the dataset. However, this does **not** indicate that the classifier is actually effective—it merely reflects the class imbalance.


To better assess performance in the presence of class imbalance, we consider several standard classification metrics. In this context:

- Class **1** is treated as the **positive class**  
- Class **0** is treated as the **negative class**


<center>
<img src="https://drive.google.com/uc?export=view&id=1SuL3Apx6Vx3tml5qC8H6MNYVPWFVUlOc" width=600 />
</center>



Here are the key metrics:

- **Accuracy**:  
  The proportion of correctly classified instances out of all samples.  
  $$
  \text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Samples}}
  $$  
  ⚠️ Accuracy is **not reliable** in imbalanced datasets, as a model can achieve high accuracy by always predicting the majority class.

- **Precision** (for class 1):  
  The proportion of predicted class `1` instances that are actually correct.  
  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$

- **Recall** (for class 1):  
  The proportion of actual class `1` instances that are correctly predicted.  
  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

- **F1 Score** (for class 1):  
  The harmonic mean of precision and recall.  
  $$
  \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  $$

🟢 **Precision**, **Recall**, and **F1 Score** are more meaningful than accuracy in imbalanced settings. However, these metrics are usually reported when **the minority class is 1** by default.

Since in our task **class `0` is the underrepresented class**, we flip the perspective and evaluate the metrics **with respect to class `0`**. In `scikit-learn`, this can be done as follows:


```python
from sklearn.metrics import f1_score
f1_score(y_test, y_pred, pos_label=0)
```

**Another important consideration is how we split the dataset into training and test sets. It is essential that both sets reflect the same class distribution as the full dataset. This ensures that performance metrics are not skewed by an uneven class ratio in either the training or test split.**


In [None]:
# We look at different metrics, because accuracy can be bad with unbalanced data
# If you don't know any of metrics below, google/ask GPT/look documentation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report
)

# Split the data into training and testing sets with stratified sampling

X_train, X_test, y_train, y_test = #<YOUR CODE HERE> Don't forget to stratify!


# Train and evaluate polynomial logistic regression models for degrees 1 to 4
degrees = range(1, 5)
accuracies = []
f1_scores = []
precisions = []
recalls = []

for degree in degrees:
    model = #<YOUR CODE HERE>

    #<YOUR CODE HERE>


    # Evaluate metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, pos_label=0)
    precision = precision_score(y_test, y_pred, pos_label=0)
    recall = recall_score(y_test, y_pred, pos_label=0)

    accuracies.append(accuracy)
    f1_scores.append(f1)
    precisions.append(precision)
    recalls.append(recall)

    print(f"\nDegree {degree} classification report:")
    print(classification_report(y_test, y_pred))

# Plot
plt.plot(degrees, accuracies, marker='o', label='Accuracy')
plt.plot(degrees, f1_scores, marker='s', label='F1 Score')
plt.plot(degrees, precisions, marker='^', label='Precision')
plt.plot(degrees, recalls, marker='v', label='Recall')
plt.xlabel('Polynomial Degree')
plt.ylabel('Score')
plt.title('Model Evaluation Metrics')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# best_model = #<YOUR CODE HERE>
y_pred = best_model.predict(X_test)

def plot_pie_charts(y_true, y_pred):
    labels = np.unique(y_true)

    fig, axes = plt.subplots(1, len(labels), figsize=(18, 6)) # Adjust figure size as needed

    for i, label in enumerate(labels):
        true_positives = np.sum([(y_true[j] == label) & (y_pred[j] == label) for j in range(len(y_true))])
        total_elements = np.sum([y_true[j] == label for j in range(len(y_true))])

        proportions = [true_positives, total_elements - true_positives]

        axes[i].pie(proportions, labels=['True', 'False'], autopct='%1.1f%%', startangle=90,
                    colors=['lightgreen', 'lightcoral'])
        axes[i].set_title(f"Class {label}")

    plt.show()

plot_pie_charts(y_test, y_pred)
print(classification_report(y_test, y_pred))


**We expect your model's f1 score to be higher than 0.97.**

Write a code that will help to identify which features (including polynomial combinations) have the most significant impact on the model by examining the learned weights of the logistic regression step. Output features sorted by the absolute value of their weights.

If you use make_pipeline, you will need to acceess logistic regression part of it.




In [None]:
weights = dict() # Example: {"a^1": 1.0, "b^1":5.0, "a^3c^1":0.5}
#<YOUR CODE HERE>

Explain the weights you see:
YOUR TEXT HERE

# Question $2^*$ (1 Point)

What would happen if we tried to train a classifier on randomly sampled real-valued matrices?

By "randomly sampled," we mean that each element of the matrix is drawn independently from a uniform distribution.

Answer: YOUR TEXT HERE