Q2:

1. Accuracy: Best used when false positives and false negatives have comparable consequences, and the class distribution is balanced.
Example (Weather Prediction): Predicting if it will rain tomorrow (yes/no) in a region where rain occurs about 50% of the time. Since false alarms and missed rain predictions have similar effects, accuracy serves as a suitable metric.

2. Sensitivity (Recall): Focuses on identifying the proportion of actual positives that are correctly detected.
Example (Medical Diagnosis): Screening for cancer in patients. A missed cancer diagnosis (false negative) is far more serious than a false alarm (false positive). High sensitivity ensures most cancer cases are identified.

3. Specificity: Emphasizes the proportion of actual negatives correctly identified.
Example (Spam Filtering): Preventing legitimate emails from being marked as spam. Here, false positives (valid emails flagged as spam) are more problematic than missing a few spam emails, making high specificity crucial.

4. Precision: Evaluates the proportion of predicted positives that are true positives.
Example (Fraud Detection): Flagging fraudulent transactions in financial systems. False positives (legitimate transactions flagged as fraud) can cause customer dissatisfaction and operational challenges. High precision minimizes unnecessary alerts.

Q4

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
ab_reduced_noNaN_train, ab_reduced_noNaN_test = train_test_split(ab_reduced_noNaN, test_size=0.2, random_state=42)

# Reporting the number of observations in each set
print("Number of observations in the training set:", len(ab_reduced_noNaN_train))
print("Number of observations in the test set:", len(ab_reduced_noNaN_test))


y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']: This applies one-hot encoding to the categorical variable Hard_or_Paper, converting it into a binary format. The resulting variable, y, is set to 1 for "Hardcover" ('H') and 0 for all other categories.

X = ab_reduced_noNaN[['List Price']]: This extracts the List Price column from the dataset, creating a DataFrame X to be used as the input feature for model training and testing.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

# Preparing the data for training
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']
X = ab_reduced_noNaN[['List Price']]

# Initializing and training the DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X, y)

# Visualizing the tree
plt.figure(figsize=(10, 6))
tree.plot_tree(clf, feature_names=['List Price'], class_names=['Paper', 'Hard'], filled=True)
plt.show()


Train-Test Split:

Discussed how to divide a dataset into training and testing sets using an 80/20 split via train_test_split.
Provided code to create the training (ab_reduced_noNaN_train) and testing (ab_reduced_noNaN_test) datasets and to count the number of observations in each.
Understanding Data Preparation:

Explained the following steps:
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']: Converts the categorical variable "Hard_or_Paper" into a binary format, with 1 for "hardcover" and 0 for "paperback."
X = ab_reduced_noNaN[['List Price']]: Extracts the "List Price" column to use as the feature for model training.
Training a Decision Tree:

Provided code to initialize and train a DecisionTreeClassifier (clf) using the "List Price" feature to predict the book format.
Set the tree's maximum depth to 2 for simplicity.
Visualizing the Decision Tree:

Shared code to use tree.plot_tree for visualizing the decision-making process. This includes:
Splits based on "List Price" thresholds.
Class predictions at the terminal nodes.

Interpreting the Decision Tree:
Summarized how the tree uses splits in the "List Price" feature to predict whether a book is "hardcover" or "paperback."
Highlighted how terminal nodes indicate class probabilities and predictions.

In [None]:
Q5

In [None]:
# Defining X and y
X = ab_reduced_noNaN[['NumPages', 'Thick', 'List Price']]
y = pd.get_dummies(ab_reduced_noNaN["Hard_or_Paper"])['H']

# Initialize and train the model
clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X, y)

# Simple tree visualization
import matplotlib.pyplot as plt
from sklearn import tree

plt.figure(figsize=(20, 10))
tree.plot_tree(clf2, feature_names=['NumPages', 'Thick', 'List Price'], class_names=['Paper', 'Hard'], filled=True)
plt.show()

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'max_depth': [2, 3, 4, 5, 6]}

# Initialize GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')

# Fit the grid search
grid_search.fit(X, y)

# Best parameters
print("Best max_depth:", grid_search.best_params_['max_depth'])
print("Best accuracy:", grid_search.best_score_)

In [None]:
Q6

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
import numpy as np

# Prepare the test data
y_actual = pd.get_dummies(ab_reduced_noNaN_test["Hard_or_Paper"])['H']
X_clf_test = ab_reduced_noNaN_test[['List Price']]
X_clf2_test = ab_reduced_noNaN_test[['NumPages', 'Thick', 'List Price']]

# Generate predictions for both models
y_predicted_clf = clf.predict(X_clf_test)
y_predicted_clf2 = clf2.predict(X_clf2_test)

# Compute confusion matrices
confusion_matrix_clf = confusion_matrix(y_actual, y_predicted_clf)
confusion_matrix_clf2 = confusion_matrix(y_actual, y_predicted_clf2)

# Define a function to calculate performance metrics
def calculate_metrics(conf_matrix):
    TP = conf_matrix[1, 1]
    TN = conf_matrix[0, 0]
    FP = conf_matrix[0, 1]
    FN = conf_matrix[1, 0]

    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    accuracy = (TP + TN) / conf_matrix.sum()

    return np.round([sensitivity, specificity, accuracy], 3)

# Evaluate both models
metrics_for_clf = calculate_metrics(confusion_matrix_clf)
metrics_for_clf2 = calculate_metrics(confusion_matrix_clf2)

# Display the results
print("Confusion Matrix for clf:\n", confusion_matrix_clf)
print("Sensitivity, Specificity, Accuracy for clf:", metrics_for_clf)

print("\nConfusion Matrix for clf2:\n", confusion_matrix_clf2)
print("Sensitivity, Specificity, Accuracy for clf2:", metrics_for_clf2)

Confusion Matrix and Metrics for Models:

Confusion Matrix for clf:
 [[40  4]
 [ 3 17]]
Sensitivity, Specificity, Accuracy:

Sensitivity: 0.85
Specificity: 0.909
Accuracy: 0.891
Confusion Matrix for clf2:
[[42  2]
 [ 2 18]]
Sensitivity, Specificity, Accuracy:

Sensitivity: 0.9
Specificity: 0.955
Accuracy: 0.938
Model Analysis:

The confusion matrix provides the counts for True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
Sensitivity: Reflects the model's ability to correctly identify hardcover books.
Specificity: Indicates how well the model identifies paperback books.
Accuracy: Represents the overall performance of the model.
Summary of Evaluation:

Objective:
Compare the performance of two classification models (clf and clf2) using confusion matrices and calculate key metrics.

Key Concepts:

Positive (P): Predicted hardcover (1).
Negative (N): Predicted paperback (0).
Confusion Matrix Terms:
TP (True Positive): Correctly predicted hardcover books.
TN (True Negative): Correctly predicted paperback books.
FP (False Positive): Predicted hardcover, but the actual book is paperback.
FN (False Negative): Predicted paperback, but the actual book is hardcover.
 
Implementation:

Used the test dataset (ab_reduced_noNaN_test) for predictions.
Generated confusion matrices for clf and clf2.
Calculated sensitivity, specificity, and accuracy using a helper function, rounding results to three decimal places.
Results:

Presented confusion matrices for both models.
Reported sensitivity, specificity, and accuracy for each. Model clf2 demonstrates slightly better performance across all metrics.

In [None]:
Q7

The differences between the two confusion matrices arise from the features used by the models. The first model (clf) relies solely on a single feature, List Price, for its predictions. In contrast, the second model (clf2) utilizes three features—NumPages, Thick, and List Price. By leveraging a wider range of features, clf2 gains access to more information, enabling it to identify more complex patterns and improve its prediction accuracy. Both models were trained on a clean dataset without missing values, ensuring consistent and reliable results. The inclusion of additional features in clf2 results in fewer misclassifications, as evidenced by its superior performance in the confusion matrix.