# Model Visualization
- Accuracy Tables: Detailed model performance tables for both training and validation accuracy.
- Final Plots: Accuracy comparison between Logistic Regression and KNN.

In [None]:
# -Checking Accuracies tables-
print(accuracies_table_1)
print(accuracies_table_2)

fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Logistic Regression Plot
axs[0].plot(accuracies_table_1['l2_penalty'], accuracies_table_1['train_accuracy'], color='red', label='Train Accuracy')
axs[0].plot(accuracies_table_1['l2_penalty'], accuracies_table_1['validation_accuracy'], color='blue', linestyle='--', label='Validation Accuracy')
axs[0].set_xscale('log')
axs[0].set_ylim(0.75, 1.0)
axs[0].set_title('Logistic Regression Accuracy')
axs[0].set_xlabel('L2 Penalty (Log Scale)')
axs[0].set_ylabel('Accuracy')
axs[0].legend()
axs[0].grid(True)

# KNN Plot
axs[1].plot(accuracies_table_2['n_neighbors'], accuracies_table_2['train_accuracy'], color='green', label='Train Accuracy')
axs[1].plot(accuracies_table_2['n_neighbors'], accuracies_table_2['validation_accuracy'], color='purple', linestyle='--', label='Validation Accuracy')
axs[1].set_xlim(3, 27)
axs[1].set_ylim(0.75, 1.0)
axs[1].set_title('KNN Accuracy')
axs[1].set_xlabel('Number of Neighbors')
axs[1].legend()
axs[1].grid(True)

plt.tight_layout()
plt.show()

# Project Analysis

In this project, I developed and evaluated two classification models — Logistic Regression and K-Nearest Neighbors (KNN) — each optimized with multiple hyperparameters to predict outcomes based on a provided dataset. For Logistic Regression, I experimented with six levels of L2 regularization (penalties of 0.001, 0.01, 0.1, 1, 10, and 100), while for KNN, I evaluated model performance across five different neighbor counts (5, 10, 15, 20, and 25).

Initially, using the raw data with no pre-processing, the baseline model accuracy was 0.51642, which indicated significant room for improvement. To enhance model performance, I implemented a comprehensive pre-processing pipeline. This included handling missing values by imputing the mean for NaN entries and encoding categorical variables using one-hot encoding to convert them into a numerical format. I also excluded the `userid_DI` feature, focusing on the most informative features for training.

Post pre-processing, I trained and validated both models, observing how accuracy varied with different hyperparameters. In the case of Logistic Regression, a smaller L2 regularization penalty yielded higher validation accuracy, indicating a stronger model fit with lower regularization (i.e., larger values of the regularization parameter `C`). For KNN, the impact of the neighbor count on validation accuracy was less predictable, with accuracy levels fluctuating as the number of neighbors increased.

Ultimately, Logistic Regression emerged as the superior model, consistently achieving the highest validation accuracy across hyperparameter settings. This model met the project’s target requirement, reaching a validation accuracy of 0.95 or higher. By carefully selecting and tuning hyperparameters and applying robust pre-processing techniques, I successfully enhanced model performance and met the assignment's criteria.

# Ethical Implications

This project raises several important ethical implications, particularly concerning the potential use of a predictive model aimed at maximizing revenue by encouraging students to complete a paid certification program. Such a focus on profit could lead to ethical challenges, including data bias, risks to the platform’s credibility, and potential discrimination.

Firstly, the use of predictive models with a primary profit-driven motive risks embedding historical bias into decision-making. Historical bias arises when data reflects past inequities or societal biases, often related to certain demographics. Even if the data is accurate and well-sampled, it can still carry harmful biases that perpetuate inequalities. In this case, the model might inadvertently favor students who already have better access to education, thereby reinforcing existing disparities rather than providing equitable opportunities.

Secondly, prioritizing profit over ethical considerations could undermine the credibility and integrity of the online education platform. Users may perceive the platform as focused solely on revenue rather than on promoting fair access to education. To maintain trust, the platform should prioritize values such as fairness, transparency, and ethical development in deploying predictive models. This would demonstrate a commitment to using data-driven insights responsibly to support all learners equitably.

In conclusion, implementing predictive modeling in this context requires careful consideration of ethical principles to avoid perpetuating discrimination, ensure transparency, and uphold the platform’s reputation.