# 1. Bias-Variance Tradeoff, Overfitting & Underfitting
You are working on a predictive model to detect fraudulent transactions using a decision tree classifier. Initially, you trained a very deep tree that performs well on the training set but poorly on new data. To address this, you tried a shallow tree, but its performance was suboptimal across both training and test sets.

Questions:
- How would you describe the bias-variance tradeoff in this situation?
  The bias-variance is the balance between underfitting and overfitting.
- Which of the models is suffering from high variance, and which one has high bias?
  High variance is in overfitting because the deep decision tree has low training error but high test error and high bias appers in underfitting because the shallow decision tree has high error on both training and test sets.
- What strategies can you use to achieve a balance between bias and variance?
  To achieve balance we would have to select meaningful features, remove noise, and creat new informative features to improve model generalization.
    

# 2. Accuracy, Precision, Recall, and Error Rate
A healthcare startup is building a machine learning model to predict whether a patient has a rare disease based on their medical records. The dataset is highly imbalanced, with only 2% of cases being positive. Your team trained a logistic regression model and reported 98% accuracy.

Questions:
- Why is accuracy not the best metric in this case?
-  accuracy is not the best metric because of the imbalanced datasets because it can be misleading.
- How would you calculate and interpret precision and recall in this scenario?
-  we would calculate the precision with true positives divided by true positives plues false positives and we can understand that many of the predicted postives cases are actually. For recall we would true positives divided by true positives plues false negatives. This measures how many actual positives cases were correctly identified.
- If the recall is 70% and precision is 30%, what does this indicate about the model’s performance?
-  This means that we have an unbalance dataset and the model prioritizes high recall but at the cost of many incorrect predictions. This could lead to unnecessary medical tests and anxiety for patients.
- What steps would you take to improve the model’s predictive ability?
-  I would extract relevant medical indicators to improve model accuracy and remove noisy or redundant featrues that contribute to false positives. Another thing I would try to do is balance the dataset with sampling
    

# 3. Area Under the ROC Curve (AUC-ROC)
You built two classifiers to detect spam emails: Model A (a random forest) and Model B (a k-NN model). After evaluation, Model A has an AUC-ROC score of 0.85, while Model B has a score of 0.65.

Questions:
- What does the AUC-ROC score indicate about the performance of each model?
-  The AUC-ROC is the accuray of the model becuase in this cause it's the model's ability to distinguish between spam and non-spam emails.
- If you were to deploy one of the models in a real-world email filtering system, which would you choose and why?
-  I would deploy model A because it's AUC-ROC value is higher than model B and will be distinguished emails and spams.
- Suppose Model B’s AUC-ROC improves to 0.75 after hyperparameter tuning. How would you further assess whether it is ready for deployment?
- I would assess the model by compare precision, recall and f1-score along with that I would check false positive and false negatives rates.Final to see if it is ready for deployment I would conduct real-world testing to collect data on the performances of the mdoel.

# 4. k-NN Distance Metrics and Feature Scaling
Scenario: You are working with a k-NN classifier on a dataset with mixed features, including age (years), income ($), and number of purchases. After training the model, you notice that income has a dominant effect on the distance calculations.

Questions:
- Why does income have a dominant effect on k-NN distance calculations?
- In k-NN is the distance calculation determine the closeness of data points and income has dominant can effect because it has a much larger numerical range than the other features.
- How does feature scaling (e.g., Min-Max Scaling, Standardization) help in k-NN?
- Scaling ensures that all features contribute equally to distance calculations and preventing features with larger rnages form dominating.
- Suppose you are using Euclidean distance. Would feature scaling be necessary? Why or why not?
- Even with Eclidean distance we would still use scaling becuase features with large values will dominate the distance calculation, reducing the impact of smaller-scaled features like age.
- If the dataset contains categorical variables, how can k-NN handle them?
- K-NN requires a numerical representation of all features. This means that categorical variables must be encoded before applying distance based calculation. K-NN can handle them with One-Hot encoding or Binary Encoding.
    

# 5. Choosing the Optimal k Value

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Define range for k values
k_values = list(range(1, 21))
cv_scores = []
#haing low k is about high varances and 
# Perform cross-validation
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Select the best k
best_k = k_values[np.argmax(cv_scores)]
print(f"Best k: {best_k}")
    

Best k: 3


Questions:
- What does `cross_val_score` do in this code?
- 'cross_val_score' performs the cross-balidation and this is to find a balances of overfeeding and underfeeding
- Why do we perform 5-fold cross-validation to choose `k`?
- We do this to balances the overfitting and underfitting and to reduces variance in evaluation
- What does the `np.argmax(cv_scores)` function return, and why is it useful?
- This function returns the index of the maximum 'cross_val_score' which is to find the best 'k' value. We use this function to help find the optimal 'k' that yields the best performance.
- What happens if `k` is set too high? How would it affect bias and variance?
- If the 'k' is to high it would affect the bias of the model. This is because the model becoems overly simplistic. However the variance would be more stable.


# 6. Effect of Different Distance Metrics

Questions:
- What is the difference between Euclidean and Manhattan distances?
- the difference is in the process of finding the distance measure between two point. Euclidean does the straight-line distance and Manhattan is the absolute differences between coordinates.
- In which cases would Manhattan distance be preferable over Euclidean?
- A cases would be in a high dimensional dataset because Euclidean distance can be distorted by irrelevant dimensions.
- How would using Minkowski distance generalize both Euclidean and Manhattan distances?
- Its because Minkowski distances includes both Euclidean and Manhattan distances.
- How do different distance metrics affect k-NN classification in high-dimensional data?
- Euclidean becomes less effective in high dimensions due to the curse of dimensionality—distances tend to become similar. Manhattan is mroe robust in high-dimensional space because it does not amplify large differences in a single feature. Lastly, minkowski is felxbable in choosing a distance metric.


# 7. Handling Imbalanced Data in k-NN

Scenario: You are applying k-NN to a medical dataset where 95% of patients are healthy (negative class) and only 5% have a rare disease (positive class). The model achieves high accuracy but fails to detect positive cases.

Questions:
- Why does k-NN struggle with imbalanced datasets?
- It's because k-NN relies on majority voting among the nearest neighbors.
- What are some techniques to handle imbalanced classes in k-NN? (e.g., weighted voting, SMOTE)
- The techniques that can be used to hangle imbalanced classes is to adjust the wight of the tree by using oversampling or undersampling.
- How does setting the `weights='distance'` parameter in `KNeighborsClassifier` help in this scenario?
- `weights='distance' ensures that the closer neighbors have mroe influence and this reduces the impact of faraway negative class samples.
- If you were to use precision-recall curves instead of accuracy, how would that impact model evaluation?
- The impact would be on the accuracy because it would be misleading in imbalanced datasets because predicting only negative cases gives 95% accuracy, despite failing to detect any positives cases.



# 8. k-NN vs. Tree-Based Classifiers

Scenario: You need to classify images into categories, and you are choosing between k-NN and Decision Trees.

Questions:
- What are the main advantages of k-NN compared to tree-based classifiers like Decision Trees or Random Forests?
- The advantage of k-NN is that it makes no assumptions about data distribution making it useful for complex irregular patterns
- When would tree-based models be a better choice than k-NN?
- The tree-based models would be better at handling  missing data or when the dataset has outliers.
- k-NN requires storing all training data for predictions. How does this impact its computational efficiency compared to tree-based classifiers?
- while tree-based classifiers require more computation during training but make faster predictions once trained. However, k-NN is very slow for large datasets because it must compute distances to all training samples for every new prediction.
- How would a high-dimensional feature space affect k-NN’s performance?
- In a high-dimensional feature the computational cost increases because the distance calculations become expensive due to the large number of features.

