<img src="https://github.com/PKhosravi-CityTech/LightCnnRad/raw/main/Images/BioMindLogo.png" alt="BioMind AI Lab Logo" width="150" height="150" align="left" style="margin-bottom: 40px;"> **Repository Developed by Pegah Khosravi, Principal Investigator of the BioMind AI Lab**

Welcome to this repository! This notebook is designed to provide hands-on experience and foundational knowledge in machine learning. It is part of our journey to explore key ML concepts, algorithms, and applications. Whether you're a PhD student, or a master's student, this repository aims to support your learning goals and encourage critical thinking about machine learning systems.


# Midterm

## Midterm Description

You will have a maximum of 30 minutes to complete the quiz. However, if you are experiencing high stress or require additional time, you may take up to 1 hour to finish.

## Midterm Structure
- 6 Short-answer questions
- 1 True/False question
- 1 Fill-in-the-blank question (with two blanks)
- 2 Essay questions

## Instructions
- For the short-answer questions, select one of the four options and write your answer as A, B, C, or D (capital letters preferred).
- For the fill-in-the-blank question, provide one word per blank, separated by a comma (e.g., word1, word2).
- For the essay questions, write your explanation and code in Google Colab or any Python-compatible platform.
- This is an open-book quiz, so you may use textbooks or other resources—just make sure to work independently and do not use OpenAI or similar AI tools.

Follow these instructions carefully. Good luck!

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

# Function to create a question-answer toggle button
def create_question(question_text, answer_text):
    button = widgets.Button(description="Click to Reveal Answer")
    output = widgets.Output()

    def reveal_answer(b):
        with output:
            output.clear_output()
            display(Markdown(f"**Answer:** {answer_text}"))

    button.on_click(reveal_answer)
    display(Markdown(f"**{question_text}**"), button, output)

# Updated list of questions and answers
qa_pairs = [
    ("Question 1: In the context of bias-variance tradeoff, what happens when a model has high bias?\n"
     "A) The model underfits the data\n"
     "B) The model overfits the data\n"
     "C) The model generalizes well but is computationally expensive\n"
     "D) The model uses too many parameters, leading to poor interpretability", "A"),

    ("Question 2: What is a key advantage of hierarchical clustering over k-Means?\n"
     "A) It does not require the number of clusters to be specified in advance\n"
     "B) It is always computationally more efficient\n"
     "C) It only works well with large datasets\n"
     "D) It is a supervised learning algorithm", "A"),

    ("Question 3: Which dimensionality reduction technique uses linear discriminants to separate different classes in a dataset?\n"
     "A) Principal Component Analysis (PCA)\n"
     "B) t-SNE\n"
     "C) Linear Discriminant Analysis (LDA)\n"
     "D) Singular Value Decomposition (SVD)", "C"),

    ("Question 4: Which of the following is NOT a characteristic of a Decision Tree?\n"
     "A) It can handle both categorical and numerical data\n"
     "B) It requires feature scaling for optimal performance\n"
     "C) It can suffer from overfitting if not properly pruned\n"
     "D) It uses impurity measures such as Gini impurity and entropy", "B"),

    ("Question 5: Which clustering algorithm assigns data points based on their density rather than distance-based centroids?\n"
     "A) k-Means\n"
     "B) DBSCAN\n"
     "C) Hierarchical Clustering\n"
     "D) PCA", "B"),

    ("Question 6: Which method is most appropriate for detecting non-linear decision boundaries in classification problems?\n"
     "A) Linear Regression\n"
     "B) Logistic Regression\n"
     "C) Support Vector Machines (SVM) with kernel functions\n"
     "D) k-Means Clustering", "C"),

    ("Question 7: True/False? t-SNE and UMAP are primarily used for classification tasks rather than data visualization.", "False"),

    ("Question 8: Fill in the Blank → In boosting algorithms like AdaBoost and XGBoost, each new model is trained to correct the ______ of the previous models by focusing on the ______ examples.",
     "errors, misclassified"),
]

# Display all questions with interactive answer buttons
for question, answer in qa_pairs:
    create_question(question, answer)

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

# Function to create a question-answer toggle button
def create_question(question_text, answer_text):
    button = widgets.Button(description="Click to Reveal Answer")
    output = widgets.Output()

    def reveal_answer(b):
        with output:
            output.clear_output()
            display(Markdown(f"**Answer:**\n\n{answer_text}"))

    button.on_click(reveal_answer)
    display(Markdown(question_text), button, output)

# Revised question formatting for proper Markdown rendering
question_9 = (
    "**Question 9: Essay - Compare and Contrast k-Means Clustering and k-Nearest Neighbors (k-NN)**\n\n"
    "1. **Explain the differences in how they work.**\n"
    "2. **Discuss the types of datasets where each algorithm performs well.**\n"
    "3. **Provide an example of a real-world use case where one would be preferred over the other.**"
)

answer_9 = (
    "**k-Means and k-Nearest Neighbors (k-NN)** are two fundamental machine learning algorithms, but they serve different purposes and operate in distinct ways.\n\n"
    "**k-Means** is an **unsupervised learning algorithm** used for clustering, meaning it finds patterns and groups in data without predefined labels. It works by initializing k centroids, assigning data points to the nearest centroid, and iteratively updating the centroids until clusters stabilize. This makes it particularly useful for identifying natural structures within large, unlabeled datasets.\n\n"
    "In contrast, **k-NN** is a **supervised learning algorithm** used for classification and regression. Given a new data point, k-NN finds the k closest labeled points and assigns the most common class (or averages the values for regression). Unlike k-Means, which works on entire datasets at once, k-NN classifies one instance at a time, making it more flexible but computationally expensive, especially as dataset size increases.\n\n"
    "**k-Means performs best** in scenarios where we need to discover hidden patterns within data, such as customer segmentation in marketing. For example, a retail company might analyze spending behaviors and apply k-Means to group customers into categories like budget shoppers, frequent buyers, and luxury consumers. This allows businesses to develop targeted marketing strategies.\n\n"
    "On the other hand, **k-NN** is more suitable when labels are known and we want to predict a class for new data points. For instance, in medical diagnosis, a k-NN model can classify patients as low, medium, or high risk for heart disease based on past patient records. By comparing a new patient's features (such as cholesterol levels and blood pressure) with similar cases, the model makes a classification without needing complex model training.\n\n"
    "While both algorithms rely on distance-based calculations, their practical use cases differ. **k-Means** is an efficient and scalable algorithm that works well with large datasets but requires the number of clusters to be defined in advance. In contrast, **k-NN** is adaptable to new data but is computationally expensive because it must compare each new point to the entire dataset.\n\n"
    "In summary, **k-Means** is ideal for grouping unlabeled data, while **k-NN** excels in classification tasks where labeled data is available. The choice between them depends on whether the goal is to find hidden patterns or make predictions based on past examples."
)

# Display the question
create_question(question_9, answer_9)

In [None]:
import ipywidgets as widgets
from IPython.display import display, Markdown, clear_output

def create_question(question_text, answer_text):
    button = widgets.Button(description="Click to Reveal Answer")
    output = widgets.Output()

    def reveal_answer(b):
        with output:
            output.clear_output()
            display(Markdown(f"**Answer:**\n\n{answer_text}"))

    button.on_click(reveal_answer)
    display(Markdown(question_text), button, output)

question_10 = (
    "**Question 10: Essay & Code - Parkinson’s Disease Detection with SVM, PCA, and t-SNE Visualization**\n\n"
    "You are given the Parkinson’s Disease Detection Dataset from the UCI Machine Learning Repository. This dataset contains 22 numerical biomedical voice features extracted from voice recordings of patients with and without Parkinson’s disease. Your task is to predict whether a person has Parkinson’s disease (1) or not (0) using a Support Vector Machine (SVM) classifier—both before and after applying Principal Component Analysis (PCA) for dimensionality reduction. Finally, apply t-SNE to visualize the dataset in 2D after reducing it with PCA.\n\n"
    "**Dataset URL:** [Parkinson’s Dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data)\n\n"
    "**Your Task:**\n\n"
    "1. Load the dataset and inspect its structure.\n"
    "2. Preprocess the data by removing any unnecessary columns (e.g., patient names).\n"
    "3. Standardize all numerical features.\n"
    "4. Split the dataset into training (80%) and testing (20%) sets.\n"
    "5. Train a Support Vector Machine (SVM) classifier on the full dataset (22 features) and evaluate its performance.\n"
    "6. Apply PCA and reduce the dataset to the top 10 principal components.\n"
    "7. Train a second SVM model on the PCA-transformed dataset and compare the performance metrics (accuracy, precision, recall, F1-score).\n"
    "8. Final Task: Apply t-SNE on the PCA-reduced dataset to visualize the data distribution in 2D."
)

answer_10 = (
    "**Python Implementation**\n\n"
    "```python\n"
    "import pandas as pd\n"
    "import numpy as np\n"
    "import matplotlib.pyplot as plt\n"
    "import seaborn as sns\n"
    "from sklearn.model_selection import train_test_split\n"
    "from sklearn.preprocessing import StandardScaler\n"
    "from sklearn.decomposition import PCA\n"
    "from sklearn.svm import SVC\n"
    "from sklearn.manifold import TSNE\n"
    "from sklearn.metrics import accuracy_score, classification_report\n\n"
    "# Load the dataset\n"
    "url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data\"\n"
    "data = pd.read_csv(url)\n\n"
    "# Drop the 'name' column\n"
    "data.drop(columns=['name'], inplace=True)\n\n"
    "# Define features and target\n"
    "X = data.drop(columns=['status'])\n"
    "y = data['status']\n\n"
    "# Standardize features\n"
    "scaler = StandardScaler()\n"
    "X_scaled = scaler.fit_transform(X)\n\n"
    "# Split data into training and testing sets\n"
    "X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)\n\n"
    "# Train SVM before PCA\n"
    "svm_full = SVC(kernel='rbf', random_state=42)\n"
    "svm_full.fit(X_train, y_train)\n"
    "y_pred_full = svm_full.predict(X_test)\n\n"
    "print('SVM (Before PCA) Accuracy:', accuracy_score(y_test, y_pred_full))\n"
    "print(classification_report(y_test, y_pred_full))\n\n"
    "# Apply PCA\n"
    "pca = PCA(n_components=10)\n"
    "X_train_pca = pca.fit_transform(X_train)\n"
    "X_test_pca = pca.transform(X_test)\n\n"
    "# Train SVM after PCA\n"
    "svm_pca = SVC(kernel='rbf', random_state=42)\n"
    "svm_pca.fit(X_train_pca, y_train)\n"
    "y_pred_pca = svm_pca.predict(X_test_pca)\n\n"
    "print('SVM (After PCA) Accuracy:', accuracy_score(y_test, y_pred_pca))\n"
    "print(classification_report(y_test, y_pred_pca))\n\n"
    "# t-SNE Visualization\n"
    "tsne = TSNE(n_components=2, perplexity=30, random_state=42)\n"
    "X_tsne = tsne.fit_transform(X_train_pca)\n\n"
    "plt.figure(figsize=(10, 6))\n"
    "sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=y_train, palette='coolwarm', alpha=0.7)\n"
    "plt.title('t-SNE Visualization (After PCA)')\n"
    "plt.xlabel('t-SNE Component 1')\n"
    "plt.ylabel('t-SNE Component 2')\n"
    "plt.legend(['Healthy (0)', 'Parkinson’s (1)'])\n"
    "plt.show()\n"
    "```"
)

create_question(question_10, answer_10)

### 📥 Download the Parkinson’s Disease Dataset from the UCI Repository and Load Directly into Colab  
#### 🔗 Dataset Link: [Parkinson’s Disease Detection Dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data"
data = pd.read_csv(url)

# Drop the 'name' column
data.drop(columns=['name'], inplace=True)

# Define features and target
X = data.drop(columns=['status'])
y = data['status']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train SVM before PCA
svm_full = SVC(kernel='rbf', random_state=42)
svm_full.fit(X_train, y_train)
y_pred_full = svm_full.predict(X_test)

print('SVM (Before PCA) Accuracy:', accuracy_score(y_test, y_pred_full))
print(classification_report(y_test, y_pred_full))

# Apply PCA
pca = PCA(n_components=10)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Train SVM after PCA
svm_pca = SVC(kernel='rbf', random_state=42)
svm_pca.fit(X_train_pca, y_train)
y_pred_pca = svm_pca.predict(X_test_pca)

print('SVM (After PCA) Accuracy:', accuracy_score(y_test, y_pred_pca))
print(classification_report(y_test, y_pred_pca))

# t-SNE Visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_train_pca)

plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=y_train, palette='coolwarm', alpha=0.7)
plt.title('t-SNE Visualization (After PCA)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend(['Healthy (0)', 'Parkinson’s (1)'])
plt.show()