# Module 3: Conventional Machine Learning

Datasets
1. **[Wisconsin breast cancer dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)**
2. **[Knee MRI image in the dicom format](https://drive.google.com/file/d/18szKEsfWko9PnuG__SuAVNQLhTVv-1Ha/view?usp=sharing)**

## 3.1 Classification

Useful Packages/Libraries: **numpy, pandas, scikit-learn**

**Assignment 3.1**

*Our goal is to use different features obtained from the cell nuclei present in the image (e.g. radius, area, smoothness, etc.) to predict whether the given sample/image corresponds to a malignant or a benign tumor.*

*   **Load the breast cancer dataset using pandas.**
  *  Extract the diagnosis feature values as a numpy array, **Y**
  *  Remove the the diagnosis and id features from the dataframe
  *  Extract the remaining feature values as a numpy array/matrix, **X**

*   **Data setup and preprocessing**
  *  Randomly split X into X_train and X_test such that 70% of the data is in X_train and the remaining 30% is in X_test. **Note: Save the split indices**
  *  Use the above indices to split Y into Y_train and Y_test. **Hint:** Use the **train_test_split** function from scikit learn to simultaneously do both of these tasks.
  *  Scale the data (column-wise) in X_train to zero-mean and unit standard deviation. **Hint:** Use the **StandardScaler** function from sklearn.preprocessing for this task. ***Why do we scale the data?***
  *   Use the standard scaler trained on X_train to transform X_test. Why do we do this?

*   **Set up the following classifiers**
    * SVM (use a linear kernel)
    * Decision trees
    * Random forest (ntrees=50)
    * k nearest neighbors (k=5)
    * Logistic regression
  
  Here is an example code for setting up support vector machines (SVM). Use this as a reference when implementing different classifiers.

    ```python
    from sklearn.svm import SVC
    mySVM = SVC(kernel='linear',probability=True)
    ```

*   **Train each classifier**

  Here is an example for training support vector machines. Use this as a reference to train each classifier.

    ```python
    mySVM.fit(X_train,Y_train)
    ```
*   **Evaluate the trained classifiers**. For each classifier, perform the following operations:
  *   Calculate true positives, false positives, true negatives and false negatives. *Hint: Use sklearn.predict*
  *   Visualize the above values on a confusion matrix.
  *   Plot the receiver operating characteristic (ROC) curve. *Hint: use `from sklearn.metrics._plot.roc_curve import plot_roc_curve`*
  *   Calculate AUROC: area under the roc curve
  *   Plot the precision-recall (PR) curve and calculate AUPRC.
  *   Which classifier had the best performance? Which classifier had the worst performance?


In [None]:
# Assignment 3.1

# Load the breast cancer dataset using pandas.

# Data setup and preprocessing

# Set up the following classifiers

# Train each classifier

# Evaluate the trained classifiers.


## 3.2 Hyperparameter Optimization

Useful Packages/Libraries: **numpy, pandas, scikit-learn**

**Assignment 3.2**

*Our goal is to optimize the classification results obtained in Assignment 3.1*. We will work with the SVM classifier.

*   **k-fold cross validation**
  *  Perform 10-fold cross validation on X_train and record the results: (Sensitivity, Specificity, AUROC)
  
    Useful link: [https://scikit-learn.org/0.17/modules/generated/sklearn.cross_validation.KFold.html](https://scikit-learn.org/0.17/modules/generated/sklearn.cross_validation.KFold.html)

*   **Working with imbalanced datasets**
  *  Use SMOTE to balance the dataset.
  ```python
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=10)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
```
  *   Perform 10-fold cross validation on the resampled dataset and record the results.
  *   Train SVM on the resampled dataset and record the results on the test dataset. What do you observe?
  *   Do a literature review to list and describe all the methods that can be used for dealing with imbalanced datasets.

*   **Hyperparameter optimization:** Use [GridsearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to optimize the following SVM hyperparameters on X_train
  * Kernel: Linear, RBF
  * C: 1, 10
  Use the following code snippet for initialization
  ```python
  parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
  ```
  * Use the optimal hyperparameters obtained from the above experiment to train a new SVM model on X_train and test it on X_test. What do you observe?

*   **Feature selection**:
  * Use recursive feature elimination to identify the top features using cross-validation. Use this [link](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py) as reference. **Note: Apply this only to X_train**.
  * Use the top features to re-train the model on X_train and evaluate it on X_test. **Note: Do not forget to remove the unused features from X_test**.



  

In [None]:
# Assignment 3.2

# k-fold cross validation

# Working with imbalanced datasets

# Hyperparameter optimization

# Feature selection:


## 3.3 Clustering

Useful Packages/Libraries: **numpy, pandas, scikit-learn**

**Assignment 3.3**


*   **Load the knee MRI using pydicom**.
  * Save 2D array into the variable, K2
  * Save shape of K2 as S2
  * Flatten the 2D array into a 1D array variable (K11)
  * Find the length of the K11, L1
  * Reshape K11 to (L1,1)

*   **Set up the following clustering algorithms**
    * k-means
    * mini-batch k-means
    * hierarchical clustering
    * spectral clustering
    
    Here is an example code for setting up **k-means**. Use this as a reference when implementing different algorithms.

    ```python
    from sklearn.cluster import KMeans
    myKmeans = KMeans(n_clusters = 3)
    ```

*   **Fit each clustering algorithm to the data**

  Here is an example for **kmeans**.

    ```python
    myKmeans.fit(K1)
    ```
*   **Visualize the results**. For each algorithm, perform the following operations:
  *   Reshape the clustered array to S2 and visualize the clustered image alongside K2.
  *   Try different number of clusters (2-5) and compare the results.


In [None]:
#Assignment 3.3

# Load the knee MRI

# Set the clustering algorithms

# Fit the clustering algorithms

# Visualize the results

## 3.4 Dimensionality Reduction / Representation Learning

Useful Packages/Libraries: **numpy, pandas, scikit-learn**

**Assignment 3.4**


*   **Load the breast cancer data using pandas**.
  *  Extract the diagnosis feature values as a numpy array, **Y**
  *  Remove the the diagnosis and id features from the dataframe
  *  Extract the remaining feature values as a numpy array/matrix, **X**  
*   **Scale / Normalize each column in X to zero mean and unit standard deviation**.  
  
*   **Set up the following dimensionality reduction algorithms with #dimensions = 2**
    * pca
    * tsne
    
*   **Fit each dimensionality reduction to X**

*   **Visualize the resulting 2-dimensional embeddings using a scatterplot**
*   **Colorcode your scatterplot with diagnosis labels**



In [None]:
#Assignment 3.4

# Load the breast cancer data

# Scale the data

# Set up the DR algorithms

# Fit the dimensionality reduction algorithms

# Visualize the results

