# Final Notebook for DSC478 Project
Names: Matt Soria, Jack Leniart, Francisco Lozano  

## Table of Contents
* [Overview](#overiew)  
* [Task 1 - Preprocessing](#task1)  
* [Task 2 - Exploratory Data Analysis](#task2)  
* [Task 3 - Benchmark Model](#task3)  
* [Task 4 - Models](#task4)
    * [Xgboost](#xgboost)
    * [Knn](#knn)
    * [SVM](#svm)
    * [Evaluation](#evaluation)
* [Task 5 - Application](#task5)
* [Conclusion](#conclusion)

## Overview <a class="anchor" id="overview"></a>
This notebook includes both our Executive Summary (covered in the Overview and Conclusion) - as well as our final report. This was approved by the professor.  

Note: If opening the html file - it won't display the images while running the jupyter - open the html file outside of jupyter and make sure the images folder is in the same directory as the .html file.  

For our project we decided to create a classifer for a kaggle dataset that has bio-signals along with information on whether individuals are smokers or non-smokers. Our classifiers will attempt to identify whether or not a patient is a smoker based on the 22 features provided.
(https://www.kaggle.com/datasets/gauravduttakiit/smoker-status-prediction-using-biosignals?resource=download&select=train_dataset.csv).  

We originally chose this dataset because we recognized the potential ability to build a model that would be used in healthcare.  One application of our model could be used by epidemiologists - attempting to understand population health trends/information - without directly having access to whether or not a person was a smoker.   

We created 5 tasks in order to build the classifiers, select the best, and create an application that would allow user's to interact with the model.  
* Task 1 - Preprocessing
* Task 2 - Exploratory Analysis
* Task 3 - Benchmark Model
* Task 4 - Models
* Task 5 - Application

We stuck pretty close to our proposal - however during model evaluation Francisco decided to test if combining all of our models together as an ensemble to see if we'd get better results. After putting this together we were able to confirm the results did improve with this approach. So instead of selecting one best model, we built an ensemble of all three of the models we built.  

Finally, we were able to export the ensemble model and build an application that used the ensemble. This would give users, epidemiologists, institutions, etc the ability to use our model and predict whether or not a patient is a smoker.  

## Task 1 - Preprocessing <a class="anchor" id="task1"></a>
[Link to notebook](1_preprocess.ipynb)  
[Link to html](1_preprocess.html)

Feature list of dataset
* age : 5-years gap
* height(cm)
* weight(kg)
* waist(cm) : Waist circumference length
* eyesight(left)
* eyesight(right)
* hearing(left)
* hearing(right)
* systolic : Blood pressure
* relaxation : Blood pressure
* fasting blood sugar
* Cholesterol : total
* triglyceride
* HDL : cholesterol type
* LDL : cholesterol type
* hemoglobin
* Urine protein
* serum creatinine
* AST : glutamic oxaloacetic transaminase type
* ALT : glutamic oxaloacetic transaminase type
* Gtp : γ-GTP
* dental caries

Label:
* smoking: label for this dataset. 0 = non-smoker, and 1 = smoker

The distribution of our labels was slightly uneven - we had more non smokers than smokers in both train/test.

<img src="images/1_preprocess/label_distribution.png" alt="image" style="height: 400px;"/>

We had a total of 38,984 records and 22 features. After reviewing the dataset we did not see any need to convert our categorical features to dummies. So we went right into doing a train/test split of 80/20. We decided to normalize our data with the scikit-learn's MinMax Scaler.

Once this was completed we fit the data on our training set and transformed our training/test sets.  

Finally we exported the data so it could be accessible for other steps. However, each team member also had the flexibility of doing their own preprocessing if they wanted in order to best work with their model.

## Task 2 - Exploratory Data Analysis <a class="anchor" id="task2"></a>
[Link to notebook](2_0_exploratory_data_analysis.ipynb)  
[Link to html](2_0_exploratory_data_analysis.html)  

Our broad goal for Exploratory Data Analysis was to become more familiar with the values in our dataset. We set out to achieve that by looking at things like descriptive statistics, distributions, and correlations.  

The first thing we checked was the data types of the 23 variables in our dataset. All of the variables were assigned numeric data types (float and int) when they were read into our notebook from the external file.

<img src="images/2_eda/dtypes.png" alt="image"  style="height: 350px;"/>

While all fields had numerical values, we knew that there were some categorical variables in the dataset. So we performed further analysis to identify the categorical variables.  

The next thing we checked was the descriptive statistics for all of the variables. There are two main takeaways from the descriptive statistics: the dataset appears to have five categorical variables (including our target variable), and there are significant differences in the ranges of values of our variables.  

We will need to be mindful of the categorical variables when conducting further analysis and interpreting results. We should also normalize (or scale) our data. This will prevent some of the larger values we see from skewing our results.  

Next, we generated box plots and histograms to review the distribution of values for each variable.

<img src="images/2_eda/histogram1.png" alt="image"  style="height: 500px;"/>
<img src="images/2_eda/histogram2.png" alt="image"  style="height: 500px;"/>
<img src="images/2_eda/histogram3.png" alt="image"  style="height: 350px;"/>

Most of the distributions are normal. Some are slightly skewed, but nothing that is too concerning.  

A few of the categorical variables in our data set are easy to spot when looking at the histograms like hearing (left), hearing (right), and dental caries.  

Our target variable, smoking, is also categorical. There are two possible classes: non-smoker (0) and smoker (1). For this variable, we looked at the value counts to see the size of each class.  

<img src="images/2_eda/counts.png" alt="image"  style="height: 150px;"/>

To better understand how the variables in our dataset are related to each other, we looked at the Pearson correlation coefficients for each pair of variables. We visualized these values in a correlation matrix to make it is easier to interpret.  

<img src="images/2_eda/correlation.png" alt="image"  style="height: 400px;"/>

There are some intuitive correlations in our dataset like the positive correlations between: height and weight, weight and waist, systolic and relaxation (both heart rate measures), cholesterol and LDL, and AST and ALT (both glutamic oxaloacetic transaminase types).  

We also see a negative correlation between age and height, which makes sense because humans become shorter in old age and our dataset contains observations for people up to 85 years old.  

The variable HDL (sometimes referred to as the “good” cholesterol) has negative correlations with weight, waist, and triglyceride (a type of fat found in the blood).

After conducting a correlation analysis, we proceeded with data grouping analysis. Initially, we employed K-means clustering and aimed to determine the optimal number of clusters (k). To accomplish this, we utilized both the silhouette score and the elbow method. The silhouette score suggested that either 2 or 3 clusters could be optimal, with 3 clusters exhibiting a lower within-cluster sum of squares. However, the elbow method indicated that the optimal k value was 4, as observed from the graph where the elbow occurred at k=4.

Interestingly, the optimal number of clusters coincided with the number of labels. Consequently, we opted for 2 clusters to explore whether data points within clusters shared the same label. However, assessing homogeneity scores revealed that neither 2 nor 3 clusters effectively separated the data into distinct clusters corresponding to different classes or labels, as both scenarios yielded homogeneity scores below 0.01.

<img src="images/2_eda/kmeans_graph.png" alt="image"  style="height: 400px;"/>

Subsequently, we employed Principal Component Analysis (PCA) to visualize the data in a 2D plane. This involved selecting 2 principal components and performing PCA on the scaled dataset. Upon plotting the principal components along with their labels, we observed clustering of data points, although the labels appeared to be mixed. Notably, while the data seemed linearly separable, utilizing the two principal components in Support Vector Machine (SVM) might not yield accurate results due to the blending of labels.

 <img src="images/2_eda/pca_clust.png" alt="image"  style="height: 400px;"/>

## Task 3 - Benchmark Model <a class="anchor" id="task3"></a>
[Link to notebook](3_benchmark_model.ipynb)  
[Link to html](3_benchmark_model.html)  

We decided that we would create a benchmark model for our main models in task 4 to compare to.

We used a decision tree for our model because of it's relative simplicity, the ability to visualize the decision tree, and the ability to calculate the feature importance.

We started out by creating a basic model with no parameters. This resulted in a training accuracy of 100% and test accuracy of about 74%. Clearly this was overfitting so we decided to use grid search with cross validation.

<img src="images/3_benchmark_model/gridsearch.png" alt="image"  style="height: 300px;"/>

With the parameters found in the grid search we got a model with 73% train and test accuracy.

<img src="images/3_benchmark_model/feature_importance.png" alt="image" style="height: 600px;"/>

Looking at precision/recall we noticed the recall was much lower for smokers.

Finally, the top 3 features of the model were:
* height
* Gtp
* hemoglobin

## Task 4 - Models <a class="anchor" id="task4"></a>

### Xgboost <a class="anchor" id="xgboost"></a>
[Link to notebook](4_modeling_xgboost.ipynb)  
[Link to html](4_modeling_xgboost.html)  
Created by Matt.  

I decided to use xgboost for my classifier based on it's speed and performance on a wide range of machine learning tasks. Xgboost is not in the scikit-learn package so we needed to add xgboost to our package manager.

I used the normalized data from the preprocessing step to start out with my model.  

I started out by creating a Xgboost model with no parameters to see what the training/test accuracy would be. This resulted in a training/test accuracy of 87% and 77% respectfully. Pretty good start compared to our benchmark model. However, when looking at the feature importance I noticed that height was the most important feature.

<img src="images/4_modeling_xgboost/base_feature_importance.png" alt="image" style="height: 400px;"/>

I didn't think this made sense in the context of the problem we are trying to solve - height shouldn't have any importance when trying to determine if someone is a smoker or not - so I decided to remove it from the feature list and test the model again.

Our test score dropped about 1% but that is a good trade off for a more generalized model.

Next, I used gridsearch, with cross validation, to find the best features.  

<img src="images/4_modeling_xgboost/gridsearch.png" alt="image" style="height: 300px;"/>

We got a tigher split between train/test and our test accuracy stayed consistent around 76.5%

Best Parameters:
* colsample_bytree = 0.8  
* gamma = 0  
* learning_rate = 0.1  
* max_depth = 5  
* min_child_weight = 1  
* n_estimators = 1000  
* reg_alpha = 0  
* reg_lambda = 0  
* subsample = 0.9

Finally, I attempted to use feature selection, with the parameters from gridsearch, to see if this would have any impact on the model. I incremented the feature selection by 5% intervals.

The best model removed the following features: hearing(left), Urine protein, hearing(right). This represents 86% of the features and leaves us with a total of 18 features (height had already been removed).

<img src="images/4_modeling_xgboost/feature_selection_results.png" alt="image" style="height: 600px;"/>

We were able to increase the testing score with feature selection up to 77% and remained a less than 10% difference between train/test. Based on this producing the highest testing score and having a good trade off between bias/variance I decided to use this as the best model for xgboost.

Best Xgboost Model:
- Used gridsearch to find best parameters - see parameters above
- Use feature selection to select 86% of features -> dropping height, hearing(left), Urine protein, hearing(right)
- Traing accuracy = 85%
- Test accuracy = 77%

### Knn <a class="anchor" id="knn"></a>
[Link to notebook](4_modeling_knn.ipynb)  
[Link to html](4_modeling_knn.html)  
Create by Jack.  

Since the goal of our project is to create models that can accurately predict a categorical variable, I chose to create a K-Nearest Neighbors classifier.
To start this task, I defined a KNN model using the default parameters from the KNeighborsClassifer() function in scikit-learn. This will serve as the baseline KNN model.


<img src="images/4_modeling_knn/base_test_report.png" alt="image" style="height: 250px;"/>

The baseline model accurately predicted the target class for 72% of the observations from the test data we set aside during preprocessing.
Next, I used GridSearch with 5-fold cross-validation to try and identify the best values to use for the parameters n_neighbors (k) and weights.  

<img src="images/4_modeling_knn/gridsearch.png" alt="image" style="height: 300px;"/>

GridSearch determined that k = 82 and distance weights were the best parameters.
To go one step further, I also ran cross-validation using the same combinations of possible parameters to see if I would reach the same conclusion as GridSearch.  

It should come as no surprise that my cross-validation yielded the same results as GridSearch. The highest accuracy we recorded was 77.5%, which corresponded with the model that used k= 82 and distance weights.

<img src="images/4_modeling_knn/cv_acc.png" alt="image" style="height: 350px;"/>

One insight that was gained from my cross-validation was that the accuracy noticeably levels off once we reach k = 30. The additional increases in accuracy are minimal. If we wanted to use a smaller value of k for any reason, we would not be sacrificing much accuracy.  

Next, I wanted to see how the model would perform after applying a feature reduction technique. For this, I used Principal Component Analysis.
After calculating the principal components, I reviewed the explained variance ratios

<img src="images/4_modeling_knn/pca_variance.png" alt="image" style="height: 100px;"/>

There does appear to be a knee after the fourth component. However, I wanted to select a number of components that would explain a greater ratio of the variance in the dataset. So I decided to choose the first eight components, which in total explain about 91.6% of the variance.  

I transformed the normalized test dataset using the eight principal components and then fit a new KNN model (using the best parameters identified by GridSearch) to the transformed data.  

I applied the same transformation to the set aside test data and then ran the model against the transformed test data.

<img src="images/4_modeling_knn/pca_test_preds.png" alt="image" style="height: 250px;"/>

The model accurately predicted the target class for 70% of the observations in the test data. This was just slightly worse than our baseline model (72% accuracy) and not far away from our best accuracy we saw during cross-validation (77.5%). This is promising performance considering this model was using a dataset with fewer dimensions.  

Lastly, I defined our best KNN model based on all the analysis I completed.  

After defining the model and fitting it to the normalized training data, I then ran it against the set aside test data.  

<img src="images/4_modeling_knn/test_class.png" alt="image" style="height: 250px;"/>

The best model accurately predicted the target class for 79% of the observations in the test data. This was even better than the best accuracy we recorded during cross-validation.  

<img src="images/4_modeling_knn/matrix.png" alt="image" style="height: 350px;"/>

### SVM <a class="anchor" id="svm"></a>
[Link to notebook](4_modeling_svm.ipynb)  
[Link to html](4_modeling_svm.html)  
Created by Francisco.

In my analysis, I explored Support Vector Machine (SVM) models, specifically focusing on both Linear and Radial Basis Function (RBF) kernels, particularly after observing the 2d plane plot of our dataset in Exploratory Data Analysis (EDA). The plot showed that the data was able to be split, just not with 2 principal components. There was a posibility that the variance lost could have caused the labels being mixed together. Because of this I wanted to explore if a svm model could fine this split using the original features and not the principal components.

#### Linear SVM <a class="anchor" id="lin_svm"></a>

Starting with a Linear SVM, I established a baseline model with default settings using sklearn, achieving a validation accuracy of 73%. This performance was comparable to our benchmark model. Next, I fine-tuned the Linear SVM using grid search with cross-validation, focusing on the regularization parameter (C). This adjusts the penalty for misclassification, with smaller values promoting a smoother decision boundary and potentially better generalization, while larger values aim for stricter classification of training data, which may lead to overfitting.
Through this process, I discovered that a C value of 10 yielded optimal results, balancing between bias and variance.

 <img src="images/4_modeling_svm/linear_svm_gridsearch.png" alt="image"  style="height: 400px;"/>

Using this parameter, the new fine tuned model did ~0.01 better in accuracy on the validation split. Secondly, both the validation and training split received a similiar accuracy score so there was no sign of overfitting. Moreover, I analyzed feature importance in the Linear SVM, identifying top influential features such as ALT, Gtp, and Hemoglobin. Utilizing this insight, we retrained the model with feature reduction, dropping less impactful features. However, this led to a slight decrease in accuracy to 72% which is worse than our benchmark model. Given this, I decided to explore alternative SVM kernels, considering the potential for non-linear decision boundaries.

 <img src="images/4_modeling_svm/linear_svm_ft_imp.png" alt="image"  style="height: 400px;"/>


#### RBF SVM <a class="anchor" id="rbf_svm"></a>


Moving on to RBF SVM, I observed promising initial results with a baseline accuracy of 74% on the validation split. This kernel was chosen due to its similarity to KNN and its robustness to outliers. Subsequently, we fine-tuned the RBF SVM using grid search with cross validation, focusing on both the regularization parameter (C) and the kernel coefficient (gamma). The best parameters were C=10 and gamma='scale'. Scale means gamma is dynamically adjusted based on the number of features and their variance effectively ensuring that the scale of gamma is appropriate relative to the data. Using these two parameters, the new fine tuned model did ~1% better in accuracy on the validation split.

 <img src="images/4_modeling_svm/rbf_svm_gridsearch.png" alt="image"  style="height: 400px;"/>

Upon validating the model on the test split, I obtained a consistent accuracy score of 74%, indicating robust performance without signs of overfitting. Additionally, the model demonstrated aptitude in capturing underlying patterns within the data, as evidenced by its performance on both training and test splits.

 <img src="images/4_modeling_svm/rbf_svm_cm.png" alt="image"  style="height: 400px;"/>

From the confusion matrix, we can see that my best model (RBF SVM) has a overall better performance in predicting non-smokers(0).


### Evaluation <a class="anchor" id="evaluation"></a>
[Link to notebook](4_modeling_combine_all_models.ipynb)  
[Link to html](4_modeling_combine_all_models.html)  

After completing our model building process, we conducted a comparative analysis of the performance of each model. All three models we developed exhibited superior accuracy scores compared to our benchmark model. Notably, the K-Nearest Neighbors (KNN) model outperformed the others, while the XGBoost model displayed slight indications of overfitting.


| Model                         | Train Score | Test Score |
|-------------------------------|-------------|------------|
| Decision Tree (Benchmark)     | 0.729       | 0.73       |
| KNN                           | 0.775       | 0.79       |
| RBF SVM                       | 0.758       | 0.74       |
| XGBoost                       | 0.857       | 0.772      |

>Note: Accuracy scores are presented

Additionally, we observed that the SVM and XGBoost models identified different patterns as the most important features. Recognizing this divergence, we explored the potential of combining our models to leverage their diverse pattern capturing abilities. The aim was to create a comprehensive ensemble model that merges the strengths of each individual model, potentially leading to enhanced overall performance.

#### Ensemble Classifier <a class="anchor" id="en_class"></a>

Since our task involved classification, we opted for either a Voting Classifier or a Stacking Classifier. Here's a clearer distinction between the two:

- **Voting Classifier**:
  - Offers both hard and soft voting variants.
  - Hard voting selects the majority vote as the final prediction, while soft voting averages the probability scores predicted by individual models.
- **Stacking Classifier**:
  - Utilizes the outputs of individual models as inputs for another model, typically referred to as the "stacked model".
  - We employed logistic regression with an L2 penalty as our stacked model.

We tested these approaches to determine the most effective ensemble method:

| Model                         | Train Score | Validation Score |
|-------------------------------|-------------|------------|
| Hard Voting Classifier     | 0.776       | 0.788      |
| Soft Voting Classifier                           | 0.783       | 0.796       |
| Stacking Classifier                      | 0.779       | 0.794       |
>Note: Accuracy scores are presented

Our analysis revealed that the Soft Voting Classifier emerged as the most effective ensemble model. It exhibited minimal signs of overfitting and achieved the highest accuracy score. Subsequently, we evaluated this model on the test split and obtained an accuracy score of 0.788, further affirming its robustness and generalization capability.

#### Final Evaluation <a class="fin_evaluation" id="en_class"></a>

After creating the ensemble classifier, we revisited all our models again.

| Model                         | Train Score | Test Score |
|-------------------------------|-------------|------------|
| Decision Tree (Benchmark)     | 0.729       | 0.73       |
| KNN                           | 0.775       | 0.79       |
| RBF SVM                       | 0.758       | 0.74       |
| XGBoost                       | 0.857       | 0.772      |
| Soft Voting Classifier                           | 0.783       | 0.796       |

>Note: Accuracy scores are presented

Upon evaluation, it became evident that the Soft Voting Classifier outperformed all individual models, showcasing superior accuracy without any discernible signs of overfitting. This ensemble model effectively merged the strengths of each individual model, resulting in enhanced predictive performance.

 <img src="images/4_modeling_combine_all_models/best_model_cm.png" alt="image"  style="height: 400px;"/>


```
              precision    recall  f1-score   support

  non-smoker       0.83      0.84      0.83      4933
      smoker       0.72      0.69      0.71      2864

    accuracy                           0.79      7797
   macro avg       0.77      0.77      0.77      7797
weighted avg       0.79      0.79      0.79      7797
```

## Task 5 - Application <a class="anchor" id="task5"></a>
[Link to app](https://huggingface.co/spaces/FranciscoLozDataScience/smoker_model)

Once we determined our best model, we embarked on creating our application. Our objective was to design a user-friendly interface where users could interact with the model's inputs and observe the predicted labels it generated. To achieve this, we turned to Gradio for its seamless configuration for creating AI apps, and HuggingFace for its integration with Gradio and its capability for free hosting.

In creating the app, we employed the Python library joblib to save our best model and our min-max scaler. This approach eliminated the need for retraining the model each time the app was accessed.

This application serves as a valuable tool for epidemiologists, enabling them to comprehend population health trends and information even when direct access to an individual's smoking status is unavailable. Additionally, it can be utilized to analyze older historical data where smoking status is unknown, and contacting individuals included in the dataset is not feasible.

 <img src="images/other/app_ui.png" alt="image"  style="height: 400px;"/>

## Conclusion <a class="anchor" id="conclusion"></a>

Wrapping up our project.  

The problem we attempted to solve is being able to classify smoker's based on 22 bio signals. The dataset was available through kaggle.

We wanted to be able to create a model that would be used in healthcare.  

All group members learned how powerful ensemble models can be, and especially when the different models are capturing different patterns - which we saw with feature importance.

One application for this type of model would be for epidemiologists to use this model when creating studies or looking at population health trends. If they didn't have access to whether or not a person was a smoker they'd be able to use our model for that classification.  

Another application, would be for health insurance companies. They could use our model to help with premiums and risk. The premium and risk of a smoker vs non smoker should be different.