### **Description**

Fantastic work on the Data Analysis & EDA! You've cleaned the data and know it well. Now, it's time to use that data to **build, train, and test our predictive models**.

This is where we go from being a data analyst to a data scientist. Your goal is to experiment with different techniques to find the best-performing machine learning model that can accurately predict whether a patient has heart disease.

---

### **Checklist & Key Steps**

**1. Final Data Preprocessing (The Experiment)**
This is a critical step where your choices will directly impact the model.
* **Handle Categorical Data (The Challenge):**
    * Your models need numbers, not text. You must convert categorical columns (like 'sex', 'cp', 'fbs', etc.) into numeric ones.
    * **I do not want you to use just one method.** I want you to experiment and see the results. You must try at least **two** different encoding strategies and compare their impact. For example:
        * **Strategy A: One-Hot Encoding** (e.g., using `pd.get_dummies`). This creates new "dummy" columns for each category.
        * **Strategy B: Label Encoding** (e.g., using `LabelEncoder`). This converts each category into a single number (e.g., 'Male'=0, 'Female'=1).
* **Apply Your Cleaning Strategy:** Implement the plan you made for handling outliers or transformed features from Task 1.
* **Feature Scaling:** This is critical! Many models perform poorly if features are on different scales. Use `StandardScaler` from scikit-learn to scale your numeric features. (Note: Apply scaling *after* your train/test split to avoid data leakage!)

**2. Prepare for Training**
* **Define X and y:** Separate your data into `X` (all the features/columns for predicting) and `y` (the 'target' column you are trying to predict).
* **Train/Test Split:** Split your `X` and `y` data into a training set and a testing set (e.g., 80% for training, 20% for testing) using `train_test_split`. **Do this *before* scaling.**

**3. Model Training (Baseline)**
Let's train a few different types of models to see what works best. Train these models using the default settings for *each* of your encoding strategies (e.g., train a Logistic Regression on your One-Hot-Encoded data, and *another* Logistic Regression on your Label-Encoded data).
* Train a **Logistic Regression** model.
* Train a **Desicion tree** model.
* Train a **Support Vector Machine (SVM)** model.
* Train a **Random Forest Classifier** model.

**4. Model Evaluation (Baseline)**
For each model you trained, use the *test set* to see how well it performs. We need to know more than just "accuracy."
* **Accuracy:** How many predictions were correct overall?
* **Precision, Recall, and F1-Score:** Get these from a `classification_report`.
    * **Recall** is very important here: How many *actual* heart disease cases did we catch? (We don't want to miss any!)
* **Confusion Matrix:** Create one for each model. This shows you exactly *what types* of mistakes the model is making.

* **ROC-AUC Score:** This gives you a single number to judge how well a model can separate the two classes (disease vs. no disease).

**5. Hyperparameter Tuning**
* **Select Your Best Combination:** Based on your evaluation (especially Recall and ROC-AUC scores), pick your **best-performing combination** of (Encoding Strategy + Model).
* **Tune:** Use `GridSearchCV` or `RandomizedSearchCV` to automatically find the *best settings* (hyperparameters) for your chosen model to maximize its performance.

**6. Final Evaluation & Feature Importance**
* **Final Scores:** Evaluate your *tuned* model on the test set. How much did its scores improve from the baseline?
* **Feature Importance:** Now, let's look *inside* the model.
    * If you used Logistic Regression or SVM, look at the `model.coef_` attribute.
    * If you used Random Forest, look at the `model.feature_importances_` attribute.
* Create a bar chart to visualize the top 5-10 most important features. What does the model "think" is the best predictor of heart disease?

---

### **Deliverables**

* A github link includes your jubyter notebook containing all your code for preprocessing, training, and evaluation.
* **Important:** Please include text/markdown cells with your observations:
    * **Crucially: How did your choice of *categorical encoding* (Strategy A vs. B) affect your models' performance? Which was better and why do you think that is?**
    * Which model performed best at the baseline?
    * How much did hyperparameter tuning help?
    * Which model is your *final choice* and **why**?
    * What were the most important features your model found? Did they match your hypotheses from Task 1?

---

### **Resources**

* **Tools:** You will be using **scikit-learn** for almost everything here.
* **Key functions to look up:**
    * `pd.get_dummies` (for One-Hot)
    * `LabelEncoder`
    * `train_test_split`
    * `StandardScaler`
    * `LogisticRegression`, `KNeighborsClassifier`, `SVC`, `RandomForestClassifier`
    * `classification_report`, `confusion_matrix`, `roc_auc_score`
    * `GridSearchCV`
 **Start Date:** Friday nov 7 
** first follow up: sunday morning nov 9
** second follow up: tuesday morning nov 11
deadline of the task on wednesday nov 12 at 11:59


