# Machine Learning Project
# Haberman

## importing relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import LabelEncoder

Attribute Information:
   1. Age of patient at time of operation (numerical)
   2. Patient's year of operation (year - 1900, numerical)
   3. Number of positive axillary nodes detected (numerical)
   4. Survival status (class attribute)
         * 1 = the patient survived 5 years or longer
         * 2 = the patient died within 5 year
         
        
When cancer isn't detected in the lymph nodes, the test results are negative, indicating that none of the lymph nodes show signs of cancer. Conversely, if cancer cells are present in the lymph nodes, the results are considered positive. 

## Loading dataset

In [2]:
import pandas as pd

dataset = "haberman.csv"
nam = ['age_patient_operate_time', 'year_operate_time', 'axillary_node_num', 'survive_after5years']
myData = pd.read_csv(dataset, names=nam)
myData

Here's the typical order of operations in a machine learning project:

1. Data Collection: Gather our dataset.

2. Data Preprocessing:
    * Handle missing values (if any).
    * Encoding categorical variables (if applicable).
    * Feature engineering (if needed).
    * Scaling numerical features.<br><br>

3. Data Splitting: Split the data into training and testing sets.

4. Model Building: Choose a machine learning model and train it on the training data.

5. Model Evaluation: Evaluate the model's performance on the testing data.

6. Model Tuning: Adjust hyperparameters and make other model improvements based on evaluation results.

7. Final Model: Train the final model on the entire dataset (including training and testing data).

8. Inference: Use the final model for predictions on new, unseen data.

Source link: https://www.linkedin.com/pulse/unlock-power-machine-learning-data-science-ai-inbuiltdata-1f/

## Explore the Dataset

In [3]:
myData.info()

In [4]:
myData.describe()

## Data Preprocessing:
Since there are no missing values in our dataset and all the features are numerical, there's no need to handle missing values or encode categorical features. However, there are a few optional preprocessing steps we can consider:

## Missing values

In [5]:
import numpy as np

myData.isna().sum()

## Visualization 

### Pie chart to show the percentage of survival 

In [6]:
import matplotlib.pyplot as plt

sur = myData["survive_after5years"].value_counts()

plt.pie(x=sur, labels=["Survived after 5 years", "Died within 5 years"], colors=["#98FB98", "#FFC0CB"], autopct="%1.0f%%")
plt.title("Survival information")

plt.show()

In [7]:
myData.survive_after5years.value_counts()

In this case, we can see that there is a significant class imbalance in the dataset. The class "1" has many more data points (patients who survived) compared to class "2" (patients who did not survive). Class imbalance is a common issue in machine learning, and handling it effectively is important for building accurate models. Imbalanced datasets can lead to biased model performance.

### Scatter plot to visualize the age distribution


In [8]:
age = myData["age_patient_operate_time"].value_counts().sort_index()

# Extract ages (x-axis) and counts (y-axis)
ages = age.index
counts = age.values

# Create a scatter plot
plt.scatter(ages, counts, color="red")  # we can adjust the color as needed
plt.title("Age of Patients in the Time of Operation")
plt.xlabel("Age")
plt.ylabel("Count")


print(f"Youngest patient was {myData['age_patient_operate_time'].min()} old.")
print(f"Oldest patient was {myData['age_patient_operate_time'].max()} old.")
print(f"Mean of the ages in patients was {myData['age_patient_operate_time'].mean():.1f} years.")

plt.show()

### Histogram chart to visualize the count of positive nodes.

In [9]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(myData['axillary_node_num'], bins=20, color='darkred', edgecolor='black')
plt.xlabel('Number of Positive Axillary Nodes')
plt.ylabel('Count')
plt.title('Histogram of Number of Positive Nodes')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


## Univariate analysis

Univariate analysis is like looking at one thing at a time. In this project we have "haberman" dataset with many pieces of information, where we know the age, year of operation, and the number of lymph nodes detected for different patients. When we perform univariate analysis,we're interested in just one of these pieces of information at a time.

For example let's say we want to do univariate analysis on the "age" variable. We'd focus only on the ages of the patients and see what we can learn from that one piece of information. We might do the following:

PDF (Probability Density Function): We create a chart that shows how many patients are at each age. For example, we could find that there are many patients in their 40s and 50s in the dataset, and fewer in their 30s or 70s. This gives us an idea of the distribution of ages in the data.

CDF (Cumulative Distribution Function): This chart shows us how many patients are younger or older than a specific age. For instance, we might see that about 70% of patients are younger than 60 years old. This helps us understand the cumulative distribution of ages.

Boxplot: A boxplot gives us a quick summary of the age data. It shows the median age (the middle value), the spread of ages (interquartile range), and any outliers. We can see if the ages are concentrated around a certain range or if there are outliers that represent unusual cases.

Violin Plot: This plot is like a combination of a boxplot and a PDF. It shows the distribution of ages, giving us an idea of where the ages are most concentrated and if there are multiple age "peaks" in the data.

In each of these cases, we're examining just one variable (age) to understand the patterns and characteristics of that specific variable. Univariate analysis helps us get a basic understanding of the dataset without considering the relationships between variables or the causes of certain outcomes. It's a fundamental step in data analysis to see what we have before delving into more complex analyses.

Source link: https://medium.com/aiguys/beginner-friendly-exploratory-data-analysis-on-haberman-breast-cancer-survival-dataset-4da95f314ad


### PDF(Probability Density Function)

In [10]:
# visualizing how many death and survival happened in different ages

sns.FacetGrid(myData, hue="survive_after5years", height=5).map(sns.histplot, "age_patient_operate_time", kde=True).add_legend()
plt.title("Histogram of age")
plt.ylabel("Density")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

In [11]:
# visualizing how many death and survival in different years

sns.FacetGrid(myData, hue="survive_after5years", height=5).map(sns.histplot, "year_operate_time", kde=True).add_legend()
plt.title("Histogram of years")
plt.ylabel("Density")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

In [12]:
# visualizing how many death and survival with number of positive nudes

sns.FacetGrid(myData, hue="survive_after5years", height=5).map(sns.histplot, "axillary_node_num", kde=True).add_legend()
plt.title("Histogram of number of nodes")
plt.ylabel("Density")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

### Boxplots
The boxplot provides a statistical summary of the data as follows:

The rectangular box represents the interquartile range (IQR), with the bottom and top edges of the box corresponding to the 1st (25th percentile) and 3rd (75th percentile) quartiles, respectively.

The horizontal line inside the box represents the median (50th percentile) of the data.

These characteristics of the boxplot help visualize the central tendency, spread, and skewness of the data distribution.

In [None]:
sns.boxplot(x = "survive_after5years", y = "age_patient_operate_time", hue = "survive_after5years", data = myData, palette= {1: "palegreen", 2: "pink"}).set_title("Box plot for survival and age")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.boxplot(x = "survive_after5years", y = "year_operate_time", hue = "survive_after5years", data = myData, palette= {1: "palegreen", 2: "pink"}).set_title("Box plot for survival and year of operation")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.boxplot(x = "survive_after5years", y = "axillary_node_num", hue = "survive_after5years", data = myData, palette= {1: "palegreen", 2: "pink"}).set_title("Box plot for survival and number of nodes")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

### Violin Plot

In [None]:
sns.violinplot(x = "survive_after5years", y = "age_patient_operate_time", hue = "survive_after5years", data = myData, palette= {1: "palegreen", 2: "pink"})
plt.title("Violin plot for survival and age")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.violinplot(x = "survive_after5years", y = "year_operate_time", hue = "survive_after5years", data = myData, palette= {1: "palegreen", 2: "pink"})
plt.title("Violin plot for survival and year of operation")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.violinplot(x = "survive_after5years", y = "axillary_node_num", hue = "survive_after5years", data = myData, palette= {1: "palegreen", 2: "pink"})
plt.title("Violin plot for survival and number of nodes")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()


## Bivariate Analysis:

Bivariate analysis is a fundamental form of quantitative statistical analysis that focuses on the examination of two variables, typically represented as X and Y. Its main purpose is to discover how these two things are connected in the real world. In simpler terms, it's about understanding how two variables relate to each other through data and numbers.

### Scatter Plot:

A scatter plot serves as a valuable visual tool for illustrating the connection between two numerical variables or attributes. Typically, it is created before diving into tasks such as calculating linear correlations or fitting regression lines. The pattern that emerges from a scatter plot reveals insights into the nature of the relationship between the two variables, whether it's linear or non-linear, and provides clues about the strength of that relationship.

In [None]:
# Create a scatter plot
sns.scatterplot(data=myData, hue="survive_after5years", palette= {1: "green", 2: "red"}, x="age_patient_operate_time", y="survive_after5years")
plt.title("Scatter Plot: Age vs. Survival")
plt.xlabel("Age at Operation Time", color="black")
plt.ylabel("Survival")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.scatterplot(data=myData, hue="survive_after5years", palette= {1: "green", 2: "red"}, x="year_operate_time", y="survive_after5years")
plt.title("Scatter Plot: Year of operation vs. Survival")
plt.xlabel("Year of Operation")
plt.ylabel("Survival")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.scatterplot(data=myData, hue="survive_after5years", palette= {1: "green", 2: "red"}, x="axillary_node_num", y="survive_after5years")
plt.title("Scatter Plot: Number of nodes vs. Survival")
plt.xlabel("Number of nodes")
plt.ylabel("Survival")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

sns.scatterplot(data=myData,  hue="survive_after5years", palette= {1: "green", 2: "red"}, x="age_patient_operate_time", y="axillary_node_num")
plt.title("Scatter Plot: Age of patient vs. Number of positive nodes")
plt.xlabel("Age of patient")
plt.ylabel("Positive Nodes")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

#### Result:
It appears that the majority of patients have no detected positive lymph nodes. In this scatter plot, the green and red points are closely mixed together, making it difficult to determine a patient's survival based solely on this 2-D plot of Age of Patient and Positive nodes. To make informed decisions, we need to explore all possible combinations of features. There are three such combinations (excluding the 'Survival' class attribute), and to analyze these combinations, we employ the concept of Pair-Plot.

### Pair plot

A pair plot is like a visual cheat sheet for data. It's a way to quickly see how all the different things in our data relate to each other. It's like having a bunch of scatter plots all at once, so we can spot patterns and connections between different pieces of information. In other words, it helps us understand how everything in our data works together visually.


In [None]:
# Create a pairplot to visualize scatter plots for all pairs of features
sns.pairplot(myData, hue="survive_after5years", markers=["o", "s"], palette= {1: "green", 2: "red"})
plt.suptitle("Pairwise Scatter Plots for Haberman Dataset")
print("Explanation of Survival status (class attribute): \n1 = the patient survived 5 years or longer and \n2 = the patient died within 5 year" )
plt.show()

In [None]:
sns.pairplot(data=myData)
plt.show()

## Conclusion:

* The dataset we have is unbalanced, meaning it doesn't have an equal number of data points for each category or class.
* It's also not easy to draw a straight line to separate the different classes because there's a lot of overlap in the data points.
* To work with this dataset, we can't rely on simple rules like "if this, then that." Instead, we'll need to use more advanced techniques to handle the complexity of the data.
* Both Age and Operation year don't provide significant insights because their distributions are quite similar for both surviving and not survived individuals.
* The only feature that appears to be valuable in determining the survival status of patients is the number of positive lymph nodes. There's a noticeable difference in the distributions for the two groups. Notably, a significant portion of surviving patients has zero positive lymph nodes.
* When examining the distribution of years, it's apparent that the number of individuals who did not survive experiences a sudden increase and decrease between 1958 and 1960. Additionally, there's a higher number of non-survivors in the year 1965.

### Creating new data frames to show characterisitics for survived and unsurvived people


In [None]:
survived=myData.loc[myData["survive_after5years"]==1]
not_survived=myData.loc[myData["survive_after5years"]==2]

print("Characteristics of people who survived 5 years or longer:")
survived.describe()

In [None]:
print("Characteristics of people who died within 5 years:")
not_survived.describe()

#### Result:
After examining both tables, it's apparent that the statistics are quite alike for most of the features, with one notable exception being positive lymph nodes.

The average number of positive lymph nodes is higher for individuals who passed away within 5 years compared to those who survived for more than 5 years.

Based on my analysis, it's reasonable to conclude that patients with more axillary nodes detected are more likely to pass away within 5 years. Therefore, individuals with fewer positive lymph nodes tend to have a better chance of survival.

## Correlation Matrix

We use a correlation matrix in the Haberman dataset to see how the different pieces of information are related. It helps us figure out if there are any connections or patterns between the features, like age, number of lymph nodes, and how long someone survives. If two things are highly correlated, it means they tend to change together. This helps us understand the data better and make informed decisions, especially in tasks like predicting whether a patient will survive or not.

In [None]:
correlation_matrix = myData.corr().round(2)

display(correlation_matrix)

plt.figure(figsize=(8, 6))

# Create a heatmap using Seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)

# Add a title
plt.title("Correlation Heatmap")

# Display the heatmap
plt.show()


### Result:

This matrix shows the pairwise correlations between the variables in the dataset. Each cell in the matrix represents the correlation coefficient between two variables. The correlation coefficient can range from -1 to 1, indicating the strength and direction of the relationship between variables:

    * If the correlation coefficient is close to 1, it indicates a strong positive correlation, meaning that as one variable increases, the other tends to increase as well.

    * If the correlation coefficient is close to -1, it indicates a strong negative correlation, meaning that as one variable increases, the other tends to decrease.

    * If the correlation coefficient is close to 0, it indicates a weak or no correlation between the variables.

Here's the interpretation of the correlation matrix:

1. **Age_patient_operate_time vs. Age_patient_operate_time**: The correlation between a variable and itself is always 1, indicating a perfect positive correlation, which is expected.

2. **Age_patient_operate_time vs. Year_operate_time**: The correlation coefficient is 0.09, which is relatively low. This suggests a **weak positive** correlation between a patient's age at the time of operation and the year of the operation. In other words, there is a slight tendency for older patients to have undergone surgery in more recent years.

3. **Age_patient_operate_time vs. Axillary_node_num**: The correlation coefficient is -0.06, which indicates a **very weak negative** correlation. This suggests that there is almost no relationship between a patient's age and the number of axillary nodes detected.

4. **Age_patient_operate_time vs. Survive_after5years**: The correlation coefficient is 0.07, which also indicates a **weak positive** correlation. This suggests that there is a slight tendency for older patients to have a slightly higher chance of surviving beyond five years after surgery.

5. **Year_operate_time vs. Year_operate_time**: As with the first variable, the correlation between a variable and itself is always 1.

6. **Year_operate_time vs. Axillary_node_num**: The correlation coefficient is -0.00, indicating **no significant correlation** between the year of operation and the number of axillary nodes.

7. **Year_operate_time vs. Survive_after5years**: The correlation coefficient is also -0.00, indicating **no significant correlation** between the year of operation and survival status.

8. **Axillary_node_num vs. Axillary_node_num**: As expected, the correlation between a variable and itself is 1.

9. **Axillary_node_num vs. Survive_after5years**: The correlation coefficient is 0.29, indicating a **moderate positive** correlation. This suggests that as the number of axillary nodes increases, the likelihood of not surviving beyond five years after surgery also increases.

10. **Survive_after5years vs. Survive_after5years**: The correlation between a variable and itself is always 1.

In summary, this correlation matrix provides insights into the relationships between variables in the "Haberman" dataset. Notably, **there is a weak positive correlation between patient age and the year of operation, and a moderate positive correlation between the number of axillary nodes and the likelihood of survival beyond five years after surgery. However, the correlations in this dataset are generally weak or close to zero, suggesting that these variables are not strongly interrelated.**

## Feature Scaling:

**In many cases, we may not need feature scaling for logistic regression, especially if the features are already on similar scales or if the algorithm is not sensitive to feature scaling. Logistic regression is not as sensitive to feature scaling as some other algorithms like Support Vector Machines or k-Nearest Neighbors. It is important to note that feature scaling won't always improve the model's performance significantly, and it's not a strict requirement for logistic regression. Whether or not we scale features can depend on the specific characteristics of our dataset and the behavior of our logistic regression model. As a best practice, I try both scaled and unscaled features and compare the model's performance to see if scaling provides any advantages for your particular problem, 
Depending on the machine learning algorithm we plan to use, it might be beneficial to scale our features. Feature scaling can help algorithms converge faster and perform better. Common scaling methods are Min-Max scaling (scaling features to a range of [0, 1]) and standardization (scaling features to have a mean of 0 and a standard deviation of 1). However, I noticed that including it doesn't have a significant impact on the model's performance. Therefore, I made the decision to remove it from my project.**

Feature scaling is an important thing we do before teaching computers to learn from data. It's like making sure all the numbers in our data are on the same scale. This helps our computer learn without favoring big numbers over small ones. Feature scaling is extra important when our data has numbers that cover a wide range. Let's see why feature scaling is so important and when we should use it.

1. Equalizing the Impact of Features:

Some computer programs that learn from data use measurements that involve distances, like how far apart things are. But when these measurements involve numbers that are very different, it can cause problems. For example, if one number is between 0 and 1, and another number is between 0 and 1000, the big number can have too much control over the program. Feature scaling is a way to fix this issue and make sure all the numbers have a fair say in the program's decisions.

2. Faster Convergence:

Methods that use gradients, such as gradient descent, work better when we adjust the size of ourr features. When we don't do this adjustment, the process of finding the best solution might be slower and not very steady.

3. Improved Model Performance:

Some methods, such as linear regression and K-Means clustering, can be affected by how big or small the numbers in the data are. To make these methods work better and give more accurate predictions, we can adjust the size of the numbers in the data.

4. Interpretability:

Scaling features makes it easier to understand how different features affect a model and which ones are more important. When features have very different sizes, figuring out their importance can be tricky.

When should we use feature scaling:

* If we're using algorithms that depend on measuring distances or gradient descent for improving the model (like K-Nearest Neighbors, Support Vector Machines, K-Means clustering, Principal Component Analysis).
* When our data has features with different types of measurements or units.
* When we want to improve how well our machine learning model works and how quickly it learns.
* If we're using techniques that punish large coefficient values (like L1 and L2 regularization).


Common Feature Scaling Methods:

Several common methods for feature scaling include:

1. Standardization (Z-score normalization): Rescales the data to have a mean of 0 and a standard deviation of 1. It is advantageous when the data approximates a normal distribution.

2. Min-Max Scaling: Transforms data to a specified range, often [0, 1] or [-1, 1], while preserving the original data distribution.

3. Robust Scaling: This method adjusts data by using the middle value and the range between the middle 50% of the numbers. It helps data not be affected too much by unusual or extreme numbers. When we decide how to adjust our data, think about the kind of data we have and the needs of the machine learning we're using. To make the best choice, see how adjusting the data affects our model's performance for our specific dataset and problem. For example imagine we have a list of people's salaries in a company. Some people earn really high salaries, like the CEO, while others earn average salaries. Now, if we want to find the average salary for the company, we could just add up all the salaries and divide by the number of people. But, this method can be heavily influenced by the CEO's extremely high salary. So, our average salary might not really represent what most people in the company earn. To avoid this problem, we can use a method that looks at the middle values, like the median (which is the middle number when all salaries are lined up from lowest to highest) and the range between the salaries in the middle 50% (this means the range between the 25th percentile and the 75th percentile). These middle values are less affected by extreme salaries, making our calculation more robust or resistant to these outliers.



Source link: https://towardsdatascience.com/what-is-feature-scaling-why-is-it-important-in-machine-learning-2854ae877048


# Train Test Split for Logistic Regression

In [None]:
# Import necessary libraries for machine learning
from sklearn.model_selection import train_test_split

# Data Splitting
# Split the data into features (X) and the target variable (y)
X1 = myData.drop("survive_after5years", axis=1)
y1 = myData["survive_after5years"]

# Split the data into a training set and a testing set (e.g., 70% training, 30% testing)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.3, random_state=42)


# Build Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Model Building
# Choose a machine learning model (in our case Logistic Regression)
logistic_model = LogisticRegression()

# Train the model on the training data
logistic_model.fit(X_train1, y_train1)


In [None]:
# Model Evaluation
# Make predictions on the testing data
y_pred_logistic = logistic_model.predict(X_test1)

## Classification Metrics Evaluation for Logistic Model

In [None]:
# Calculate accuracy and other classification metrics
accuracy_logistic = accuracy_score(y_test1, y_pred_logistic)
conf_matrix_logistic = confusion_matrix(y_test1, y_pred_logistic)
class_report_logistic = classification_report(y_test1, y_pred_logistic)

# Print the results for Logistic Regression
print("\n**Logistic Regression Model Results:**\n")
print("Accuracy:", accuracy_logistic)
print("\nConfusion Matrix:")
print(conf_matrix_logistic)
print("\nClassification Report:\n")
print(class_report_logistic)


### Confusion matrix Explanation:

Positive means Correctly<br>
Negative means Incorrectly

If Survived is our True value and Died is our False vlaue, then:
##### [[TP: Positive predicted Survived;   FN: Negative predicted Died]
##### [TN: Negative predicted Survived;   FP: Positive predicted Died]]<br>


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import ConfusionMatrixDisplay

cm_display_logistic = ConfusionMatrixDisplay(confusion_matrix=conf_matrix_logistic, display_labels=["Survived", "Died"])
cm_display_logistic.plot()


### Results (Classification Metrics Evaluation for Logistic Model):

Model Evaluation Metrics:

**Accuracy**: The model achieved an accuracy of approximately 73.91%, indicating that it correctly classified 73.91% of cases in the test dataset.


**Confusion Matrix**:

For Survived =><br>
True Positives (TP): 61<br>
True Negatives (TN): 19<br>
False Positives (FP): 7<br>
False Negatives (FN): 5<br>
Precision survived => 61/(61 + 19) = 0.76  [correctly predicted survived/(correctly predicted survived + Incorrectly predicted survived)]<br>

For Died => <br>
True Positives (TP): 7<br>
True Negatives (TN): 61<br>
False Positives (FP): 5<br>
False Negatives (FN): 19<br>
Precision Died => 7/(7 + 5) = 0.58 [correctly predicted died/(correctly predicted died + Incorrectly predicted died)]<br>

Source link: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9826726

**Precision**: Precision measures the ratio of correctly predicted positive observations (TP) to the total predicted positive observations (TP + FP). In this case, precision is higher for class "1" (patients who survived for 5 years or longer), indicating that when the model predicts a patient to survive, it is correct 76% of the time. In simpler terms, if the model says a patient will survive, there's a 76% chance that the model is right. A high precision in this context is crucial because it means that the model doesn't make many false positive predictions. It's like the doctor making sure they're really accurate when they say a patient will be fine. In medical applications, high precision is valuable because it minimizes the chances of telling a patient they will survive when they won't. It's about being trustworthy and accurate when making positive predictions, which can have a significant impact on patient care and decisions.

**Recall**: Recall assesses the ratio of correctly predicted positive observations (TP) to all observations in the actual class (TP + FN). In this case, recall is higher for class "1," suggesting that the model excels at identifying patients who actually survived. Recall for "class 1" answers the following question: "Out of all the patients who actually survived for 5 years or longer (the true members of 'class 1'), how many did our model correctly identify?" In other words, recall tells us how good our model is at capturing or "recalling" the patients who truly survived for a long time. It's like counting how many people the model correctly recognizes as the ones who had a positive outcome, in this case, surviving for 5 years or more. Now, let's move to "class 2." Recall for "class 2" addresses a similar question: "Out of all the patients who unfortunately died within 5 years (the true members of 'class 2'), how many did our model correctly identify?" For "class 2," recall helps us understand how well the model can identify individuals who experienced a negative outcome, which, in this context, means not surviving for 5 years.
Here's why it's important: A high recall value indicates that the model is good at finding the individuals who truly belong to a particular class, whether it's the "survived" class or the "not survived" class. In our case, we mentioned that recall is higher for "class 1," which means that the model is better at identifying patients who actually survived for 5 years or longer. It excels at capturing those positive cases. However, for "class 2," recall is lower, suggesting that the model is not as good at identifying patients who did not survive within 5 years. It might miss some of the negative cases. This information is valuable because it helps us understand the strengths and weaknesses of our model in differentiating between these two classes, which is essential when dealing with medical data and making predictions about patient outcomes.

**F1-Score**: The F1-score represents the harmonic mean of precision and recall, providing a balanced measure between the two. For class "1," the F1-score is higher, signifying that the model is relatively better at classifying patients who survived. The F1-score is a single number that tells us how well our model is performing when it comes to classifying things into different groups or categories. In this context, it helps us measure the model's performance in deciding whether patients survived or not. 

The F1 score can range from 0 to 1, with higher values indicating better model performance. Here's how to interpret F1 scores: **High F1 Score**: An F1 score closer to 1 indicates that the model is doing well in both precision and recall. It means the model is achieving a good balance between making accurate positive predictions and not missing many positive cases. In other words, it's a sign of a well-balanced and effective model. **Low F1 Score**: An F1 score closer to 0 indicates that the model is not performing well in either precision or recall, or both. It suggests that the model might be making many mistakes when predicting positive cases or missing a significant number of actual positive cases. In this case, the model's performance is considered poor.


**Support**: Support indicates the number of occurrences of each class in the actual dataset. Support simply tells us how many times each class appears in the actual dataset. In our case, we have two classes: "class 1," which represents patients who survived for 5 years or longer, and "class 2," which represents patients who did not survive within 5 years. Support tells us how many patients belong to each of these two classes in the haberman dataset.

**Macro Avg**: The macro average is computed as the average of precision, recall, and F1-score for **both classes**. In this case, it's lower due to class "2" having lower precision, recall, and F1-score.

**Weighted Avg**: The weighted average considers the imbalance in class distribution, offering a more comprehensive representation of overall model performance. In this case, it's highest for class "1," which is the primary class of interest. The weighted average is similar to the macro average, but it considers the fact that there might be more data for one class than the other. In our dataset, "class 1" (patients who survived for 5 years or longer) appears more frequently. So, the weighted average takes into account the fact that one class is more dominant. It gives higher importance to the class with more data because it's the primary class of interest. In our case, "class 1" has the highest weighted average because it's the class you are particularly concerned about when making predictions.

Source link: https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/

**Summary**:

Based on these metrics, the model performs better in predicting patients who survived for 5 years or longer (class "1") compared to those who did not survive (class "2"). However, there is room for improvement, especially in correctly identifying patients who did not survive, as evidenced by the lower recall and F1-score for class "2."

To enhance the model's performance, consider further model tuning, experimenting with different algorithms, or exploring additional feature engineering techniques, particularly for class "2" (patients who did not survive within 5 years). Addressing the class imbalance may also be advantageous for improving overall model performance.

## ROC Curve for evaluating the Logistic Model

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import LabelEncoder


# Convert labels from {1, 2} to {0, 1}
label_encoder = LabelEncoder()
label_encoder.fit(y_test1)  # Fit the label encoder on y_test1
y_test1_binary = label_encoder.transform(y_test1)  # Now you can transform y_test1

# Logistic Regression ROC Curve
y_prob_logistic = logistic_model.predict_proba(X_test1)[:, 1]  # Probability of class 1
fpr_logistic, tpr_logistic, thresholds_logistic = roc_curve(y_test1_binary, y_prob_logistic)
roc_auc_logistic = roc_auc_score(y_test1_binary, y_prob_logistic)


# Plot ROC Curves
plt.figure(figsize=(8, 6))

# Logistic Regression ROC Curve
plt.plot(fpr_logistic, tpr_logistic, color='blue', lw=2, label='Logistic Regression (AUC = %0.2f)' % roc_auc_logistic)

# Random Guess Line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()


## Cross-Validation for evaluating Logistic Model

In [None]:
from sklearn.model_selection import cross_val_score, cross_val_predict, StratifiedKFold

# Perform 5-fold cross-validation (we can change the number of folds if needed)
scores_logistic = cross_val_score(logistic_model, X1, y1, cv=5, scoring='accuracy')

# Print the cross-validation scores
print("Cross-Validation Scores:", scores_logistic)
print("Mean Accuracy:", np.mean(scores_logistic))

#### We can also get cross-validated predictions and performance metrics for each fold:

In [None]:
# Perform cross-validation with predictions
y_pred_cv_logistic = cross_val_predict(logistic_model, X1, y1, cv=5)

# Compute performance metrics for each fold
for i, (train, test) in enumerate(StratifiedKFold(n_splits=5).split(X1, y1)):
    X_train1, X_test1 = X1.iloc[train], X1.iloc[test]
    y_train1, y_test1 = y1.iloc[train], y1.iloc[test]
    
    logistic_model.fit(X_train1, y_train1)
    y_pred1 = logistic_model.predict(X_test1)
    
    accuracy = accuracy_score(y_test1, y_pred1)
    confusion = confusion_matrix(y_test1, y_pred1)
    report = classification_report(y_test1, y_pred1)
    
    print(f"Fold {i + 1} - Accuracy: {accuracy}")
    print(f"Fold {i + 1} - Confusion Matrix:\n{confusion}")
    print(f"Fold {i + 1} - Classification Report:\n{report}")

### Results (Cross-Validation  for evaluating Logistic Model)

Cross-validation is a common technique for assessing the performance of a machine learning model. In this analysis, we performed 5-fold cross-validation to evaluate the accuracy, confusion matrices, and classification reports for each fold and the overall mean accuracy.

**Overall Mean Accuracy**: 
The mean accuracy across all 5 folds of the cross-validation is approximately 74.83%.

**Fold 1 Results**

Fold 1 Accuracy: 75.81%

Fold 1 Classification Report:

* Precision (Class 1): 76%
* Recall (Class 1): 98%
* F1-Score (Class 1): 85%
* Precision (Class 2): 75%
* Recall (Class 2): 18%
* F1-Score (Class 2): 29%


**Fold 2 Results**

Fold 2 Accuracy: 75.41%

Fold 2 Classification Report:

* Precision (Class 1): 76%
* Recall (Class 1): 98%
* F1-Score (Class 1): 85%
* Precision (Class 2): 67%
* Recall (Class 2): 12%
* F1-Score (Class 2): 21%

**Fold 3 Results**

Fold 3 Accuracy: 72.13%

Fold 3 Classification Report:

* Precision (Class 1): 75%
* Recall (Class 1): 93%
* F1-Score (Class 1): 83%
* Precision (Class 2): 40%
* Recall (Class 2): 12%
* F1-Score (Class 2): 19%

**Fold 4 Results**

Fold 4 Accuracy: 75.41%

Fold 4 Classification Report:

* Precision (Class 1): 77%
* Recall (Class 1): 96%
* F1-Score (Class 1): 85%
* Precision (Class 2): 60%
* Recall (Class 2): 19%
* F1-Score (Class 2): 29%

**Fold 5 Results**

Fold 5 Accuracy: 75.41%

Fold 5 Classification Report:

* Precision (Class 1): 78%
* Recall (Class 1): 93%
* F1-Score (Class 1): 85%
* Precision (Class 2): 57%
* Recall (Class 2): 25%
* F1-Score (Class 2): 35%

**Conclusion**

The cross-validation results for the logistic regression model on the Haberman dataset indicate that the model's performance is relatively consistent across the 5 folds, with accuracies ranging from 72.13% to 75.81%. The mean accuracy of approximately 74.83% suggests that the model is performing moderately well on this dataset.

In terms of class-specific performance, the model exhibits higher precision and recall for Class 1 (patients who survived 5 or more years) compared to Class 2 (patients who survived less than 5 years). This indicates that the model is better at correctly classifying patients with longer survival times.

The F1-scores for Class 2 are generally lower, suggesting that the model struggles to balance precision and recall for this class, which may be due to class imbalance in the dataset.

In conclusion, the logistic regression model shows a moderate level of accuracy in predicting the survival status of breast cancer patients based on the Haberman dataset. While it performs well in identifying patients who survived 5 or more years, it has room for improvement in correctly classifying patients with shorter survival times. Further model optimization and evaluation may be necessary to enhance its predictive performance, particularly for Class 2.

# Train Test Split for SVM Model

In [None]:
# Import necessary libraries for machine learning
from sklearn.model_selection import train_test_split

# Data Splitting
# Split the data into features (X) and the target variable (y)
X2 = myData.drop("survive_after5years", axis=1)
y2 = myData["survive_after5years"]

# Split the data into a training set and a testing set (e.g., 70% training, 30% testing)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3, random_state=42)


# Build SVM Model

In [None]:
from sklearn.svm import SVC

# SVM Model
svm_model = SVC()
svm_model.fit(X_train2, y_train2)
y_pred_svm = svm_model.predict(X_test2)

## Classification Metrics Evaluation for SVM Model

In [None]:
# Calculate accuracy and other classification metrics
accuracy_svm = accuracy_score(y_test2, y_pred_svm)
conf_matrix_svm = confusion_matrix(y_test2, y_pred_svm)
class_report_svm = classification_report(y_test2, y_pred_svm, zero_division=np.nan)


# Print the results for SVM
print("\n**SVM Model Results:**\n")
print("Accuracy:", accuracy_svm)
print("\nConfusion Matrix:")
print(conf_matrix_svm)
print("\nClassification Report:\n")
print(class_report_svm)


In [None]:
cm_display_svm = ConfusionMatrixDisplay(confusion_matrix=conf_matrix_svm, display_labels=["Survived", "Died"])
cm_display_svm.plot()

### Results (Classification Metrics Evaluation for SVM Model):

**Accuracy**: 0.7173 (71.73%): 

This is the overall accuracy of the SVM model. It tells us that the model correctly predicted the class labels for approximately 71.73% of the total data points.

**Confusion Matrix**:

For Survived =><br>
True Positives (TP): 66<br>
True Negatives (TN): 26<br>
False Positives (FP): 0<br>
False Negatives (FN): 0<br>
Precision survived => 66/(66 +26) = 0.72[correctly predicted survived/(correctly predicted survived + Incorrectly predicted survived)]

For Died =><br>
True Positives (TP): 0<br>
True Negatives (TN): 66<br>
False Positives (FP): 0<br>
False Negatives (FN): 26<br>
Precision Died => 0/(0 + 0) = undefined [correctly predicted died/(correctly predicted died + Incorrectly predicted died)]


**Classification Report**:

**Precision**: Precision measures how many of the predicted positive instances were actually positive. For class 1, precision is 0.72, which means that 72% of the instances predicted as class 1 were actually class 1. However, for class 2, precision is undefined due to a division by zero, indicating that none of the instances predicted as class 2 were actually class 2.

**Recall**: Recall (or sensitivity) measures how many of the actual positive instances were correctly predicted as positive by the model. For class 1, recall is 1, indicating that 100% of the actual class 1 instances were correctly identified. However, for class 2, recall is 0.00, indicating that none of the actual class 2 instances were correctly identified.

**F1-Score**: The F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall. For class 1, the F1-score is 0.84, and for class 2, it is undefined due to a division by zero.

**Support**: Support is the number of instances in each class. There are 66 instances of class 1 and 26 instances of class 2.

**Accuracy (repeated)**: The overall accuracy is calculated as the percentage of correctly classified instances, which is 0.72 (or 72%).

**Macro Avg**: This is the average of precision, recall, and F1-score across both classes. In this case, the macro-averaged precision is 0.72, recall is 0.50, and F1-score is 0.84.

**Weighted Avg**: This is the weighted average of precision, recall, and F1-score, where each class's score is weighted by its support. In this case, the weighted-averaged precision is 0.72, recall is 0.72, and F1-score is 0.84.

Overall, this SVM model seems to perform well in terms of accuracy for class 1, but it performs poorly for class 2. The low recall and F1-score for class 2 suggest that the model has difficulty correctly identifying instances of class 2, possibly due to class imbalance or other factors. Further analysis and model tuning may be necessary to improve the model's performance, especially for class 2.






## ROC Curve for evaluating SVM Model

In [None]:
# Convert labels from {1, 2} to {0, 1}
label_encoder = LabelEncoder()
label_encoder.fit(y_test2)  # Fit the label encoder on y_test2
y_test2_binary = label_encoder.transform(y_test2)  # Now you can transform y_test2

# SVM ROC Curve
y_prob_svm = svm_model.decision_function(X_test2)  # Use decision_function for SVM
fpr_svm, tpr_svm, thresholds_svm = roc_curve(y_test2_binary, y_prob_svm)
roc_auc_svm = roc_auc_score(y_test2_binary, y_prob_svm)

# Plot ROC Curves
plt.figure(figsize=(8, 6))

# SVM ROC Curve (corrected comment)
plt.plot(fpr_svm, tpr_svm, color='blue', lw=2, label='SVM (AUC = %0.2f)' % roc_auc_svm)

# Random Guess Line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()


## Cross-Validation for evaluating SVM Model

In [None]:
# Perform 5-fold cross-validation (we can change the number of folds if needed)
scores_SVM = cross_val_score(svm_model, X2, y2, cv=5, scoring='accuracy')

# Print the cross-validation scores
print("Cross-Validation Scores:", scores_SVM)
print("Mean Accuracy:", np.mean(scores_SVM))

##### We can also get cross-validated predictions and performance metrics for each fold:

In [None]:
# Perform 5-fold cross-validation with predictions
y_pred_cv_svm = cross_val_predict(svm_model, X2, y2, cv=5)

# Compute performance metrics for each fold
for i, (train, test) in enumerate(StratifiedKFold(n_splits=5).split(X2, y2)):
    X_train2, X_test2 = X2.iloc[train], X2.iloc[test]
    y_train2, y_test2 = y2.iloc[train], y2.iloc[test]
    
    svm_model.fit(X_train2, y_train2)
    y_pred2 = svm_model.predict(X_test2)
    
    accuracy = accuracy_score(y_test2, y_pred2)
    confusion = confusion_matrix(y_test2, y_pred2)
    report = classification_report(y_test2, y_pred2, zero_division=np.nan)
    
    print(f"Fold {i + 1} - Accuracy: {accuracy}")
    print(f"Fold {i + 1} - Confusion Matrix:\n{confusion}")
    print(f"Fold {i + 1} - Classification Report:\n{report}")

### Results (Cross-Validation for evaluating SVM Model):

**Overall Mean Accuracy**: 
The mean accuracy across all 5 folds of the cross-validation is approximately 72.88%. It is important to note that the mean accuracy is relatively low, suggesting that the SVM model may not be performing well on this dataset.

**Fold 1 Results**

Fold 1 Accuracy: 72.58%

Fold 1 Classification Report:

    Precision (Class 1): 73%
    Recall (Class 1): 100%
    F1-Score (Class 1): 84%
    Precision (Class 2): 100%
    Recall (Class 2): 0%
    F1-Score (Class 2): 0%
    
**Fold 2 Results**

Fold 2 Accuracy: 73.77%

Fold 2 Classification Report:

    Precision (Class 1): 74%
    Recall (Class 1): 100%
    F1-Score (Class 1): 85%
    Precision (Class 2): 100%
    Recall (Class 2): 0%
    F1-Score (Class 2): 0%
    
**Fold 3 Results**

Fold 3 Accuracy: 72.13%

Fold 3 Classification Report:

    Precision (Class 1): 73%
    Recall (Class 1): 98%
    F1-Score (Class 1): 84%
    Precision (Class 2): 0%
    Recall (Class 2): 0%
    F1-Score (Class 2): 100%

**Fold 4 Results**

Fold 4 Accuracy: 73.77%

Fold 4 Classification Report:

    Precision (Class 1): 74%
    Recall (Class 1): 100%
    F1-Score (Class 1): 85%
    Precision (Class 2): 100%
    Recall (Class 2): 0%
    F1-Score (Class 2): 0%
    
**Fold 5 Results**

Fold 5 Accuracy: 72.13%

Fold 5 Classification Report:

    Precision (Class 1): 73%
    Recall (Class 1): 98%
    F1-Score (Class 1): 84%
    Precision (Class 2): 0%
    Recall (Class 2): 0%
    F1-Score (Class 2): 100%
    
**Conclusion**

The results from the SVM model on the Haberman dataset during cross-validation show that the model isn't performing well. The average accuracy of about 72.88% suggests it's having difficulty making correct predictions. In all five rounds of testing, the model is good at predicting when a patient will survive 5 or more years (Class 1), but it fails to identify patients who won't (Class 2). This means it wrongly labels almost all patients as Class 1, resulting in zero correct predictions for Class 2. The model's trouble distinguishing between the two groups leads to very low scores for Class 2. To improve the model's performance, more analysis and adjustments are needed, including addressing the class imbalance issue and fine-tuning the model's settings.

# Train Test Split for K-Nearest Neighbors

In [None]:
# Data Splitting
# Split the data into features (X) and the target variable (y)
X3 = myData.drop("survive_after5years", axis=1)
y3 = myData["survive_after5years"]

# Split the data into a training set and a testing set (e.g., 70% training, 30% testing)
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.3, random_state=42)


# Build K-Nearest Neighbors Model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# K-Nearest Neighbors Model
knn_model = KNeighborsClassifier()
knn_model.fit(X_train3, y_train3)
y_pred_knn = knn_model.predict(X_test3)

## Classification Metrics Evaluation for K-Nearest Neighbors Model

In [None]:
# Calculate accuracy and other classification metrics
accuracy_knn = accuracy_score(y_test3, y_pred_knn)
conf_matrix_knn = confusion_matrix(y_test3, y_pred_knn)
class_report_knn = classification_report(y_test3, y_pred_knn)

# Print the results for K-Nearest Neighbors
print("K-Nearest Neighbors Model Results:")
print("Accuracy:", accuracy_knn)
print("\nConfusion Matrix:")
print(conf_matrix_knn)
print("\nClassification Report:\n")
print(class_report_knn)


In [None]:
cm_display_knn = ConfusionMatrixDisplay(confusion_matrix=conf_matrix_knn, display_labels=["Survived", "Died"])
cm_display_knn.plot()

## Results (Classification Metrics Evaluation for K-Nearest Neighbors Model):

**Accuracy**: The accuracy of the model is 0.7826, which means that it correctly classified 78.26% of the total instances in the dataset.

**Confusion Matrix**:

For Survived =><br>
True Positives (TP): 62<br>
True Negatives (TN): 16<br>
False Positives (FP): 10<br>
False Negatives (FN): 4<br>
Precision survived => 62/(62 +16) = 0.79 [correctly predicted survived/(correctly predicted survived + Incorrectly predicted survived)]

For Died =><br>
True Positives (TP): 10<br>
True Negatives (TN): 62<br>
False Positives (FP): 4<br>
False Negatives (FN): 16<br>
Precision Died => 10/(10 + 4) = 0.71 [correctly predicted died/(correctly predicted died + Incorrectly predicted died)]

**Classification Report**:

The purpose of this model was to predict the class labels of the dataset, and the performance of the model is evaluated in terms of accuracy, confusion matrix, and classification report.

**Class 1**:
* Precision: 0.79
* Recall: 0.94
* F1-score: 0.86
* Support: 66

**Class 2**:
* Precision: 0.71
* Recall: 0.38
* F1-score: 0.50
* Support: 26


**Precision**: Precision measures the ability of the model to correctly classify instances of a particular class. A precision of 0.79 for Class 1 means that 79% of instances classified as Class 1 were correct, while a precision of 0.71 for Class 2 means that 71% of instances classified as Class 2 were correct.

**Recall**: Recall, also known as sensitivity or true positive rate, assesses the model's ability to identify all instances of a given class. A recall of 0.94 for Class 1 indicates that 94% of actual Class 1 instances were correctly classified, while a recall of 0.38 for Class 2 means that only 38% of actual Class 2 instances were identified.

**F1-Score**: The F1-score is the harmonic mean of precision and recall and provides a balance between these two metrics. An F1-score of 0.86 for Class 1 indicates a good balance between precision and recall, while an F1-score of 0.50 for Class 2 reflects a trade-off between precision and recall.

**Conclusion**:
In summary, the K-Nearest Neighbors model performed reasonably well in classifying the data, with an accuracy of 78.26%. It demonstrated good precision and recall for Class 1 but had lower performance for Class 2. The F1-scores provide a balanced measure of the model's performance, taking both precision and recall into account. The K-Nearest Neighbors model achieved an accuracy of approximately 78.26% in classifying instances into Class 1 and Class 2. While the model performed well in terms of precision, recall, and F1-score for Class 1, it showed lower performance for Class 2, with lower recall and F1-score. These results suggest that the model may need further tuning or that Class 2 samples may be more challenging to classify accurately. Further analysis and potential adjustments to the model may be necessary to improve its performance, especially for Class 2.




## ROC Curve for K-Nearest Neighbors Model

In [None]:
# Convert labels from {1, 2} to {0, 1}
label_encoder = LabelEncoder()
label_encoder.fit(y_test3)  # Fit the label encoder on y_test3
y_test3_binary = label_encoder.transform(y_test3)  # Now you can transform y_test3

# KNN ROC Curve
y_prob_knn = knn_model.predict_proba(X_test3)[:, 1]  # Probability of class 1
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test3_binary, y_prob_knn)
roc_auc_knn = roc_auc_score(y_test3_binary, y_prob_knn)

# Plot ROC Curves for KNN
plt.figure(figsize=(8, 6))

# KNN ROC Curve
plt.plot(fpr_knn, tpr_knn, color='green', lw=2, label='KNN (AUC = %0.2f)' % roc_auc_knn)

# Random Guess Line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve for KNN')
plt.legend(loc='lower right')
plt.show()


## Cross-Validation for evaluating K-Nearest Neighbors Model

In [None]:
# Perform 5-fold cross-validation (we can change the number of folds if needed)
scores_knn = cross_val_score(knn_model, X3, y3, cv=5, scoring='accuracy')

# Print the cross-validation scores
print("Cross-Validation Scores:", scores_knn)
print("Mean Accuracy:", np.mean(scores_knn))

##### We can also get cross-validated predictions and performance metrics for each fold:

In [None]:
# Perform 5-fold cross-validation with predictions
y_pred_cv_knn = cross_val_predict(knn_model, X3, y3, cv=5)

# Compute performance metrics for each fold
for i, (train, test) in enumerate(StratifiedKFold(n_splits=5).split(X3, y3)):
    X_train3, X_test3 = X3.iloc[train], X3.iloc[test]
    y_train3, y_test3 = y3.iloc[train], y3.iloc[test]
    
    knn_model.fit(X_train3, y_train3)
    y_pred3 = knn_model.predict(X_test3)
    
    accuracy = accuracy_score(y_test3, y_pred3)
    confusion = confusion_matrix(y_test3, y_pred3)
    report = classification_report(y_test3, y_pred3, zero_division=np.nan)
    
    print(f"Fold {i + 1} - Accuracy: {accuracy}")
    print(f"Fold {i + 1} - Confusion Matrix:\n{confusion}")
    print(f"Fold {i + 1} - Classification Report:\n{report}")

### Results (Cross-Validation for evaluating K-Nearest Neighbors Model):

**Overall Mean Accuracy**: 

The mean accuracy across all 5 folds of the cross-validation is approximately 72.54%. This suggests that the KNN model exhibits a moderate level of accuracy on this dataset.

**Fold 1 Results**

Fold 1 Accuracy: 75.81%

Fold 1 Classification Report:

    Precision (Class 1): 75%
    Recall (Class 1): 100%
    F1-Score (Class 1): 86%
    Precision (Class 2): 100%
    Recall (Class 2): 12%
    F1-Score (Class 2): 21%

**Fold 2 Results**

Fold 2 Accuracy: 57.38%

Fold 2 Classification Report:

    Precision (Class 1): 79%
    Recall (Class 1): 58%
    F1-Score (Class 1): 67%
    Precision (Class 2): 32%
    Recall (Class 2): 56%
    F1-Score (Class 2): 41%

**Fold 3 Results**

Fold 3 Accuracy: 80.33%

Fold 3 Classification Report:

    Precision (Class 1): 84%
    Recall (Class 1): 91%
    F1-Score (Class 1): 87%
    Precision (Class 2): 67%
    Recall (Class 2): 50%
    F1-Score (Class 2): 57%

**Fold 4 Results**

Fold 4 Accuracy: 75.41%

Fold 4 Classification Report:

    Precision (Class 1): 81%
    Recall (Class 1): 87%
    F1-Score (Class 1): 84%
    Precision (Class 2): 54%
    Recall (Class 2): 44%
    F1-Score (Class 2): 48%

**Fold 5 Results**

Fold 5 Accuracy: 73.77%

Fold 5 Classification Report:

    Precision (Class 1): 76%
    Recall (Class 1): 93%
    F1-Score (Class 1): 84%
    Precision (Class 2): 50%
    Recall (Class 2): 19%
    F1-Score (Class 2): 27%


**Conclusion** 

The cross-validation results for the KNN model on the Haberman dataset indicate that the model's performance varies across the five folds. The mean accuracy of approximately 72.54% suggests that the model achieves moderate accuracy on average. However, there are notable differences in precision, recall, and F1-scores between the two classes, indicating that the model has varying success in distinguishing between patients who survived 5 or more years (Class 1) and those who did not (Class 2).

In Fold 3, the model achieves the highest accuracy and well-balanced performance in terms of precision, recall, and F1-scores for both classes. On the other hand, in Fold 2, the model struggles with a lower accuracy and less balanced performance, especially for Class 2, which is reflected in lower F1-scores.

In conclusion, the KNN model demonstrates moderate accuracy in predicting the survival status of breast cancer patients in the Haberman dataset. However, the model's performance varies across different folds, indicating that its effectiveness can depend on the specific data split. Further analysis and fine-tuning of the model may be necessary to enhance its predictive performance and improve the balance between precision and recall for both classes.


## Compare my three machine learning models (Logistic Regression, SVM, and K-Nearest Neighbors) models and find the best one:

In this report, I compare the performance of three classification models - Logistic Regression, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) - on the Haberman dataset. The primary objective is to identify the best-performing model based on a variety of classification metrics.

### 1. Compare my models in terms of accuracy, confusion matrix, and classification report:

The **logistic regression model** gets 73.9% correct. It does a good job for Class 1 but not as good for Class 2.

The **SVM model** has an accuracy of 71.7%. It shows a high recall for Class 1 but fails to correctly classify any samples of Class 2, leading to undefined precision and F1-score for Class 2.

The **K-Nearest Neighbors model** achieves an accuracy of 78.3%. It exhibits balanced precision and recall for both Class 1 and Class 2.

Among the three models, **K-Nearest Neighbors (KNN)** stands out with the highest accuracy (78.3%).

The Logistic Regression model also performs reasonably well but is outperformed by KNN in terms of accuracy and F1-scores for both classes.

The SVM model has a high recall for Class 1 but completely fails to classify any samples from Class 2, making it unsuitable for this dataset.

In conclusion, in this part **based on the evaluation of classification metrics**, the **K-Nearest Neighbors (KNN) model** is the better choice in comapre to the other two models. It achieves **the highest accuracy and offers a balanced trade-off between precision and recall for both classes**. 

### 2. Compare my models in terms of ROC Curve:

In [None]:
# Convert labels from {1, 2} to {0, 1}
label_encoder = LabelEncoder()
y_test1_binary = label_encoder.fit_transform(y_test1)  # Convert y_test1
y_test2_binary = label_encoder.transform(y_test2)  # Convert y_test2 (use transform, not fit_transform)
y_test3_binary = label_encoder.transform(y_test3)  # Convert y_test3 (use transform, not fit_transform)

# Logistic Regression ROC Curve
y_prob_logistic = logistic_model.predict_proba(X_test1)[:, 1]  # Probability of class 1
fpr_logistic, tpr_logistic, thresholds_logistic = roc_curve(y_test1_binary, y_prob_logistic)
roc_auc_logistic = roc_auc_score(y_test1_binary, y_prob_logistic)

# SVM ROC Curve
y_prob_svm = svm_model.decision_function(X_test2)  # Use decision_function for SVM
fpr_svm, tpr_svm, thresholds_svm = roc_curve(y_test2_binary, y_prob_svm)
roc_auc_svm = roc_auc_score(y_test2_binary, y_prob_svm)

# KNN ROC Curve
y_prob_knn = knn_model.predict_proba(X_test3)[:, 1]  # Probability of class 1
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test3_binary, y_prob_knn)
roc_auc_knn = roc_auc_score(y_test3_binary, y_prob_knn)

# Plot ROC Curves
plt.figure(figsize=(8, 6))

# Plot ROC curves for each model
plt.plot(fpr_logistic, tpr_logistic, color='blue', lw=2, label='Logistic Regression (AUC = %0.2f)' % roc_auc_logistic)
plt.plot(fpr_svm, tpr_svm, color='red', lw=2, label='SVM (AUC = %0.2f)' % roc_auc_svm)
plt.plot(fpr_knn, tpr_knn, color='green', lw=2, label='KNN (AUC = %0.2f)' % roc_auc_knn)

# Random Guess Line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()


### 3. Compare my models in terms of Cross-Validation:

In this part I will compare three different machine learning models: Logistic Regression, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN) for the task of classifying patients in the Haberman dataset. I will evaluate the models' performance using cross-validation and provide reasoning to determine which model is the most suitable for this classification task.

**Logistic Model:** 

    Mean Accuracy: 0.7483
    Fold Accuracy Range: 0.7213 to 0.7581
    
Confusion Matrices and Classification Reports: The model demonstrates moderate accuracy, but the performance varies across different folds. The classification report highlights that the model is better at predicting Class 1 (survived) but has challenges with Class 2 (not survived), as evidenced by the low recall for Class 2.

**SVM Model:**

    Mean Accuracy: 0.7288
    Fold Accuracy Range: 0.7213 to 0.7377
    
Confusion Matrices and Classification Reports: The SVM model exhibits lower accuracy than the Logistic Model. It is worth noting that in all folds, the model predicts only Class 1, while it completely fails to predict Class 2, leading to poor overall performance.

**K-Nearest Neighbors (KNN) Model:**

    Mean Accuracy: 0.7254
    Fold Accuracy Range: 0.5738 to 0.8033
    
Confusion Matrices and Classification Reports: The KNN model shows the widest accuracy range among the three models. While it can achieve high accuracy in some folds, it also has the lowest accuracy in others. The KNN model generally provides better recall for Class 2 compared to SVM but is still not as effective as the Logistic Model.


**Comparative Analysis:**

* The Logistic Model outperforms both SVM and KNN in terms of mean accuracy, demonstrating a consistent accuracy score around 0.75.
* The Logistic Model also provides better recall for Class 2 (not survived) compared to the other models, although it is still not ideal.
* The SVM Model consistently fails to predict Class 2, rendering it unsuitable for this dataset.
* The KNN Model shows a variable performance across different folds, with the ability to achieve high accuracy but also falling short in some instances.

**Conclusion and Recommendation:**

Based on the **cross-validation** results, the **Logistic Model** appears to be the most suitable choice for the Haberman dataset classification task. It demonstrates **the highest mean accuracy** and **more balanced performance** in predicting both classes. However, it's important to note that even the Logistic Model struggles with predicting Class 2 accurately, indicating the complexity of the classification problem.

If improving the classification of Class 2 is a critical objective, further **model tuning**, **feature engineering**, or **alternative algorithms** might be considered. Additionally, **the dataset size**, **imbalance between the classes**, and f**eature selection** can play a significant role in model performance and should be taken into account when making a final decision.

In summary, the **Logistic Model** is the best performer among the three models, but further refinements may be needed to improve the classification of Class 2 in the Haberman dataset.

## Conclusion
