![predicting_breast_cancer.png](attachment:9ce50cae-602b-496b-9612-258f480208c9.png)

# Predicting Breast Cancer

## Using 10 Models to predict the onset of Breast Cancer in the next year.

#### Orignal notebook downloaded, adapted and localized from KAGGLE - https://www.kaggle.com/code/sandragracenelson/Breast-cancer-prediction
#### by Joe Eberle started on 04-10-2023 - https://github.com/JoeEberle/ - josepheberle@outlook.com

![sample.png](attachment:b338f069-a8a2-4e55-8e5f-0302f9c069ed.png)

## Explanation of Each Model 

1. **Logistic Regression**: A linear model used for binary classification that estimates the probability of a sample belonging to a particular class.

2. **Decision Tree**: A tree-like model that splits the data into subsets based on the value of input features, making decisions based on feature values to classify instances.

3. **K-Nearest Neighbor (KNN)**: A non-parametric method used for classification by finding the 'k' nearest data points in the feature space and assigning the most common class among them to the query point.

4. **Gaussian Naive Bayes**: A probabilistic classifier based on Bayes' theorem with the assumption of independence among features, often used for text classification tasks.

5. **Multinomial Naive Bayes**: Similar to Gaussian Naive Bayes but specifically designed for classification tasks with discrete features, such as word counts in text classification.

6. **Support Vector Classifier (SVC)**: A supervised learning algorithm that finds the hyperplane that best separates classes in a high-dimensional space, often used for binary classification.

7. **Random Forest**: An ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

8. **XGBoost**: An optimized gradient boosting library that implements machine learning algorithms under the Gradient Boosting framework, known for its speed and performance in handling large datasets.

9. **Multi-layer Perceptron (MLP)**: A type of artificial neural network composed of multiple layers of nodes (neurons) that can learn non-linear relationships between input and output data.

10. **Gradient Boosting Classifier**: A machine learning technique that builds an ensemble of weak learners (typically decision trees) in a sequential manner, with each tree correcting the errors of its predecessors, resulting in a strong predictive model.

## Analyzing these variables and using machine learning algorithms 

1. Gender
2. Age
3. Smoking
4. Yellow fingers
5. Anxiety
6. Peer pressure
7. Chronic disease
8. Fatigue
9. Allergy
10. Wheezing
11. Alcohol consuming
12. Coughing
13. Shortness of breath
14. Swallowing difficulty
15. Chest pain
16. Breast cancer

In [1]:
first_install = False 
if first_install:
    !pip install schedule
    !pip install zipp

In [2]:
import os
import schedule
from datetime import datetime
import pandas as pd 
import quick_logger as ql
import talking_code as tc 
import file_manager as fm 
import time

#Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#For ignoring warning
import warnings
warnings.filterwarnings("ignore")

print(f"Libraries Imported succesfully on {datetime.now().date()} at {datetime.now().time()}") 

Libraries Imported succesfully on 2024-04-12 at 15:36:52.629930


## Optional Step 0 - Intitiate Configuration Settings and name the overall solution

In [3]:
import configparser 
config = configparser.ConfigParser()
cfg = config.read('config.ini')  
solution_name = 'predicting_breast_cancer'

## Optional Step 0 - Intitiate Logging and debugging 

In [4]:
# Establish the Python Logger  
import logging # built in python library that does not need to be installed 
import quick_logger as ql

global start_stime 
start_time = ql.set_start_time()
logging = ql.create_logger_start(solution_name, start_time) 
ql.set_speaking_log(False)
ql.set_speaking_steps(False)
ql.pvlog('info',f'Process {solution_name} Step 0 - Initializing and starting Logging Process.') 


Process predicting_breast_cancer Step 0 - Initializing and starting Logging Process.


## Step 1  - Read in Sample Data Set 

In [6]:
sample_file_name = 'survey lung cancer.csv'
df=pd.read_csv('survey lung cancer.csv')   # Kaggle data sets need to be localized 
print(f"The sample data file is '{sample_file_name}' ")
print(f"The sample data set contains {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The sample data set contains {df.duplicated().sum()} duplicate rows, leaving {df.shape[0]-df.duplicated().sum()} unique rows" )
ql.pvlog('info',f'Process {solution_name} Step 1  - Read in Sample Data Set .') 

df

FileNotFoundError: [Errno 2] No such file or directory: 'survey lung cancer.csv'

## Step 2.1  - Data Cleansing - Label encode
#### converts 2,1 to 1,0 int this dataset 

In [None]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
df['GENDER']=le.fit_transform(df['GENDER'])
df['Breast_CANCER']=le.fit_transform(df['Breast_CANCER'])
df['SMOKING']=le.fit_transform(df['SMOKING'])
df['YELLOW_FINGERS']=le.fit_transform(df['YELLOW_FINGERS'])
df['ANXIETY']=le.fit_transform(df['ANXIETY'])
df['PEER_PRESSURE']=le.fit_transform(df['PEER_PRESSURE'])
df['CHRONIC DISEASE']=le.fit_transform(df['CHRONIC DISEASE'])
df['FATIGUE ']=le.fit_transform(df['FATIGUE '])
df['ALLERGY ']=le.fit_transform(df['ALLERGY '])
df['WHEEZING']=le.fit_transform(df['WHEEZING'])
df['ALCOHOL CONSUMING']=le.fit_transform(df['ALCOHOL CONSUMING'])
df['COUGHING']=le.fit_transform(df['COUGHING'])
df['SHORTNESS OF BREATH']=le.fit_transform(df['SHORTNESS OF BREATH'])
df['SWALLOWING DIFFICULTY']=le.fit_transform(df['SWALLOWING DIFFICULTY'])
df['CHEST PAIN']=le.fit_transform(df['CHEST PAIN'])
df['Breast_CANCER']=le.fit_transform(df['Breast_CANCER'])
df

## Step 2.2  - Data Cleansing - Remove duplicates

In [None]:
df=df.drop_duplicates()     #Removing Duplicates
print(f"The sample data set contains {df.shape[0]} rows and {df.shape[1]} columns")
print(f"The sample data set contains {df.duplicated().sum()} duplicate rows, leaving {df.shape[0]-df.duplicated().sum()} unique rows" )

ql.pvlog('info',f'Process {solution_name} Step 2.2  - Data Cleansing - Remove duplicates.')

In [None]:
def get_column_statistics(df):
    total_rows = len(df)
    column_details = []

    for column in df.columns:
        filled_rows = df[column].count()
        filled_percentage = (filled_rows / total_rows) * 100
        dtype = df[column].dtype
        null_values = total_rows - filled_rows
        column_details.append({
            'Column': column,
            'Filled Rows': filled_rows,
            'Filled Percentage': f"{filled_percentage:.0f}%",
            'Data Type': dtype,
            'Null Values': null_values
        })
    
    df_column_stats = pd.DataFrame(column_details)
    return df_column_stats


df_column_stats = get_column_statistics(df)
df_column_stats.head(20)


## Step 3  - Examine Data Set 

In [None]:
df.info()
df.isnull().sum()

In [None]:
#Checking for null values
df.describe()

## Step 4 - Examine Target Variable 

In [None]:
#Let's check the distributaion of Target variable.
target_variable = 'Breast_CANCER'
sns.countplot(x=target_variable, data=df,)
plt.title('Target Distribution');

In [None]:
df[target_variable].value_counts()

In [None]:
# function for plotting
def plot_independent_to_target(col, df=df):
    return df.groupby(col)[target_variable].value_counts(normalize=True).unstack().plot(kind='bar', figsize=(6,3))

## Step 5 - Plot all independent variables relationship to target variable

In [None]:
ql.pvlog('info',f'Process {solution_name} Step 5 - Plot all independent variables relationship to target variable')
for independent_variable in df.columns:
    if independent_variable != target_variable:
        plot_independent_to_target(independent_variable)                                           

## Step 7 - Drop the columns that dont help the analysis

In [None]:
ql.pvlog('info',f'Process {solution_name} Step 7 - Drop the columns that dont help the analysis') 
df_new=df.drop(columns=['GENDER','AGE', 'SMOKING', 'SHORTNESS OF BREATH', 'PEER_PRESSURE'])
df_new

In [None]:
#Finding Correlation
cn=df_new.corr()
cn

## Correlation Heatmap 

In [None]:
#Correlation 
cmap=sns.diverging_palette(260,-10,s=50, l=75, n=6,
as_cmap=True)
plt.subplots(figsize=(18,18))
sns.heatmap(cn,cmap=cmap,annot=True, square=True)
plt.show()
ql.pvlog('info',f'Process {solution_name} Step 5 - Plot all independent variables relationship to target variable')

## Step 8 - Feature Engineering

In [None]:
ql.pvlog('info',f'Process {solution_name} Step 8 - Feature Engineering')
df_new['ANXYELFIN']=df_new['ANXIETY']*df_new['YELLOW_FINGERS']
df_new

In [None]:
#Splitting independent and dependent variables
X = df_new.drop('Breast_CANCER', axis = 1)
y = df_new['Breast_CANCER']

In [None]:
!pip install imbalanced-learn

In [None]:
# from imblearn.over_sampling import ADASYN


In [None]:
# adasyn = ADASYN(random_state=42)
# X, y = adasyn.fit_resample(X, y)

In [None]:
len(X)

## 10 machine learning classification models using Scikit-learn library in Python

1. Logistic Regression
2. Decision Tree
3. K-Nearest Neighbor
4. Gaussian Naive Bayes
5. Multinomial Naive Bayes
6. Support Vector Classifier
7. Random Forest
8. XGBoost
9. Multi-layer Perceptron
10. Gradient Boosting Classifier

# **Logistic Regression**

**Logistic Regression**: A linear model used for binary classification that estimates the probability of a sample belonging to a particular class.

In [None]:
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Logistic Regression')
#Splitting data for training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.25, random_state=0)

In [None]:
#Fitting training data to the model
from sklearn.linear_model import LogisticRegression
lr_model=LogisticRegression(random_state=0)
lr_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_lr_pred= lr_model.predict(X_test)
y_lr_pred

In [None]:
#Model accuracy
from sklearn.metrics import classification_report, accuracy_score, f1_score
lr_cr=classification_report(y_test, y_lr_pred)
print(lr_cr)

# **Decision Tree**

 **Decision Tree**: A tree-like model that splits the data into subsets based on the value of input features, making decisions based on feature values to classify instances.

In [None]:
#Fitting training data to the model
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Decision Tree')
from sklearn.tree import DecisionTreeClassifier
dt_model= DecisionTreeClassifier(criterion='entropy', random_state=0)  
dt_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_dt_pred= dt_model.predict(X_test)
y_dt_pred

In [None]:
#Model accuracy
dt_cr=classification_report(y_test, y_dt_pred)
print(dt_cr)

# **K Nearest Neighbor**

K-Nearest Neighbor (KNN): A non-parametric method used for classification by finding the 'k' nearest data points in the feature space and assigning the most common class among them to the query point.

In [None]:
#Fitting K-NN classifier to the training set  
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - K Nearest Neighbor')
from sklearn.neighbors import KNeighborsClassifier  
knn_model= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )  
knn_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_knn_pred= knn_model.predict(X_test)
y_knn_pred

In [None]:
#Model accuracy
knn_cr=classification_report(y_test, y_knn_pred)
print(knn_cr)

# **Gaussian Naive Bayes**

In [None]:
#Fitting Gaussian Naive Bayes classifier to the training set  
from sklearn.naive_bayes import GaussianNB
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_gnb_pred= gnb_model.predict(X_test)
y_gnb_pred

In [None]:
#Model accuracy
gnb_cr=classification_report(y_test, y_gnb_pred)
print(gnb_cr)
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Gaussian Naive Bayes')

# **Multinomial Naive Bayes**
**Multinomial Naive Bayes**: Similar to Gaussian Naive Bayes but specifically designed for classification tasks with discrete features, such as word counts in text classification.

In [None]:
#Fitting Multinomial Naive Bayes classifier to the training set  
from sklearn.naive_bayes import MultinomialNB
mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_mnb_pred= mnb_model.predict(X_test)
y_mnb_pred

In [None]:
#Model accuracy
mnb_cr=classification_report(y_test, y_mnb_pred)
print(mnb_cr)
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Multinomial Naive Bayes')

# **Support Vector Classifier**
**Support Vector Classifier (SVC)**: A supervised learning algorithm that finds the hyperplane that best separates classes in a high-dimensional space, often used for binary classification.


In [None]:
#Fitting SVC to the training set  
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_svc_pred= svc_model.predict(X_test)
y_svc_pred

In [None]:
#Model accuracy
svc_cr=classification_report(y_test, y_svc_pred)
print(svc_cr)
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Support Vector Classifier')

# **Random Forest**

In [None]:
#Training
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_rf_pred= rf_model.predict(X_test)
y_rf_pred

In [None]:
#Model accuracy
rf_cr=classification_report(y_test, y_rf_pred)
print(rf_cr)
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Random Forest')

# **XGBoost**

In [None]:
upgrading_xgboost = False
if upgrading_xgboost: 
    !pip install xgboost
    !pip install --upgrade xgboost
    !pip install --upgrade pandas

In [None]:
# from xgboost import XGBClassifier
# xgb_model = XGBClassifier()
# xgb_model.fit(X_train, y_train)

In [None]:
# #Predicting result using testing data
# y_xgb_pred= xgb_model.predict(X_test)
# y_xgb_pred

In [None]:
# #Model accuracy
# xgb_cr=classification_report(y_test, y_xgb_pred)
# print(xgb_cr)
# ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - XGBoost')

# **Multi-layer Perceptron classifier**

In [None]:
#Training a neural network model
from sklearn.neural_network import MLPClassifier
mlp_model = MLPClassifier()
mlp_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_mlp_pred= mlp_model.predict(X_test)
y_mlp_pred

In [None]:
#Model accuracy
mlp_cr=classification_report(y_test, y_mlp_pred)
print(mlp_cr)
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Multi-layer Perceptron classifier')

# **Gradient Boosting**

In [None]:
#Training
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)

In [None]:
#Predicting result using testing data
y_gb_pred= gb_model.predict(X_test)
y_gb_pred

In [None]:
#Model accuracy
gb_cr=classification_report(y_test, y_gb_pred)
print(gb_cr)
ql.pvlog('info',f'Process {solution_name} Machine Learning Model  - Gradient Boosting')

## Cross Validation 

K-Fold cross validation is a popular technique used in machine learning for model evaluation and selection. It involves dividing a dataset into K subsets of equal size, called folds. The algorithm then trains and evaluates the model K times, each time using a different fold as the validation set and the remaining K-1 folds as the training set.

During each iteration of K-Fold cross validation, the model is trained on K-1 folds and evaluated on the remaining fold. The performance metrics are then averaged over all K iterations to obtain an estimate of the model's overall performance.

K-Fold cross validation is a robust method for model evaluation because it uses all the available data for training and testing. It also helps to reduce the risk of overfitting and provides a more accurate estimate of the model's performance than using a single training-test split.

Typically, values of K between 5 and 10 are used for K-Fold cross validation, but the optimal value of K may vary depending on the size and complexity of the dataset, as well as the type of model being evaluated.

Here are some general guidelines that can help you choose an appropriate value of K:

1. For small datasets, it is recommended to use a larger value of K, such as 10. This is because the larger value of K allows for more robust estimates of model performance, given the limited amount of data.

2. For larger datasets, a smaller value of K can be used, such as 5. This is because a larger value of K will result in K smaller training sets, which may not be representative of the full dataset. Using a smaller value of K ensures that each fold has a sufficient amount of data for both training and testing.

3. For models that are computationally expensive or time-consuming to train, a smaller value of K is preferred to reduce the overall training time.

4. It's also essential to note that the choice of K should not be based solely on the accuracy of the model. Other metrics, such as precision, recall, and F1 score, should also be considered, as they can provide valuable insights into the performance of the model.

In [None]:
## K-fold Cross Validation 

In [None]:
# K-Fold Cross Validation

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

k = 10
kf = KFold(n_splits=k, shuffle=True, random_state=42)


# Logistic regerssion model
lr_model_scores = cross_val_score(lr_model,X, y, cv=kf)

# Decision tree model
dt_model_scores = cross_val_score(dt_model,X, y, cv=kf)

# KNN model
knn_model_scores = cross_val_score(knn_model,X, y, cv=kf)

# Gaussian naive bayes model
gnb_model_scores = cross_val_score(gnb_model,X, y, cv=kf)

# Multinomial naive bayes model
mnb_model_scores = cross_val_score(mnb_model,X, y, cv=kf)

# Support Vector Classifier model
svc_model_scores = cross_val_score(svc_model,X, y, cv=kf)

# Random forest model
rf_model_scores = cross_val_score(rf_model,X, y, cv=kf)

# XGBoost model
# xgb_model_scores = cross_val_score(xgb_model,X, y, cv=kf)

# Multi-layer perceptron model
mlp_model_scores = cross_val_score(mlp_model,X, y, cv=kf)

# Gradient boost model
gb_model_scores = cross_val_score(gb_model,X, y, cv=kf)


print("Logistic regression models' average accuracy:", np.mean(lr_model_scores))
print("Decision tree models' average accuracy:", np.mean(dt_model_scores))
print("KNN models' average accuracy:", np.mean(knn_model_scores))
print("Gaussian naive bayes models' average accuracy:", np.mean(gnb_model_scores))
print("Multinomial naive bayes models' average accuracy:", np.mean(mnb_model_scores))
print("Support Vector Classifier models' average accuracy:", np.mean(svc_model_scores))
print("Random forest models' average accuracy:", np.mean(rf_model_scores))
# print("XGBoost models' average accuracy:", np.mean(xgb_model_scores))
print("Multi-layer perceptron models' average accuracy:", np.mean(mlp_model_scores))
print("Gradient boost models' average accuracy:", np.mean(gb_model_scores))
ql.pvlog('info',f'Process {solution_name}  K-fold Cross Validation ')

In [None]:
# K-Fold Cross Validation

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

k = 10
kf = StratifiedKFold(n_splits=k)


# Logistic regerssion model
lr_model_scores = cross_val_score(lr_model,X, y, cv=kf)

# Decision tree model
dt_model_scores = cross_val_score(dt_model,X, y, cv=kf)

# KNN model
knn_model_scores = cross_val_score(knn_model,X, y, cv=kf)

# Gaussian naive bayes model
gnb_model_scores = cross_val_score(gnb_model,X, y, cv=kf)

# Multinomial naive bayes model
mnb_model_scores = cross_val_score(mnb_model,X, y, cv=kf)

# Support Vector Classifier model
svc_model_scores = cross_val_score(svc_model,X, y, cv=kf)

# Random forest model
rf_model_scores = cross_val_score(rf_model,X, y, cv=kf)

# XGBoost model
# xgb_model_scores = cross_val_score(xgb_model,X, y, cv=kf)

# Multi-layer perceptron model
mlp_model_scores = cross_val_score(mlp_model,X, y, cv=kf)

# Gradient boost model
gb_model_scores = cross_val_score(gb_model,X, y, cv=kf)


print("Logistic regression models' average accuracy:", np.mean(lr_model_scores))
print("Decision tree models' average accuracy:", np.mean(dt_model_scores))
print("KNN models' average accuracy:", np.mean(knn_model_scores))
print("Gaussian naive bayes models' average accuracy:", np.mean(gnb_model_scores))
print("Multinomial naive bayes models' average accuracy:", np.mean(mnb_model_scores))
print("Support Vector Classifier models' average accuracy:", np.mean(svc_model_scores))
print("Random forest models' average accuracy:", np.mean(rf_model_scores))
# print("XGBoost models' average accuracy:", np.mean(xgb_model_scores))
print("Multi-layer perceptron models' average accuracy:", np.mean(mlp_model_scores))
print("Gradient boost models' average accuracy:", np.mean(gb_model_scores))

## Step 0 - Process End - display log

In [None]:
# Calculate and classify the process performance 
status = ql.calculate_process_performance(solution_name, start_time) 
print(ql.append_log_file(solution_name))  

#### https://github.com/JoeEberle/ -- josepheberle@outlook.com