![predicting_lung_cancer.png](attachment:ec1ef66f-c36f-4a5a-a013-55a155405d9a.png)

# Predicting Lung Cancer

## Using 10 Models to predict the onset of Lung Cancer in the next year.

#### Orignal notebook downloaded, adapted and localized from KAGGLE - https://www.kaggle.com/code/sandragracenelson/lung-cancer-prediction
#### by Joe Eberle started on 04-10-2023 - https://github.com/JoeEberle/ - josepheberle@outlook.com

## 10 machine learning classification models using Scikit-learn library in Python

1. Logistic Regression
2. Decision Tree
3. K-Nearest Neighbor
4. Gaussian Naive Bayes
5. Multinomial Naive Bayes
6. Support Vector Classifier
7. Random Forest
8. XGBoost
9. Multi-layer Perceptron
10. Gradient Boosting Classifier

## Analyzing these variables and using machine learning algorithms 

1. Gender
2. Age
3. Smoking
4. Yellow fingers
5. Anxiety
6. Peer pressure
7. Chronic disease
8. Fatigue
9. Allergy
10. Wheezing
11. Alcohol consuming
12. Coughing
13. Shortness of breath
14. Swallowing difficulty
15. Chest pain
16. Lung cancer

## Explanation of Each Model 

1. **Logistic Regression**: A linear model used for binary classification that estimates the probability of a sample belonging to a particular class.

2. **Decision Tree**: A tree-like model that splits the data into subsets based on the value of input features, making decisions based on feature values to classify instances.

3. **K-Nearest Neighbor (KNN)**: A non-parametric method used for classification by finding the 'k' nearest data points in the feature space and assigning the most common class among them to the query point.

4. **Gaussian Naive Bayes**: A probabilistic classifier based on Bayes' theorem with the assumption of independence among features, often used for text classification tasks.

5. **Multinomial Naive Bayes**: Similar to Gaussian Naive Bayes but specifically designed for classification tasks with discrete features, such as word counts in text classification.

6. **Support Vector Classifier (SVC)**: A supervised learning algorithm that finds the hyperplane that best separates classes in a high-dimensional space, often used for binary classification.

7. **Random Forest**: An ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or the mean prediction (regression) of the individual trees.

8. **XGBoost**: An optimized gradient boosting library that implements machine learning algorithms under the Gradient Boosting framework, known for its speed and performance in handling large datasets.

9. **Multi-layer Perceptron (MLP)**: A type of artificial neural network composed of multiple layers of nodes (neurons) that can learn non-linear relationships between input and output data.

10. **Gradient Boosting Classifier**: A machine learning technique that builds an ensemble of weak learners (typically decision trees) in a sequential manner, with each tree correcting the errors of its predecessors, resulting in a strong predictive model.

![image.png](attachment:ce1df4cb-8488-4df6-bfe9-467f037e196f.png)

In [1]:
first_install = False 
if first_install:
    !pip install schedule
    !pip install zipp

In [1]:
import os
import schedule
from datetime import datetime
import pandas as pd 
import quick_logger as ql
import talking_code as tc 
import file_manager as fm 
import time
print(f"Libraries Imported succesfully on {datetime.now().date()} at {datetime.now().time()}") 

Libraries Imported succesfully on 2024-03-17 at 22:14:05.996959


## Optional Step 0 - Intitiate Configuration Settings and name the overall solution

In [3]:
import configparser 
config = configparser.ConfigParser()
cfg = config.read('config.ini')  
solution_name = 'predicting_lung_cancer'

## Optional Step 0 - Intitiate Logging and debugging 

In [4]:
# Establish the Python Logger  
import logging # built in python library that does not need to be installed 
import quick_logger as ql

global start_stime 
start_time = ql.set_start_time()
logging = ql.create_logger_start(solution_name, start_time) 
ql.set_speaking_log(False)
ql.set_speaking_steps(False)
ql.pvlog('info',f'Process {solution_name} Step 0 - Initializing and starting Logging Process.') 

Process solution_temple Step 0 - Initializing and starting Logging Process.


## Step 0 - Process End - display log

In [5]:
# Calculate and classify the process performance 
status = ql.calculate_process_performance(solution_name, start_time) 
print(ql.append_log_file(solution_name))  

2024-03-15 10:39:07,381 - INFO - START solution_temple Start Time = 2024-03-15 10:39:07
2024-03-15 10:39:07,381 - INFO - solution_temple Step 0 - Initialize the configuration file parser
2024-03-15 10:39:07,382 - INFO - Process solution_temple Step 0 - Initializing and starting Logging Process.
2024-03-15 10:39:07,391 - INFO - PERFORMANCE solution_temple The total process duration was:0.01
2024-03-15 10:39:07,391 - INFO - PERFORMANCE solution_temple Stop Time = 2024-03-15 10:39:07
2024-03-15 10:39:07,391 - INFO - PERFORMANCE solution_temple Short process duration less than 3 Seconds:0.01
2024-03-15 10:39:07,391 - INFO - PERFORMANCE solution_temple Performance optimization is not reccomended



#### https://github.com/JoeEberle/ -- josepheberle@outlook.com