# Introduction

In this section, we will try to use the <a href="Lung_Cancer_Dataset.csv">Lung cancer dataset</a> and build a machine learning model using the Scikit learn decision tree regressor that gives a predictive screening of whether a patient is likely to have lung cancer or not depending on their symptoms 

## Process

The steps fro this sectipn are going to be:

<ul>
<li>Import the libraries (Scikit learn, joblib, and Pandas) and prepare the dataset</li>
<li>Data cleaning and preparing</li>
<li>Build the prediction model</li>
<li>Calculate the accuracy of the model</li>
<li>Export the model</li>
<li>Use machine learning explainability techniques to understand the model better</li>
</ul>

# Import libraries and dataset

In this section, we will need Scikit learn and Pandas

In [173]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
import joblib
lung_cancer_df = pd.read_csv("Lung_Cancer_Dataset.csv")

We will also take a look at the DataFrame

In [166]:
lung_cancer_df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


# Data cleaning and preparing

We sill start by checking whether the DataFrame contains any null values

In [167]:
lung_cancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            

Since there are no null values in the DataFrame we can skip the cleaning, however, we will anyway need to replace some values to better use the DataFrame, since there are only two categories in each column, we will make sure to turn each category's value to either 0 or 1

In [None]:
replacements = {
    "YES":1,
    "M":1, # Replace "M" to 1
    2:1, # 2 indicates that the patient is suffering from the symptom
    "NO":0,
    "F":0, # Replace "F" to 0
    1:0 # 1 indicates that the patient is not suffering from the symptom
}
lung_cancer_df = lung_cancer_df.replace(replacements)

  lung_cancer_df = lung_cancer_df.replace(replacements)


Now we can take a look at our DataFrame again

In [169]:
lung_cancer_df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,0,1,1,0,0,1,0,1,1,1,1,1,1,1
1,1,74,1,0,0,0,1,1,1,0,0,0,1,1,1,1
2,0,59,0,0,0,1,0,1,0,1,0,1,1,0,1,0
3,1,63,1,1,1,0,0,0,0,0,1,0,0,1,1,0
4,0,63,0,1,0,0,0,0,0,1,0,1,1,0,0,0


# Build the prediction model

In [None]:
X = lung_cancer_df.copy().drop("LUNG_CANCER", axis=1) # Input data
y = lung_cancer_df["LUNG_CANCER"] # Output data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split the data to 80% training and 20% testing data 
model = DecisionTreeRegressor() # Instanciate a new decision tree regressor model
model.fit(X_train, y_train) # for the training data

# Calculate the accuracy of the model

In [None]:
predictions = model.predict(X_test) 
accuracy_score(y_test, predictions)

0.967741935483871

This amount of accuracy is fairly high, and we can rely on the model, however, it is always going to be an overall screening model, that is only almost 97% accurate

# Export the model

We will now export our model to be able to use it later whenever an wherever we need

In [175]:
joblib.dump(model, "lung_cancer_detector.joblib")

['lung_cancer_detector.joblib']

# Conclusion

In this section we have built a machine learning model that can predict if a patient is likely to have lung cancer using the information it learned from the dataset, we will also build an interactive lung cancer screening in python using our model, if you want to try it open and run <a href="Interactive_screening.py">this file</a>