## Lung Cancer Prediction


 Aim : Prediction of Lung Cancer

Data source : https://www.kaggle.com/datasets/thedevastator/cancer-patients-and-air-pollution-a-new-link/data


### Reason why choosing Lung cancer
 Lung cancer is the leading cause of cancer death worldwide, accounting for 1.59 million deaths in 2018. The majority of lung cancer cases are attributed to smoking, but exposure to air pollution is also a risk factor. A new study has found that air pollution may be linked to an increased risk of lung cancer, even in nonsmokers.

The study, which was published in the journal Nature Medicine, looked at data from over 462,000 people in China who were followed for an average of six years. The participants were divided into two groups: those who lived in areas with high levels of air pollution and those who lived in areas with low levels of air pollution.

The researchers found that the people in the high-pollution group were more likely to develop lung cancer than those in the low-pollution group. They also found that the risk was higher in nonsmokers than smokers, and that the risk increased with age.

While this study does not prove that air pollution causes lung cancer, it does suggest that there may be a link between the two. More research is needed to confirm these findings and to determine what effect different types and levels of air pollution may have on lung cancer risk




### Descirbe
This dataset contains information on patients with lung cancer, including their age, gender, air pollution exposure, alcohol use, dust allergy, occupational hazards, genetic risk, chronic lung disease, balanced diet, obesity, smoking, passive smoker, chest pain, coughing of blood, fatigue, weight loss ,shortness of breath ,wheezing ,swallowing difficulty ,clubbing of finger nails and snoring



### Features info

Age : The age of the patinet (Numeric)

Gender : The gender of the patient (Categorical)

Air Pollution : The level of air pollution exposure of the patient (Categorical)

Alcohol use : The level of alchol use of the patient (Categorical)

Dust Allergy : The levle of dust allergy of patient (Categorical)

OccuPational Hazards : The level of occupational hazards of the patient (Categorical)

Genetic Risk : The level of genetic risk of the patient (Categorical)

Chronic Lung Disease :  The level of chronic lung disease of the patient (Categorical)

Balanced Diet : The level of balanced diet of the patient (Categorical)

Obesity : The level of obesity of the patient (Categorical)

Smoking : The level of smoking of the patient (Categorical)

Passive Smoker : The level of passive smoker of the patient (Categorical)

Chest Pain : The level of chest pain of the patient (Categorical)

Coughing of Blood : The level of coughing of blood of the patient(Categorical)

Fatigue : The level of fatigue of the patient (Categorical)

Weight Lpss : The level of weight loss of the patient (Categorical)

Shortness of Breath : The level of shortness of breath of the patient (Categorical)

Wheezing : The level of wheezing of the patient (Categorical)

Swallowing Difficulty : The level of swallowing difficulty of the patient (Categorical)

Clubbing of Finger Nails : The level of clubbing of finger nail of the patient (Categorical)



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kstest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from scipy import interp
sns.set()


## Data Load & Checking

Flow Chart

[Data pre-filtering & Modifying]

1. Load data
2. Checking sparsity
(Do not need to scaling)
3. Checking 'Severity' of Lung Cancer
4. Remapping Factors to Index [High : 2 , Medium : 1, Low : 0]

[Find out influential factors]

1. Checking Correlation between features
2. Spare corr>0.5

[Model modifying]

1. Find Best Parameter
2. Applying to model
3. Cross Validiation(5-fold)

[Accuracy Check]

1. Confusion Matrix
2. ROC Curve


In [None]:
df = pd.read_csv('cancer patient data sets (2).csv')
df # Patients : 1000 , Features = 26

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Level'].value_counts()

In [None]:
map = {'High' : 2, 'Medium' : 1, 'Low' : 0}
df['Level'].replace(map, inplace = True)
df['Level'].unique()

In [None]:
df = df.drop('Patient Id', axis = 1)
df = df.drop('index', axis = 1)
df

In [None]:
# Check Levles distribution

plt.figure(figsize=(11, 4))
plt.pie(df['Level'].value_counts(), labels=df['Level'].value_counts().index,
        autopct=lambda p: f'{p:.2f}%\n{p * sum(df["Level"].value_counts()) / 100:,.0f}')
plt.show()

In [None]:
fig, ax = plt.subplots(ncols=4, nrows=6, figsize=(20, 20))
ax = ax.flatten()

for i, col in enumerate(df.columns):
    sns.violinplot(x=df['Level'].replace(dict(zip(map.values(), map.keys()))),
                   y=col, data=df, hue_order='Level', palette='turbo', ax=ax[i])
    ax[i].set_title(col.title())

plt.tight_layout(pad=0.1, w_pad=0.2, h_pad=2.5)
plt.show()

## To findout which features are most impactable to Lung Cancer

Need to do correlation test between features

### [ Correlation method ]

1. Pearson
2. Spearman
3. Kendal

Before check correlation

!Need to check Normality first by Komogorov-Smirnov

In [None]:
# Normal distribution Check

alpha = 0.05 # Cutoff

for features in df.columns:
    print(f'{features}')
    # K-S
    statistic ,pvalue = kstest(df[features], 'norm')

    if pvalue > alpha:
        print('Non normal distributed')
    else:
        print('Normal distributed')



All features is follwing normal distribution

Therefore, Okay to use pearson

In [None]:
# Correlation Check between features

plt.figure(figsize = (20,10))
sns.heatmap(df.corr(), cmap = 'RdYlBu', annot = True)
plt.show()

In [None]:
sns.heatmap(df.corr()[['Level']].sort_values(by = 'Level', ascending = False), annot = True, cmap = 'RdYlBu')

In [None]:
# Correlation cutoff 0.5 > corr.score

df = df[['Level', 'Coughing of Blood', 'Dust Allergy', 'Passive Smoker', 'OccuPational Hazards', 'Air Pollution', 'chronic Lung Disease', 'Shortness of Breath']]

In [None]:
#  Data spliting

y = df.pop('Level')
X = df
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = True, random_state = 42)
print(f'X_train : {X_train.shape} and X_test : {X_test.shape}')
print(f'y_train : {y_train.shape} and y_test : {y_test.shape}')

## Find best cluster by PCA

In [None]:
for i in range(2, 7):
    pca = PCA(n_components = i)
    pca.fit(X_train)
    print(f'{i} accuracy : {sum(pca.explained_variance_ratio_ * 100.00) : .2f}%')

In [None]:
# Factors are three(High, Medium, Low) so we should use multinomial regression model

# Finding best parameter
param_grid_multi = {
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'C': [0.1, 1.0, 10.0],
    'class_weight': [None, 'balanced'],
    'solver': ['liblinear', 'lbfgs', 'newton-cg', 'sag', 'saga'],
    'random_state': [42]
}

model = LogisticRegression(multi_class = 'multinomial', max_iter = 3000)

grid_search = GridSearchCV(model, param_grid_multi, cv = 5, scoring='accuracy')

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

In [None]:
## Checking Accuracy

# Multinomial Classifer

classifer_multi = LogisticRegression(multi_class = 'multinomial', C = 0.1, class_weight = 'balanced', penalty = None, solver = 'lbfgs', random_state = 42, max_iter = 3000)

classifer_multi.fit(X_train, y_train)

predictions_multi = classifer_multi.predict(X_test)


print('Multinomial Classifer Test Accuracy', classifer_multi.score(X_test, y_test))

In [None]:
# Cross Validation

cv_scores = cross_val_score(classifer_multi, X_train, y_train, cv=5)

print('Cross-Validation Scores :', cv_scores)
print('Average Cross-Validation Score:', np.mean(cv_scores))

In [None]:
# Confusion matrix

confusion_mat = confusion_matrix(y_test, predictions_multi)
sns.heatmap(confusion_mat, annot = True, fmt = 'd',
            cmap = 'RdYlBu',
            xticklabels= ['Low', 'Medium', 'High'],
            yticklabels = ['Low', 'Medium', 'High'])

print(classification_report(y_test, predictions_multi))

In [None]:
# One-hot encode labels

y_test_binarized = label_binarize(y_test, classes=[0, 1, 2]) # 2 : High, 1 : Medium, 0 : Lows

# Calculate ROC curve for each class

fpr = dict()
tpr = dict()
roc_auc = dict()

probs = classifer_multi.predict_proba(X_test)
num_classes = 3
for i in range(num_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test_binarized[:, i], probs[:, i], drop_intermediate = False)
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area

fpr["micro"], tpr["micro"], _ = roc_curve(y_test_binarized.ravel(), probs.ravel(), drop_intermediate = False)
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Compute macro-average ROC curve and ROC area

all_fpr = np.unique(np.concatenate([fpr[i] for i in range(num_classes)]))
mean_tpr = np.zeros_like(all_fpr)
for i in range(num_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])
mean_tpr /= num_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot ROC curve

plt.figure(figsize=(8, 6))
plt.plot(fpr["micro"], tpr["micro"], label=f'micro-average ROC curve (area = {roc_auc["micro"]:0.2f})', color='deeppink', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"], label=f'macro-average ROC curve (area = {roc_auc["macro"]:0.2f})', color='navy', linestyle=':', linewidth=4)

colors = ['aqua', 'darkorange', 'cornflowerblue']
for i, color in zip(range(num_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2, label=f'ROC curve of class {i} (area = {roc_auc[i]:0.2f})')

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) for Multi-Class')
plt.legend(loc='lower right')
plt.show()
