Lung Cancer Prediction

Step 1: Getting the data

In [105]:
import pandas as pd

data = pd.read_csv('dataseter.csv')

data.head()


Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC_DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL_CONSUMING,COUGHING,SHORTNESS_OF_BREATH,SWALLOWING_DIFFICULTY,CHEST_PAIN,LUNG_CANCER
0,M,65,Yes,Yes,Yes,No,No,Yes,No,No,No,No,No,No,Yes,NO
1,F,55,Yes,No,No,Yes,Yes,No,No,No,Yes,Yes,Yes,No,No,NO
2,F,78,No,No,Yes,Yes,Yes,No,Yes,No,Yes,Yes,No,Yes,Yes,YES
3,M,60,No,Yes,Yes,Yes,No,Yes,No,Yes,Yes,No,Yes,No,No,YES
4,F,80,Yes,Yes,No,Yes,Yes,No,Yes,No,Yes,Yes,Yes,Yes,No,NO


Step 2: Data Exploration

In [106]:
data.describe()

Unnamed: 0,AGE
count,3000.0
mean,55.169
std,14.723746
min,30.0
25%,42.0
50%,55.0
75%,68.0
max,80.0


We have a large range of ages here, wondering how age plays a role in this problem?

In [107]:
print(data.shape)

(3000, 16)


In [108]:
data.dtypes

GENDER                   object
AGE                       int64
SMOKING                  object
YELLOW_FINGERS           object
ANXIETY                  object
PEER_PRESSURE            object
CHRONIC_DISEASE          object
FATIGUE                  object
ALLERGY                  object
WHEEZING                 object
ALCOHOL_CONSUMING        object
COUGHING                 object
SHORTNESS_OF_BREATH      object
SWALLOWING_DIFFICULTY    object
CHEST_PAIN               object
LUNG_CANCER              object
dtype: object

Almost all of the features are objects, we are going to have to change these to binary values

In [109]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

for column in data.columns:
    data[column] = encoder.fit_transform(data[column])

data.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC_DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL_CONSUMING,COUGHING,SHORTNESS_OF_BREATH,SWALLOWING_DIFFICULTY,CHEST_PAIN,LUNG_CANCER
0,1,35,1,1,1,0,0,1,0,0,0,0,0,0,1,0
1,0,25,1,0,0,1,1,0,0,0,1,1,1,0,0,0
2,0,48,0,0,1,1,1,0,1,0,1,1,0,1,1,1
3,1,30,0,1,1,1,0,1,0,1,1,0,1,0,0,1
4,0,50,1,1,0,1,1,0,1,0,1,1,1,1,0,0


In [110]:
data.isnull().sum()

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC_DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL_CONSUMING        0
COUGHING                 0
SHORTNESS_OF_BREATH      0
SWALLOWING_DIFFICULTY    0
CHEST_PAIN               0
LUNG_CANCER              0
dtype: int64

In [111]:
data['GENDER'].value_counts()

GENDER
1    1514
0    1486
Name: count, dtype: int64

In [112]:
print(data['AGE'].value_counts())

AGE
24    73
37    71
45    71
17    70
39    70
48    69
25    69
49    68
23    66
2     66
9     65
7     65
47    64
16    64
33    63
20    63
3     62
40    61
38    61
5     60
18    60
6     60
41    59
46    59
1     59
26    58
32    58
27    58
22    57
50    57
31    57
30    57
4     57
12    56
11    56
19    56
34    56
29    55
21    55
13    55
14    54
8     53
10    51
28    51
44    50
35    49
36    48
15    47
42    46
43    43
0     42
Name: count, dtype: int64


In [113]:
print(data.duplicated().sum())


2


In [114]:
print(data.corr())

                         GENDER       AGE   SMOKING  YELLOW_FINGERS   ANXIETY  \
GENDER                 1.000000  0.010966  0.028505       -0.014412  0.023891   
AGE                    0.010966  1.000000  0.020289       -0.016101 -0.030051   
SMOKING                0.028505  0.020289  1.000000       -0.001497 -0.055562   
YELLOW_FINGERS        -0.014412 -0.016101 -0.001497        1.000000  0.012342   
ANXIETY                0.023891 -0.030051 -0.055562        0.012342  1.000000   
PEER_PRESSURE         -0.010019 -0.003850 -0.032041        0.011394 -0.024692   
CHRONIC_DISEASE       -0.008488  0.025655  0.045697       -0.013216  0.016903   
FATIGUE               -0.002193  0.002322  0.019635       -0.010761 -0.006250   
ALLERGY               -0.013211 -0.024915  0.004908       -0.005709 -0.001174   
WHEEZING               0.009284  0.018212  0.000571        0.004151 -0.016065   
ALCOHOL_CONSUMING      0.009173  0.020704  0.003022        0.027167 -0.001542   
COUGHING               0.005

No big correlations I can see

In [123]:
for feature in data.columns:
    if feature not in ['AGE', 'LUNG_CANCER','GENDER']:
        total_count = data[data[feature] == 1].shape[0]
        lung_cancer_count = data[(data[feature] == 1) & (data['LUNG_CANCER'] == 1)].shape[0]
        print(f"'{feature}': {lung_cancer_count}/ {total_count}")



'SMOKING': 762/ 1527
'YELLOW_FINGERS': 728/ 1458
'ANXIETY': 779/ 1518
'PEER_PRESSURE': 779/ 1503
'CHRONIC_DISEASE': 752/ 1471
'FATIGUE': 773/ 1531
'ALLERGY': 744/ 1480
'WHEEZING': 792/ 1508
'ALCOHOL_CONSUMING': 795/ 1526
'COUGHING': 718/ 1468
'SHORTNESS_OF_BREATH': 779/ 1536
'SWALLOWING_DIFFICULTY': 781/ 1531
'CHEST_PAIN': 759/ 1504



These numbers look at for total 'YES' in each column, have lung cancer.

(have cancer) / (total yes for feature column)