In [218]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns


Loading the dataset.

In [219]:
data = pd.read_csv('/kaggle/input/lung-cancer/survey lung cancer.csv')

Inspecting the data and checking for missing or categorical values

In [220]:
data.head()

In [221]:
data.isnull().sum()

In [222]:
data.select_dtypes(include =['object']).columns

The columns with categorical values would need to be transformed.

The data would be splitted before the transformation to avoid leakage and only the symptoms of lung cancer would be used to build the model.

In [223]:
data.columns

In [224]:
symptoms = ['YELLOW_FINGERS', 'ANXIETY', 'FATIGUE ', 'WHEEZING', 'COUGHING','SHORTNESS OF BREATH', 'SWALLOWING DIFFICULTY', 'CHEST PAIN']

In [225]:
X = data[symptoms]
y = data.LUNG_CANCER

In [226]:
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify = y, random_state = 42)

Transforming the label to numerical values with LabelEncoder.

In [227]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

Defining the model. The target is a category so this is a classification problem and Random forest classifier would be used.

In [228]:
model = RandomForestClassifier(n_estimators = 100, random_state = 42)

Fitting the data to the Random Forest model.

In [229]:
model.fit(X_train,y_train)

Predicting with the Random Forest model.

In [230]:
y_pred = model.predict(X_test)

Evaluating the random forest model.

In [231]:
print('The accuracy score of this Random Forest model is {0:.1f}%'.format(100 *accuracy_score(y_test,y_pred)))

In [232]:
print('The confusion matrix of this Random Forest model')
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

The model seems to be good, so the next step is to calculate the features importance.

In [233]:
features_importance = pd.DataFrame({'Symptoms': X.columns, 'Importance':model.feature_importances_}).sort_values('Importance', ascending= False)
features_importance

A barplot would be made to better understand this.

In [234]:
sns.barplot(x= features_importance.Symptoms, y = features_importance.Importance)
plt.rcParams['figure.figsize'] = (15,7)
plt.title('visualizing the symptoms \n importance', fontsize = 40, fontweight = 'bold')
plt.xlabel('Symptoms', fontsize = 45)
plt.ylabel('Importance', fontsize = 45 )
plt.xticks(rotation = 45, horizontalalignment = 'right', fontweight = 'light', fontsize = 'x-large')

From this random forest model, we have been able to show that the most important symptoms for lung cancer are yellow fingers, coughing and chest pain.