## Introduction
The effectiveness of cancer prediction system helps the people to know their cancer risk with low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system .
<br>
Total no. of attributes:16
No .of instances:284
Attribute information:

- Gender: M(male), F(female)
- Age: Age of the patient
- Smoking: YES=2 , NO=1.
- Yellow fingers: YES=2 , NO=1.
- Anxiety: YES=2 , NO=1.
- Peer_pressure: YES=2 , NO=1.
- Chronic Disease: YES=2 , NO=1.
- Fatigue: YES=2 , NO=1.
- Allergy: YES=2 , NO=1.
- Wheezing: YES=2 , NO=1.
- Alcohol: YES=2 , NO=1.
- Coughing: YES=2 , NO=1.
- Shortness of Breath: YES=2 , NO=1.
- Swallowing Difficulty: YES=2 , NO=1.
- Chest pain: YES=2 , NO=1.
- Lung Cancer: YES , NO.

## Importing necessary library

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix
import seaborn as sns
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import os
%matplotlib inline
sns.set_theme(color_codes=True, style='darkgrid', 
              palette='deep', font='sans-serif')

## Exploratory Data Analysis

In [2]:
df = pd.read_csv("../input/lung-cancer/survey lung cancer.csv")
df.head()

In [3]:
df.info()

In [4]:
df.describe()

In [5]:
df.columns

In [6]:
df.isnull().sum()

In [7]:
df.shape

In [8]:
df.duplicated().sum()

In [9]:
df = df.drop_duplicates()

In [10]:
corr = df.corr()
corr

## Data Visualization

In [11]:
sns.distplot(a=df["AGE"]);

In [12]:
sns.boxplot(y = 'AGE', data = df);

In [13]:
sns.countplot(x="LUNG_CANCER", data=df);

In [14]:
r = df.groupby('LUNG_CANCER')['LUNG_CANCER'].count()
plt.pie(r, explode=[0.05, 0.1], labels=['No', 'Yes'], radius=1.5, autopct='%1.1f%%',  shadow=True);

As we can see only 13.8% of the data has no cancer disease, so we need to increase the data. For this purpose we will use SMOTE technique.

In [15]:
fig, axes = plt.subplots(4, 4, figsize=(25, 15))
fig.suptitle('Different feature distributions')

axes = axes.reshape(16,)

for i,column in enumerate(df.columns):
    sns.histplot(ax = axes[i],data = df, x= column)

In [16]:
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True);

## Preparing data for model

In [17]:
df.reset_index()

In [18]:
le = LabelEncoder()

In [19]:
df1 = df.copy(deep=True)

In [20]:
df1.GENDER = le.fit_transform(df1.GENDER)
df1.LUNG_CANCER = le.fit_transform(df1.LUNG_CANCER)

In [21]:
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df2 = pd.DataFrame(scaler.fit_transform(df1),columns=df.columns,index=df.index)

In [22]:
X = df2.drop('LUNG_CANCER',axis=1)
y = df2.LUNG_CANCER
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 50)

In [23]:
sm = SMOTE(random_state = 500)
X_res, y_res = sm.fit_resample(X_train, y_train)

## Training the model

In [24]:
model = XGBClassifier(learning_rate=0.2,n_estimators=5000,use_label_encoder=False,random_state=40)
model.fit(X_res, y_res)
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
accuracy

In [25]:
print(classification_report(y_test,y_pred))

## Conclusion
In conclusion, since, amount of data is low, we coundn't do that much. Additionally, problem with imbalanced data decreased number of training samples what might lead to inconsistent results.