# <center> BREAST CANCER PREDICTION
    
![](https://www.dadberg.com/wp-content/uploads/2021/04/f-958x575.png)
    
    - Breast cancer is the most common invasive cancer in women and the second leading cause of cancer death in women after lung cancer.
    -  The Wisconsin Breast Cancer dataset is obtained from a prominent machine learning database named UCI machine learning database. Using the Breast Cancer Wisconsin (Diagnostic) Database, we can create a classifier that can help diagnose patients and predict the likelihood of a breast cancer.
    - In this Notebbok I have used almost all useful classifiers for classification of breast cancer being benign or malignant.
    - Before feeding the data into Classifying Model, I have preprocessed it by dimension reducing technique of PCA(Principle Component Analysis).

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [None]:
#reading the csv data file
df =  pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.head()

## Exploring the data

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df = df.drop(['Unnamed: 32'],axis=1)

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.corr()

In [None]:
# PLOTTING HEATMAP FOR VISUALISING CORRELATION BETWEEN FEATURES
plt.figure(figsize=(20, 10))
heatmap = sb.heatmap(df.corr(),cmap='BrBG',annot=True)
# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

In [None]:
#check the balance in deendent feature
plt.figure(figsize=(10, 8))
sb.scatterplot(y = df.index , x= df.diagnosis,palette = 'BrBe')

In [None]:
#CHECKING DISTRIBUTION OF DATA IN FEATURES
fig, axes = plt.subplots(2,3,figsize=(20,8))
sb.distplot(df['area_mean'],ax = axes[0,0])
sb.distplot(df['radius_mean'],ax = axes[0,1])
sb.distplot(df['texture_mean'],ax = axes[0,2])
sb.distplot(df['perimeter_mean'],ax = axes[1,0])
sb.distplot(df['smoothness_mean'],ax = axes[1,1])
sb.distplot(df['concavity_mean'],ax = axes[1,2])

In [None]:
#CONVERTING THE CATEGORICAL DATA TO NUMERICAL
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
df['diagnosis'] = le.fit_transform(df['diagnosis'])

In [None]:
df['diagnosis']

![](https://miro.medium.com/max/2000/1*KdvxqXIOkb9JY_BeUWvpxg.jpeg)

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
PCA is a very flexible tool and allows analysis of datasets that may contain, for example, multicollinearity, missing values, categorical data, and imprecise measurements. The goal is to extract the important information from the data and to express this information as a set of summary indices called principal components.
    So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.
- Read more about PCA in details from [here](https://builtin.com/data-science/step-step-explanation-principal-component-analysis)

![](https://ars.els-cdn.com/content/image/1-s2.0-S2214784516300147-gr1.jpg)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(df)

In [None]:
scaled_data = scaler.transform(df)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

In [None]:
pca.fit(scaled_data)

In [None]:
x_pca = pca.transform(scaled_data)

In [None]:
scaled_data.shape, x_pca.shape

In [None]:
pca_df = pd.DataFrame(data = x_pca, columns = ['principal component 1', 'principal component 2'])
pca_df

## Visualising PCA

In [None]:
plt.figure(figsize=(16,8))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer",fontsize=20)
targets = [0,1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = df['diagnosis'] == target
    plt.scatter(pca_df.loc[indicesToKeep, 'principal component 1']
               , pca_df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})

## Split data into train and test

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[['radius_mean', 'texture_mean', 'perimeter_mean','area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean','radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se','compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst','perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst','symmetry_worst', 'fractal_dimension_worst']]
Y = df[['diagnosis']]

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3)

In [None]:
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### 1) Logisic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()

In [None]:
LR.fit(X_train,Y_train)

In [None]:
Y_LR = LR.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix
acc_LR = accuracy_score(Y_test, Y_LR)
print('ACCURACY SCORE: ',acc_LR)
cm_LR = confusion_matrix(Y_test,Y_LR)
print('CONFUSION MATRIX: \n',cm_LR)

### 2)KNN Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
#Finding best possible number of neighbors
no_of_neighbors_and_accuracies = {}
for i in range(1,15):
    knn = KNeighborsClassifier(n_neighbors = i)
    knn.fit(X_train,Y_train)
    Y_knn = knn.predict(X_test)
    score = accuracy_score(Y_knn,Y_test)
    no_of_neighbors_and_accuracies[i] = score

In [None]:
no_of_neighbors_and_accuracies

- Here we can see that 3 gives highest accuracy, so we'll choose n_neighbors = 3

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train,Y_train)

In [None]:
Y_knn = knn.predict(X_test)

In [None]:
acc_knn = accuracy_score(Y_test, Y_knn)
print('ACCURACY SCORE: ',acc_knn)
cm_knn = confusion_matrix(Y_test,Y_knn)
print('CONFUSION MATRIX: \n',cm_knn)

### 3)Decision Tree 

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion='gini')

In [None]:
dtc.fit(X_train,Y_train)

In [None]:
Y_dtc = dtc.predict(X_test)

In [None]:
acc_dtc = accuracy_score(Y_test, Y_dtc)
print('ACCURACY SCORE: ',acc_dtc)
cm_dtc = confusion_matrix(Y_test,Y_dtc)
print('CONFUSION MATRIX: \n',cm_dtc)

### 4)Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion='entropy')

In [None]:
rfc.fit(X_train, Y_train)

In [None]:
Y_rfc = rfc.predict(X_test)

In [None]:
acc_rfc = accuracy_score(Y_test, Y_rfc)
print('ACCURACY SCORE: ',acc_rfc)
cm_rfc = confusion_matrix(Y_test,Y_rfc)
print('CONFUSION MATRIX: \n',cm_rfc)

### 5)Support Vector Machine

In [None]:
from sklearn.svm import SVC
svc = SVC()

In [None]:
svc.fit(X_train,Y_train)

In [None]:
Y_svc = svc.predict(X_test)

In [None]:
acc_svc = accuracy_score(Y_test, Y_svc)
print('ACCURACY SCORE: ',acc_svc)
cm_svc = confusion_matrix(Y_test,Y_svc)
print('CONFUSION MATRIX: \n',cm_svc)

### 6)Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB  
gnb = GaussianNB() 

In [None]:
gnb.fit(X_train, Y_train)

In [None]:
Y_gnb = gnb.predict(X_test) 

In [None]:
acc_gnb = accuracy_score(Y_test, Y_gnb)
print('ACCURACY SCORE: ',acc_gnb)
cm_gnb = confusion_matrix(Y_test,Y_gnb)
print('CONFUSION MATRIX: \n',cm_gnb)

### 7)Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()

In [None]:
gbc.fit(X_train,Y_train)

In [None]:
Y_gbc = gbc.predict(X_test)

In [None]:
acc_gbc = accuracy_score(Y_test, Y_gbc)
print('ACCURACY SCORE: ',acc_gbc)
cm_gbc = confusion_matrix(Y_test,Y_gbc)
print('CONFUSION MATRIX: \n',cm_gbc)

### 8)Stochastic Gradient Decent

In [None]:
from sklearn.linear_model import SGDClassifier
sgdc = SGDClassifier()

In [None]:
Y_sgdc = sgdc.fit(X_train,Y_train)

In [None]:
Y_sgdc = sgdc.predict(X_test)

In [None]:
acc_sgdc = accuracy_score(Y_test, Y_sgdc)
print('ACCURACY SCORE: ',acc_sgdc)
cm_sgdc = confusion_matrix(Y_test,Y_sgdc)
print('CONFUSION MATRIX: \n',cm_sgdc)

### 8)AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier()

In [None]:
adb.fit(X_train,Y_train)

In [None]:
Y_adb = adb.predict(X_test)

In [None]:
acc_adb = accuracy_score(Y_test, Y_adb)
print('ACCURACY SCORE: ',acc_adb)
cm_adb = confusion_matrix(Y_test,Y_adb)
print('CONFUSION MATRIX: \n',cm_adb)

### 9)XGBoost

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()

In [None]:
xgb.fit(X_train,Y_train)

In [None]:
Y_xgb = xgb.predict(X_test)

In [None]:
acc_xgb = accuracy_score(Y_test, Y_xgb)
print('ACCURACY SCORE: ',acc_xgb)
cm_xgb = confusion_matrix(Y_test,Y_xgb)
print('CONFUSION MATRIX: \n',cm_xgb)

### 10)CatBoost

In [None]:
from catboost import CatBoostClassifier
cb = CatBoostClassifier()

In [None]:
cb.fit(X_train,Y_train)

In [None]:
Y_cb = cb.predict(X_test)

In [None]:
acc_cb = accuracy_score(Y_test, Y_cb)
print('ACCURACY SCORE: ',acc_cb)
cm_cb = confusion_matrix(Y_test,Y_cb)
print('CONFUSION MATRIX: \n',cm_cb)

### 11)Light GBM

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lg = LGBMClassifier()

In [None]:
lg.fit(X_train,Y_train)

In [None]:
Y_lg = lg.predict(X_test)

In [None]:
acc_lg = accuracy_score(Y_test, Y_lg)
print('ACCURACY SCORE: ',acc_lg)
cm_lg = confusion_matrix(Y_test,Y_lg)
print('CONFUSION MATRIX: \n',cm_lg)

In [None]:
models = pd.DataFrame({
    'Model': ['Logistic Regression','KNN','Decision Tree','Random Forest','Support Vector Machines',
              'Naive Bayes','Gradient Boosting','Stochastic gradient decent','AdaBoost','XGboost','Catboost','LightGBM'],
    'Score': [acc_LR, acc_knn, acc_dtc,acc_rfc, acc_svc, acc_gnb, acc_gbc, acc_sgdc, acc_adb, acc_xgb, acc_cb, acc_lg]})
models.sort_values(by='Score', ascending=False)

## We can conclude that Logistic Regression & SVM gives the highest possible accuracy which is 98.8%

### Thank you. Consider **UPVOTING** if you find it useful :)