<h1><center>🎗️Breast Cancer Data Analysis🔎</center></h1>
<h3><center>🩺(Prediction at the end)🔮</center></h3>
<center><img src= "https://media.slidesgo.com/storage/4701966/breast-cancer-case1617872724.jpg" alt ="Titanic" style='width: 600px;'></center>

<h3>Overview</h3>
<p>
Breast cancer is cancer that forms in the cells of the breasts.

After skin cancer, breast cancer is the most common cancer diagnosed in women in the United States. Breast cancer can occur in both men and women, but it's far more common in women.

Substantial support for breast cancer awareness and research funding has helped created advances in the diagnosis and treatment of breast cancer.
    
Breast cancer survival rates have increased, and the number of deaths associated with this disease is steadily declining, largely due to factors such as earlier detection, a new personalized approach to treatment and a better understanding of the disease.
</p>

<h3>What are the symptoms of breast cancer?</h3>
<p>
Signs and symptoms of breast cancer may include:

- A breast lump or thickening that feels different from the surrounding tissue

- Change in the size, shape or appearance of a breast

- Changes to the skin over the breast, such as dimpling
    
- A newly inverted nipple
    
- Peeling, scaling, crusting or flaking of the pigmented area of skin surrounding the nipple (areola) or breast skin
    
- Redness or pitting of the skin over your breast, like the skin of an orange
</p>

# Exploratory Data Analysis

## Aim :
- Understand the data ("A small step forward is better than a big one backwards")
- Begin to develop a modelling strategy

## Features

- Patient_ID: unique identifier id of a patient
- Age: age at diagnosis (Years)
- Gender: Male/Female
- Protein1, Protein2, Protein3, Protein4: expression levels (undefined units)
- Tumour_Stage: I, II, III
- Histology: Infiltrating Ductal Carcinoma, Infiltrating Lobular Carcinoma, Mucinous Carcinoma
- ER status: Positive/Negative
- PR status: Positive/Negative
- HER2 status: Positive/Negative
- Surgery_type: Lumpectomy, Simple Mastectomy, Modified Radical Mastectomy, Other
- DateofSurgery: Date on which surgery was performed (in DD-MON-YY) DateofLast_Visit: Date of last visit (in DD-MON-YY) [can be null, in case the patient didn’t visited again after the surgery]

Patient_Status: Alive/Dead [can be null, in case the patient didn’t visited again after the surgery and there is no information available whether the patient is alive or dead].

## Base Checklist
#### Shape Analysis :
- **target feature** : Patient_Status
- **rows and columns** : 341 , 16
- **features types** : qualitatives : 11 , quantitatives : 5
- **NaN analysis** :
    - NaN (1 feature > 5 % of NaN)

#### Features Analysis :
- **Target Analysis** :
    - Balanced (Oui/Non) : Non
    - Percentages : 79% Alive

In [None]:
!pip install dataprep

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

from dataprep.eda import create_report
from dataprep.eda import plot_missing
from dataprep.eda import plot_correlation
from dataprep.eda import plot

from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

## Dataset Analysis

In [None]:
data = pd.read_csv('../input/breastcancerdataset/BRCA.csv')
df = data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()

In [None]:
(df.isna().sum()/df.shape[0]*100).sort_values(ascending=False)

In [None]:
plot_missing(df)

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.isna(),cbar=False)
plt.show()

In [None]:
print('There is' , df.shape[0] , 'rows')
print('There is' , df.shape[1] , 'columns')

### Checking for duplicates

In [None]:
df.duplicated().sum()

In [None]:
df.loc[df.duplicated(keep=False),:]

In [None]:
df.drop_duplicates(keep='first',inplace=True)
df.shape

<h1><center><font size="30">Target Distribution</font></center></h1>

In [None]:
df = data.copy()
df = df.drop(['Patient_ID','Date_of_Surgery','Date_of_Last_Visit'],axis=1)
df['Patient_Status'].value_counts(normalize=True) #Classes déséquilibrées

In [None]:
target_dist = df['Patient_Status'].value_counts()

fig, ax = plt.subplots(1, 1, figsize=(8,5))

barplot = plt.bar(target_dist.index, target_dist, color = 'lightgreen', alpha = 0.8)
barplot[1].set_color('darkred')

ax.set_title('Target Distribution')
percentage = df['Patient_Status'].value_counts(normalize=True)[0]*100
ax.annotate("percentage of Alive Patients : {}%".format(percentage),
              xy=(0, 0),xycoords='axes fraction', 
              xytext=(0,-50), textcoords='offset points',
              va="top", ha="left", color='grey',
              bbox=dict(boxstyle='round', fc="w", ec='w'))

plt.xlabel('Target', fontsize = 12, weight = 'bold')
plt.show()

# Resampling

A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

![](https://raw.githubusercontent.com/rafjaa/machine_learning_fecib/master/src/static/img/resampling.png)

Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free lunch). The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting. In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

Let's implement a basic example, which uses the <code>DataFrame.sample</code> method to get random samples each class:

In [None]:
# Class count
count_class_0, count_class_1 = df['Patient_Status'].value_counts()

# Divide by class
df_class_0 = df[df['Patient_Status'] == 'Alive']
df_class_1 = df[df['Patient_Status'] == 'Dead']

print(count_class_0)
print(count_class_1)

In [None]:
df_class_0_under = df_class_0.sample(count_class_1,random_state=42)
df_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print('Random under-sampling:')
print(df_under['Patient_Status'].value_counts())

df_under['Patient_Status'].value_counts().plot(kind='bar', title='Count (target)');

<h1><center><font size="30">Categorical Features</font></center></h1>

In [None]:
for col in df.select_dtypes("object"):
    print(f'{col :-<50} {df[col].unique()}')

In [None]:
fig, ax = plt.subplots(4,2, figsize=(30, 30))
i=0
sns.set(font_scale = 1.5)
for col in df.select_dtypes('object'): 
    sns.countplot(df_under[col], hue=df_under['Patient_Status'], ax=ax[i//2][i%2])
    i=i+1
plt.show()

<h1><center><font size="30">Continuous Features</font></center></h1>

In [None]:
Alive_df = df[df['Patient_Status']=="Alive"]
Dead_df = df[df['Patient_Status']=="Dead"]
sns.set(font_scale = 1.5)
fig, ax = plt.subplots(2,3, figsize=(30, 15))
i=0
for col in df.select_dtypes(include=['float64','int64']):
    sns.distplot(Alive_df[col],label='Alive',ax=ax[i//3][i%3])
    sns.distplot(Dead_df[col],label='Dead',ax=ax[i//3][i%3])
    i=i+1
fig.legend(labels=['Alive','Dead'],fontsize='22')
fig.show()

# A bit of data engineering ...

In [None]:
def encoding(df):
    code = {'FEMALE':0,
            'MALE':1,
            'III':3,
            'II':2,
            'I':1,
            'Infiltrating Ductal Carcinoma':0,
            'Mucinous Carcinoma':1,
            'Infiltrating Lobular Carcinoma':2,
            'Negative':0,
            'Positive':1,
            'Modified Radical Mastectomy':0,
            'Lumpectomy':1,
            'Simple Mastectomy':2,
            'Other':3,
            'Alive':1,
            'Dead':0
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)        
    return df

def imputation(df):
    df = df.fillna(df.median())
    df = df.dropna()
    return df

def feature_engineering(df):
    useless_columns = ['Patient_ID','Date_of_Surgery','Date_of_Last_Visit','ER status','PR status']
    df = df.drop(useless_columns,axis=1)
    return df

def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('Patient_Status',axis=1)
    y = df['Patient_Status']    

    return df,X,y

In [None]:
df = data.copy()
df,X,y=preprocessing(df)

In [None]:
# Class count
count_class_0, count_class_1 = df['Patient_Status'].value_counts()

# Divide by class
df_class_0 = df[df['Patient_Status'] == 1]
df_class_1 = df[df['Patient_Status'] == 0]

df_class_0_under = df_class_0.sample(count_class_1,random_state=42)
df_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print('Random under-sampling:')
print(df_under['Patient_Status'].value_counts())

# Resampling
df_under['Patient_Status'].value_counts().plot(kind='bar', title='Count (target)');

### Comments
We can now analyze categorical features as quantitative features

In [None]:
sns.heatmap(df_under.corr())

In [None]:
sns.pairplot(df, height=2)

### Comments
Considering the correlations shown above, it seems really difficult to find a difference between Dead and Alive patients...

#### Let's find out !

# Modelling

In [None]:
trainset, testset = train_test_split(df_under, test_size=0.2, random_state=0)
fig, ax = plt.subplots(1,2, figsize=(10, 5))
sns.countplot(x = trainset['Patient_Status'] , data = trainset['Patient_Status'],ax=ax[0],palette="Set3").set_title('TrainSet')
sns.countplot(x = testset['Patient_Status'] , data = testset['Patient_Status'],ax=ax[1],palette="Set2").set_title('TestSet')

In [None]:
X_train = trainset.drop(['Patient_Status'],axis=1)
y_train = trainset['Patient_Status']
X_test = testset.drop(['Patient_Status'],axis=1)
y_test = testset['Patient_Status']

In [None]:
preprocessor = make_pipeline(RobustScaler())

PCAPipeline = make_pipeline(preprocessor, PCA(n_components=3,random_state=42))

RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=42))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=42))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=42,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression(solver='sag',random_state=42))

## PCA Analysis

In [None]:
PCA_df = pd.DataFrame(PCAPipeline.fit_transform(X_train))
y_train = y_train.astype(int)
y_train.reset_index(drop=True, inplace=True)
PCA_df = pd.concat([PCA_df, y_train], axis=1, ignore_index=True )
PCA_df.head()

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(PCA_df[0],PCA_df[1],hue=PCA_df[3],palette=sns.color_palette("tab10", 2))
plt.show()

In [None]:
import plotly.express as px
figure1 = px.scatter_3d(PCA_df,
        x=0, 
        y=1, 
        z=2, 
        color = 3,
                       width=600, height=800)
figure1.update_traces(marker=dict(size=5,
                              line=dict(width=0.2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

figure1.show()

### Remark :

#### The classes are really hard to classify looking at the graph above...

# Training models
## Models overview

In [None]:
dict_of_models = {'RandomForest': RandomPipeline,
'AdaBoost': AdaPipeline,
'SVM': SVMPipeline,
'KNN': KNNPipeline,
'LR': LRPipeline}

In [None]:
def evaluation(model):
    # calculating the probabilities
    y_pred_proba = model.predict_proba(X_test)

    # finding the predicted valued
    y_pred = np.argmax(y_pred_proba,axis=1)
    print('Accuracy = ', accuracy_score(y_test, y_pred))
    print('-')
    print(confusion_matrix(y_test,y_pred))
    print('-')
    print(classification_report(y_test,y_pred))
    print('-')
    
    N, train_score, test_score = learning_curve(model, X_train, y_train, 
                                               cv=4, scoring='f1', 
                                               train_sizes=np.linspace(0.1,1,10))
    plt.figure(figsize=(5,5))
    plt.plot(N, train_score.mean(axis=1), label='train score')
    plt.plot(N, test_score.mean(axis=1), label='validation score')
    plt.legend()
    plt.show()

In [None]:
sns.set(font_scale = 1)
for name, model in dict_of_models.items():
    print('---------------------------------')
    print(name)
    model.fit(X_train,y_train)
    evaluation(model)

## Using RandomForest

In [None]:
RandomPipeline.fit(X_train, y_train)
evaluation(RandomPipeline)

In [None]:
y_pred_prob = RandomPipeline.predict_proba(X_test)[:,1]

fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

plt.plot(fpr,tpr,label='RandomForest ROC Curve')
plt.xlabel("False Survivor Rate")
plt.ylabel("True SurvivorR Rate")
plt.title("andomForest ROC Curve")
plt.show()

### Optimization

In [None]:
from sklearn.model_selection import RandomizedSearchCV
RandomPipeline.get_params().keys()

In [None]:
hyper_params = {
    'randomforestclassifier__n_estimators':[10,100,150,250,400,600],
    'randomforestclassifier__criterion':['gini','entropy'],
    'randomforestclassifier__min_samples_split':[2,6,12],
    'randomforestclassifier__min_samples_leaf':[1,4,6,10],
    'randomforestclassifier__max_features':['auto','srqt','log2',int,float],
    'randomforestclassifier__verbose':[0,1,2],
    'randomforestclassifier__class_weight':['balanced','balanced_subsample'],
    'randomforestclassifier__n_jobs':[-1],
}

In [None]:
RF_grid = RandomizedSearchCV(RandomPipeline,hyper_params,scoring='accuracy',n_iter=40)
RF_grid.fit(X_train,y_train)

In [None]:
print(RF_grid.best_params_)

In [None]:
best_forest = (RF_grid.best_estimator_)
best_forest.fit(X_train,y_train)
# calculating the probabilities
y_pred_proba = best_forest.predict_proba(X_test)
#Finding the predicted valued
y_pred = np.argmax(y_pred_proba,axis=1)

N, train_score, test_score = learning_curve(best_forest, X_train, y_train, 
                                           cv=4, scoring='f1', 
                                           train_sizes=np.linspace(0.1,1,10))

In [None]:
print('Accuracy = ', accuracy_score(y_test, y_pred))
print('-')
print(confusion_matrix(y_test,y_pred))
print('-')
print(classification_report(y_test,y_pred))
print('-')
    
plt.figure(figsize=(5,5))
plt.plot(N, train_score.mean(axis=1), label='train score')
plt.plot(N, test_score.mean(axis=1), label='validation score')
plt.legend()
plt.title('f1 score')
plt.show()

## Using KNN

In [None]:
err = []
  
for i in range(1, 40):
    
    model = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = i))
    model.fit(X_train, y_train)
    pred_i = model.predict(X_test)
    err.append(np.mean(pred_i != y_test))
  
plt.figure(figsize =(10, 8))
plt.plot(range(1, 40), err, color ='blue',
                linestyle ='dashed', marker ='o',
         markerfacecolor ='blue', markersize = 8)
  
plt.title('Mean Err = f(K)')
plt.xlabel('K')
plt.ylabel('Mean Err')

In [None]:
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = 5))
KNNPipeline.fit(X_train, y_train)

In [None]:
evaluation(KNNPipeline)

## Using XGBoost

In [None]:
import xgboost as xgb
gbm = xgb.XGBClassifier(
     learning_rate = 0.15,
     n_estimators= 3000,
     max_depth= 16,
     min_child_weight= 2,
     #gamma=1,
     gamma=0.9,                        
     subsample=0.8,
     colsample_bytree=0.8,
     objective= 'binary:logistic',
     eval_metric = 'logloss',
     nthread= -1,
     scale_pos_weight=1).fit(X_train, y_train)
evaluation (gbm)

## Using SVM

In [None]:
SVMPipeline.fit(X_train, y_train)
evaluation(SVMPipeline)

In [None]:
y_pred_prob = SVMPipeline.predict_proba(X_test)[:,1]

fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

plt.plot(fpr,tpr,label='SVM ROC Curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("SVM ROC Curve")
plt.show()

# Tuning Threshold

In [None]:
best_classifier = KNNPipeline

thresholds = [0.3,0.4,0.5,0.6,0.7,0.8]
best_t = 0.3
best_acc = 0
for t in thresholds:
    y_pred = (best_classifier.predict_proba(X_test)[:,1] >= t).astype(int)
    acc = accuracy_score(y_test, y_pred)
    if acc > best_acc:
        best_acc=acc
        best_t=t

In [None]:
print('Accuracy on test set :',round(best_acc*100),"%")
print('Best threshold :',best_t)

# Training Artificial Neural Network

In [None]:
# Importing the Keras libraries and packages
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Dropout

In [None]:
X_train.shape

In [None]:
# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 16, kernel_initializer = 'uniform', activation = 'relu', input_dim = 10))
classifier.add(Dropout(0.2))
# Adding the second hidden layer
classifier.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(0.2))
# Adding the third hidden layer
classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(0.2))
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.add(Dropout(0.2))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
callback = tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=80)
history =classifier.fit(X_train, y_train, batch_size = 10, epochs = 100, callbacks=callback)

In [None]:
classifier.save('1rst-model.h5')

In [None]:
accuracy = history.history['accuracy']
loss = history.history['loss']

In [None]:
plt.figure(figsize=(15,10))

plt.subplot(2, 2, 1)
plt.plot(accuracy, label = "Training accuracy")
plt.legend()
plt.title("Training vs validation accuracy")


plt.subplot(2,2,2)
plt.plot(loss, label = "Training loss")
plt.legend()
plt.title("Training vs validation loss")

plt.show()

# Conclusion

#### According to the results shown above, these models (RF, AdaBoost, KNN, SVM, XGBoost, LR, ANN) can't make the classification between Dead and Alive patients.
#### Best we can do is getting a 1/2 chance of guessing right...

## Hypothesis

- The features have no impact on the target
- There isn't enough rows in the dataset (need more people)
- The dataset isn't representative of the population
- As we undersampled the dataset, we only have 66*2 rows in the end. I could have tried to oversample instead

# If you like please upvote !
## Also check my other notebooks :
#### 🔎EDA & Modelling🔮 - 🐁Mice Trisomy (100% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-mice-100-acc
#### 🔎EDA & Modelling🔮 - 🩺🎗️Breast Cancer Detection : https://www.kaggle.com/dorianvoydie/eda-modelling-breast-cancer-detection
#### 🌦🌡 Weather Forecasting 📈 (98% acc.) : https://www.kaggle.com/dorianvoydie/weather-forecasting-98-acc
#### 🔎EDA & Modelling🔮 - Heart Attack 🩺💓 (90% Acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-heart-attack-90-accuracy-score
#### 🔎EDA & Modelling🔮 - Mobile price (95.5% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-95-5-acc-mobile-price
#### 🔎EDA & Modelling🔮 - 🩺🧠 Stroke (74% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-stroke-74-acc
#### 🔎EDA & Modelling🔮 - Holiday Package (89% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-holiday-package-89-acc
#### 🔎EDA & Modelling🔮 - 🦠🍬 Diabetes Detection : https://www.kaggle.com/dorianvoydie/eda-modelling-diabetes-detection
#### ⚡🐲 Pokemon Stats 🥊✨ : https://www.kaggle.com/dorianvoydie/pokemon-stats
#### 🐟Fish Classification - Using CNN🔮 (97% acc.) : https://www.kaggle.com/dorianvoydie/fish-classification-using-cnn-97-acc
#### 💉👩‍⚕️ Vaccine & COVID-19 Indicators📈 : https://www.kaggle.com/dorianvoydie/vaccine-covid-19-indicators