# CASE STUDY: BREAST CANCER CLASSIFICATION

## STEP #1: PROBLEM STATEMENT

- Predicting if the cancer diagnosis is benign or malignant based on several observations/features 
- 30 features are used, examples:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

- Datasets are linearly separable using all 30 input features
- Number of Instances: 569
- Class Distribution: 212 Malignant, 357 Benign
- Target class:
         - Malignant
         - Benign


https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)


## STEP #2: IMPORTING DATA

In [None]:
# import libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline

In [None]:
# Import Cancer data drom the Sklearn library
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [None]:
cancer

In [None]:
cancer.keys()

In [None]:
print(cancer['DESCR'])

In [None]:
print(cancer['target_names'])

In [None]:
print(cancer['feature_names'])

In [None]:
cancer['data'].shape

In [None]:
df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

In [None]:
df_cancer.head()

In [None]:
df_cancer.tail()

## STEP #3: VISUALIZING THE DATA

In [None]:
sns.pairplot(df_cancer, hue = 'target', 
             vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness'] )

In [None]:
sns.countplot(df_cancer['target'], label = "Count") 

In [None]:
sns.scatterplot(x = 'mean area', 
                y = 'mean smoothness', 
                hue = 'target', 
                data = df_cancer)

In [None]:
# Let's check the correlation between the variables 
# Strong correlation between the mean radius and mean perimeter, mean area and mean primeter
plt.figure(figsize=(20,10)) 
sns.heatmap(df_cancer.corr(), annot=True) 

## STEP #4: MODEL TRAINING (FINDING A PROBLEM SOLUTION)

In [None]:
# Let's drop the target label coloumns
X = df_cancer.drop(['target'],axis=1)