# In-depth Analysis of Breast Cancer Dataset

<hr>

## Breast Cancer in US

In 2019, an estimated 268,600 new cases of invasive breast cancer will be diagnosed in women in the U.S. as well as 62,930 new cases of non-invasive (in situ) breast cancer. 62% of breast cancer cases are diagnosed at a localized stage, for which the 5-year survival rate is 99%. This year, an estimated 41,760 women will die from breast cancer in the U.S. Although rare, men get breast cancer too. The lifetime risk for U.S. men is about 1 in 1,000. An estimated 2,670 men will be diagnosed with breast cancer this year in the United States and approximately 500 will die. 1 in 8 women in the United States will develop breast cancer in her lifetime. Breast cancer is the most common cancer in American women, except for skin cancers. There are over 3.5 million breast cancer survivors in the United States. On average, every 2 minutes a woman is diagnosed with breast cancer in the United States. Female breast cancer represents 15.2% of all new cancer cases in the U.S.

Source : <a href="https://www.nationalbreastcancer.org/breast-cancer-facts#:~:text=In%202019%2C%20an%20estimated%20268%2C600,year%20survival%20rate%20is%2099%25." style="text-decoration: none;">NationalBreastCancer.org</a>

<hr>

## About Data Set

The datset has been obtained from Kaggle, the link is privided <a href="https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/version/2" style="text-decoration: none;">HERE</a>. The dataset contains the information for the patients in Wisconsin suffering from Breast Cancer. 

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34.

## Attribute Information
<ul>
    <li>ID Number</li>
    <li>Diagnosis (M = malignant, B = benign)</li>
    <li>Columns 3 - 32 Contains (The calulations have been performed using three metrics : Mean, Standard Error and Worst. Each section contains 10 columns, in total resulting in 30 columns.) : </li>
    <ol>
        <li>radius (mean of distances from center to points on the perimeter)</li>
        <li>texture (standard deviation of gray-scale values)</li>
        <li>perimeter</li>
        <li>area</li>
        <li>smoothness (local variation in radius lengths)</li>
        <li>compactness (perimeter^2 / area - 1.0)</li>
        <li>concavity (severity of concave portions of the contour)</li>
        <li>concave points (number of concave portions of the contour)</li>
        <li>symmetry</li>
        <li>fractal dimension ("coastline approximation" - 1)</li>
    </ol>
    
</ul>

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

<hr>

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

<hr>

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

In [2]:
# Read CSV File / DataSet
df = pd.read_csv('data/data.csv', index_col=False)

# 1. Cleaning Data

In [3]:
# View DataSet
df.head(5)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# check null values and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [5]:
# check for duplicate values
sum(df.duplicated())

0

In [6]:
# Get overview of how many unique values a column has.
def remove_unique(x):
    '''
    Author : Niladri Ghosh
    Email : niladri1406@gmail.com
    
    This function takes in a single argument X - which should be a data frame and then check for each and every column where 
    only single value exists in the whole column and then drop the whole column, finally print out the droppped columns.
    
    '''
    
    uni = x.nunique()
    uniq = pd.DataFrame({'variable':uni.index, 'unique_values':uni.values})
    for i,j in uniq.iterrows():
        if j['unique_values'] == 1:
            print([j.variable])
            x.drop([j.variable], axis = 1, inplace = True)

In [7]:
# remove columns with unique values and print out the names
remove_unique(df)

#### Since there are no columns with unique data, none of the columns have been dropped.

In [8]:
# remove the useless columns
df.drop(['id','Unnamed: 32'], axis=1, inplace=True)

In [9]:
# check shape of dataset
df.shape

(569, 31)

## Observations:

> The following details could be drawn after properly analysing the dataset :
><ul>
    <li>There are no null values in the dataset, other than the automated column Unnamed, we'll remove it anyways.</li>
    <li>No duplicate values present </li>
    <li>Each and every columns have proper data type assigned to them</li>
    <li>Column id contain id of patients and Unnamed: 32 column was generated automatically while reading the csv file there we will remove them as they are of no use.</li>
    <li>After cleaning out the minor issues, our dataset has 31. 
</ul>

__Since there are no issues in the datset. Therefor no cleaning is needed.__


### Data Slicing

As the calculations have been performed using three metices, we would distribute the data into parts viz., mean, se and the worst. This would simplify our jobs.

In [10]:
# slicing and creating new dataframes
df_mean = df.iloc[:,np.r_[0:1,1:11]]
df_se = df.iloc[:,np.r_[0:1,11:21]]
df_worst = df.iloc[:,np.r_[0:1,21:31]]

In [11]:
# rename columns
def rename_columns(x):
    '''
    
    Author : Niladri Ghosh
    Email : niladri1406@gmail.com
    
    The function takes in a single argument, a dataframe and renames the columns to simple format.
    eg - if the column name is "radius_mean" it renames it to "radius" and if the item contains multiple
    "_" it will fetch the last "_" and remove it from there, for instance "fractal_dimention_mean" will be
    "fractal_dimention". Simply speaking it removes the characters after "_" including the "_".
    
    
    '''
    
    new_label = []
    for i in x.columns:
        if '_' in str(i):
            loca = i.rfind('_')
            new_label.append(i[:loca])
        else:
            new_label.append(i)
    x.columns = new_label

In [12]:
# rename each and every columns of all the sliced dataframes and storing to different csv files.
arr = [[df_mean,"data_mean"], [df_se,"data_se"], [df_worst,"data_worst"]]
for i,j in arr:
    rename_columns(i)
    i.to_csv("data/"+j+".csv", index=False)
