# Notebook 1: Data Wrangling(Cleaning) 

Table of Contents:
* [Data Import and its overview](#1) 
* [Data Cleaning](#2)

In [1]:
# import the libraries
import numpy as np 
import pandas as pd
from tabulate import tabulate # to print pretty dataframe 

## Data Import and its overview<a id=1></a>  
Data can be imported from kaggle by several ways:  
* Download the data and then import from local
* Import data by Kaggle API
* Read Data through Url(Used this method)
* Read Data by using gcs_path(google cloud service) of kaggle data sets.

In [2]:
data_url = "https://raw.githubusercontent.com/Gkchandora/Breast_Cancer_Prediction/main/Dataset/Data/wdbc_data.csv"
df = pd.read_csv(data_url)

In [3]:
# glimpse of data set
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
#  number of instances and features in data are:
if __name__=="__main__":
    print(f" Total number of instances = {df.shape[0]}\n",
          f"Number of features = {len(df.columns)}\n"
          " ----------------------------------------------------\n"
          f" The features of the data are :\n {df.columns.values}",)

 Total number of instances = 569
 Number of features = 33
 ----------------------------------------------------
 The features of the data are :
 ['id' 'diagnosis' 'radius_mean' 'texture_mean' 'perimeter_mean'
 'area_mean' 'smoothness_mean' 'compactness_mean' 'concavity_mean'
 'concave points_mean' 'symmetry_mean' 'fractal_dimension_mean'
 'radius_se' 'texture_se' 'perimeter_se' 'area_se' 'smoothness_se'
 'compactness_se' 'concavity_se' 'concave points_se' 'symmetry_se'
 'fractal_dimension_se' 'radius_worst' 'texture_worst' 'perimeter_worst'
 'area_worst' 'smoothness_worst' 'compactness_worst' 'concavity_worst'
 'concave points_worst' 'symmetry_worst' 'fractal_dimension_worst'
 'Unnamed: 32']


In [5]:
# data types
df.dtypes.sort_values() # alternate syntax df.info() which gives too much information

id                           int64
symmetry_worst             float64
concave points_worst       float64
concavity_worst            float64
compactness_worst          float64
smoothness_worst           float64
area_worst                 float64
perimeter_worst            float64
texture_worst              float64
radius_worst               float64
fractal_dimension_se       float64
symmetry_se                float64
concave points_se          float64
concavity_se               float64
compactness_se             float64
fractal_dimension_worst    float64
smoothness_se              float64
perimeter_se               float64
texture_se                 float64
radius_se                  float64
fractal_dimension_mean     float64
symmetry_mean              float64
concave points_mean        float64
concavity_mean             float64
compactness_mean           float64
smoothness_mean            float64
area_mean                  float64
perimeter_mean             float64
texture_mean        

As we can see data types are:
* `id` has `int64`
* `diagnosis` has `object`
* rest all feautures are `float64` data type

## Data Cleaning<a id=2></a>

In [6]:
# Inspection of missing data

# Total missing values and column wise missing counts etc
if __name__=="__main__":

    mis_val_count = df.isna().sum(axis = 0)

    # Total missing values in entire data
    print(f"Total missing values = {mis_val_count.sum(axis=0)}")

    # Column with missing counts > 0
    mis_val_col = mis_val_count[mis_val_count>0]
    mis_val_col = pd.DataFrame(mis_val_col, columns = ["# Missing Counts"])
    mis_val_col.index.name = "Features"
    print(tabulate(mis_val_col, headers='keys', tablefmt='psql'))

Total missing values = 569
+-------------+--------------------+
| Features    |   # Missing Counts |
|-------------+--------------------|
| Unnamed: 32 |                569 |
+-------------+--------------------+


In [7]:
# Drop the columns which have no use  : id, Unnamed 32
df.drop(["id", "Unnamed: 32"], inplace = True, axis = 1)

In [8]:
# change the title of the diagnosis feature
df.rename(columns = {"diagnosis":"target"}, inplace = True)

In [9]:
# Lets check whether string("?") exist or not in our data, it might be missed in the above approach if it was string type
df[ (df == "?").sum(axis = 1) > 0]

Unnamed: 0,target,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst


$\color{red}{\textrm{Note}}$ :  
Care should be taken for the entries in each columns, for example : `radius_mean` should be greater than zero. So, understanding and knowledge of data is necessary to build the model. 

In [10]:
# Implementation of above mentioned note : Identify data where radius_mean is <= 0 and radius_worst < = 0

df[(df.radius_mean <= 0) & (df.radius_worst <= 0)]

Unnamed: 0,target,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst


*Data is cleaned successfully, now let's move to the next step EDA for model builiding.[EDA Notebook](https://github.com/Gkchandora/Breast_Cancer_Prediction/blob/main/Model_Builing_Steps/NB_2_EDA.ipynb)*