# **(Part 1 Data Collection Notebook)**

## Objectives

- Upload dataset to repository.
- Load data and save it under inputs/datasets/raw/data.csv
- Inspect the data and save it under inputs/datasets/data_clean_id/data.csv

## Inputs

Download  data set from Kaggle and import data.csv.

## Outputs

Generate Dataset:   inputs/datasets/raw/data.csv
                    inputs/datasets/data_clean_id/data.csv

## Additional Comments


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Breast-Cancer-Prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Breast-Cancer-Prediction'

# 1 Problem Statement

Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a result of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose breast cancer performed.

# 1.1 Expected outcome

Given breast cancer results from breast fine-needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore, or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:

1 = Malignant (Cancerous) - Present
0 = Benign (Not Cancerous) -Absent

# 1.2 Objective

Since the labels in the data are discrete, the predication falls into two categories, (i.e. Malignant or benign). In machine learning, this is a classification problem.

Thus, the goal is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning classification methods to fit a function that can predict the discrete class of new input.

# 1.3 Identify data sources

The Breast Cancer datasets is available as machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M = malignant, B = benign), respectively.

The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

# Getting Started: Load libraries and set options

In [4]:
#load libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns


The below cell consists of all the sklearn libraries that we will be using throughout the project. It includes some modules that helps in building classifiers such as Logistic Regression, KNearest Neighbors Classifier and Random Forest Classifier. 

Alongside that, inorder to best fit the model the hyper parameteres of our relevant models have been tuned to find the best combination using the available search methods for which the model_selection module has been imported. 

Finally, given the expansive nature of the dataset, it was found better fit to incorporate dimensionality reduction techniques into our dataset. Therefore a feature selection method was used to redeem the model from the curse of dimensonality. Given this, the feature_selection module was imported.

Below is the necessary list of imports that is used further.  

In [5]:
# For Train , Test Spliting 
from sklearn.model_selection import train_test_split

# For Building Classifier Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# For Evaluating the Accuray of the Classifiers 
from sklearn.metrics import classification_report , mean_squared_error , confusion_matrix
from sklearn import metrics

# For  Hyper parameter Tuning 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# For Cross Validation
from sklearn.model_selection import KFold, StratifiedKFold,cross_val_score, cross_val_predict

# For Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Load Dataset

We are using the following data from Kaggle [Kaggle URL](https://www.kaggle.com/datasets/vijayaadithyanvg/breast-cancer-prediction) I downloaded it as a csv file and imported the csv file to the repo

In order to perform this task, the open source dataset platform Kaggle was used to access the dataset for a classification problem in order to build a Classifier for predicting the nature of the tumor as either maligant or benign based on certain attributes.

In [6]:
# Read the file "data.csv" and print the contents.
df = pd.read_csv("inputs/datasets/raw/data.csv", index_col=False)

# Inspecting the data

The first step is to visually inspect the new data set. There are multiple ways to achieve this:

•	The easiest being to request the first few records using the DataFrame data.head() method. By default,      data.head() returns the first 5 rows from the DataFrame object df (excluding the header row).

•	Alternatively, one can also use df.tail() to return the five rows of the data frame.

•	For both head and tail methods, there is an option to specify the number of records by including the required number in between the parentheses when calling either method.Inspecting the data


In [7]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


As per the desciption provided in [Kaggle](https://www.kaggle.com/datasets/shubamsumbria/breast-cancer-prediction) , the dataset has been derived from the digital image of FNA of a breast mass.

**Attribute Information:**

* Diagnosis (1 = Maligant ,  0 = Benign )

#### Ten real-valued features computed for each cell nucleus (3–32) referes to the followinh details:

* radius : Distances from the center to points on the perimeter.
* texture : Standard deviation of gray-scale values.
* perimeter 
* area
* smoothness : local variation in radius lengths.
* compactness : perimeter² / area — 1.0.
* concavity : severity of concave portions of the contour.
* concave points : number of concave portions of the contour.
* symmetry
* fractal dimension : “coastline approximation” — 1.

#### Dimension Of The Dataset

In [9]:
print(f"Dimension of the dataset is {df.shape[1]}")

Dimension of the dataset is 32


#### Checking The Features and Their Types 

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [12]:
print("The columns present in our dataset are as follows : \n")
for col in df.columns:
    print(col , end = ", ")

The columns present in our dataset are as follows : 

id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave points_worst, symmetry_worst, fractal_dimension_worst, 

### Handling Missing Values 

#### Checking The Amount of Null Values By Columns 

In [13]:
df.isnull().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

#### Checking The Amount of Missing Values By Columns

In [14]:
df.isna().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Similarly, from the above outputs, we know that there are no missing or null values.

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
    os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
  # os.makedirs(name='')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/cleaned/data.csv",index=False)