# **(Part 1 Data Collection Notebook)**

## Objectives

- Upload dataset to repository.
- Load data and save it under inputs/datasets/raw/data.csv
- Inspect the data and save it under inputs/datasets/data_clean_id/data.csv

## Inputs

Download  data set from Kaggle and import data.csv.

## Outputs

Generate Dataset:   inputs/datasets/raw/data.csv
                    inputs/datasets/data_clean_id/data.csv

## Additional Comments


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# 1 Problem Statement

Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a result of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound, and biopsy are commonly used to diagnose breast cancer performed.

# 1.1 Expected outcome

Given breast cancer results from breast fine-needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore, or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:

1 = Malignant (Cancerous) - Present
0 = Benign (Not Cancerous) -Absent

# 1.2 Objective

Since the labels in the data are discrete, the predication falls into two categories, (i.e. Malignant or benign). In machine learning, this is a classification problem.

Thus, the goal is to classify whether the breast cancer is benign or malignant and predict the recurrence and non-recurrence of malignant cases after a certain period. To achieve this we have used machine learning classification methods to fit a function that can predict the discrete class of new input.

# 1.3 Identify data sources

The Breast Cancer datasets is available as machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M = malignant, B = benign), respectively.

The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

# Getting Started: Load libraries and set options

In [None]:
#load libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns


The below cell consists of all the sklearn libraries that we will be using throughout the project. It includes some modules that helps in building classifiers such as Logistic Regression, KNearest Neighbors Classifier and Random Forest Classifier. 

Alongside that, inorder to best fit the model the hyper parameteres of our relevant models have been tuned to find the best combination using the available search methods for which the model_selection module has been imported. 

Finally, given the expansive nature of the dataset, it was found better fit to incorporate dimensionality reduction techniques into our dataset. Therefore a feature selection method was used to redeem the model from the curse of dimensonality. Given this, the feature_selection module was imported.
  

# Load Dataset

We are using the following data from Kaggle [Kaggle URL](https://www.kaggle.com/datasets/vijayaadithyanvg/breast-cancer-prediction) I downloaded it as a csv file and imported the csv file to the repo

In order to perform this task, the open source dataset platform Kaggle was used to access the dataset for a classification problem in order to build a Classifier for predicting the nature of the tumor as either maligant or benign based on certain attributes.

In [None]:
# Read the file "data.csv" and print the contents.
df = pd.read_csv("inputs/datasets/raw/data.csv", index_col=False)

# Inspecting the data

The first step is to visually inspect the new data set. There are multiple ways to achieve this:

•	The easiest being to request the first few records using the DataFrame data.head() method. By default,      data.head() returns the first 5 rows from the DataFrame object df (excluding the header row).

•	Alternatively, one can also use df.tail() to return the five rows of the data frame.

•	For both head and tail methods, there is an option to specify the number of records by including the required number in between the parentheses when calling either method.Inspecting the data


In [None]:
df.head()

As per the desciption provided in [Kaggle](https://www.kaggle.com/datasets/shubamsumbria/breast-cancer-prediction) , the dataset has been derived from the digital image of FNA of a breast mass.

**Attribute Information:**

* Diagnosis (1 = Maligant ,  0 = Benign )

#### Ten real-valued features computed for each cell nucleus (3–32) referes to the followinh details:

* radius : Distances from the center to points on the perimeter.
* texture : Standard deviation of gray-scale values.
* perimeter 
* area
* smoothness : local variation in radius lengths.
* compactness : perimeter² / area — 1.0.
* concavity : severity of concave portions of the contour.
* concave points : number of concave portions of the contour.
* symmetry
* fractal dimension : “coastline approximation” — 1.

#### Dimension Of The Dataset

In [None]:
print(f"Dimension of the dataset is {df.shape[1]}")

#### Checking The Features and Their Types 

In [None]:
df.info()

In [None]:
print("The columns present in our dataset are as follows : \n")
for col in df.columns:
    print(col , end = ", ")

### Handling Missing Values 

#### Checking The Amount of Null Values By Columns 

In [None]:
df.isnull().sum()

#### Checking The Amount of Missing Values By Columns

In [None]:
df.isna().sum()

Similarly, from the above outputs, we know that there are no missing or null values.

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
    os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
  # os.makedirs(name='')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/cleaned/data.csv",index=False)