# Breast Cancer Wisconsin (Diagnostic) Data Analysis

## Overview

This project aims to leverage data science and machine learning methodologies to address a critical real-world problem: the diagnosis of breast cancer as benign or malignant. Utilizing the Breast Cancer Wisconsin (Diagnostic) Data Set, we will apply various predictive modeling techniques to accurately classify cases. Our objective is to formulate a meaningful commercial question, select and implement appropriate machine learning models, and evaluate their impact on the diagnosis process, potentially contributing to commercial success in the medical field.

## Data Set

The data set features were computed from digitized images of fine needle aspirates (FNA) of breast masses, describing characteristics of the cell nuclei present. Originating from the UCI Machine Learning Repository and also available through the UW CS ftp server, it includes data on 569 samples with 32 attributes each, detailing aspects such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension of cell nuclei.

**Attributes:**
Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none


Data Source: [Breast Cancer Wisconsin (Diagnostic) Data Set on Kaggle](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)

## Project Structure

- **Data Exploration:** Initial analysis to understand the dataset's characteristics and distribution.
- **Preprocessing:** Data cleaning and normalization to prepare for model training.
- **Model Selection:** Evaluating various machine learning models to find the most suitable for our data and objective.
- **Model Training and Evaluation:** Implementing the selected model using Python and Jupyter Notebooks, followed by rigorous evaluation to assess its performance.
- **Impact Analysis:** Assessing the model's commercial impact, particularly its potential to improve breast cancer diagnosis.

## Tools and Technologies

- **Python:** The primary programming language for data analysis and machine learning model implementation.
- **Jupyter Notebooks:** For interactive code execution, visualization, and documentation.
- **Pandas & NumPy:** For data manipulation and numerical computation.
- **Scikit-learn:** For applying machine learning algorithms.
- **Matplotlib & Seaborn:** For data visualization.

## Getting Started

To begin working with this project, you will need to have Python installed on your system, along with Jupyter Notebooks. The data set can be downloaded from the Kaggle link provided above. Ensure you have the necessary libraries installed:

```bash
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
