# Life cycle of Machine learning Project
## Understanding the Problem Statement
- 1. Data Collection
- 2. Data Checks to perform
- 3. Exploratory data analysis
- 4.  Data Pre-Processing
- 5. Model Training

## Choose best model

- 1) `Problem Statement`: The task is to develop a machine learning model that can accurately classify breast cancer tumors as either malignant (cancerous) or benign (non-cancerous) based on various features extracted from digitized images of breast mass samples. The goal is to assist medical professionals in the early detection and diagnosis of breast cancer, providing them with a reliable tool to aid in treatment planning and decision-making.

- 2.`Dataset`: The dataset used for this problem is the "Breast Cancer Wisconsin (Diagnostic) Dataset" provided by the University of Wisconsin Hospitals, Madison. It contains a collection of clinical measurements and features derived from digitized images of breast mass samples. Each sample is labeled as either malignant or benign, representing the presence or absence of breast cancer, respectively.


In [8]:
from sklearn.datasets import load_breast_cancer

data= load_breast_cancer()

descr = data.DESCR
start_index = descr.find("Creator")
end_index = descr.find(" topic")

if start_index != -1 and end_index != -1:
    summary_statistics = descr[start_index:end_index]
    print(summary_statistics)
else:
    print("Summary Statistics section not found.")


Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robu

## Objective:

- The objective is to train a machine learning model using the breast cancer dataset to accurately classify future breast mass samples as malignant or benign. The model should generalize well to unseen data and achieve a high level of accuracy, sensitivity, and specificity in detecting breast cancer.

## 2.1 Import Data and Required Packages
## - Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Lib

In [1]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## Display all colunms
pd.pandas.set_option("display.max_columns",None)
%matplotlib inline

In [3]:
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=['target'])

df = pd.concat([X, y], axis=1)
df['target'] = df['target'].replace({0: 'malignant', 1: 'benign'})

## Showing Top 5 Records

In [4]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,malignant
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,malignant
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,malignant
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,malignant
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,malignant


# Showing Sample of Data

In [5]:
df.sample(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
303,10.49,18.61,66.86,334.3,0.1068,0.06678,0.02297,0.0178,0.1482,0.066,0.1485,1.563,1.035,10.08,0.008875,0.009362,0.01808,0.009199,0.01791,0.003317,11.06,24.54,70.76,375.4,0.1413,0.1044,0.08423,0.06528,0.2213,0.07842,benign
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,0.3345,0.8902,2.217,27.19,0.00751,0.03345,0.03672,0.01137,0.02165,0.005082,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,malignant
206,9.876,17.27,62.92,295.4,0.1089,0.07232,0.01756,0.01952,0.1934,0.06285,0.2137,1.342,1.517,12.33,0.009719,0.01249,0.007975,0.007527,0.0221,0.002472,10.42,23.22,67.08,331.6,0.1415,0.1247,0.06213,0.05588,0.2989,0.0738,benign
557,9.423,27.88,59.26,271.3,0.08123,0.04971,0.0,0.0,0.1742,0.06059,0.5375,2.927,3.618,29.11,0.01159,0.01124,0.0,0.0,0.03004,0.003324,10.49,34.24,66.5,330.6,0.1073,0.07158,0.0,0.0,0.2475,0.06969,benign
207,17.01,20.26,109.7,904.3,0.08772,0.07304,0.0695,0.0539,0.2026,0.05223,0.5858,0.8554,4.106,68.46,0.005038,0.01503,0.01946,0.01123,0.02294,0.002581,19.8,25.05,130.0,1210.0,0.1111,0.1486,0.1932,0.1096,0.3275,0.06469,malignant


## Shape of Data 

In [15]:
df.shape

(569, 31)

## No Mising Value in the datasets

## 2.2 Dataset information

In [16]:
descr = data.DESCR
start_index = descr.find("_breast_cancer_dataset")
end_index = descr.find("Summary Statistics:")

if start_index != -1 and end_index != -1:
    summary_statistics = descr[start_index:end_index]
    print(summary_statistics)
else:
    print("Summary Statistics section not found.")


_breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius,

## 3. Data Checks to perform
- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

## 3.1 Checking Mising Value

In [17]:
df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

## No Missing Value found

## 3.2 Check Duplicates

In [18]:
df.duplicated().sum()

0

## No duplicated Values

## 3.3 Check data type

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

## Converting each columns into suitable data-types

In [35]:

def convert_columns(dataframe):
    converted_df = dataframe.copy()

    for column in converted_df.columns:
        column_data_type = converted_df[column].dtype

        if column_data_type == 'float64':
            column_min = converted_df[column].min()
            column_max = converted_df[column].max()

            if column_min >= 0 and column_max <= 1:
                converted_df[column] = converted_df[column].astype('float32')
            else:
                converted_df[column] = converted_df[column].astype('float16')

        elif column_data_type == 'object':
            converted_df[column] = converted_df[column].astype('category')

    return converted_df

# Assuming you have a DataFrame named 'df'
df = convert_columns(df)


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   mean radius              569 non-null    float16 
 1   mean texture             569 non-null    float16 
 2   mean perimeter           569 non-null    float16 
 3   mean area                569 non-null    float16 
 4   mean smoothness          569 non-null    float32 
 5   mean compactness         569 non-null    float32 
 6   mean concavity           569 non-null    float32 
 7   mean concave points      569 non-null    float32 
 8   mean symmetry            569 non-null    float32 
 9   mean fractal dimension   569 non-null    float32 
 10  radius error             569 non-null    float16 
 11  texture error            569 non-null    float16 
 12  perimeter error          569 non-null    float16 
 13  area error               569 non-null    float16 
 14  smoothness

## Saving Preprocessed Data

In [33]:
df.to_csv("F:\project1\sklearn-Diabets-Deployment\datasets\Proecessed_breast_cancer_data.csv", index=False)

## Copying Orginal Dataset

In [22]:
df_copy = df.copy()

In [23]:
df_copy

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.984375,10.382812,122.81250,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,1.094727,0.905273,8.585938,153.375000,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.375000,17.328125,184.62500,2019.0,0.16220,0.665527,0.711914,0.2654,0.4601,0.11890,malignant
1,20.562500,17.765625,132.87500,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0.543457,0.733887,3.398438,74.062500,0.005225,0.01308,0.01860,0.01340,0.01389,0.003532,24.984375,23.406250,158.75000,1956.0,0.12380,0.186646,0.241577,0.1860,0.2750,0.08902,malignant
2,19.687500,21.250000,130.00000,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0.745605,0.787109,4.585938,94.000000,0.006150,0.04006,0.03832,0.02058,0.02250,0.004571,23.562500,25.531250,152.50000,1709.0,0.14440,0.424561,0.450439,0.2430,0.3613,0.08758,malignant
3,11.421875,20.375000,77.56250,386.0,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0.495605,1.156250,3.445312,27.234375,0.009110,0.07458,0.05661,0.01867,0.05963,0.009208,14.906250,26.500000,98.87500,567.5,0.20980,0.866211,0.687012,0.2575,0.6638,0.17300,malignant
4,20.296875,14.343750,135.12500,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0.757324,0.781250,5.437500,94.437500,0.011490,0.02461,0.05688,0.01885,0.01756,0.005115,22.546875,16.671875,152.25000,1575.0,0.13740,0.204956,0.399902,0.1625,0.2364,0.07678,malignant
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.562500,22.390625,142.00000,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,1.175781,1.255859,7.671875,158.750000,0.010300,0.02891,0.05198,0.02454,0.01114,0.004239,25.453125,26.406250,166.12500,2027.0,0.14100,0.211304,0.410645,0.2216,0.2060,0.07115,malignant
565,20.125000,28.250000,131.25000,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0.765625,2.462891,5.203125,99.062500,0.005769,0.02423,0.03950,0.01678,0.01898,0.002498,23.687500,38.250000,155.00000,1731.0,0.11660,0.192261,0.321533,0.1628,0.2572,0.06637,malignant
566,16.593750,28.078125,108.31250,858.0,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,0.456299,1.075195,3.425781,48.562500,0.005903,0.03731,0.04730,0.01557,0.01318,0.003892,18.984375,34.125000,126.68750,1124.0,0.11390,0.309326,0.340332,0.1418,0.2218,0.07820,malignant
567,20.593750,29.328125,140.12500,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0.726074,1.594727,5.773438,86.250000,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.734375,39.406250,184.62500,1821.0,0.16500,0.868164,0.938477,0.2650,0.4087,0.12400,malignant


## 3.4 Check the number of unique values of each column

In [42]:
int_columns = list(df_copy.select_dtypes(include=['int']).columns)

# Step 2: Select categorical columns
object_columns = list(df_copy.select_dtypes(include=['category']).columns)

# Step 3: Combine integer and categorical columns
categorical_features = int_columns + object_columns

# Step 4: Print unique values for each categorical feature
for column in categorical_features:
    unique_values = df_copy[column].unique()
    print(f"Unique values for column '{column}': {unique_values}")

Unique values for column 'target': ['malignant', 'benign']
Categories (2, object): ['benign', 'malignant']


## 3.5 Check statistics of data set

In [43]:
df_copy.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.132812,19.296875,91.9375,inf,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405029,1.216797,2.867188,40.3125,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.265625,25.671875,107.25,inf,0.132369,0.25415,0.272217,0.114606,0.290076,0.083946
std,3.525391,4.300781,24.296875,inf,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277344,0.551758,2.021484,inf,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.832031,6.148438,33.59375,inf,0.022832,0.157349,0.208618,0.065732,0.061867,0.018061
min,6.980469,9.710938,43.78125,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.111511,0.360107,0.756836,6.800781,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.929688,12.023438,50.40625,185.25,0.07117,0.027283,0.0,0.0,0.1565,0.05504
25%,11.703125,16.171875,75.1875,420.25,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.232422,0.833984,1.606445,17.84375,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.007812,21.078125,84.125,515.5,0.1166,0.147217,0.114502,0.06493,0.2504,0.07146
50%,13.367188,18.84375,86.25,551.0,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.324219,1.108398,2.287109,24.53125,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.96875,25.40625,97.6875,686.5,0.1313,0.211914,0.226685,0.09993,0.2822,0.08004
75%,15.78125,21.796875,104.125,782.5,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.479004,1.473633,3.357422,45.1875,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.796875,29.71875,125.375,1084.0,0.146,0.339111,0.382812,0.1614,0.3179,0.09208
max,28.109375,39.28125,188.5,2500.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873047,4.886719,21.984375,542.0,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.03125,49.53125,251.25,4256.0,0.2226,1.057617,1.251953,0.291,0.6638,0.2075
