Name:- Ankit Kuamr
Roll no.:- 1323578
Trainer Name:- Lokesh Sir

## Project Title- "Breast cancer classification using machine learning algorithims (Decision Tree,Random Forest, SVM, and Naive Bayes)"

## PROBLEM STATEMENT:

Breast cancer is one of the most common types of cancer among women worldwide. Early detection and accurate diagnosis are crucial for effective treatment and improving survival rates. This project aims to develop and compare the performance of various machine learning algorithms to classify breast cancer tumors as either malignant or benign. The algorithms include Decision Tree, Random Forest, Support Vector Machine (SVM), Kernel SVM, and Naive Bayes.

The goal is to identify the most effective model for accurate classification, providing insights into the strengths and weaknesses of each algorithm. This will assist medical practitioners in making informed decisions based on the predictions of the models.


## Description:

Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.

The key challenges against it’s detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). We ask you to complete the analysis of classifying these tumors using machine learning (with SVMs) and the Breast Cancer Wisconsin (Diagnostic) Dataset.

## Acknowledgements:

This dataset has been referred from Kaggle.

Objective:
Understand the Dataset & cleanup (if required).
Build classification models to predict whether the cancer type is Malignant or Benign.
Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms.

## Project Methodology

### Step 1: Import Dependencies

In [141]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,  confusion_matrix, accuracy_score

### Step 2: Reading and Loading Dataset

In [142]:
df=pd.read_csv('BreastCancer.csv')

### Step 3: Apply EDA

In [143]:
df.describe

In [144]:
df.shape

(569, 32)

In [145]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

There is no any biasness of the Dataset

### Step 4: Label Encoding

In [146]:
df["diagnosis"]=df["diagnosis"].map({"M":1,"B":0})
x=df.drop(columns=["diagnosis","id"])
y=df["diagnosis"]

### Step 5: Training Testing and Spliting Data

In [147]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=101)
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
#x_train
y

Unnamed: 0,diagnosis
0,1
1,1
2,1
3,1
4,1
...,...
564,1
565,1
566,1
567,1


# Feature Engineering

In [148]:
x_train
y_train

Unnamed: 0,diagnosis
464,0
454,0
447,0
363,0
241,0
...,...
552,0
393,1
75,1
337,1


### Step 6: Apply Various Machine Learning Algorithm

##1.decision Tree

In [149]:
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier(random_state=41)

model.fit(x_train,y_train)
y_pred=model.predict(x_test)
print("classification report ",classification_report(y_test,y_pred))
print("confusion matrix ",confusion_matrix(y_test,y_pred))
print(f"accuracy {accuracy_score(y_test,y_pred)}")

classification report                precision    recall  f1-score   support

           0       0.94      0.92      0.93        88
           1       0.88      0.91      0.89        55

    accuracy                           0.92       143
   macro avg       0.91      0.91      0.91       143
weighted avg       0.92      0.92      0.92       143

confusion matrix  [[81  7]
 [ 5 50]]
accuracy 0.916083916083916


#2. Random Forest

In [150]:
from sklearn.ensemble import RandomForestClassifier
rnd_for=RandomForestClassifier(random_state=101)

rnd_for.fit(x_train,y_train)
y_pred=model.predict(x_test)
print("classification report ",classification_report(y_test,y_pred))
print("confusion matrix ",confusion_matrix(y_test,y_pred))
print(f"accuracy {accuracy_score(y_test,y_pred)}")

classification report                precision    recall  f1-score   support

           0       0.94      0.92      0.93        88
           1       0.88      0.91      0.89        55

    accuracy                           0.92       143
   macro avg       0.91      0.91      0.91       143
weighted avg       0.92      0.92      0.92       143

confusion matrix  [[81  7]
 [ 5 50]]
accuracy 0.916083916083916


#3. SVM

In [151]:
from sklearn.svm import SVC
svm_mod=SVC(kernel="linear",random_state=101)
svm_mod.fit(x_train,y_train)
svm_pred=svm_mod.predict(x_test)
print("classification report ",classification_report(y_test,y_pred))
print("confusion matrix ",confusion_matrix(y_test,y_pred))
print(f"accuracy {accuracy_score(y_test,y_pred)}")

classification report                precision    recall  f1-score   support

           0       0.94      0.92      0.93        88
           1       0.88      0.91      0.89        55

    accuracy                           0.92       143
   macro avg       0.91      0.91      0.91       143
weighted avg       0.92      0.92      0.92       143

confusion matrix  [[81  7]
 [ 5 50]]
accuracy 0.916083916083916


#4. NAIVE BAYES

In [152]:
from sklearn.naive_bayes import GaussianNB

# Initialize and train the Naive Bayes model
nb_model = GaussianNB()
nb_model.fit(x_train, y_train)

# Make predictions on the test data
nb_predictions = nb_model.predict(x_test)

# Evaluate the model
print("Naive Bayes")
print("Accuracy:", accuracy_score(y_test, nb_predictions))
print("Confusion Matrix:\n", confusion_matrix(y_test, nb_predictions))
print("Classification Report:\n", classification_report(y_test, nb_predictions))

Naive Bayes
Accuracy: 0.9300699300699301
Confusion Matrix:
 [[82  6]
 [ 4 51]]
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94        88
           1       0.89      0.93      0.91        55

    accuracy                           0.93       143
   macro avg       0.92      0.93      0.93       143
weighted avg       0.93      0.93      0.93       143

