<a href="https://colab.research.google.com/github/Aakashaakubhardwaj/Breast_Cancer/blob/main/Breast_cancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import Libraries

This cell imports the necessary libraries for data manipulation, visualization, and machine learning.

- `pandas` is used for data manipulation and analysis.
- `numpy` is used for numerical operations.
- `matplotlib.pyplot` and `seaborn` are used for creating visualizations.
- `warnings` is used to ignore warning messages.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # Ignore warning messages

### Load and Display Data

This cell loads the dataset from a CSV file into a pandas DataFrame and displays the first 5 rows to get a glimpse of the data.

In [None]:
# Load the dataset
data = pd.read_csv('/content/data (1).csv')
# Display the first 5 rows
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


### Display Last Rows of Data

This cell displays the last 5 rows of the dataset to check the end of the data.

In [None]:
# Display the last 5 rows
data.tail()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


### Check Data Shape

This cell shows the number of rows and columns in the dataset.

In [None]:
# Check the shape of the DataFrame (rows, columns)
data.shape

(569, 33)

### Get Data Information

This cell provides a concise summary of the DataFrame, including the data types of each column and the number of non-null values.

In [None]:
# Get information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

### Check for Duplicate Rows

This cell checks for and sums the number of duplicate rows in the dataset.

In [None]:
# Check for duplicate rows
data.duplicated().sum()

np.int64(0)

### Check for Missing Values

This cell checks for and sums the number of missing values in each column of the dataset.

In [None]:
# Check for missing values in each column
data.isnull().sum()

Unnamed: 0,0
id,0
diagnosis,0
radius_mean,0
texture_mean,0
perimeter_mean,0
area_mean,0
smoothness_mean,0
compactness_mean,0
concavity_mean,0
concave points_mean,0


### Display Descriptive Statistics

This cell generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset's numerical columns.

In [None]:
# Display descriptive statistics
data.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


### Display Column Names

This cell displays the names of all columns in the DataFrame.

In [None]:
# Display column names
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

### Drop Unnecessary Column

This cell removes the 'Unnamed: 32' column as it contains only missing values and is not useful for the analysis.

In [None]:
# Drop the 'Unnamed: 32' column as it is empty
data = data.drop(columns=['Unnamed: 32'])

### Display Column Names After Dropping

This cell displays the column names again to confirm that the 'Unnamed: 32' column has been removed.

In [None]:
# Display column names after dropping the column
data.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

### Separate Features and Target Variable

This cell separates the features (independent variables) from the target variable (dependent variable). 'diagnosis' is the target variable, and 'id' is also excluded as it's just an identifier.

In [None]:
# Separate features (X) and target variable (Y)
X = data.drop(columns=['diagnosis','id'])
Y = data['diagnosis']

### Check the Shape of Features and Target

This cell displays the shapes of the feature DataFrame (X) and the target Series (Y) to confirm the separation was successful.

In [None]:
# Check the shape of the features and target
X.shape, Y.shape

((569, 30), (569,))

### Split Data into Training and Testing Sets

This cell splits the data into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate its performance. A `test_size` of 0.2 means 20% of the data will be used for testing. `random_state` ensures reproducibility of the split.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

### Train Logistic Regression Model

This cell initializes and trains a Logistic Regression model on the training data. Logistic Regression is a common algorithm for binary classification problems like this one (predicting malignant or benign).

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, Y_train)

### Evaluate Model on Training Data

This cell evaluates the trained model's accuracy on the training data.

In [None]:
from sklearn.metrics import accuracy_score

# Predict on the training data
X_train_prediction = model.predict(X_train)
# Calculate accuracy on training data
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
print('Accuracy on training data = ', training_data_accuracy)

Accuracy on training data =  0.9560439560439561


### Evaluate Model on Test Data

This cell evaluates the trained model's accuracy on the unseen test data. This gives a more realistic estimate of how the model will perform on new data.

In [None]:
# Predict on the test data
X_test_prediction = model.predict(X_test)
# Calculate accuracy on test data
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print('Accuracy on test data = ', test_data_accuracy)

Accuracy on test data =  0.9122807017543859


### Prepare Input Data for Prediction

This cell creates a tuple with sample input data for making a prediction. This data represents the features of a new, unseen case.

In [None]:
# Sample input data for prediction
input_data = (13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,
              0.1885,0.05766,0.2699,0.7886,2.058,23.56,0.008462,0.0146,
              0.02387,0.01315,0.0198,0.0023,15.11,19.26,99.7,711.2,0.144,
              0.1773,0.239,0.1288,0.2977,0.07259)

### Convert Input Data to NumPy Array

This cell converts the input data tuple into a NumPy array, which is the format required by the model for prediction.

In [None]:
# Convert the input data tuple to a NumPy array
input_data_as_numpy_array = np.asarray(input_data)
input_data_as_numpy_array

array([1.354e+01, 1.436e+01, 8.746e+01, 5.663e+02, 9.779e-02, 8.129e-02,
       6.664e-02, 4.781e-02, 1.885e-01, 5.766e-02, 2.699e-01, 7.886e-01,
       2.058e+00, 2.356e+01, 8.462e-03, 1.460e-02, 2.387e-02, 1.315e-02,
       1.980e-02, 2.300e-03, 1.511e+01, 1.926e+01, 9.970e+01, 7.112e+02,
       1.440e-01, 1.773e-01, 2.390e-01, 1.288e-01, 2.977e-01, 7.259e-02])

### Reshape Input Data

This cell reshapes the NumPy array to have a shape of (1, -1). This is because the model expects input in the form of a 2D array, even for a single instance.

In [None]:
# Reshape the NumPy array as we are predicting for one instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
input_data_reshaped

array([[1.354e+01, 1.436e+01, 8.746e+01, 5.663e+02, 9.779e-02, 8.129e-02,
        6.664e-02, 4.781e-02, 1.885e-01, 5.766e-02, 2.699e-01, 7.886e-01,
        2.058e+00, 2.356e+01, 8.462e-03, 1.460e-02, 2.387e-02, 1.315e-02,
        1.980e-02, 2.300e-03, 1.511e+01, 1.926e+01, 9.970e+01, 7.112e+02,
        1.440e-01, 1.773e-01, 2.390e-01, 1.288e-01, 2.977e-01, 7.259e-02]])

### Make Prediction on New Data

This cell uses the trained model to make a prediction on the reshaped input data. The output will be either 'M' (Malignant) or 'B' (Benign).

In [None]:
# Make a prediction using the trained model
prediction = model.predict(input_data_reshaped)
prediction

array(['B'], dtype=object)

### Display Prediction Result

This cell checks the prediction result and prints a user-friendly message indicating whether the breast cancer is predicted to be Malignant or Benign.

In [None]:
# Check the prediction result and print the corresponding message
if (prediction[0] == 'M'):
  print('The Breast Cancer is Malignant')
else:
  print('The Breast Cancer is Benign')

The Breast Cancer is Benign
