# BREAST CANCER PREDICTION MODEL

In [None]:
# importing the necessary packages 
import numpy as np
import pandas as pd

Importing the necessary packages since we need to use it in the code
On the below code we have to load the dataset so that we can get the right kind of data. After loading of the dataset the first five rows are printed below. 

In [None]:
file_path= "home/ankit/Breast_Cancer_Prediction_Model/Breast-Cancer-Prediction-Model-using-Machine-Learning/breast cancer.csv"
breast = pd.read_csv(file_path)
breast.head ()

A simple understanding of the presented table will be below : 

_id = patient identifier which should be unique_

**diagnosis = which shows what kind of prediction we are expecting**

Apart from this all the other columns will turn input features

In [None]:
breast['diagnosis'].value_counts()

So the above code shows what kind of diagnosis we are expecting. 

If 'B' comes it is benign and if 'M' comes it is malignant. 

So we are having only two types of outputs over here. 

In [None]:
breast.shape

In [None]:
breast.isnull().sum()

In [None]:
breast.duplicated().sum()
breast.info()

So this shows there is no duplicated data in the dataset. 

The above three commands analyze what type of dataset we are working on over here

In [None]:
breast.drop(columns=['Unnamed: 32'], axis=1, inplace=True)
# drop the unnecessary columns since it will create noise in the data

In [None]:
breast.describe()

So we used the describe function to check the mathematical analysis of the dataset which we are using. It is used for that purpose 

### Encoding the Target Variable 

This is required because the output variable B or M, that needs to be changed to binary 0 or 1, so that the machine can understand it. 

The below package is required for the encoding of the output variable 

In [None]:
# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()
# without using sklearn we can also manually map the output data 

breast['diagnosis'] = breast['diagnosis'].map({'M' : 1 , 'B' : 0})

In [None]:
breast
breast['diagnosis'].value_counts()

## Splitting Data into Training Sets and Testing sets 

Here, we will be filtering the data as inputs and outputs. There will be 31 inputs and the rest would be output.

In [None]:
X= breast.drop(['diagnosis'], axis=1)
# apart from the diagnosis column all the others are output of X 
Y= breast['diagnosis']

Now we need to split the data into training and testing sets. This will be done by sklearn. 

In [None]:
from sklearn.model_selection import train_test_split
# used for data splitting 

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
X_train.shape

In [None]:
X_test.shape

So the above code defines the data is split at 20% which means, 20% of the data is split to training set and the remaining data is split to test set

## Feature Scaling 

The values are having a lot of deviation. To standardize this, we are introducing feature scaling 

In [None]:
from sklearn.preprocessing import StandardScaler 
# this package is used for feature scaling 
sc = StandardScaler ()

In [None]:
sc.fit(X_train , X_test)
X_train = sc.transform (X_train)
X_test = sc.transform(X_test)

We are using StandardScaler of sklearn package to fit and transform the input data for both the testing and the training sets. 

## Training the Model 

We will be using logistic regression and use the training sets and then try to predict the test sets

In [None]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()

# imported the package and created an object to use it in the code 

In [None]:
lg.fit(X_train,Y_train)
y_pred = lg.predict(X_test)

After prediction we need to see how much accurately it is predicting. 

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred,Y_test)

## Creation of the Prediction System

In [None]:
input_user = ()
np_df = np.asarray (input_user)
prediction = lg.predict(np_df.reshape(1,-1))

if prediction[0] == 1 : 
    print ("Malignant tumour. Chances of having breast cancer. Take care of yourself")
else : 
    print ("Benign tumour. No chances of cancer. Please enjoy your life!!! ")

So before the prediction we are using the numpy package to change the text to array and reshape it as 1 element array. This is important otherwise we cannot use it in the right way 

## Rough Input if required 

Use X_train tables to get input for testing cases. You can use it below. 

In [None]:
X_train[14]

## Integration of this model to Website 

This is done using a pickle library which is required to be imported. Once used this library we will be using a pickle file and keep this model as write binary mode 

In [None]:
import pickle
pickle.dump(lg,open('breast_cancer_model.pkl', 'wb'))