# AI FOR SOCIAL GOOD: WOMEN CODERS' BOOTCAMP
## Breast cancer prediction 

> [Merishna Singh Suwal](https://www.linkedin.com/in/merishna-ss/) and [Pragyan Subedi](https://www.linkedin.com/in/pragyanbo/)

Basic notebook commands:
- Shift+Enter: Execute a cell
-  a : Add new cell above
- b: Add new cell below
- x: Cut the cell
- c: Copy the cell
- m: Markdown
- y: Code
- z: Undo

**Fork this Notebook** to get started.

Steps in any Machine Learning classification problem
- Exploring the dataset
- Preprocessing the dataset and feature selection
- Splitting the dataset into training and testing set
- Building the model
- Evaluating the model

### Importing the necessary libraries

 Data used:

https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset 

In [None]:
import pandas as pd 
import seaborn as sns # for data visualization
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline

# If using from Google colaboratory
from google.colab import files

uploaded = files.upload()

#### Reading the dataset using pandas

In [None]:
df = pd.read_csv("../input/Breast_cancer_data.csv", delimiter=",")

In [None]:
df.head() #gives first 5 entries of a dataframe by default

#### Checking the columns

In [None]:
df.columns

### Data dictionary
- diagnosis: The diagnosis of breast tissues (1 = malignant, 0 = benign)
- mean_radius: mean of distances from center to points on the perimeter
- mean_texture: standard deviation of gray-scale values
- mean_perimeter: mean size of the core tumor
- mean_area
- mean_smoothness: mean of local variation in radius lengths


---
**Always make a habit to check for null values in a dataset**

In [None]:
df.isnull().sum()

Most datasets that we work on will not be as clean as this one. **Data cleaning** is an important part of any problem in Data Science.  Go through [this](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values) exercise to learn how to handle missing values in a dataset.

**Now, we will be looking at the distribution of classes( Malignant and Benign) in our dataset.**

In [None]:
count = df.diagnosis.value_counts()
count

**The distribution can be visualized as well by using a simple plot function of the matplotlib library.**

In [None]:
count.plot(kind='bar')
plt.title("Distribution of malignant(1) and benign(0) tumor")
plt.xlabel("Diagnosis")
plt.ylabel("count");

---
### Target variable/ class
The main motive of our predictor is to correctly predict on the basis of the data available, if the breast cancer is 
- Malignant(1) i.e. Harmful ,or
- Benign(0) i.e. Not Harmful.

Hence, our target class is **Diagnosis**

In [None]:
y_target = df['diagnosis']

### Feature Selection

Now, among all the features available, we need to select the best set of features inorder to train our predictor. A typical dataset might have features ranging from 30 to even about 100 and more. In such a case, feature selection plays an important role in the accuracy of the prediction.

Let's see what features are available on our dataset.

In [None]:
df.columns.values

**Let us now plot out the pairplot of different features to determine which features are better at classifying the 2 classes of our problem.**

In [None]:
df['target'] = df['diagnosis'].map({0:'B',1:'M'}) # converting the data into categorical

In [None]:
g = sns.pairplot(df.drop('diagnosis', axis = 1), hue="target", palette='prism');

**The features mean_perimeter and mean_texture seem to be most relevant**

In [None]:
sns.scatterplot(x='mean_perimeter', y = 'mean_texture', data = df, hue = 'target', palette='prism');

In [None]:
features = ['mean_perimeter', 'mean_texture']

In [None]:
X_feature = df[features]

**Taking all features**

In [None]:
# X_feature = df.drop(['target','diagnosis'], axis = 1)

### Splitting the data into training and test set
We use Cross Validation to assess the predictive performance of the models and and to judge how they perform outside the sample to a new data set also known as test data. So our classifier is first trained on the train set( usually 70% of the total data) and then tested on the test set( usually rest 30% of the data which the classifier has not seen) on the basis of which accuracy is computed.

<img src = "https://mapr.com/blog/churn-prediction-sparkml/assets/Picture14.png">

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test= train_test_split(X_feature, y_target, test_size=0.3, random_state = 42)

---
#### Binary classification using Logistic Regression

Logistic Regression is mostly used for binary classifications where the dependent variable(target) which are dichotomous in nature( yes or no). 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
model = LogisticRegression()

Data is trained to fit on the train set.

In [None]:
model.fit(X_train, y_train)

**Plotting decision boundaries for 2 features**

In [None]:
from mlxtend.plotting import plot_decision_regions

In [None]:
# !pip install mlxtend

In [None]:
plot_decision_regions(X_train.values, y_train.values, clf=model, legend=2)
plt.title("Decision boundary for Logistic Regression (Train)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture");

**Predictions are done on the Test set**

In [None]:
y_pred = model.predict(X_test)

**Accuracy**

The predicted values and the actual test values are compared to compute the accuracy.

In [None]:
acc = accuracy_score(y_test, y_pred)
print("Accuracy score using Logistic Regression:", acc*100)

In [None]:
plot_decision_regions(X_test.values, y_test.values, clf=model, legend=2)
plt.title("Decision boundary for Logistic Regression (Test)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture");

### Confusion matrix

<img src = "https://www.dataschool.io/content/images/2015/01/confusion_matrix2.png">

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
conf_mat = confusion_matrix(y_test, y_pred)

In [None]:
conf_mat

---
### Binary classification using K Nearest Neighbours

KNN Algorithm is based on feature similarity, i.e how closely out-of-sample features resemble our training set determines how we classify a given data point.

<img src = "https://cdn-images-1.medium.com/max/800/0*Sk18h9op6uK9EpT8.">

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier()

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
acc = accuracy_score(y_test, y_pred)
print("Accuracy score using KNN:", acc*100)

In [None]:
confusion_matrix(y_test, y_pred)

**Plotting the decision boundaries**

In [None]:
plot_decision_regions(X_train.values, y_train.values, clf=clf, legend=2)
plt.title("Decision boundary using KNN (Train)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture");

In [None]:
plot_decision_regions(X_test.values, y_test.values, clf=clf, legend=2)
plt.title("Decision boundary using KNN (Test)")
plt.xlabel("mean_perimeter")
plt.ylabel("mean_texture");

### THANK YOU