

---


# Using kaggle API to download the dataset

  What is kaggle?
  
  ***Kaggle is an online community of data scientists and machine learners, owned by Google LLC. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.***
  
*   First you need to create an account on kaggle
*   From your account details, create an API token which will download a json file
*   Then you can use your API key to use commands provided from kaggle





In [0]:
!mkdir /root/.kaggle/
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
!echo '{"username":"maibot","key":"6d074d5b96d5e062ee89f0330b8a4cc2"}' > /root/.kaggle/kaggle.json

---
# Retrieving the dataset from kaggle



In [0]:
!kaggle datasets download -d uciml/breast-cancer-wisconsin-data
!unzip breast-cancer-wisconsin-data.zip

Downloading breast-cancer-wisconsin-data.zip to /content
  0% 0.00/48.0k [00:00<?, ?B/s]
100% 48.0k/48.0k [00:00<00:00, 41.8MB/s]
Archive:  breast-cancer-wisconsin-data.zip
  inflating: data.csv                


---

#Loading the dataset

This datasets provides features of a tumor in breast tissue collected over 500 patients.
A dataset is a component of any deep learning project,without them,the model*
has nothing to train with.

> **Model**: The term model refers to the model artifact that is created by the training process.

In [0]:
import pandas as pd

data = pd.read_csv("data.csv")
data.drop(["Unnamed: 32"],axis=1,inplace=True) #deleting a corrupted column in the dataset which we don't need
data.drop(["id"],axis = 1,inplace=True) # also deleting another unnecessary column; id. which has nothing to do with breast cancer
data.head()
#data.count(axis = 'columns')

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Here, we are splitting the data to ***features*** and ***diagnosis***.

This is is **crucial step** to be able to train the model, *Why?:*


> The model needs some features about a breast cancer case and needs to associate the features to the diagnosis so it can find a correlation.

And in the code, we are replacing all the diagnosis' with ***Malign*** to *1* 
and with ***Benign*** to *0* 

> *(ie. 1 for yes, 0 for no).*







In [0]:

data["diagnosis"] = [1 if var == "M" else 0 for var in data["diagnosis"]]
y = data["diagnosis"].values
data.drop(["diagnosis"],axis=1,inplace=True)
x = data

In the part, ***TRAIN TEST SPLIT***

We are taking the dataset into 2 parts; **train** and **test**.
which will be %80 train data and %20 test data

(*Eg. If we had a dataset consisting of 10.000 cases, train data would consist of 8.000 cases*)


> The reason for this, is that, our model needs a dataset to train on and a dataset to test how much it learned from training.

In the part ***STANDARD SCALING***

We converting all of our data features onto the same scale.
The scale ***roughly*** starts from **-1** to **1**

*(Eg if a feature value is 56, after scaling it will be 0.56, it will mean the same thing but on a smaller scale.
**beware, this is not how it exactly works, it's just an explaination in simpler terms**)*

> The simple reason for Standard Scaling is, when the values our model uses are much smaller, the total computation time is shorter


In [0]:
#TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2,random_state = 0)

#STANDARD SCALING
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

---
# Training the model

In [0]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
#%% NORMAL TRAIN
model =Sequential()
model.add(Dense(units=16,init= 'uniform',activation = 'relu',input_dim =  30 ))
model.add(Dense(units=32,init= 'uniform',activation = 'relu' ))
model.add(Dense(units=64,init= 'uniform',activation = 'relu' ))
model.add(Dropout(0.25))
model.add(Dense(units=1,init= 'uniform',activation = 'sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [0]:
results = model.fit(x_train, y_train, batch_size = 16, epochs = 50,validation_data = (x_test,y_test))
loss = results.history['loss']
val_loss = results.history['val_loss']
acc = results.history['acc']
val_acc = results.history['val_acc']
epochs = range(1,len(loss) + 1)

In [0]:
import matplotlib.pyplot as plt
plt.subplot(211)
plt.plot(epochs,loss,'r--')
plt.plot(epochs,val_loss,'b-')
plt.title("Loss Grafiği")
plt.legend(["Eğitim loss","Test loss"])
plt.xlabel('Epoch')
plt.xticks(range(1,2))
plt.ylabel('Loss')


plt.subplot(212)
plt.plot(epochs,acc,'r--')
plt.plot(epochs,val_acc,'g-')
plt.title("Doğruluk Grafiği")
plt.legend(["Eğitim doğruluğu","Test doğruluğu"])
plt.xlabel('Epoch')
plt.xticks(range(1,2))
plt.ylabel('Doğruluk')

plt.show()
