### First let's import the required libraries

In [10]:
import pandas as pd # To be able to handle the data with ease (by employing dataframe structure)
import numpy as np # To be able to treat the data as vectors
from sklearn.model_selection import train_test_split # For splitting data into train & test clusters
from sklearn import svm # To be able to model the dataset with Support Vector Machine Modeling
from sklearn.metrics import f1_score # To evaluate the results with f1_score method

### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### In this case study, we're going to decide if a cancer cell is benevolent or maleficent.

### The dataset contains over 200 cancer cell information. Each row consists of various info about cells:

#### Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single epithelial cell size
#### Bare Nuclei, Bland Chromatin, Normal Nuclei, Mitoses, Class

### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Let's import our dataset from the data center of University of California, Irvine.

In [4]:
!wget -O Hücreler.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv

--2020-05-08 05:53:42--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20675 (20K) [text/csv]
Saving to: ‘Hücreler.csv’


2020-05-08 05:53:43 (236 KB/s) - ‘Hücreler.csv’ saved [20675/20675]



In [5]:
df = pd.read_csv("Hücreler.csv")
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### Data Preprocessing

#### Since Support Vector Machine Modeling method does not support categorical values, we first need to convert the categorical data into numerical data.

In [6]:
df.dtypes

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

#### We can throw out the non-numerical values of "Bare Nuclei" column.

In [7]:
df = df[pd.to_numeric(df['BareNuc'], errors = 'coerce').notnull()]
df['BareNuc'] = df['BareNuc'].astype('int')
df.dtypes

ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int64
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object

In [8]:
x = np.asarray(df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']])
y = np.asarray(df['Class'])

### Splitting our dataset into Train / Test 

In [9]:
# Verisetimizi %80 eğitim ve %20 olmak üzere iki alt kümeye ayıralım. 
xTrain, xTest, yTrain, yTest = train_test_split( x, y, test_size=0.2, random_state=4)

### Modeling

In [38]:
model = svm.SVC(kernel='rbf')
model.fit(xTrain, yTrain)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [39]:
yHat = model.predict(xTest)
print(yHat[0:10])
print(yTest[0:10])

[2 4 2 4 2 2 2 2 4 2]
[2 4 2 4 2 2 2 2 4 2]


### Model Evaluation

In [40]:
f1_score(yTest, yHat, average='weighted') 

0.9639038982104676

#### --- f1 Skorunun 1'e yakın olması modelin gelecek kanser hücresi tahminlemesinde oldukça başarılı olduğunu gösteriyor. ---