<a href="https://colab.research.google.com/github/AtharvaTambat/WnCC-SoC-2022-QML/blob/main/SVM_Breast_Cancer_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Importing weather data from the .csv file to NumPy arrays**

In [107]:
import pandas as pd
import numpy as np

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/data.csv")


# **2. Making random groups of parameters to see which set of parameters classifies breast cancer, the best**

A simple way to choose a subset of the features, to perform SVM classification on, is:
 

1. The 30 parameters/ features are randomly divided into 5 groups of 6 parameters each (to remove any bias in the order of groups). 
2. The groups are trained using (4/5)th of the dataset (training set) and scored using the next (1/5)th of the dataset (test set).

3. The group with the highest score is chosen as the set of parameters, with respect to which future predictions will be done.


 






In [108]:
# Store the labels of the parameters with respect to which classification is to be done
parameters = df.columns
parameters = parameters[2:32]

# To store the indices of the columns, and to shuffle them to ensure random allotement 
# of columns into 5 groups of 6 parameters each 
label_parameters = np.array(range(0,30))
np.random.shuffle(label_parameters)

# Split the indices into 5 groups of 6 parameters in each group
groups = np.split(label_parameters,range(0,30,6))[1:]

In [109]:
# Print the sets of groups of parameters/ features
for i in range(5):
  print("Group",(i+1),": ")
  for j in range(6):
    print(parameters[groups[i][j]])
  
  print("\n")



Group 1 : 
fractal_dimension_se
concave points_worst
smoothness_mean
concavity_worst
symmetry_mean
concavity_se


Group 2 : 
area_mean
concave points_se
smoothness_se
texture_worst
fractal_dimension_worst
texture_mean


Group 3 : 
symmetry_worst
perimeter_se
radius_mean
radius_worst
concavity_mean
compactness_se


Group 4 : 
smoothness_worst
radius_se
texture_se
compactness_worst
concave points_mean
perimeter_worst


Group 5 : 
perimeter_mean
compactness_mean
area_se
area_worst
fractal_dimension_mean
symmetry_se




# **3. Classifying the data using SVM and testing the classification using the test dataset**

In [110]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

score = [0.0,0.0,0.0,0.0,0.0,0.0,0.0]
clf_objects = []

for i in range(5):
  Y = np.array(df["diagnosis"])
  X = np.array(df.iloc[:,groups[i]+2])

  # To extract first (4/5)th data as training dataset
  X_train = X[0:int(len(Y)*4.0/5.0)]
  Y_train = Y[0:int(len(Y)*4.0/5.0)]

  # To extract next (1/5)th data as test dataset
  X_test = X[int(len(Y)*4.0/5.0):]
  Y_test = Y[int(len(Y)*4.0/5.0):]

  # Train the SVM
  clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
  clf.fit(X_train, Y_train)
  clf_objects = np.append(clf_objects,clf)


  # To calculate how good the set of 6 parameters in the given iteration are,
  # for identifying breast cancer 
  score[i] = clf.score(X_test, Y_test)

  print("Score for Group no.",i+1,"is: ", score[i])

Score for Group no. 1 is:  0.9210526315789473
Score for Group no. 2 is:  0.9298245614035088
Score for Group no. 3 is:  0.9736842105263158
Score for Group no. 4 is:  0.9122807017543859
Score for Group no. 5 is:  0.9473684210526315


# **4. To show the best set of parameters/ features for classifying the tumor**

In [111]:
print("The best set of parameters for classifying breast cancer is: \n")
best = np.argmax(score)      # This is group index, not group no.

for i in range(6):
  print(parameters[groups[best]][i])

The best set of parameters for classifying breast cancer is: 

symmetry_worst
perimeter_se
radius_mean
radius_worst
concavity_mean
compactness_se


# **5. Randomly selecting a test case and classifying it**

In [112]:
i = np.random.randint(0, len(Y_test)) # Randomly select a testcase from the test dataset

# Extract data of the best parameters for breast cancer classification
Y = np.array(df["diagnosis"])
X = np.array(df.iloc[:,groups[best]+2])

# To extract first (4/5)th data as training dataset
X_train = X[0:int(len(Y)*4.0/5.0)]
Y_train = Y[0:int(len(Y)*4.0/5.0)]

# To extract next (1/5)th data as test dataset
X_test = X[int(len(Y)*4.0/5.0):]
Y_test = Y[int(len(Y)*4.0/5.0):]

clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, Y_train)

prediction = clf.predict(X_test) # Selects the SVM with the highest score, and predicts the result of ith testcase 

print("Features:-")
for j in range(6):
  print((j+1),".",parameters[groups[best]][j],":",X_test[i][j])
print("\n")
print("SVM Prediction:", prediction[i])
print("Actual result from test dataset:", Y_test[i])




Features:-
1 . symmetry_worst : 0.2871
2 . perimeter_se : 2.548
3 . radius_mean : 7.76
4 . radius_worst : 9.456
5 . concavity_mean : 0.0
6 . compactness_se : 0.00466


SVM Prediction: B
Actual result from test dataset: B
