# **Data analysis of Breast Tumor in order to determine best fit of correlation coefficient between diagnosis and other features.**

### Data source
 
The dataset contains 569 samples of malignant and benign tumor cells.


*   The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
* The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

### Importing all the neccessary libraries to perform the analysis:-
1.Pandas : data processing, CSV file I/O, data manipulation as in SQL

2.Matplotlib this is used for ploting the graph

3.Sklearn.naive_bayes : to apply naive bayes

4.Sklearn.linear_model : to apply Logistic regression

5.Sklearn.metrics : for the check of Recall, precision, f1-score and accuracy of the model

In [34]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Reading the csv file form drive location into an object using Pandas.

In [35]:
data = pd.read_csv("/content/drive/MyDrive/data.csv")

Displaying the data read from the csv file.

In [36]:
data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


The data is to be divided into the categories of training and testing data:-

*   *Training data* - contains 70% of original data.
*   *Testing data* - contains rest 30% of original data.

Finding the size of data in regards to the respective percentage. 

In [37]:
training_data_len = int(0.7*len(data))
testing_data_len = int(0.3*len(data))

In [38]:
training_data_len

398

In [39]:
testing_data_len

170

Further separating both the training and testing data into corresponding equal sizes of tuples with *malignant* and *benign* diagnosis.

In [40]:
neg_class_data = data[data["diagnosis"] == 'B']
pos_class_data = data[data["diagnosis"] == 'M']
train_neg_class_data = neg_class_data.iloc[0:training_data_len//2,:]
train_pos_class_data = pos_class_data.iloc[0:training_data_len//2,:]

test_neg_class_data = neg_class_data.iloc[training_data_len//2:,:]
test_pos_class_data = pos_class_data.iloc[training_data_len//2:,:]

Displaying the section of training data with **benign** diagnosis.

In [41]:
train_neg_class_data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
19,8510426,B,13.540,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.047810,...,19.26,99.70,711.2,0.14400,0.17730,0.23900,0.12880,0.2977,0.07259,
20,8510653,B,13.080,15.71,85.63,520.0,0.10750,0.12700,0.04568,0.031100,...,20.49,96.09,630.5,0.13120,0.27760,0.18900,0.07283,0.3184,0.08183,
21,8510824,B,9.504,12.44,60.34,273.9,0.10240,0.06492,0.02956,0.020760,...,15.66,65.13,314.9,0.13240,0.11480,0.08867,0.06227,0.2450,0.07773,
37,854941,B,13.030,18.42,82.61,523.8,0.08983,0.03766,0.02562,0.029230,...,22.81,84.46,545.9,0.09701,0.04619,0.04833,0.05013,0.1987,0.06169,
46,85713702,B,8.196,16.84,51.71,201.9,0.08600,0.05943,0.01588,0.005917,...,21.96,57.26,242.2,0.12970,0.13570,0.06880,0.02564,0.3105,0.07409,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
355,9010258,B,12.560,19.07,81.92,485.8,0.08760,0.10380,0.10300,0.043910,...,22.43,89.02,547.4,0.10960,0.20020,0.23880,0.09265,0.2121,0.07188,
356,9010259,B,13.050,18.59,85.09,512.0,0.10820,0.13040,0.09603,0.056030,...,24.85,94.22,591.2,0.13430,0.26580,0.25730,0.12580,0.3113,0.08317,
357,901028,B,13.870,16.21,88.52,593.7,0.08743,0.05492,0.01502,0.020880,...,25.58,96.74,694.4,0.11530,0.10080,0.05285,0.05556,0.2362,0.07113,
358,9010333,B,8.878,15.49,56.74,241.0,0.08293,0.07698,0.04721,0.023810,...,17.70,65.27,302.0,0.10150,0.12480,0.09441,0.04762,0.2434,0.07431,


Displaying the section of training data with **Malignant** diagnosis.

In [42]:
train_pos_class_data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.3001,0.14710,...,17.33,184.60,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.80,1956.0,0.1238,0.1866,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.1974,0.12790,...,25.53,152.50,1709.0,0.1444,0.4245,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.2414,0.10520,...,26.50,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.1980,0.10430,...,16.67,152.20,1575.0,0.1374,0.2050,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,91485,M,20.59,21.24,137.80,1320.0,0.10850,0.16440,0.2188,0.11210,...,30.76,163.20,1760.0,0.1464,0.3597,0.5179,0.2113,0.2480,0.08999,
501,91504,M,13.82,24.49,92.33,595.9,0.11620,0.16810,0.1357,0.06759,...,32.94,106.00,788.0,0.1794,0.3966,0.3381,0.1521,0.3651,0.11830,
503,915143,M,23.09,19.83,152.10,1682.0,0.09342,0.12750,0.1676,0.10030,...,23.87,211.50,2782.0,0.1199,0.3625,0.3794,0.2264,0.2908,0.07277,
509,915460,M,15.46,23.95,103.80,731.3,0.11830,0.18700,0.2030,0.08520,...,36.33,117.70,909.4,0.1732,0.4967,0.5911,0.2163,0.3013,0.10670,


Combining the separated training and testing data with equal sizes of tuples with **malignant** and **benign** diagnosis.

In [43]:
training_data = pd.concat([train_neg_class_data,train_pos_class_data])
testing_data = pd.concat([test_neg_class_data,test_pos_class_data])

Droping the columns with undesirable values to avoid any discrepencies.
> "Unnamed: 32" is the last column with value 'NaN' abreviated as Not a Number and it can lead to wrong calculations.

In [44]:
training_data.drop([data.columns[32]],axis=1,inplace=True)
testing_data.drop([data.columns[32]],axis=1,inplace=True)

### **Refining both training and testing data via feature selection.**

Encoding diagnosis from string to integer type in order to further perform some mathematical calulations.

Here we are using one hot encoding to set:-


*   Malignant type = 1
*   Benign type = 0



In [45]:
training_data[data.columns[1]].replace(to_replace=['B','M'],value = [0,1],inplace=True)

In [46]:
testing_data[data.columns[1]].replace(to_replace=['B','M'],value = [0,1],inplace=True)

Constructing a correlation matrix of diffrent pair of features.

In [47]:
corr_matrix = training_data.corr()
corr_matrix

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
id,1.0,0.045047,0.090787,0.167022,0.091058,0.120059,0.0335,0.03449,0.07878,0.068188,...,0.099812,0.117864,0.100667,0.13131,0.028039,0.01593,0.04471,0.051267,-0.048499,0.000955
diagnosis,0.045047,1.0,0.724298,0.523033,0.736404,0.686193,0.37045,0.600565,0.666892,0.757983,...,0.764562,0.553388,0.772223,0.704527,0.455838,0.60206,0.660176,0.801852,0.403745,0.345429
radius_mean,0.090787,0.724298,1.0,0.423461,0.997694,0.987369,0.163584,0.503963,0.666143,0.819427,...,0.965959,0.379369,0.962388,0.938114,0.133019,0.399732,0.512911,0.733373,0.142004,0.007183
texture_mean,0.167022,0.523033,0.423461,1.0,0.430737,0.406719,0.075724,0.332135,0.374872,0.389672,...,0.453145,0.905372,0.456969,0.429743,0.188699,0.360214,0.380656,0.408845,0.170121,0.186314
perimeter_mean,0.091058,0.736404,0.997694,0.430737,1.0,0.986057,0.203135,0.556726,0.707207,0.849163,...,0.965354,0.386399,0.967227,0.937632,0.164292,0.442629,0.550522,0.760646,0.167662,0.051014
area_mean,0.120059,0.686193,0.987369,0.406719,0.986057,1.0,0.167526,0.489072,0.670129,0.813356,...,0.956464,0.355407,0.953579,0.953814,0.127434,0.370314,0.493947,0.704323,0.112782,-0.005618
smoothness_mean,0.0335,0.37045,0.163584,0.075724,0.203135,0.167526,1.0,0.658811,0.557233,0.56667,...,0.206653,0.130719,0.236436,0.196102,0.787594,0.504358,0.476919,0.510499,0.39847,0.521051
compactness_mean,0.03449,0.600565,0.503963,0.332135,0.556726,0.489072,0.658811,1.0,0.890881,0.835291,...,0.530167,0.333853,0.584806,0.496415,0.555464,0.868557,0.82354,0.815929,0.51736,0.681613
concavity_mean,0.07878,0.666892,0.666143,0.374872,0.707207,0.670129,0.557233,0.890881,1.0,0.919576,...,0.670765,0.357934,0.711359,0.652521,0.466058,0.742074,0.879514,0.850542,0.403292,0.513388
concave points_mean,0.068188,0.757983,0.819427,0.389672,0.849163,0.813356,0.56667,0.835291,0.919576,1.0,...,0.821021,0.371817,0.848307,0.793424,0.457648,0.664058,0.751607,0.904517,0.364338,0.367768


Inorder to select the appropriate features on which diagnosis is highly dependend, progresive selection of features are needed for a range of thresold value of pearson cofficient between diagnosis and other features.

Value of this pearson coefficient will range from 0.1 to 0.9

list_of_list_of_features will store the list of lists containing features corresponding to respective pearson coefficient.

In [48]:
list_of_list_of_features = []
for i in range(1,10):
   list_of_list_of_features.append(training_data.columns[corr_matrix[training_data.columns[1]] > i/10])

Storing the data corresponding to each list of features in list_of_list_of_features for both training and testing data.

In [49]:
list_of_filtered_training_data = []
list_of_filtered_testing_data = []
for i in range(0,9):
  list_of_filtered_training_data.append(training_data[list_of_list_of_features[i]])
  list_of_filtered_testing_data.append(testing_data[list_of_list_of_features[i]])

Using Naive-Bayes-algorihtm to sequencially train the modle for list of training data and then predicting the precision and recall for each iteration.

At each step comparision of the sum of precision and recall is performed with the maximum value in order to determine the best fit value of correlation coefficient.

In [69]:
max_recall = 0 #initialising maximum recall value to 0
max_precision = 0 #initialising maximum precision value to 0
best_fit_thresold_value = 0 #initialising best fit value of pearson coefficient that has to be estimated to 0

for i in range(0,8): #looping through the list of filtered training data
  answer = list_of_filtered_training_data[i][data.columns[1]]
  input_features = list_of_filtered_training_data[i].iloc[:,1:]
  naive_bayes_algo = GaussianNB()
  naive_bayes_algo.fit(X=input_features,y=answer)
  testing_answers = list_of_filtered_testing_data[i][data.columns[1]]
  testing_questions = list_of_filtered_testing_data[i].iloc[:,1:]
  exam_answers = naive_bayes_algo.predict(testing_questions)
  print('Classification Report for Pearson Coefficient = ' ,(i+1)/10, '\n')
  print(classification_report(y_true=testing_answers,y_pred=exam_answers) + '\n')
  report_dict = classification_report(y_true=testing_answers, y_pred=exam_answers, output_dict=True)
  sum = report_dict['macro avg']['precision'] + report_dict['macro avg']['recall']
  if sum > max_precision + max_recall:
    max_precision = report_dict['macro avg']['precision']
    max_recall = report_dict['macro avg']['recall']
    best_fit_thresold_value = (i+1)/10
print('The best fit value of correlation coeffient for the tumor analysis = ',best_fit_thresold_value)

Classification Report for Pearson Coefficient =  0.1 

              precision    recall  f1-score   support

           0       0.99      0.96      0.97       158
           1       0.61      0.85      0.71        13

    accuracy                           0.95       171
   macro avg       0.80      0.90      0.84       171
weighted avg       0.96      0.95      0.95       171


Classification Report for Pearson Coefficient =  0.2 

              precision    recall  f1-score   support

           0       0.99      0.96      0.97       158
           1       0.61      0.85      0.71        13

    accuracy                           0.95       171
   macro avg       0.80      0.90      0.84       171
weighted avg       0.96      0.95      0.95       171


Classification Report for Pearson Coefficient =  0.3 

              precision    recall  f1-score   support

           0       0.99      0.96      0.97       158
           1       0.61      0.85      0.71        13

    accuracy   