# Machine Learning Lab 4 - Predicting Breast Cancer
Submitted By <br/>
Name: **Rathod Nishit Shailesh** <br/>
Register Number: **19112014** <br/>
Class: **5 BSc Data Science** <br/>

<hr/>

## Lab Overview

##### About the Dataset
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
- ftp.cs.wisc.edu - cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: 
1. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1. ID number
2. Diagnosis (M = malignant, B = benign)
3. Ten real-valued features are computed for each cell nucleus:
    1. radius (mean of distances from center to points on the perimeter)
    1. texture (standard deviation of gray-scale values)
    1. perimeter
    1. area
    1. smoothness (local variation in radius lengths)
    1. compactness (perimeter^2 / area - 1.0)
    1. concavity (severity of concave portions of the contour)
    1. concave points (number of concave portions of the contour)
    1. symmetry
    1. fractal dimension ("coastline approximation" - 1)

- The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

- All feature values are recoded with four significant digits.

- Missing attribute values: none

- Class distribution: 357 benign, 212 malignant

#### Objective
- Get familiar with the problem statement, Know the dataset thoroghly, Analyse the given dataset by exploring the hidden insights with beautiful visuals and Train & Test the model for acurate classification prediction of Breast Cancer.

#### Problem Definition
- Understand the Dataset & Features.
- Perform Data Preprocessing Technique to Get Balanced Structured Data.
- Perform Statistical Data Analysis and Derive Valuable Inferences.
- Perform Exploratory Data Analysis and Derive Valuable Insights.
- Train and Test through Different Classification Models for Better Pricdiction. 

#### Approach
This is an extension to the Problem Defnintion. Mention the process/appraoch that you have followed in order to reach out the above problem defintion.

- Step 1: Know the dataset thoroughly.
- Step 2: Perform preprocessing on data.
- Step 3: Import needfull libraries as an when you try to plot different graphs and evaluate the model.
- Step 4: Perform Statistical Data Analysis and Derive Valuable Inferences.
- Step 5: Perform Exploratory Data Analysis and Derive Valuable Insights.
- Step 6: Train and Test through Different Classification Models for Better Breast Cancer Prediction.
- Step 7: Help the doctors with insights for predicting if a patient is diaganose with breast cancer or not.

#### Sections
Here, mentioned sections are defined in the below code. For this lab, the sections are -
1. Lab Overview
1. Dataset Overview
1. Data Analyst Process
1. About Different Classification Models
1. Implementation and Evaluation of Different Classification Models
1. Conclusion

#### References
1. https://pandas.pydata.org/
1. https://matplotlib.org/
1. https://seaborn.pydata.org/
1. https://plotly.com/
1. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
1. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
1. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
1. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("BreastCancer.csv")

In [3]:
df.shape

(569, 33)

In [4]:
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

In [5]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [6]:
df.tail()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


In [7]:
df.drop('Unnamed: 32', axis = 1, inplace = True)
df.drop('id', axis = 1, inplace = True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

In [9]:
df.sample(5)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
543,B,13.21,28.06,84.88,538.4,0.08671,0.06877,0.02987,0.03275,0.1628,...,14.37,37.17,92.48,629.6,0.1072,0.1381,0.1062,0.07958,0.2473,0.06443
68,B,9.029,17.33,58.79,250.5,0.1066,0.1413,0.313,0.04375,0.2111,...,10.31,22.65,65.5,324.7,0.1482,0.4365,1.252,0.175,0.4228,0.1175
362,B,12.76,18.84,81.87,496.6,0.09676,0.07952,0.02688,0.01781,0.1759,...,13.75,25.99,87.82,579.7,0.1298,0.1839,0.1255,0.08312,0.2744,0.07238
24,M,16.65,21.38,110.0,904.6,0.1121,0.1457,0.1525,0.0917,0.1995,...,26.46,31.56,177.0,2215.0,0.1805,0.3578,0.4695,0.2095,0.3613,0.09564
397,B,12.8,17.46,83.05,508.3,0.08044,0.08895,0.0739,0.04083,0.1574,...,13.74,21.06,90.72,591.0,0.09534,0.1812,0.1901,0.08296,0.1988,0.07053


In [10]:
df.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [11]:
df.isnull().sum()

diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

## Exploratory Data Analysis

$ ! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 

$ from pandas_profiling import ProfileReport

$ profile = ProfileReport(df)

$ profile.to_notebook_iframe()

$ profile.to_file(output_file = "19112014_NishitRathod_EDA_BreastCancer.html")

## Implementing Classification Models

#### Biffurcating Classification Parameter from the Dataset.

In [12]:
Y = df['diagnosis']
X = df.drop(['diagnosis'], axis=1)

In [13]:
X.sample(5)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
131,15.46,19.48,101.7,748.9,0.1092,0.1223,0.1466,0.08087,0.1931,0.05796,...,19.26,26.0,124.9,1156.0,0.1546,0.2394,0.3791,0.1514,0.2837,0.08019
200,12.23,19.56,78.54,461.0,0.09586,0.08087,0.04187,0.04107,0.1979,0.06013,...,14.44,28.36,92.15,638.4,0.1429,0.2042,0.1377,0.108,0.2668,0.08174
139,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,0.1771,0.06072,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
135,12.77,22.47,81.72,506.3,0.09055,0.05761,0.04711,0.02704,0.1585,0.06065,...,14.49,33.37,92.04,653.6,0.1419,0.1523,0.2177,0.09331,0.2829,0.08067
144,10.75,14.97,68.26,355.3,0.07793,0.05139,0.02251,0.007875,0.1399,0.05688,...,11.95,20.72,77.79,441.2,0.1076,0.1223,0.09755,0.03413,0.23,0.06769


In [14]:
Y.sample(5)

488    B
487    M
208    B
445    B
207    M
Name: diagnosis, dtype: object

#### Train Test Split

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 8)

In [16]:
X_train.sample(5)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
171,13.43,19.63,85.84,565.4,0.09048,0.06288,0.05858,0.03438,0.1598,0.05671,...,17.98,29.87,116.6,993.6,0.1401,0.1546,0.2644,0.116,0.2884,0.07371
518,12.88,18.22,84.45,493.1,0.1218,0.1661,0.04825,0.05303,0.1709,0.07253,...,15.05,24.37,99.31,674.7,0.1456,0.2961,0.1246,0.1096,0.2582,0.08893
352,25.73,17.46,174.2,2010.0,0.1149,0.2363,0.3368,0.1913,0.1956,0.06121,...,33.13,23.58,229.3,3234.0,0.153,0.5937,0.6451,0.2756,0.369,0.08815
158,12.06,12.74,76.84,448.6,0.09311,0.05241,0.01972,0.01963,0.159,0.05907,...,13.14,18.41,84.08,532.8,0.1275,0.1232,0.08636,0.07025,0.2514,0.07898
166,10.8,9.71,68.77,357.6,0.09594,0.05736,0.02531,0.01698,0.1381,0.064,...,11.6,12.02,73.66,414.0,0.1436,0.1257,0.1047,0.04603,0.209,0.07699


In [17]:
X_test.sample(5)

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
537,11.69,24.44,76.37,406.4,0.1236,0.1552,0.04515,0.04531,0.2131,0.07405,...,12.98,32.19,86.12,487.7,0.1768,0.3251,0.1395,0.1308,0.2803,0.0997
327,12.03,17.93,76.09,446.0,0.07683,0.03892,0.001546,0.005592,0.1382,0.0607,...,13.07,22.25,82.74,523.4,0.1013,0.0739,0.007732,0.02796,0.2171,0.07037
538,7.729,25.49,47.98,178.8,0.08098,0.04878,0.0,0.0,0.187,0.07285,...,9.077,30.92,57.17,248.0,0.1256,0.0834,0.0,0.0,0.3058,0.09938
175,8.671,14.45,54.42,227.2,0.09138,0.04276,0.0,0.0,0.1722,0.06724,...,9.262,17.04,58.36,259.2,0.1162,0.07057,0.0,0.0,0.2592,0.07848
338,10.05,17.53,64.41,310.8,0.1007,0.07326,0.02511,0.01775,0.189,0.06331,...,11.16,26.84,71.98,384.0,0.1402,0.1402,0.1055,0.06499,0.2894,0.07664


In [18]:
Y_train.sample(5)

337    M
242    B
164    M
322    B
512    M
Name: diagnosis, dtype: object

In [19]:
Y_test.sample(5)

122    M
328    M
207    M
442    B
436    B
Name: diagnosis, dtype: object

### Logistic Regression

#### Fitting and Predicting 

In [20]:
from sklearn.linear_model import LogisticRegression
logistic_regressor = LogisticRegression()

In [21]:
logistic_regressor.fit(X_train, Y_train)

LogisticRegression()

In [22]:
predA = logistic_regressor.predict(X_test)
y_prob = logistic_regressor.predict_proba(X_test)

In [23]:
predA[0:5]

array(['B', 'B', 'B', 'B', 'B'], dtype=object)

In [24]:
from sklearn.metrics import accuracy_score
accuracy_score(predA, Y_test)

0.9532163742690059

In [25]:
y_prob[0:5]

array([[0.99752283, 0.00247717],
       [0.98704683, 0.01295317],
       [0.99537735, 0.00462265],
       [0.99631079, 0.00368921],
       [0.99835446, 0.00164554]])

### K - Nearest Neighbours

#### Fitting and Predicting 

In [26]:
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(n_neighbors = 2)
KNN.fit(X_train, Y_train)
predB = KNN.predict(X_test)
accuracy_score(predB, Y_test)

0.935672514619883

### Decision Tree 

#### Fitting and Predicting 

In [27]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
DT.fit(X_train,Y_train)
predC = DT.predict(X_test)
accuracy_score(predC,Y_test)

0.9298245614035088

### Classification Report 

In [28]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, predA))

              precision    recall  f1-score   support

           B       0.96      0.96      0.96       105
           M       0.94      0.94      0.94        66

    accuracy                           0.95       171
   macro avg       0.95      0.95      0.95       171
weighted avg       0.95      0.95      0.95       171



### Confusion Matrix

In [29]:
from sklearn.metrics import confusion_matrix

In [30]:
CM = confusion_matrix(Y_test, predA)
CM

array([[101,   4],
       [  4,  62]], dtype=int64)

In [31]:
Accuracy = (CM[0][0] + CM[1][1]) / (CM[0][0] + CM[1][1] + CM[0][1] + CM[1][0])
Accuracy

0.9532163742690059

In [32]:
ErrorRate = (CM[0][1] + CM[1][0]) / (CM[0][0] + CM[1][1] + CM[0][1] + CM[1][0])
ErrorRate

0.04678362573099415

In [33]:
Sensitivity = CM[0][0]/(CM[0][0] + CM[1][0])
Sensitivity

0.9619047619047619

In [34]:
Specificity = CM[1][1]/(CM[1][1] + CM[0][1])
Specificity

0.9393939393939394

In [35]:
Recall = CM[0][0]/(CM[0][0] + CM[1][0])
Recall

0.9619047619047619

In [36]:
Precision = CM[0][0]/(CM[0][0] + CM[0][1])
Precision

0.9619047619047619

In [37]:
F1Score = (2*(Precision*Recall))/(Precision + Recall)
F1Score

0.9619047619047619

### Evaluating the Effect of Parameters For Logistic Regression

In [38]:
from sklearn.metrics import accuracy_score

In [39]:
def doLogisticRegression(X, Y, test_size = 0.30, random_state = 42, penalty='l2', solver='lbfgs'):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = random_state)
    
    logistic_regressor = LogisticRegression(penalty = penalty, solver = solver)
    logistic_regressor.fit(X_train, Y_train)
    y_pred = logistic_regressor.predict(X_test)
    
    acc_score = accuracy_score(Y_test, y_pred)
    
    return acc_score

In [40]:
df2 = pd.DataFrame(columns = ['Test Size', 'Random States', 'Penalty', 'Solvers', 'Accuracy'])
df2

Unnamed: 0,Test Size,Random States,Penalty,Solvers,Accuracy


In [41]:
penalties = ['none', 'l2']
test_size = [0.30, 0.25, 0.20]
random_states = [21, 42, 84]
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

for t_size in test_size:
    for r_state in random_states:
        for penalty in penalties:
            for solver in solvers:
                accuracy = doLogisticRegression(X, Y, t_size, r_state, penalty)
                BreastCancerResults = {}
                BreastCancerResults['Test Size'] = t_size
                BreastCancerResults['Random States'] = r_state
                BreastCancerResults['Penalty'] = penalty
                BreastCancerResults['Solvers'] = solver
                BreastCancerResults['Accuracy'] = accuracy

                df2 = df2.append(BreastCancerResults, ignore_index = True)

In [42]:
df2.sample(10)

Unnamed: 0,Test Size,Random States,Penalty,Solvers,Accuracy
23,0.3,84,none,sag,0.959064
54,0.25,84,none,saga,0.965035
84,0.2,84,none,saga,0.973684
27,0.3,84,l2,liblinear,0.959064
77,0.2,42,l2,liblinear,0.964912
9,0.3,21,l2,saga,0.947368
25,0.3,84,l2,newton-cg,0.959064
43,0.25,42,none,sag,0.965035
68,0.2,21,l2,sag,0.921053
58,0.25,84,l2,sag,0.965035


### Evaluating the Effect of Parameters For K Nearest Neighbours

In [43]:
def doKNearestNeighbour(X, Y, test_size = 0.20, randomstate = 8,nn = 5 ):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = randomstate)
    cls1 = KNeighborsClassifier(n_neighbors = nn)
    cls1.fit(X_train, Y_train)
    pred1 = cls1.predict(X_test)
    acc_score1 = accuracy_score(pred1,Y_test)
    return acc_score1

In [44]:
test_size = [0.30, 0.25, 0.20,0.10]
random_states = [8, 27, 42]
n_neighbours = [2,3,4,5]

criterions=['gini', 'entropy']
maxfeatures=['auto', 'sqrt', 'log2']
penalties = [ 'l1', 'elasticnet','none', 'l2']
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [45]:
df3 = pd.DataFrame(columns = ['Test Size', 'Random States','Number of neighbours','K-nearest neighbour Accuracy'])

In [46]:
for t_size in test_size:
    for r_state in random_states:
        for neigh in n_neighbours:
            a1 = doKNearestNeighbour(X, Y, t_size, r_state,neigh)
            KNearestNeighbours = {} 
            KNearestNeighbours['Test Size'] = t_size
            KNearestNeighbours['Random States'] = r_state
            KNearestNeighbours['Number of neighbours'] = neigh
            KNearestNeighbours['K-nearest neighbour Accuracy'] = a1

            df3 = df3.append(KNearestNeighbours, ignore_index = True)

In [47]:
df3.sample(10)

Unnamed: 0,Test Size,Random States,Number of neighbours,K-nearest neighbour Accuracy
22,0.25,42.0,4.0,0.951049
5,0.3,27.0,3.0,0.923977
43,0.1,27.0,5.0,0.947368
9,0.3,42.0,3.0,0.94152
2,0.3,8.0,4.0,0.947368
31,0.2,27.0,5.0,0.929825
0,0.3,8.0,2.0,0.935673
4,0.3,27.0,2.0,0.906433
28,0.2,27.0,2.0,0.885965
26,0.2,8.0,4.0,0.964912


### Evaluating the Effect of Parameters For Decision Tree

In [48]:
def doDecisionTree(X, Y, test_size = 0.20, randomstate = 8,c='gini',mf = 'auto'):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = randomstate)
    cls2 = DecisionTreeClassifier(criterion = c, max_features = mf)
    cls2.fit(X_train,Y_train)
    pred2 = cls2.predict(X_test)
    acc_score2 = accuracy_score(pred2,Y_test)
    return acc_score2

In [49]:
test_size = [0.30, 0.25, 0.20,0.10]
random_states = [8, 27, 42]
n_neighbours = [2,3,4,5]

criterions=['gini', 'entropy']
maxfeatures=['auto', 'sqrt', 'log2']
penalties = [ 'l1', 'elasticnet','none', 'l2']
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [50]:
df4 = pd.DataFrame(columns = ['Test Size', 'Random States','Decision Tree Accuracy','Criterions','Max features'])

In [51]:
for t_size in test_size:
    for r_state in random_states:
        for crs in criterions:
            for mfs in maxfeatures:
                     
                a2 = doDecisionTree(X, Y, t_size, r_state, crs, mfs)
                DecisionTree = {} 
                DecisionTree['Test Size'] = t_size
                DecisionTree['Random States'] = r_state
                DecisionTree['Decision Tree Accuracy'] = a2
                DecisionTree['Criterions'] = crs
                DecisionTree['Max features'] = mfs

                df4 = df4.append(DecisionTree, ignore_index = True)

In [52]:
df4.sample(10)

Unnamed: 0,Test Size,Random States,Decision Tree Accuracy,Criterions,Max features
23,0.25,8,0.937063,entropy,log2
66,0.1,42,0.912281,gini,auto
14,0.3,42,0.94152,gini,log2
51,0.2,42,0.938596,entropy,auto
29,0.25,27,0.909091,entropy,log2
68,0.1,42,0.947368,gini,log2
65,0.1,27,0.929825,entropy,log2
17,0.3,42,0.929825,entropy,log2
59,0.1,8,0.912281,entropy,log2
27,0.25,27,0.916084,entropy,auto


## Conclusion

In this lab, we have tried to gain the knowledge about data and its variables, further we did some preprocessing to the data in order to bring it into more analyst friendly mode, laterly we implemented various graphs using various libraries in order to get valuable insights, furthermore, we implemented and evaluated various classification models to get high accuracy in terms of predicting breast cancer which can help the doctors to predict breast cancer for the patients.