# PROBLEM STATEMENT

In [None]:
Citation Request:
   This breast cancer databases was obtained from the University of Wisconsin
   Hospitals, Madison from Dr. William H. Wolberg.  If you publish results
   when using this database, then please include this information in your
   acknowledgements.  Also, please cite one or more of:

   1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear 
      programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.

   2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of 
      pattern separation for medical diagnosis applied to breast cytology", 
      Proceedings of the National Academy of Sciences, U.S.A., Volume 87, 
      December 1990, pp 9193-9196.

   3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition 
      via linear programming: Theory and application to medical diagnosis", 
      in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
      Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.

   4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming 
      discrimination of two linearly inseparable sets", Optimization Methods
      and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

1. Title: Wisconsin Breast Cancer Database (January 8, 1991)

2. Sources:
   -- Dr. WIlliam H. Wolberg (physician)
      University of Wisconsin Hospitals
      Madison, Wisconsin
      USA
   -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
      Received by David W. Aha (aha@cs.jhu.edu)
   -- Date: 15 July 1992

3. Past Usage:

   Attributes 2 through 10 have been used to represent instances.
   Each instance has one of 2 possible classes: benign or malignant.

   1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of 
      pattern separation for medical diagnosis applied to breast cytology. In
      {\it Proceedings of the National Academy of Sciences}, {\it 87},
      9193--9196.
      -- Size of data set: only 369 instances (at that point in time)
      -- Collected classification results: 1 trial only
      -- Two pairs of parallel hyperplanes were found to be consistent with
         50% of the data
         -- Accuracy on remaining 50% of dataset: 93.5%
      -- Three pairs of parallel hyperplanes were found to be consistent with
         67% of data
         -- Accuracy on remaining 33% of dataset: 95.9%

   2. Zhang,~J. (1992). Selecting typical instances in instance-based
      learning.  In {\it Proceedings of the Ninth International Machine
      Learning Conference} (pp. 470--479).  Aberdeen, Scotland: Morgan
      Kaufmann.
      -- Size of data set: only 369 instances (at that point in time)
      -- Applied 4 instance-based learning algorithms 
      -- Collected classification results averaged over 10 trials
      -- Best accuracy result: 
         -- 1-nearest neighbor: 93.7%
         -- trained on 200 instances, tested on the other 169
      -- Also of interest:
         -- Using only typical instances: 92.2% (storing only 23.1 instances)
         -- trained on 200 instances, tested on the other 169

4. Relevant Information:

   Samples arrive periodically as Dr. Wolberg reports his clinical cases.
   The database therefore reflects this chronological grouping of the data.
   This grouping information appears immediately below, having been removed
   from the data itself:

     Group 1: 367 instances (January 1989)
     Group 2:  70 instances (October 1989)
     Group 3:  31 instances (February 1990)
     Group 4:  17 instances (April 1990)
     Group 5:  48 instances (August 1990)
     Group 6:  49 instances (Updated January 1991)
     Group 7:  31 instances (June 1991)
     Group 8:  86 instances (November 1991)
     -----------------------------------------
     Total:   699 points (as of the donated datbase on 15 July 1992)

   Note that the results summarized above in Past Usage refer to a dataset
   of size 369, while Group 1 has only 367 instances.  This is because it
   originally contained 369 instances; 2 were removed.  The following
   statements summarizes changes to the original Group 1's set of data:

   #####  Group 1 : 367 points: 200B 167M (January 1989)
   #####  Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
   #####  Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
   #####                  : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
   #####                  : Changed 0 to 1 in field 6 of sample 1219406
   #####                  : Changed 0 to 1 in field 8 of following sample:
   #####                  : 1182404,2,3,1,1,1,2,0,1,1,1

5. Number of Instances: 699 (as of 15 July 1992)

6. Number of Attributes: 10 plus the class attribute

7. Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

8. Missing attribute values: 16

   There are 16 instances in Groups 1 to 6 that contain a single missing 
   (i.e., unavailable) attribute value, now denoted by "?".  

9. Class distribution:
 
   Benign: 458 (65.5%)
   Malignant: 241 (34.5%)


# IMPORTING BASIC LIBRARIES

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# READING THE DATASET

In [74]:
data=pd.read_table(r"C:\Users\Pragya\Downloads\breast+cancer+wisconsin+original\breast-cancer-wisconsin.data",header=None,delimiter=",")
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3,1,1,1,3,2,1,1,1,2
695,841769,2,1,1,1,2,1,1,1,1,2
696,888820,5,10,10,3,7,3,8,10,2,4
697,897471,4,8,6,4,3,4,10,6,1,4


# ASSIGNING VARIABLE NAME TO THE VARIABLE

In [75]:
data.columns=['Sample_code_number','Clump_Thickness','Uniformity_of_Cell_Size','Uniformity_of_Cell_Shape',
               'Marginal_Adhesion','Single_Epithelial_Cell_Size','Bare_Nuclei','Bland_Chromatin','Normal_Nucleoli',
               'Mitoses','Class']     

# EDA

**1. HEAD**

In [44]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


**2. SHAPE**

In [45]:
data.shape

(699, 11)

Shape of data set is:  
--> 699 rows   
--> 11 columns

**3. COLUMNS**

In [53]:
data.columns

Index(['Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
       'Uniformity_of_Cell_Shape', 'Marginal_Adhesion',
       'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin',
       'Normal_Nucleoli', 'Mitoses', 'Class'],
      dtype='object')

**4. DUPLICATED VALUES**

In [12]:
data.duplicated().sum()

235

Here i am not droping the duplicate entries because my date is already very small to work upon.  
Thats why i am trying to restore the data.

**5. DESCRIPTIVE STATISTICS**

In [46]:
data.describe()

Unnamed: 0,0,1,2,3,4,5,7,8,9,10
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


**6. SUMMARIZED INFORMATION**

In [47]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       699 non-null    int64 
 1   1       699 non-null    int64 
 2   2       699 non-null    int64 
 3   3       699 non-null    int64 
 4   4       699 non-null    int64 
 5   5       699 non-null    int64 
 6   6       699 non-null    object
 7   7       699 non-null    int64 
 8   8       699 non-null    int64 
 9   9       699 non-null    int64 
 10  10      699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB


**7. DATA TYPES**

In [48]:
data.dtypes

0      int64
1      int64
2      int64
3      int64
4      int64
5      int64
6     object
7      int64
8      int64
9      int64
10     int64
dtype: object

**8. ANAMOLY DETECTION**

In [13]:
for i in data.columns:
    print({i:data[i].unique()})

{('Clump_Thickness',): array([ 5,  3,  6,  4,  8,  1,  2,  7, 10,  9], dtype=int64)}
{('Uniformity_of_Cell_Size',): array([ 4,  1,  8, 10,  2,  3,  7,  5,  6,  9], dtype=int64)}
{('Uniformity_of_Cell_Shape',): array([ 4,  1,  8, 10,  2,  3,  5,  6,  7,  9], dtype=int64)}
{('Marginal_Adhesion',): array([ 5,  1,  3,  8, 10,  4,  6,  2,  9,  7], dtype=int64)}
{('Single_Epithelial_Cell_Size',): array([ 7,  2,  3,  1,  6,  4,  5,  8, 10,  9], dtype=int64)}
{('Bare_Nuclei',): array(['10', '2', '4', '1', '3', '9', '7', '?', '5', '8', '6'],
      dtype=object)}
{('Bland_Chromatin',): array([ 3,  9,  1,  2,  4,  5,  7,  8,  6, 10], dtype=int64)}
{('Normal_Nucleoli',): array([ 2,  1,  7,  4,  5,  3, 10,  6,  9,  8], dtype=int64)}
{('Mitoses',): array([ 1,  5,  4,  2,  3,  7, 10,  8,  6], dtype=int64)}
{('Class',): array([2, 4], dtype=int64)}


Here anamoly is detected in the form of "?".  
We need to deal with it.  
So i am replacing "?" with np.nan and filling it with mode value.

**9. CODE FOR REPLACING "?" WITH NP.NAN**

In [76]:
data.replace("?",np.nan,inplace=True)
data.isnull().sum()

Sample_code_number              0
Clump_Thickness                 0
Uniformity_of_Cell_Size         0
Uniformity_of_Cell_Shape        0
Marginal_Adhesion               0
Single_Epithelial_Cell_Size     0
Bare_Nuclei                    16
Bland_Chromatin                 0
Normal_Nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64

**10. MISSING VALUES**

In [77]:
data.isnull().sum()

Sample_code_number              0
Clump_Thickness                 0
Uniformity_of_Cell_Size         0
Uniformity_of_Cell_Shape        0
Marginal_Adhesion               0
Single_Epithelial_Cell_Size     0
Bare_Nuclei                    16
Bland_Chromatin                 0
Normal_Nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64

# DATA PREPROCESSING

**FEATURE SELECTION**

In [78]:
data.drop(['Sample_code_number'],axis=1,inplace=True)

Here i am dropping the variable beacause of high cardinality(each and every varible has unique value)

**HANDLING MISSING VALUE WITH MEDIAN**

In [79]:
for x in data.columns:
    if data[x].dtype=='object' or data[x].dtype=='bool':      # for categorical
        data[x].fillna(data[x].mode()[0],inplace=True)
    elif data[x].dtype=='int64' or data[x].dtype=='float64':    # for numerical
        data[x].fillna(round(data[x].median()),inplace=True)

**RECHECKING MISSING VALUE**

In [80]:
data.isnull().sum()

Clump_Thickness                0
Uniformity_of_Cell_Size        0
Uniformity_of_Cell_Shape       0
Marginal_Adhesion              0
Single_Epithelial_Cell_Size    0
Bare_Nuclei                    0
Bland_Chromatin                0
Normal_Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

**CONVERING CATEGORICAL DATA TO NUMERICAL DATA**

Here i am using label encoder for encoding

In [81]:
colname=[]
for x in data.columns:
    if data[x].dtype=='object':
        colname.append(x)
print(colname)



# appliying fit transform method to all the columns (transform_method) i.e categorical to numerical
# For preprocessing the data
from sklearn.preprocessing import LabelEncoder
 
le=LabelEncoder()
 
for x in colname:
    data[x]=le.fit_transform(data[x])


['Bare_Nuclei']


In [60]:
data.head()

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,0,3,1,1,2
1,1002945,5,4,4,5,7,1,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,0,3,1,1,2


**CREATING X AND Y**

In [82]:
X = data.values[:,0:-1]
Y = data.values[:,-1]

**CHECKING SHAPE OF X AND Y**

In [83]:
print(X.shape)
print(Y.shape)

(699, 9)
(699,)


**SCALING THE DATA**

FOR scaling i am using method of standard scaler

In [84]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()     
scaler.fit(X)                 
scaler.transform(X)       


array([[ 0.20693572, -0.69999505, -0.74329904, ..., -0.17966213,
        -0.61182504, -0.34391178],
       [ 0.20693572,  0.28384518,  0.2668747 , ..., -0.17966213,
        -0.28411186, -0.34391178],
       [-0.50386559, -0.69999505, -0.74329904, ..., -0.17966213,
        -0.61182504, -0.34391178],
       ...,
       [ 0.20693572,  2.25152563,  2.28722218, ...,  1.87236122,
         2.33759359,  0.23956962],
       [-0.14846494,  1.59563215,  0.94032386, ...,  2.69317056,
         1.02674087, -0.34391178],
       [-0.14846494,  1.59563215,  1.61377302, ...,  2.69317056,
         0.37131451, -0.34391178]])

**SPLITTING DATA INTO TEST AND TRAIN**

In [85]:
# splitting the data into test and train
from sklearn.model_selection import train_test_split

# split the data into test and train
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state = 10)  

**CHECKING SHAPE OF TRAIN AND TEST**

In [86]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(559, 9)
(140, 9)
(559,)
(140,)


# MODEL BUILDING

# LOGISTIC REGRESSION

**CREATING MODEL OBJECT**

In [87]:
from sklearn.linear_model import LogisticRegression
# create a model
classifier=LogisticRegression(random_state=10)

**FITTING MODEL OBJECT**

In [88]:
# fitting training data to the model
classifier.fit(X_train,Y_train)

**GENERATING BETA PARAMETERS**

In [90]:
# output is equation of liner regression 
print(classifier.intercept_)   
print(classifier.coef_)     

[-10.36793979]
[[ 0.6069971  -0.01324661  0.49332059  0.25846548  0.11772542  0.4312262
   0.51954504  0.11566553  0.78105356]]


**PREDICTING VALUE OF TEST DATA**

In [91]:
Y_pred=classifier.predict(X_test)
print(Y_pred)     

[4 2 2 2 2 4 2 2 4 2 4 2 2 2 2 2 2 2 4 4 2 4 4 2 2 4 4 4 2 2 2 2 4 4 2 4 4
 2 2 4 2 2 2 2 4 2 2 4 2 4 2 2 2 2 2 2 4 2 4 4 2 2 2 2 4 4 2 2 2 2 2 2 2 2
 2 4 4 4 2 4 2 2 2 2 2 2 2 2 4 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2
 2 4 2 4 4 2 2 4 2 2 2 2 4 2 2 2 2 4 4 4 2 2 2 4 2 4 2 2 2]


**GENERATING PROBABILITY MATRIX**

In [93]:
# probability matrix 
Y_pred_prob = classifier.predict_proba(X_test)

**GENERATING EVALUATION MATRIX**

In [94]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
# confusion matrix
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))


# accuracy_score
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)


# mera model 82.72% acurate h mtlb 
# mera model 100 observation 83 observation predict accurately and 17 abservation become wrong

[[95  3]
 [ 3 39]]
Classification report: 
              precision    recall  f1-score   support

           2       0.97      0.97      0.97        98
           4       0.93      0.93      0.93        42

    accuracy                           0.96       140
   macro avg       0.95      0.95      0.95       140
weighted avg       0.96      0.96      0.96       140

Accuracy of the model:  0.9571428571428572


Here i am not tuning the model because it already give the good accuracy

# DECISION TREE

**CREATING MODEL OBJECT**

In [106]:
#predicting using the Decision_Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DT=DecisionTreeClassifier(random_state=10, 
                                         criterion="gini")

**FITTING MODEL OBJECT**

In [107]:
model_DT.fit(X_train,Y_train)

**PREDICTING VALUE OF X_TEST**

In [108]:
Y_pred=model_DT.predict(X_test)
#print(Y_pred)

**COMPARING Y_ACTUAL WITH Y_PREDICTED**

In [None]:
#print(list(zip(Y_test,Y_pred)))

**GENERATING EVALUATION MATRIX**

In [109]:
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

[[97  1]
 [ 5 37]]
0.9571428571428572
              precision    recall  f1-score   support

           2       0.95      0.99      0.97        98
           4       0.97      0.88      0.93        42

    accuracy                           0.96       140
   macro avg       0.96      0.94      0.95       140
weighted avg       0.96      0.96      0.96       140



**CHECKING MODEL IS OVERFITTED OR NOT**

In [110]:
model_DT.score(X_train,Y_train)

1.0

Here my model is detected as over fitted so i need to prune this model.

# PRUNNING DECISION TREE

In [112]:
#predicting using the Decision_Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DT=DecisionTreeClassifier(random_state=10, 
                                         criterion="gini",
                                         splitter="best", 
                                         min_samples_leaf=3,
                                         min_samples_split=5,
                                         max_depth=10, 
                                        #max_leaf_nodes=100,
                                         # max_features=0.6
                                         )
#min_samples_leaf, min_samples_split, max_depth, max_features, max_leaf_nodes

#fit the model on the data and predict the values
model_DT.fit(X_train,Y_train)
Y_pred=model_DT.predict(X_test)


from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

[[95  3]
 [ 5 37]]
0.9428571428571428
              precision    recall  f1-score   support

           2       0.95      0.97      0.96        98
           4       0.93      0.88      0.90        42

    accuracy                           0.94       140
   macro avg       0.94      0.93      0.93       140
weighted avg       0.94      0.94      0.94       140



**AGAING CHECKING OVERFITTING OF MODEL**

In [113]:
model_DT.score(X_train,Y_train)

0.9767441860465116

Now accuracy of score function is not 100% so my model is not overfitted

# RANDOM FOREST

In [116]:
#predicting using the Random_Forest_Classifier
from sklearn.ensemble import RandomForestClassifier

model_RandomForest=RandomForestClassifier(n_estimators=100,
                                          random_state=10, bootstrap=True,
                                         n_jobs=-1)

#fit the model on the data and predict the values
model_RandomForest.fit(X_train,Y_train)

Y_pred=model_RandomForest.predict(X_test)


from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

[[95  3]
 [ 0 42]]
0.9785714285714285
              precision    recall  f1-score   support

           2       1.00      0.97      0.98        98
           4       0.93      1.00      0.97        42

    accuracy                           0.98       140
   macro avg       0.97      0.98      0.97       140
weighted avg       0.98      0.98      0.98       140



# EXTRATREES

In [117]:
#predicting using the Extra_Trees_Classifier
from sklearn.ensemble import ExtraTreesClassifier

model_EXT=ExtraTreesClassifier(n_estimators=300, random_state=10, bootstrap=True)

#fit the model on the data and predict the values
model_EXT.fit(X_train,Y_train)

Y_pred=model_EXT.predict(X_test)



from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

[[95  3]
 [ 0 42]]
0.9785714285714285
              precision    recall  f1-score   support

           2       1.00      0.97      0.98        98
           4       0.93      1.00      0.97        42

    accuracy                           0.98       140
   macro avg       0.97      0.98      0.97       140
weighted avg       0.98      0.98      0.98       140



# COMBINING MULTIPLE MODELS AT ONCE

In [118]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# first, initialize the classificators
tree= DecisionTreeClassifier(random_state=10) # using the random state for reproducibility
knn= KNeighborsClassifier(n_neighbors=5,metric='euclidean')
svm= SVC(kernel="rbf", gamma=0.1, C=90,random_state=10)
logreg=LogisticRegression(multi_class="multinomial",random_state=10)



# now, create a list with the objects 
models= [tree, knn, svm, logreg]



from sklearn.metrics import confusion_matrix, accuracy_score,classification_report

for model in models:
    model.fit(X_train, Y_train) # fit the model
    Y_pred= model.predict(X_test) # then predict on the test set
    accuracy= accuracy_score(Y_test, Y_pred) 
    clf_report= classification_report(Y_test, Y_pred) 
    print(confusion_matrix(Y_test,Y_pred))
    print("The accuracy of the ",type(model).__name__, " model is ", accuracy*100 )
    print("Classification report:\n", clf_report)
    print("\n")

[[97  1]
 [ 5 37]]
The accuracy of the  DecisionTreeClassifier  model is  95.71428571428572
Classification report:
               precision    recall  f1-score   support

           2       0.95      0.99      0.97        98
           4       0.97      0.88      0.93        42

    accuracy                           0.96       140
   macro avg       0.96      0.94      0.95       140
weighted avg       0.96      0.96      0.96       140



[[95  3]
 [ 1 41]]
The accuracy of the  KNeighborsClassifier  model is  97.14285714285714
Classification report:
               precision    recall  f1-score   support

           2       0.99      0.97      0.98        98
           4       0.93      0.98      0.95        42

    accuracy                           0.97       140
   macro avg       0.96      0.97      0.97       140
weighted avg       0.97      0.97      0.97       140



[[92  6]
 [ 0 42]]
The accuracy of the  SVC  model is  95.71428571428572
Classification report:
               p

# CONCLUSION

**LOGISTIC REGRESSION**   

Classification report: 
              precision    recall  f1-score   support

           2       0.97      0.97      0.97        98
           4       0.93      0.93      0.93        42

    accuracy                           0.96       140
   macro avg       0.95      0.95      0.95       140
weighted avg       0.96      0.96      0.96       140

Accuracy of the model:  0.9571428571428572

**DECISION TREE**  

ACCURACY OF THE MODEL:  0.9428571428571428  
              precision    recall  f1-score   support

           2       0.95      0.97      0.96        98
           4       0.93      0.88      0.90        42

    accuracy                           0.94       140
   macro avg       0.94      0.93      0.93       140
weighted avg       0.94      0.94      0.94       140


**PRUNED DECISION TREE**  

ACCUARCY OF MODEL: 0.9428571428571428  
              precision    recall  f1-score   support

           2       0.95      0.97      0.96        98
           4       0.93      0.88      0.90        42

    accuracy                           0.94       140
   macro avg       0.94      0.93      0.93       140
weighted avg       0.94      0.94      0.94       140


**EXTRA TREES**  

ACCURACY OF MODEL : 0.9785714285714285  
              precision    recall  f1-score   support

           2       1.00      0.97      0.98        98
           4       0.93      1.00      0.97        42

    accuracy                           0.98       140
   macro avg       0.97      0.98      0.97       140
weighted avg       0.98      0.98      0.98       140


**RANDOM FOEST**  

ACCURACY OF MODEL : 0.9785714285714285  
              precision    recall  f1-score   support

           2       1.00      0.97      0.98        98
           4       0.93      1.00      0.97        42

    accuracy                           0.98       140
   macro avg       0.97      0.98      0.97       140
weighted avg       0.98      0.98      0.98       140


OUT OF ALL THE MODELS EXTRATREES AND RANDOMFOREST GIVE THE HIGHEST ACCURACY