# Option 2 Python Homework 5
## Mammographic Dataset

This data contains 961 instances of masses detected in mammograms, and contains the following
attributes:
1. BI-RADS assessment: 1 to 5 (ordinal)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=No or malignant=Yes (binary)

BI-RADS is an assessment of how confident the severity classification is; it is not a "predictive"
attribute and so we will discard it. The age, shape, margin, and density attributes are the features
that we will build our model with, and "severity" is the classification we will attempt to predict
based on those attributes.
Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with
well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for
example is ordered increasingly from round to irregular.

Importing required libraries
since BI-RADS is not a "predictive" attribute and so we will discard it.

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier  
data = pd.read_csv("mammographic.csv",names=['BI-RADS','Age','Shape','Margin','Density','Severity'],usecols = ['Age','Shape','Margin','Density','Severity']) 
data.head()

Unnamed: 0,Age,Shape,Margin,Density,Severity
0,67,3,5,3,yes
1,43,1,1,?,yes
2,58,4,5,3,yes
3,28,1,1,3,no
4,74,1,5,?,yes


#### Converting Field Severity to numerical Data

In [2]:
def severity_to_numeric(x):
    if x=='yes':
        return 1
    if x=='no':
        return 0
data['Severity'] = data['Severity'].apply(severity_to_numeric)
data.head()

Unnamed: 0,Age,Shape,Margin,Density,Severity
0,67,3,5,3,1
1,43,1,1,?,1
2,58,4,5,3,1
3,28,1,1,3,0
4,74,1,5,?,1


#### Converting "?" to NaN from NumPy library

The data needs to be cleaned: many rows contain missing data. Some column needs to be
transformed to numerical data. Techniques such as KNN also require the input data to be
normalized first. (Hint: use preprocessing.StandardScaler()). Show your data after being
preprocessed. If none of the techniques described below is able to achieve around 80% accuracy,
exam your data again to see if there is anything that you can improve.

Apply the following supervised learning techniques to your preprocessed data set, and see which
one yields the highest accuracy as measured with K-Fold cross validation (K=10).

In [3]:
data1 = data.replace("?",np.nan)
data1.head()

Unnamed: 0,Age,Shape,Margin,Density,Severity
0,67,3,5,3.0,1
1,43,1,1,,1
2,58,4,5,3.0,1
3,28,1,1,3.0,0
4,74,1,5,,1


#### Dropping all the rows containing NaN in them

In [4]:
dataA = data1.dropna()
print(dataA.head())

   Age Shape Margin Density  Severity
0   67     3      5       3         1
2   58     4      5       3         1
3   28     1      1       3         0
8   57     1      5       3         1
10  76     1      4       3         1


#### Dividing input features and target variable 

#### Standardizing features by removing the mean and scaling to unit variance 

In [5]:
fields = list(dataA.columns[0:4])
factors = dataA[fields].values
print(factors)
severity = list(dataA.columns[4:5])
targetVariable = dataA[severity].values
targetVariable = targetVariable.ravel()
print(targetVariable)
scaler  = StandardScaler()
factors = scaler.fit_transform(factors)
print(factors)

[['67' '3' '5' '3']
 ['58' '4' '5' '3']
 ['28' '1' '1' '3']
 ...
 ['64' '4' '5' '3']
 ['66' '4' '5' '3']
 ['62' '3' '3' '3']]
[1 1 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 0
 1 0 0 0 0 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0
 1 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1
 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0
 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1
 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1
 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1
 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1
 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 0 0 1 0 1 0 1 1 1 0 1 1 1 1 0 0 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1 0 0 1 1
 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1
 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 0 0 1 0 

#### Train data and Test data split with 3:1 ratio 

In [6]:
trainX, testX, trainY, testY = model_selection.train_test_split(factors,targetVariable, test_size=0.25, random_state=1)

#### Decision Tree implementation

• Create a single train/test split of your data. Set aside 75% for training, and 25% for
testing. Create a DecisionTreeClassifier and fit it to your training data. Measure the
accuracy of the resulting decision tree model using your test data. (Hint: you don’t have
to visualize the tree and use score method to get the accuracy.)

In [7]:
#DECISION TREE
from sklearn import tree
decTree = tree.DecisionTreeClassifier(criterion='entropy', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
decTree1 = decTree.fit(trainX,trainY)
predY = decTree1.predict(testX)
decTreeScore = metrics.accuracy_score(testY, predY)
print("Score:-",decTreeScore)

Score:- 0.7211538461538461




#### Decision Tree implementation with K-Fold CV = 10

• Use K-Fold cross validation to get a measure of your model’s accuracy (K=10). (Hint:
use model_selection.cross_val_score)

In [8]:
from sklearn import model_selection
DecTree_scores = model_selection.cross_val_score(decTree1,factors,targetVariable,cv=10)
print(DecTree_scores)
print("Average Score after CV:-",DecTree_scores.mean())

[0.67857143 0.77108434 0.71084337 0.75903614 0.77108434 0.69879518
 0.72289157 0.77108434 0.77108434 0.71084337]
Average Score after CV:- 0.7365318416523235




#### Random Forest implementation

• Create a RandomForestClassifier using n_estimators=10 and use K-Fold cross validation
to get a measure of the accuracy (K=10).

In [9]:
#RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
RanFor = RandomForestClassifier(n_estimators=10)
RanFor1= RanFor.fit(trainX,trainY)
predY = RanFor1.predict(testX)
metrics.accuracy_score(testY, predY)

0.7067307692307693

#### Random Forest implementation with K-Fold CV = 10

In [10]:
from sklearn import model_selection
RanFor_scores = model_selection.cross_val_score(RanFor1,factors,targetVariable,cv=10)
print(RanFor_scores)
print(RanFor_scores.mean())

[0.69047619 0.77108434 0.78313253 0.74698795 0.79518072 0.6746988
 0.77108434 0.77108434 0.71084337 0.6746988 ]
0.7389271371199081


#### K Nearest Neighbor Classifier implementation

• Create a neighbors.KNeighborsClassifier and use K-Fold cross validation to get a
measure of the accuracy (K=10).

In [11]:
# KNN 
KNN = KNeighborsClassifier(n_neighbors=10)  
KNN1 = KNN.fit(trainX, trainY)
KNN_scores = model_selection.cross_val_score(KNN1, trainX, trainY, cv=10, scoring='accuracy')
print(KNN_scores.mean())

0.8040706605222734


#### K Nearest Neighbor Classifier implementation with neighbors ranging from 1 to 50

• Try different values of K. Write a for loop to run KNN with K values ranging from 1 to
50 and see if K makes a substantial difference. Make a note of the best performance you
could get out of KNN. 

In [12]:
KNN_scores = []
CV_scores = []
for i in range(1, 50):
    knn = KNeighborsClassifier(n_neighbors=i)
    KNN_scores = model_selection.cross_val_score(knn, trainX, trainY, cv=10, scoring='accuracy')
    CV_scores.append(KNN_scores.mean())
print(max(CV_scores))

0.8041218637992833


#### Extracting maximum value and n_neighbors where max accuracy is obtained

In [13]:
A = max(CV_scores)
for i in range(0, 49): 
    if CV_scores[i] == A:
        print("At K =",i+1," KNN gives maximum accuracy of",CV_scores[i]*100,"percent")

At K = 9  KNN gives maximum accuracy of 80.41218637992833 percent


#### Naive Bayes implementation

• Create a naive_bayes.MultinomailNB and use K-Fold cross validation to get a measure of
the accuracy (K=10).

In [18]:
#Naive Bayes
NB_scaler = MinMaxScaler()
factors = NB_scaler.fit_transform(factors)
trainX1, testX1, trainY1, testY1 = model_selection.train_test_split(factors,targetVariable, test_size=0.25, random_state=0)
#NB = MultinomialNB()
NB = MultinomialNB().fit(trainX1,trainY1)
predY = NB.predict(testX1)
NB.score(testX1,testY1)
metrics.accuracy_score(testY1, predY)

0.7884615384615384

#### Naive Bayes implementation with K-Fold CV = 10

In [19]:
NB_scores = model_selection.cross_val_score(NB,factors,targetVariable,cv=10)
print(NB_scores)
print(NB_scores.mean())

[0.73809524 0.77108434 0.79518072 0.8313253  0.8313253  0.77108434
 0.71084337 0.75903614 0.89156627 0.71084337]
0.7810384394721744


|  MODEL | Decision Tree | Random Forest  |      KNN        | Naive Bayes |
|--------|---------------|----------------|-----------------|-------------|
| W/O CV |   76.4%       |     77.4%      |   76.7% - CV    |     78.8%   |
|   CV   |   73.4%       |     73.8%      |79.6% - variableN|     78.2%   |