## Sruthi Pusuluri G01149012
# Mammographic dataset
This data contains 961 instances of masses detected in mammograms, and contains the following attributes:
1. BI-RADS assessment: 1 to 5 (ordinal)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=No or malignant=Yes (binary)

BI-RADS is an assessment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

The data needs to be cleaned: many rows contain missing data. Some column needs to be transformed to numerical data. Techniques such as KNN also require the input data to be normalized first. (Hint: use preprocessing.StandardScaler()). Show your data after being preprocessed. If none of the techniques described below is able to achieve around 80% accuracy, exam your data again to see if there is anything that you can improve.

Apply the following supervised learning techniques to your preprocessed data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10).


In [3]:
import pandas as pd
from pandas import read_csv
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

In [4]:
m_cols = ['assessment', 'age', 'shape','margin','density','severity']
mammo = pd.read_csv('mammographic.csv', sep=',', names=m_cols, usecols=range(6))
print(mammo.head())
print(mammo.dtypes)
#remove BI-RADS assessment
mammo_df = mammo[['age', 'shape','margin','density','severity']]
print(mammo_df.head())
# all the values are returned as objects but they are numerical values and string
#converting them into numerics
def coerce_cols_to_numeric(df, col_list):
    df[col_list] = df[col_list].apply(pd.to_numeric, errors='coerce')
#all the "?" will be converted to NaN by coercing them to numeric
coerce_cols_to_numeric(mammo_df, ['age'])

  assessment age shape margin density severity
0          5  67     3      5       3      yes
1          4  43     1      1       ?      yes
2          5  58     4      5       3      yes
3          4  28     1      1       3       no
4          5  74     1      5       ?      yes
assessment    object
age           object
shape         object
margin        object
density       object
severity      object
dtype: object
  age shape margin density severity
0  67     3      5       3      yes
1  43     1      1       ?      yes
2  58     4      5       3      yes
3  28     1      1       3       no
4  74     1      5       ?      yes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [5]:
print(mammo_df.dtypes)
print(mammo_df.head())
print(mammo_df.describe())

age         float64
shape        object
margin       object
density      object
severity     object
dtype: object
    age shape margin density severity
0  67.0     3      5       3      yes
1  43.0     1      1       ?      yes
2  58.0     4      5       3      yes
3  28.0     1      1       3       no
4  74.0     1      5       ?      yes
              age
count  956.000000
mean    55.487448
std     14.480131
min     18.000000
25%     45.000000
50%     57.000000
75%     66.000000
max     96.000000


In [6]:
mammo_df[['shape','margin','density']] = mammo_df[['shape','margin','density']].replace("?", np.NaN)
mammo_df.fillna(mammo_df.median(), inplace=True)
print(mammo_df.isnull().sum())
print(mammo_df.head())
coerce_cols_to_numeric(mammo_df, ['shape','margin','density'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


age         0
shape       0
margin      0
density     0
severity    0
dtype: int64
    age shape margin density severity
0  67.0     3      5       3      yes
1  43.0     1      1       3      yes
2  58.0     4      5       3      yes
3  28.0     1      1       3       no
4  74.0     1      5       3      yes


In [7]:
scaler = StandardScaler()
scaled_mammo = scaler.fit_transform(mammo_df[['age','shape','margin','density']])
scaled_mammo = pd.DataFrame({'age':scaled_mammo[:,0],'shape':scaled_mammo[:,1],'margin':scaled_mammo[:,2],'density':scaled_mammo[:,3]})
scaled_mammo.head()

Unnamed: 0,age,shape,margin,density
0,0.796984,0.220384,1.436762,0.224804
1,-0.86561,-1.415052,-1.183216,0.224804
2,0.173511,1.038102,1.436762,0.224804
3,-1.904732,-1.415052,-1.183216,0.224804
4,1.281908,-1.415052,1.436762,0.224804


In [8]:
df_mammo = pd.concat([mammo_df[['severity']], scaled_mammo], axis=1,sort=False)
df_mammo.head()

Unnamed: 0,severity,age,shape,margin,density
0,yes,0.796984,0.220384,1.436762,0.224804
1,yes,-0.86561,-1.415052,-1.183216,0.224804
2,yes,0.173511,1.038102,1.436762,0.224804
3,no,-1.904732,-1.415052,-1.183216,0.224804
4,yes,1.281908,-1.415052,1.436762,0.224804


## Decision tree
• Create a single train/test split of your data. Set aside 75% for training, and 25% for
testing. Create a DecisionTreeClassifier and fit it to your training data. Measure the accuracy of the resulting decision tree model using your test data. (Hint: you don’t have to visualize the tree and use score method to get the accuracy.)
• Use K-Fold cross validation to get a measure of your model’s accuracy (K=10). (Hint: use model_selection.cross_val_score)

In [9]:
#split dataset in features and target variable
feature_cols = ['age','shape','margin','density']
X = mammo_df[feature_cols] # Features
y = mammo_df.severity # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1) # 75% training and 25% test

In [10]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=4)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [11]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7966804979253111


## Random forest
• Create a RandomForestClassifier using n_estimators=10 and use K-Fold cross validation
to get a measure of the accuracy (K=10).


In [12]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=10)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)
# 10-Fold Cross validation
print ("Accuracy:",np.mean(cross_val_score(clf, X_train, y_train, cv=10)))

Accuracy: 0.762617638862092


## Naive Bayes
• Create a naïve_bayes.MultinomailNB and use K-Fold cross validation to get a measure of
the accuracy (K=10).

In [13]:
clf = MultinomialNB().fit(X_train, y_train) 
predicted = clf.predict(X_test)

# 10-Fold Cross validation
print ("Accuracy:",np.mean(cross_val_score(clf, X_train, y_train, cv=10)))

Accuracy: 0.7335825455013184


## KNN
• Create a neighbors.KNeighborsClassifier and use K-Fold cross validation to get a
measure of the accuracy (K=10).
• Try different values of K. Write a for loop to run KNN with K values ranging from 1 to
50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [14]:
X = df_mammo[feature_cols] # Features
y = df_mammo.severity # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1) # 75% training and 25% test

In [15]:
clf = KNeighborsClassifier(n_neighbors=15)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)
# 10-Fold Cross validation
print ("Accuracy:",np.mean(cross_val_score(clf, X_train, y_train, cv=10)))

Accuracy: 0.7974186442858061


In [16]:
acc = []
for i in range(2, 50):  
    acc.append(np.mean(cross_val_score(clf, X_train, y_train, cv=i)))
#max_acc = np.mean(cross_val_score(clf, X_train, y_train, cv=i))
print("Best k-value: ",acc.index(max(acc))+2)
print("Accuracy:",max(acc))

Best k-value:  45
Accuracy: 0.8067374727668842
