### Understanding the problem
For this program we are going to be working with a dataset, breast-cancer.csv. This is a dataset that has several different attributes that describe tumors found in the breast. Along with the attributes of the tumor, the dataset also has an output which is the diagnostic column. This column contains whether each each tumor was malignant or benign. We will be using all of the attributes(shown below) to train a model to predict whether or not a given tumor is malignant or benign. This makes our problem one of Binary Classification. We will be spliting our dataset into training and testing subsets in order to train and test our model. Upon completing the training of our model, we will calculate an accuracy score of the model as well as populate a classification report containing the precision and recall of our model.

1) Getting the data

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# Importing dataset
dataSet = pd.read_csv('./data/breast-cancer.csv')

# Shuffling dataset
data = dataSet.sample(frac=1).reset_index(drop=True)

# Showing all current attributes in the Dataset 
data.columns.tolist()

['id',
 'diagnosis',
 'radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'fractal_dimension_mean',
 'radius_se',
 'texture_se',
 'perimeter_se',
 'area_se',
 'smoothness_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'symmetry_se',
 'fractal_dimension_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst']

In [3]:
# Showing a few rows of the dataset
data[:5]

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,867387,B,15.71,13.93,102.0,761.7,0.09462,0.09462,0.07135,0.05933,...,17.5,19.25,114.3,922.8,0.1223,0.1949,0.1709,0.1374,0.2723,0.07071
1,911296201,M,17.08,27.15,111.2,930.9,0.09898,0.111,0.1007,0.06431,...,22.96,34.49,152.1,1648.0,0.16,0.2444,0.2639,0.1555,0.301,0.0906
2,874839,B,12.3,15.9,78.83,463.7,0.0808,0.07253,0.03844,0.01654,...,13.35,19.59,86.65,546.7,0.1096,0.165,0.1423,0.04815,0.2482,0.06306
3,89382602,B,12.76,13.37,82.29,504.1,0.08794,0.07948,0.04052,0.02548,...,14.19,16.4,92.04,618.8,0.1194,0.2208,0.1769,0.08411,0.2564,0.08253
4,8712766,M,17.47,24.68,116.1,984.6,0.1049,0.1603,0.2159,0.1043,...,23.14,32.33,155.3,1660.0,0.1376,0.383,0.489,0.1721,0.216,0.093


2) Exploring our data  
Upon reviewing the dataset, I see that I'm very fortunate to have a dataset that is very easy to work with. This dataset, obtained from kaggle, was designed to be very convinent for people to train models with. If you would like to see the dataset for yourself, it can be found at https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset?select=breast-cancer.csv. Despite the convinient parts of the dataset, there are some aspects that I would like to update. We will be dropping the ID column as those values are entirely independent of the each diagnosis and having them could negatively affect our accuracy. Furthermore, you will see below that the diagnosis (our output) is in the form of a string. More specifically, the diagnosis is an M for malignant, and a B for benign. I prefer to change these values to a 1 for malignant, and a 0 for benign. I find integers easier to work with when drawing conclusions on the efficiency of our model, so we will be updating these values to make the dataset easier to work with.

In [4]:
# Dropping id column and giving our updated 
# dataset a new reference variable
myData = data.drop(['id'], axis=1)

In [5]:
# Presenting statistical information of the attributes
myData.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [6]:
# Updating diagnosis values 1 for malginant and 0 for benign
myData['diagnosis'] = myData['diagnosis'].apply(lambda x: x.replace("B", "0"))
myData['diagnosis'] = myData['diagnosis'].apply(lambda x: x.replace("M", "1"))

# Keeping datatypes consistent
myData = myData.astype(float)

# Presenting the same 5 rows of the dataset as before to verify that M is now 1, B is now 0, and the ID column is removed.
myData[:5]

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,0.0,15.71,13.93,102.0,761.7,0.09462,0.09462,0.07135,0.05933,0.1816,...,17.5,19.25,114.3,922.8,0.1223,0.1949,0.1709,0.1374,0.2723,0.07071
1,1.0,17.08,27.15,111.2,930.9,0.09898,0.111,0.1007,0.06431,0.1793,...,22.96,34.49,152.1,1648.0,0.16,0.2444,0.2639,0.1555,0.301,0.0906
2,0.0,12.3,15.9,78.83,463.7,0.0808,0.07253,0.03844,0.01654,0.1667,...,13.35,19.59,86.65,546.7,0.1096,0.165,0.1423,0.04815,0.2482,0.06306
3,0.0,12.76,13.37,82.29,504.1,0.08794,0.07948,0.04052,0.02548,0.1601,...,14.19,16.4,92.04,618.8,0.1194,0.2208,0.1769,0.08411,0.2564,0.08253
4,1.0,17.47,24.68,116.1,984.6,0.1049,0.1603,0.2159,0.1043,0.1538,...,23.14,32.33,155.3,1660.0,0.1376,0.383,0.489,0.1721,0.216,0.093


In [8]:
# Populating X and y values as we prepare to split our data
X = myData.iloc[:,1:32].values
y = myData.iloc[:,0].values

# Updating y (diagnosis) values to integers since we do not need floats for them
y = y.astype(int)

# Printing the shapes of X and y to make sure we have the right amount of columns and rows for each
# (Should be 30 columns for X and 1 for y)
# As we can see below, we have the right dimensions!
print(np.shape(X))
print(np.shape(y))

# Double checking our y to make sure nothing unexpected happened. Looks good!
print(y)

(569, 30)
(569,)
[0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 1 0
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 0
 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0
 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0
 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1
 0 0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1
 1 0 1 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 0
 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1
 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1
 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0
 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0
 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0 0 0 1 1 1 0 1
 0 1 0 1

3) Preparing the data

In [9]:
# Splitting the data into training and testing
# I prefer the 80/20 split
x_train, x_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=10
)
print(len(myData))
print(len(x_train))
print(len(x_test))

569
455
114


4) Training and Evaluating the model using logistic regression  
I decided to use Logistic Regression for this particular dataset for three main reasons: Firstly, it is rather straighforward. Secondly, it's implementation is simple. And finally, upon looking into each model that I've experienced, I found Logisitic Regression to be the most effecting with problems regarding binary classification.

In [10]:
# Importing Logistic Regression
from sklearn.linear_model import LogisticRegression

# Importing Classification Report
from sklearn.metrics import classification_report

In [11]:
# Must increase number of iterations for this particular dataset to ensure convergence
lr = LogisticRegression(solver='lbfgs', max_iter=4000)

In [12]:
# Creating a fit for our model
lr.fit(x_train,y_train)

# Generating predictions for the diagnosis
predictions = lr.predict(x_test)

# Calculating a score for how accurate our models predictions were compared to the actual data
score = lr.score(x_test,y_test)
print('Score: ',"%.2f" %(score * 100),'%')

Score:  94.74 %


In [13]:
# Generating Classifcation Report
report = classification_report(y_test, predictions)
print(report)

              precision    recall  f1-score   support

           0       0.94      0.99      0.96        74
           1       0.97      0.88      0.92        40

    accuracy                           0.95       114
   macro avg       0.95      0.93      0.94       114
weighted avg       0.95      0.95      0.95       114



6) Conclusions  
While the precision/recall and the accuracy scores do fluctuate each time the runtime is recycled, the highest scores I was able to achieve was an accuracy score of 98%, and both precision and recalls of 0.98 and 0.96 respectively. I'm satisifed with these results as this project took fair amount of time to get these results. Seeing the project be finished with such high accuracy is very satisfying. In the future, I would like to code my own logisitic regression instead of using a library. I would have also enjoyed drawing conclusions on this dataset using different methods (such as Random Trees or KNN) as it would have been enjoyable to see which method could achieve the highest accuracy. Above everything else, I would love to be able to see if I could push my accuracy to be above 98% everytime the runtime is recycled as I was ranging from 93%-98% and I believe that trying other methods could have been a way to achieve that result.