<div class="alert alert-warning">
DATA PREPARATION ON MAMMOGRAPHIC MASSES

<br>Data Acquired From University of California, Irvine Machine Learning Repository
<br>Additonal Data Information in the Link Below:
<br>
[http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.names](http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.names)

The Data was used to determine the effectiveness of radiological evaluations of breast cancer diagnoses in women who have breast tumors.
</div>

In [58]:
# Packages
import numpy as np
import pandas as pd

# Get Packages
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

In [59]:
# URL DATA
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data"

# Downloading Data
Mamm = pd.read_csv(url, header=None)

# Replacing Default Collumn Names (0, 1, 2, 3, 4, 5)
Mamm.columns = ["BI_RADS", "Age", "Shape", "Margin", "Density", "Severity"]

Mamm.head()

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


In [60]:
# Preliminary EDA
display(Mamm.shape)
display(Mamm.dtypes)

(961, 6)

BI_RADS     object
Age         object
Shape       object
Margin      object
Density     object
Severity     int64
dtype: object

<div class="alert alert-warning">
DATA PROCESSING:
<br>* Replacing Unusable Entries with "null/nan"
<br>* Change Data Types
<br>* Correct Unexpected Values (The Outliers)
<br>* Decode Category Data    
<br>* Consolidate Categories in Category Data 
</div>

In [61]:
# Coerce All Data to Numeric Data
# Coercion Introduces nans/nulls for the Non-Numeric Values in All Columns
# Missing Categories will be nans/nulls After Coercion for Categories encoded as Integers
Mamm = Mamm.apply(pd.to_numeric, errors="coerce")

Mamm.head(5)


Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [62]:
# Checking Data Types Again
display(Mamm.dtypes)

BI_RADS     float64
Age         float64
Shape       float64
Margin      float64
Density     float64
Severity      int64
dtype: object

In [63]:
# Replacing Outliers
# Cap "BI_RADS" Values to Range 1 to 5
Mamm.BI_RADS = np.clip(Mamm.BI_RADS, a_min = 1, a_max = 5)

<div class="alert alert-warning">

**CONSOLIDAGTING & DECODING CATEGORY COLUMNS**


**Orginal Shape Category Coding:**
<br>round = 1
<br>oval = 2
<br>lobular = 3
<br>irregular = 4

**Orignial Margin Category Coding:**
<br>circumscribed = 1
<br>microlobulated = 2
<br>obscured = 3
<br>illdefined = 4
<br>spiculated = 5


</div>

In [64]:
# The Category Columns are Decoded & Categories are Consolidated
# Shape Variable Decoded As:
shape_decoding = {
    1: 'oval',
    2: 'oval',
    3: 'lobular',
    4: 'irregular'
}
Mamm.Shape = Mamm.Shape.replace(shape_decoding)

# Shape Variable Decoded As:
margin_decoding = {
    1: 'circumscribed',
    2: 'ill-defined',
    3: 'ill-defined',
    4: 'ill-defined',
    5: 'spiculated'
}
Mamm.Margin = Mamm.Margin.replace(margin_decoding)

Mamm.head(5)

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,lobular,spiculated,3.0,1
1,4.0,43.0,oval,circumscribed,,1
2,5.0,58.0,irregular,spiculated,3.0,1
3,4.0,28.0,oval,circumscribed,3.0,0
4,5.0,74.0,oval,spiculated,,1


In [65]:
# EDA
display(Mamm.shape)

# Distribution of Nulls Among Columns
Mamm.isna().sum(axis = 0)

(961, 6)

BI_RADS      2
Age          5
Shape       31
Margin      48
Density     76
Severity     0
dtype: int64

In [66]:
# Dropping Rows w/ Multiple Missing Values
# Drop Rows w/ Threshold of 5
Mamm = Mamm.dropna(thresh=5)

# Show the Shape of the Data
display(Mamm.shape)

# Show the Distribution of Nulls Among the Columns
Mamm.isna().sum(axis = 0)

(931, 6)

BI_RADS      1
Age          5
Shape       17
Margin      22
Density     56
Severity     0
dtype: int64

In [67]:
# Determining the Imputation Values for Age
# Replace Missing Age Values w/ Median
MedianAge = np.nanmedian(Mamm.loc[:,"Age"])
HasNanAge = pd.isnull(Mamm.loc[:,"Age"])
print('Now we replace', HasNanAge.sum(),'missing age values with the age median (', MedianAge, ')')
Mamm.loc[HasNanAge, "Age"] = MedianAge
Mamm.isna().sum(axis=0)

Now we replace 5 missing age values with the age median ( 57.0 )


BI_RADS      1
Age          0
Shape       17
Margin      22
Density     56
Severity     0
dtype: int64

In [68]:
# Impute Missing Values for BI_RADS & Density
# Median Imputation for BI_RADS
median_bi_rads = np.nanmedian(Mamm.loc[:,"BI_RADS"])

# Median Imputation for Density
median_density = np.nanmedian(Mamm.loc[:,"Density"])

# Distribution of Nulls
has_nan_bi_rads = pd.isnull(Mamm.loc[:, "BI_RADS"])
has_nan_density = pd.isnull(Mamm.loc[:, "Density"])

# BI_RADS 
print('Now we replace', has_nan_bi_rads.sum(),'missing age values with the age median (', median_bi_rads, ')')
Mamm.loc[has_nan_bi_rads, "BI_RADS"] = median_bi_rads

print("\n**********************************\n")

# Density 
print('Now we replace', has_nan_density.sum(),'missing age values with the age median (', median_density, ')')
Mamm.loc[has_nan_density, "Density"] = median_density

# Numbers of Nulls Per Column After Imputation
print("\n********* REMAINING NULLS ********")
Mamm.isna().sum(axis=0)

Now we replace 1 missing age values with the age median ( 4.0 )

**********************************

Now we replace 56 missing age values with the age median ( 3.0 )

********* REMAINING NULLS ********


BI_RADS      0
Age          0
Shape       17
Margin      22
Density      0
Severity     0
dtype: int64

In [69]:
# Replacing Missing Values for the 2 Categorical Columns
print("*********SHAPE*******")
# Determine Distribution of Categories for Shape
display(Mamm.Shape.value_counts())

print("\n**********************************\n")

# Common "Shape" Value
common_val_shape = Mamm.Shape.value_counts().idxmax()

# Replace Nulls in Shape w/ the Most Common Category of Shape
HasNanShape = pd.isnull(Mamm.loc[:,"Shape"])
print('Now we replace', HasNanShape.sum(),'missing age values with the age median (', common_val_shape, ')')
Mamm.loc[HasNanShape, "Shape"] = common_val_shape


print("\n*********MARGIN*******\n")
# Determine Distribution of Categories for Margin
display(Mamm.Margin.value_counts())

print("\n**********************************\n")

# Common "Margin" Value
common_val_margin = Mamm.Margin.value_counts().idxmax()

# Replace Nulls in Margin w/ the Most Common Category of Margin
HasNanMargin = pd.isnull(Mamm.loc[:,"Margin"])
print('Now we replace', HasNanMargin.sum(),'missing age values with the age median (', common_val_margin, ')')
Mamm.loc[HasNanMargin, "Margin"] = common_val_margin

# Distribution of Nulls
Mamm.isna().sum(axis=0)

# Determine the Distribution of Categories
print("\n*********DISTRIBUTION OF CATEGORIES*******\n")
display(Mamm.value_counts())

*********SHAPE*******


Shape
oval         422
irregular    399
lobular       93
Name: count, dtype: int64


**********************************

Now we replace 17 missing age values with the age median ( oval )

*********MARGIN*******



Margin
ill-defined      417
circumscribed    357
spiculated       135
Name: count, dtype: int64


**********************************

Now we replace 22 missing age values with the age median ( ill-defined )

*********DISTRIBUTION OF CATEGORIES*******



BI_RADS  Age   Shape      Margin         Density  Severity
5.0      66.0  irregular  ill-defined    3.0      1           11
4.0      45.0  oval       circumscribed  3.0      0           10
         59.0  oval       circumscribed  3.0      0            9
         56.0  oval       circumscribed  3.0      0            9
         63.0  oval       circumscribed  3.0      0            8
                                                              ..
         58.0  irregular  spiculated     3.0      0            1
         57.0  oval       spiculated     3.0      0            1
                          ill-defined    1.0      0            1
                          circumscribed  2.0      0            1
5.0      96.0  lobular    ill-defined    3.0      1            1
Name: count, Length: 524, dtype: int64

In [70]:
# Checking the Data Types
display(Mamm.dtypes)
Mamm.head()

BI_RADS     float64
Age         float64
Shape        object
Margin       object
Density     float64
Severity      int64
dtype: object

Unnamed: 0,BI_RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,lobular,spiculated,3.0,1
1,4.0,43.0,oval,circumscribed,3.0,1
2,5.0,58.0,irregular,spiculated,3.0,1
3,4.0,28.0,oval,circumscribed,3.0,0
4,5.0,74.0,oval,spiculated,3.0,1


In [71]:
# One Hot Encode the Categorical Variables

# Changing "Shape" & "Margin" Data Type from "object" to "category"
# I don't need to change the data type, adding step for feature changes.
Mamm[['Shape', 'Margin']] = Mamm[['Shape', 'Margin']].astype('category')

display(Mamm.dtypes)

# One-hot-encode
onehot = OneHotEncoder(sparse = False)
onehot.fit(Mamm[["Shape", "Margin"]])

# Create Column Names
col_names = onehot.get_feature_names_out(["Shape", "Margin"])

# Add one-hot-encoded columns to dataframe
mamm_onehot = pd.DataFrame(onehot.transform(Mamm[["Shape", "Margin"]]), columns = col_names)

Mamm = pd.concat([Mamm, mamm_onehot], axis=1)

# Drop original categorical columns
Mamm.drop(columns=['Shape', 'Margin'], inplace=True)

# Show the first few rows
Mamm.head()

BI_RADS      float64
Age          float64
Shape       category
Margin      category
Density      float64
Severity       int64
dtype: object



Unnamed: 0,BI_RADS,Age,Density,Severity,Shape_irregular,Shape_lobular,Shape_oval,Margin_circumscribed,Margin_ill-defined,Margin_spiculated
0,5.0,67.0,3.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
1,4.0,43.0,3.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
2,5.0,58.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,4.0,28.0,3.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
4,5.0,74.0,3.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
