# Data Processing

In [1]:
import pandas as pd
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

from sklearn.model_selection import cross_val_score

### Mammogram Dataset

We begin by importing the mammogram data as a dataframe, replacing '?' with NaN. DataFrames have static methods to remove NaN values

In [2]:
mammogram_df = pd.read_csv('mammographic_masses.data.txt', header=0, names=['BI-RADS', 'Age', 'Shape', 'Margin', 'Density', 'Severity'], na_values=['?'])
mammogram_df.head()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,4.0,43.0,1.0,1.0,,1
1,5.0,58.0,4.0,5.0,3.0,1
2,4.0,28.0,1.0,1.0,3.0,0
3,5.0,74.0,1.0,5.0,,1
4,4.0,65.0,1.0,,3.0,0


Off the bat we can see there are missing data points. Lets see the extent of the damage

In [3]:
mammogram_df.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,958.0,955.0,929.0,912.0,884.0,960.0
mean,4.347599,55.475393,2.721206,2.79386,2.910633,0.4625
std,1.783838,14.482917,1.243428,1.565702,0.380647,0.498852
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


We can see by the count that there are a number of missing data points, especially on the density column.

In [4]:
mammogram_df.dropna(inplace=True)

We want to be careful that we don't introduce bias by simply dropping values! Alternatively, we can interpolate using the mean but this will still add some bias

In [5]:
mammogram_df.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,829.0,829.0,829.0,829.0,829.0,829.0
mean,4.393245,55.768396,2.781665,2.810615,2.915561,0.484922
std,1.889394,14.675456,1.243088,1.566276,0.351136,0.500074
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


The standard deviation of NaN removed data set is similar to the standard deviation of the entire data set, specifically on the Density column which had the most number of missing data points. Suggests that we won't introduce bias by simply dropping the rows with incomplete data. The count is also identical across all features so the dataset should be nice and clean.

Lets break the data set into Numpy features and labels for later consumption

BI-RADS (col 0) is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it.

In [6]:
mammogram_features = mammogram_df.columns[1:-1]
mammogram_x = mammogram_df[mammogram_features].values
mammogram_y = mammogram_df['Severity'].values

Since some models require normalised features, we should go ahead and prepare that now

In [8]:
mammogram_x_scaled = scaler.fit_transform(mammogram_x)

### Iris Dataset

In [11]:
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [12]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


This dataset looks clean and ready to use. Lets extract the features and labels into Numpy arrays

In [13]:
iris_features = iris.feature_names
iris_x = iris_df[iris_features].values
iris_y = iris.target

iris_x_scaled = scaler.fit_transform(iris_x)