#### Data Understanding

In the data understanding step, it's important to evaluate the data to determine what must be cleaned, or considered for normalization and/or transformation.

#### Steps:
- 1 - Rename columns
- 2 - Evaluate for Missing (NaN) values
- 3 - Evaluate for Duplicate rows

#### Import Package

In [1]:
import pandas as pd

In [2]:
masses = pd.read_csv('mammographic_masses.data.txt')
masses

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0
...,...,...,...,...,...,...
955,4,47,2,1,3,0
956,4,56,4,5,3,1
957,4,64,4,5,3,0
958,5,66,4,5,3,1


In [3]:
#need to rename columns appropriately
masses = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
masses

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1
...,...,...,...,...,...,...
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1


In [4]:
masses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   BI-RADS   959 non-null    float64
 1   age       956 non-null    float64
 2   shape     930 non-null    float64
 3   margin    913 non-null    float64
 4   density   885 non-null    float64
 5   severity  961 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 45.2 KB


#### Evaluate for Missing (NaN) values

In [5]:
#Total null (NaN) values for each column; will need to set these to zeros when data is cleaned.
masses.isna().sum() #masses.fillna(0) 

BI-RADS      2
age          5
shape       31
margin      48
density     76
severity     0
dtype: int64

In [6]:
#evaluate the null values by columns to understand if the missing data distribition is randomly distributed
masses.loc[(masses['BI-RADS'].isnull()) |
                   (masses['age'].isnull()) |
              (masses['shape'].isnull()) |
              (masses['margin'].isnull()) |
              (masses['density'].isnull())]

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


Note: Missing data appear to be randomly distributed; however instead of dropping the columns (which is also an option to explore and see how the results differ), the approach will be instead to fill in with zeros.

#### Evaluate for Duplicate rows

In [7]:
masses.duplicated().sum() # will need to understand what these duplicate values are

272

#### Next steps:
- 1 - Replace missing (NaN) values with zeros.
- 2 - Understand the 272 rows considered duplicate rows.
- 3 - Determine a resolution on the duplicate rows.

#### <a href="https://github.com/ElenaE873/classifying_predicting_mammogrammasses/blob/main/Project%20Log.ipynb">Go Back to Project log</a>