#### Data Cleaning

In the data cleaning and normalizing step, it's important to evaluate the areas that need "fixes" or "resolutions" logically so that the data is what we're looking to acheive with the dataset.  It's important to consider decision whereby we are not loosing data irresponsibily, or making changes to the data that might falsely represent the data.

#### Next steps:
- 1 - Replace missing (NaN) values with zeros.
- 2 - Understand the 272 rows considered duplicate rows.
- 3 - Determine a resolution on the duplicate rows.

#### Taking care of Missing (NaN) values with zeros

In [1]:
import pandas as pd

In [2]:
#import dataset just as previous and adjust for column names
masses = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
masses

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1
...,...,...,...,...,...,...
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1


In [3]:
masses_rev = masses.fillna(0) #Turn NaN into zeros, keeping in mind that  131 rows contain zeros values from NaN

#### Evaluating Duplicates Further

In [4]:
masses.duplicated().sum()

272

In [5]:
#Find duplicate rows based on all columns
allduplicatedrows = masses[masses.duplicated()]

In [6]:
#set it to see all rows considered to be duplicate columns so it cane be examined if duplicates based on all columns truly exist.
pd.set_option('display.max_rows', allduplicatedrows.shape[0]+1)
print(allduplicatedrows)

     BI-RADS   age  shape  margin  density  severity
52       4.0  23.0    1.0     1.0      NaN         0
95       5.0  54.0    4.0     4.0      3.0         1
116      4.0  45.0    2.0     1.0      NaN         0
127      4.0  40.0    1.0     1.0      NaN         0
134      5.0  66.0    4.0     4.0      3.0         1
139      5.0  67.0    3.0     5.0      3.0         1
152      4.0  66.0    1.0     1.0      3.0         0
160      5.0  67.0    4.0     4.0      3.0         1
167      4.0  64.0    1.0     1.0      3.0         0
174      4.0  54.0    1.0     1.0      3.0         0
176      4.0  62.0    2.0     1.0      NaN         0
194      4.0  50.0    1.0     1.0      3.0         0
197      5.0  66.0    4.0     4.0      3.0         1
215      4.0  46.0    1.0     1.0      3.0         0
217      4.0  57.0    1.0     1.0      3.0         0
221      4.0  45.0    2.0     1.0      3.0         0
226      5.0  63.0    4.0     4.0      3.0         1
234      5.0  64.0    4.0     5.0      3.0    

Note: These do not appear to be duplicates we are okay to proceed without dropping any of the rows.

In [7]:
masses

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1
...,...,...,...,...,...,...
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1


#### Reviewing Final Data Set from Cleaning

In [8]:
masses_rev

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,0.0,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,0.0,1
...,...,...,...,...,...,...
956,4.0,47.0,2.0,1.0,3.0,0
957,4.0,56.0,4.0,5.0,3.0,1
958,4.0,64.0,4.0,5.0,3.0,0
959,5.0,66.0,4.0,5.0,3.0,1


#### Exporting the Clean DataSet

In [9]:
masses_clean = "Clean_Masses_Data.csv"
print("Saving file: '{}'".format(masses_clean))
masses_rev.to_csv(masses_clean, index=False, encoding='utf-8')
print("File Saved...")

Saving file: 'Clean_Masses_Data.csv'
File Saved...


#### <a href="https://github.com/ElenaE873/classifying_predicting_mammogrammasses/blob/main/Project%20Log.ipynb">Go Back to Project log</a>