# Cleaning Practice
Let's first practice handling missing values and duplicate data using the `cancer_data_means.csv` file, which you created and saved in the "Assessing and Building Intuition" notebook a few pages back. If you created this CSV file in that notebook, it should still be available in this workspace for you to load into the notebook here.

In [15]:
# import pandas and load cancer data
import pandas as pd

df = pd.read_csv('cancer_data_means.csv')
# check which columns have missing values with info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
id                        569 non-null int64
diagnosis                 569 non-null object
radius_mean               569 non-null float64
texture_mean              548 non-null float64
perimeter_mean            569 non-null float64
area_mean                 569 non-null float64
smoothness_mean           521 non-null float64
compactness_mean          569 non-null float64
concavity_mean            569 non-null float64
concave_points_mean       569 non-null float64
symmetry_mean             504 non-null float64
fractal_dimension_mean    569 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.4+ KB


In [16]:
# use means to fill in missing values
text_mean = df['texture_mean'].mean()
smooth_mean = df['smoothness_mean'].mean()
symm_mean = df['symmetry_mean'].mean()

df['texture_mean']= df['texture_mean'].fillna(text_mean)
df['smoothness_mean'].fillna(smooth_mean, inplace= True)
df['symmetry_mean'].fillna(symm_mean, inplace=True)

# confirm your correction with info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
id                        569 non-null int64
diagnosis                 569 non-null object
radius_mean               569 non-null float64
texture_mean              569 non-null float64
perimeter_mean            569 non-null float64
area_mean                 569 non-null float64
smoothness_mean           569 non-null float64
compactness_mean          569 non-null float64
concavity_mean            569 non-null float64
concave_points_mean       569 non-null float64
symmetry_mean             569 non-null float64
fractal_dimension_mean    569 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.4+ KB


In [17]:
df.nunique()

id                        562
diagnosis                   2
radius_mean               451
texture_mean              460
perimeter_mean            516
area_mean                 532
smoothness_mean           435
compactness_mean          530
concavity_mean            530
concave_points_mean       536
symmetry_mean             395
fractal_dimension_mean    494
dtype: int64

In [18]:
sum(df.duplicated())

5

In [19]:
# check for duplicates in the data
df[df.duplicated()]

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean
202,852552,M,16.65,21.38,110.0,904.6,0.1121,0.1457,0.1525,0.0917,0.1995,0.0633
325,89511502,B,12.67,17.3,81.25,489.9,0.1028,0.07664,0.03193,0.02107,0.1707,0.05984
345,898677,B,10.26,14.71,66.2,321.6,0.09882,0.09159,0.03581,0.02037,0.1633,0.07005
489,9113846,B,12.27,29.97,77.42,465.4,0.07699,0.03398,0.0,0.0,0.1701,0.0596
558,925277,B,14.59,22.68,96.39,657.1,0.08473,0.133,0.1029,0.03736,0.1454,0.06147


In [20]:
# drop duplicates
df.drop_duplicates(inplace=True)
df.iloc[200:205]

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean
200,877501,B,12.23,19.56,78.54,461.0,0.096087,0.08087,0.04187,0.04107,0.1979,0.06013
201,877989,M,17.54,19.32,115.1,951.6,0.08968,0.1198,0.1036,0.07488,0.181091,0.05491
203,87880,M,13.81,23.75,91.56,597.8,0.1323,0.1768,0.1558,0.09176,0.181091,0.07421
204,87930,B,12.47,18.6,81.09,481.9,0.09965,0.1058,0.08005,0.03821,0.1925,0.06373
205,879523,M,15.12,16.68,98.78,716.6,0.08876,0.09588,0.0755,0.04079,0.1594,0.05986


In [21]:
# confirm correction by rechecking for duplicates in the data
df.duplicated().any()

False

## Renaming Columns
Since we also previously changed our dataset to only include means of tumor features, the "_mean" at the end of each feature seems unnecessary. It just takes extra time to type in our analysis later. Let's come up with a list of new labels to assign to our columns.

In [22]:
# remove "_mean" from column names
new_labels = []
for col in df.columns:
    if '_mean' in col:
        new_labels.append(col[:-5])  # exclude last 6 characters
    else:
        new_labels.append(col)

# new labels for our columns
new_labels

['id',
 'diagnosis',
 'radius',
 'texture',
 'perimeter',
 'area',
 'smoothness',
 'compactness',
 'concavity',
 'concave_points',
 'symmetry',
 'fractal_dimension']

In [23]:
# assign new labels to columns in dataframe
df.columns = new_labels

# display first few rows of dataframe to confirm changes
df.head()

Unnamed: 0,id,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,symmetry,fractal_dimension
0,842302,M,17.99,19.293431,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,84348301,M,11.42,20.38,77.58,386.1,0.096087,0.2839,0.2414,0.1052,0.2597,0.09744
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [24]:
# save this for later
df.to_csv('cancer_data_edited.csv', index=False)