# Cleaning Practice

Let's first practice handling missing values and duplicate data using the `cancer_data_means.csv` file.


In [3]:
# import pandas and load cancer data
import pandas as pd
df = pd.read_csv('cancer_data_means.csv')
# check which columns have missing values
df.columns[df.isna().any()]



Index(['texture_mean', 'smoothness_mean', 'symmetry_mean'], dtype='object')

In [5]:
# use the mean to fill in missing values

for col in df.columns[df.isna().any()]:
    mean=df[col].mean()
    df[col]=df[col].fillna(mean)

 

# confirm your correction
df.columns[df.isna().any()]

Index([], dtype='object')

In [8]:
# how many duplicates are there ?
df.duplicated().sum()

5

In [11]:
# drop duplicates
df=df.drop_duplicates()


In [12]:
# confirm correction by rechecking for duplicates in the data
df.duplicated().sum()

0

## Renaming Columns

Since we also previously changed our dataset to only include means of tumor features, the "\_mean" at the end of each feature seems unnecessary. It just takes extra time to type in our analysis later. Rename the columns of the dataframe to remove "\_mean".


In [15]:
# rename the columns of the dataframe (remove _mean from the name of each column if found)
df=df.rename(columns=lambda x: x.replace("_mean", ""))
df.columns

Index(['id', 'diagnosis', 'radius', 'texture', 'perimeter', 'area',
       'smoothness', 'compactness', 'concavity', 'concave_points', 'symmetry',
       'fractal_dimension'],
      dtype='object')

In [16]:
# display first few rows of the dataframe to confirm changes
df.head()

Unnamed: 0,id,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,symmetry,fractal_dimension
0,842302,M,17.99,19.293431,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,84348301,M,11.42,20.38,77.58,386.1,0.096087,0.2839,0.2414,0.1052,0.2597,0.09744
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [10]:
# save this for later as a csv file named "cancer_data_edited.csv" and set indexing to false (why ??)
df.to_csv("cancer_data_edited.csv",index=False)



Setting index=False in the to_csv() method call will exclude the row index from the output CSV file. If we don’t set this parameter, pandas will include the row index as a separate column in the output file.

If we don’t need to keep track of the row index in your CSV file, it’s usually a good idea to set index=False to keep your output file clean and simple.