# Pandas - Cleaning Data

Data cleaning means fixing bad data in your data set.

Bad data could be:

1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates

# 1. Empty cells

One way to deal with empty cells is to **remove rows** that contain empty cells.

In [None]:
#DO NOT RUN THIS
# Return a new Data Frame with no empty cells:
import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()

print(new_df.to_string())

**Note:** By default, the **dropna()** method returns a new DataFrame, and will not change the original.

If you want to change the original DataFrame, use the **inplace = True** argument:\
Remove all rows with NULL values:

In [None]:
#DO NOT RUN THIS
import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())

Note: Now, the **dropna(inplace = True)** will NOT return a new DataFrame, but it will **remove all rows containing NULL values** from the original DataFrame.

### Replace Empty Values

The **fillna()** method allows us to replace empty cells with a value:

In [None]:
#DO NOT RUN THIS
import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)

### Replace Only For Specified Columns
Replace NULL values in the "Calories" columns with the number 130:

In [None]:
#DO NOT RUN THIS
import pandas as pd

df = pd.read_csv('data.csv')

df["Calories"].fillna(130, inplace = True)

# Replace Using Mean, Median, or Mode

Pandas uses the **mean() median() and mode()** methods to calculate the respective values for a specified column:



In [None]:
# DO NOT RUN THIS
#Calculate the MEAN, and replace any empty values with it:
import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mean()  #you replace with median() or mode()

df["Calories"].fillna(x, inplace = True)



# 2. Data in wrong format

Let's try to convert all cells in the 'Date' column into dates.

Pandas has a **to_datetime()** method for this:

In [None]:
#DO NOT RUN THIS
import pandas as pd

df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

# 3. Wrong data

One way to fix wrong values is to **replace** them with something else.

In [None]:
#DO NOT RUN THIS
#Set "Duration" = 45 in row 7:

df.loc[7, 'Duration'] = 45

If the value is higher than 120, set it to 120:

In [None]:
for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

Delete rows where "Duration" is higher than 120:

In [None]:
for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

# 4. Removing Duplicates

To discover duplicates, we can use the **duplicated()** method.

The **duplicated()** method returns a Boolean values for each row:

In [None]:
# Returns True for every row that is a duplicate, otherwise False:

print(df.duplicated())

To remove duplicates, use the **drop_duplicates()** method.

In [None]:
df.drop_duplicates(inplace = True)

Remember: The **(inplace = True)** will make sure that the method does NOT return a new DataFrame, but it will remove all duplicates from the original DataFrame.

# Pandas - Data Correlations

### Finding Relationships
A great aspect of the Pandas module is the **corr()** method.

The **corr()** method calculates the relationship between each column in your data set.

In [None]:
# Show the relationship between the columns:

df.corr()

# Pandas - Plotting

Pandas uses the **plot()** method to create diagrams.\
We can use **Pyplot**, a submodule of the **Matplotlib library** to visualize the diagram on the screen.

Pandas uses the **plot()** method to create diagrams.

In [None]:
#DO NOT RUN THIS
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot()

plt.show()

### Scatter Plot
Specify that you want a scatter plot with the kind argument:

**kind = 'scatter'**

A scatter plot needs an **x-axis and a y-axis.**

In [None]:
#DO NOT RUN THIS
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()

### Histogram
Use the kind argument to specify that you want a histogram:

**kind = 'hist'**

A histogram needs only **one column.**

In [None]:
df["Duration"].plot(kind = 'hist')