
# Pandas - Data Cleaning Tutorial
In this lecture, we will explore how to clean data using the Pandas library in Python.

We will cover the following topics:
- Removing Empty Cells
- Fixing Data with Wrong Format
- Fixing Wrong Data
- Removing Duplicates


In [None]:

import pandas as pd

# Sample data
data = {
    "Duration": [60, 60, 60, 45, 45, 60, 60, 450, 30, 60, 60, 60, 60, 60, 60, 60, 60, 60, 45, 60, 45, 60, 45, 60, 45, 60, 60, 60, 60, 60, 60, 60],
    "Date": [None, '2020/12/02', None, '2020/12/04', '2020/12/05', '2020/12/06', '2020/12/07', '2020/12/08', '2020/12/09', '2020/12/10',
             '2020/12/11', '2020/12/12', '2020/12/12', '2020/12/13', '2020/12/14', '2020/12/15', '2020/12/16', '2020/12/17', '2020/12/18', '2020/12/19',
             '2020/12/20', '2020/12/21', None, '2020/12/23', '2020/12/24', '2020/12/25', '20201226', '2020/12/27', '2020/12/28', '2020/12/29', '2020/12/30', '2020/12/31'],
    "Calories": [409.1, 479.0, 340.0, 282.4, 406.0, 300.0, 374.0, 253.3, 195.1, 269.0, 329.3, 250.7, 250.7, 345.3, 379.3, 275.0, 215.2, 300.0, None, 323.0,
                 243.0, 364.2, 282.0, 300.0, 246.0, 334.5, 250.0, 241.0, None, 280.0, 380.3, 243.0]
}

#make our starding df
df = pd.DataFrame(data)

In [None]:
df.info()

In [None]:
#what do you notice with the df? what needs to be cleaned up?
df



## 1. Removing Empty Cells
Empty cells can potentially give you incorrect results when analyzing data.

### Removing Empty Cells
You can remove rows with empty cells using the `dropna()` method:


In [None]:
# Remove rows with empty cells
df = df.dropna()

In [None]:
#what do you see with the index numbers? Also what else needs to be cleaned?
df

In [None]:
df.info()


## 2. Fixing Data with Wrong Format
Sometimes, data may have wrong formats that need correction. For instance, the 'Date' column should have all values in the date format.

### Example:
```plaintext
Row 22 has an empty date.
Row 26 has a date in the wrong format ('20201226').
```

We can use the `to_datetime()` method to convert the 'Date' column into the correct format:


In [None]:
# Convert the 'Date' column into datetime format

# Make sure all values are strings
df['Date'] = df['Date'].astype(str)

print(df)

# Identify values that are in 'YYYYMMDD' format (8 digits, no slashes)
mask = df['Date'].str.match(r'^\d{8}$')
print(mask)

# Convert those using the correct format
df.loc[mask, 'Date'] = pd.to_datetime(df.loc[mask, 'Date'], format='%Y%m%d')

# Convert the rest using pandas' auto-format pack to date.
df.loc[~mask, 'Date'] = pd.to_datetime(df.loc[~mask, 'Date'])


The parameter errors='coerce' tells pandas to handle invalid date formats gracefully. Here's what it does:

If a value in the 'Date' column can't be converted into a valid datetime (e.g., due to typos, wrong format, or missing values), it will be replaced with NaT (Not a Time), which is the datetime equivalent of NaN.

In [None]:
df


## 3. Fixing Wrong Data
Sometimes, data values may be wrong, such as a duration of 450 minutes in a dataset where most durations are between 30 and 60 minutes.

### Example:
We will fix the duration value in row 7 to 45 minutes:


In [None]:

# Fix wrong data in 'Duration' column
df.loc[df['Duration'] > 120, 'Duration'] = 120

In [None]:
df


## 4. Removing Duplicates
Duplicate rows are rows that have been registered more than once in the dataset.

### Example:
Rows 11 and 12 are duplicates.

To remove duplicates, we use the `drop_duplicates()` method:


In [None]:

# Remove duplicates
df_no_duplicates = df.drop_duplicates()

# Show the cleaned DataFrame
print(df_no_duplicates.to_string())
