# Slide 1: Introduction to Data Cleaning

Data cleaning is a crucial step in the data analysis process. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets to ensure the quality and reliability of your analysis. Python offers powerful tools and libraries for efficient data cleaning, making it an essential skill for any data scientist or analyst.


In [4]:
# Example: Loading a dataset and checking for missing values
import pandas as pd
import numpy as np
# Load a sample dataset
df = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [9, 10, 11, 12]}
df=pd.DataFrame(df)
# Check for missing values
missing_values = df.isnull().sum()

print("Missing values in each column:")
print(missing_values)

Missing values in each column:
A    1
B    1
C    0
dtype: int64


# Slide 2: Handling Missing Data

One common issue in datasets is missing values. Python provides various methods to handle missing data, such as dropping rows with missing values or filling them with appropriate values.

In [2]:
# Example: Handling missing data
import pandas as pd
import numpy as np

# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)

# Drop rows with missing values
df_dropped = df.dropna()

# Fill missing values with the mean of the column
df_filled = df.fillna(df.mean())

print("Original DataFrame:")
print(df)
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
print("\nDataFrame after filling missing values with column means:")
print(df_filled)

Original DataFrame:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  7.0  11
3  4.0  8.0  12

DataFrame after dropping rows with missing values:
     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12

DataFrame after filling missing values with column means:
          A         B   C
0  1.000000  5.000000   9
1  2.000000  6.666667  10
2  2.333333  7.000000  11
3  4.000000  8.000000  12


# Slide 3: Removing Duplicates
Duplicate entries can skew your analysis and lead to incorrect conclusions. Python's pandas library offers simple methods to identify and remove duplicate rows from your dataset.

In [5]:
# Example: Removing duplicate rows
import pandas as pd

# Create a sample dataset with duplicate rows
data = {'A': [1, 2, 2, 3, 4], 'B': [5, 6, 6, 7, 8]}
df = pd.DataFrame(data)

# Identify duplicate rows
duplicates = df.duplicated()

# Remove duplicate rows
df_unique = df.drop_duplicates()

print("Original DataFrame:")
print(df)
print("\nDuplicate rows:")
print(duplicates)
print("\nDataFrame after removing duplicates:")
print(df_unique)

Original DataFrame:
   A  B
0  1  5
1  2  6
2  2  6
3  3  7
4  4  8

Duplicate rows:
0    False
1    False
2     True
3    False
4    False
dtype: bool

DataFrame after removing duplicates:
   A  B
0  1  5
1  2  6
3  3  7
4  4  8


# Slide 4: Handling Outliers

Outliers can significantly impact your analysis and should be carefully handled. One common method is the Interquartile Range (IQR) technique to identify and remove outliers.

In [6]:
# Example: Handling outliers using IQR
import pandas as pd
import numpy as np

# Create a sample dataset with outliers
data = {'values': [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]}
df = pd.DataFrame(data)

# Calculate Q1, Q3, and IQR
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_clean = df[(df['values'] >= lower_bound) & (df['values'] <= upper_bound)]

print("Original DataFrame:")
print(df)
print("\nDataFrame after removing outliers:")
print(df_clean)

Original DataFrame:
   values
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9     100

DataFrame after removing outliers:
   values
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9


# Slide 5: Data Type Conversion

Ensuring correct data types is crucial for accurate analysis. Python provides methods to check and convert data types as needed.

In [7]:
# Example: Converting data types
import pandas as pd

# Create a sample dataset with mixed data types
data = {'A': ['1', '2', '3'], 'B': ['4.5', '5.5', '6.5'], 'C': ['True', 'False', 'True']}
df = pd.DataFrame(data)

# Check initial data types
print("Initial data types:")
print(df.dtypes)

# Convert data types
df['A'] = df['A'].astype(int)
df['B'] = df['B'].astype(float)
df['C'] = df['C'].astype(bool)

# Check converted data types
print("\nConverted data types:")
print(df.dtypes)

print("\nConverted DataFrame:")
print(df)

Initial data types:
A    object
B    object
C    object
dtype: object

Converted data types:
A      int32
B    float64
C       bool
dtype: object

Converted DataFrame:
   A    B     C
0  1  4.5  True
1  2  5.5  True
2  3  6.5  True


# Slide 6: String Cleaning and Normalization

String data often requires cleaning and normalization to ensure consistency. This includes tasks like removing whitespace, converting to lowercase, and handling special characters.

In [8]:
# Example: String cleaning and normalization
import pandas as pd

# Create a sample dataset with messy string data
data = {'names': [' John ', 'JANE', 'bob ', ' Alice']}
df = pd.DataFrame(data)

# Clean and normalize strings
df['names'] = df['names'].str.strip().str.lower().str.capitalize()

print("Original DataFrame:")
print(data)
print("\nCleaned DataFrame:")
print(df)

Original DataFrame:
{'names': [' John ', 'JANE', 'bob ', ' Alice']}

Cleaned DataFrame:
   names
0   John
1   Jane
2    Bob
3  Alice


# Slide 7: Handling Date and Time Data

Date and time data often require special handling and conversion to ensure proper analysis and formatting.

In [None]:
# Example: Handling date and time data
import pandas as pd

# Create a sample dataset with date strings
data = {'dates': ['2023-01-01', '2023-02-15', '2023-03-30']}
df = pd.DataFrame(data)

# Convert string to datetime
df['dates'] = pd.to_datetime(df['dates'])

# Extract various components
df['year'] = df['dates'].dt.year
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
df['day_of_week'] = df['dates'].dt.day_name()

# Then Drop the Main Column
df.drop(columns=["day_of_week"],inplace=True)

print("Processed DataFrame:")
print(df)

Processed DataFrame:
       dates  year  month  day
0 2023-01-01  2023      1    1
1 2023-02-15  2023      2   15
2 2023-03-30  2023      3   30
