# Data Cleaning
This section covers essential steps for cleaning data in Python using pandas.

In [None]:
# Import necessary libraries
import pandas as pd  # For data manipulation
import numpy as np   # For numerical operations

## Loading Data
Let's load your dataset into a pandas DataFrame. This is the first step in any data analysis workflow.

In [None]:
# Load a CSV file into a DataFrame
# Replace 'your_data.csv' with your actual file path

df = pd.read_csv('your_data.csv')

## Previewing the Data
It's important to preview the first few rows to understand the structure of your dataset.

In [None]:
# Display the first five rows of the DataFrame
df.head()

## Handling Missing Values
Let's identify and handle missing values in your dataset.

In [None]:
# Check for missing values in each column
df.isnull().sum()

### Filling Missing Values
One common approach is to fill missing values with the mean of the column.

In [None]:
# Fill missing values with the mean of each column
df_filled = df.fillna(df.mean())

## Removing Duplicates
Duplicate rows can skew your analysis. Let's remove them.

In [None]:
# Remove duplicate rows from the DataFrame
df_no_duplicates = df.drop_duplicates()

## Correcting Data Types
Columns should have the correct data types for accurate analysis.

In [None]:
# Convert a column to numeric, forcing errors to NaN
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')

In [None]:
# Convert a column to datetime
df['date_column'] = pd.to_datetime(df['date_column'], errors='coerce')

## Handling Outliers
Outliers can affect your analysis. Let's define a function to remove them using the IQR method.

In [None]:
# Function to remove outliers using the IQR method
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

In [None]:
# Remove outliers from a specific column
df_no_outliers = remove_outliers_iqr(df, 'numeric_column')

## String Cleaning
String columns often contain extra spaces or inconsistent casing. Let's clean them.

In [None]:
# Remove leading/trailing whitespace and convert to lowercase
df['string_column'] = df['string_column'].str.strip().str.lower()

## Renaming Columns
Rename columns for clarity and consistency.

In [None]:
# Rename columns using a dictionary
df = df.rename(columns={'old_name': 'new_name'})

## Encoding Categorical Variables
Convert categorical columns to numeric codes for machine learning.

In [None]:
# Encode categorical variables using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['categorical_column'])

## Feature Scaling
Scale numerical features for better model performance.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Scale a numeric column
df[['numeric_column']] = scaler.fit_transform(df[['numeric_column']])

## Dropping Unnecessary Columns
Remove columns that are not needed for analysis.

In [None]:
# Drop columns by name
df = df.drop(['unnecessary_column1', 'unnecessary_column2'], axis=1)

---

# Advanced Data Cleaning
In this section, we will cover more advanced data cleaning techniques to further improve your dataset.

## Visualizing Outliers
Before removing outliers, it's helpful to visualize them. Let's use a boxplot to see the distribution.

In [None]:
import matplotlib.pyplot as plt

# Create a boxplot for a numeric column
plt.figure(figsize=(8, 4))
df.boxplot(column=['numeric_column'])
plt.title('Boxplot of Numeric Column')
plt.show()

## Advanced Imputation Techniques
Sometimes, filling missing values with the mean is not enough. Let's use interpolation and forward/backward fill methods.

In [None]:
# Interpolate missing values
df_interpolated = df.interpolate()

In [None]:
# Forward fill missing values
df_ffill = df.fillna(method='ffill')

In [None]:
# Backward fill missing values
df_bfill = df.fillna(method='bfill')

## Feature Engineering
Create new features from existing data to enhance your analysis or model.

In [None]:
# Create a new column based on existing columns
df['total'] = df['column1'] + df['column2']