#**Duplicates**

Explanation:
- df.duplicated(): This function returns a boolean Series indicating whether each row is a duplicate or not. By default, it considers all columns.
sum(): Counts the number of True values in the Series returned by df.duplicated(), which corresponds to the number of duplicate rows.
- df.drop_duplicates(): Removes the duplicate rows from the DataFrame. By default, it keeps the first occurrence of each duplicate row and removes the subsequent ones.

Using this approach, you can clean your dataset by removing any rows that are exact duplicates across all columns, ensuring that each data entry in your DataFrame is unique. This is an essential step in data preprocessing, especially before feeding the data into a machine learning model.

In [None]:
import pandas as pd
from sklearn.datasets import fetch_openml
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')

# Load the Titanic dataset
titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.data

# Check the number of duplicates
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# If duplicates exist, remove them
if num_duplicates > 0:
    df = df.drop_duplicates()
    print("Duplicates removed.")
else:
    print("No duplicates found.")

# Optionally, check the shape of the DataFrame after removing duplicates
print(f"DataFrame shape after removing duplicates: {df.shape}")


Number of duplicate rows: 0
No duplicates found.
DataFrame shape after removing duplicates: (1309, 13)
