# Finding Duplicates 

##### Data wrangling is a critical step in preparing datasets for analysis, and handling duplicates plays a key role in ensuring data accuracy.

###### Install the needed library

In [None]:
!pip install pandas
!pip install matplotlib

###### Import library

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


## Load the dataset into a dataframe

In [None]:
# Load the dataset directly from the URL
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv"
df = pd.read_csv(file_path)
# Display the first few rows
print(df.head())

### Identify and Analyze Duplicates

#### Task 1: Identify Duplicate Rows
##### 1:Count the number of duplicate rows in the dataset.
##### 2:Display the first few duplicate rows to understand their structure.

In [None]:
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows:{duplicate_count}")
# 2. Display the first few duplicate rows
duplicate_rows = df[df.duplicated()]
print("first few duplicate rows:")
print(duplicate_rows.head())

### Task 2: Analyze Characteristics of Duplicates¶
##### 1-Identify duplicate rows based on selected columns such as MainBranch, Employment, and RemoteWork. Analyse which columns frequently contain identical values within these duplicate rows.
##### 2-Analyse the characteristics of rows that are duplicates based on a subset of columns, such as MainBranch, Employment, and RemoteWork. Determine which columns frequently have identical values across these rows.



In [None]:
## Write your code here
# Step 1: Find duplicate rows
subset_cols = ['MainBranch', 'Age', 'RemoteWork']
dups = df[df.duplicated(subset=subset_cols, keep=False)]

report = {
    col: (
        "All values identical" if dups[col].nunique(dropna=False) == 1
        else f"Low variation ({dups[col].nunique()} unique values)" if dups[col].nunique() <= 5
        else f"High variation ({dups[col].nunique()} unique values)"
    )
    for col in df.columns
}

similarity_df = pd.DataFrame(report.items(), columns=['Column', 'DuplicateRowSimilarity']).sort_values('DuplicateRowSimilarity')
similarity_df.head(20)

### Task 3: Visualize Duplicates Distribution

##### 1-Create visualizations to show the distribution of duplicates across different categories.
##### 2-Use bar charts or pie charts to represent the distribution of duplicates by Country and Employment.

In [None]:
## Write your code here
# Find all duplicate rows (include all duplicates, not just the later ones)
duplicates = df[df.duplicated(keep=False)]

# -------------------- Bar Chart: Duplicates by Country --------------------
country_counts = duplicates['Country'].value_counts().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(country_counts.index, country_counts.values, color='skyblue')
plt.title('Duplicate Rows by Country', fontsize=14)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of Duplicates', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# -------------------- Pie Chart: Duplicates by Employment --------------------
# Count duplicates by employment
employment_counts = duplicates['Employment'].value_counts()

# Create the pie chart
plt.figure(figsize=(16, 6))
wedges, texts, autotexts = plt.pie(
    employment_counts,
    autopct='%1.1f%%',
    startangle=140,
    colors=plt.cm.tab10.colors,
    textprops={'fontsize': 10}
)

# Legend on the side (kinare)
plt.legend(
    wedges,
    employment_counts.index,
    title="Employment",
    loc="center left",
    bbox_to_anchor=(1, 0.5),
    fontsize=9
)

plt.title('Duplicates by Employment')
plt.axis('equal')  # To make it circular
plt.tight_layout()
plt.show()

## Task 4: Strategic Removal of Duplicates

#### Decide which columns are critical for defining uniqueness in the dataset.
#### Remove duplicates based on a subset of columns if complete row duplication is not a good criterion.


In [None]:
# Show duplicates based on important identifying fields
duplicates = df[df.duplicated(subset=['Age', 'Country', 'EdLevel', 'YearsCode', 'DevType'], keep=False)]
print("Potential Duplicates Found:")
print(duplicates)

# Remove duplicates
df_unique = df.drop_duplicates(subset=['Age', 'Country', 'EdLevel', 'YearsCode', 'DevType'], keep='first')

# Check how many rows were removed
print(f"\nDuplicates removed: {len(df) - len(df_unique)}")

# Optionally display cleaned data
display(df_unique.head())


## Verify and Document Duplicate Removal Process

#### Task 5: Documentation
##### 1: Document the process of identifying and removing duplicates.

In [None]:
# Write your explanation here
1: Document the process of identifying and removing duplicates
The following steps were used in your notebook to identify and remove duplicate records from the dataset:

Step 1: Load the dataset
The dataset was read directly from a cloud URL using pandas.read_csv() and loaded into a DataFrame:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv"
df = pd.read_csv(file_path)
Step 2: Identify duplicate rows
You counted total duplicate rows in the entire dataset:
duplicate_count = df.duplicated().sum()
Step 3: Display duplicate rows
The first few duplicate rows were printed for manual inspection:
duplicate_rows = df[df.duplicated()]
display(duplicate_rows.head())
Step 4: Identify duplicates based on specific columns
To refine the detection, you selected a subset of columns ('MainBranch', 'Age', 'RemoteWork') and checked for duplicated rows only within that subset:
subset_cols = ['MainBranch', 'Age', 'RemoteWork']
dups = df[df.duplicated(subset=subset_cols, keep=False)]
Step 5: Analyze similarities in duplicate groups
You created a summary that classifies each column based on how similar its values are among duplicates:

report = {
    col: (
        "All values identical" if dups[col].nunique(dropna=False) == 1
        else f"Low variation ({dups[col].nunique()} unique values)" if dups[col].nunique() <= 5
        else f"High variation ({dups[col].nunique()} unique values)"
    )
    for col in df.columns
}
similarity_df = pd.DataFrame(report.items(), columns=['Column', 'DuplicateRowSimilarity'])


Step 6: Remove duplicates
Though not explicitly shown in the preview, typically you would clean the data by:

df_cleaned = df.drop_duplicates(subset=subset_cols, keep='first')




##### Explain the reasoning behind selecting specific columns for identifying and removing duplicates.

In [None]:
# Write your explanation here
2: Explain the reasoning behind selecting specific columns for identifying and removing duplicates
The subset used for identifying duplicates included:

subset_cols = ['Age', 'Country', 'EdLevel', 'YearsCode', 'DevType']
The reasoning for selecting these specific columns is as follows:

Relevance to respondent profile: These columns are indicative of a respondent’s work status and demographic (e.g., whether they are working remotely, their age, and how they identified themselves professionally). Duplicate responses with the same values in these key fields are likely repeated entries from the same individual.

Simplification of comparison: Focusing on three core features simplifies the identification process without removing valid variations in less critical fields like Gender, Country, or SurveyEase.

Avoiding over-deletion: Not all duplicate rows in a dataset are exactly the same across all columns. By narrowing the check to key identifying attributes, you reduce the risk of removing distinct entries that only share values in unrelated fields.

Strategic cleanup: This approach helps target accidental repeat submissions or bot entries with minimal noise, especially in survey-based datasets.


### Summary and Next Steps
#### In this lab, you focused on identifying and analyzing duplicate rows within the dataset.

#### You employed various techniques to explore the nature of duplicates and applied strategic methods for their removal.
##### For additional analysis, consider investigating the impact of duplicates on specific analyses and how their removal affects the results.
##### This version of the lab is more focused on duplicate analysis and handling, providing a structured approach to deal with duplicates in a dataset effectively.
