## Introduction

Your manager is impressed with your progress but points out that the data is messy. Before we can analyze it effectively, we need to clean and structure the data properly.

Your task is to:

1. Handle missing values
2. Remove duplicate or inconsistent data
3. Standardize the data format
            

# Task 1: Identify Issues in the Data


Your manager provides you with an example dataset where some records are incomplete or incorrect. Here’s an example:

 ### Problems:


1. User ID 3 has an empty name.
2. User ID 4 has a duplicate friend entry.
3. User ID 5 has no connections or liked pages (inactive user).
4. The pages list contains duplicate page IDs.

# Task 2: Clean the Data


We will:

1. Remove users with missing names.
2. Remove duplicate friend entries.
3. Remove inactive users (users with no friends and no liked pages).
4. Deduplicate pages based on IDs.

In [1]:
import json

In [1]:
def clean_data(data):
    # Remove user with missing names
    data["users"] = [user for user in data["users"] if user["name"].strip()] 

    # Remove duplicate friends 
    for user in data["users"]:
        user['friends'] = list(set(user['friends']))

    # Remove inactive users 
    data['users'] = [user for user in data["users"] if user["friends"] or user["liked_pages"]]

    #Remove duplicate pages 

    unique_pages = {}
    for page in data ['pages']:
        unique_pages[page['id']] = page
    data['pages'] = list (unique_pages.values())
    
    return data

# load the data.json
data = json.load(open("data2.json"))
data = clean_data(data)
json.dump(data, open("cleaned_data2.json","w"), indent=4 )
print("Data has been cleaned succesfully!")

Data has been cleaned succesfully!
