# üõí Customer Data Preprocessing & Quality Assessment

**Objective:** Clean, format, and standardize raw customer data from an e-commerce platform to ensure data integrity for downstream analytics and predictive modeling.

### üõ†Ô∏è Tech Stack
* **Language:** Python
* **Key Concepts:** Data Structures (Lists), String Manipulation, Data Type Casting, Exception Handling, Data Aggregation.

In [4]:
# Raw dataset provided by the client
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]

print(f"Total records in raw data: {len(users)}")

Total records in raw data: 10


### üßπ Data Cleaning Pipeline
We identified several data quality issues:
1. Names contain unnecessary whitespaces and improper formatting.
2. Age is stored as a float instead of an integer.
3. First and last names need to be separated for better querying.

The following script processes the raw data and standardizes all entries.

In [5]:
# Processing the data and handling exceptions
users_clean = []

for user in users:
    # 1. ID Extraction
    user_id = user[0]
    
    # 2. String manipulation: Cleaning names
    raw_name = user[1].strip().replace('_', ' ').lower()
    split_name = raw_name.split()
    
    # 3. Data Type Casting: Converting age to integer with Error Handling
    try:
        user_age = int(user[2])
    except ValueError:
        print(f"Error: Age for user {user_id} must be a numerical value.")
        user_age = None
        
    # 4. Extracting categories and spending
    categories = user[3]
    spendings = user[4]
    
    # Appending cleaned data
    users_clean.append([user_id, split_name, user_age, categories, spendings])

# Sorting the final dataset by user_id
users_clean.sort()

print("Data cleaning completed successfully. First 3 clean records:")
print(users_clean[:3])

Data cleaning completed successfully. First 3 clean records:
[['31980', ['kate', 'morgan'], 24, ['CLOTHES', 'BOOKS'], [439, 390]], ['32156', ['john', 'doe'], 37, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]], ['32415', ['mike', 'reed'], 32, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]]]


### üìä Aggregation & Reporting
Extracting insights from the cleaned data.

In [6]:
# Example: Calculating total spendings for a specific user and formatting the output
sample_user = users_clean[0]
user_id = sample_user[0]
first_name = sample_user[1][0].capitalize()
age = sample_user[2]
total_spend = sum(sample_user[4])

# Generating a formatted executive summary
summary = f"User {user_id} is {first_name}, who is {age} years old and has spent a total of ${total_spend}."
print(summary)

User 31980 is Kate, who is 24 years old and has spent a total of $829.
