# Data Preprocessing of Machine Learning Solution Project Dataset

This notebook cleans and standardizes **three related datasets**:

- **Demographics** (`customer_demographics_contaminated (1).csv`): CustomerID, Age, Gender, Location, IncomeLevel, SignupDate
- **Transactions** (`customer_transactions_contaminated.csv`): CustomerID, TransactionID, TransactionDate, Amount, ProductCategory, PaymentMethod
- **Social Media Interactions** (`social_media_interactions_contaminated (1).csv`): CustomerID, InteractionID, InteractionDate, Platform, InteractionType, Sentiment

You'll find each step explained in **Markdown** (why the step matters) followed by **executable Python code** with comments.

**High-level goals:**  
- Ensure consistent column names and data types  
- Handle missing values, duplicates, outliers  
- Normalize categories (e.g., gender, platforms, payment methods)  
- Parse dates and numeric amounts robustly  
- Save **cleaned** CSVs plus a **joined master** dataset

## STEP 1 — Loading the CSVs
In this step, the three raw CSV datasets are imported into pandas DataFrames. Displaying their shapes and initial rows allows us to verify that the files were successfully loaded and to obtain an initial view of their structure.

In [1]:
# STEP 1: Load the CSV files into pandas

import pandas as pd
import numpy as np

# Load each dataset into a DataFrame
demo_raw   = pd.read_csv("customer_demographics_contaminated (1).csv")      # Customer demographic information
txn_raw    = pd.read_csv("customer_transactions_contaminated.csv")          # Customer transaction history
social_raw = pd.read_csv("social_media_interactions_contaminated (1).csv")  # Social media interactions

# Display the shape of each DataFrame (rows, columns) to confirm successful loading
print("Demographics dataset shape:", demo_raw.shape)
print("Transactions dataset shape:", txn_raw.shape)
print("Social Media dataset shape:", social_raw.shape)

# Preview the first five rows of each dataset to examine their structure and values
print("\n=== DEMOGRAPHICS (first 5 rows) ===")
display(demo_raw.head())

print("\n=== TRANSACTIONS (first 5 rows) ===")
display(txn_raw.head())

print("\n=== SOCIAL MEDIA (first 5 rows) ===")
display(social_raw.head())


Demographics dataset shape: (3200, 6)
Transactions dataset shape: (3200, 6)
Social Media dataset shape: (3200, 6)

=== DEMOGRAPHICS (first 5 rows) ===


Unnamed: 0,CustomerID,Age,Gender,Location,IncomeLevel,SignupDate
0,9207fa75-5758-48d1-94ad-19c041e0520f,51.0,Female,Jensenberg,Low,2022-11-17
1,5fb09cd8-a473-46f7-80bd-6e49cf509078,,Female,Castilloport,High,2020-07-21
2,c139496e-cc89-498a-bd90-1fb4627b6cff,37.0,Male,Lake Jennifertown,,2021-01-01
3,50118139-7264-428f-81cc-a25fddc5d6dd,44.0,Male,Port Carl,Medium,2024-06-10
4,7d1f2bbc-8d16-4fbc-9b37-ece3324e8ed4,50.0,Female,Jessebury,High,2023-08-24



=== TRANSACTIONS (first 5 rows) ===


Unnamed: 0,CustomerID,TransactionID,TransactionDate,Amount,ProductCategory,PaymentMethod
0,60567026-f719-4cd6-849e-137e86d8938f,5ff75116-0a50-4d04-80fb-31e5ccbb0769,2024-05-15,117.64,Clothing,PayPal
1,4090ba85-b111-4f75-a792-c777965f5255,2c39b9fe-ff57-4d39-9321-9f5cdf187aa1,2023-04-26,466.14,Health & Beauty,Bank Transfer
2,9223891b-73ff-4d5c-b8ae-13ece82ee28b,f79588dd-3db9-4ffa-97f8-7de0e64259f1,2022-09-23,563.99,Clothing,Debit Card
3,9243eebc-938f-480c-8564-16d503d250de,401c0fc9-60df-4455-ad78-67c132f9897d,2024-04-15,254.44,Automotive,PayPal
4,6e3e8eb8-bc0f-4ffe-9f74-5d5efec9502f,2034aebc-8280-4254-a667-92bcd1c2be4f,2024-06-03,590.52,Home & Garden,Bank Transfer



=== SOCIAL MEDIA (first 5 rows) ===


Unnamed: 0,CustomerID,InteractionID,InteractionDate,Platform,InteractionType,Sentiment
0,2dcb9523-356b-40b2-a67b-1f27797de261,e5d15761-d0a7-4329-89e3-79a892c56097,2023-07-11,,Comment,
1,e12c37b3-7d4d-472f-9fd8-0df2cb3001aa,02f9f376-70ae-4fcd-9070-1db977939948,2023-07-06,Twitter,Share,
2,08a911a3-65e6-4f5d-a6a1-ae7ddcbe28a2,a83fa04c-f109-4f24-8ce1-2078154f6a1c,2024-05-24,Instagram,Comment,Neutral
3,efdfdfc9-5dbb-4478-911a-101a390a0285,28a69c4b-a2e4-4c74-a130-1132d7733fdf,2023-11-01,Instagram,Like,Neutral
4,ca1e90f6-0e5f-492e-ab92-252ff540da18,d9d1c6f8-5e15-4738-b52b-13c2982420cc,2023-07-08,Instagram,Like,


## STEP 2 — Quick Checks (columns, types, missing values)
This step examines the basic structure of each dataset (column names, data types, and missing values). These diagnostics inform the subsequent cleaning operations such as type parsing, imputation, and normalization.

In [2]:
# STEP 2: Structural checks (columns, types, missing values)

# Column names
print("Demographics columns:", list(demo_raw.columns))
print("Transactions columns:", list(txn_raw.columns))
print("Social columns:", list(social_raw.columns))

# Data types
print("\nDemographics dtypes:\n", demo_raw.dtypes)
print("\nTransactions dtypes:\n", txn_raw.dtypes)
print("\nSocial dtypes:\n", social_raw.dtypes)

# Missing values per column
print("\nMissing values (Demographics):\n", demo_raw.isna().sum())
print("\nMissing values (Transactions):\n", txn_raw.isna().sum())
print("\nMissing values (Social):\n", social_raw.isna().sum())


Demographics columns: ['CustomerID', 'Age', 'Gender', 'Location', 'IncomeLevel', 'SignupDate']
Transactions columns: ['CustomerID', 'TransactionID', 'TransactionDate', 'Amount', 'ProductCategory', 'PaymentMethod']
Social columns: ['CustomerID', 'InteractionID', 'InteractionDate', 'Platform', 'InteractionType', 'Sentiment']

Demographics dtypes:
 CustomerID     object
Age            object
Gender         object
Location       object
IncomeLevel    object
SignupDate     object
dtype: object

Transactions dtypes:
 CustomerID         object
TransactionID      object
TransactionDate    object
Amount             object
ProductCategory    object
PaymentMethod      object
dtype: object

Social dtypes:
 CustomerID         object
InteractionID      object
InteractionDate    object
Platform           object
InteractionType    object
Sentiment          object
dtype: object

Missing values (Demographics):
 CustomerID       0
Age            291
Gender           0
Location         0
IncomeLevel    30

## STEP 3 — Standardize Column Names
Column names are standardized to a consistent snake_case convention. Consistent naming reduces the likelihood of downstream errors and simplifies reference in code and documentation.

In [3]:
# STEP 3: Standardize column names to snake_case

# Create normalized copies
demo = demo_raw.rename(columns={
    "CustomerID": "customer_id",
    "Age": "age",
    "Gender": "gender",
    "Location": "location",
    "IncomeLevel": "income_level",
    "SignupDate": "signup_date"
})

txn = txn_raw.rename(columns={
    "CustomerID": "customer_id",
    "TransactionID": "transaction_id",
    "TransactionDate": "transaction_date",
    "Amount": "amount",
    "ProductCategory": "product_category",
    "PaymentMethod": "payment_method"
})

social = social_raw.rename(columns={
    "CustomerID": "customer_id",
    "InteractionID": "interaction_id",
    "InteractionDate": "interaction_date",
    "Platform": "platform",
    "InteractionType": "interaction_type",
    "Sentiment": "sentiment"
})

# Verify standardized names
print(list(demo.columns))
print(list(txn.columns))
print(list(social.columns))

['customer_id', 'age', 'gender', 'location', 'income_level', 'signup_date']
['customer_id', 'transaction_id', 'transaction_date', 'amount', 'product_category', 'payment_method']
['customer_id', 'interaction_id', 'interaction_date', 'platform', 'interaction_type', 'sentiment']


## STEP 4 — Clean Text Fields and Harmonize Null Tokens
Text fields are trimmed to remove extraneous whitespace, and common null tokens (e.g., “NA”, “none”, “-”) are converted to NaN. This ensures that missingness is represented uniformly across datasets.

In [13]:
# STEP 4: Trim whitespace and convert common null tokens to NaN

# Convert common null-like tokens to real NaN so dropna will catch them
for df in [demo, txn, social]:
    for c in df.select_dtypes(include="object").columns:
        df[c] = df[c].replace(["nan", "NaN", "Nan", "NULL", "None", ""], np.nan)

import numpy as np

# Define tokens that should be treated as missing values
null_tokens = {"", "na", "n/a", "none", "null", "nan", "-", "--", "unknown", "missing"}

# Apply to all object (string) columns in each dataset
for df in [demo, txn, social]:
    for col in df.select_dtypes(include="object").columns:
        # Strip leading/trailing whitespace
        df[col] = df[col].astype(str).str.strip()
        # Replace listed tokens with NaN
        df[col] = df[col].replace({t: np.nan for t in null_tokens})
        # Convert literal strings "nan"/"None" to NaN
        df[col] = df[col].replace({"nan": np.nan, "None": np.nan})

# Print first 10 rows after cleaning
print("Demo dataset (first 10 rows after cleaning):")
print(demo.head(10))

print("\nTransaction dataset (first 10 rows after cleaning):")
print(txn.head(10))

print("\nSocial dataset (first 10 rows after cleaning):")
print(social.head(10))

Demo dataset (first 10 rows after cleaning):
                             customer_id    age  gender            location  \
0   9207fa75-5758-48d1-94ad-19c041e0520f   51.0  Female          Jensenberg   
3   50118139-7264-428f-81cc-a25fddc5d6dd   44.0    Male           Port Carl   
4   7d1f2bbc-8d16-4fbc-9b37-ece3324e8ed4   50.0  Female           Jessebury   
5   2de49c7c-32ae-4ba8-b058-622a090d7094   53.0  Female          Emilyville   
10  c96a5ee9-f1a6-416a-adc6-1c8b128c7399  150.0    Male          Hansontown   
13  8602d631-457c-49c1-8b59-8efb2a4448d4   51.0    Male     East Keithville   
15  56f11a95-76f1-4a97-b38f-db1dc95da1ed   59.0  Female         East Nathan   
16  3f520998-2bc4-4f38-af82-5ab2de339984   59.0  Female           Smithside   
17  b2f4c25e-be11-4912-9d14-5c288616e56e   29.0    Male  South Timothyhaven   
18  8de7560f-370d-4cfd-8135-20a3de237264   31.0  Female         Colemanstad   

   income_level signup_date  
0           Low  2022-11-17  
3        Medium  2024-06-

  df[c] = df[c].replace(["nan", "NaN", "Nan", "NULL", "None", ""], np.nan)
  df[c] = df[c].replace(["nan", "NaN", "Nan", "NULL", "None", ""], np.nan)
  df[c] = df[c].replace(["nan", "NaN", "Nan", "NULL", "None", ""], np.nan)


## STEP 5 — Remove Exact Duplicate Rows
Exact duplicate rows are removed to avoid double-counting and ensure the reliability of aggregations and analyses.

In [14]:
# STEP 5: Drop exact duplicates

demo_before = len(demo)
demo = demo.drop_duplicates()
print("Demographics duplicates removed:", demo_before - len(demo))

txn_before = len(txn)
txn = txn.drop_duplicates()
print("Transactions duplicates removed:", txn_before - len(txn))

social_before = len(social)
social = social.drop_duplicates()
print("Social duplicates removed:", social_before - len(social))

Demographics duplicates removed: 0
Transactions duplicates removed: 0
Social duplicates removed: 0


## STEP 6 — Parse Dates and Coerce Numeric Amounts
Date columns are converted into proper datetime format, and the amount column is coerced into numeric values. Using .loc ensures we avoid chained assignment warnings in pandas.

In [6]:
# STEP 6: Parse dates and convert amount to numeric (revised)

# Convert date-like columns safely using .loc
if "signup_date" in demo.columns:
    demo.loc[:, "signup_date"] = pd.to_datetime(demo["signup_date"], errors="coerce")

if "transaction_date" in txn.columns:
    txn.loc[:, "transaction_date"] = pd.to_datetime(txn["transaction_date"], errors="coerce")

if "interaction_date" in social.columns:
    social.loc[:, "interaction_date"] = pd.to_datetime(social["interaction_date"], errors="coerce")

# Safely handle the 'amount' column
if "amount" in txn.columns:
    # Step 1: work on a temporary string series
    amt_str = txn["amount"].astype(str)
    # Step 2: remove currency symbols, spaces, and commas
    amt_str = (amt_str.str.replace("$", "", regex=False)
                        .str.replace("₱", "", regex=False)
                        .str.replace(",", "", regex=False)
                        .str.replace(" ", "", regex=False))
    # Step 3: convert cleaned strings to numeric
    txn.loc[:, "amount"] = pd.to_numeric(amt_str, errors="coerce")

# Quick verification
print(txn[["amount"]].head())

# Print first 10 entries of datetime columns
if "signup_date" in demo.columns:
    print("\nFirst 10 signup_date entries:")
    print(demo["signup_date"].head(10))

if "transaction_date" in txn.columns:
    print("\nFirst 10 transaction_date entries:")
    print(txn["transaction_date"].head(10))

if "interaction_date" in social.columns:
    print("\nFirst 10 interaction_date entries:")
    print(social["interaction_date"].head(10))

   amount
0  117.64
1  466.14
2  563.99
3  254.44
4  590.52

First 10 signup_date entries:
0    2022-11-17 00:00:00
1    2020-07-21 00:00:00
2    2021-01-01 00:00:00
3    2024-06-10 00:00:00
4    2023-08-24 00:00:00
5    2022-02-13 00:00:00
6    2019-12-08 00:00:00
7    2022-04-26 00:00:00
8    2022-04-17 00:00:00
9    2024-02-18 00:00:00
Name: signup_date, dtype: object

First 10 transaction_date entries:
0    2024-05-15 00:00:00
1    2023-04-26 00:00:00
2    2022-09-23 00:00:00
3    2024-04-15 00:00:00
4    2024-06-03 00:00:00
5    2024-04-07 00:00:00
6    2024-01-12 00:00:00
7    2023-03-10 00:00:00
8    2024-01-26 00:00:00
9    2023-06-15 00:00:00
Name: transaction_date, dtype: object

First 10 interaction_date entries:
0    2023-07-11 00:00:00
1    2023-07-06 00:00:00
2    2024-05-24 00:00:00
3    2023-11-01 00:00:00
4    2023-07-08 00:00:00
5    2023-12-18 00:00:00
6    2023-11-15 00:00:00
7    2024-03-29 00:00:00
8    2024-05-02 00:00:00
9    2023-11-18 00:00:00
Name: interactio

## STEP 7 — Normalize Categorical Values
Categorical values (e.g., gender, payment method, product category, platform, sentiment) are normalized to consistent labels. Assignments use .loc to avoid chained-assignment warnings and ensure changes apply to the intended DataFrame.

In [7]:
# Defensive copy to avoid any view-related warnings
demo   = demo.copy()
txn    = txn.copy()
social = social.copy()

# --- Gender (demographics) ---
def normalize_gender(x):
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    if s in ["m", "male"]:
        return "Male"
    if s in ["f", "female"]:
        return "Female"
    if s in ["other", "nonbinary", "non-binary", "nb"]:
        return "Other"
    return str(x).title()

if "gender" in demo.columns:
    demo.loc[:, "gender"] = demo["gender"].map(normalize_gender)

# --- Payment method & product category (transactions) ---
if "payment_method" in txn.columns:
    txn.loc[:, "payment_method"] = (
        txn["payment_method"].astype(str).str.strip().str.title()
    )

if "product_category" in txn.columns:
    txn.loc[:, "product_category"] = (
        txn["product_category"].astype(str).str.strip().str.title()
    )

# --- Platform & interaction type (social) ---
if "platform" in social.columns:
    social.loc[:, "platform"] = (
        social["platform"].astype(str).str.strip().str.title()
    )

if "interaction_type" in social.columns:
    social.loc[:, "interaction_type"] = (
        social["interaction_type"].astype(str).str.strip().str.title()
    )

# --- Sentiment (social) ---
def normalize_sentiment(x):
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    if s in ["pos", "positive", "1", "+", "good"]:
        return "Positive"
    if s in ["neu", "neutral", "0", "meh"]:
        return "Neutral"
    if s in ["neg", "negative", "-1", "-", "bad"]:
        return "Negative"
    return str(x).title()

if "sentiment" in social.columns:
    social.loc[:, "sentiment"] = social["sentiment"].map(normalize_sentiment)


## STEP 8 — Handle Missing Values
Missing values are addressed by dropping

In [8]:
# Make sure numeric columns are coerced properly
if "age" in demo.columns:
    demo["age"] = pd.to_numeric(demo["age"], errors="coerce")

if "amount" in txn.columns:
    txn["amount"] = pd.to_numeric(txn["amount"], errors="coerce")

# Drop rows with missing essential fields
demo = demo.dropna(subset=["customer_id", "age", "gender", "location"])
txn = txn.dropna(subset=["customer_id", "transaction_id", "amount"])
social = social.dropna(subset=["customer_id", "interaction_id", "interaction_date"])

# Recreate refund flag
if "amount" in txn.columns:
    txn["is_refund"] = txn["amount"] < 0
    
demo = demo.dropna().copy()
txn = txn.dropna().copy()
social = social.dropna().copy()

# Recreate refund flag (in case df changed)
if "amount" in txn.columns:
    txn["is_refund"] = txn["amount"] < 0

# Re-check
print("Remaining NaNs (Demographics):\n", demo.isna().sum())
print("\nRemaining NaNs (Transactions):\n", txn.isna().sum())
print("\nRemaining NaNs (Social):\n", social.isna().sum())


Remaining NaNs (Demographics):
 customer_id     0
age             0
gender          0
location        0
income_level    0
signup_date     0
dtype: int64

Remaining NaNs (Transactions):
 customer_id         0
transaction_id      0
transaction_date    0
amount              0
product_category    0
payment_method      0
is_refund           0
dtype: int64

Remaining NaNs (Social):
 customer_id         0
interaction_id      0
interaction_date    0
platform            0
interaction_type    0
sentiment           0
dtype: int64


## STEP 9 — Mitigate Outliers via Light Clipping
Extreme outliers in numeric fields can distort statistical summaries and models. Here, values are clipped to the 1st and 99th percentiles for age and amount to reduce undue influence.

In [15]:
# STEP 9 — Drop Outliers (Abnormal Values)

# DEMOGRAPHICS: drop unrealistic ages (<0 or >120)
if "age" in demo.columns:
    before = len(demo)
    demo = demo[(demo["age"] >= 0) & (demo["age"] <= 120)]
    print(f"Demo: dropped {before - len(demo)} rows with invalid ages")

# TRANSACTIONS: drop negative amounts (if refunds not required in this intro task)
if "amount" in txn.columns:
    before = len(txn)
    txn = txn[txn["amount"] >= 0]
    print(f"Txn: dropped {before - len(txn)} rows with negative amounts")

# SOCIAL: drop negative interaction counts if any (defensive)
for col in ["likes","comments","shares","score"]:
    if col in social.columns:
        before = len(social)
        social = social[social[col].isna() | (social[col] >= 0)]
        print(f"Social: dropped {before - len(social)} rows with invalid {col}")


Demo: dropped 48 rows with invalid ages
Txn: dropped 34 rows with negative amounts


## STEP 10 — Enforce Identifier Uniqueness
Primary identifiers are enforced as unique within their respective tables. This step removes any duplicated keys to maintain entity integrity and prevent ambiguity in joins.

In [16]:
if "customer_id" in demo.columns:
    before = len(demo)
    demo = demo.drop_duplicates(subset=["customer_id"])
    print("Duplicate customer_id removed (demo):", before - len(demo))

if "transaction_id" in txn.columns:
    before = len(txn)
    txn = txn.drop_duplicates(subset=["transaction_id"])
    print("Duplicate transaction_id removed (txn):", before - len(txn))

if "interaction_id" in social.columns:
    before = len(social)
    social = social.drop_duplicates(subset=["interaction_id"])
    print("Duplicate interaction_id removed (social):", before - len(social))

Duplicate customer_id removed (demo): 0
Duplicate transaction_id removed (txn): 0
Duplicate interaction_id removed (social): 0


## STEP 11 — Build a Customer-Level Master Table
A customer-level master table is constructed by left-joining demographic records with aggregated transaction and social metrics. This unified view facilitates downstream analysis, reporting, and modeling.

In [17]:
# Aggregate transactions per customer
if set(["customer_id", "amount", "transaction_id"]).issubset(txn.columns):
    txn_agg = (txn.groupby("customer_id")
                 .agg(total_spend=("amount", "sum"),
                      avg_spend=("amount", "mean"),
                      txn_count=("transaction_id", "nunique"),
                      refund_count=("is_refund", "sum"))
                 .reset_index())
else:
    txn_agg = pd.DataFrame(columns=["customer_id", "total_spend", "avg_spend", "txn_count", "refund_count"])

# Aggregate social interactions per customer
if set(["customer_id", "interaction_id"]).issubset(social.columns):
    if "platform" in social.columns:
        social_agg = (social.groupby("customer_id")
                        .agg(interactions=("interaction_id", "nunique"),
                             platforms=("platform", "nunique"))
                        .reset_index())
    else:
        social_agg = (social.groupby("customer_id")
                        .agg(interactions=("interaction_id", "nunique"))
                        .reset_index())
else:
    social_agg = pd.DataFrame(columns=["customer_id", "interactions", "platforms"])

# Start from demographics (one row per customer), then left-join aggregates
master = demo.copy()
if not txn_agg.empty:
    master = master.merge(txn_agg, on="customer_id", how="left")
if not social_agg.empty:
    master = master.merge(social_agg, on="customer_id", how="left")

print("Master dataset shape:", master.shape)
display(master.head(10))

Master dataset shape: (2335, 12)


Unnamed: 0,customer_id,age,gender,location,income_level,signup_date,total_spend,avg_spend,txn_count,refund_count,interactions,platforms
0,9207fa75-5758-48d1-94ad-19c041e0520f,51.0,Female,Jensenberg,Low,2022-11-17,660.87,660.87,1.0,0.0,2.0,2.0
1,50118139-7264-428f-81cc-a25fddc5d6dd,44.0,Male,Port Carl,Medium,2024-06-10,,,,,2.0,2.0
2,7d1f2bbc-8d16-4fbc-9b37-ece3324e8ed4,50.0,Female,Jessebury,High,2023-08-24,733.46,733.46,1.0,0.0,2.0,2.0
3,2de49c7c-32ae-4ba8-b058-622a090d7094,53.0,Female,Emilyville,Low,2022-02-13,891.74,445.87,2.0,0.0,3.0,2.0
4,8602d631-457c-49c1-8b59-8efb2a4448d4,51.0,Male,East Keithville,High,2022-04-17,815.21,815.21,1.0,0.0,2.0,2.0
5,56f11a95-76f1-4a97-b38f-db1dc95da1ed,59.0,Female,East Nathan,Medium,2020-01-02,,,,,3.0,2.0
6,3f520998-2bc4-4f38-af82-5ab2de339984,59.0,Female,Smithside,High,2023-10-06,183.91,183.91,1.0,0.0,,
7,b2f4c25e-be11-4912-9d14-5c288616e56e,29.0,Male,South Timothyhaven,High,2023-07-09,533.94,533.94,1.0,0.0,1.0,1.0
8,8de7560f-370d-4cfd-8135-20a3de237264,31.0,Female,Colemanstad,Low,2020-12-01,,,,,,
9,fcb513ca-6a74-4745-b855-6cc3e891cc02,53.0,Female,East Stuart,Medium,2022-01-09,,,,,,


## STEP 12 — Persist the Cleaned Outputs
Finally, the cleaned datasets and the integrated master table are exported to CSV files. Persisting these outputs enables reproducibility and reuse in analytics and BI tools.

In [18]:
demo.to_csv("cleaned_customer_demographics.csv", index=False)
txn.to_csv("cleaned_customer_transactions.csv", index=False)
social.to_csv("cleaned_social_media_interactions.csv", index=False)
master.to_csv("cleaned_master.csv", index=False)

print("Saved the following files:")
print(" - cleaned_customer_demographics.csv")
print(" - cleaned_customer_transactions.csv")
print(" - cleaned_social_media_interactions.csv")
print(" - cleaned_master.csv")

Saved the following files:
 - cleaned_customer_demographics.csv
 - cleaned_customer_transactions.csv
 - cleaned_social_media_interactions.csv
 - cleaned_master.csv
