# Combine Cleaned Brand Data — 5G Phones ≥ ₹20,000

This notebook reads all cleaned CSV files from `data/clean/`, adds a `brand` column if missing, concatenates them, removes duplicates (based on name, RAM, and storage), performs basic sanity checks, and saves a final dataset to:

`data/processed/all_brands_20k.csv`

In [40]:
import pandas as pd
import os
from pathlib import Path

CLEAN_DIR = Path("data/clean")
PROCESSED_DIR = Path("data/processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print("Clean folder:", CLEAN_DIR.resolve())
print("Processed folder:", PROCESSED_DIR.resolve())

Clean folder: C:\Users\Admin\Mobile_Phones_Analysis\data\clean
Processed folder: C:\Users\Admin\Mobile_Phones_Analysis\data\processed


In [41]:
csv_files = [f for f in os.listdir(CLEAN_DIR) if f.endswith(".csv")]

if len(csv_files) == 0:
    raise FileNotFoundError("No CSV files found in data/clean/. Check your folder.")

print("Found CSVs:")
for f in csv_files:
    print("-", f)

Found CSVs:
- oneplus_20k_clean.csv
- oppo_20k_clean.csv
- pixel_20k_clean.csv
- realme_20k_clean.csv
- samsung_phones_20k_clean.csv
- vivo_20k_clean.csv
- xiaomi_20k_clean.csv


In [42]:
dfs = []

for file in csv_files:
    brand = file.split("_")[0]  # Extract brand name from filename
    df = pd.read_csv(CLEAN_DIR / file)

    brand = brand.strip().title()  # Samsung, Pixel, Vivo, etc.
    df["brand"] = brand

    brand_fix = {
        "Oneplus": "OnePlus"
    }
    df["brand"] = df["brand"].replace(brand_fix)

    dfs.append(df)

len(dfs)

7

In [43]:
combined_df = pd.concat(dfs, ignore_index=True)
print("Combined shape:", combined_df.shape)
combined_df.head()

Combined shape: (1451, 9)


Unnamed: 0,name,price,specs,ram_gb,storage_gb,display_inch,battery_mah,camera_mp,brand
0,"OnePlus Nord 5 (Phantom Grey, 256 GB)",31190,"['8 GB RAM | 256 GB ROM', '17.35 cm (6.83 inch...",8.0,256.0,6.83,6800.0,50,OnePlus
1,"OnePlus Nord CE5 5G (Black Infinity, 256 GB)",26904,"['8 GB RAM | 256 GB ROM', '17.2 cm (6.77 inch)...",8.0,256.0,6.77,7100.0,50,OnePlus
2,"OnePlus Nord CE4 (Dark Chrome, 128 GB)",20280,"['8 GB RAM | 128 GB ROM', '17.02 cm (6.7 inch)...",8.0,128.0,6.7,5500.0,50,OnePlus
3,"OnePlus 13s 5G (Green Silk, 256 GB)",50958,"['12 GB RAM | 256 GB ROM', '16.05 cm (6.32 inc...",12.0,256.0,6.32,5850.0,50,OnePlus
4,"OnePlus Nord 5 5G (Dry Ice, 256 GB)",31884,"['8 GB RAM | 256 GB ROM', '17.35 cm (6.83 inch...",8.0,256.0,6.83,6800.0,50,OnePlus


In [44]:
combined_df["price"] = (
    combined_df["price"]
    .astype(str)
    .str.replace(r"[^0-9]", "", regex=True)
    .replace("", None)
    .astype(float)
)

combined_df = combined_df.dropna(subset=["price"])  # Remove rows with missing price

In [45]:
numeric_cols = ["ram_gb", "storage_gb", "display_inch", "battery_mah", "camera_mp"]

for col in numeric_cols:
    combined_df[col] = pd.to_numeric(combined_df[col], errors="coerce")

In [46]:
import re

def clean_model_name(name):
    if not isinstance(name, str):
        return name

    name = re.sub(r"\(.*?\)", "", name)
    name = re.sub(r"\s+", " ", name).strip().rstrip(",")

    return name

combined_df["name"] = combined_df["name"].apply(clean_model_name)

# Deduplicate again based on clean name, RAM, and storage
combined_df = (
    combined_df.sort_values("price")
              .drop_duplicates(subset=["name", "ram_gb", "storage_gb"], keep="first")
              .reset_index(drop=True)
)
print("After de-duplication:", combined_df.shape)

After de-duplication: (577, 9)


In [47]:
print("Brand distribution:\n", combined_df["brand"].value_counts())
print("\nPrice summary:\n", combined_df["price"].describe())

Brand distribution:
 brand
Samsung    115
Vivo       100
OnePlus     75
Realme      73
Xiaomi      72
Oppo        71
Pixel       71
Name: count, dtype: int64

Price summary:
 count       577.000000
mean      33630.414211
std       22504.129048
min       20000.000000
25%       21999.000000
50%       24999.000000
75%       34990.000000
max      216999.000000
Name: price, dtype: float64


In [48]:
output_path = PROCESSED_DIR / "all_brands_20k.csv"
combined_df.to_csv(output_path, index=False)

print("Saved final dataset to:", output_path)

Saved final dataset to: data\processed\all_brands_20k.csv
