# Mobile Trend Analysis

This notebook is a self-contained Kaggle-ready analysis combining preprocessing, processing, visualization, and trend analysis for a mobile phone dataset.

**What it includes:**
- Full function definitions embedded (no external .py files required)
- Preprocessing of raw `mobile.csv`
- Data processing into launched and upcoming/rumored subsets
- Visualizations and brand-family trend analysis

**Notes:** The notebook expects the raw CSV to be available at one of these paths:
- `/mnt/data/mobile.csv` (useful for local testing)
- `data/raw/mobile.csv` (Kaggle typical structure)

If the file is not found, the notebook will raise a clear error.

---

In [1]:

# Setup: imports and path handling
import os
import shutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Configure matplotlib for notebooks
%matplotlib inline

# Create expected directory structure
os.makedirs('data/raw', exist_ok=True)
os.makedirs('data/preprocess', exist_ok=True)
os.makedirs('data/processed', exist_ok=True)
os.makedirs('data/figures', exist_ok=True)

# Try to locate the raw data (supports both local and Kaggle paths)
possible_locations = ['/mnt/data/mobile.csv', 'data/raw/mobile.csv', '/kaggle/input/mobile.csv']
found = None
for p in possible_locations:
    if os.path.isfile(p):
        found = p
        break

if found is None:
    raise FileNotFoundError(
        "Raw mobile CSV not found. Please upload your raw `mobile.csv` to the notebook and ensure it is at "
        "`/mnt/data/mobile.csv` or `data/raw/mobile.csv`."
    )

# If it's at /mnt/data/mobile.csv, copy it into data/raw for the pipeline
if found != 'data/raw/mobile.csv':
    shutil.copy(found, 'data/raw/mobile.csv')
    print(f"Copied {found} -> data/raw/mobile.csv")
else:
    print("Using data/raw/mobile.csv")

RAW_PATH = 'data/raw/mobile.csv'


Using data/raw/mobile.csv


In [2]:

# == Preprocessing functions (embedded from preprocess.py) ==
def load_mobile_data(raw_path='data/raw/mobile.csv'):
    if not os.path.isfile(raw_path):
        raise FileNotFoundError(f"Raw file not found: {raw_path}")
    df_mobile = pd.read_csv(raw_path)
    return df_mobile

def rename_columns(df):
    # Rename columns that actually exist in the input
    col_map = {
        'Name': 'Brand Name',
        'Spec Score': 'Spec Score',
        'rating': 'Rating',
        'price': 'Price',
        'img': 'Image Preview',
        'tag': 'Tag',
        'sim': 'SIM / Network',
        'processor': 'Processor',
        'storage': 'Storage',
        'battery': 'Battery',
        'display': 'Display',
        'camera': 'Camera',
        'memoryExternal': 'Memory External',
        'version': 'OS Version',
        'fm': 'FM Radio',
    }
    existing_map = {k: v for k, v in col_map.items() if k in df.columns}
    return df.rename(columns=existing_map)

def initial_cleaning(df):
    # Work on a copy to avoid chained-assignment issues
    df = df.copy()
    # Preserve Image Preview content if present
    if 'Image Preview' in df.columns:
        image_col = df['Image Preview']
    # Drop FM Radio if present
    if 'FM Radio' in df.columns:
        df = df.drop(columns=['FM Radio'])
    # Replace blank-only strings with NaN
    df = df.replace(r'^\s*$', np.nan, regex=True)
    # Determine essential columns to keep rows for (at least these must be present)
    required = []
    for col in ['Brand Name', 'Price', 'Spec Score', 'Rating', 'Tag']:
        if col in df.columns:
            required.append(col)
    if required:
        df = df.dropna(subset=required, how='any')
    # Restore Image Preview position/content if it existed
    if 'Image Preview' in locals():
        df['Image Preview'] = image_col.reindex(df.index)
    return df

def standardize_and_fill(df):
    df = df.copy()
    # Text columns to normalize (do not lowercase Image Preview to keep URLs)
    text_cols = ['Brand Name', 'Tag', 'SIM / Network', 'Processor', 'Storage', 'Battery', 'Display', 'Camera', 'Memory External', 'OS Version']
    for col in text_cols:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip().str.lower()
    # Numeric conversions with safe defaults
    if 'Price' in df.columns:
        df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
        if df['Price'].notna().any():
            df['Price'] = df['Price'].fillna(df['Price'].mean())
    if 'Spec Score' in df.columns:
        df['Spec Score'] = pd.to_numeric(df['Spec Score'], errors='coerce')
        if df['Spec Score'].notna().any():
            df['Spec Score'] = df['Spec Score'].fillna(df['Spec Score'].mean())
    if 'Rating' in df.columns:
        df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
        if df['Rating'].notna().any():
            df['Rating'] = df['Rating'].fillna(df['Rating'].mean())
    # Fill common text columns with Unknown if missing
    for col in ['Battery', 'Storage', 'Processor', 'SIM / Network', 'Display', 'Camera', 'Memory External', 'OS Version']:
        if col in df.columns:
            df[col] = df[col].fillna('Unknown')
    return df

def split_processor(df):
    if 'Processor' not in df.columns:
        return df
    df = df.copy()
    # split into up to 3 parts; handle variable lengths safely
    parts = df['Processor'].astype(str).str.split(',', n=2)
    df['Processor Name'] = parts.str[0].str.strip().replace('nan', np.nan)
    df['Processor Type'] = parts.str[1].str.strip().replace('nan', np.nan)
    speed = parts.str[2].str.strip().replace('nan', np.nan)
    # normalize speed: remove non-numeric/period and ghz, then append ' GHz' if value found
    def norm_speed(s):
        if pd.isna(s):
            return np.nan
        s = s.lower()
        s = re.sub(r'ghz', '', s)
        s = re.sub(r'[^0-9\.]', '', s)
        return (s + ' GHz') if s else np.nan
    df['Processor Speed'] = speed.apply(norm_speed)
    df = df.drop(columns=['Processor'])
    return df

def split_sim_network(val):
    try:
        parts = [p.strip().lower() for p in str(val).split(',') if p.strip() != '']
    except Exception:
        parts = []
    sim_type = None
    extra_feature = None
    if 'volte' in parts:
        volte_idx = parts.index('volte')
        sim_type = ', '.join(parts[:volte_idx + 1]) if parts[:volte_idx + 1] else None
        extra_feature = ', '.join(parts[volte_idx + 1:]) if len(parts) > volte_idx + 1 else None
    else:
        sim_type = ', '.join(parts) if parts else None
        extra_feature = None
    return pd.Series([sim_type, extra_feature])

def split_sim(df):
    if 'SIM / Network' not in df.columns:
        return df
    df = df.copy()
    df[['SIM Type', 'Extra Feature']] = df['SIM / Network'].apply(split_sim_network)
    df = df.drop(columns=['SIM / Network'])
    return df

def split_storage(df):
    if 'Storage' not in df.columns:
        return df
    df = df.copy()
    # split into up to 2 parts (RAM, Internal Storage)
    parts = df['Storage'].astype(str).str.split(',', n=1)
    df['RAM'] = parts.str[0].str.strip().str.lower().replace('nan', np.nan)
    df['Internal Storage'] = parts.str[1].str.strip().str.lower().replace('nan', np.nan)
    df = df.drop(columns=['Storage'])
    return df

def extract_battery_capacity(battery):
    if pd.isna(battery):
        return None
    battery = str(battery).lower()
    match = re.search(r'(\d{3,5})\s*mah', battery)
    if match:
        capacity = match.group(1)
        watt_match = re.search(r'(\d{1,3})\s*w', battery)
        if watt_match:
            return f"{capacity}mAh {watt_match.group(1)}W"
        return f"{capacity}mAh"
    return None

def extract_battery_feature(battery):
    if pd.isna(battery):
        return 'Unknown'
    battery = str(battery).lower()
    if any(x in battery for x in ['fast', 'quick', 'turbo', 'super', 'warp']):
        return 'Fast Charging'
    return 'Standard Charging'

def split_battery(df):
    if 'Battery' not in df.columns:
        return df
    df = df.copy()
    df['Battery Capacity'] = df['Battery'].apply(extract_battery_capacity)
    df['Battery Feature'] = df['Battery'].apply(extract_battery_feature)
    # keep original Battery column removed to avoid redundancy
    df = df.drop(columns=['Battery'])
    return df

def split_display(display):
    display = '' if pd.isna(display) else str(display).lower()
    size_match = re.search(r'(\d+(\.\d+)?)\s*inch', display)
    size = f"{size_match.group(1)} inch" if size_match else None
    res_match = re.search(r'(\d{3,4})\s*[x×]\s*(\d{3,4})\s*(?:px)?', display)
    resolution = None
    if res_match:
        resolution = f"{res_match.group(1)}x{res_match.group(2)}"
    hz_match = re.search(r'(\d{2,3})\s*hz', display)
    if hz_match:
        hz = f"{hz_match.group(1)} Hz"
        resolution = f"{resolution}, {hz}" if resolution else hz
    feature = 'with punch hole' if 'punch hole' in display else 'no punch hole'
    return pd.Series([size, resolution, feature])

def split_display_col(df):
    if 'Display' not in df.columns:
        return df
    df = df.copy()
    df[['Display Size', 'Display Resolution', 'Display Feature']] = df['Display'].apply(split_display)
    df = df.drop(columns=['Display'])
    return df

def clean_memory_external(df):
    if 'Memory External' not in df.columns:
        return df
    df = df.copy()
    vals = df['Memory External'].astype(str).str.strip().str.lower()
    vals = vals.replace({'yes': 'supported', 'y': 'supported', 'true': 'supported', 'no': 'not supported', 'n': 'not supported', 'false': 'not supported'})
    vals = vals.where(~vals.isin(['nan', 'none', 'none', 'unknown']), other='unknown')
    df['Memory External'] = vals
    return df

def rearrange_columns(df):
    preferred = ['Brand Name', 'Spec Score', 'Rating', 'Price',
                'Tag', 'Processor Name', 'Processor Type', 'Processor Speed',
                'RAM', 'Internal Storage',
                'Battery Capacity', 'Battery Feature',
                'SIM Type', 'Extra Feature',
                'Display Size', 'Display Resolution', 'Display Feature',
                'Memory External', 'OS Version', 'Camera', 'Image Preview']
    # keep only columns that exist and preserve their order
    cols = [c for c in preferred if c in df.columns]
    # append any remaining columns at the end to avoid data loss
    remaining = [c for c in df.columns if c not in cols]
    df = df[cols + remaining]
    return df

def categorize_by_tag(df):
    if 'Tag' not in df.columns:
        return {}
    categories = {}
    for tag in df['Tag'].astype(str).str.lower().unique():
        categories[tag] = df[df['Tag'].astype(str).str.lower() == tag].copy()
    return categories

def save_categories(df, out_dir='data/preprocess'):
    os.makedirs(out_dir, exist_ok=True)
    if 'Tag' not in df.columns:
        return None, None
    df = df.copy()
    df['Tag'] = df['Tag'].astype(str).str.lower()
    launched_df = df[df['Tag'] == 'launched']
    upcoming_rumored_df = df[df['Tag'].isin(['upcoming', 'rumored'])]
    launched_path = os.path.join(out_dir, 'mobile_launched.csv')
    upcoming_path = os.path.join(out_dir, 'mobile_upcoming_rumored.csv')
    launched_df.to_csv(launched_path, index=False)
    upcoming_rumored_df.to_csv(upcoming_path, index=False)
    return launched_path, upcoming_path

def preprocess_mobile_data(raw_path='data/raw/mobile.csv', preprocess_dir='data/preprocess'):
    os.makedirs(preprocess_dir, exist_ok=True)
    df_mobile = load_mobile_data(raw_path)
    df_rename = rename_columns(df_mobile)
    df_mobile_cleaned = initial_cleaning(df_rename)
    cleaned_path = os.path.join(preprocess_dir, 'mobile_cleaned.csv')
    df_mobile_cleaned.to_csv(cleaned_path, index=False)

    df_mobile = pd.read_csv(cleaned_path)
    df_mobile = standardize_and_fill(df_mobile)
    df_mobile = split_processor(df_mobile)
    df_mobile = split_sim(df_mobile)
    df_mobile = split_storage(df_mobile)
    df_mobile = split_battery(df_mobile)
    df_mobile = split_display_col(df_mobile)
    df_mobile = clean_memory_external(df_mobile)
    df_mobile = rearrange_columns(df_mobile)
    final_cleaned_path = os.path.join(preprocess_dir, 'mobile_final_cleaned.csv')
    df_mobile.to_csv(final_cleaned_path, index=False)

    df_mobile_cleaned = pd.read_csv(final_cleaned_path)
    save_categories(df_mobile_cleaned, preprocess_dir)

    return df_mobile_cleaned


In [3]:

# == Data processing functions (embedded from data_process.py) ==
def _safe_read_csv(path):
    if not os.path.isfile(path):
        raise FileNotFoundError(f"Input file not found: {path}")
    return pd.read_csv(path)

def _ensure_output_dir(path):
    out_dir = os.path.dirname(path)
    if out_dir:
        os.makedirs(out_dir, exist_ok=True)

def _first_match_in_list(text, choices):
    text = '' if pd.isna(text) else str(text)
    tl = text.lower()
    for fam in choices:
        if fam.lower() in tl:
            return fam
    return 'Unknown'

def _extract_first_number(text):
    if pd.isna(text):
        return None
    m = re.search(r'\d+\.?\d*', str(text))
    return float(m.group(0)) if m else None

def get_display_size_range(text):
    size = _extract_first_number(text)
    if size is None:
        return 'Unknown'
    if size < 5.0:
        return 'Less than 5 inch'
    if 5.0 <= size < 6.0:
        return '5 to 6 inch'
    if 6.0 <= size < 7.0:
        return '6 to 7 inch'
    return 'More than 7 inch'

def get_battery_capacity_range(text):
    if pd.isna(text):
        return 'Unknown'
    m = re.search(r'(\d{3,5})', str(text))
    if not m:
        return 'Unknown'
    try:
        capacity = int(m.group(1))
    except ValueError:
        return 'Unknown'
    if capacity < 3000:
        return 'Low (<3000mAh)'
    if 3000 <= capacity < 4000:
        return 'Medium (3000 to 4000mAh)'
    if 4000 <= capacity < 5000:
        return 'High (4000 to 5000mAh)'
    return 'Very High (>=5000mAh)'

def _safe_dropna(df, subset):
    cols = [c for c in subset if c in df.columns]
    if not cols:
        return df
    return df.dropna(subset=cols, how='any')

def process_launched_data(input_path='data/preprocess/mobile_launched.csv', output_path='data/preprocess/mobile_launched_cleaned.csv'):
    df_launched = _safe_read_csv(input_path)
    df_launched_cleaned = _safe_dropna(df_launched, ['Brand Name', 'Spec Score', 'Rating', 'Price', 'Processor Name', 'Image Preview'])

    brand_families = ['Alcatel', 'Apple', 'Google', 'Infinix', 'IQOO', 'Itel', 'Motorola',
                      'Nokia', 'OnePlus', 'Oppo', 'Poco', 'Realme', 'Samsung', 'Tecno', 'Vivo',
                      'Xiaomi', 'ZTE']
    df_launched_cleaned['Brand Family'] = df_launched_cleaned.get('Brand Name', pd.Series()).apply(lambda t: _first_match_in_list(t, brand_families) if not pd.isna(t) else 'Unknown')

    processor_families = ['Snapdragon', 'Dimensity', 'Helio', 'Exynos', 'MediaTek', 'Bionic', 'Tensor', 'Unisoc', 'Tiger', 'Intel', 'AMD', 'Qualcomm']
    df_launched_cleaned['Processor Family'] = df_launched_cleaned.get('Processor Name', pd.Series()).apply(lambda t: _first_match_in_list(t, processor_families) if not pd.isna(t) else 'Unknown')

    # Display Size Range
    if 'Display Size' in df_launched_cleaned.columns:
        df_launched_cleaned['Display Size Range'] = df_launched_cleaned['Display Size'].apply(get_display_size_range)
    else:
        df_launched_cleaned['Display Size Range'] = 'Unknown'

    # Battery Capacity Range
    if 'Battery Capacity' in df_launched_cleaned.columns:
        df_launched_cleaned['Battery Capacity Range'] = df_launched_cleaned['Battery Capacity'].apply(get_battery_capacity_range)
    else:
        df_launched_cleaned['Battery Capacity Range'] = 'Unknown'

    _ensure_output_dir(output_path)
    df_launched_cleaned.to_csv(output_path, index=False)
    return df_launched_cleaned

def process_upcoming_data(input_path='data/preprocess/mobile_upcoming_rumored.csv', output_path='data/preprocess/mobile_upcoming_cleaned.csv'):
    df_upcoming = _safe_read_csv(input_path)
    df_upcoming_cleaned = _safe_dropna(df_upcoming, ['Brand Name', 'Spec Score', 'Rating', 'Price', 'Processor Name', 'Image Preview'])

    brand_families = ['Alcatel', 'Apple', 'Google', 'Infinix', 'HTC', 'Honor', 'IQOO', 'Itel', 'Lava', 'Moondrop', 'Motorola',
                      'Nokia', 'Nubia', 'OnePlus', 'Oppo', 'Poco', 'Realme', 'Sharp', 'Samsung', 'Sony Xperia', 'Tecno', 'Tesla', 'Vivo',
                      'Xiaomi', 'ZTE']
    df_upcoming_cleaned['Brand Family'] = df_upcoming_cleaned.get('Brand Name', pd.Series()).apply(lambda t: _first_match_in_list(t, brand_families) if not pd.isna(t) else 'Unknown')

    processor_families = ['Snapdragon', 'Dimensity', 'Helio', 'Exynos', 'MediaTek', 'Bionic', 'Tensor', 'Unisoc', 'Tiger', 'Intel', 'AMD', 'Qualcomm', 'Apple', 'Xring']
    df_upcoming_cleaned['Processor Family'] = df_upcoming_cleaned.get('Processor Name', pd.Series()).apply(lambda t: _first_match_in_list(t, processor_families) if not pd.isna(t) else 'Unknown')

    if 'Display Size' in df_upcoming_cleaned.columns:
        df_upcoming_cleaned['Display Size Range'] = df_upcoming_cleaned['Display Size'].apply(get_display_size_range)
    else:
        df_upcoming_cleaned['Display Size Range'] = 'Unknown'

    if 'Battery Capacity' in df_upcoming_cleaned.columns:
        df_upcoming_cleaned['Battery Capacity Range'] = df_upcoming_cleaned['Battery Capacity'].apply(get_battery_capacity_range)
    else:
        df_upcoming_cleaned['Battery Capacity Range'] = 'Unknown'

    _ensure_output_dir(output_path)
    df_upcoming_cleaned.to_csv(output_path, index=False)
    return df_upcoming_cleaned


In [4]:

# == Visualization functions (embedded from visualization.py) ==
def _ensure_output_dir_vis(path):
    if not path:
        return
    os.makedirs(path, exist_ok=True)

def _save_or_show(fig_name, save_dir, show):
    if save_dir:
        _ensure_output_dir_vis(save_dir)
        path = os.path.join(save_dir, fig_name)
        try:
            plt.savefig(path, dpi=150, bbox_inches='tight')
        except Exception:
            pass
    if show:
        try:
            plt.show()
        except Exception:
            pass
    plt.close()

def _plot_hist(df, col, title, xlabel, save_dir, show):
    if col not in df.columns or df[col].dropna().empty:
        return
    data = pd.to_numeric(df[col], errors='coerce').dropna()
    if data.empty:
        return
    plt.figure(figsize=(12, 6))
    sns.histplot(data, bins=30, kde=True)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.tight_layout()
    _save_or_show(f"{col.replace(' ', '_')}_hist.png", save_dir, show)

def _plot_count(df, col, title, xlabel, save_dir, show):
    if col not in df.columns or df[col].dropna().empty:
        return
    plt.figure(figsize=(14, 7))
    order = df[col].value_counts().index
    try:
        sns.countplot(data=df, x=col, order=order)
    except Exception:
        sns.countplot(data=df, x=col)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.grid(True)
    plt.tight_layout()
    _save_or_show(f"{col.replace(' ', '_')}_count.png", save_dir, show)

def visualize_launched_phones(df_launched, save_dir='data/figures', show=False):
    if df_launched is None or df_launched.empty:
        print("No launched data to visualize.")
        return
    print(df_launched.head())
    print('=' * 50)

    _plot_hist(df_launched, 'Spec Score', 'Distribution of Specification Scores in Launched Phones', 'Specification Score', save_dir, show)
    _plot_hist(df_launched, 'Price', 'Distribution of Prices in Launched Phones', 'Price', save_dir, show)
    _plot_hist(df_launched, 'Rating', 'Distribution of Ratings in Launched Phones', 'Rating', save_dir, show)

    _plot_count(df_launched, 'Brand Family', 'Count of Launched Phones by Brand Family', 'Brand Family', save_dir, show)
    _plot_count(df_launched, 'Processor Family', 'Count of Launched Phones by Processor Family', 'Processor Family', save_dir, show)
    _plot_count(df_launched, 'RAM', 'Count of Launched Phones by RAM', 'RAM', save_dir, show)
    _plot_count(df_launched, 'Internal Storage', 'Count of Launched Phones by Internal Storage', 'Internal Storage', save_dir, show)
    _plot_count(df_launched, 'Battery Capacity Range', 'Count of Launched Phones by Battery Capacity Range', 'Battery Capacity Range', save_dir, show)

    print("Visualizations for launched phones completed.")
    print('=' * 50)

def visualize_upcoming_phones(df_upcoming_rumored, save_dir='data/figures', show=False):
    if df_upcoming_rumored is None or df_upcoming_rumored.empty:
        print("No upcoming/rumored data to visualize.")
        return
    print(df_upcoming_rumored.head())
    print('=' * 50)

    _plot_hist(df_upcoming_rumored, 'Spec Score', 'Distribution of Specification Scores in Upcoming and Rumored Phones', 'Specification Score', save_dir, show)
    _plot_hist(df_upcoming_rumored, 'Price', 'Distribution of Prices in Upcoming and Rumored Phones', 'Price', save_dir, show)
    _plot_hist(df_upcoming_rumored, 'Rating', 'Distribution of Ratings in Upcoming and Rumored Phones', 'Rating', save_dir, show)

    _plot_count(df_upcoming_rumored, 'Brand Family', 'Count of Upcoming and Rumored Phones by Brand Family', 'Brand Family', save_dir, show)
    _plot_count(df_upcoming_rumored, 'Processor Family', 'Count of Upcoming and Rumored Phones by Processor Family', 'Processor Family', save_dir, show)
    _plot_count(df_upcoming_rumored, 'RAM', 'Count of Upcoming and Rumored Phones by RAM', 'RAM', save_dir, show)
    _plot_count(df_upcoming_rumored, 'Internal Storage', 'Count of Upcoming and Rumored Phones by Internal Storage', 'Internal Storage', save_dir, show)
    _plot_count(df_upcoming_rumored, 'Battery Capacity Range', 'Count of Upcoming and Rumored Phones by Battery Capacity Range', 'Battery Capacity Range', save_dir, show)
    _plot_count(df_upcoming_rumored, 'Display Size Range', 'Count of Upcoming and Rumored Phones by Display Size Range', 'Display Size Range', save_dir, show)

    print("Visualizations for upcoming and rumored phones completed.")
    print('=' * 50)

def load_launched_data(path='data/preprocess/mobile_launched_cleaned.csv'):
    if not os.path.isfile(path):
        raise FileNotFoundError(f"File not found: {path}")
    return pd.read_csv(path)

def load_upcoming_data(path='data/preprocess/mobile_upcoming_cleaned.csv'):
    if not os.path.isfile(path):
        raise FileNotFoundError(f"File not found: {path}")
    return pd.read_csv(path)


In [5]:

# == Trend analysis & prediction functions (embedded from mobile_prediction.py) ==
def _safe_read_csv_mp(path):
    if not os.path.isfile(path):
        raise FileNotFoundError(f"Input file not found: {path}")
    return pd.read_csv(path)

def _ensure_output_dir_mp(path):
    out_dir = os.path.dirname(path)
    if out_dir:
        os.makedirs(out_dir, exist_ok=True)

def _safe_mode(series, default='Unknown'):
    s = series.dropna()
    if s.empty:
        return default
    m = s.mode()
    return m.iloc[0] if not m.empty else default

def process_mobile_trends(input_path, output_path):
    df = _safe_read_csv_mp(input_path)

    # Safe numeric extraction for RAM and Storage
    if 'RAM' in df.columns:
        ram_extracted = df['RAM'].astype(str).str.extract(r'(\d+\.?\d*)')[0]
        df['RAM_GB'] = pd.to_numeric(ram_extracted, errors='coerce')
    else:
        df['RAM_GB'] = np.nan

    if 'Internal Storage' in df.columns:
        stor_extracted = df['Internal Storage'].astype(str).str.extract(r'(\d+\.?\d*)')[0]
        df['Storage_GB'] = pd.to_numeric(stor_extracted, errors='coerce')
    else:
        df['Storage_GB'] = np.nan

    # Normalize Price to numeric safely
    if 'Price' in df.columns:
        df['Price_numeric'] = pd.to_numeric(df['Price'], errors='coerce')
    else:
        df['Price_numeric'] = np.nan

    # Create price bins only when we have numeric prices
    bins = [0, 2000, 4000, 6000, 8000, 12000, float("inf")]
    labels = ["0-2K(Low)", "2K-4K(Low)", "4K-6K(Mid)", "6K-8K(Mid)", "8K-12K(High)", ">=12K(High)"]
    if df['Price_numeric'].notna().any():
        df['Price Range'] = pd.cut(df['Price_numeric'], bins=bins, labels=labels)
    else:
        df['Price Range'] = pd.Series([np.nan] * len(df))

    # Group and aggregate with safe functions
    agg_dict = {
        "Spec Score": "mean",
        "Rating": "mean",
        "Price_numeric": "mean",
        "Price Range": lambda x: _safe_mode(x, default="Unknown"),
        "Processor Family": lambda x: _safe_mode(x, default="Unknown"),
        "RAM_GB": lambda x: _safe_mode(x, default=np.nan),
        "Storage_GB": lambda x: _safe_mode(x, default=np.nan),
    }

    # Ensure Brand Family exists to group by; if not, create Unknown group
    if 'Brand Family' not in df.columns:
        df['Brand Family'] = 'Unknown'

    trend_df = df.groupby("Brand Family").agg(agg_dict).reset_index()

    # Clean up numeric columns and formatting
    numeric_cols = ["Spec Score", "Rating", "Price_numeric"]
    for c in numeric_cols:
        if c in trend_df.columns:
            trend_df[c] = pd.to_numeric(trend_df[c], errors='coerce')

    if "Spec Score" in trend_df.columns:
        trend_df["Spec Score"] = trend_df["Spec Score"].round(2)
    if "Rating" in trend_df.columns:
        trend_df["Rating"] = trend_df["Rating"].round(2)
    if "Price_numeric" in trend_df.columns:
        trend_df = trend_df.rename(columns={"Price_numeric": "Price"})
        # Format price where present, otherwise keep NaN
        trend_df["Price"] = trend_df["Price"].apply(lambda x: f"{x:,.2f}" if pd.notna(x) else "")

    # Ensure output dir and save
    _ensure_output_dir_mp(output_path)
    trend_df = trend_df.sort_values(by="Spec Score", ascending=False, na_position='last')
    trend_df.to_csv(output_path, index=False)

    return trend_df

def visualize_trends(trend_df, title="Trends in Mobile Phones by Brand Family", save_path=None):
    if trend_df is None or trend_df.empty:
        print("No trend data to plot.")
        return
    plt.figure(figsize=(12, 6))
    try:
        sns.lineplot(data=trend_df, x="Brand Family", y="Spec Score", marker="o")
        plt.title(title)
        plt.xlabel("Brand Family")
        plt.ylabel("Specification Score")
        plt.xticks(rotation=45)
        plt.grid(True)
        plt.tight_layout()
        if save_path:
            _ensure_output_dir_mp(save_path)
            plt.savefig(save_path, dpi=150, bbox_inches='tight')
        else:
            plt.show()
    except Exception as e:
        print("Plotting failed:", e)
    finally:
        plt.close()

def analyze_mobile_trends():
    processed_dir = 'data/processed'
    os.makedirs(processed_dir, exist_ok=True)

    # Launched Phones
    launched_path = 'data/preprocess/mobile_launched_cleaned.csv'
    launched_output = os.path.join(processed_dir, 'brand_family_trends.csv')
    launched_trends = None
    try:
        launched_trends = process_mobile_trends(launched_path, launched_output)
        print("Brand family trends saved to", launched_output)
        print(launched_trends.head())
    except FileNotFoundError:
        print(f"Launched input file not found: {launched_path}")
    except Exception as e:
        print("Error processing launched trends:", e)
    print('=' * 50)

    # Upcoming Phones
    upcoming_path = 'data/preprocess/mobile_upcoming_cleaned.csv'
    upcoming_output = os.path.join(processed_dir, 'upcoming_brand_family_trends.csv')
    upcoming_trends = None
    try:
        upcoming_trends = process_mobile_trends(upcoming_path, upcoming_output)
        print("Upcoming and Rumored brand family trends saved to", upcoming_output)
        if upcoming_trends is not None:
            print("Top 10 Upcoming Brands by Spec Score:")
            print(upcoming_trends.head(10))
    except FileNotFoundError:
        print(f"Upcoming input file not found: {upcoming_path}")
    except Exception as e:
        print("Error processing upcoming trends:", e)
    print('=' * 50)

    # Visualize (save to file to avoid GUI blocking in headless environments)
    if upcoming_trends is not None and not upcoming_trends.empty:
        viz_path = os.path.join(processed_dir, 'upcoming_trends_spec_score.png')
        visualize_trends(upcoming_trends, title="Trends in Upcoming Mobile Phones by Brand Family", save_path=viz_path)

    print("Mobile trends analysis completed.")
    print('=' * 50)
    return launched_trends, upcoming_trends


In [6]:

# === Run full pipeline ===
print("Starting preprocessing...")
df_final = preprocess_mobile_data(RAW_PATH)
print("Preprocessing finished. Sample:")
display(df_final.head())

# Save final cleaned file path
final_cleaned_path = 'data/preprocess/mobile_final_cleaned.csv'
print("Final cleaned saved to:", final_cleaned_path)

print("\nProcessing categories into launched/upcoming files...")
launched_path, upcoming_path = save_categories(df_final, out_dir='data/preprocess')
print("Launched path:", launched_path)
print("Upcoming/Rumored path:", upcoming_path)

# Further cleaning for launched/upcoming
if launched_path and os.path.isfile(launched_path):
    df_launched_cleaned = process_launched_data(launched_path, output_path='data/preprocess/mobile_launched_cleaned.csv')
    print("Launched cleaned sample:")
    display(df_launched_cleaned.head())
else:
    print("No launched data to process.")

if upcoming_path and os.path.isfile(upcoming_path):
    df_upcoming_cleaned = process_upcoming_data(upcoming_path, output_path='data/preprocess/mobile_upcoming_cleaned.csv')
    print("Upcoming cleaned sample:")
    display(df_upcoming_cleaned.head())
else:
    print("No upcoming/rumored data to process.")

# Visualizations
print("\nCreating visualizations...")
try:
    visualize_launched_phones(df_launched_cleaned, save_dir='data/figures', show=False)
except Exception as e:
    print("Visualization for launched phones failed:", e)
try:
    visualize_upcoming_phones(df_upcoming_cleaned, save_dir='data/figures', show=False)
except Exception as e:
    print("Visualization for upcoming phones failed:", e)

# Trend analysis
print("\nAnalyzing brand-family trends...")
launched_trends, upcoming_trends = analyze_mobile_trends()
print("Launched trends (if generated):")
if launched_trends is not None:
    display(launched_trends.head())
print("Upcoming trends (if generated):")
if upcoming_trends is not None:
    display(upcoming_trends.head())

print("\nAll outputs saved under data/ (preprocess, processed, figures).")


Starting preprocessing...
Preprocessing finished. Sample:


Unnamed: 0,Brand Name,Spec Score,Rating,Price,Tag,Processor Name,Processor Type,Processor Speed,RAM,Internal Storage,...,Battery Feature,SIM Type,Extra Feature,Display Size,Display Resolution,Display Feature,Memory External,OS Version,Camera,Image Preview
0,oppo reno 14 pro 5g,89,4.65,41990,upcoming,dimensity 8450,octa core,3.25 GHz,12 gb ram,256 gb inbuilt,...,Fast Charging,"dual sim, 3g, 4g, 5g, volte","wi-fi, nfc, ir blaster",6.83 inch,"1272x2800, 120 Hz",with punch hole,unknown,android v15,50 mp + 50 mp + 50 mp triple rear & 50 mp fron...,https://cdn1.smartprix.com/rx-is822PXo3-w280-h...
1,oppo reno 14 5g,87,4.75,32990,upcoming,dimensity 8350,octa core,3.35 GHz,8 gb ram,256 gb inbuilt,...,Fast Charging,"dual sim, 3g, 4g, 5g, volte","wi-fi, nfc, ir blaster",6.59 inch,"1256x2760, 120 Hz",with punch hole,unknown,android v15,50 mp + 50 mp + 8 mp triple rear & 50 mp front...,https://cdn1.smartprix.com/rx-iRGgfcGDH-w280-h...
2,poco f7 5g,83,4.75,31999,launched,snapdragon 8s gen4,octa core,3.2 GHz,12 gb ram,256 gb inbuilt,...,Fast Charging,"dual sim, 3g, 4g, 5g, volte","wi-fi, nfc, ir blaster",6.83 inch,"1280x2772, 120 Hz",with punch hole,memory card not supported,android v15,50 mp + 8 mp dual rear & 20 mp front camera,https://cdn1.smartprix.com/rx-icmgBU9Q2-w280-h...
3,vivo x200 fe,89,4.65,49990,upcoming,dimensity 9300 plus,octa core,3.25 GHz,12 gb ram,256 gb inbuilt,...,Fast Charging,"dual sim, 3g, 4g, 5g, volte","wi-fi, nfc, ir blaster",6.31 inch,"1216x2640, 120 Hz",with punch hole,memory card not supported,android v15,50 mp + 50 mp + 8 mp triple rear & 50 mp front...,https://cdn1.smartprix.com/rx-iHI7IaQgQ-w280-h...
4,oppo k13x 5g,73,4.2,11999,launched,dimensity 6300,octa core,2.4 GHz,4 gb ram,128 gb inbuilt,...,Fast Charging,"dual sim, 3g, 4g, 5g, volte",wi-fi,6.67 inch,"720x1604, 120 Hz",with punch hole,"memory card supported, upto 2 tb",android v15,50 mp + 2 mp dual rear & 8 mp front camera,https://cdn1.smartprix.com/rx-iXUulomIY-w280-h...


Final cleaned saved to: data/preprocess/mobile_final_cleaned.csv

Processing categories into launched/upcoming files...
Launched path: data/preprocess\mobile_launched.csv
Upcoming/Rumored path: data/preprocess\mobile_upcoming_rumored.csv
Launched cleaned sample:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_launched_cleaned['Brand Family'] = df_launched_cleaned.get('Brand Name', pd.Series()).apply(lambda t: _first_match_in_list(t, brand_families) if not pd.isna(t) else 'Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_launched_cleaned['Processor Family'] = df_launched_cleaned.get('Processor Name', pd.Series()).apply(lambda t: _first_match_in_list(t, processor_families) if not pd.isna(t) else 'Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_in

Unnamed: 0,Brand Name,Spec Score,Rating,Price,Tag,Processor Name,Processor Type,Processor Speed,RAM,Internal Storage,...,Display Resolution,Display Feature,Memory External,OS Version,Camera,Image Preview,Brand Family,Processor Family,Display Size Range,Battery Capacity Range
0,poco f7 5g,83,4.75,31999,launched,snapdragon 8s gen4,octa core,3.2 GHz,12 gb ram,256 gb inbuilt,...,"1280x2772, 120 Hz",with punch hole,memory card not supported,android v15,50 mp + 8 mp dual rear & 20 mp front camera,https://cdn1.smartprix.com/rx-icmgBU9Q2-w280-h...,Poco,Snapdragon,6 to 7 inch,Very High (>=5000mAh)
1,oppo k13x 5g,73,4.2,11999,launched,dimensity 6300,octa core,2.4 GHz,4 gb ram,128 gb inbuilt,...,"720x1604, 120 Hz",with punch hole,"memory card supported, upto 2 tb",android v15,50 mp + 2 mp dual rear & 8 mp front camera,https://cdn1.smartprix.com/rx-iXUulomIY-w280-h...,Oppo,Dimensity,6 to 7 inch,Very High (>=5000mAh)
2,vivo y400 pro 5g,79,4.1,24999,launched,dimensity 7300,octa core,2.5 GHz,8 gb ram,128 gb inbuilt,...,"1080x2392, 120 Hz",with punch hole,unknown,android v15,50 mp + 2 mp dual rear & 32 mp front camera,https://cdn1.smartprix.com/rx-iJGGh862X-w280-h...,Vivo,Dimensity,6 to 7 inch,Very High (>=5000mAh)
3,vivo t4 lite 5g,73,4.1,9999,launched,dimensity 6300,octa core,2.4 GHz,4 gb ram,128 gb inbuilt,...,"720x1600, 90 Hz",no punch hole,"memory card supported, upto 2 tb",android v15,50 mp + 2 mp dual rear & 5 mp front camera,https://cdn1.smartprix.com/rx-ic3ZJSusM-w280-h...,Vivo,Dimensity,6 to 7 inch,Very High (>=5000mAh)
4,vivo t4 ultra,85,4.5,37999,launched,dimensity 9300 plus,octa core,3.25 GHz,8 gb ram,256 gb inbuilt,...,"1260x2800, 120 Hz",with punch hole,memory card not supported,android v15,50 mp + 50 mp + 8 mp triple rear & 32 mp front...,https://cdn1.smartprix.com/rx-izSx0A8ZX-w280-h...,Vivo,Dimensity,6 to 7 inch,Very High (>=5000mAh)


Upcoming cleaned sample:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_upcoming_cleaned['Brand Family'] = df_upcoming_cleaned.get('Brand Name', pd.Series()).apply(lambda t: _first_match_in_list(t, brand_families) if not pd.isna(t) else 'Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_upcoming_cleaned['Processor Family'] = df_upcoming_cleaned.get('Processor Name', pd.Series()).apply(lambda t: _first_match_in_list(t, processor_families) if not pd.isna(t) else 'Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_in

Unnamed: 0,Brand Name,Spec Score,Rating,Price,Tag,Processor Name,Processor Type,Processor Speed,RAM,Internal Storage,...,Display Resolution,Display Feature,Memory External,OS Version,Camera,Image Preview,Brand Family,Processor Family,Display Size Range,Battery Capacity Range
0,oppo reno 14 pro 5g,89,4.65,41990,upcoming,dimensity 8450,octa core,3.25 GHz,12 gb ram,256 gb inbuilt,...,"1272x2800, 120 Hz",with punch hole,unknown,android v15,50 mp + 50 mp + 50 mp triple rear & 50 mp fron...,https://cdn1.smartprix.com/rx-is822PXo3-w280-h...,Oppo,Dimensity,6 to 7 inch,Very High (>=5000mAh)
1,oppo reno 14 5g,87,4.75,32990,upcoming,dimensity 8350,octa core,3.35 GHz,8 gb ram,256 gb inbuilt,...,"1256x2760, 120 Hz",with punch hole,unknown,android v15,50 mp + 50 mp + 8 mp triple rear & 50 mp front...,https://cdn1.smartprix.com/rx-iRGgfcGDH-w280-h...,Oppo,Dimensity,6 to 7 inch,Very High (>=5000mAh)
2,vivo x200 fe,89,4.65,49990,upcoming,dimensity 9300 plus,octa core,3.25 GHz,12 gb ram,256 gb inbuilt,...,"1216x2640, 120 Hz",with punch hole,memory card not supported,android v15,50 mp + 50 mp + 8 mp triple rear & 50 mp front...,https://cdn1.smartprix.com/rx-iHI7IaQgQ-w280-h...,Vivo,Dimensity,6 to 7 inch,Very High (>=5000mAh)
3,xiaomi redmi turbo 4 pro,83,4.2,23990,upcoming,snapdragon 8s gen4,octa core,3.2 GHz,12 gb ram,256 gb inbuilt,...,"1280x2772, 120 Hz",with punch hole,unknown,android v15,50 mp + 8 mp dual rear & 20 mp front camera,https://cdn1.smartprix.com/rx-ixV8Z9NfY-w280-h...,Xiaomi,Snapdragon,6 to 7 inch,Very High (>=5000mAh)
4,vivo s30 pro mini,89,4.65,40990,upcoming,dimensity 9300 plus,octa core,3.4 GHz,12 gb ram,256 gb inbuilt,...,"1216x2640, 120 Hz",with punch hole,memory card not supported,android v15,50 mp + 50 mp + 8 mp triple rear & 50 mp front...,https://cdn1.smartprix.com/rx-ihxl2mWnr-w280-h...,Vivo,Dimensity,6 to 7 inch,Very High (>=5000mAh)



Creating visualizations...
         Brand Name  Spec Score  Rating  Price       Tag       Processor Name  \
0        poco f7 5g          83    4.75  31999  launched   snapdragon 8s gen4   
1      oppo k13x 5g          73    4.20  11999  launched       dimensity 6300   
2  vivo y400 pro 5g          79    4.10  24999  launched       dimensity 7300   
3   vivo t4 lite 5g          73    4.10   9999  launched       dimensity 6300   
4     vivo t4 ultra          85    4.50  37999  launched  dimensity 9300 plus   

  Processor Type Processor Speed        RAM Internal Storage  ...  \
0      octa core         3.2 GHz  12 gb ram   256 gb inbuilt  ...   
1      octa core         2.4 GHz   4 gb ram   128 gb inbuilt  ...   
2      octa core         2.5 GHz   8 gb ram   128 gb inbuilt  ...   
3      octa core         2.4 GHz   4 gb ram   128 gb inbuilt  ...   
4      octa core        3.25 GHz   8 gb ram   256 gb inbuilt  ...   

  Display Resolution  Display Feature                   Memory Externa

Unnamed: 0,Brand Family,Spec Score,Rating,Price,Price Range,Processor Family,RAM_GB,Storage_GB
8,OnePlus,85.14,4.34,43517.76,>=12K(High),Snapdragon,8.0,128.0
0,Alcatel,84.0,4.38,20999.0,>=12K(High),Dimensity,6.0,128.0
2,Google,83.12,4.48,64154.62,>=12K(High),Tensor,8.0,128.0
1,Apple,82.23,4.4,85186.42,>=12K(High),Bionic,8.0,128.0
4,Infinix,81.39,4.36,17299.39,>=12K(High),Dimensity,8.0,256.0


Upcoming trends (if generated):


Unnamed: 0,Brand Family,Spec Score,Rating,Price,Price Range,Processor Family,RAM_GB,Storage_GB
17,Sharp,87.0,4.25,29990.0,>=12K(High),Snapdragon,12,256.0
3,Honor,85.5,4.45,56490.75,>=12K(High),Snapdragon,12,256.0
12,OnePlus,83.76,4.31,45932.76,>=12K(High),Snapdragon,12,256.0
18,Sony Xperia,83.33,4.48,97326.33,>=12K(High),Snapdragon,12,64.0
11,Nubia,82.7,4.3,43791.8,>=12K(High),Snapdragon,8,256.0



All outputs saved under data/ (preprocess, processed, figures).


## Notes & Next Steps

- You can inspect generated CSVs in `data/preprocess` and `data/processed` and figures in `data/figures`.
- If you upload this notebook to Kaggle, make sure to upload `mobile.csv` as a dataset or add it to the notebook's files.
- Want me to also add example EDA tables, additional charts, or model training for price/spec prediction? Ask and I'll extend the notebook.

---

*Generated automatically — ready to run on CPU-only Kaggle environment.*