# Jumia Phone Price Prediction

## Business Understanding
 Retailers on Jumia's e-commerce platform face challenges in determining optimal pricing due to the competitive nature of the marketplace with over 100,000 of them and the time-consuming process of evaluating other competitor prices. Jumia has tasked us to analyze the phone catalog data and develop a predictive model that provides data-driven insights, enabling sellers to set competitive prices and maximize profitability effectively ahead of the November black Friday Big Sale.This model is expected to reduce the stress that retailers/sellers have to go through to determins the optimal average price of the product they intend to list on the platform.
 
 The objective of our project is as outlined below:
* Identify factors contributing to higher product visibility and marketability on Jumia’s first top pages.
* Explore the relationship between phone features and customer reviews.
* Develop a predictive model to recommend competitive, optimal pricing that promotes first-page placement.
* Assess the potential relationship between buyer ratings and product pricing.



## Data Understanding
The data we used was scrapped on 31st October 2024 from the Jumia Kenya e-commerce platform specifically under the smartphones category and sorted by popularity from the 1at to the last page. This gave us 12,000 listed devices. The python code used to scrape the data has been stored on a separate file **scrapped_data.ipynb** The packages used included the Beautiful Soup and Pandas. We saved the data in the csv format on our local machine as jumia_phones.csv that contains the below features respectively outlines:

**Name** This describes the brand and the feature of the phone.

**Price** This describes the current price the phone retails at.

**Old Price** This describes the previous price of the phone.

**Discount** The % discount calculated

**Rating** The buyers explicit rating of the product and service.

**Number of Reviews** The number of reviews from possible buyers.

**Search Ranking** The page and position of the product in terms of listing and popularity.

 The Name column contains unstructured text, combining brand names and product specifications (e.g., “Samsung Galaxy A12, 5000mAh, 128GB ROM, 6GB RAM”). To transform these into separate, structured attributes, we shall use Regex as it allows for consistent pattern matching, enabling the extraction of information such as battery capacity (e.g., numbers followed by "mAh") and storage (e.g., "GB" or "MB"), making data more structured and accessible for analysis.

Data Limitation:

* Dynamic Pricing: Prices on e-commerce platforms fluctuate frequently. Therefore, the scraped prices reflect only the prices at the time of scraping and may not represent current or future values.

* Incomplete or Inconsistent Data: Due to the variety of phone models and brands, some listings may lack uniform information (e.g., missing battery details or memory specifications), which could lead to variability in the parsed features.

* Unverified Ratings and Reviews: Ratings and reviews might be biased or manipulated, affecting any insights or model predictions derived from them.

* Potential Duplicate Listings: Duplicate or near-duplicate entries may exist if the same model is listed by multiple sellers, which could influence popularity and ranking statistics.




### Import Relevant Libraries

In [42]:
# Import libraries for inspecting, loading, cleaning and visualizing data
import pandas as pd
import numpy as np
#Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#regular expression
import re

## Data Preparation


### 1. Loading and Inspecting Data
Here we shall load the data using the pandas library imported.

Thereafter we shall inspect our files using the pandas attributes and methods.

In [43]:
# load the two datasets
phone_df = pd.read_csv('jumia_phones.csv')
#View the first 5 rows of the phone_pricing df
phone_df.head()

Unnamed: 0,Name,Price,Old Price,Discount,Rating,Number of Reviews,Search Ranking
0,"XIAOMI Redmi A3, 6.71"", 3GB RAM + 64GB (Dual S...","KSh 11,000",,,4.1 out of 5,4.1 out of 5(220),"Page 1, Rank 1"
1,"Tecno Spark 20, Android 13, 6.6"", 128GB + 4GB ...","KSh 12,925","KSh 15,000",14%,4.4 out of 5,4.4 out of 5(135),"Page 1, Rank 2"
2,"Itel S23 6.6"", 128GB + 4GB RAM, 50MP Camera, (...","KSh 10,000",,,4.2 out of 5,4.2 out of 5(151),"Page 1, Rank 3"
3,"Samsung Galaxy A05, 6.7'' 4GB RAM + 128GB ROM ...","KSh 14,000",,,4.5 out of 5,4.5 out of 5(29),"Page 1, Rank 4"
4,"Itel S23 6.6"", 128GB + 4GB RAM, 50MP Camera, (...","KSh 10,000",,,4.3 out of 5,4.3 out of 5(249),"Page 1, Rank 5"


In [44]:
#To see the column names
phone_df.columns

Index(['Name', 'Price', 'Old Price', 'Discount', 'Rating', 'Number of Reviews',
       'Search Ranking'],
      dtype='object')

In [45]:
#To inspect the size of the df
phone_df.shape

(12000, 7)

In [46]:
#To inspect the detailed information of the dataset 
phone_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               12000 non-null  object
 1   Price              12000 non-null  object
 2   Old Price          8101 non-null   object
 3   Discount           8101 non-null   object
 4   Rating             11700 non-null  object
 5   Number of Reviews  11700 non-null  object
 6   Search Ranking     12000 non-null  object
dtypes: object(7)
memory usage: 656.4+ KB


In [47]:
phone_df.describe()

Unnamed: 0,Name,Price,Old Price,Discount,Rating,Number of Reviews,Search Ranking
count,12000,12000,8101,8101,11700,11700,12000
unique,40,34,19,22,12,38,12000
top,"Itel A18 5.0"", 32GB + 1GB RAM, 2400mAh - Lime ...","KSh 10,000","KSh 29,999",45%,4.3 out of 5,4.6 out of 5(9),"Page 166, Rank 4"
freq,300,1500,1200,900,1800,600,1


Summary Findings From Data Loading and Inspection:

* We have 7 columns in our DataFrame tagged 'Name', 'Price', 'Old Price', 'Discount', 'Rating', 'Number of Reviews' and 'Search Ranking'

* Our Dataset has 120,000 rows indicating 120,000 phones listed.

* All the columns are in the datatype object.

* We have 40 unique products signifying that one phone brand has multiple listing, 34 unique prices indicate price variations, likely due to discounts or seller pricing strategies, 19 unique old prices suggest that some products have similar prices,  22 unique discount percentages point to varying discount rates across products, 12 unique ratings indicate that products have received different levels of customer feedback, 38 unique values in the number of reviews column suggest variability in how many reviews each product has received and 12,000 unique values imply that each product has a unique search ranking based on its page position.

* The most frequently listed product is the "Samsung GALAXY A15," which indicates its popularity among sellers or consumers, KSh 10,000 is the most common price, suggesting that many products are priced around this figure. The most common old price is KSh 29,999, indicating that many products have been discounted from this price.
The most common discount of 45% suggests aggressive pricing strategies to attract buyers, The most common rating is 4.3 out of 5, indicating a generally positive perception of the products, 4.6 out of 5 from 9 reviews is the top rating, showing strong customer satisfaction for that specific product.
The top search ranking is "Page 21, Rank 35," suggesting the ranking method is working as intended, with many products on the site.


### 3. Feature Splitting
Here we intend to split the name column into several features containing

In [48]:
# Define regex pattern to extract brand
pattern = r"""
    (?P<brand>[\w\s]+)(?=,\s|\s|$)  # Brand can include words and spaces, followed by a comma or space
"""

# Compile the regex with verbose mode for readability
pattern_brand = re.compile(pattern, re.VERBOSE | re.IGNORECASE)

# Function to extract features
def extract_features(name):
    match = pattern_brand.search(name)
    if match:
        return match.groupdict()
    else:
        return {
            'brand': None,
        }

# Apply the function to each entry in the 'Name' column
brand_df = phone_df['Name'].apply(extract_features).apply(pd.Series)

# Display the resulting DataFrame
brand_df.head()  

Unnamed: 0,brand
0,XIAOMI Redmi A3
1,Tecno Spark 20
2,Itel S23
3,Samsung Galaxy A05
4,Itel S23


In [49]:
# Define regex pattern to extract screen size
screen_size_pattern = r"""
    (?P<screen_size>\d+(\.\d+)?['\"]{1,2})  # Matches numbers with optional decimal followed by " or '
"""
# Compile the regex with verbose mode for readability
pattern_size = re.compile(screen_size_pattern, re.VERBOSE | re.IGNORECASE)

# Function to extract features
def extract_features(name):
    match = pattern_size.search(name)  # Use pattern_size instead of pattern_size_compiled
    if match:
        return match.groupdict()
    else:
        return {
            'screen_size': None,
        }

# Apply the function to each entry in the 'Name' column
size_df = phone_df['Name'].apply(extract_features).apply(pd.Series)

# Display the resulting DataFrame
size_df.head()

Unnamed: 0,screen_size
0,"6.71"""
1,"6.6"""
2,"6.6"""
3,6.7''
4,"6.6"""


In [50]:
# Define regex pattern to extract RAM without requiring whitespace
ram_pattern = r"""
    \b(?P<RAM>\d\s?GB)\b  # Matches a single digit followed by optional whitespace and then "GB"
"""

# Compile the regex with verbose mode for readability
pattern_ram = re.compile(ram_pattern, re.VERBOSE | re.IGNORECASE)

# Function to extract RAM
def extract_ram(name):
    match = pattern_ram.search(name)
    if match:
        return match.groupdict()
    else:
        return {
            'RAM': None,
        }
# Apply the function to each entry in the 'Name' column
ram_df = phone_df['Name'].apply(extract_ram).apply(pd.Series)

# Display the resulting DataFrame
ram_df.head()

Unnamed: 0,RAM
0,3GB
1,4GB
2,4GB
3,4GB
4,4GB


In [51]:
# Define regex pattern to extract ROM
rom_pattern = r"""
    \b(?P<ROM>\d{2,}\s?GB)\b  # Matches two or more digits followed by optional whitespace and then "GB"
"""

# Compile the regex with verbose mode for readability
pattern_rom = re.compile(rom_pattern, re.VERBOSE | re.IGNORECASE)

# Function to extract ROM
def extract_rom(name):
    match = pattern_rom.search(name)
    if match:
        return match.groupdict()
    else:
        return {
            'ROM': None,
        }
# Apply the function to each entry in the 'Name' column
rom_df = phone_df['Name'].apply(extract_rom).apply(pd.Series)

# Display the resulting DataFrame
print(rom_df.head())

     ROM
0   64GB
1  128GB
2  128GB
3  128GB
4  128GB


In [61]:
# Define regex pattern to extract Color
color_pattern = r"""
    [,-]\s*                               # Matches a comma or hyphen followed by optional whitespace
    (?P<Color>([A-Z][a-z]+(?:\s[A-Z][a-z]+)*))  # Matches one or two capitalized words (color names)
    \s*(?:\+.*)?                          # Matches any additional text (like + Smart Watch & Buds)
    (?=\s*\(|$)                           # Lookahead for an open parenthesis or end of string
"""

# Compile the regex with verbose mode for readability
pattern_color = re.compile(color_pattern, re.VERBOSE)

# Function to extract Color
def extract_color(name):
    match = pattern_color.search(name)  # Search for the pattern
    if match:
        return {
            'Color': match.group('Color').strip(),  # Return the extracted color
        }
    return {
        'Color': None,
    }

# Apply the function to each entry in the 'Name' column
color_df = phone_df['Name'].apply(extract_color).apply(pd.Series)
# Display the resulting DataFrame
color_df.head()

Unnamed: 0,Color
0,Midnight Black
1,Gravity Black
2,Mystery White
3,Black
4,Starry Black


In [67]:
# Define regex pattern for extracting Warranty
warranty_pattern = r"(\d+)\s*(?:YR|WRTY)"

# Function to extract Warranty
def extract_warranty(name):
    # Find warranty
    warranty_match = re.search(warranty_pattern, name)
    return warranty_match.group(1) if warranty_match else None

# Apply the function to each entry in the 'Name' column
warranty_df = phone_df['Name'].apply(extract_warranty)

# Display the resulting DataFrame
warranty_df.head()

0       2
1       1
2       1
3    None
4    None
Name: Name, dtype: object

Now lets combine these feature split series into a dataframe and concat to the rest of the columns

## Modeling

## Evaluation

## Deployment

## Conclusions & Recommendations