# **Reviews vs Reality.**

## **Business Understanding.**
### Background.
E-commerce is slowly but steadily climbing the ladder to be one of Kenya's economic backbones. With businesses shifting from the traditional trade, that included setting up shops and waiting for customers to walk around looking for what they need, to digital trade where all one needs to do is open an account on any social media platform, post whatever they're selling and just wait for notifications that someone needs their product. No need for physical shop, no pressure, just internet connection and the comfort of their homes.

However, such a strong growth in such an industry, comes with really stiff competition. Every seller wants to be the best, to sell the most and earn the most which in result pushes sellers to the extremes of buying fake reviews.

### Problem Statement.
In Kenya, entreprenuers, both young and old are turning to platforms like Jumia, Killimall, Jiji, PigiaMe among others to sell products and make ends meet. It's not only a business, but a survival.

The rise of fake reviews, however, threatens the honest business persons. There have been not only reports but also companies have come up and stated openly that they do, in fact, sell online reviews. This practice not only creates an unfair playing field for vendors because the vendors buying reviews are often better funded than the ones that do not but also denies the struggling vendors a way to earn a living.

As a result of fake reviews;
- Good products go unseen as the algorithms normally show products with high ratings so genuine sellers who actually rely on real customer feedback get burried in search results. 

- Honest sellers lose customers as with a low number of reviews, their products are percieved as lower quality.

Online platforms need a way to detect mismatches between reviews and ratings, to surface truly trustworthy sellers and protect buyers and sellers alike. This project seeks to help them both understand if the reviews are true or false.

### Project Objectives.
This project is looking to;
- Use Natural Language Processing and sentiment analysis to spot suspicious products whose ratings do not match what people are really saying.

### Success criteria.
When fake visisbility wins over genuine value, everyone is affected. What would spell success for this project would be that;

- Sellers get equal visibility based on real customer feedback.

- Customers get protected against cons

- Companies get to protect their customers from falling victims of fake reviews 

### Stakeholders.
- Customers: The individuals or businesses purchasing goods or services through the platform.

- Sellers/Merchants: Businesses or individuals who list and sell products or services on the platform. 

- Platform Providers: The company or organization that owns and operates the online commerce platform.

- Regulatory Bodies: Government agencies and other organizations that set rules and standards for online commerce, including consumer protection.

- Investors: Individuals or organizations that have provided funding for the platform. 

- Government and International Organizations: These entities are stakeholders due to their role in setting and enforcing regulations related to online commerce. 

### Project plan.

## **Data Understanding.**
### Data Source.
The data used in this project was collected from publicly accessible product pages on `Jumia Kenya` using web scraping techniques. Reviews were gathered from selected products in three categories: fashion, appliances, and other electronics.

The scraping was performed using Python libraries such as requests and BeautifulSoup as evident in the `Scraper` folder, and all information collected is visible to any user visiting the site. No login or bypassing of protections was required.

This data was collected strictly for educational and research purposes, with the intent of exploring the relationship between product star ratings and actual customer sentiment. It is not affiliated with, endorsed by, or intended to defame Jumia or any of its sellers. It simply aims to uncover insights from publicly available information and promote transparency in digital marketplaces.

The dataset is under the file path `Data/`.

### Why is the data suitable for this project?
Jumia Kenya is one of the most popular e-commerce platforms in Kenya, data from the platform not only provides a larger pool of sellers but also a richer variety of reviews which allows for a more comprehensive understanding of consumer behavior.

- The largest e-commerce companies have millions or even billions of transactions, providing a vast dataset to analyze. This allows for more robust statistical analysis and reduces the chance of drawing inaccurate conclusions based on limited data.

- Large companies also serve a diverse customer base, including demographics, geographic locations, and purchasing habits. This diverse data helps in identifying patterns that might be missed in smaller datasets focused on specific niches.

### Exploring the dataset for understanding.
In this section we will be carrying out both qualitative and quantitative analysis to understand the structure of the dataset as well as identify areas that would impact our analysis if left unchecked or simply not fixed.

#### Import dependencies and loading the dataset.
In this section we will be importing our dependencies that we will be using all through the project for our data cleaning, exploratory data analysis, NPL etc. We will also be loading the scraped dataset.

We are loading the dataset using pandas .read_csv() method.


In [15]:
# Import dependencies.
import pandas as pd
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset.
reviews = pd.read_csv('../Data/Raw/jumia_reviews.csv')

#### Qualitative Analysis.
Before our analysis, we will explore our data, to understand its structure and what it contains. We will also be looking at the contenet and making sure it is in good quality for analysis.

We will use the .head() method to access the first 5 rows of the data. This will help us understand what columns we have and what type of values they contain.

In [16]:
# Dataset preview.
reviews.head()

Unnamed: 0,product_name,category,review_title,review_text,rating,review_date,verified
0,Berrykey Hawaiian Shirt,fashion,big size not cotton,Not cotton,1,19-06-2025,Verified Purchase
1,Berrykey Hawaiian Shirt,fashion,Not satisfied,The material is bad.not what I expected,1,13-06-2025,Verified Purchase
2,Berrykey Hawaiian Shirt,fashion,I like it,It is okay,5,12-05-2025,Verified Purchase
3,Berrykey Hawaiian Shirt,fashion,I like it,The quality is good. It's worth the price,5,22-04-2025,Verified Purchase
4,Berrykey Hawaiian Shirt,fashion,good,Good,5,27-01-2025,Verified Purchase


From the output above, the data has 7 columns and it contains review data of products. To further explore the dataset's structure, we will use pandas .info() method. This method will give us the concise summary of our data, providing us with infomation on the number of rows and columns, number of non-null values, and the columns' datatype. 

In [17]:
# Get dataset's qualitative summary
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362 entries, 0 to 361
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   product_name  362 non-null    object
 1   category      362 non-null    object
 2   review_title  362 non-null    object
 3   review_text   362 non-null    object
 4   rating        362 non-null    int64 
 5   review_date   362 non-null    object
 6   verified      362 non-null    object
dtypes: int64(1), object(6)
memory usage: 19.9+ KB


The dataset contains data stored in 362 rows in columns named product_name, category, review_title, review_text, rating, review_date and verified. Out of the 7 columns, only 1 is stored as an integer and the rest are stored as objects. Additionally, the data seems to have no null values.

Next we are ensuring the data has no null values and also looking if our data has duplicate records. We are using pandas .isnull() and .duplicated() methods. This is a crucial step as duplicates or null values impact the quality, accuracy, and reliability of your data, and lead to inaccurate results in your analysis.

In [18]:
# Check for null values
reviews.isnull().sum()

product_name    0
category        0
review_title    0
review_text     0
rating          0
review_date     0
verified        0
dtype: int64

In [19]:
# Check for duplicated records.
print(f'The reviews dataset contains {reviews.duplicated().sum()} duplicated record.')

The reviews dataset contains 1 duplicated record.


The datast contains no null values but has 1 duplicated record.

#### Quantitative Analysis.
Here we will be getting the dataset's qualitative summary using the .describe() method. This method generates descriptive statistics, giving summarized statistical description of your data's measures of central tendecy, mean, mode, median, and percentiles as well as measures of spread, standard deviation.

In [20]:
# Get dataset's statistical summary.
reviews.describe()

Unnamed: 0,rating
count,362.0
mean,4.20442
std,1.205795
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


A lot of ratings in the dataset are positive, the mean of the ratings is 4.20 and the 25th percentile is 4.0.

#### Data Quality Issues.
Data quality issues, are characterized by inaccuracies, incompleteness, and inconsistencies, these issues, if left unchecked/unsolved, can severely impact our analysis by leading to flawed analysis, poor decision-making, among other problems.

Our data understanding section has helped us identfy some of the issues in the dataset including;
- Inaccuracies - most of the columns in the dataset are stored as objects when they dates.

- Duplicates - we found one duplicated record in the dataset.

Addressing these problems is crucial for maintaining data integrity and ensuring reliable results across all business operations. In the next section, we will be cleaning the data and preparing it for analysis.

## **Data Preparation.**
