# Exploring the Jumia Smart TV Catalogue Dataset 

## by (Olatunji Jola)

## Introduction
> The Jumia smart TV dataset contains 1615 smart TV listing Scraped from the Jumia website. The data set contains 1615 Observations and 7 variables. The data was scraped using scrapy as a json file name `'jumia_tv_catalogue.json'` and given descrpitive column names. In this notebook, I will be cleaning up the dataset and then analysing the dataset for patterns in the aim of performing a market research on the best smart tv in different categories, by size, features, ratings, and pricing and to check for the seller on the jumia website with the best offerings in terms of quality, ratings and pricing. 

In [89]:
# import all packages and set plots to be embedded inline 
import pandas as pd
from pandas import json_normalize
import numpy as np
import json
import matplotlib .pyplot as plt
import seaborn as sb
import re

%matplotlib inline


In [25]:
# import the json
with open ('jumia/jumia_tv_catalogue.json', 'r') as file:
    smart_tv_catalogue = json.load(file)

In [29]:
# loading the json file into a dataframe
pd_tv_catalogue = json_normalize(smart_tv_catalogue)

### Preliminary Wrangling 

in this Section various observation are made about the dataset for cleaning purposes

In [30]:
pd_tv_catalogue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1615 entries, 0 to 1614
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           1615 non-null   object
 1   brand          1609 non-null   object
 2   price          1609 non-null   object
 3   ratings        1615 non-null   object
 4   specification  1615 non-null   object
 5   Seller_name    1615 non-null   object
 6   seller_rating  1387 non-null   object
dtypes: object(7)
memory usage: 88.4+ KB


In [39]:
pd_tv_catalogue.sample(20)

Unnamed: 0,name,brand,price,ratings,specification,Seller_name,seller_rating
1296,"High Teck 32"" INCHES SMART FULL HD LED TV WITH...",High Teck,"₦ 97,799",0 out of 5,"[32"" SMART FULL HD LED, Browser: Yes, Wifi: Ye...",Generator Store,60%
681,Samsung 65 Inch Ultra Slim Premium UHD Class H...,Samsung,"₦ 820,000",0 out of 5,[],Bryan's store,100%
1550,Felicity Solar 3KVA 24V Hybrid Inverter With I...,Felicity Solar,"₦ 330,000",0 out of 5,"[Pure Sine Wave Inverter , 100A built in MPPT ...",360degree,100%
384,LG 65'' 4K OLED Smart AI ThinQ Built In Satell...,LG,"₦ 2,410,000",0 out of 5,"[65 inch TV, SELF LIT OLED TV, OLED Display, W...",Donhenries,100%
228,"UFC 55"" INCHES TV SMART Full HD LED 4K Television",UFC,"₦ 196,000",0 out of 5,"[UFC Smart TV ., RAM 1.5G 4K HDR, 55 INCHES{4K...",1 billion store,94%
834,"Infinity 50"" INCH SMART Full HD LED 4K SCREEN ...",Infinity,"₦ 183,900",0 out of 5,"[Operating System: Android, WiFi: Yes, PlaySto...",CALCULUX RESOURCES,100%
280,TCL 55 Inch Ultra Slim Smart UHD Android 4K TV,TCL,"₦ 400,000",0 out of 5,"[55 inch LED UHD Android TV, OS Android P UI S...",Bryan's store,100%
1052,Samsung 65 Inch AU8000 Crystal UHD 4K Smart TV...,Samsung,"₦ 820,000",0 out of 5,"[65 inch 4K UHD 3840 x 2160 LED Panel, HDR10, ...",Kaylas Mart Electronics,
452,Samsung 65 Inch UHD Certified LED Crystal 2021...,Samsung,"₦ 820,000",0 out of 5,"[65 inch 4K UHD 3840 x 2160 LED Panel, HDR10, ...",Smart Center,98%
1089,"Samsung 55''Smart UHD 4K TV-Netflix,Youtube,Ap...",Samsung,"₦ 429,900",0 out of 5,[D],Jamiu Lagos,100%


#### How many unique brands are in the dataset

In [50]:
# Checking for unique brands

pd_tv_catalogue['brand'].unique()

array(['LG', 'Amani', 'Hisense', 'Polystar', 'Infinity', 'Energy', 'TCL',
       'Vision', 'WEYON', 'Rock', 'Transparent', 'UFC', 'Samsonic',
       'Sonix', 'Syinix', 'Infinix', 'Amaz', 'Samsung', 'Sony', 'MK',
       'BUC', 'Maxi', 'Konka', 'Felicity Solar', 'itel', 'High Teck',
       'Bruhm', 'Vitek', None, 'High', 'Google', 'Skyworth', 'XTRAPOWER',
       'Royal', 'Famicare', 'Dexter', 'Sanyo', 'Panasonic', 'Jvc',
       'Delta', 'Mercedes Amg', 'XIAOMI', 'Rilsopower', 'Luminous',
       'Nexus'], dtype=object)

##### Observation
45 unique brands can be observed overall. However, the Felicity Solar brand name is not a TV brand.

#### Check for Duplicates

In [60]:
# columns to check for duplicated values
column_names =  ['name', 'brand', 'price', 'ratings', 'Seller_name', 'seller_rating']

# checking for duplicates
duplicates = pd_tv_catalogue.duplicated(subset = column_names, keep = False)

pd_tv_catalogue[duplicates].sort_values(by = 'name')

Unnamed: 0,name,brand,price,ratings,specification,Seller_name,seller_rating
626,"Amani 32"" INCHES SMART FULL HD LED TV WITH 1 Y...",Amani,"₦ 82,500",0 out of 5,"[32"" SMART FULL HD LED, Browser: Yes, Wifi: Ye...",JennyLink Store,60%
220,"Amani 32"" INCHES SMART FULL HD LED TV WITH 1 Y...",Amani,"₦ 82,500",0 out of 5,"[32"" SMART FULL HD LED, Browser: Yes, Wifi: Ye...",JennyLink Store,60%
629,Amaz Real 32''inch Dual Glass TV +18 Months Wa...,Amaz,"₦ 70,080",0 out of 5,[],AMAZ official shop,60%
738,Amaz Real 32''inch Dual Glass TV +18 Months Wa...,Amaz,"₦ 70,080",0 out of 5,[],AMAZ official shop,60%
1373,"Hisense 43” Smart Frameless TV+Netflix,Youtube...",Hisense,"₦ 155,000",0 out of 5,"[Screen Size: 43″ Smart tv, Screen Type: LED B...",Kris Global Electronics,50%
1547,"Hisense 43” Smart Frameless TV+Netflix,Youtube...",Hisense,"₦ 155,000",0 out of 5,"[Screen Size: 43″ Smart tv, Screen Type: LED B...",Kris Global Electronics,50%
752,"Hisense 55"" ULED 4K Smart TV With Free Wall Br...",Hisense,"₦ 435,500",0 out of 5,[],JennyLink Store,60%
914,"Hisense 55"" ULED 4K Smart TV With Free Wall Br...",Hisense,"₦ 435,500",0 out of 5,[],JennyLink Store,60%
1330,Hisense 55-inch Android 4K Smart ULED Premium ...,Hisense,"₦ 435,000",0 out of 5,[],JennyLink Store,60%
322,Hisense 55-inch Android 4K Smart ULED Premium ...,Hisense,"₦ 435,000",0 out of 5,[],JennyLink Store,60%


In [64]:
#number of duplicated items
pd_tv_catalogue.duplicated(subset = column_names, keep = 'first').sum()

30

##### Observation

There are 30 duplicated items in the dataset

### Cleaning

In this section the observed issues with the dataset will be addressed.

The following cleaning operations will be carried out.

- drop duplicated rows
- Remove entries that are not TVs
- Make the `name` column more brief
- change the `Seller` column to be all small letters
- make the price column a number and describe the currency in the price column title
- make the seller_rating column numeric and describe the unit of measurement in the column title 
- rating should be numeric
- address missing values
- expand the specification column focusing on key metrics eg. ports, screen_resolution, screen_types, and screen_size   



#### Drop duplicates

In [75]:
#drop duplicate rows
pd_tv_catalogue.drop_duplicates(subset = column_names, inplace= True)

#### Test

In [77]:
pd_tv_catalogue.duplicated(subset = column_names).sum()

0

#### Drop Entries that are not TVs

most entries that are TVs contain any of the following words in their names.

- TV, Tv, inches, #", UHD, 4K, Full HD, Television  or smart 

In [97]:
# Drop rows that do not meet the condition in the pattern 
pattern = 'TV|inch|\"|uhd|4k|full hd|television|smart'
contains_no_TV = pd_tv_catalogue.name.str.contains(pattern, flags=re.IGNORECASE, regex =True)
pd_tv_catalogue.drop(pd_tv_catalogue[~contains_no_TV].index, inplace = True)

#### Test

In [102]:
pd_tv_catalogue[~contains_no_TV]

  pd_tv_catalogue[~contains_no_TV]


Unnamed: 0,name,brand,price,ratings,specification,Seller_name,seller_rating
