#### Importing the required libraries

In [1]:
import gzip
import json
import pandas as pd

#### Opening the json.gz files, and reading the dataset

In [2]:
with gzip.open('/content/brands.json.gz', 'rb') as f:
    file_contents = f.read()

decoded_contents = file_contents.decode('utf-8')
data = []
decoder = json.JSONDecoder()

# Parse each JSON object in the file
while decoded_contents:
    obj, idx = decoder.raw_decode(decoded_contents)
    data.append(obj)
    decoded_contents = decoded_contents[idx:].lstrip()

# Create a DataFrame from the parsed JSON objects
brands = pd.DataFrame(data)

In [3]:
brands

Unnamed: 0,_id,barcode,category,categoryCode,cpg,name,topBrand,brandCode
0,{'$oid': '601ac115be37ce2ead437551'},511111019862,Baking,BAKING,"{'$id': {'$oid': '601ac114be37ce2ead437550'}, ...",test brand @1612366101024,False,
1,{'$oid': '601c5460be37ce2ead43755f'},511111519928,Beverages,BEVERAGES,"{'$id': {'$oid': '5332f5fbe4b03c9a25efd0ba'}, ...",Starbucks,False,STARBUCKS
2,{'$oid': '601ac142be37ce2ead43755d'},511111819905,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, ...",test brand @1612366146176,False,TEST BRANDCODE @1612366146176
3,{'$oid': '601ac142be37ce2ead43755a'},511111519874,Baking,BAKING,"{'$id': {'$oid': '601ac142be37ce2ead437559'}, ...",test brand @1612366146051,False,TEST BRANDCODE @1612366146051
4,{'$oid': '601ac142be37ce2ead43755e'},511111319917,Candy & Sweets,CANDY_AND_SWEETS,"{'$id': {'$oid': '5332fa12e4b03c9a25efd1e7'}, ...",test brand @1612366146827,False,TEST BRANDCODE @1612366146827
...,...,...,...,...,...,...,...,...
1162,{'$oid': '5f77274dbe37ce6b592e90c0'},511111116752,Baking,BAKING,"{'$ref': 'Cogs', '$id': {'$oid': '5f77274dbe37...",test brand @1601644365844,,
1163,{'$oid': '5dc1fca91dda2c0ad7da64ae'},511111706328,Breakfast & Cereal,,"{'$ref': 'Cogs', '$id': {'$oid': '53e10d6368ab...",Dippin Dots® Cereal,,DIPPIN DOTS CEREAL
1164,{'$oid': '5f494c6e04db711dd8fe87e7'},511111416173,Candy & Sweets,CANDY_AND_SWEETS,"{'$ref': 'Cogs', '$id': {'$oid': '5332fa12e4b0...",test brand @1598639215217,,TEST BRANDCODE @1598639215217
1165,{'$oid': '5a021611e4b00efe02b02a57'},511111400608,Grocery,,"{'$ref': 'Cogs', '$id': {'$oid': '5332f5f6e4b0...",LIPTON TEA Leaves,False,LIPTON TEA Leaves


In [4]:
brands.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167 entries, 0 to 1166
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   _id           1167 non-null   object
 1   barcode       1167 non-null   object
 2   category      1012 non-null   object
 3   categoryCode  517 non-null    object
 4   cpg           1167 non-null   object
 5   name          1167 non-null   object
 6   topBrand      555 non-null    object
 7   brandCode     933 non-null    object
dtypes: object(8)
memory usage: 73.1+ KB


##### In the above output, we can see that there are a lot of missing values in `category`, `categoryCode`, `topBrand`, and `brandCode` features. Let us dig deeper into it.

#### Checking for null values

In [5]:
brands.isnull().sum()

_id               0
barcode           0
category        155
categoryCode    650
cpg               0
name              0
topBrand        612
brandCode       234
dtype: int64

#### Checking the percentage of null values

In [6]:
(brands.isnull().sum()/brands.shape[0]) * 100

_id              0.000000
barcode          0.000000
category        13.281919
categoryCode    55.698372
cpg              0.000000
name             0.000000
topBrand        52.442159
brandCode       20.051414
dtype: float64

#### Checking for duplicated values

##### To inspect the duplicate rows in the dataset, we have to first change the data type of few feature such as `_id`, and `cpg`. The data types of these feature is dictionary, annd it should be changed into string format.

In [7]:
dict_columns = ['_id', 'cpg']  # Specify the columns that contain dictionaries
for column in dict_columns:
    brands[column] = brands[column].apply(json.dumps)

In [8]:
brands.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1162    False
1163    False
1164    False
1165    False
1166    False
Length: 1167, dtype: bool

In [9]:
duplicates = brands.duplicated()
duplicated_rows = brands[duplicates]
duplicated_rows

Unnamed: 0,_id,barcode,category,categoryCode,cpg,name,topBrand,brandCode


## Data quality issues found in the dataset are:

1. We can observe that there are missing values in the brands dataset. The highest number of missing values are observed in the `categoryCode` and `topBrand` features with 56% and 52% respectively. also, there are few missing values in `category` and `topBrand` features with 13% and 20% respectively.

2. There are no duplicated rows in the brands dataset.