#### Importing the required libraries

In [1]:
import gzip
import json
import pandas as pd

#### Opening the json.gz files, and reading the dataset

In [2]:
with gzip.open('/content/users.json.gz', 'rb') as f:
    file_contents = f.read()

decoded_contents = file_contents.decode('utf-8')
data = []
decoder = json.JSONDecoder()

# Parsing the JSON files
while decoded_contents:
    obj, idx = decoder.raw_decode(decoded_contents)
    data.append(obj)
    decoded_contents = decoded_contents[idx:].lstrip()

# creating the dataset as a dataframe
users = pd.DataFrame(data)

In [3]:
users

Unnamed: 0,_id,active,createdDate,lastLogin,role,signUpSource,state
0,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
1,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
2,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
3,{'$oid': '5ff1e1eacfcf6c399c274ae6'},True,{'$date': 1609687530554},{'$date': 1609687530597},consumer,Email,WI
4,{'$oid': '5ff1e194b6a9d73a3a9f1052'},True,{'$date': 1609687444800},{'$date': 1609687537858},consumer,Email,WI
...,...,...,...,...,...,...,...
490,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
491,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
492,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,
493,{'$oid': '54943462e4b07e684157a532'},True,{'$date': 1418998882381},{'$date': 1614963143204},fetch-staff,,


In [4]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   _id           495 non-null    object
 1   active        495 non-null    bool  
 2   createdDate   495 non-null    object
 3   lastLogin     433 non-null    object
 4   role          495 non-null    object
 5   signUpSource  447 non-null    object
 6   state         439 non-null    object
dtypes: bool(1), object(6)
memory usage: 23.8+ KB


##### In the above output, we can see that there are some missing values in `lastLogin`, `signUpSource`, and state features. Let us dig deeper into it.

#### Checking for null values

In [5]:
users.isnull().sum()

_id              0
active           0
createdDate      0
lastLogin       62
role             0
signUpSource    48
state           56
dtype: int64

#### Checking the percentage of null values

In [6]:
(users.isnull().sum()/users.shape[0]) * 100

_id              0.000000
active           0.000000
createdDate      0.000000
lastLogin       12.525253
role             0.000000
signUpSource     9.696970
state           11.313131
dtype: float64

#### Checking for duplicated values

##### To inspect the duplicate rows in the dataset, we have to first change the data type of few feature such as `_id`, `createdDate`, and `lastLogin`. The data types of these feature is dictionary, annd it should be changed into string format.

In [7]:
dict_columns = ['_id', 'createdDate', 'lastLogin']  # Specify the columns that contain dictionaries
for column in dict_columns:
    users[column] = users[column].apply(json.dumps)

In [8]:
users.duplicated()

0      False
1       True
2       True
3      False
4       True
       ...  
490     True
491     True
492     True
493     True
494     True
Length: 495, dtype: bool

In [9]:
duplicates = users.duplicated()
duplicated_rows = users[duplicates]
duplicated_rows

Unnamed: 0,_id,active,createdDate,lastLogin,role,signUpSource,state
1,"{""$oid"": ""5ff1e194b6a9d73a3a9f1052""}",True,"{""$date"": 1609687444800}","{""$date"": 1609687537858}",consumer,Email,WI
2,"{""$oid"": ""5ff1e194b6a9d73a3a9f1052""}",True,"{""$date"": 1609687444800}","{""$date"": 1609687537858}",consumer,Email,WI
4,"{""$oid"": ""5ff1e194b6a9d73a3a9f1052""}",True,"{""$date"": 1609687444800}","{""$date"": 1609687537858}",consumer,Email,WI
5,"{""$oid"": ""5ff1e194b6a9d73a3a9f1052""}",True,"{""$date"": 1609687444800}","{""$date"": 1609687537858}",consumer,Email,WI
8,"{""$oid"": ""5ff1e194b6a9d73a3a9f1052""}",True,"{""$date"": 1609687444800}","{""$date"": 1609687537858}",consumer,Email,WI
...,...,...,...,...,...,...,...
490,"{""$oid"": ""54943462e4b07e684157a532""}",True,"{""$date"": 1418998882381}","{""$date"": 1614963143204}",fetch-staff,,
491,"{""$oid"": ""54943462e4b07e684157a532""}",True,"{""$date"": 1418998882381}","{""$date"": 1614963143204}",fetch-staff,,
492,"{""$oid"": ""54943462e4b07e684157a532""}",True,"{""$date"": 1418998882381}","{""$date"": 1614963143204}",fetch-staff,,
493,"{""$oid"": ""54943462e4b07e684157a532""}",True,"{""$date"": 1418998882381}","{""$date"": 1614963143204}",fetch-staff,,


##### We can observe that there are 283 rows of duplicated data in the users dataset.

## Data quality issues found in the dataset are:

1. There is a huge amount of data that is duplicated in the users dataset. There are 283 duplicated rows, which means that there are only 212 original values.

2. There are also some amount of missing values in the users dataset. There are 62 missing values in `lastLogin` feature, followed by 56 and 48 missing values in the state and `signUpSource` features respectively.

