# Raw Data EDA

Goal: Validate raw data quality and structure of data in raw json files

Datasets:
- customers.json
- products.json
- countries.json
- sales.json
- orders.json

Design Notes:
- The following helper functions have been created:
  - load_variable_json - processes malformed JSON so it can be loaded into a DataFrame.
  - is_json_line - Tests to see a line is empty, JSON or an array.



## Imports and Context Setting

### Imports

In [0]:
import json
import pandas as pd

### Context Setting

In [0]:


pd.set_option("display.max_rows", 10)
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)


## Helper Functions


###load_variable_json

In [0]:
def load_variable_json(raw_data_path:str, verbose:bool = True) -> pd.DataFrame:
    '''Process variably formed JSON files'''
    with open(raw_data_path, 'r') as f:
        raw = f.read().strip()

    # Try properly formed JSON
    try:
        parsed = json.loads(raw)
        if isinstance(parsed, dict):
            data = [parsed]
        elif isinstance(parsed, list):
            data = parsed
        else:
            data = [parsed]
        if verbose:
            print(f"[load_variable_json] parsed as properly formed JSON from: {raw_data_path}")
        return pd.DataFrame(data)
    except json.JSONDecodeError:
        pass

    # Try JSON lines format
    lines = [ln.strip() for ln in raw.splitlines() if is_json_line(ln)]
    json_line_objects = []
    json_line_objects_ok = True
    for line in lines:
        try:
            obj = json.loads(line)
            json_line_objects.append(obj)
        except json.JSONDecodeError:
            json_line_objects_ok = False
            break
    if json_line_objects_ok and json_line_objects:
        if verbose:
             print(f"[load_variable_json] parsed as JSONL (one JSON object per line) from: {raw_data_path}")
        return pd.DataFrame(json_line_objects)
    
    # Improperly formed JSON (end of line commas)
    cleaned = raw

    # Strip trailing commas at end of file
    while cleaned.endswith(","):
        cleaned = cleaned[:-1].rstrip()

    cleaned = "[" + cleaned + "]"
    data = json.loads(cleaned)

    if verbose:
        print(f"[load_variable_json] Parsed as 'concatenated JSON objects with commas' from: {raw_data_path}")

    return pd.DataFrame(data)

### is_json_line

In [0]:
def is_json_line(line: str) -> bool:
    '''Test to see is each line in file is empty, JSON or an array'''
    line = line.strip()
    if not line:
        return False
    return line.startswith("{") or line.startswith("[")


## Load Datasets



In [0]:
raw_file_path = "../data/"

customers_df = load_variable_json(raw_file_path + "customers.json")
products_df = load_variable_json(raw_file_path + "products.json")
countries_df = load_variable_json(raw_file_path + "countries.json")
sales_df = load_variable_json(raw_file_path + "sales.json")
orders_df = load_variable_json(raw_file_path + "orders.json")





## Analyze Datasets

### customers_df
Checks integrity of the countries_df dataset, especially fields required for the dashboards:
- CustomerId
- Name
- Country

Checks performed:
1. Row/Column count
2. Expected columns
3. Data types
4. Dataset sample
5. Null checks
6. Key column (CustomerId) uniqueness
7. Quality checks:
    - Active flag encoding
    - Sanity check Country values
    - Sanity check City values
    - Sanity check Name length
    - Sanity check Email format
8. Referential integrity check
    - customers_df.Country -> countries_df.Country

****


####Check customers_df row and column count (based on file row count) to verify load.

In [0]:
customers_df.shape


####Check customers_df expected columns exist

In [0]:
customers_df.columns

####Check customers_df columns match expected type

In [0]:
customers_df.dtypes

####Sample customers_df content

In [0]:
customers_df.sample(5, random_state = 13)

####Check all customers_df columns for nulls, focusing on CustomerId, Name, Country

In [0]:

customers_df.isna().sum()

####Check customers_df key field (CustomerId) is unique

In [0]:
customers_df['CustomerId'].is_unique


####Check customers_df Active flag values

In [0]:
customers_df['Active'].value_counts(dropna=False)

####Check customers_df Country values

In [0]:
customers_df['Country'].value_counts(dropna=False)

####Check customers_df City values

In [0]:
customers_df['City'].value_counts(dropna=False)

####Check customers_df Name values lengths

In [0]:
customers_df['Name'].str.len().describe()



####Check customers_df Email format


In [0]:
customers_df['Email'].str.contains('@', na=False).mean()

Referential integrity check: customers_df.Country -> countries_df.Country

In [0]:
####customers_df[~customers_df.Country.isin(countries_df.Country)]


### products_df
Checks integrity of the products_df dataset. 


Checks performed:
1. Row/Column count
2. Expected columns
3. Data types
4. Dataset sample
5. Null checks
6. Key column (ProductId) uniqueness
7. Quality checks:
    - Sanity check Name values
    - Check to see of Name is unique
    - Sanity check WeightGrams values
8. Referential integrity check
    - products_df.ManufacturedCountry -> countries_df.Country
   
  



####Check products_df row and column count (based on file row count) to verify load

In [0]:
products_df.shape

####Check products_df expected columns exist

In [0]:
products_df.columns

####Check products_df columns match expected type

In [0]:
products_df.dtypes

####Sample products_df content

In [0]:
products_df.sample(5, random_state = 13)

####Check all products_df columns for nulls

In [0]:
products_df.isna().sum()

####Check products_df key field (ProductId) is unique

In [0]:
products_df['ProductId'].is_unique


####Check products_df.Name values

In [0]:
products_df['Name'].value_counts(dropna=False)

####Check products_df Name uniqueness

In [0]:
products_df['Name'].is_unique

####Check products_df WeightGrams for < 0, outliers and outlandish values



In [0]:
products_df['WeightGrams'].describe()

####Referential integrity check: products_df.ManufacturedCountry -> countries_df.Country

In [0]:
products_df[~products_df.ManufacturedCountry.isin(countries_df.Country)] 

### countries_df
Checks integrity of the countries_df dataset. There is a wealth of data in this set. That said, we'll be 
focusing on the fields which we'll use to build the dashboard :
- Country
- Name

Checks performed:
1. Row/Column count
2. Expected columns
3. Data types
4. Dataset sample
5. Null checks
6. Key column (Country) uniqueness
7. Quality checks:
    - Sanity check Name values
    - Name unique check
  



####Check countries_df row and column count (based on file row count) to verify load

In [0]:
countries_df.shape


####Check countries_df expected columns exist

In [0]:
countries_df.columns

####Check countries_df columns match expected type


In [0]:
countries_df.dtypes

####Sample countries_df content

In [0]:
countries_df.sample(5, random_state = 13)

####Check all countries_df columns for nulls, focusing on Country, Name and Region

In [0]:
countries_df.isna().sum()

####Check countries_df key field (Country) is unique

In [0]:
countries_df['Country'].is_unique

####Check countries_df Name values

In [0]:
countries_df['Name'].value_counts(dropna=False)

####Check countries_df Name uniqueness

In [0]:
countries_df['Name'].is_unique

### sales_df
Checks integrity of the sales_df dataset. 


Checks performed:
1. Row/Column count
2. Expected columns
3. Data types
4. Dataset sample
5. Null checks
6. Key column (SalesId) uniqueness
7. Quality checks:
    - Sanity check Quantity values
  
    
8. Referential integrity check
    - sales_df.OrderId -> orders_df.OrderId
    - sales_df.ProductId -> products_df.ProductId
  
   
  



####Check sales_df row and column count (based on file row count) to verify load

In [0]:
sales_df.shape

####Check sales_df expected columns exist


In [0]:
sales_df.columns

####Check sales_df columns match expected type


In [0]:

sales_df.dtypes

####Sample sales_df content


In [0]:
sales_df.sample(5, random_state = 13)

####Check all sales_df columns for nulls, focusing on CustomerId, Name, Country


In [0]:
sales_df.isna().sum()

####Check sales_df key field (SaleId) is unique


In [0]:
sales_df['SaleId'].is_unique

####Check sales_df Quantity for < 0, outliers and outlandish values

In [0]:
sales_df['Quantity'].describe()

####Referential integrity check: sales_df.OrderId -> orders_df.OrderId

In [0]:
sales_df[~sales_df.OrderId.isin(orders_df.OrderId)]

####Referential integrity check: sales_df.ProductId -> products_df.ProductId

In [0]:
sales_df[~sales_df.ProductId.isin(products_df.ProductId)]

### orders_df
Checks integrity of the orders_df dataset. 


Checks performed:
1. Row/Column count
2. Expected columns
3. Data types
4. Dataset sample
5. Null checks
6. Key column (OrderId) uniqueness
7. Quality checks:
    - Check orders_df.Date to ensure its a date datatype
    - Check orders_df.Date min/max values
    
8. Referential integrity check
    - orders_df.CustomerId -> customers_df.CustomerId
  
   
  



####Check orders_df row and column count (based on file row count) to verify load

In [0]:
orders_df.shape

####Check orders_df expected columns exist

In [0]:
orders_df.columns

####Check orders_df columns match expected type


In [0]:
orders_df.dtypes


####Sample orders_df content

In [0]:
orders_df.sample(5, random_state=13)


####Check all orders_df columns for nulls

In [0]:
orders_df.isna().sum()

####Check orders_df key field (OrderId) is unique


In [0]:
orders_df['OrderId'].is_unique

####Check orders_df Date is proper datetime type

In [0]:
# Stored as object in the dataframe, we'll convert it
orders_df['Date'] = pd.to_datetime(orders_df['Date'], errors='coerce')
# Check for parsing failures
orders_df['Date'].isna().sum()


####Check orders_df Date min/max values


In [0]:
orders_df['Date'].min(), orders_df['Date'].max()


####Referential integrity check: orders_df.CustomerId -> customers_df.CustomerId



In [0]:
orders_df[~orders_df.CustomerId.isin(customers_df.CustomerId)] 