# **Data Profiling and Quality Report**


---



####1. Import Libraries: Load Pandas and NumPy for data handling.

####2. Load Dataset: Read CSV into df (low_memory=False for large files).

####3. Preview Data: df.head() shows the first few rows.

####4. Dataset Size: df.shape gives rows and columns.

####5. Column Info: df.dtypes shows data types.

####6. Missing Values: df.isnull().sum() counts NaNs per column.

####7. Missing Data Map: df.isna() highlights where values are missing.

In [6]:
# Import Libraries
import numpy as np
import pandas as pd

In [4]:
# Load your log file
df = pd.read_csv("/content/logs_preprocessed.csv", low_memory=False)

In [5]:
print(df.head()) # head() => used to display first five rows in dataset

     run_date dut suite                                 filename  line_number  \
0  2024-07-22  tc   tcp  tc_func_tcp_tfg_001_20240722-113936.log          194   
1  2024-07-22  tc   tcp  tc_func_tcp_tfg_001_20240722-113936.log          195   
2  2024-07-22  tc   tcp  tc_func_tcp_tfg_001_20240722-113936.log          198   
3  2024-07-22  tc   tcp  tc_func_tcp_tfg_001_20240722-113936.log          202   
4  2024-07-22  tc   tcp  tc_func_tcp_tfg_001_20240722-113936.log          265   

                 timestamp                             test_case_id status  \
0  2024-07-22 11:39:38.710  tc_func_tcp_tfg_001_20240722-113936.log   PASS   
1  2024-07-22 11:39:38.812  tc_func_tcp_tfg_001_20240722-113936.log   PASS   
2  2024-07-22 11:39:39.123  tc_func_tcp_tfg_001_20240722-113936.log   PASS   
3  2024-07-22 11:39:40.125  tc_func_tcp_tfg_001_20240722-113936.log   FAIL   
4  2024-07-22 11:39:43.465  tc_func_tcp_tfg_001_20240722-113936.log   FAIL   

                                           e

In [7]:
print("Dataset Shape:", df.shape) # Display the rows and columns
print("\nColumn Info:") # print the statement
print(df.dtypes) # Print the datatypes of the column

Dataset Shape: (124684, 12)

Column Info:
run_date        object
dut             object
suite           object
filename        object
line_number      int64
timestamp       object
test_case_id    object
status          object
error_msg       object
config          object
raw_line        object
os_version      object
dtype: object


In [11]:
print(df.isnull().sum()) # Identify the Missing values


run_date        0
dut             0
suite           0
filename        0
line_number     0
timestamp       0
test_case_id    0
status          0
error_msg       0
config          0
raw_line        0
os_version      0
dtype: int64


##**Coverage,Missing Values, and Duplicate Values**

---
####This code generates a detailed missing values report for a DataFrame by counting the number and percentage of missing entries in each column. It shows which columns have missing data and to what extent, helping identify where data cleaning or imputation is needed.


In [12]:
# Calculate the count of missing (NA) values per column
missing_report = df.isna().sum().reset_index()

# Rename columns for clear display: 'Column' for column names, 'Missing_Count' for missing value counts
missing_report.columns = ["Column","Missing_Count"]

# Calculate the percentage of missing values relative to total rows for each column
missing_report["Missing_%"] = (missing_report["Missing_Count"] / len(df)) * 100

# Print a report sorted by descending missing percentage to highlight worst columns first
print("\nMissing Values Report:")
print(missing_report.sort_values("Missing_%", ascending=False))



Missing Values Report:
          Column  Missing_Count  Missing_%
0       run_date              0        0.0
1            dut              0        0.0
2          suite              0        0.0
3       filename              0        0.0
4    line_number              0        0.0
5      timestamp              0        0.0
6   test_case_id              0        0.0
7         status              0        0.0
8      error_msg              0        0.0
9         config              0        0.0
10      raw_line              0        0.0
11    os_version              0        0.0


In [13]:
# Count the number of duplicate rows in the DataFrame
duplicate_count = df.duplicated().sum()

# Print the total count of duplicate rows found
print(f"\nDuplicate Rows: {duplicate_count}")



Duplicate Rows: 0


####This code calculates data coverage per column in the DataFrame by counting non-missing values and computing their percentage relative to the total rows. It then prints a report sorted by coverage percentage, helping identify columns with incomplete data coverage.

In [14]:
# Create a new DataFrame summarizing columns, count of non-missing values, and coverage percentage
coverage_report = pd.DataFrame({
    "Column": df.columns,
    "NonNull_Count": df.notna().sum(),
    "Coverage_%": (df.notna().sum() / len(df)) * 100
})

# Print a heading for the coverage report
print("\nData Coverage:")

# Print the coverage report sorted by lowest coverage percentage first
print(coverage_report.sort_values("Coverage_%"))



Data Coverage:
                    Column  NonNull_Count  Coverage_%
run_date          run_date         124684       100.0
dut                    dut         124684       100.0
suite                suite         124684       100.0
filename          filename         124684       100.0
line_number    line_number         124684       100.0
timestamp        timestamp         124684       100.0
test_case_id  test_case_id         124684       100.0
status              status         124684       100.0
error_msg        error_msg         124684       100.0
config              config         124684       100.0
raw_line          raw_line         124684       100.0
os_version      os_version         124684       100.0


## **Critical Fields**

---

####This code filters the coverage report to focus only on a set of critical columns important for machine learning readiness. It then prints the coverage details for these selected critical fields, helping quickly assess their data completeness.

In [15]:
# Define a list of critical columns needed for ML or core analysis
critical_fields = ["test_case_id", "dut", "suite", "config", "error_msg", "status", "run_date", "timestamp"]

# Filter the coverage report DataFrame to include only these critical columns
critical_report = coverage_report[coverage_report["Column"].isin(critical_fields)]

# Print the coverage details for the critical fields
print("\nCritical Fields Coverage (ML readiness):")
print(critical_report)



Critical Fields Coverage (ML readiness):
                    Column  NonNull_Count  Coverage_%
run_date          run_date         124684       100.0
dut                    dut         124684       100.0
suite                suite         124684       100.0
timestamp        timestamp         124684       100.0
test_case_id  test_case_id         124684       100.0
status              status         124684       100.0
error_msg        error_msg         124684       100.0
config              config         124684       100.0


##**Identify Gaps**

---
#### This code checks for potential data quality issues and prints warnings about key areas that need attention. It highlights missing values in important columns such as DUT version, OS version, and error messages, and warns if duplicate rows are present. The final print statement marks the completion of the profiling step.


In [17]:
# Print header for missing data and quality gaps report
print("\nPotential Gaps to Fix:")

# Check if 'dut_version' column exists and has missing values, warn about its reliability
if "dut_version" in df.columns:
    print("- DUT Version has missing values → may not be reliable for ML.")

# Check if 'os_version' exists and has missing values, print count needing better extraction
if "os_version" in df.columns and df["os_version"].isna().sum() > 0:
    print(f"- OS Version missing in {df['os_version'].isna().sum()} rows → need better extraction.")

# Warn if any logs have missing error messages
if df["error_msg"].isna().sum() > 0:
    print("- Some logs have no error messages.")

# Warn if any duplicate rows exist in dataset
if duplicate_count > 0:
    print("- Dataset has duplicate rows, consider removing them.")

# Indicate profiling process is complete
print("\nProfiling Complete.")



Potential Gaps to Fix:

Profiling Complete.
