# Profile Reports
**Description**: This script generates the the Profile Pandas Report for each dataset. **Note**, when running the notebook, do not run the script to generate profile reports for all datasets at once. Rather generate a profile report for each dataset individually to save memory.

**Author**: Marang Mutloatse

**Version**: 0.0.1

**Status**: Development

## Import Libraries

In [1]:
from pandas_profiling import ProfileReport
import pandas as pd
import yaml
import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Working Functions

In [2]:
def load_excel_sheet(path: str):
    if path is not None:
        try:
            xls = pd.ExcelFile(path)
            df = xls.parse(skiprows=0)
            return df
        except Exception as e:
            print(f"Exception on loading excel spreadsheet with error: {e}")

## Loading Data

In [3]:
# Get parent directory
user_dev_path = os.path.dirname(os.getcwd())

# Load config path
try: 
    with open (user_dev_path + '/' + 'config_LTFU.yaml', 'r') as file:
        config = yaml.safe_load(file)
except Exception as e:
    print(f'Error reading the config file, {e}')

## Get Parameters from Config file

In [4]:
# Get root path parameter
input_root = config['rise_files']['raw_data_path']

# Get filenames
patient_file = config['rise_files']['raw_patient_file']
lab_file = config['rise_files']['raw_lab_file']
clinic_file = config['rise_files']['raw_clinic_file']

pharmacy_adamawa_file = config['rise_files']['raw_pharmacy_adamawa_file']
pharmacy_akwa_ibom_file = config['rise_files']['raw_pharmacy_akwa_ibom_file']
pharmacy_cross_river_file = config['rise_files']['raw_pharmacy_cross_river_file']
pharmacy_niger_file = config['rise_files']['raw_pharmacy_niger_file']

eac_file = config['rise_files']['raw_eac_file']
otz_file = config['rise_files']['raw_otz_file']

# Dataset Paths
patient_input = input_root + patient_file
adamawa_input = input_root + pharmacy_adamawa_file
akwa_ibom_input = input_root + pharmacy_akwa_ibom_file
cross_river_input = input_root + pharmacy_cross_river_file
niger_input = input_root + pharmacy_niger_file
eac_input = input_root + eac_file
otz_input = input_root + otz_file
lab_input = input_root + lab_file
clinic_input = input_root + clinic_file

In [5]:
# Print paths
print(f"Patient data path is: {patient_input}\n")
print(f"Lab data path is: {lab_input}\n")
print(f"Clinic data path is: {clinic_input}\n")
print(f"EAC data path is: {eac_input}\n")
print(f"OTZ data path is: {otz_input}\n")
print(f"Adamawa pharmacy data path is: {adamawa_input}\n")
print(f"Akwa Ibom pharmacy data path is: {akwa_ibom_input}\n")
print(f"Cross River pharmacy data path is: {cross_river_input}\n")
print(f"Niger data path is: {niger_input}")

Patient data path is: /data/rise_data/PatientDemographicsData.xlsx

Lab data path is: /data/rise_data/All LaboratoryData_Flat File.csv

Clinic data path is: /data/rise_data/ClinicData.xlsx

EAC data path is: /data/rise_data/EacData.xlsx

OTZ data path is: /data/rise_data/OtzData.xlsx

Adamawa pharmacy data path is: /data/rise_data/PharmacyData_Adamawa.xlsx

Akwa Ibom pharmacy data path is: /data/rise_data/PharmacyData_Akwa Ibom.xlsx

Cross River pharmacy data path is: /data/rise_data/PharmacyData_Cross River.xlsx

Niger data path is: /data/rise_data/PharmacyData_Niger.csv


## Loading Data

In [6]:
# Uncomment each dataframe after generating plot
df = load_excel_sheet(patient_input)

#df = load_excel_sheet(eac_input)
# df = load_excel_sheet(otz_input)
# df = load_excel_sheet(adamawa_input)
# df = load_excel_sheet(akwa_ibom_input)

# Load CSV files - uncomment out per dataset dataframe

# df = pd.read_csv(lab_input)
# df = pd.read_csv(niger_input)

# Create Profile Report

In [11]:
prof = ProfileReport(df)

prof.to_file(output_file='PatientDemographics_Summary.html')

# prof.to_file(output_file='LaboratorySummary.html')
# prof.to_file(output_file='EAC_Summary.html')
# prof.to_file(output_file='OTZ_Summary.html')
# prof.to_file(output_file='Pharmacy_adamawa_Summary.html')
# prof.to_file(output_file='Pharmacy_Cross_River_Summary.html')
# prof.to_file(output_file='Pharmacy_NIger_Summary.html')
# # very large
# prof.to_file(output_file='Pharmacy_Akwa_Ibom_Summary.html')

Summarize dataset:   0%|          | 0/70 [00:00<?, ?it/s]

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))
  np.sqrt(var) / np.sqrt(2)))
(using `df.profile_report(correlations={"cramers": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'No data; `observed` has size 0.')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

  font.set_text(s, 0.0, flags=LOAD_NO_HINTING)
  font.set_text(s, 0.0, flags=flags)
  font.set_text(s, 0, flags=flags)


Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]