# Talking Therapies
Data exploration worksheet.

## Data load
- Function imports
- Base table loads

### Set up

In [0]:
# should 'data check' calculations be performed?
global flag_data_check, flag_data_display, flag_data_export

flag_data_check = False # should data checks be performed?
flag_data_display = False # should data be displayed?
flag_data_export = True # should data be exported?

In [0]:
# imports
import pyspark.sql.functions as F
from pyspark.sql import Window as W

In [0]:
# set path to iapt referrals
lake = "udalstdatacuratedprod.dfs.core.windows.net"
container = "restricted"
file = "/patientlevel/MESH/IAPT/IDS101referral/Published/2/"
iapt_path_referral = "abfss://" + container + "@" + lake + file

# set path to iapt mpi
lake = "udalstdatacuratedprod.dfs.core.windows.net"
container = "restricted"
file = "/patientlevel/MESH/IAPT/IDS001mpi/Published/3/"
iapt_path_mpi = "abfss://" + container + "@" + lake + file

# set path to iapt contacts
lake = "udalstdatacuratedprod.dfs.core.windows.net"
container = "restricted"
file = "/patientlevel/MESH/IAPT/IDS201carecontact/Published/2/"
iapt_path_contact = "abfss://" + container + "@" + lake + file

# set path to organisation details
lake = "udalstdatacuratedprod.dfs.core.windows.net"
container = "unrestricted"
file = "/reference/UKHD/ODS_API/vwOrganisation_SCD_IsLatestEqualsOneWithRole"
ods_path = "abfss://" + container + "@" + lake + file

# set path to lake mart
# https://udalstdataanalysisprod.dfs.core.windows.net/analytics-projects/StrategyUnit/StrategyUnit/Evaluation/E004546-0001_Talking_Therapies/
lake = "udalstdataanalysisprod.dfs.core.windows.net"
container = "analytics-projects"
folder = "/StrategyUnit/StrategyUnit/Evaluation/E004546-0001_Talking_Therapies/"
eval_path = "abfss://" + container + "@" + lake + folder


### IAPT referrals

In [0]:
iapt_referrals = spark.read.option("header", "true").option("recursiveFileLookup", "true").parquet(iapt_path_referral)

if flag_data_display:
    iapt_referrals.limit(10).display()

In [0]:
if flag_data_check:
    check_iapt_referral_id_unique = iapt_referrals \
        .groupBy("PathwayID").count().filter(F.col("count") > 1)

    if flag_data_display:
        display(check_iapt_referral_id_unique)

In [0]:
if flag_data_check:
        iapt_referrals.filter(F.col("PathwayID") == "6d6a4bfda814475ba409586a19dce42e").orderBy("EFFECTIVE_FROM").display()

There are multiple records per `PathwayID`, where the same referral was reported over subsequent months. There are also seemingly duplicates of the same record (same `EFFECTIVE_FROM` timestamp and same `UDALFieldID` and same `UniqueID_IDS101`, which means it's not very unique). Plan:
- Filter for the latest `EFFECTIVE_FROM` timestamp per `PathwayID`, use a rownumber approach to ensure only one record is returned per `PathwayID` as the official version.

## update 2025-07-11
Since setting the referral path to only include 'Published' data and selecting a specific version of published data, the `UniqueID_IDS101` is now unique among the list of records. There are still multiple records per `PathwayID` but these all appear to be submissions made over multiple months whilst this referral was active, and is to be expected.

In [0]:
# define a window spec for each person in MPI
window_spec = W.partitionBy("PathwayID").orderBy(F.col("EFFECTIVE_FROM").desc())

# get the latest record for each person
iapt_referrals_latest = (
    iapt_referrals
        .withColumn("row_number", F.row_number().over(window_spec)) # assign 1,2,3... within each id
        .filter(F.col("row_number") == 1) # keep only top row per id
        .drop("row_number")
)

if flag_data_display:
    iapt_referrals_latest.limit(10).display()

In [0]:
if flag_data_check:
    check_iapt_referral_id_unique_v2 = iapt_referrals_latest \
        .groupBy("PathwayID").count().filter(F.col("count") > 1)

    if flag_data_display:
        display(check_iapt_referral_id_unique_v2)

Yes, they're unique now.

### IAPT MPI
Get some details for the patient, such as deprivation (based on LSOA), gender identity, langauge, ethnicity.

In [0]:
iapt_mpi = spark.read.option("header", "true").option("recursiveFileLookup", "true").parquet(iapt_path_mpi)

if flag_data_display:
    iapt_mpi.limit(10).display()

In [0]:
if flag_data_check:
    check_mpi_id_unique = iapt_mpi.groupBy("Person_ID", "UniqueSubmissionID").count().filter(F.col("count") > 1)

    if flag_data_display:
        display(check_mpi_id_unique)

No, it appears the combination of `Person_ID` and `UniqueSubmissionID` is not unique within this table.

**IMPORTANT NOTE**

We probably want to get the details for each person at the point of their discharge, or the latest submission if their referral is still active.

This means we need to join this data to the referrals table on both `Person_ID` and `Unique_Submission_ID` to ensure the patient details are contemporaneous with the referral.

In [0]:
if flag_data_check:
    iapt_mpi.filter((F.col("Person_ID") == "N6STX48XQU3Y8JT") & (F.col("UniqueSubmissionID") == "9534275")).display()

Note on duplicate MPI records:

This is fine and probably what we want. We can use the multiple records per UID to link MPI to referrals on both PatientID AND SubmissionID, thereby getting the LSOA (address) which was applicable at the time of referral.

See below 'Joins' section for details of how this was achived.

### Provider details
We want some details about the organisation that provided Talking Therapies services to patients.

In [0]:
ods = spark.read.option("header","true").option("recursiveFileLookup","true").parquet(ods_path)

if flag_data_display:
    ods.limit(10).display()

In [0]:
if flag_data_check:
    check_ods_codes_unique = ods \
        .groupBy("ODS_Code").count().filter(F.col("count") > 1)

    if flag_data_display:
        display(check_ods_codes_unique)

Yes, all ODS codes are unique to the ODS table.

### IAPT contacts
Load in care contact details

In [0]:
iapt_contacts = spark.read.option("header", "true").option("recursiveFileLookup", "true").parquet(iapt_path_contact)

if flag_data_display:
    iapt_contacts.limit(10).display()

In [0]:
if flag_data_check:
    check_contact_codes_unique = iapt_contacts \
        .groupBy("Unique_CareContactID").count().filter(F.col("count") > 1)

    if flag_data_display:
        display(check_contact_codes_unique)

In [0]:
if flag_data_display:
    # there are some duplicates, some are quite large. Lets take a look at one
    iapt_contacts.filter(F.col("Unique_CareContactID") == "RX3YE6090764").display()

OK, so it looks like the `Unique_CareContactID` can refer to multiple records. It looks like this is due to submissions over multiple months, and the details **can** change, e.g. the above example shows the contact date changes date and time.

Speculation: the clinician changed some (but not all) details for a contact, possibly to correct an erroneous record. In this case, it looks like they changed the date and time of the appointment and also changed attendance from 'attended' to being 'cancelled' by the patient.

In [0]:
# define a window spec for each person in MPI
window_spec = W.partitionBy("Unique_CareContactID").orderBy(F.col("EFFECTIVE_FROM").desc())

# get the latest record for each person
iapt_contacts_latest = (
    iapt_contacts
        .withColumn("row_number", F.row_number().over(window_spec)) # assign 1,2,3... within each id
        .filter(F.col("row_number") == 1) # keep only top row per id
        .drop("row_number")
)

if flag_data_display:
    iapt_contacts_latest.limit(10).display()

In [0]:
if flag_data_check:
    check_contact_codes_unique_v2 = iapt_contacts_latest \
        .groupBy("Unique_CareContactID").count().filter(F.col("count") > 1)

    if flag_data_display:
        display(check_contact_codes_unique_v2)

Yes, they are unique now. :)

## Joins

### Referrals - MPI

In [0]:
# get a truncated version of the MPI containing gender, deprivation and ethnicity
iapt_mpi_short = iapt_mpi.select(
    # keys
    "Person_ID", 
    "UniqueSubmissionID",
    "EFFECTIVE_FROM",
    "RecordNumber",
    # details
    "Gender",
    "GenderIdentity",
    "IndicesOfDeprivationDecile",
    "IMD_YEAR",
    "EthnicCategory",
    "Validated_EthnicCategory",
    "EthnicCategory2021"
).distinct()

In [0]:
# There are some records that appear to show multiple records for the same person / unique submission ID
# These need to be consolidated to one row per person / unique submission ID combo.

# define a window spec for each person in MPI
window_spec = W.partitionBy("Person_ID", "UniqueSubmissionID").orderBy(F.col("EFFECTIVE_FROM").desc(), F.col("RecordNumber").desc())

# get the latest record for each person
iapt_mpi_short_unique = (
    iapt_mpi_short
        .withColumn("row_number", F.row_number().over(window_spec)) # assign 1,2,3... within each id
        .filter(F.col("row_number") == 1) # keep only top row per id
        .drop("row_number")
)

if flag_data_display:
    iapt_mpi_short_unique.limit(10).display()

In [0]:
if flag_data_check:
    check_mpi_short_multiples = iapt_mpi_short_unique \
        .groupBy("Person_ID", "UniqueSubmissionID").count().filter(F.col("count") > 1) 

    if flag_data_display:
        display(check_mpi_short_multiples.orderBy("count", ascending=False))

Great news - we have unique MPI records for each Person/Submission ID combo. This df is ready for joining with the referrals records.

In [0]:
iapt_referrals_latest_test = iapt_referrals_latest \
  .join(iapt_mpi_short_unique, 
        on = (
          (iapt_referrals_latest["Person_ID"] == iapt_mpi_short_unique["Person_ID"]) &
          (iapt_referrals_latest["UniqueSubmissionID"] == iapt_mpi_short_unique["UniqueSubmissionID"])
        ),
        how = "left"
  )

if flag_data_display:
    iapt_referrals_latest_test.limit(10).display()

In [0]:
if flag_data_check:
    n_rows_referrals = iapt_referrals_latest.count()
    n_rows_referrals_test = iapt_referrals_latest_test.count()
    print(f"Original: {n_rows_referrals:,}")
    print(f"Test:     {n_rows_referrals_test:,}")

Great news - the number of referrals remains constant after the left-join.

In [0]:
iapt_referrals_latest = iapt_referrals_latest_test

### Referrals - ODS

In [0]:
iapt_referrals_latest = iapt_referrals_latest.join(
    ods, 
    on = iapt_referrals_latest["OrgID_Provider"] == ods["ODS_Code"], 
    how = "left"
)

if flag_data_display:
    iapt_referrals_latest.limit(10).display()

In [0]:
if flag_data_check:
    
    ods = iapt_referrals_latest.select("ODS_Code", "Name").withColumnRenamed("Name", "ODS_Name").distinct()
    
    if flag_data_display:
        ods.display()
else:
    display("Data check flag is set to False")

All `ODS_Code` values appear to have corresponding `ODS_Name` values, indicating all provider codes have been identified.
There are 181 providers of Talking Therapies listed. This appears to be the right order of magnitude when compared with the range of providers reported in the IAPT dashboard.

In [0]:
if flag_data_check:
    # see if there are any nulls in the `ODS_Name` field - i.e. checking for unmatched `OrgID_Provider`
    null_count = iapt_referrals_latest.groupBy("OrgID_Provider").agg(F.count(F.when(F.col("PPSM").isNull(), True)).alias("null_ods_names"))

    if flag_data_display:
        display(null_count)

else:
    display("Data check flag is set to False")

All provider codes appear to be matched with corresponding provider names.

### Contacts - ODS
Link the organisation name to the care contact

In [0]:
iapt_contacts_latest = iapt_contacts_latest.join(
    ods, 
    on = iapt_contacts_latest["OrgID_Provider"] == ods["ODS_Code"], 
    how = "left"
)

if flag_data_display:
    iapt_contacts_latest.limit(10).display()

# Explore

## Referrals

### Age distribution

In [0]:
if flag_data_check:
    referral_age_distribution = iapt_referrals_latest \
        .groupBy(F.col("Age_ReferralRequest_ReceivedDate")) \
            .agg(F.countDistinct(F.col("Unique_ServiceRequestID")).alias("unique_referrals"))\
                .orderBy("Age_ReferralRequest_ReceivedDate")

    if flag_data_display:
        referral_age_distribution.display()
        
else:
    display("Data check flag is set to False")

Interesting... several hundred referrals were received for patients aged 0, that's interesting and doesn't look right. There are a (relatively) small number of referrals for people aged 0 to 15 - presumably data quality issues. 
The bulk of referrals ramp upwards from age 16 onwards, which fits references for NHS TT being an adult service but which can take adolescents too.


In [0]:
# Lets see what happens if we focus on referrals received after Jan 2022
if flag_data_check:
    referral_age_distribution_since2022 = iapt_referrals \
        .filter(F.col("ReferralRequestReceivedDate") > F.lit("2022-01-01")) \
            .groupBy(F.col("Age_ReferralRequest_ReceivedDate")) \
                .agg(F.countDistinct(F.col("Unique_ServiceRequestID")).alias("unique_referrals"))\
                    .orderBy("Age_ReferralRequest_ReceivedDate")

    if flag_data_display:
        referral_age_distribution_since2022.display()
        
else:
    display("Data check flag is set to False")

There are still quite a few referrals for people aged 0 to 15 in this sample.
The number of referrals ramps up from age 16 onward, which fits expectations.

### Referrals per month
Looking to get a count of unique referrals by month the referral was received and discharged. These will be the denominators for the aggregate summaries per provider.

In [0]:
# work out the year-month for referrals received and discharged
iapt_referrals_latest = iapt_referrals_latest \
    .withColumn("calc_Referral_Received_YM", F.date_format("ReferralRequestReceivedDate", "yyyy-MM")) \
        .withColumn("calc_Referral_Discharged_YM", F.date_format("ServDischDate", "yyyy-MM"))

if flag_data_display:
    iapt_referrals_latest \
        .select("Unique_ServiceRequestID", "ReferralRequestReceivedDate", "ServDischDate", "calc_Referral_Received_YM", "calc_Referral_Discharged_YM") \
            .limit(10).display()

In [0]:
if flag_data_check:
  # Count referrals received
  iapt_referrals_received_count = iapt_referrals_latest \
    .groupBy("calc_Referral_Received_YM") \
      .agg(F.countDistinct("PathwayID").alias("referrals_received_count")) \
        .orderBy("calc_Referral_Received_YM")
  # Count referrals discharged
  iapt_referrals_discharged_count = iapt_referrals_latest \
    .groupBy("calc_Referral_Discharged_YM") \
      .agg(F.countDistinct("PathwayID").alias("referrals_discharged_count")) \
        .orderBy("calc_Referral_Discharged_YM")
else:
  display("Flag Data Check is set to False")

In [0]:
if flag_data_check:
    iapt_referrals_received_count.display()
else:
    display("Flag Data Check is set to False")

In [0]:
if flag_data_check:
    iapt_referrals_discharged_count.display()
else:
    display("Flag Data Check is set to False")

**Referrals over time - comments**
- Some referrals were received as far back as the 1940s.
- Not just some extreme outliers, referrals are counted in ones and twos a month for each decade, 1950s, 1960s, 1970s, 1980s, etc. ? Linked with person's year of birth.
- Count of referrals begin to ramp up from 2019 onwards, reaching what is currently the steady state of 140k referrals per month around September 2020. 

In [0]:
if flag_data_check:
  # Count distinct pathwayid by YearMonth of discharge
  iapt_referrals_per_month_provider = iapt_referrals_latest \
    .groupBy("calc_Referral_Discharged_YM", "ODS_Code", "Name") \
      .agg(F.countDistinct("PathwayID").alias("referral_count")) \
        .orderBy(F.desc("calc_Referral_Discharged_YM"), "ODS_Code")
  
  if flag_data_display:
    iapt_referrals_per_month_provider.display()

else:
  display("Flag Data Check is set to False")

## Contacts

### Contacts per month
Looking to calculate some matching variables based on a profile of contacts.

In [0]:
# work out the year-month for contacts
iapt_contacts_latest = iapt_contacts_latest \
    .withColumn("calc_Contact_YM", F.date_format("CareContDate", "yyyy-MM"))

if flag_data_display:
    iapt_contacts_latest \
        .select("Unique_ServiceRequestID", "Unique_CareContactID", "CareContDate", "calc_Contact_YM") \
            .limit(10).display()

In [0]:
if flag_data_display:
    iapt_contacts_latest \
        .groupBy("calc_Contact_YM") \
        .agg(F.countDistinct("Unique_CareContactID").alias("contacts_count")) \
            .orderBy("calc_Contact_YM").display()

### Contact location

In [0]:
if flag_data_check:
    # get the total number of contacts (denominator)
    total_contacts_count = (
        iapt_contacts_latest
        # limit to attended contacts (5) or attended late contacts (6)
        .filter(F.col("AttendOrDNACode").isin("5", "6"))
        .select(F.countDistinct("Unique_CareContactID")).collect()[0][0]
    )

    # summarise contacts by location
    ( iapt_contacts_latest
        # limit to attended contacts (5) or attended late contacts (6)
        .filter(F.col("AttendOrDNACode").isin("5", "6"))
        
        # count activity by location
        .groupBy("ActLocTypeCode")
        .agg(F.countDistinct("Unique_CareContactID").alias("contacts_count"))
        .withColumn("contacts_rate", F.col("contacts_count") / total_contacts_count)
        .withColumn("contacts_perc", F.format_number("contacts_rate", 2))
        .orderBy(F.col("contacts_count").desc())
        .display()
    )

This is interesting, there are many more contacts in a non-hospital setting than I was expecting: 
- 40% of attended contacts have a recorded location of `X01` (Other locations not elsewhere classified),
- 15% are `A01` (Patient's home), 
- 14% `null`, 
- 10% `A04` (Other patient related location),
- 9% `B01` (Primary care health centre),
- 3% `E01` (Out-Patient clinic),
- 2% `C01` (GP)

... very few seem to be in locations that would pose logistical issues in attending.

In [0]:
if flag_data_check:
    # lets simplify the locations to improve readability
    act_loc_type_mapping = {
        "Patient main residence": ["A01", "A02", "A03", "A04"],
        "Health Centre premises": ["B01", "B02"],
        "GP / Dentist / Opthalmic": ["C01", "C02", "C03"],
        "Walk in Centres": ["D01", "D02", "D03"],
        "Hospital premises": ["E01", "E02", "E03", "E04", "E99"],
        "Hospice": ["F01"],
        "Nursing / Residental home": ["G01", "G02", "G03", "G04"],
        "Day Centre premises": ["H01"],
        "Resource Centre premises": ["J01"],
        "Children and Family premises": ["K01", "K02"],
        "Educational premises": ["L01", "L02", "L03", "L04", "L05", "L06", "L99"],
        "Justice premises": ["M01", "M02", "M03", "M04", "M06", "M07", "M05"],
        "Public locations": ["N01", "N02", "N03", "N04", "N05"],
        "Other locations": ["X01"]
    }

    # convert the dictionary to a dataframe
    mapping_list = [(code, group) for group, codes in act_loc_type_mapping.items() for code in codes]
    mapping_df = spark.createDataFrame(mapping_list, ["ActLocTypeCode", "calc_LocationGroup"])

    # summarise contacts by location group
    ( iapt_contacts_latest
        # limit to attended contacts (5) or attended late contacts (6)
        .filter(F.col("AttendOrDNACode").isin("5", "6"))

        # add the location groups
        .join(mapping_df, on="ActLocTypeCode", how="left")
        
        # count activity by location group
        .groupBy("calc_LocationGroup")
        .agg(F.countDistinct("Unique_CareContactID").alias("contacts_count"))
        .withColumn("contacts_rate", F.col("contacts_count") / total_contacts_count)
        .withColumn("contacts_perc", F.format_number("contacts_rate", 2))
        .orderBy(F.col("contacts_count").desc())
        .display()
    )

OK, the list now looks like this:
- 40% Other locations
- 24% Patient main residence
- 14% `null`
- 9% Health centre
- 5% Hospital
- 4% Public locations
- 2% GP

It is not clear what the `Other` locations are, and why they aren't covered by the extensive list of coding. Query, is this data quality issue?

The `null` records are also prominent, being the third most frequently recorded option. Again, raises queries re: data quality.

Patient main residence seems quite high at 24% of contacts. Perhaps this is a result of telephone contacts?

# Matching variables
We proposed the following measures as matching variables in the 'Matching Variables' .pptx file (July 2025):

Referrals:
- **Number of referrals dishcarged** each month (proxy for size of service).
- Proportion of referrals for people **aged 25 years and younger** at referral.
- Proportion of referrals for people **aged 60 years and older** at referral.
- Proportion of referrals for people whose **gender identity is female**.
- Proportion of referrals for people whose **LSOA of residence is among the 20% most deprived** in England.
- Proportion of referrals for people whose **LSOA of residence is among the 20% least deprived** in England.
- Proportion of referrals for people whose **broad ethnic background is `White`**.
- Proportion of referrals where the **referral-to-treatment- wait time is within six weeks**.
- Proportion of referrals where there was a **step-up to high-intensity** therapy.

Contacts:
- Proportion of care contacts where the **therapist has attained an NHS TT qualification**.
- Proportion of care contacts conducted **on hospital premises**.
- Proportion of care contacts conducted **face-to-face**.
- Proportion of care contacts conducted **outside of weekdays, 9am to 5pm**.
- Proportion of care contacts conducted **in English**.
- Proportion of care contacts conducted **with an interpreter present**.
- Proportion of care contacts delivered **as internet enabled therapy**.

#### Referrals

In [0]:
# remind myself what fields are and the data they hold
iapt_referrals_latest.limit(10).display()

In [0]:
# I'm preparing this list of matching variables in advance of sign-off by TT clients to save time
# NB, this section may need revisiting once final set of matching variables agreed
matching_referrals = (
    iapt_referrals_latest
        # only work with records that have been discharged
        .filter(F.col("ServDischDate").isNotNull())
        # coalesce gender identity and gender to get a single value - i.e. fill in gaps in gender identity with values from gender
        .withColumn("calc_gender_identity", F.coalesce(F.col("GenderIdentity"), F.col("Gender")))
        # flag records where there is a step-up from low-intensity to high-intensity therapy
        .withColumn("calc_step_up_therapy_flag", F.when(F.col("LowIntensityTherapy_FirstDate") < F.col("HighIntensityTherapy_FirstDate"), True).otherwise(False))
        # calculate matching variables
        .groupBy("calc_Referral_Discharged_YM", "ODS_Code", "Name").agg(

            # Total referrals discharged in the month
            F.countDistinct("PathwayID").alias("discharges_count"),

            # Outcome 1 - proportion of referrals where 6 or more completed appointments (1 assessment and 5 treatment)
            F.countDistinct(F.when(F.col("TreatmentCareContact_Count") >= 6, F.col("PathwayID"))).alias("o1_discharges_6_or_more_completed_appointments"),

            # Matching 1 - referrals discharged for people aged 25 years and younger on the date of referral
            F.countDistinct(F.when(F.col("Age_ReferralRequest_ReceivedDate") <= 25, F.col("PathwayID"))).alias("m1_discharges_aged_under_26_at_referral"),

            # Matching 2 - referrals discharged for people aged 60 years and older on the date of referral
            F.countDistinct(F.when(F.col("Age_ReferralRequest_ReceivedDate") >= 60, F.col("PathwayID"))).alias("m2_discharges_aged_60_plus_at_referral"),

            # Matching 3 - referrals discharged for people whose gender identity is female (2)
            F.countDistinct(F.when(F.col("calc_gender_identity") == "2", F.col("PathwayID"))).alias("m3_discharges_female"),

            # Matching 4 - referrals discharged for people whose LSOA of residence is in the 20% most deprived
            F.countDistinct(F.when(F.col("IndicesOfDeprivationDecile").isin(1,2), F.col("PathwayID"))).alias("m4_discharges_20pc_most_deprived"),

            # Matching 5 - referrals discharged for people whose LSOA of residence is in the 20% least deprived
            F.countDistinct(F.when(F.col("IndicesOfDeprivationDecile").isin(9,10), F.col("PathwayID"))).alias("m5_discharges_20pc_least_deprived"),

            # Matching 6 - referrals discharged for people whose ethnicity is 'White', i.e. codes A, B or C
            F.countDistinct(F.when(F.col("Validated_EthnicCategory").isin("A", "B", "C"), F.col("PathwayID"))).alias("m6_discharges_white_ethnicity"),

            # Matching 7 - referrals discharged where there was a step-up from low-intensity to high-intensity therapy
            F.countDistinct(F.when(F.col("calc_step_up_therapy_flag") == True, F.col("PathwayID"))).alias("m7_discharges_step_up_therapy")
        )
)

if flag_data_display:
    matching_referrals.display()

In [0]:
if flag_data_export:
    matching_referrals.write.mode("overwrite").parquet(eval_path + "matching_referrals.parquet")

#### Contacts

Contacts:
- Proportion of care contacts where the **therapist has attained an NHS TT qualification**.
- Proportion of care contacts conducted **on hospital premises**.
- Proportion of care contacts conducted **face-to-face**.
- Proportion of care contacts conducted **outside of weekdays, 9am to 5pm**.
- Proportion of care contacts conducted **in English**.
- Proportion of care contacts conducted **with an interpreter present**.
- Proportion of care contacts delivered **as internet enabled therapy**.
- Proportion of care contacts delivered **to an individual patient**.

In [0]:
if flag_data_display:
  # remind myself what fields are and the data they hold
  iapt_contacts_latest.limit(10).display()

In [0]:
matching_contacts = (
    iapt_contacts_latest
        
        # limit to attended contacts (5) or attended late contacts (6)
        .filter(F.col("AttendOrDNACode").isin("5", "6"))

        # calculate matching variables
        .groupBy("calc_Contact_YM", "ODS_Code", "Name").agg(

            # denominator
            F.countDistinct("Unique_CareContactID").alias("contacts_count"),

            # Matching 1 - contacts where the therapist attained an NHS TT qualification
            #F.countDistinct(F.when(F.col("TherapistAttainedNHS_TT_Qualification") == True, F.col("Unique_CareContactID"))).alias("m1_contacts_therapist_attained_nhs_tt_qualification")

            # Matching 2 - contacts conducted on hospital premises
            F.countDistinct(F.when(
                F.col("ActLocTypeCode").isin("E01", "E02", "E03", "E04", "E99"), 
                F.col("Unique_CareContactID")
                )
            ).alias("m2_contacts_hospital_premises"),

            # Matching 3 - contacts conducted face-to-face
            F.countDistinct(F.when(
                F.col("ConsMechanism").isin("01"),
                F.col("Unique_CareContactID")
                )
            ).alias("m3_contacts_face_to_face"),

            # Matching 4 - contacts conducted weekdays, 9am-5pm
            F.countDistinct(F.when(
                F.weekday(F.col("CareContDate")).isin([0, 1, 2, 3, 4]) & 
                F.hour(F.col("CareContTime")).between(9, 17),
                F.col("Unique_CareContactID")
                )
            ).alias("m4_contacts_outside_weekdays_9am_5pm"),

            # Matching 5 - contacts conducted in English,
            F.countDistinct(F.when(
                F.col("LanguageCodeTreat").isin(["eng"]),
                F.col("Unique_CareContactID")
                )
            ).alias("m5_contacts_conducted_in_english"),

            # Matching 6 - contacts with interpreter present
            F.countDistinct(F.when(
                F.col("InterpreterPresentInd").isin([1, 2, 3]),
                F.col("Unique_CareContactID")
                )
            ).alias("m6_contacts_interpreter_present")
        )
)

In [0]:
if flag_data_export:
    matching_referrals.write.mode("overwrite").parquet(eval_path + "matching_contacts.parquet")

In [0]:
# NB, this is an early attempt at working with data, so may not be the best approach.
from pyspark.sql.functions import col, datediff, mean

# Filter for usepathway_flag = 'True'
filtered_df = df.filter(col("UsePathway_Flag") == 'True')

# Calculate the mean of TherapySEssionFirstDate - ReferralRequestReceivedDate
mean_diff = filtered_df.select(mean(datediff(col("TherapySEssionFirstDate"), col("ReferralRequestReceivedDate"))).alias("mean_diff"))

display(mean_diff)