# Check CiC Dataset

#### Getting to Know the Dataset

In [None]:
# Load Libraries
library(dplyr)
library(here)
library(tidyr)
library(ggplot2)
library(bigrquery)
bq_auth()

In [None]:
# Pull Data
# Store the project ID
project_id = "yhcr-prd-phm-bia-core"

# Store Tables of Interest
targetdb1 <-'yhcr-prd-phm-bia-core.CB_FDM_ChildrensSocialCare'
targetdb1 <-gsub(' ','',targetdb1)
print (targetdb1)

# Create SQL command
sql1 <-paste('select * from ',targetdb1,'.tbl_CiC limit 10000;', sep = "")

# Run Query
tb1 <- bq_project_query(project_id, sql1)

# Load into Dataframe
table <- bq_table_download(tb1)

table


In [None]:
# Look at summary
summary(table)

In [None]:
# Check unique enteries for Ethnic Origin
unique(table$EthnicOrigin)

# Count number of unique enteries for Ethnic Origin
length(unique(table$EthnicOrigin))

There are 16 ethnic origin categories including 'Information Not Yet Obtained'.

In [None]:
# Check unique enteries for PCArea_Home
unique(table$PCArea_Home)

# Count number of unique enteries for PCArea_Home
length(unique(table$PCArea_Home))

There are 37 unique area codes. Some appear to be outside of Bradford. OO00 appears to be ones not available.

In [None]:
#### Check for Errors in Enteries

In [None]:
# Check if all personal IDs are unique
length(unique(table$person_id))

# Amount of Records
nrow(table)

There are more records than there are person_ids in the dataset.  There is 895 records in the dataset with 817 unique person_ids meaning that there are 78 records that are additional entries.

In [None]:
# Count number of occurrences of each personal_id
person_id_counts <- table %>%
  group_by(person_id) %>%
  summarise(count = n()) %>%
  filter(count > 1)

nrow(person_id_counts)

There is at least one additional record for 65 person_id.  

#### Check if any additional entries are duplicate records 

In [None]:
# Identify any duplicate records

table_dist <- table %>%
  distinct()

nrow(table_dist)


    Two records were exact duplicate entries leaving 893 records and 76 records that have additional entries for person_id. This needs to be considered in further analyses.


#### Checking reason for multiple enteries
Determining whether the additional enteries are care episodes or duplication with differing variables (e.g. differing EthnicOrigin or PostCode). 
Person_ids with additional enteries were selected to be compared in what variables differed among them.

In [None]:
# Obtain enteries which have multiple enteries
multiple_person_ids <- table_dist %>%
  group_by(person_id) %>%
  filter(n() > 1) %>%
  ungroup()

In [None]:
# Compare differences within each person_id
df_char <- multiple_person_ids %>%
  mutate(across(everything(), as.character))

# Pivot longer to compare entries
pivot_long <- df_char %>%
  pivot_longer(cols = -person_id, names_to = "Variable", values_to = "Value")

# Identify differences within each person_id
differences <- pivot_long %>%
  group_by(person_id, Variable) %>%
  summarise(Different = n_distinct(Value) > 1) %>% # count number of distinct values, returns TRUE if more than 1 unique value (differences)
  filter(Different == TRUE) %>%
  ungroup()

print(differences)

In [None]:
# Summarise the number of differences for each variable
differences_summary <- differences %>%
  group_by(Variable) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

print(differences_summary)

# Plot the summary 
plot <- ggplot(differences_summary, aes(x = reorder(Variable, Count), y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Differences Within person_id for Each Variable",
       x = "Variable",
       y = "Number of person_id with Differences") +
  theme_minimal()

plot

When looking at if variables differed within each person_id records at least once, there was at least one difference between entries when starting and ending care for every person_id, suggesting these additional records may be different episodes of care. However, some did show different entries in PCArea_home.
To determine whether any additional entries were repeated care episodes with only changes in PCArea_home, further checks were undertaken. 

In [None]:
# Identify person_id with differences in PCArea_Home
differences_PCArea_Home <- table_dist %>%
  group_by(person_id) %>%
  filter(n_distinct(PCArea_Home) > 1) %>%
  ungroup()

# Check if these person_id have any repeated StartDate or EndDate
differences_dates <- differences_PCArea_Home %>%
  group_by(person_id) %>%
  summarise(
    start_dates_same = any(duplicated(StartDate)),
    end_dates_same = any(duplicated(EndDate))
  ) %>%
  filter(start_dates_same | end_dates_same)

# Count how many PersonIDs have the same StartDate or EndDate
true_count <- differences_dates %>%
  summarise(Count = n())

print("person_id with the same StartDate or EndDate for those with differences in PCArea_Home:")
print(differences_dates)

print("Number of person_id with the same StartDate or EndDate:")
print(true_count)

All enteries that had different PCArea_Home had different start and end dates. This indicates that none are duplicate entries of episodes of care. This reassures us that the additional enteries for person_ids are indeed additional care episodes. 

The fact that PCArea_Home differs for some people is important to consider during analyses.

#### Overlapping dates 
Next, I checked whether any additional entries have any overlapping dates to signify any errors in enteries.

In [None]:
# Function to check for overlapping dates
check_overlap <- function(dates) {
  overlaps <- FALSE
  for (i in 1:(nrow(dates) - 1)) {
    for (j in (i + 1):nrow(dates)) {
      if (dates$StartDate[i] <= dates$EndDate[j] && dates$EndDate[i] >= dates$StartDate[j]) {
        overlaps <- TRUE
        break
      }
    }
    if (overlaps) break
  }
  return(overlaps)
}


# Check for overlapping dates within these person_ids
overlapping_dates <- multiple_person_ids %>%
  group_by(person_id) %>%
  summarise(overlap = check_overlap(cur_data())) %>%
  filter(overlap == TRUE)


print("personal_id with overlapping StartDate and EndDate:")
print(overlapping_dates)

There were no dates which overlapped for additional enteries

Future analyses may consider creating a “care_episodes” variable to indicate the sequence of care episodes. For example a person who has been in care 5 times could have entries of 1 to 5 in the care episodes column representing the order of care. However, the care episodes will only be in the context of this data, children may have been in care before this date in this data.