## AFCARS DATASET DESCRIPTION

Data description and the proposed filling method

The dataset we received appears to be a part of the AFCARS dataset. The complete dataset description can be found at https://www.ndacan.acf.hhs.gov/datasets/dataset-details.cfm?ID=274 or [AFCARS Foster Care File Codebook (PDF)](https://www.ndacan.acf.hhs.gov/datasets/pdfs_user_guides/afcars-foster-care-file-codebook.pdf).

The proposed filling method tries to fill the file with the most likely values. To run this processing script, place the original dataset in the ORIGINAL_DATASET_FILE variable, and the processed file will be saved to the PROCESSED_DATASET_FILE variable.

| #    | Column   | non-null | Null   | Missing Rate | Desc                                                         | PreProcessing     |
| ---- | -------- | -------- | ------ | ------------ | ------------------------------------------------------------ | ----------------- |
| 0    | FY       | 631893   | 0      | 0.000        | The Federal Fiscal Year of this Dataset. All 2020 in this  file |                   |
| 1    | Version  | 631893   | 0      | 0.000        | All 1 in this file                                           | IGNORE            |
| 2    | State    | 631893   | 0      | 0.000        | 1-72, MA code 25                                             | filter only 25    |
| 3    | St       | 631893   | 0      | 0.000        | 52 unique state name                                         | IGNORE            |
| 4    | RepDatYr | 631893   | 0      | 0.000        | Reporting Period End Date: Year. 2020 or 2021 in this file   |                   |
| 5    | RepDatMo | 631893   | 0      | 0.000        | Reporting Period End Date: Month                             |                   |
| 6    | FIPSCode | 631884   | 9      | 0.000        | Local Agency FIPS Code. The 5-digit Federal Information  Processing Standard (FIPS) of the county or equivalent unit which has  responsibility for the case. | IGNORE            |
| 7    | RecNumbr | 631893   | 0      | 0.000        | Record Number (AFCARS ID)                                    | IGNORE            |
| 8    | Sex      | 631805   | 88     | 0.000        | The biological sex of the child. 1-Male, 2-Female,  "Other" or "unknown" is mapped to missing/blank/null | fill blank with 3 |
| 9    | AmIAKN   | 625469   | 6424   | 0.010        | Child Race American Indian or Alaska Native, 0: Does Not Apply, 1:  Applies | fill blank with 0 |
| 10   | Asian    | 625463   | 6430   | 0.010        | Child Race Asian, 0: Does Not Apply, 1: Applies              | fill blank with 0 |
| 11   | BlkAfrAm | 625464   | 6429   | 0.010        | Child Race Black or African American, 0: Does Not Apply, 1:  Applies | fill blank with 0 |
| 12   | HawaiiPI | 625463   | 6430   | 0.010        | Child Race Native Hawaiian or Other Pacific Islander, 0: Does  Not Apply, 1: Applies | fill blank with 0 |
| 13   | White    | 625469   | 6424   | 0.010        | Child Race White, 0: Does Not Apply, 1: Applies              | fill blank with 0 |
| 14   | UnToDetm | 625429   | 6464   | 0.010        | Child Race Unable to Determine, 0: Does Not Apply, 1: Applies | fill blank with 0 |
| 15   | HisOrgin | 622904   | 8989   | 0.014        | Child Hispanic or Latino Ethnicity, 1: Yes, 2: No, 3: Unable  to Determine | fill blank with 2 |
| 16   | ClinDis  | 617037   | 14856  | 0.024        | Child Has Been Clinically Diagnosed with Disability, 1:  Yes, 2:No, 3:Unable to Determine | fill blank with 2 |
| 17   | MR       | 621448   | 10445  | 0.017        | Mental Retardation, 0: Does Not Apply, 1: Applies            | fill blank with 0 |
| 18   | VisHear  | 621447   | 10446  | 0.017        | Visually or Hearing Impaired, 0: Does Not Apply, 1: Applies  | fill blank with 0 |
| 19   | PhyDis   | 621448   | 10445  | 0.017        | Physically Disabled, 0: Does Not Apply, 1: Applies           | fill blank with 0 |
| 20   | EmotDist | 621448   | 10445  | 0.017        | Emotionally Disturbed, 0: Does Not Apply, 1: Applies         | fill blank with 0 |
| 21   | OtherMed | 621449   | 10444  | 0.017        | Other Medically Diagnosed Condition Requiring Special Care, 0:  Does Not Apply, 1: Applies | fill blank with 0 |
| 22   | EverAdpt | 611970   | 19923  | 0.033        | Child Has Previously Been Adopted, 1: Yes, 2: No, 3: Unable to  Determine | fill blank with 2 |
| 23   | AgeAdopt | 517505   | 114388 | 0.221        | Age on Date of Legal Adoption, 0: Not Applicable, 1: Less than  2 years old, 2: 2-5 years old, 3: 6-12 years old, 4: 13 years or older, 5:  Unable to determine | fill blank with 0 |
| 24   | TotalRem | 631289   | 604    | 0.001        | Total Number of Removals from Home                           | fill blank with 0 |
| 25   | NumPlep  | 629931   | 1962   | 0.003        | Number of Placement Settings in Current FC Episode           | fill blank with 0 |
| 26   | ManRem   | 629872   | 2021   | 0.003        | Manner of removal from home for current episode, 1: Voluntary,  2: Court Ordered, 3: Not Yet Determined | fill blank with 3 |
| 27   | PhyAbuse | 629349   | 2544   | 0.004        | Removal Reason-Physical Abuse, 0: Does Not Apply, 1: Applies | fill blank with 0 |
| 28   | SexAbuse | 629351   | 2542   | 0.004        | Removal Reason-Sexual Abuse, 0: Does Not Apply, 1: Applies   | fill blank with 0 |
| 29   | Neglect  | 629349   | 2544   | 0.004        | Removal Reason-Neglect, 0: Does Not Apply, 1: Applies        | fill blank with 0 |

In [22]:
ORIGINAL_DATASET_FILE = "dataset/DATA2020.CSV"
PROCESSED_DATASET_FILE = "dataset/AFCARS2020-preprocessed.csv"

In [23]:
import pandas as pd

df = pd.read_csv(ORIGINAL_DATASET_FILE)

In [24]:
# pre proscessing
df.drop("Version", axis=1, inplace=True)

df = df[df["State"] == 25]
df.drop("State", axis=1, inplace=True)

df.drop("St", axis=1, inplace=True)

df.drop("FIPSCode", axis=1, inplace=True)

df.drop("RecNumbr", axis=1, inplace=True)

df["Sex"] = df["Sex"].fillna(3).astype("int")

df["AmIAKN"] = df["AmIAKN"].fillna(0).astype("int")

df["Asian"] = df["Asian"].fillna(0).astype("int")

df["BlkAfrAm"] = df["BlkAfrAm"].fillna(0).astype("int")

df["HawaiiPI"] = df["HawaiiPI"].fillna(0).astype("int")

df["White"] = df["White"].fillna(0).astype("int")

df["UnToDetm"] = df["UnToDetm"].fillna(0).astype("int")

df["HisOrgin"] = df["HisOrgin"].fillna(2).astype("int")

df["ClinDis"] = df["ClinDis"].fillna(2).astype("int")

df["MR"] = df["MR"].fillna(0).astype("int")

df["VisHear"] = df["VisHear"].fillna(0).astype("int")

df["PhyDis"] = df["PhyDis"].fillna(0).astype("int")

df["EmotDist"] = df["EmotDist"].fillna(0).astype("int")

df["OtherMed"] = df["OtherMed"].fillna(0).astype("int")

df["EverAdpt"] = df["EverAdpt"].fillna(2).astype("int")

df["AgeAdopt"] = df["AgeAdopt"].fillna(0).astype("int")

df["TotalRem"] = df["TotalRem"].fillna(0).astype("int")

df["NumPlep"] = df["NumPlep"].fillna(0).astype("int")

df["ManRem"] = df["ManRem"].fillna(3).astype("int")

df["PhyAbuse"] = df["PhyAbuse"].fillna(0).astype("int")

df["SexAbuse"] = df["SexAbuse"].fillna(0).astype("int")

df["Neglect"] = df["Neglect"].fillna(0).astype("int")


In [25]:
df.describe(include="all")


Unnamed: 0,FY,RepDatYr,RepDatMo,Sex,AmIAKN,Asian,BlkAfrAm,HawaiiPI,White,UnToDetm,...,EmotDist,OtherMed,EverAdpt,AgeAdopt,TotalRem,NumPlep,ManRem,PhyAbuse,SexAbuse,Neglect
count,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,...,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0,14289.0
mean,2020.0,2020.016866,7.791098,1.489747,0.020645,0.016446,0.272027,0.002939,0.680314,0.123382,...,0.122122,0.102246,2.012527,0.213521,1.34894,3.633704,1.971307,0.078172,0.018266,0.689761
std,0.0,0.128774,2.406735,0.500612,0.142199,0.127188,0.44502,0.054138,0.466371,0.328886,...,0.327438,0.302983,0.231354,0.951348,0.670642,3.819053,0.200475,0.268451,0.133916,0.462607
min,2020.0,2020.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
25%,2020.0,2020.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0
50%,2020.0,2020.0,9.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,2.0,0.0,1.0,2.0,2.0,0.0,0.0,1.0
75%,2020.0,2020.0,9.0,2.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,2.0,0.0,2.0,4.0,2.0,0.0,0.0,1.0
max,2020.0,2021.0,9.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,3.0,5.0,9.0,49.0,3.0,1.0,1.0,1.0


In [26]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 14289 entries, 329 to 306109
Data columns (total 25 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   FY        14289 non-null  int64
 1   RepDatYr  14289 non-null  int64
 2   RepDatMo  14289 non-null  int64
 3   Sex       14289 non-null  int64
 4   AmIAKN    14289 non-null  int64
 5   Asian     14289 non-null  int64
 6   BlkAfrAm  14289 non-null  int64
 7   HawaiiPI  14289 non-null  int64
 8   White     14289 non-null  int64
 9   UnToDetm  14289 non-null  int64
 10  HisOrgin  14289 non-null  int64
 11  ClinDis   14289 non-null  int64
 12  MR        14289 non-null  int64
 13  VisHear   14289 non-null  int64
 14  PhyDis    14289 non-null  int64
 15  EmotDist  14289 non-null  int64
 16  OtherMed  14289 non-null  int64
 17  EverAdpt  14289 non-null  int64
 18  AgeAdopt  14289 non-null  int64
 19  TotalRem  14289 non-null  int64
 20  NumPlep   14289 non-null  int64
 21  ManRem    14289 non-null  int64
 22  

In [27]:
df.to_csv(PROCESSED_DATASET_FILE, index=False)


In [28]:
# load the preprocessed data
df = pd.read_csv(PROCESSED_DATASET_FILE)
