# Detention By Nationality Analysis

The full methodology for this analysis is available [here](../methodology.md).

## Load the data

In [1]:
import pandas as pd
import sys
sys.path.append("../utils")
import loaders

*Note: loaders is a custom module to handle basic data-loading. It is available [here](https://github.com/BuzzFeedNews/2015-08-immigration/blob/master/utils/loaders.py).*

In [2]:
first_scheduled_proceeding = pd.read_csv("../data/first-scheduled-proceeding.csv", 
     parse_dates=["ADJ_DATE"],
     dtype={ "IDNCASE": str, "IDNPROCEEDING": str })

*Note: first-scheduled-proceeding.csv is a pre-processed data file. The code to create that file from tbl_schedule.csv is available [here](../utils/generate-first-scheduled-proceeding.py).*

In [3]:
nationality_table = loaders.load_file("tblLookupNationality.csv")

In [4]:
case_date_list = [
    "E_28_DATE",
    "DATE_OF_ENTRY",
    "C_BIRTHDATE",
    "C_RELEASE_DATE",
    "DATE_DETAINED",
    "DATE_RELEASED"
]

In [5]:
_cases = loaders.load_file("A_tblCase.csv",
    parse_dates=case_date_list,
    dtype={ "IDNCASE": str })

In [6]:
_cases["GENDER"] = _cases["GENDER"].fillna("UNK")

In [7]:
_charges = loaders.load_file("B_tblProceedCharges.csv",
    dtype={ "IDNCASE": str, "IDNPROCEEDING": str })

Skipping line 1165848: expected 5 fields, saw 6

Skipping line 1433634: expected 5 fields, saw 6

Skipping line 2646392: expected 5 fields, saw 6

Skipping line 2847501: expected 5 fields, saw 6

Skipping line 2947399: expected 5 fields, saw 6

Skipping line 3131015: expected 5 fields, saw 6



*Note: Six rows — of the more than 8 million total rows — in the charges table contain malformed data stemming from extra tab characters, triggering the warning messages above.*

## Process the data

Join the various tables and prepare them for analysis.

In [8]:
charges_group = _charges.groupby([ "IDNCASE", "IDNPROCEEDING" ])

In [9]:
charge_lists = pd.DataFrame({
    "charge_list": charges_group["CHARGE"].apply("|".join)
}).reset_index()

In [10]:
charge_lists.head()

Unnamed: 0,IDNCASE,IDNPROCEEDING,charge_list
0,2046920,3200048,212a06Ai
1,2046921,3200049,212a06Ai
2,2046922,3200050,212a06Ai
3,2046923,3200051,212a06Ci
4,2046923,3525150,212a06Ci


In [11]:
assert(charge_lists["IDNCASE"].nunique() == 5033293)
assert(len(first_scheduled_proceeding) == 5045511)

From the numbers above: A small fraction of cases — approximately 0.2% — have a scheduled proceding but no charges.

In [12]:
cases_with_first_proceeding = first_scheduled_proceeding\
    .merge(charge_lists, how="left", on=[ "IDNCASE", "IDNPROCEEDING" ])\
    .merge(_cases, how="left", on="IDNCASE", suffixes=["_schedule", "_case"])

Legal representatives file the EOIR-28 form to notify the court of their representation for a given immigrant.

`ADJ_DATE` in this table indicates the date of the case's first proceeding.

In [13]:
cases_with_first_proceeding["legal_rep_at_first_proceeding"] = cases_with_first_proceeding\
    .apply(lambda x: x["E_28_DATE"] <= x["ADJ_DATE"], axis=1)

## Select non-criminal removal cases between Jan. 1, 2003 and Jan. 1, 2015

In [14]:
selected_cases = cases_with_first_proceeding[
     # Select cases with first-scheduled-hearing dates in 2003–2014
    (cases_with_first_proceeding["ADJ_DATE"] >= "2003-01-01") &
    (cases_with_first_proceeding["ADJ_DATE"] < "2015-01-01") &
    # Remove unaccompanied children
    (cases_with_first_proceeding["CASEPRIORITY_CODE"] != "UC") & 
    # Keep only "removal" cases
    (cases_with_first_proceeding["CASE_TYPE"] == "RMV")
].copy()

In [15]:
selected_cases["has_criminal_charge"] = (
    selected_cases["charge_list"].str.contains("237a02") |
    selected_cases["charge_list"].str.contains("212a02")
)

In [16]:
selected_cases["detained"] = selected_cases["CUSTODY"].map({"N": 0, "D": 1, "R": 1})

In [17]:
non_crim_selected_cases = selected_cases[~selected_cases["has_criminal_charge"]].copy()

## Overall detention rate for non-Mexican non-criminal cases

In [18]:
non_crim_non_mex = non_crim_selected_cases[non_crim_selected_cases["NAT"] != "MX"]
print("{0:.1f}%".format(100 * non_crim_non_mex["detained"].mean()))

39.1%


## Calculate detention rates by nationality

In [19]:
non_crim_custody_by_nationality = non_crim_selected_cases.groupby(["NAT", "CUSTODY"])\
    .size()\
    .unstack()\
    .fillna(0)

In [20]:
non_crim_custody_by_nationality["total"] = non_crim_custody_by_nationality.sum(axis=1)

In [21]:
non_crim_custody_by_nationality["percent_detained"] = non_crim_custody_by_nationality\
    .apply(lambda x: round(100.0 * (x["D"] + x["R"]) / x["total"], 1), axis=1)

In [22]:
nationality_table.set_index("NAT_CODE")["NAT_NAME"].head()

NAT_CODE
??          UNKNOWN NATIONALITY
AB                        ARUBA
AC          ANTIGUA AND BARBUDA
AF                  AFGHANISTAN
AG                      ALGERIA
Name: NAT_NAME, dtype: object

In [23]:
# Add full country names
non_crim_custody_by_nationality["NAT_NAME"] = non_crim_custody_by_nationality\
    .join(nationality_table.set_index("NAT_CODE")[["NAT_NAME"]])["NAT_NAME"]

In [24]:
main_columns = ["N", "D", "R", "total", "percent_detained", "NAT_NAME"]
large_nationalities = non_crim_custody_by_nationality[
    non_crim_custody_by_nationality["total"] > 20000
].sort("percent_detained", ascending=False)[main_columns]

## Table: Per-Nationality Detention Rate

In [25]:
large_nationalities

CUSTODY,N,D,R,total,percent_detained,NAT_NAME
NAT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MX,211607,500627,123631,835865,74.7,MEXICO
GT,82227,73429,53525,209181,60.7,GUATEMALA
EC,12188,5179,10663,28030,56.5,ECUADOR
HO,105350,52544,47282,205176,48.7,HONDURAS
IN,14196,1793,11174,27163,47.7,INDIA
DR,10795,5740,3831,20366,47.0,DOMINICAN REPUBLIC
ES,140820,48522,68096,257438,45.3,EL SALVADOR
BR,43364,10442,5760,59566,27.2,BRAZIL
CO,25681,3284,3165,32130,20.1,COLOMBIA
CH,75969,3064,14431,93464,18.7,CHINA


## Regression Analysis of Removal Cases

The regression below analyzes the relationship between detention and the following factors:

* Nationality
* Whether the cases includes any criminal charges
* If the immigrant had legal representation at his/her first scheduled proceeding
* The gender of the immigrant ("UNK" if not listed)

Note that being detained at any point (**D** or **R** in the `CUSTODY` column) is considered as detention for our analysis.

In [26]:
import statsmodels.api as sm
import scipy.stats
import patsy

In [27]:
regression_cases = selected_cases.copy()

In [28]:
# Create dummy variables for nationalities with at least 20,000 non-criminal cases
top_nat_names = []
for tn in large_nationalities.index:
    c_name = "IS_{0}".format(tn)
    top_nat_names.append(c_name)
    regression_cases[c_name] = regression_cases["NAT"].apply(lambda x: 1 if x == tn else 0)

In [29]:
base_formula = 'detained ~ legal_rep_at_first_proceeding + has_criminal_charge + GENDER'
formula = "{0} + {1}".format(base_formula, "+".join(top_nat_names))
y,x = patsy.dmatrices(formula, regression_cases, return_type="dataframe")
est1 = sm.Logit(y,x).fit()
est1.summary()

Optimization terminated successfully.
         Current function value: 0.558104
         Iterations 6


0,1,2,3
Dep. Variable:,detained,No. Observations:,2618476.0
Model:,Logit,Df Residuals:,2618459.0
Method:,MLE,Df Model:,16.0
Date:,"Fri, 21 Aug 2015",Pseudo R-squ.:,0.1813
Time:,11:01:44,Log-Likelihood:,-1461400.0
converged:,True,LL-Null:,-1784900.0
,,LLR p-value:,0.0

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.4167,0.009,-152.946,0.000,-1.435 -1.399
legal_rep_at_first_proceeding[T.True],-0.8076,0.005,-168.949,0.000,-0.817 -0.798
has_criminal_charge[T.True],1.8889,0.005,352.255,0.000,1.878 1.899
GENDER[T.M],1.4407,0.010,144.531,0.000,1.421 1.460
GENDER[T.UNK],0.5915,0.009,67.052,0.000,0.574 0.609
IS_MX,1.8861,0.004,457.676,0.000,1.878 1.894
IS_GT,1.2365,0.006,219.363,0.000,1.225 1.248
IS_EC,1.1034,0.013,88.065,0.000,1.079 1.128
IS_HO,0.7721,0.006,138.517,0.000,0.761 0.783


---

---

---