Since its inception, the National Domestic Workers Alliance has collected survey data about domestic workers and low-propensity voters of color. With the creation of a data department (myself and a former organizer), expanded capacity allowed for this data to be used in predictive analytics. Two models emerged: one to measure the likelihood of a respondent being a domestic worker (as defined by participation in the care economy) and one to measure the likelihood of member engagement with the organizational primary purpose.



In [None]:
from ndwa import worker, meng

In the domestic worker identification project, a machine learning model was trained on a large dataset of voter, demographic, immigration and consumer data to predict which respondents were most likely to identify as domestic workers. This model was then fine-tuned using a smaller dataset of voter demographic and consumer data from a 2020 COVID survey to predict which respondents were most likely to be domestic workers.

The first step of the project was to preprocess the data by cleaning and normalizing the data, and then encoding categorical variables. The demographic and consumer data were used as features, and the target variable was whether or not the respondent replied “Yes” to the general survey question “Are you a domestic worker?”.

In [None]:
# load data
df_train = pd.read_csv('survey_training.csv')
df_scoring = pd.read_csv('survey_scoring.csv')


In [None]:
# preview data
df_train.head()

In [None]:
# check for missing values
df_train.isnull().sum()

In [None]:
# clean dataset
from eda import clean_dataset

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.fillna(0, inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep]#.astype(np.float64)

df_train_clean = clean_dataset(df_train)

In [None]:
# define feature columns
feature_cols = ['immigration_status', 'zip_code', 'member_status', 'voter_propensity', 'race', 'ethnicity']

# scale features
scaler = preprocessing.StandardScaler()
df_train_clean[feature_cols] = scaler.fit_transform(df_train_clean[feature_cols])
df_scoring_clean[feature_cols] = scaler.transform(df_scoring_clean[feature_cols])

In [None]:
# define X and y
X = df_train_clean[feature_cols]
y = df_train_clean['domestic_worker']

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# train logistic regression model
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)

# train decision tree model
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

A machine learning model was then trained on the large dataset using gradient boosting algorithm. This model was then fine-tuned using the smaller dataset from the 2020 COVID survey. The fine-tuning process involved re-training the model using the smaller dataset while keeping the pre-trained weights from the initial model as a starting point. This allowed the model to quickly adapt to the new dataset while still leveraging the knowledge learned from the larger dataset.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# train gradient boosting algorithm
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
gb.score(X_test, y_test)

The fine-tuned model was then evaluated on a hold-out test set from the 2020 COVID survey and was found to have improved performance compared to a model trained from scratch on the smaller dataset.

In [None]:
apply(X_covid)

# model score for smaller dataset
gb.score(X_covid, y_covid)

This project demonstrates the effectiveness of transfer learning for tabular data, as the domestic worker model was able to leverage the knowledge learned from the larger dataset to improve its performance on the smaller dataset from 2020.