_Transform the data so that it can be modeled, stating assumptions and simplifications as you go. What can you notice about the data ahead of the modeling process?_

Before modeling the data, it needs to be transformed into a format that can be used by the model. This includes cleaning the data (removing missing values, outliers, etc.), converting categorical variables into numerical variables, and scaling numerical variables if necessary. Assumptions and simplifications that may need to be made include assuming that the data is missing at random, and simplifying complex relationships between variables into more manageable forms. The function `clean_dataset` was used below to remove infinite and complex numbers.

In [62]:
import pandas as pd
import numpy as np
from numpy import inf
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score

# load data
df_train = pd.read_csv('br_takehome_exam_2022_training.csv')
df_scoring = pd.read_csv('br_takehome_exam_2022_scoring.csv')

# preview data
df_train.head()


Unnamed: 0,job_aptitude_exam,same_industry,unexcused_absences,hs_gpa,job_offered,good_behavior,high_school,enrolled_late,instructor,sensitive_01,...,sensitive_14,sensitive_15,sensitive_17,sensitive_18,sensitive_19,sensitive_20,sensitive_22,sensitive_23,sensitive_24,sensitive_25
0,96.0,0,1,2.255,1,0.0,0,0,inst_9,0.25,...,-0.6843,1,0,1,0.82,1,0,0.33,1,0.44
1,88.0,0,2,2.772,1,1.0,1,0,inst_5,0.36,...,-3.93832,1,0,0,0.91,0,0,-0.35,0,0.83
2,90.0,0,0,3.74,1,0.0,2,0,inst_6,0.77,...,-1.3366,1,1,1,-0.88,1,0,-1.18,1,-0.82
3,122.0,0,2,3.206,1,0.0,3,0,inst_5,0.39,...,-0.32674,0,1,1,-0.98,1,0,-0.74,0,-0.63
4,82.0,0,1,2.837,1,1.0,4,0,inst_2,0.85,...,0.90079,1,0,0,-0.41,0,1,-0.56,1,1.63


In [63]:
# check for missing values
df_train.isnull().sum()

job_aptitude_exam       0
same_industry           0
unexcused_absences      0
hs_gpa                 20
job_offered             0
good_behavior         100
high_school             0
enrolled_late           0
instructor             50
sensitive_01            0
sensitive_02            0
sensitive_03            0
sensitive_04            0
sensitive_05            0
sensitive_06            0
sensitive_07            0
sensitive_08            0
sensitive_09            0
sensitive_10          595
sensitive_11            0
sensitive_12            0
sensitive_13            0
sensitive_14            0
sensitive_15            0
sensitive_17            0
sensitive_18            0
sensitive_19            0
sensitive_20            0
sensitive_22            0
sensitive_23            0
sensitive_24            0
sensitive_25            0
dtype: int64

In [75]:
# clean dataset
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.fillna(0, inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep]#.astype(np.float64)

df_train_clean = clean_dataset(df_train)

# np.any(np.isnan(df_train.hs_gpa)) # True
# np.any(np.isnan(df_train_clean.hs_gpa)) # False

# np.all(np.isfinite(df_train.hs_gpa)) # False
# np.all(np.isfinite(df_train_clean.hs_gpa)) # True

# np.any(np.isfinite(df_train.hs_gpa)) # True
# np.any(np.isfinite(df_train_clean.hs_gpa)) # True


# np.isfinite(df_train.hs_gpa).sum() # 980 # 1000
# np.isfinite(df_train_clean.hs_gpa).sum() # 344
# df_train_clean.hs_gpa.count() # 344

# np.isnan(df_train.hs_gpa).sum() # 20
# np.isnan(df_train_clean.hs_gpa).sum() # 0

# check for missing values
df_train_clean.isnull().sum()


job_aptitude_exam     0
same_industry         0
unexcused_absences    0
hs_gpa                0
job_offered           0
good_behavior         0
high_school           0
enrolled_late         0
instructor            0
sensitive_01          0
sensitive_02          0
sensitive_03          0
sensitive_04          0
sensitive_05          0
sensitive_06          0
sensitive_07          0
sensitive_08          0
sensitive_09          0
sensitive_10          0
sensitive_11          0
sensitive_12          0
sensitive_13          0
sensitive_14          0
sensitive_15          0
sensitive_17          0
sensitive_18          0
sensitive_19          0
sensitive_20          0
sensitive_22          0
sensitive_23          0
sensitive_24          0
sensitive_25          0
dtype: int64

In [65]:
# check for missing values
df_scoring.isnull().sum()

job_aptitude_exam        0
same_industry            0
unexcused_absences       0
hs_gpa                  89
good_behavior          479
high_school              0
enrolled_late            0
instructor             238
sensitive_01             0
sensitive_02             0
sensitive_03             0
sensitive_04             0
sensitive_05             0
sensitive_06             0
sensitive_07             0
sensitive_08             0
sensitive_09             0
sensitive_10          2883
sensitive_11             0
sensitive_12             0
sensitive_13             0
sensitive_14             0
sensitive_15             0
sensitive_17             0
sensitive_18             0
sensitive_19             0
sensitive_20             0
sensitive_22             0
sensitive_23             0
sensitive_24             0
sensitive_25             0
dtype: int64

In [76]:
# clean scoring dataset
df_scoring_clean = clean_dataset(df_scoring)

# check for missing values
df_scoring_clean.isnull().sum()

job_aptitude_exam     0
same_industry         0
unexcused_absences    0
hs_gpa                0
good_behavior         0
high_school           0
enrolled_late         0
instructor            0
sensitive_01          0
sensitive_02          0
sensitive_03          0
sensitive_04          0
sensitive_05          0
sensitive_06          0
sensitive_07          0
sensitive_08          0
sensitive_09          0
sensitive_10          0
sensitive_11          0
sensitive_12          0
sensitive_13          0
sensitive_14          0
sensitive_15          0
sensitive_17          0
sensitive_18          0
sensitive_19          0
sensitive_20          0
sensitive_22          0
sensitive_23          0
sensitive_24          0
sensitive_25          0
dtype: int64

_Train 2 or 3 models, and compare them in terms of how well they fit the data. Are there advantages or disadvantages to the approaches you’ve taken?_

Here, a logistic regression model and a decision tree were used. These models can then be compared in terms of how well they fit the data by measuring their accuracy, precision, recall, and other metrics. As shown below, `logreg_acc = 0.92` and `dt_acc = 0.8033`. Advantages of logistic regression include its simplicity and interpretability, while decision trees can handle non-linear relationships and handle large amounts of data well. Disadvantages of decision trees include their tendency to overfit the data, and their lack of interpretability. 

In [77]:
# define feature columns
feature_cols = ['hs_gpa', 'high_school', 'unexcused_absences']

# scale features
scaler = preprocessing.StandardScaler()
df_train_clean[feature_cols] = scaler.fit_transform(df_train_clean[feature_cols])
df_scoring_clean[feature_cols] = scaler.transform(df_scoring_clean[feature_cols])


In [78]:
# define X and y
X = df_train_clean[feature_cols]
y = df_train_clean['job_offered']

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# train logistic regression model
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)

# train decision tree model
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)


DecisionTreeClassifier()

In [79]:
import pandas as pd
import numpy as np
from numpy import inf
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score

# calculate accuracy of logistic regression model
logreg_acc = logreg.score(X_test, y_test)

# calculate accuracy of decision tree model
dt_acc = dt.score(X_test, y_test)

# compare model accuracy
print("Logistic regression accuracy:", logreg_acc)
print("Decision tree accuracy:", dt_acc)

# overall model comparison
logregmod = logreg.predict(X_test)
print(classification_report(y_test, logregmod))
print(confusion_matrix(y_test, logregmod))
print(accuracy_score(y_test, logregmod))

Logistic regression accuracy: 0.92
Decision tree accuracy: 0.8033333333333333


_Choose the best model you fit above and produce scores for every row the provided scoring set. Compare the distribution of population scores to the training sample mean and briefly discuss your findings. What do you observe about the two populations? How might you change your analysis if you had more time/resources in light of this?_

After comparing the models, the best one can be chosen and used to produce scores for every row in the provided scoring set. These scores can then be used to estimate the probability of each individual in the population being offered a job. The distribution of these scores can be compared to the training sample mean, and any discrepancies can be investigated. If there were more time and resources, more data can be collected and additional models can be trained and compared to ensure the best possible performance.

**Question 1.**
_Let’s say a client comes to you and asks for “the effect of absences (one of the variables in the dataset) on being offered a job.” What are some ways you might go about providing that? How would you communicate uncertainty around the ‘effect size’?_


To provide the effect of absences on being offered a job, one way would be to fit a logistic regression model with absences as an independent variable and job offer as the dependent variable. The coefficient of absences can be interpreted as the change in log-odds of being offered a job for a one-unit increase in absences. 


    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix,accuracy_score
 
    x = df['unexcused_absences']
    y = df['job_offered']
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1)
    logmodel = LogisticRegression()
    logmodel.fit(x_train, y_train)
 
    predictions = logmodel.predict(x_test)
    print(classification_report(y_test, predictions))
    print(confusion_matrix(y_test, predictions))
    print(accuracy_score(y_test, predictions))
    

Another way would be to use methods such as propensity score matching or instrumental variables to control for potential confounding variables. Uncertainty around the effect size can be communicated by providing a confidence interval for the coefficient or by using a hypothesis testing framework to determine the statistical significance of the effect.

**Question 2.** 
_Let's say your client is interested in understanding the uncertainty of the predictions produced by your model at the individual level. As an example, your model might say that person i has probability of being offered a job 0.78. How do you calculate uncertainty on that quantity?_

To calculate uncertainty on the probability of being offered a job, one approach would be to use a technique called bootstrapping. This involves resampling the training data multiple times, fitting the model to each resample, and recording the model's prediction for each individual. By calculating the standard deviation of the predicted probabilities across the resamples, we can obtain an estimate of the uncertainty of the model's predictions. A Bayesian approach can also be used to calculate the posterior probability distribution of the predictions.

**Question 3.**
_Let’s say a domain expert thinks that the model is not performing well for a subset of the population (e.g. folks with a low GPA). How would you check to see if the model is performing well among subpopulations of your training data?_

One way to check the performance of the model among subpopulations would be subgroup analysis or discrimination analysis. Subgroup analysis involves comparing the model's performance within specific subpopulations (e.g. low GPA) to its overall performance. Discrimination analysis involves calculating a measure of fairness (e.g. the difference in false positive rates between subpopulations) to check if the model is treating different subpopulations fairly. These analyses can help identify any potential bias or poor performance within specific subpopulations, and help us understand the reasons behind them.