Rebecca Black • May 6, 2017

## Predicting Blood Donations

This is an analysis of a dataset from drivendata.org. It predicts the probability of a blood donor making a donation in March 2007. This is a small dataset of ~600 observations with the following variables:
+ Months since Last Donation
+ Number of Donations
+ Total Volume Donated (in c.c.)
+ Months Since First Donation
+ Whether the donor gave blood in March 2007

The deliverable for the analysis is a list of donors along with their probability of donating blood in March 2007.

This analysis was done in R.

###### Read in the data

In [7]:
blood=read.csv('Training.csv', header=T)
test=read.csv('Test.csv', header=T)

### Print the initial structure and variables and the first few rows

In [8]:
dim(blood)
names(blood)
head(blood)

X,Months.since.Last.Donation,Number.of.Donations,Total.Volume.Donated..c.c..,Months.since.First.Donation,Made.Donation.in.March.2007
619,2,50,12500,98,1
664,0,13,3250,28,1
441,1,16,4000,35,1
160,2,20,5000,45,1
358,1,24,6000,77,0
335,4,4,1000,4,0


### Now for some E.D.A

##### Some summary statistics

In [9]:
#Let's change the last column to a factor (since this is a binary variable)
blood$Made.Donation.in.March.2007=as.factor(blood$Made.Donation.in.March.2007)

#Now we print the structure, summary, and first few rows
str(blood)
summary(blood)
head(blood)

'data.frame':	576 obs. of  6 variables:
 $ X                          : int  619 664 441 160 358 335 47 164 736 436 ...
 $ Months.since.Last.Donation : int  2 0 1 2 1 4 2 1 5 0 ...
 $ Number.of.Donations        : int  50 13 16 20 24 4 7 12 46 3 ...
 $ Total.Volume.Donated..c.c..: int  12500 3250 4000 5000 6000 1000 1750 3000 11500 750 ...
 $ Months.since.First.Donation: int  98 28 35 45 77 4 14 35 98 4 ...
 $ Made.Donation.in.March.2007: Factor w/ 2 levels "0","1": 2 2 2 2 1 1 2 1 2 1 ...


       X         Months.since.Last.Donation Number.of.Donations
 Min.   :  0.0   Min.   : 0.000             Min.   : 1.000     
 1st Qu.:183.8   1st Qu.: 2.000             1st Qu.: 2.000     
 Median :375.5   Median : 7.000             Median : 4.000     
 Mean   :374.0   Mean   : 9.439             Mean   : 5.427     
 3rd Qu.:562.5   3rd Qu.:14.000             3rd Qu.: 7.000     
 Max.   :747.0   Max.   :74.000             Max.   :50.000     
 Total.Volume.Donated..c.c.. Months.since.First.Donation
 Min.   :  250               Min.   : 2.00              
 1st Qu.:  500               1st Qu.:16.00              
 Median : 1000               Median :28.00              
 Mean   : 1357               Mean   :34.05              
 3rd Qu.: 1750               3rd Qu.:49.25              
 Max.   :12500               Max.   :98.00              
 Made.Donation.in.March.2007
 0:438                      
 1:138                      
                            
                            
        

X,Months.since.Last.Donation,Number.of.Donations,Total.Volume.Donated..c.c..,Months.since.First.Donation,Made.Donation.in.March.2007
619,2,50,12500,98,1
664,0,13,3250,28,1
441,1,16,4000,35,1
160,2,20,5000,45,1
358,1,24,6000,77,0
335,4,4,1000,4,0


This is all very straghtforward - there are no missing values and no suspicious summary statistics. So let's move on to the analysis. 

### The question we want to answer with this analysis

What is the probability a donor will give blood in March 2007?

To answer this question, I'll run a logistic regression analysis.

In [10]:
fit <- glm(Made.Donation.in.March.2007~Months.since.Last.Donation+Number.of.Donations+
           Total.Volume.Donated..c.c..+Months.since.First.Donation,
           data=blood,family=binomial())
summary(fit) # display results
pred=predict(fit, newdata=test, type="response") # predicted values


Call:
glm(formula = Made.Donation.in.March.2007 ~ Months.since.Last.Donation + 
    Number.of.Donations + Total.Volume.Donated..c.c.. + Months.since.First.Donation, 
    family = binomial(), data = blood)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5102  -0.8079  -0.5273  -0.2427   2.5545  

Coefficients: (1 not defined because of singularities)
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -0.585643   0.201818  -2.902  0.00371 ** 
Months.since.Last.Donation  -0.091026   0.018955  -4.802 1.57e-06 ***
Number.of.Donations          0.129921   0.029102   4.464 8.03e-06 ***
Total.Volume.Donated..c.c..        NA         NA      NA       NA    
Months.since.First.Donation -0.018797   0.006588  -2.853  0.00433 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 634.29  on 575  degrees of freedom
Residual deviance: 556.6

“prediction from a rank-deficient fit may be misleading”

In [11]:
final=write.csv(pred, file = "Pred.csv")

This result put me in the top 11% of the competition submissions. Not bad, but I'll try a subset of the predictor variables to see if I can get a bit higher.

In [12]:
fit2 <- glm(Made.Donation.in.March.2007~Months.since.Last.Donation+Number.of.Donations+
           Months.since.First.Donation,
           data=blood,family=binomial())
summary(fit2) # display results
pred2=predict(fit, newdata=test, type="response") # predicted values


Call:
glm(formula = Made.Donation.in.March.2007 ~ Months.since.Last.Donation + 
    Number.of.Donations + Months.since.First.Donation, family = binomial(), 
    data = blood)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5102  -0.8079  -0.5273  -0.2427   2.5545  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -0.585643   0.201818  -2.902  0.00371 ** 
Months.since.Last.Donation  -0.091026   0.018955  -4.802 1.57e-06 ***
Number.of.Donations          0.129921   0.029102   4.464 8.03e-06 ***
Months.since.First.Donation -0.018797   0.006588  -2.853  0.00433 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 634.29  on 575  degrees of freedom
Residual deviance: 556.61  on 572  degrees of freedom
AIC: 564.61

Number of Fisher Scoring iterations: 5


“prediction from a rank-deficient fit may be misleading”

And this kept me in the ~89th percentile. I'll stop here. Both of these models give a log loss of 0.4457, which is very close to optimal.

Now the preceding analysis was primarily intended for simple prediction of blood donation for existing blood donors. Hoever depending on the demographics of the blood donor pool, we may be able to draw some meaningful conclusions from the logistic regression coefficients. That is, for donors similar in profile to the donor pool, we might infer that while all other variables are held constant, each subsequent donation translates into a exp(.129921)=1.138738 odds of donating blood in March 2007. That is, number of donations is associated with a slightly increased odds of donating blood again in March 2007. Similarly, each increase of one month since last donation is associated with a slightly decreased odds of donating blood in March 2007.

Care must of course be taken to analyze the characteristics of the original donor pool before making such inferences, since the demographics of this pool may not allow us to extrapolate the findings to the larger population.