## Introduction:

- Write a function called Sample, that takes a Cdf and an integer, n, and returns a list of n values chosen at random from the Cdf. 

- Using the distribution of birth weights from the NSFG, generate a random sample with 1000 elements. Compute the CDF of the sample. Make a plot that shows the original CDF and the CDF of the random sample. For large values of n, the distributions should be the same.


## Solution steps:

- Choose a random probability in the range 0–1.
- Use Cdf.Value to find the value in the distribution that corresponds to the probability you chose.

In [1]:
using StatsBase
using DataFrames
using Plots

Let's call the above function after constructing CDF for the birth weights of NSFG dataset .. 

In [2]:
data = [("caseid",1,12,Int), ("nbrnaliv",22,22,Int),("babysex",56,56,Int), ("birthwgt_lb",57,58,Int),("birthwgt_oz",59,60,Int), ("prglength", 275,276,Int), ("outcome", 277,277, Int), ("birthord",278,279,Int), ("agepreg", 284,287,Int), ("finalwgt",423,440,Float64)]

10-element Array{Tuple{String,Int64,Int64,DataType},1}:
 ("caseid",1,12,Int64)       
 ("nbrnaliv",22,22,Int64)    
 ("babysex",56,56,Int64)     
 ("birthwgt_lb",57,58,Int64) 
 ("birthwgt_oz",59,60,Int64) 
 ("prglength",275,276,Int64) 
 ("outcome",277,277,Int64)   
 ("birthord",278,279,Int64)  
 ("agepreg",284,287,Int64)   
 ("finalwgt",423,440,Float64)

In [3]:
df_=DataFrame()

In [4]:
map(x->df_[Symbol(x[1])]=Vector{x[4]}(0),data)

10-element Array{Array{T,1},1}:
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Float64[]

In [5]:
open("2002FemPreg.dat","r") do io
    for l in eachline(io)
        if length(l)>100
            for (i,case) in enumerate(data)
                val2push=
                try 
                    parse(case[4],l[case[2]:case[3]])
                catch 
                    NA
                end
                push!(df_[Symbol(case[1])],val2push)
          end
      end
    end
end

In [6]:
df_

Unnamed: 0,caseid,nbrnaliv,babysex,birthwgt_lb,birthwgt_oz,prglength,outcome,birthord,agepreg,finalwgt
1,1,1,1,8,13,39,1,1,3316,6448.271111704751
2,1,1,2,7,14,39,1,2,3925,6448.271111704751
3,2,3,1,9,2,39,1,1,1433,12999.542264385902
4,2,1,2,7,0,39,1,2,1783,12999.542264385902
5,2,1,2,6,3,39,1,3,1833,12999.542264385902
6,6,1,1,8,9,38,1,1,2700,8874.440799222995
7,6,1,2,9,9,40,1,2,2883,8874.440799222995
8,6,1,2,8,6,42,1,3,3016,8874.440799222995
9,7,1,1,7,9,39,1,1,2808,6911.879920534536
10,7,1,2,6,10,35,1,2,3233,6911.879920534536


In [7]:
class_weights = Int[]

for i = 1:size(df_)[1]
    if(df_[i, 7] == 1)  ##alive baby
        try
            append!(class_weights, df_[i, 5])
        catch 
            0
        end
    end
end
class_weights

9148-element Array{Int64,1}:
 13
 14
  2
  0
  3
  9
  9
  6
  9
 10
 13
  0
  0
  ⋮
  0
 13
  9
  2
  7
  0
  0
  6
  6
  3
  8
  8

In [8]:
c = countmap(class_weights)

Dict{Int64,Int64} with 48 entries:
  2                   => 604
  76091456            => 1
  11                  => 557
  76088568            => 1
  76096976            => 1
  76011456            => 1
  72325104            => 1
  8                   => 756
  68719476916         => 1
  75924848            => 1
  76069392            => 1
  140690640205664     => 1
  14                  => 475
  4294967296          => 1
  75971456            => 1
  336                 => 1
  75981504            => 1
  4                   => 525
  4658083153315464688 => 2
  76098208            => 1
  13                  => 487
  75961040            => 1
  117                 => 1
  99                  => 46
  76097104            => 1
  ⋮                   => ⋮

In [9]:
runsum = 0
xs = []
cs = []

for (value, count) in sort(c)
    runsum = runsum + count
    append!(xs, value)
    append!(cs, runsum)
end

total = float(runsum)
cdf = [c/total for c in cs]   ## cdf[end] = 1, ps carrried cdf

48-element Array{Float64,1}:
 0.000655881
 0.116747   
 0.161347   
 0.227372   
 0.285636   
 0.343026   
 0.401509   
 0.479012   
 0.533778   
 0.616419   
 0.671622   
 0.723546   
 0.784434   
 ⋮          
 0.998688   
 0.998798   
 0.998907   
 0.999016   
 0.999125   
 0.999235   
 0.999344   
 0.999453   
 0.999563   
 0.999672   
 0.999891   
 1.0        

In [10]:
length(cdf)

48

In [12]:
plot(cdf)
xlabel!("Class Weights")
ylabel!("CDFs")

Matches python plot .. 

## Sample elements from CDF list .. 

I'll generate 1000 element from the list .. 

In [13]:
function Sample(cdf, n)  # returns a list of n values chosen at random from the Cdf
    
    return cdf[1:rand(1:n)]  # random selection from CDF
end

Sample (generic function with 1 method)

In [14]:
n = 50
cdf_vals = Sample(cdf, n)

47-element Array{Float64,1}:
 0.000655881
 0.116747   
 0.161347   
 0.227372   
 0.285636   
 0.343026   
 0.401509   
 0.479012   
 0.533778   
 0.616419   
 0.671622   
 0.723546   
 0.784434   
 ⋮          
 0.998579   
 0.998688   
 0.998798   
 0.998907   
 0.999016   
 0.999125   
 0.999235   
 0.999344   
 0.999453   
 0.999563   
 0.999672   
 0.999891   

In [15]:
plot(cdf_vals)
xlabel!("index")
ylabel!("cdf values")

as long as n increases, the shape of the sampled cdf tends to be similar to the shape of the cdf population ..

Let's generate n number of sample elements from babies_weights and calc their CDF then plot it to compare it with the pop of weights ..

In [16]:
num = 1000
class_weights[1:rand(1:num)]

783-element Array{Int64,1}:
 13
 14
  2
  0
  3
  9
  9
  6
  9
 10
 13
  0
  0
  ⋮
  8
  8
  0
  6
 13
 14
  3
 13
  0
  2
  6
  0

In [17]:
c_ = countmap(class_weights)

Dict{Int64,Int64} with 48 entries:
  2                   => 604
  76091456            => 1
  11                  => 557
  76088568            => 1
  76096976            => 1
  76011456            => 1
  72325104            => 1
  8                   => 756
  68719476916         => 1
  75924848            => 1
  76069392            => 1
  140690640205664     => 1
  14                  => 475
  4294967296          => 1
  75971456            => 1
  336                 => 1
  75981504            => 1
  4                   => 525
  4658083153315464688 => 2
  76098208            => 1
  13                  => 487
  75961040            => 1
  117                 => 1
  99                  => 46
  76097104            => 1
  ⋮                   => ⋮

In [18]:
runsum = 0
xs = []
cs = []

for (value, count) in sort(c_)
    runsum = runsum + count
    append!(xs, value)
    append!(cs, runsum)
end

total = float(runsum)
cdf_ = [i/total for i in cs]   ## cdf[end] = 1, ps carrried cdf

48-element Array{Float64,1}:
 0.000655881
 0.116747   
 0.161347   
 0.227372   
 0.285636   
 0.343026   
 0.401509   
 0.479012   
 0.533778   
 0.616419   
 0.671622   
 0.723546   
 0.784434   
 ⋮          
 0.998688   
 0.998798   
 0.998907   
 0.999016   
 0.999125   
 0.999235   
 0.999344   
 0.999453   
 0.999563   
 0.999672   
 0.999891   
 1.0        

In [19]:
n = 50
cdf_vals_ = Sample(cdf_, n)

48-element Array{Float64,1}:
 0.000655881
 0.116747   
 0.161347   
 0.227372   
 0.285636   
 0.343026   
 0.401509   
 0.479012   
 0.533778   
 0.616419   
 0.671622   
 0.723546   
 0.784434   
 ⋮          
 0.998688   
 0.998798   
 0.998907   
 0.999016   
 0.999125   
 0.999235   
 0.999344   
 0.999453   
 0.999563   
 0.999672   
 0.999891   
 1.0        

In [20]:
plot(cdf_vals_)
xlabel!("index")
ylabel!("cdf values")

almost similar to the shape of the population'S CDF .. 