## Introduction:

How much did you weigh at birth? If you don’t know, call your mother or someone else who knows. Using the pooled data (all live births), compute the distribution of birth weights and use it to find your percentile rank. 

If you were a first baby, find your percentile rank in the distribution for first babies. Otherwise use the distribution for others. If you are in the 90th percentile or higher, call your mother back and apologize.

In [3]:
using DataFrames
using Plots

## Columns Extraction:

In [4]:
data = [("caseid",1,12,Int), ("nbrnaliv",22,22,Int),("babysex",56,56,Int), ("birthwgt_lb",57,58,Int),("birthwgt_oz",59,60,Int), ("prglength", 275,276,Int), ("outcome", 277,277, Int), ("birthord",278,279,Int), ("agepreg", 284,287,Int), ("finalwgt",423,440,Float64)]

10-element Array{Tuple{String,Int64,Int64,DataType},1}:
 ("caseid",1,12,Int64)       
 ("nbrnaliv",22,22,Int64)    
 ("babysex",56,56,Int64)     
 ("birthwgt_lb",57,58,Int64) 
 ("birthwgt_oz",59,60,Int64) 
 ("prglength",275,276,Int64) 
 ("outcome",277,277,Int64)   
 ("birthord",278,279,Int64)  
 ("agepreg",284,287,Int64)   
 ("finalwgt",423,440,Float64)

In [5]:
df_=DataFrame()

In [6]:
map(x->df_[Symbol(x[1])]=Vector{x[4]}(0),data)

10-element Array{Array{T,1},1}:
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Int64[]  
 Float64[]

In [7]:
open("2002FemPreg.dat","r") do io
    for l in eachline(io)
        if length(l)>100
            for (i,case) in enumerate(data)
                val2push=
                try 
                    parse(case[4],l[case[2]:case[3]])
                catch 
                    NA
                end
                push!(df_[Symbol(case[1])],val2push)
          end
      end
    end
end

In [8]:
df_

Unnamed: 0,caseid,nbrnaliv,babysex,birthwgt_lb,birthwgt_oz,prglength,outcome,birthord,agepreg,finalwgt
1,1,1,1,8,13,39,1,1,3316,6448.271111704751
2,1,1,2,7,14,39,1,2,3925,6448.271111704751
3,2,3,1,9,2,39,1,1,1433,12999.542264385902
4,2,1,2,7,0,39,1,2,1783,12999.542264385902
5,2,1,2,6,3,39,1,3,1833,12999.542264385902
6,6,1,1,8,9,38,1,1,2700,8874.440799222995
7,6,1,2,9,9,40,1,2,2883,8874.440799222995
8,6,1,2,8,6,42,1,3,3016,8874.440799222995
9,7,1,1,7,9,39,1,1,2808,6911.879920534536
10,7,1,2,6,10,35,1,2,3233,6911.879920534536


In [9]:
first_ = []
others_ = []
for i = 1:size(df_)[1]
    if(df_[i, 7] == 1)
        if (df_[i, 8] == 1)
            push!(first_, df_[i, 6])
        else
            push!(others_, df_[i, 6])
        end
    end
end
first_  # 4413 .. correct :D .. these are the alive, 1st babies .. 
others_

4735-element Array{Any,1}:
 39
 39
 39
 40
 42
 35
 37
 33
 39
 39
 39
 37
 40
  ⋮
 39
 39
 44
 40
 39
 39
 39
 39
 39
 39
 39
 39

## Calculating CDFs:

### 1. First babies:

In [10]:
c = countmap(first_)  ## Data dictionary for first babies ..

Dict{Any,Int64} with 31 entries:
  47 => 1
  32 => 55
  40 => 536
  39 => 2114
  21 => 1
  46 => 1
  43 => 87
  26 => 16
  0  => 1
  35 => 159
  42 => 205
  34 => 29
  29 => 9
  25 => 1
  17 => 1
  22 => 3
  44 => 23
  24 => 7
  37 => 208
  28 => 24
  38 => 272
  20 => 1
  45 => 6
  23 => 1
  31 => 15
  ⋮  => ⋮

In [18]:
## calc cdf
function calc_cdf(c) # c is a data dict
    runsum = 0
    xs = []
    cs = []

    for (value, count) in sort(c)
        runsum = runsum + count
        append!(xs, value)
        append!(cs, runsum)
    end

    total = float(runsum)
    cdf = [c/total for c in cs]   ## cdf[end] = 1, ps carrried cdf
    
    return xs, cdf
end



calc_cdf (generic function with 1 method)

In [19]:
k, v = calc_cdf(c)

(Any[0,17,20,21,22,23,24,25,26,27  …  39,40,41,42,43,44,45,46,47,48],[0.000226603,0.000453206,0.00067981,0.000906413,0.00158622,0.00181283,0.00339905,0.00362565,0.0072513,0.00747791  …  0.723091,0.84455,0.926127,0.972581,0.992295,0.997507,0.998867,0.999094,0.99932,1.0])

In [30]:
plot(k, v)
xlabel!("Firsts")
ylabel!("CDFs")

In [31]:
c_o = countmap(others_)

Dict{Any,Int64} with 32 entries:
  18 => 1
  32 => 60
  50 => 2
  40 => 580
  39 => 2579
  21 => 1
  43 => 61
  9  => 1
  26 => 19
  35 => 152
  42 => 123
  34 => 31
  25 => 2
  29 => 12
  19 => 1
  17 => 1
  22 => 4
  44 => 23
  24 => 6
  37 => 247
  4  => 1
  28 => 8
  38 => 335
  45 => 4
  31 => 12
  ⋮  => ⋮

In [32]:
k_o, v_o = calc_cdf(c_o)

(Any[4,9,13,17,18,19,21,22,24,25  …  38,39,40,41,42,43,44,45,48,50],[0.000211193,0.000422386,0.00063358,0.000844773,0.00105597,0.00126716,0.00147835,0.00232313,0.00359029,0.00401267  …  0.239071,0.783738,0.90623,0.954171,0.980148,0.993031,0.997888,0.998733,0.999578,1.0])

In [33]:
plot!(k_o, v_o)
xlabel!("Others'")
ylabel!("CDFs")

Red curve is for others, blue one is for first babies .. 