# Solutions to Problem Set 2

*Stats 507, Fall 2021*

Shihao Wu, PhD student

## Question 0 - Code review warmup

Many organizations that produce software utilize code reviews to help mantain a high quality code base. A good code review should address the following questions:

1. Does the code work? Does it do what it is supposed to?
2. How is the style of the code? Does it follow the style guidelines?
3. Is the code clearly structured and easy to understand?
4. Is the code efficient? If clarity is sacrificed for efficiency, are there comments that help to alleviate this?

In this question, you will interpret a code snippet you did not write and then write a “code review” for that snippet.
### Code snippet

In [1]:
sample_list = [(1, 3, 5), (0, 1, 2), (1, 9, 8)]
op = []
for m in range(len(sample_list)):
    li = [sample_list[m]]
        for n in range(len(sample_list)):
            if (sample_list[m][0] == sample_list[n][0] and
                    sample_list[m][3] != sample_list[n][3]):
                li.append(sample_list[n])
        op.append(sorted(li, key=lambda dd: dd[3], reverse=True)[0])
res = list(set(op))

IndentationError: unexpected indent (<ipython-input-1-537e3dd702ce>, line 5)

### Code review

**a.** Given a list of tuples, the codes aim to find the tuples with unique first numbers, and the tuples are uniquely picked if it has the largest third number among all all tuples that has the same first number as it. There are a few suggestions for the code writer.  

**b.** The codes do not work currently due to 'unexpected indent' and 'tuple index out of range' errors. To revise that, the writer should use consistent indentation (preferably with 4 spaces for each level of indentation) throught the code and be careful about *where to have indentation*. Index in python for a length-$p$ list/tuple **a** starts from $0$ and ends at $p$-1. Using **a**\[p\] will return 'out of index' error. If one wants to get the $p$th item in **a**, use **a**\[p-1\]. When iterating within the sample list, the writer iterated indices. Instead, the writer should try iterating over values in the list, which is more efficient and simple. The code is well structured and not too difficult to understand the intention. However, when using the <code>lambda</code> function in Python, it would be better to add comments to make it more understandable.

## Imports

The remaining questions will use the following imports.

In [10]:
# modules: --------------------
import numpy as np
import pandas as pd
import datetime
from IPython.core.display import display, HTML
# --------------------

## Question 1 - List of Tuples

Write a function that uses NumPy and a list comprehension to 
generate a random list of <code>n</code> k-tuples containing integers 
ranging from <code>low</code> to high. Choose an appropriate name 
for your function, and reasonable default values for k, <code>l
ow</code>, and <code>high</code>.

Use <code>assert</code> to test that your function returns a list of tuples.

In [26]:
def gen_lt(n, k=1, low=0, high=100):
    """
    Generate a random list of tuples.

    generate a random list of n k-tuples containing integers ranging
    from low to high.
    
    Parameters
    ----------
    n : int
        The length of list.
    k : int
        The length of each tuple within the list.
    low, high : int
        Values of the range of the random integers generated.

    Returns
    -------
    The final list 'gen_list' .

    """
    gen_list = [tuple(np.random.randint(low=low, high=high, size=k)) 
                for i in range(n)]
    return gen_list

assert [type(i) == tuple for i in gen_lt(10)]

## Question 2 - Refactor the Snippet

In this question, you will write functions to accomplish the 
goal you concisely described in part “a” of the warm up.

a. Encapsulate the code snippet from the warmup into a function 
that parameterizes the role of <code>0</code> and <code>3</code> and is otherwise 
unchanged. Choose appropriate names for these paramters.

In [27]:
def res(in_list, uni_idx, com_idx):
    """
    Refine a list of tuples with fewer tuples.

    Find the tuples with unique 'uni_idx'th numbers, and the tuples
    are uniquely picked if it has the largest 'com_idx'th number among 
    all all tuples that has the same 'uni_idx'th number as it.
    
    Parameters
    ----------
    in_list : list of tuples
        The list waiting for the refinement.
    uni_idx : int
        The index of the number in the tuples that we want it to be unique.
    com_idx : int
        The index of the number in the tuples to compare for selecting
        the tuple that has the largest 'com_idx's number.

    Returns
    -------
    The final list 'res' .

    """
    op = []
    for m in range(len(in_list)):
        li = [in_list[m]]
        for n in range(len(in_list)):
            if (in_list[m][uni_idx] == in_list[n][uni_idx] and
                    in_list[m][com_idx] != in_list[n][com_idx]):
                li.append(in_list[n])
        op.append(sorted(li, key=lambda dd: dd[com_idx], reverse=True)[0])
    res = list(set(op))
    return res

res([(1, 3, 5), (0, 1, 2), (1, 9, 8)],0,2)

[(0, 1, 2), (1, 9, 8)]

b. Write an improved version of the function from part a that
implements the suggestions from the code review you wrote
in part b of the warmup.

In [72]:
def res_imp(in_list, uni_idx, com_idx):
    """
    Improved version of res(in_list, uni_idx, com_idx).
    Refine a list of tuples with fewer tuples.

    Find the tuples with unique 'uni_idx'th numbers, and the tuples
    are uniquely picked if it has the largest 'com_idx'th number among 
    all tuples that has the same 'uni_idx'th number as it.
    
    Parameters
    ----------
    in_list : list of tuples
        The list waiting for the refinement.
    uni_idx : int
        The index of the number in the tuples that we want it to be unique.
    com_idx : int
        The index of the number in the tuples to compare for selecting
        the tuple that has the largest 'com_idx's number.

    Returns
    -------
    The final list 'res' .

    """
    op = []
    for tup1 in in_list:
        li = [tup1]
        for tup2 in in_list:
            if (tup1[uni_idx] == tup2[uni_idx] and
                    tup1[com_idx] != tup2[com_idx]):
                li.append(tup2)
        
        # Find the tuple that has the largest 'com_idx's number among 
        # all tuples that has the same 'uni_idx'th number as it.
        op.append(sorted(li, key=lambda dd: dd[com_idx], reverse=True)[0])
    res_imp = list(set(op))
    return res_imp

res_imp([(1, 3, 5), (0, 1, 2), (1, 9, 8)],0,2)

[(0, 1, 2), (1, 9, 8)]

c. Write a function from scratch to accomplish the same task 
as the previous two parts. Your solution should traverse
the input list of tuples no more than twice. Hint: consider 
    using a dictionary or a default dictionary in your solution.

In [73]:
def res_dic(in_list, uni_idx, com_idx, dic=None):
    """
    Improved version of res_imp(in_list, uni_idx, com_idx) with 
    traversing the input list of tuples no more than twice (using
    a dictionary).
    Refine a list of tuples with fewer tuples.

    Find the tuples with unique 'uni_idx'th numbers, and the tuples
    are uniquely picked if it has the largest 'com_idx'th number among 
    all tuples that has the same 'uni_idx'th number as it.
    
    Parameters
    ----------
    in_list : list of tuples
        The list waiting for the refinement.
    uni_idx : int
        The index of the number in the tuples that we want it to be unique.
    com_idx : int
        The index of the number in the tuples to compare for selecting
        the tuple that has the largest 'com_idx's number.
    dic : dictionary
        The dictionary that groups all the tuples by their 'uni_idx' 
        number. Can be optional as input or find within the function.

    Returns
    -------
    The final list 'res_dic' .

    """
    if dic == None:
        dic = {}
        for li in in_list:
            uni_num = str(li[uni_idx])
            if uni_num not in dic.keys():
                dic[uni_num] = [li]
            else:
                dic[uni_num].append(li)
    op = []
    for key_val in dic.keys():
        op.append(sorted(dic[key_val], key=lambda dd: dd[com_idx], reverse=True)[0])
    res_dic = list(set(op))
    return res_dic

res_dic([(1, 3, 5), (0, 1, 2), (1, 9, 8)],0,2)

[(0, 1, 2), (1, 9, 8)]

d. Using the function you wrote in question 1 to generate a list 
of tuples as input(s), run and summarize a small Monte Carlo 
study comparing the execution times of the three functions 
above (a-c).

In [157]:
n_choices = [10, 30, 100, 300, 1000, 3000, 10000,30000]
time_res = []
time_imp = []
time_dic = []

for n in n_choices:
    time_imp_int = []
    time_dic_int = []
    time_res_int = []
    for i in range(10): # 10-time Monte Carlo repetitions
        list_t = gen_lt(n=n, k=10, low=0, high=100)
        start = datetime.datetime.now()
        res(in_list=list_t, uni_idx=1, com_idx=3)
        time_res_int.append(datetime.datetime.now() - start)
        list_t = gen_lt(n=n, k=10, low=0, high=100)
        start = datetime.datetime.now()
        res_imp(in_list=list_t, uni_idx=1, com_idx=3)
        time_imp_int.append(datetime.datetime.now() - start)
        list_t = gen_lt(n=n, k=10, low=0, high=100)
        start = datetime.datetime.now()
        res_dic(in_list=list_t, uni_idx=1, com_idx=3)
        time_dic_int.append(datetime.datetime.now()-start)
    time_res.append(str(np.mean(time_res_int)))
    time_imp.append(str(np.mean(time_imp_int)))
    time_dic.append(str(np.mean(time_dic_int)))


    
tab = pd.DataFrame(
    {
     "n": n_choices,
     "res() time" : time_res,
     "res_imp() time" : time_imp,
     "res_dic() time" : time_dic,
     }
    )


display(HTML(tab.to_html(index=False)))

n,res() time,res_imp() time,res_dic() time
10,0:00:00.000057,0:00:00.000036,0:00:00.000036
30,0:00:00.000222,0:00:00.000149,0:00:00.000073
100,0:00:00.001370,0:00:00.000906,0:00:00.000135
300,0:00:00.008627,0:00:00.005689,0:00:00.000259
1000,0:00:00.100651,0:00:00.062807,0:00:00.000792
3000,0:00:00.914204,0:00:00.564544,0:00:00.002309
10000,0:00:56.733748,0:00:06.335263,0:00:00.008054
30000,0:01:34.931884,0:01:00.207007,0:00:00.026140


## Question 3 

In this question you will use Pandas to read, clean, and append several data files from the National Health and Nutrition Examination Survey NHANES. We will use the data you prepare in this question as the starting point for analyses in one or more future problem sets. For this problem, you should use the four cohorts spanning the years 2011-2018. 

a. Use Python and Pandas to read and append the demographic datasets keeping only columns containing the unique ids (SEQN), age (RIDAGEYR), race and ethnicity (RIDRETH3), education (DMDEDUC2), and marital status (DMDMARTL), along with the following variables related to the survey weighting: (RIDSTATR, SDMVPSU, SDMVSTRA, WTMEC2YR, WTINT2YR). Add an additional column identifying to which cohort each case belongs. Rename the columns with literate variable names using all lower case and convert each column to an appropriate type. Finally, save the resulting data frame to a serialized “round-trip” format of your choosing (e.g. pickle, feather, or parquet).

In [45]:
# I spend a lot of lines to make the indexes in the final data frame 
# right.

df1 = pd.read_sas('DEMO_G.XPT')
df1 = df1[['SEQN', 'RIDAGEYR', 'RIDRETH3', 'DMDEDUC2', 'DMDMARTL',
          'RIDSTATR', 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTINT2YR']]
n_df1 = df1.shape[0]
df1['cohort of the case'] = ['2011-2012'] * n_df1

dic_df1 = {}
for name in df1.columns:
    dic_df1[name] = list(df1[name])
df1_frame = pd.DataFrame(
    dic_df1, index = list(range(1, n_df1+1))
)


df2 = pd.read_sas('DEMO_H.XPT')
df2 = df2[['SEQN', 'RIDAGEYR', 'RIDRETH3', 'DMDEDUC2', 'DMDMARTL',
          'RIDSTATR', 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTINT2YR']]
n_df2 = df2.shape[0]
df2['cohort of the case'] = ['2013-2014'] * n_df2

dic_df2 = {}
for name in df2.columns:
    dic_df2[name] = list(df2[name])
df2_frame = pd.DataFrame(
    dic_df2, index = list(range(n_df1 + 1, n_df1 + n_df2 + 1))
)

df3 = pd.read_sas('DEMO_I.XPT')
df3 = df3[['SEQN', 'RIDAGEYR', 'RIDRETH3', 'DMDEDUC2', 'DMDMARTL',
          'RIDSTATR', 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTINT2YR']]
n_df3 = df3.shape[0]
df3['cohort of the case'] = ['2015-2016'] * n_df3

dic_df3 = {}
for name in df3.columns:
    dic_df3[name] = list(df3[name])
df3_frame = pd.DataFrame(
    dic_df3, index = list(range(n_df1 + n_df2 + 1, n_df1 + n_df2 + 
                                n_df3 + 1))
)

df4 = pd.read_sas('DEMO_J.XPT')
df4 = df4[['SEQN', 'RIDAGEYR', 'RIDRETH3', 'DMDEDUC2', 'DMDMARTL',
          'RIDSTATR', 'SDMVPSU', 'SDMVSTRA', 'WTMEC2YR', 'WTINT2YR']]
n_df4 = df4.shape[0]
df4['cohort of the case'] = ['2017-2018'] * n_df4

dic_df4 = {}
for name in df4.columns:
    dic_df4[name] = list(df4[name])
df4_frame = pd.DataFrame(
    dic_df4, index = list(range(n_df1 + n_df2 + n_df3 + 1, n_df1 + 
                                n_df2 + n_df3 + n_df4 + 1))
)


df_demo = pd.concat([df1_frame, df2_frame, df3_frame, df4_frame])
df_demo = df_demo.set_axis(["id", "age", "race" , 
                            "education", "marital", 
                            "examination",
                            "pseudo-sup",
                            "pseudo-stra",
                            "mec",
                            "interviewed", "cohort"], axis=1)
df_demo = df_demo.astype({'id': 'int64'})
df_demo = df_demo.astype({'id': 'string'})
df_demo = df_demo.astype({'race': 'string'})
df_demo = df_demo.astype({'education': 'string'})
df_demo = df_demo.astype({'marital': 'string'})
df_demo = df_demo.astype({'examination': 'string'})
df_demo = df_demo.astype({'mec': 'string'})
df_demo = df_demo.astype({'interviewed': 'string'})


print(df_demo.dtypes)
df_demo.to_pickle('df_demo.pickle')
df_demo

id              string
age            float64
race            string
education       string
marital         string
examination     string
pseudo-sup     float64
pseudo-stra    float64
mec             string
interviewed     string
cohort          object
dtype: object


Unnamed: 0,id,age,race,education,marital,examination,pseudo-sup,pseudo-stra,mec,interviewed,cohort
1,62161,22.0,3.0,3.0,5.0,2.0,1.0,91.0,104236.582554,102641.406474,2011-2012
2,62162,3.0,1.0,,,2.0,3.0,92.0,16116.35401,15457.736897,2011-2012
3,62163,14.0,6.0,,,2.0,3.0,90.0,7869.485117,7397.684828,2011-2012
4,62164,44.0,3.0,4.0,1.0,2.0,1.0,94.0,127965.226204,127351.373299,2011-2012
5,62165,14.0,4.0,,,2.0,2.0,90.0,13384.042162,12209.74498,2011-2012
...,...,...,...,...,...,...,...,...,...,...,...
39152,102952,70.0,6.0,3.0,1.0,2.0,2.0,138.0,18338.711104024176,16896.27620310902,2017-2018
39153,102953,42.0,1.0,3.0,4.0,2.0,2.0,137.0,63661.95157344447,61630.3800130472,2017-2018
39154,102954,41.0,4.0,5.0,5.0,2.0,1.0,144.0,17694.78334581792,17160.895268622276,2017-2018
39155,102955,14.0,4.0,,,2.0,1.0,136.0,14871.83963566592,14238.445922281095,2017-2018


b. Repeat part a for the oral health and dentition data (OHXDEN_*.XPT) retaining the following variables: SEQN, OHDDESTS, tooth counts (OHXxxTC), and coronal cavities (OHXxxCTC).

In [46]:
# The same as in (a), I spend a lot of lines to make the indexes in 
# the final data frame right.

col = ['SEQN', 'OHDDESTS']
for i in range(1, 10):
    col.append('OHX0'+str(i)+'TC')
    col.append('OHX0'+str(i)+'CTC')
for i in range(10, 32):
    col.append('OHX'+str(i)+'TC')
    col.append('OHX'+str(i)+'CTC')
col.remove("OHX16CTC")
col.remove("OHX17CTC")
col.remove("OHX01CTC")
    
    
    
df1 = pd.read_sas('OHXDEN_G.XPT')
df1 = df1[col]
n_df1 = df1.shape[0]
df1['cohort'] = ['2011-2012'] * n_df1

dic_df1 = {}
for name in df1.columns:
    dic_df1[name] = list(df1[name])
df1_frame = pd.DataFrame(
    dic_df1, index = list(range(1, n_df1+1))
)


df2 = pd.read_sas('OHXDEN_H.XPT')
df2 = df2[col]
n_df2 = df2.shape[0]
df2['cohort'] = ['2013-2014'] * n_df2

dic_df2 = {}
for name in df2.columns:
    dic_df2[name] = list(df2[name])
df2_frame = pd.DataFrame(
    dic_df2, index = list(range(n_df1 + 1, n_df1 + n_df2 + 1))
)

df3 = pd.read_sas('OHXDEN_I.XPT')
df3 = df3[col]
n_df3 = df3.shape[0]
df3['cohort'] = ['2015-2016'] * n_df3

dic_df3 = {}
for name in df3.columns:
    dic_df3[name] = list(df3[name])
df3_frame = pd.DataFrame(
    dic_df3, index = list(range(n_df1 + n_df2 + 1, n_df1 + n_df2 + 
                                n_df3 + 1))
)

df4 = pd.read_sas('OHXDEN_J.XPT')    
df4 = df4[col]
n_df4 = df4.shape[0]
df4['cohort'] = ['2017-2018'] * n_df4

dic_df4 = {}
for name in df4.columns:
    dic_df4[name] = list(df4[name])
df4_frame = pd.DataFrame(
    dic_df4, index = list(range(n_df1 + n_df2 + n_df3 + 1, n_df1 + 
                                n_df2 + n_df3 + n_df4 + 1))
)


df_oral = pd.concat([df1_frame, df2_frame, df3_frame, df4_frame])
df_oral.rename(columns={'SEQN':'id'}, inplace=True)


print(df_oral.dtypes)
df_oral.to_pickle('df_oral.pickle')
df_oral

id          float64
OHDDESTS    float64
OHX01TC     float64
OHX02TC     float64
OHX02CTC     object
             ...   
OHX30TC     float64
OHX30CTC     object
OHX31TC     float64
OHX31CTC     object
cohort       object
Length: 62, dtype: object


Unnamed: 0,id,OHDDESTS,OHX01TC,OHX02TC,OHX02CTC,OHX03TC,OHX03CTC,OHX04TC,OHX04CTC,OHX05TC,...,OHX27CTC,OHX28TC,OHX28CTC,OHX29TC,OHX29CTC,OHX30TC,OHX30CTC,OHX31TC,OHX31CTC,cohort
1,62161.0,1.0,4.0,2.0,b'S',2.0,b'S',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'Z',2.0,b'S',2011-2012
2,62162.0,1.0,4.0,4.0,b'U',4.0,b'U',1.0,b'D',1.0,...,b'D',1.0,b'D',1.0,b'D',4.0,b'U',4.0,b'U',2011-2012
3,62163.0,1.0,4.0,2.0,b'S',2.0,b'Y',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'Y',2.0,b'S',2011-2012
4,62164.0,1.0,4.0,2.0,b'Z',2.0,b'Z',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'Z',2.0,b'Z',2011-2012
5,62165.0,1.0,4.0,2.0,b'S',2.0,b'S',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'S',2.0,b'S',2011-2012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35905,102952.0,1.0,2.0,2.0,b'S',2.0,b'S',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'S',2.0,b'S',2017-2018
35906,102953.0,1.0,2.0,2.0,b'S',2.0,b'S',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'S',2.0,b'Z',2017-2018
35907,102954.0,1.0,2.0,2.0,b'S',2.0,b'S',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'F',2.0,b'S',2.0,b'S',2017-2018
35908,102955.0,1.0,4.0,2.0,b'S',2.0,b'S',2.0,b'S',2.0,...,b'S',2.0,b'S',2.0,b'S',2.0,b'S',2.0,b'Z',2017-2018


c. In your notebook, report the number of cases there are in the two datasets above.

In [25]:
print("The number of cases in the first dataset is " 
      + str(df_demo.shape[0]) + '.')
print("The number of cases in the second dataset is " 
      + str(df_oral.shape[0]) + '.')

The number of cases in the first dataset is 39156.
The number of cases in the second dataset is 35909.
