# Assessment of Missingness

In [2]:
%load_ext autoreload
%autoreload 2

In [143]:
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.express as px
pd.options.plotting.backend = 'plotly'
from itertools import chain

from utils.eda import *
from utils.dsc80_utils import *
from utils.graph import *
from utils.missing_m import *

##### Copy the cell below for getting all the `eda` transformation on the DataFrame

In [60]:
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
step0 = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id')
df = (step0
      .pipe(initial)
      .pipe(transform_df)
      .pipe(outlier)
      #.pipe(group_recipe)
      #.pipe(group_user)
)

In [61]:
display_df(df)

Unnamed: 0,name,minutes,contributor_id,recipe_date,...,sodium,protein,sat_fat,carbs
0,1 brownies in the world best ever,40,985201,2008-10-27,...,3.0,3.0,19.0,6.0
10,50 chili for the crockpot,345,2628680,2013-05-28,...,48.0,52.0,21.0,4.0
11,50 chili for the crockpot,345,2628680,2013-05-28,...,48.0,52.0,21.0,4.0
...,...,...,...,...,...,...,...,...,...
234426,cookies by design sugar shortbread cookies,20,506822,2008-04-15,...,4.0,4.0,11.0,6.0
234427,cookies by design sugar shortbread cookies,20,506822,2008-04-15,...,4.0,4.0,11.0,6.0
234428,cookies by design sugar shortbread cookies,20,506822,2008-04-15,...,4.0,4.0,11.0,6.0


## NMAR Analysis
**Analysis**:
Recall, to determine whether data are likely NMAR, you must reason about the data generating process; you cannot conclude that data are likely NMAR solely by looking at your data. As such, there’s no code to write here (and hence, nothing to put in your notebook).

- **NMAR**: Not Missing At Random, the missingness mechanism of this column depends on the column itself

**Report**:
State whether you believe there is a column in your dataset that is NMAR. Explain your reasoning and any additional data you might want to obtain that could explain the missingness (thereby making it MAR). Make sure to explicitly use the term “NMAR.”


### What is missing?

In [63]:
display_df(pd.DataFrame(df.isna().sum()), 23)

Unnamed: 0,0
name,1
minutes,0
contributor_id,0
recipe_date,0
tags,0
n_steps,0
steps,0
description,51
ingredients,0
n_ingredients,0


### Assessing Missingness for `avg_rating`: MAR? (missing 1)

In [53]:
display_df(pd.DataFrame(df[df['avg_rating'].isna()].iloc[0]), 23)

Unnamed: 0,144443
name,napa dave s individual breakfast casseroles
minutes,45
contributor_id,238966
recipe_date,2008-07-21 00:00:00
tags,"[60-minutes-or-less, time-to-make, course, pre..."
n_steps,6
steps,[grease 4 individual sized baking dishes with ...
description,these are great for guests to grab on the go! ...
ingredients,"[country sausage, eggs, cheddar cheese, potato..."
n_ingredients,5


### Assessing Missingness for `review` (missing 28)

In [58]:
display_df(pd.DataFrame(df[df['review'].isna()].iloc[3]), 23)

Unnamed: 0,25108
name,big boy original double decker hamburger class...
minutes,22
contributor_id,60650
recipe_date,2008-12-03 00:00:00
tags,"[30-minutes-or-less, time-to-make, course, mai..."
n_steps,12
steps,[divide the beef into 2 equal patties and pres...
description,this is from www.top secret restaurant recipes...
ingredients,"[ground beef, mayonnaise, relish, tomato sauce..."
n_ingredients,8


In [36]:
df[df['review'].isna()]['rating'].sum()

128

In [37]:
28*5

140

### Assessing Missingness for `name` (missing 1)

In [24]:
display_df(pd.DataFrame(df[df['name'].isna()].iloc[0]), 23)

Unnamed: 0,687
name,
minutes,10
contributor_id,779451
recipe_date,2009-04-27 00:00:00
tags,"[15-minutes-or-less, time-to-make, course, pre..."
n_steps,6
steps,"[in a bowl , combine ingredients except for ol..."
description,-------------
ingredients,"[lemon, honey, horseradish mustard, garlic clo..."
n_ingredients,10


In [26]:
df[df['name'].isna()]['review']

687    This was great! Thanx. It was the only one wit...
Name: review, dtype: object

### Assessing Missingness for `description` (missing 51)

In [12]:
df[df['description'].isna()]

Unnamed: 0,name,minutes,contributor_id,recipe_date,...,sodium,protein,sat_fat,carbs
8869,apricot gorgonzola crescent appetizers,40,991676,2008-10-22,...,7.0,6.0,13.0,4.0
8870,apricot gorgonzola crescent appetizers,40,991676,2008-10-22,...,7.0,6.0,13.0,4.0
8871,apricot gorgonzola crescent appetizers,40,991676,2008-10-22,...,7.0,6.0,13.0,4.0
...,...,...,...,...,...,...,...,...,...
232318,yukon gold potatoes jacques pepin style,20,714468,2009-08-25,...,9.0,14.0,20.0,16.0
232319,yukon gold potatoes jacques pepin style,20,714468,2009-08-25,...,9.0,14.0,20.0,16.0
232320,yukon gold potatoes jacques pepin style,20,714468,2009-08-25,...,9.0,14.0,20.0,16.0


### Conclusion of NMAR Analysis
- `name`: MCAR
- `description`: NMAR
- `review`: NMAR
- `recipe_id`, `user_id`, `avg_rating`, and `review_date` seems to be missing on the same number `144443`: MAR

## Missingness Dependency
**Analysis**:
Pick a column in the dataset with non-trivial missingness to analyze, and perform permutation tests to analyze the dependency of the missingness of this column on other columns.

- Specifically, find at least one other column that the missingness of your selected column does depend on, and at least one other column that the missingness of your selected column does not depend on.

- **Tip**: Make sure you know the difference between the different types of missingness before approaching that section. Many students in the past have lost credit for mistaking one type of missingness for another.

- Note that some datasets may have special requirements for this section; look at the “Special Considerations” section of your chosen dataset for more details.

**Report**:
Present and interpret the results of your missingness permutation tests with respect to your data and question. Embed a plotly plot related to your missingness exploration; ideas include:
• The distribution of column Y when column X is missing and the distribution of column Y when column X is not missing, as was done in Lecture 8.
• The empirical distribution of the test statistic used in one of your permutation tests, along with the observed statistic.


### Decision Rule for `description`
Let's assume that the missingess of `description` column is related to the `n_steps` column

In [163]:
mar_check(df, 'description', 'n_steps')

This data is fully simulated, looks like missing `description` dependent on `n_steps` and not missing `description` seems to be the same, resulting in that `n_steps` may not be the reason cuasing **MAR**

### Decision Rule for `review`

In [181]:
mar_check(df, 'review', 'rating')

Interesting graph for `review` dependent on `rating`, should we use K-L statistics insteaad?