# **Missing data Mechanism**s
When we find missing data values in our data they are usually caused by 1 of 3 mechanisms outlined by Rubin (1976). It is very important that you understand them as they will have a major influence in how you treat your analysis. It will also determine if it is appropriate  to impute missing values. The following [link](https://stefvanbuuren.name/fimd/sec-MCAR.html) gives a nice explaination of the area. The 3 mechanisms are as follows:

# **Missing Completely at Random (MCAR)**
Missing Completely at Random is pretty straightforward.  What it means is what is says:  the propensity for a data point to be missing is completely random.

There’s no relationship between whether a data point is missing and any values in the data set, missing or observed.

The missing data are just a random subset of the data.

# **Missing at Random (MAR)**

This is where the unfortunate names come in.

Missing at Random means  the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data.

Whether or not someone answered #13 on your survey has nothing to do with the missing values, but it does have to do with the values of some other variable.

A better name would actually be Missing Conditionally at Random, because the missingness is conditional on another variable.  But that’s not what Rubin originally picked, and it would really mess up the acronyms at this point.

The idea is, if we can control for this conditional variable, we can get a random subset.

There is another alternative to this case and exists where questions that have not been asked could determine if people answer the question in hand.

You can imagine that good techniques for data that is missing at random need to incorporate variables that are related to the missingness.

# **Missing Not at Random (MNAR)**

Data are missing not at random (MNAR) when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables. For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable. A problem with the MNAR mechanism is that it is impossible to verify that scores are MNAR without knowing the missing values.


* Can you think of examples of each mechanism?
* If I have a variable that is not missing at random can I impute the the missing values?

Now I would like you to download the data from this [link](https://openmv.net/info/class-grades). Now analyses the data and record which variables have missing values. The code below should get you going. Can you tell what type of missing process is occuring for the Final exam mark.





In [24]:
import pandas as pd
from io import StringIO

csv_text = """
Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
05,57.14,34.09,64.38,51.48,52.50
08,95.05,105.49,67.50,99.07,68.33
08,83.70,83.17,30.0,63.15,48.89
07,81.22,96.06,49.38,105.93,80.56
08,91.32,93.64,95.0,107.41,73.89
07,95.0,92.58,93.12,97.78,68.06
08,95.05,102.99,56.25,99.07,50.0
07,72.85,86.85,60.0,,56.11
08,84.26,93.10,47.50,18.52,50.83
07,90.10,97.55,51.25,88.89,63.61
07,80.44,90.20,75.0,91.48,39.72
06,86.26,80.60,74.38,87.59,77.50
08,97.16,103.71,72.50,93.52,63.33
07,91.28,83.53,81.25,99.81,92.22
08,84.80,89.08,44.38,16.91,35.83
07,93.83,95.43,88.12,80.93,90.0
08,84.80,89.08,47.50,16.91,53.33
04,92.01,102.52,38.75,86.11,49.17
08,55.14,81.85,75.0,56.11,62.50
08,93.04,82.93,79.38,83.33,91.11
08,63.40,86.21,63.12,72.78,,
08,75.27,97.52,63.12,61.11,66.11
08,63.78,76.21,39.38,42.22,34.44
07,80.44,90.20,46.25,91.48,72.22
07,53.36,82.01,74.38,102.59,56.39
06,91.28,95.24,82.50,97.59,92.78
08,82.45,86.65,93.12,85.56,89.17
08,75.27,86.67,69.38,61.11,88.89
08,91.32,94.89,76.25,107.41,85.56
07,91.62,65.18,71.88,90.0,45.56
07,98.58,102.46,67.50,97.59,63.33
07,86.26,88.57,70.0,87.59,55.0
08,67.29,95.64,48.12,72.22,43.33
07,98.58,91.03,101.25,104.26,107.78
08,85.42,95.67,56.25,103.52,64.72
05,88.09,63.39,74.38,93.70,50.83
06,95.05,70.24,52.50,52.41,47.78
07,89.89,57.97,32.50,85.19,51.67
06,90.74,89.64,61.25,90.0,,
07,95.0,94.36,89.38,100.93,85.0
06,28.14,58.51,72.50,53.70,68.33
07,95.14,82.67,110.0,89.81,90.83
07,92.01,112.58,86.25,86.11,83.33
07,86.26,74.66,85.0,64.07,82.22
06,57.14,34.09,66.88,51.48,55.83
07,93.83,57.32,28.12,77.96,45.56
08,68.95,65.11,44.38,57.41,65.28
08,85.01,98.47,91.25,83.33,72.22
08,95.90,99.99,95.62,105.56,102.22
08,92.46,95.75,61.88,83.33,48.89
08,96.73,88.11,71.88,97.41,65.56
08,83.70,83.17,60.62,63.15,57.78
07,95.14,94.01,99.38,100.0,95.0
07,98.58,88.30,90.62,100.93,99.17
08,71.79,102.87,54.37,21.53,36.11
08,71.79,101.68,75.0,21.53,49.44
08,87.93,106.53,37.50,97.41,28.06
08,87.93,108.97,28.75,87.96,47.78
08,68.95,65.11,40.0,57.41,78.89
07,72.85,86.85,41.25,60.37,46.67
08,71.79,102.87,41.88,24.77,,
08,92.02,97.76,46.25,47.22,60.56
07,90.33,87.56,68.75,77.96,58.33
07,95.0,94.36,90.62,100.93,101.11
07,91.28,108.71,96.25,99.81,88.89
08,97.0,103.02,93.12,106.48,94.44
08,93.01,104.18,55.0,96.85,67.22
08,92.02,100.58,54.37,63.89,63.89
07,100.83,105.57,101.25,104.44,108.89
08,80.53,92.80,51.25,72.78,66.67
08,90.98,97.55,86.25,88.89,90.0
08,93.59,103.83,92.50,96.85,87.22
08,97.33,100.42,69.38,102.59,83.06
07,84.26,91.31,63.12,83.33,75.56
08,84.26,96.66,52.50,83.33,50.0
07,93.83,102.19,106.25,94.44,102.78
08,75.27,86.67,70.0,71.85,80.0
08,92.02,100.58,73.12,63.89,65.28
08,97.16,103.71,83.75,95.93,78.89
08,66.17,93.68,71.88,42.22,61.39
08,81.22,91.95,79.38,105.93,90.0
07,74.29,65.70,78.75,103.52,55.0
08,97.33,106.74,76.88,108.89,83.89
04,86.86,62.64,92.50,85.19,62.78
06,95.60,61.40,64.38,99.81,42.78
04,87.93,99.47,53.12,87.96,61.11
06,98.49,95.43,42.50,24.77,39.44
07,74.35,92.93,86.25,78.70,73.89
07,86.29,88.81,83.12,77.96,75.83
08,97.0,100.52,64.38,90.74,58.61
08,97.33,106.74,81.25,108.89,71.11
08,96.41,103.71,56.25,95.93,66.39
07,95.60,82.28,76.88,108.33,78.33
08,87.52,91.58,56.25,71.85,85.0
08,96.73,103.71,45.0,93.52,61.94
07,85.34,80.54,41.25,93.70,39.72
08,89.94,102.77,87.50,90.74,87.78
07,95.60,76.13,66.25,99.81,85.56
08,63.40,97.37,73.12,72.78,77.22

"""
csv_text = csv_text.replace(',,\n', ',\n')

df3 = pd.read_csv(StringIO(csv_text))
print(df3.isna().sum())
print(df3.describe())

Prefix        0
Assignment    0
Tutorial      0
Midterm       0
TakeHome      1
Final         3
dtype: int64
          Prefix  Assignment    Tutorial     Midterm    TakeHome       Final
count  99.000000   99.000000   99.000000   99.000000   98.000000   96.000000
mean    7.313131   85.491717   89.731111   68.049495   80.828469   68.414375
std     0.932918   12.597694   15.071556   19.376074   23.808806   18.801087
min     4.000000   28.140000   34.090000   28.120000   16.910000   28.060000
25%     7.000000   80.875000   83.350000   52.810000   66.015000   53.122500
50%     8.000000   89.940000   93.100000   69.380000   87.960000   66.250000
75%     8.000000   95.000000  100.550000   82.810000   98.747500   84.167500
max     8.000000  100.830000  112.580000  110.000000  108.890000  108.890000


We can see that there are 3 values missing in the final column and 1 in the Takehome column. Conduct a basic boxplot of the Assignment, Tutorial and Midterm based on the missing/ non missing categories from the final mark.


In [25]:

df3.loc[df3['Final'].isnull()==True,'Missing']="Y"
df3.loc[df3['Final'].isnull()==False,'Missing']="N"
print(df3.loc[df3['Missing']=="Y",'Missing'])




20    Y
38    Y
60    Y
Name: Missing, dtype: object


In [26]:
dfassign=df3.pivot(columns = 'Missing',values=['Assignment'])['Assignment']
print(dfassign['Y'])

dfassign.boxplot(column=['Y','N'],grid=False)

dfassign = df3.pivot(columns='Missing', values=['Tutorial'])['Tutorial']
dfassign.boxplot(column=['Y','N'], grid=False)

dfassign=df3.pivot(columns = 'Missing',values=['Midterm'])['Midterm']
#print(dfassign['Y'])

dfassign.boxplot(column=['Y','N'],grid=False)
# d.boxplot(column=['A', 'B', 'C', 'D'], grid=False)
# plt.show()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
      ..
94   NaN
95   NaN
96   NaN
97   NaN
98   NaN
Name: Y, Length: 99, dtype: float64


<Axes: >

If you conduct analysis from the above code on each of the input variables you will see that there is no real evidence that the MAR is a reasonable missing value structure for the Final column. Also you will see there are some low marks so it is unlikly that the missing marks are due to the Final mark itself. So the final conclusion is that it is safe to say that the mark is MCAR.

</br>Would you impute the final mark? Discuss this amongst yourselves.