# **Univariate approach to Imputation**
The next 3 steps are taken from the [skikit learn repository](https://scikit-learn.org/stable/modules/impute.html). The first of these is a simple tool to help you implement Univariate imputation. Remember, it is always wise to do this with MCAR variables. Ski-learns SimpleImputer has options to use a constant, mean, median or mode for your missing values. It is important before you implement this process to visualise the data. For example when you have normally distributed variable you should probably use the mean. If you have outliers or skewed data use the median or the mode. The code in the next section gives a simple example of this. I would strongly recommend that you experiment with it so you can understand the implications of your choice.


In [None]:
import numpy as np
from sklearn.impute import SimpleImputer

y=np.array([[780,750,690,710,680,730,690,720,740,900,950,975,995,1000,1010,1020],
    [5.1,4.5,np.nan,3.3,3.6,9.3,6.7,2.8,5.4,np.nan,7.8,np.nan,np.nan,10.1,6.7,np.nan],
    [78000,75000,100000,71000,68000,70000,69000,72000,74000,69000,102000,101000,79000,114000,101000,95000],
    [0.5,0.55,0.1,0.6,0.7,0.45,0.56,0.73,0.45,0.67,0.43,0.23,0.78,0.42,0.36,0.23]])
#y=np.reshape(y, (4, 16))
y=y.transpose()
#print(y)
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
#imp.fit(y)
y=imp.fit_transform(y)
print(y)

We have to transform the matrix for SimpleImputer to handle it.
Now you will notice the missing values that were in the second column have not been converted to 5.9363. We now implement a simple regression model and note the results.

In [None]:
import statsmodels.formula.api as sm
import statsmodels.stats.stattools as st
import statsmodels.stats.api as sms
import pandas as pd
df=pd.DataFrame(y)
df.columns=['X1','X2','X3','Y']

formula_str="Y~X1 + X2 +X3"

result=sm.ols(formula=formula_str,data=df).fit()
print(result.summary())

Now we will re-run the analysis but this time with the median as our estimate of the missing value.

In [None]:
y=np.array([[780,750,690,710,680,730,690,720,740,900,950,975,995,1000,1010,1020],
    [5.1,4.5,np.nan,3.3,3.6,9.3,6.7,2.8,5.4,np.nan,7.8,np.nan,np.nan,10.1,6.7,np.nan],
    [78000,75000,100000,71000,68000,70000,69000,72000,74000,69000,102000,101000,79000,114000,101000,95000],
    [0.5,0.55,0.1,0.6,0.7,0.45,0.56,0.73,0.45,0.67,0.43,0.23,0.78,0.42,0.36,0.23]])

y=y.transpose()
imp = SimpleImputer(missing_values=np.nan, strategy='median')
#imp.fit(y)
y=imp.fit_transform(y)
print(y)
df=pd.DataFrame(y)
df.columns=['X1','X2','X3','Y']

formula_str="Y~X1 + X2 +X3"

result=sm.ols(formula=formula_str,data=df).fit()
print(result.summary())

There isn't much difference between the results. </br></br>

Insert a number of high results into the experience variable and re-run your analysis. What happens?