# Stochastic Regression Imputation

The groundwater dataset has a significant fraction of NA values.<br> In this, we'll be using stochastic regression imputation in order to fill them.<br><br>

First, the necessary libraries are imported.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

The missing values are filled in by using linear regression, using records who don't have missing values<br>
as the training set. The parameters are obtained and and used to fit the records with the missing data.

In [2]:
def linReg(x_train, y_train, x_test):
    ones = np.ones([x_train.shape[0], 1])
    x_train = np.concatenate((ones, x_train), axis=1)
    ones = np.ones([x_test.shape[0], 1])
#  x_test = np.reshape(x_test,[x_test.shape[0],1])
    x_test = np.concatenate((ones, x_test), axis=1)
    temp_1 = np.dot(x_train.T, x_train)
    temp_1 = temp_1.astype(np.float64)
    temp_1 = np.linalg.pinv(temp_1)
    temp_2 = np.dot(x_train.T, y_train)
    theta = np.dot(temp_1, temp_2)
    preds = np.dot(x_test, theta)
    return preds

In the next couple cells, all the files are iterated through and compiled into one large dataset, and <br>
unecessary features are removed.

In [3]:
places =['andhra_pradesh','assam','bihar','chattissgarh','daman_diu_dadra_nagar_haveli','goa','himachal_pradesh','kerala','lakshadweep','madhya_pradesh','maharashtra','odisha','pondicherry','punjab','rajasthan','tripura','uttar_pradesh_uttarakhand','west_bengal']

In [4]:
df1 = pd.DataFrame()
df2 = pd.DataFrame()

for place in places:
    
    df = pd.read_csv("../data_folder/groundwater_folder/ground_water_quality_in_"+place+"-2014.csv",encoding = "unicode_escape")
    df["State"] = place
    print(place)
   
    df1 = df1.append(df, ignore_index=True)
    if df1.empty:
        df1 = df
        
df2['State'] = df1['State']
df = df1[df1.columns.difference(["STATION CODE","LOCATIONS","STATE"])];
df = df.drop(["State"],1)

andhra_pradesh
assam
bihar
chattissgarh
daman_diu_dadra_nagar_haveli
goa
himachal_pradesh
kerala
lakshadweep
madhya_pradesh
maharashtra
odisha
pondicherry
punjab
rajasthan
tripura
uttar_pradesh_uttarakhand
west_bengal


In [5]:
df.isnull().sum().sum()

1836

Now that all the datasets are combined into one, we can begin imputing. <br><br>

## Mean imputation
First, a function is defined in order to impute a missing value with the mean of the feature.<br>
We require this since there are several cases where more than one feature is missing per record.

In [6]:
def mean_imputation(df, feature):

    number_missing = df[feature].isnull().sum()
    observed_values = df.loc[df[feature].notnull(), feature]
    df.loc[df[feature].isnull(),feature + '_imp'] = np.mean(observed_values)
    
    return df


In [7]:
missing_columns = df.loc[:, df.isnull().any()].columns.values
print(missing_columns)


['B.O.D. (mg/l) : Max : < 3 mg/l' 'B.O.D. (mg/l) : Mean : < 3 mg/l'
 'B.O.D. (mg/l) : Min : < 3 mg/l' 'CONDUCTIVITY (µmhos/cm) : Max'
 'CONDUCTIVITY (µmhos/cm) : Mean' 'CONDUCTIVITY (µmhos/cm) : Min'
 'FECAL COLIFORM (MPN/100ml) : Max : < 2500 MPN/100ml'
 'FECAL COLIFORM (MPN/100ml) : Mean : < 2500 MPN/100ml'
 'FECAL COLIFORM (MPN/100ml) : Min : < 2500 MPN/100ml'
 'TEMPERATURE ºC : Max' 'TEMPERATURE ºC : Mean' 'TEMPERATURE ºC : Min'
 'TOTAL COLIFORM (MPN/100ml) : Max : < 5000 MPN/100ml'
 'TOTAL COLIFORM (MPN/100ml) : Mean : < 5000 MPN/100ml'
 'TOTAL COLIFORM (MPN/100ml) : Min : < 5000 MPN/100ml']


We make a copy of the missing columns, to use it for the next step, regression imputation.

In [8]:
for feature in missing_columns:
    df[feature + '_imp'] = df[feature]
    df = mean_imputation(df, feature)


For each missing column, we use the mean imputed values as a part of the train set.
The parameters are obtained and are used on the variable test, which contain the records<br>
with the missing values. This is iterated through all the features.

## Stochastic Regression Imputation

Regression imputation leaves us with points that lie on the regression hyperplane.
This isn't acceptable since it would overestimate the correlation between features.
To overcome this, random values are generated with a normal distribution of the residual<br>
variance of each feature.

In [9]:
for feature in missing_columns:
    remove = np.append(missing_columns,np.array(feature+"_imp"))
    x_train = df.loc[df[feature].notnull(),df.columns.difference(remove)]
    y_train = pd.DataFrame(df.loc[df[feature].notnull(),feature])
    std = np.var(y_train)
    
    test =  pd.DataFrame(df.loc[df[feature].isnull(),df.columns.difference(remove)])
    
    xt = x_train.values
    yt = y_train.values
    tt = test.values;
    preds = np.array(linReg(xt,yt,tt)) + np.random.normal(0,std,[test.shape[0],1])
    preds = np.reshape(preds,[preds.shape[0],])
    df.loc[df[feature].isnull(),feature] = preds 
    
            
df[df < 0] = 0

    

Since linear regression isn't a perfect fit, some of the data points go slightly below zero. <br>
These data points are made zero, which is very common in the dataset.<br><br>


The additional features designed are dropped, followed by addition of columns with strings in them.

In [10]:
for feature in missing_columns:
    df = df.drop([feature+"_imp"],1)
 

In [11]:
df["STATION CODE"]= df1["STATION CODE"]
df["LOCATIONS"] = df1["LOCATIONS"]
df["State"] = df2["State"]

In [12]:
df.columns

Index(['B.O.D. (mg/l) : Max : < 3 mg/l', 'B.O.D. (mg/l) : Mean : < 3 mg/l',
       'B.O.D. (mg/l) : Min : < 3 mg/l', 'CONDUCTIVITY (µmhos/cm) : Max',
       'CONDUCTIVITY (µmhos/cm) : Mean', 'CONDUCTIVITY (µmhos/cm) : Min',
       'FECAL COLIFORM (MPN/100ml) : Max : < 2500 MPN/100ml',
       'FECAL COLIFORM (MPN/100ml) : Mean : < 2500 MPN/100ml',
       'FECAL COLIFORM (MPN/100ml) : Min : < 2500 MPN/100ml',
       'NITRATE- N+ NITRITE-N (mg/l) : Max',
       'NITRATE- N+ NITRITE-N (mg/l) : Mean',
       'NITRATE- N+ NITRITE-N (mg/l) : Min', 'TEMPERATURE ºC : Max',
       'TEMPERATURE ºC : Mean', 'TEMPERATURE ºC : Min',
       'TOTAL COLIFORM (MPN/100ml) : Max : < 5000 MPN/100ml',
       'TOTAL COLIFORM (MPN/100ml) : Mean : < 5000 MPN/100ml',
       'TOTAL COLIFORM (MPN/100ml) : Min : < 5000 MPN/100ml',
       'pH : Max : 6.5-8.5', 'pH : Mean : 6.5-8.5', 'pH : Min : 6.5-8.5',
       'STATION CODE', 'LOCATIONS', 'State'],
      dtype='object')

Finally, the file is saved, ready to undergo PCA.

In [13]:
df.to_csv('../data_folder/filled_groundwater.csv')

In [14]:
df

Unnamed: 0,B.O.D. (mg/l) : Max : < 3 mg/l,B.O.D. (mg/l) : Mean : < 3 mg/l,B.O.D. (mg/l) : Min : < 3 mg/l,CONDUCTIVITY (µmhos/cm) : Max,CONDUCTIVITY (µmhos/cm) : Mean,CONDUCTIVITY (µmhos/cm) : Min,FECAL COLIFORM (MPN/100ml) : Max : < 2500 MPN/100ml,FECAL COLIFORM (MPN/100ml) : Mean : < 2500 MPN/100ml,FECAL COLIFORM (MPN/100ml) : Min : < 2500 MPN/100ml,NITRATE- N+ NITRITE-N (mg/l) : Max,...,TEMPERATURE ºC : Min,TOTAL COLIFORM (MPN/100ml) : Max : < 5000 MPN/100ml,TOTAL COLIFORM (MPN/100ml) : Mean : < 5000 MPN/100ml,TOTAL COLIFORM (MPN/100ml) : Min : < 5000 MPN/100ml,pH : Max : 6.5-8.5,pH : Mean : 6.5-8.5,pH : Min : 6.5-8.5,STATION CODE,LOCATIONS,State
0,166.846451,0.000000,0.000000,2050.0,2050.0,2050.0,2.496179e+08,1.251922e+08,7.711193e+07,2.50,...,23.0,4.000000e+02,4.000000e+02,4.000000e+02,7.0,7.0,7.0,1513.0,"B W. - KRISHNA MURTHY, D.NO. 48-16-43 AUTONAGA...",andhra_pradesh
1,234.275167,31.027603,22.787722,1639.0,1639.0,1639.0,0.000000e+00,1.925763e+07,0.000000e+00,1.20,...,23.0,4.000000e+02,4.000000e+02,4.000000e+02,7.3,7.3,7.3,1514.0,"B/W. - VIJAY KUMAR AUTONAGAR VIJAYAWADA, KRISH...",andhra_pradesh
2,0.600000,0.600000,0.600000,1569.0,1569.0,1569.0,0.000000e+00,1.784477e+08,1.204883e+07,2.80,...,19.0,4.000000e+02,4.000000e+02,4.000000e+02,7.4,7.4,7.4,1515.0,"B/W. - NAGARAM(V), PALVONCHA, KHAMMAM DIST., A.P.",andhra_pradesh
3,0.000000,55.282050,0.000000,1119.0,1119.0,1119.0,1.828890e+08,5.651051e+07,2.591676e+07,0.50,...,29.0,4.000000e+02,4.000000e+02,4.000000e+02,7.4,7.4,7.4,1516.0,B W OF NAVLOK GARDENS NELLORE AP,andhra_pradesh
4,1.000000,0.900000,0.800000,1950.0,1687.0,1520.0,0.000000e+00,0.000000e+00,0.000000e+00,8.00,...,27.0,4.000000e+01,3.200000e+01,2.000000e+01,8.1,7.9,7.2,1517.0,"B/W. - TUNGBHADRA RIVER NEAR KURNOOL, A.P.",andhra_pradesh
5,1.600000,1.400000,1.000000,5000.0,3720.0,2150.0,0.000000e+00,0.000000e+00,0.000000e+00,14.00,...,27.0,5.000000e+01,3.600000e+01,2.000000e+01,8.2,7.9,7.8,1518.0,"B/W. - NANDYAL, KURNOOL DIST., A.P.",andhra_pradesh
6,1.800000,1.400000,1.100000,1510.0,1131.0,730.0,0.000000e+00,0.000000e+00,0.000000e+00,3.10,...,27.0,3.000000e+01,3.000000e+01,3.000000e+01,8.1,7.9,7.8,1519.0,"B/W. - NAGIRI, CHITTOOR DIST., A.P",andhra_pradesh
7,1.500000,1.300000,1.100000,1100.0,966.0,723.0,0.000000e+00,0.000000e+00,0.000000e+00,11.50,...,27.0,4.000000e+01,3.400000e+01,2.000000e+01,8.3,7.9,7.6,1520.0,"B/W. - SWARNAMUKHI RIVER, SRIKALAHASTI, CHITTO...",andhra_pradesh
8,1.000000,1.000000,1.000000,1790.0,1790.0,1790.0,9.000000e+00,9.000000e+00,9.000000e+00,5.40,...,25.0,1.200000e+02,1.200000e+02,1.200000e+02,7.7,7.7,7.7,1521.0,"O/W. - NEAR RAMA TEMPLE , WARD NO.2 , MINDI , ...",andhra_pradesh
9,1.200000,1.100000,1.000000,1343.0,962.0,771.0,2.300000e+01,1.600000e+01,4.000000e+00,1.80,...,25.0,1.500000e+02,8.400000e+01,4.000000e+00,7.7,7.4,7.2,1522.0,"O/W. PEDDANUYYI - VIZIANAGARAM, A.P",andhra_pradesh
