# **Water Quality Prediction**
We all know water is one of the most essential resource for our living. But as the development is increasing, we are exploiting water by wasting it and treating it with harmful materials which makes water impure and unfit for use. This is the reason it is very important to know the quality of water. This kernel is based on water quality prediction. In this kernel, water quality index (WQI) and quality status of water is predicted through some parameters that affects water quality. 
In this notebook I have performed Data Cleaning steps and did Exploratory Data Analysis. Then I have did some calculations as the data does not contain the column which can be used for prediction.
Then I have created 3 models for prediction. The first model is Non-Deep Learning based Linear Regression model. The second model is Deep Learning Based Linear Regression and the last one is Logistic Regression model. I have only used sparkml to create all the models.

# **Table of Contents**
* [Setting up the environment](#1)
* [Importing Libraries](#2)
* [Uploading the data](#3)
* [Data Cleaning](#4)
* [EDA](#5)
* [Feature Engineering](#6)
* [Model Creation](#7)

<a id=1></a>
# **Setting up the environment**

#### Before starting we first have to change the java version because if we will use version 11 then we will get some errors and we will not be able to use pyspark properly. So we will delete java version 11 and install java version 8.

In [None]:
! apt remove -y openjdk-11-jre-headless

In [None]:
!apt install -y openjdk-8-jdk openjdk-8-jre

#### Now we will first install pyspark.

In [None]:
!pip install pyspark

<a id=2></a>
# **Importing libraries**

In [None]:
import os
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import warnings
warnings.filterwarnings("ignore")

from pylab import *
from pyspark.sql.functions import udf, concat, col, lit
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.types import *
import pyspark.sql.functions as F
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()
sqlContext = SQLContext(sc)

<a id=3></a>
# **Uploading the data**
#### Then, we upload the data in the spark frame.

In [None]:
df = spark.read.format("csv").option("header", "true").load('../input/water-quality-data/waterquality.csv')
gdf = gpd.read_file('../input/india-states/Igismap/Indian_States.shp')

In [None]:
df.show(5)

In [None]:
df.dtypes

<a id=4></a>
# **Data Cleaning**

In [None]:
from pyspark.sql.types import FloatType

#### As we observed obove that all the columns have string data types, but for the calculation of water quality index we need to convert them in float data type. So we will convert the required columns in the float data type.

In [None]:
df = df.withColumn("TEMP",df["TEMP"].cast(FloatType()))
df = df.withColumn("pH",df["pH"].cast(FloatType()))
df = df.withColumn("DO",df["DO"].cast(FloatType()))
df = df.withColumn("CONDUCTIVITY",df["CONDUCTIVITY"].cast(FloatType()))
df = df.withColumn("BOD",df["BOD"].cast(FloatType()))
df = df.withColumn("NITRATE_N_NITRITE_N",df["NITRATE_N_NITRITE_N"].cast(FloatType()))
df = df.withColumn("FECAL_COLIFORM",df["FECAL_COLIFORM"].cast(FloatType()))
df.dtypes

#### Now as column TOTAL_COLIFORM is not required so we will drop this column. 

In [None]:
df=df.drop('TOTAL_COLIFORM')

#### Now we want to remove all the rows which contain any null value in it. So for applying a SQL query we first have to register it has a virtual temporary table and then we will issue SQL query. We are doing this because it is important to perform data cleansing steps as it will make our model to work better.

In [None]:
df.createOrReplaceTempView("df_sql")

In [None]:
df_clean = spark.sql('''Select * from df_sql where TEMP is not null and DO is not null 
                        and pH is not null and BOD is not null and CONDUCTIVITY is not null
                        and NITRATE_N_NITRITE_N is not null and FECAL_COLIFORM is not null''')

<a id=5></a>
# **EDA**
### Let's visualize our data.

In [None]:
df_clean.createOrReplaceTempView("df_sql")

In [None]:
do = spark.sql("Select DO from df_sql")
do = do.rdd.map(lambda row : row.DO).collect()
ph = spark.sql("Select pH from df_sql")
ph = ph.rdd.map(lambda row : row.pH).collect()
bod = spark.sql("Select BOD from df_sql")
bod = bod.rdd.map(lambda row : row.BOD).collect()
nn = spark.sql("Select NITRATE_N_NITRITE_N from df_sql")
nn = nn.rdd.map(lambda row : row.NITRATE_N_NITRITE_N).collect()

In [None]:
fig,ax = plt.subplots(num=None,figsize=(14,6), dpi=80, facecolor='w', edgecolor='k')
size=len(do)
ax.plot(range(0,size), do, color='blue', animated=True, linewidth=1, label='Dissolved Oxygen')
ax.plot(range(0,size), ph, color='red', animated=True, linewidth=1, label='pH')
fig,ax2 = plt.subplots(num=None,figsize=(14,6), dpi=80, facecolor='w', edgecolor='k')
ax2.plot(range(0,size), bod, color='orange', animated=True, linewidth=1, label='BOD')
ax2.plot(range(0,size), nn, color='green', animated=True, linewidth=1, label='NN')
legend=ax.legend()
legend=ax2.legend()

In [None]:
con = spark.sql("Select CONDUCTIVITY from df_sql")
con = con.rdd.map(lambda row : row.CONDUCTIVITY).collect()
fec = spark.sql("Select FECAL_COLIFORM from df_sql")
fec = fec.rdd.map(lambda row : row.FECAL_COLIFORM).collect()

In [None]:
fig,ax = plt.subplots(num=None,figsize=(14,6), dpi=80, facecolor='w', edgecolor='k')
ax.plot(range(0,size), con, color='blue', animated=True, linewidth=1)
fig,ax2 = plt.subplots(num=None,figsize=(14,6), dpi=80, facecolor='w', edgecolor='k')
ax2.plot(range(0,size), fec, color='red', animated=True, linewidth=1)

<a id=6></a>
# **Feature Engineering**

#### Let us convert our data to pandas frame. We are doing this because to train a model we need what we have to predict which is not in data. So we have to calculate water quality index which requires many steps but can be easily done using pandas and in less number of steps. Also we will able to visualize our data in tabular form more effectively.

In [None]:
df=df_clean.toPandas()
df.dtypes

### Initialization

In [None]:
start=0
end=448
station=df.iloc [start:end ,0]
location=df.iloc [start:end ,1]
state=df.iloc [start:end ,2]
do= df.iloc [start:end ,4].astype(np.float64)
value=0
ph = df.iloc[ start:end,5]  
co = df.iloc [start:end ,6].astype(np.float64)
bod = df.iloc [start:end ,7].astype(np.float64)
na= df.iloc [start:end ,8].astype(np.float64)
fc=df.iloc [2:end ,9].astype(np.float64)



In [None]:
df=pd.concat([station,location,state,do,ph,co,bod,na,fc],axis=1)
df. columns = ['station','location','state','do','ph','co','bod','na','fc']

### The Water Quality Index is calculated by aggregating the quality rating with the weight linearly, 
#### WQI = ∑ (qn x Wn)
#### where qn =Quality rating for the nth Water quality parameter, Wn= unit weight for the nth parameters.       
#### Although for calculation qn we have standard formula but it was not possible in this case, so we applied a standard method for calculating quality rating for each parameter.

In [None]:
df['npH']=df.ph.apply(lambda x: (100 if (8.5>=x>=7)  
                                 else(80 if  (8.6>=x>=8.5) or (6.9>=x>=6.8) 
                                      else(60 if (8.8>=x>=8.6) or (6.8>=x>=6.7) 
                                          else(40 if (9>=x>=8.8) or (6.7>=x>=6.5)
                                              else 0)))))

In [None]:
df['ndo']=df.do.apply(lambda x:(100 if (x>=6)  
                                 else(80 if  (6>=x>=5.1) 
                                      else(60 if (5>=x>=4.1)
                                          else(40 if (4>=x>=3) 
                                              else 0)))))

In [None]:
df['nco']=df.fc.apply(lambda x:(100 if (5>=x>=0)  
                                 else(80 if  (50>=x>=5) 
                                      else(60 if (500>=x>=50)
                                          else(40 if (10000>=x>=500) 
                                              else 0)))))

In [None]:
df['nbdo']=df.bod.apply(lambda x:(100 if (3>=x>=0)  
                                 else(80 if  (6>=x>=3) 
                                      else(60 if (80>=x>=6)
                                          else(40 if (125>=x>=80) 
                                              else 0)))))

In [None]:
df['nec']=df.co.apply(lambda x:(100 if (75>=x>=0)  
                                 else(80 if  (150>=x>=75) 
                                      else(60 if (225>=x>=150)
                                          else(40 if (300>=x>=225) 
                                              else 0)))))

In [None]:
df['nna']=df.na.apply(lambda x:(100 if (20>=x>=0)  
                                 else(80 if  (50>=x>=20) 
                                      else(60 if (100>=x>=50)
                                          else(40 if (200>=x>=100) 
                                              else 0)))))

df.head()
df.dtypes

#### Now we apply the formula of wqi by first multiplying all the quality rating with its weight and then summed all the values.

In [None]:
df['wph']=df.npH * 0.165
df['wdo']=df.ndo * 0.281
df['wbdo']=df.nbdo * 0.234
df['wec']=df.nec* 0.009
df['wna']=df.nna * 0.028
df['wco']=df.nco * 0.281
df['wqi']=df.wph+df.wdo+df.wbdo+df.wec+df.wna+df.wco 
df

#### Then we classify the water on the basis of their water quality index.

In [None]:
df['quality']=df.wqi.apply(lambda x:('Excellent' if (25>=x>=0)  
                                 else('Good' if  (50>=x>=26) 
                                      else('Poor' if (75>=x>=51)
                                          else('Very Poor' if (100>=x>=76) 
                                              else 'Unsuitable')))))

#### Let's visualize the water quality index in each state of India.

In [None]:
#renaming state names
gdf['st_nm'].replace({"Andaman & Nicobar Island": "Andaman and Nicobar Islands",
                      "Arunanchal Pradesh": "Arunachal Pradesh",
                      'Dadara & Nagar Havelli':'Dadra and Nagar Haveli and Daman and Diu',
                      'Jammu & Kashmir':'Jammu and Kashmir',
                      'NCT of Delhi':'Delhi'}, inplace=True)
df['state'].replace({"TAMILNADU": "TAMIL NADU"}, inplace=True)

#Capitalizing only the first letter of each word
df['state'] = df['state'].str.title()

In [None]:
gdf = gdf.rename(columns={"st_nm": "state"})
merged = pd.merge(gdf, df , how='outer', on='state')
merged['coords'] = merged['geometry'].apply(lambda x: x.representative_point().coords[:])
merged['coords'] = [coords[0] for coords in merged['coords']]
merged = merged.drop_duplicates(subset ="state") 

sns.set_context("talk")
sns.set_style("dark")
cmap = 'Blues'
figsize = (20, 15)
ax = merged.plot(column= 'wqi', cmap=cmap, 
                          figsize=figsize, scheme='User_Defined',
                          classification_kwds=dict(bins=[0,25,50,75,100]),
                          edgecolor='black', legend = True)
for idx, row in merged.iterrows():
    ax.text(row.coords[0], row.coords[1], s=row['wqi'], horizontalalignment='center', bbox={'facecolor': 'yellow', 'alpha':0.8, 'pad': 1, 'edgecolor':'blue'})

ax.get_legend().set_title('Water Quality Index')
ax.set_title("Water Quality Index in each state ", size = 25)

ax.set_axis_off()
plt.axis('equal')
plt.show()

#### Let us again convert the whole data in spark frame for further processes.

In [None]:
spark_df = sqlContext.createDataFrame(df)

In [None]:
spark_df.show()

In [None]:
spark_df.createOrReplaceTempView("df_sql")

In [None]:
State = spark.sql("Select state from df_sql")
State = State.rdd.map(lambda row : row.state).collect()

In [None]:
Wqi = spark.sql("Select wqi from df_sql")
Wqi = Wqi.rdd.map(lambda row : row.wqi).collect()

In [None]:
plt.barh(State,Wqi)

plt.xlabel("WQI")
plt.ylabel("STATES")


plt.show()

<a id=7></a>
# **Model Creation**
#### Now we apply machine learning and deep learning algorithms to predict the data.

## Non Deep Learning Based Linear Regresion Model

#### In this, first data is converted which are required to predict WQI into vector form by using VectorAssembler. Then we normalize our data by using Normalizer.

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer

vectorAssembler = VectorAssembler(inputCols=["npH","ndo","nbdo","nec","nna","nco"], outputCol="features")
normalizer = Normalizer(inputCol="features",outputCol="features_norm")

#### Then import LinearRegression from pyspark.ml.regression and applied it to our normalized data. Afterthat, import Pipeline from pyspark.ml and include all those steps in the pipeline that have been done.

In [None]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features_norm",labelCol="wqi",maxIter=10,regParam=0.3,elasticNetParam=0.2)

In [None]:
from pyspark.ml import Pipeline

In [None]:
pipeline = Pipeline(stages=[vectorAssembler,normalizer,lr])

#### Before training, our data is randomly split in two parts so as to avoid overfitting and then training is done.

In [None]:
train_data,test_data=spark_df.randomSplit([0.8,0.2])

In [None]:
model = pipeline.fit(train_data)

In [None]:
predictions = model.transform(train_data)

In [None]:
predictions.select("wqi","prediction").show()

#### Now we check the performance of our model.

In [None]:
model.stages[2].summary.r2

## Deep Learning Based Linear Regression Model
#### In this first we collect our data in an array form and to reduce number of steps we converted our data in pandas frame. 

In [None]:
df = spark_df.toPandas()

In [None]:
data = df.iloc[:,9:15].values
pred = df.iloc[:,21:22].values

In [None]:
from sklearn.model_selection import train_test_split 
data_train,data_test,pred_train,pred_test = train_test_split(data,pred,test_size=0.20,random_state=1)
pred_train.shape

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

#### Then we initialize model and add layers to it. Afterwards, the model is compiled with optimizer Adam and loss function mean squared error and then training is done.

In [None]:
model2 = Sequential()
model2.add(Dense(350,input_dim=6, activation='relu'))
model2.add(Dense(350,activation='relu'))
model2.add(Dense(350,activation='relu'))
model2.add(Dense(1,activation='linear'))

In [None]:
keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False )
model2.compile(loss='mean_squared_error',optimizer='Adam', metrics=['mse'])

In [None]:
model2.summary()

In [None]:
perform = model2.fit(data_train,pred_train,epochs=50,batch_size=32)

In [None]:
prediction = model2.predict(data_train)

#### Now we check performance of our model

In [None]:
plt.plot(perform.history['loss'])

In [None]:
plt.plot(pred_train,'bo',prediction,'g+')

## Water Quality Prediction
#### After predicting water quality index, now we classify water on the basis of its WQI and predict its quality.

In [None]:
spark_df = sqlContext.createDataFrame(df)

## Logistic Regression Model
#### Here we are creating a logistic regression model because we don't have to predict a continuous value. 

In [None]:
from pyspark.ml.feature import StringIndexer

#### As our quality column contains values in string format so first we indexed them using StringIndexer. Then data is converted which are required to predict water quality into vector form by using VectorAssembler. Then we normalize our data by using Normalizer.

In [None]:
indexer = StringIndexer(inputCol="quality",outputCol="label")
vectorAssembler2 = VectorAssembler(inputCols=["npH","ndo","nbdo","nec","nna","nco","wqi"], outputCol="features2")
normalizer2 = Normalizer(inputCol="features2",outputCol="features_norm2")

#### Then import LogisticRegression from pyspark.ml.classification and applied it to our normalized data. Afterthat, import Pipeline from pyspark.ml and include all those steps in the pipeline that have been done.

In [None]:
from pyspark.ml.classification import LogisticRegression

In [None]:
lor = LogisticRegression(featuresCol="features_norm2",labelCol="label",maxIter=10)

In [None]:
pipeline2 = Pipeline(stages=[indexer,vectorAssembler2,normalizer2,lor])

In [None]:
train_data,test_data=spark_df.randomSplit([0.8,0.2])

In [None]:
model3 = pipeline2.fit(train_data)

In [None]:
predictions2 = model3.transform(train_data)

#### Now let us check our predictions.

In [None]:
predictions2.select("label","prediction").show()

#### Now we check performance of our model.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
eval = MulticlassClassificationEvaluator().setMetricName('accuracy').setLabelCol('label').setPredictionCol('prediction')
eval.evaluate(predictions2)

#### As the quality column is in string format so we convert our predicted data which are in numbers to their real string values and compared with the actual data.

In [None]:
names = ["Very Poor","Poor","Good","Unsuitable","Excellent"]

In [None]:
predictions2.createOrReplaceTempView("predictions2_sql")

In [None]:
pred = spark.sql("Select prediction from predictions2_sql")
pred = pred.rdd.map(lambda row : int(row.prediction)).collect()
qua = spark.sql("Select quality from predictions2_sql")
qua = qua.rdd.map(lambda row : row.quality).collect()

In [None]:
for x in range(100):
    print("Predicted:", names[pred[x]], "Actual:", qua[x])

**Please if you want to share any suggestion or any doubts regarding any step in the notebook comment below and I will definitely try to solve your doubt.<br> Also, if you want to know more about Spark ML or if you don't know much about Spark ML you can view my another notebook: - https://www.kaggle.com/utcarshagrawal/titanic-spark-ml-magic-eda-feature-engineering/notebook.<br> This notebook will work as a perfect tutorial for beginners.**

## <font color='red'> Please do an upvote if you find this kernel useful or if you liked the kernel! </font>