# Drinking water potability

Primeiramente importaremos e mostraremos como se comporta a base.

In [15]:
import pandas as pd

df = pd.read_csv("water_potability.csv")

df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


## Tratamento dos dados

Como podemos ver abaixo existem vários valores `NaN`, *not a number*, na base. Portanto precisaremos tratar os dados, e removendo os valores nulos. Depois realizamos uma técnica conhecida como *undersampling*, na qual igualamos o número de instâncias das duas classes removendo as instâncias da classe com mais valores.

In [18]:
df.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

In [20]:
df = df.dropna()
df.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

In [21]:
import numpy as np

not_potable_indices = df[df.Potability == 0].index
number_of_potables = df[df.Potability == 1].shape[0]
random_indices = np.random.choice(not_potable_indices, number_of_potables, replace=False)
df_not_potables = df.loc[random_indices]
df_potables = df[df.Potability == 1]
df = pd.concat([df_potables, df_not_potables])

df.Potability.value_counts()

0    811
1    811
Name: Potability, dtype: int64

## Dividindo os dados em treino e teste
Dividimos os dados em `X`, sendo os instâncias e `y` os valores da classe "Potability".

In [22]:
from sklearn.model_selection import train_test_split

X = df.drop('Potability',axis=1).values
y = df['Potability'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0xB33F)

## Treinos e testes
### Bernoulli Naive Bayes

In [25]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import precision_score, accuracy_score

BNB = BernoulliNB()
BNB.fit(X_train, y_train)

y_predicted = BNB.predict(X_test)

precision = precision_score(y_test, y_predicted,average='macro')
accuracy = accuracy_score(y_test,y_predicted)
print(f"Precision: {precision} | Accuracy: {accuracy}")

Precision: 0.2402464065708419 | Accuracy: 0.4804928131416838


  _warn_prf(average, modifier, msg_start, len(result))


## Random Forest

In [27]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(n_estimators=100, min_samples_leaf=2, random_state=0xB33F)
RF.fit(X_train, y_train)

y_predicted = RF.predict(X_test)

precision = precision_score(y_test, y_predicted,average='macro')
accuracy = accuracy_score(y_test,y_predicted)
print(f"Precision: {precision} | Accuracy: {accuracy}")

Precision: 0.6095985155195682 | Accuracy: 0.6098562628336756
