# Assignment
I'm actually going to work with a new dataset that I've been meaning to parse.  It is mostly numerical, but it does contain one important categorical variable.  The objective is to find the variables that best predict the arsenic content in water samples from the city of Durango, Mexico.  I'm going to format this one for publication.

# Predictors of arsenic content in goundwater
\[Details about where the data came from\]

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)  # Unlimited columns

## Data cleanup
This dataset is encoded in Spanish and contains a few missing values.

In [3]:
df = pd.read_excel('RESULTADOS MUESTREO DURANGO GLOBAL FINAL 2018 REV1.xlsx',
               sheet_name='Resultados Muestreo Durango')

In [9]:
print(df.shape)
df.head()

(146, 22)


Unnamed: 0,Municipio,Localidad,Coordenadas,Unnamed: 3,Muestra,FECHA DE MUESTREO,pH,Conductividad (μs/cm),As (μg/L),Flúor (mg/L),Na+ (mg/L),K+ (mg/L),Fe+ (mg/L),Ca+ (mg/L),Mg+ (mg/L),NO3- (mg/L),Cl- (mg/L),CO3-2 (mg/L),HCO3- (mg/L),Alcalinidad total (mg CaCO3/L),SO4,Tipo de Agua
0,,,Longitud,Latitud,,NaT,,,,,,,,,,,,,,,,
1,Durango,El Nayar,-104.695,23.9629,60.0,2017-08-08,8.14,337.0,61.5,3.15,40.922,1.4075,0.007,15.0015,0.2675,3.25,4.14,0.0,97.0,97.0,70.615,BICARBONATADA SODICA
2,Durango,Sebastián Lerdo de Tejada,-104.64,23.9572,61.0,2017-08-08,8.11,406.0,38.5,2.6,45.885,0.61,,20.153,0.0645,2.1,2.04,0.0,122.0,122.0,79.445,BICARBONATADA SODICA
3,Durango,Felipe Ángeles,-104.557,23.9351,62.0,2017-08-08,8.375,384.1,26.5,1.4,38.536,6.3665,,21.809,1.189,1.35,2.325,0.0,140.0,140.0,53.73,BICARBONATADA SODICA
4,Durango,Villa Montemorelos,-104.482,23.9918,63.0,2017-08-08,8.5,557.5,23.5,1.2,31.6805,6.735,,32.833,6.5825,4.4,4.68,0.0,206.5,206.5,60.245,BICARBONATADA CALCICA Y/O MAGNESICA


In [43]:
# I rename all columns with simpler English names
df2 = df.rename(
    {'Municipio':'municipality',
     'Localidad':'town',
     'Coordenadas':'longitude',
     'Unnamed: 3':'latitude',
     'Muestra':'id',
     'FECHA DE MUESTREO ':'sampling_date',
     'pH':'pH',
     'Conductividad (μs/cm)':'conductivity',
     'As (μg/L)':'As',
     'Flúor (mg/L)':'F',
     'Na+ (mg/L)':'Na',
     'K+    (mg/L)':'K',
     'Fe+ (mg/L)':'Fe',
     'Ca+ (mg/L)':'Ca',
     'Mg+ (mg/L)':'Mg',
     'NO3- (mg/L)':'nitrate',
     'Cl- (mg/L)':'Cl',
     ' CO3-2 (mg/L)':'carbonate',
     'HCO3- (mg/L)':'bicarbonate',
     'Alcalinidad total                (mg CaCO3/L)':'total_alcalinity',
     'SO4':'sulfate',
     'Tipo de Agua':'water_type'}, axis='columns')

# The first row is garbage
df2 = df2.drop(index=0)

# The id column shouldn't have predictive power
df2 = df2.drop(columns='id')

In [44]:
df2.head()

Unnamed: 0,municipality,town,longitude,latitude,sampling_date,pH,conductivity,As,F,Na,K,Fe,Ca,Mg,nitrate,Cl,carbonate,bicarbonate,total_alcalinity,sulfate,water_type
1,Durango,El Nayar,-104.695,23.9629,2017-08-08,8.14,337.0,61.5,3.15,40.922,1.4075,0.007,15.0015,0.2675,3.25,4.14,0.0,97.0,97.0,70.615,BICARBONATADA SODICA
2,Durango,Sebastián Lerdo de Tejada,-104.64,23.9572,2017-08-08,8.11,406.0,38.5,2.6,45.885,0.61,,20.153,0.0645,2.1,2.04,0.0,122.0,122.0,79.445,BICARBONATADA SODICA
3,Durango,Felipe Ángeles,-104.557,23.9351,2017-08-08,8.375,384.1,26.5,1.4,38.536,6.3665,,21.809,1.189,1.35,2.325,0.0,140.0,140.0,53.73,BICARBONATADA SODICA
4,Durango,Villa Montemorelos,-104.482,23.9918,2017-08-08,8.5,557.5,23.5,1.2,31.6805,6.735,,32.833,6.5825,4.4,4.68,0.0,206.5,206.5,60.245,BICARBONATADA CALCICA Y/O MAGNESICA
5,Durango,Belisario Domínguez,-104.509,24.0266,2017-08-08,8.33,326.1,97.5,5.95,45.6745,1.6525,0.0795,8.519,0.21,0.52,5.39,0.0,83.0,83.0,59.88,BICARBONATADA SODICA


In [7]:
df.isnull().sum()

Municipio                                          1
Localidad                                          1
Coordenadas                                        0
Unnamed: 3                                         0
Muestra                                            1
FECHA DE MUESTREO                                  1
pH                                                 1
Conductividad (μs/cm)                              1
As (μg/L)                                          1
Flúor (mg/L)                                       1
Na+ (mg/L)                                         1
K+    (mg/L)                                       1
Fe+ (mg/L)                                       129
Ca+ (mg/L)                                         1
Mg+ (mg/L)                                         1
NO3- (mg/L)                                        1
Cl- (mg/L)                                         1
 CO3-2 (mg/L)                                      1
HCO3- (mg/L)                                  