## Final Assessment Programming 1
By: Jacob Menzinga (357758)

##### Introduction
Research question

##### Hypotheses
A higher amount of lead in wastewater correlates to a higher incidence of violent crimes

##### Data sources
1. <a href="https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=7477&_theme=309">Wastewater treatment in the netherlands</a><br>
From this dataset the following features were selected:

    - Onderwerpen -> Aanvoer van afvalwater -> Hoeveelheden:
        - Volume afvalwater in 1000 m3
        - Zink in kg

    - Regios:
        - Provincies

    - Perioden:
        - from 2010 up to and including 2020

2. <a href="https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=83648NED&_theme=406">Registered crime in the netherlands</a><br>
From this dataset the following features were selected:

    - Onderwerpen -> Geregistreerde misdrijven:
        - Geregistreerde misdrijven per 1000 inw.

    - Soort misdrijf:
        - 111 Diefstal en inbraak met geweld
        - 15 Afpersing en afdreiging
        - 21 Vernieling en beschadiging
        - 221 Openlijke geweldpleging
        - 23 Brandstichting / ontploffing
        - 3 Gewelds- en seksuele misdrijven
        - 7 Vuurwapenmisdrijven

    - Regios:
        - Provincies

    - Perioden:
        - from 2010 up to and including 2020
    


##### Reading in the data

In [None]:
# Imports
import yaml

import pandas as pd
import numpy as np

from bokeh.io import output_notebook, show
from bokeh.plotting import figure, show
output_notebook()

import hvplot.pandas

##### Supporting Functions

In [None]:
def check_data(df):
    """
    A function to check any dataframe for:
        - Missing data
        - Datatypes
        - Descriptive statistics
        
    It then prints it's findings 

    Args:
        df (pd.DataFrame): Any dataframe.
    """
    missing_data = df.isna().sum()
    if missing_data.values.sum() == 0:
        print('Missing data:')
        print('No missing data :)')
        missing_loc = 'None'
    
    else:
        # missing_data['perc. of data'] = df.isna().sum()/(len(df))*100
        missing_loc = df[df.isnull().any(axis=1)]
        print(f"Missing data per column:\n{missing_data}\n")
        print("The missing data is located in the following rows:")
        print(missing_loc)
    
    
    dtypes = df.dtypes
    print('\nData types:')
    print(dtypes)
    
    describe = df.describe()
    print(f'\nDescription of the dataframe')
    print(describe)
    

##### Importing and cleaning data

In [None]:
with open('config.yaml') as stream:
    config = yaml.safe_load(stream)
    
crime_df = pd.read_csv(config['crime'], delimiter=';')
lead_df = pd.read_csv(config['lead'], delimiter=';')

First I'll have a look at crime_df

In [None]:
crime_df.rename(columns= {'SoortMisdrijf':'Crime',
                         'RegioS':'Region', 'Perioden':'Year',
                         'GeregistreerdeMisdrijvenPer1000Inw_3':'Incidence'},
                inplace=True)
crime_df

In [None]:
check_data(crime_df)

Two things I took away from the datacheck:
1) There are a lot of missing values in the PV99 region. I looked this region code up in the metadata file of the crime dataset (also downloadable from the above link) and this is a category for 'uncatogarisable'data so I will drop these rows.

2) I want to turn the Year and Incidence columns into int and float dtypes respectively

In [None]:
# Dropping the PV99 region
crime_df = crime_df[crime_df['Region'] != 'PV99  ']

In [None]:
# Checking the values in the Incidence and Year columns
print(crime_df['Incidence'].unique())
print(crime_df['Year'].unique())

In [None]:
# replacing the '       .' value with 0
# Verklaring waarom.




 
crime_df['Incidence'] = crime_df['Incidence'].str.replace('       .', '0', regex=False)

# Typecasting the Year and Incidence columns
crime_df['Year'] = crime_df['Year'].str.replace('JJ00','').astype(int)
crime_df['Incidence'] = crime_df['Incidence'].astype(float)

In [None]:
# Now that that's done, I'll run the check data again to see if I got rid of 
# all the missing data

check_data(crime_df)

Now its time for lead_df

In [None]:
lead_df.rename(columns={'RegioS':'Region', 'Perioden':'Year',
                        'VolumeAfvalwater_43':'Vol_Wastewater', 
                        'Lood_52':'Lead'}, inplace= True)
lead_df

In [None]:
check_data(lead_df)

In this dataframe I want to change the Year and Lead columns to intergers

In [None]:
print(lead_df['Lead'].unique())

In [None]:
#  replacing the '       .' value with NaN
lead_df['Lead'] = lead_df['Lead'].replace('       .', np.nan, regex=False)

# Typecasting the Year and Lead columns
lead_df['Year'] = lead_df['Year'].str.replace('JJ00','').astype(int)
lead_df['Lead'] = lead_df['Lead'].astype(float) # Float for now because NaN can't be int.

# Filling the NaN with an interpolated value
lead_df['Lead'] = lead_df['Lead'].interpolate().astype(int)

Now we have the amount of wastewater in 1000 m3 and the amount of lead in the water in kg, I would like to create a column with the amount of lead per m3 of water

In [None]:
lead_df['lead_per_m3'] = lead_df['Lead'] / lead_df['Vol_Wastewater']
# Converting lead from kilogram to gram
lead_df['lead_per_m3'] = lead_df['lead_per_m3']*1000
lead_df

Now I'm going to have a look at the Region column in both DataFrames, since this is the feature I'll be merging on

In [None]:
print(f"""
Crime regions:
{crime_df['Region'].unique()}

Lead regions:
{lead_df['Region'].unique()}""")

There clearly is some whitespace that needs removing.

In [None]:
crime_df['Region'] = crime_df['Region'].str.replace(r'\s','', regex=True)
lead_df['Region'] = lead_df['Region'].str.replace(r'\s','', regex=True)

In [None]:
print(f"""
Crime regions:
{crime_df['Region'].unique()}

Lead regions:
{lead_df['Region'].unique()}""")

The crime_df has the different types of crime in one column, I would like each crime as a different feature with he incedence as their value

In [None]:
crime_df = crime_df.set_index(['Region','Year']).pivot(columns='Crime', values='Incidence').reset_index()
crime_df

In [None]:
crime_df['Total_incidence'] = crime_df[[
    'CRI1110', 'CRI1500', 'CRI2100', 'CRI2200', 'CRI2300','CRI3000', 'CRI7000']].sum(axis=1)
crime_df

Now both dataframes are ready to be merged!

In [None]:
lead_df

In [None]:
lead_crime_df = lead_df.merge(crime_df, how='inner', on=['Region', 'Year'])
lead_crime_df = lead_crime_df.drop(['Vol_Wastewater', 'Lead', 'ID'], axis=1)

lead_crime_df

Above is the dataframe I'll be working with to answer my research question. Below I'll provide some information

In [None]:
hvexplorer = hvplot.explorer(lead_crime_df)
hvexplorer
# Check for normal distribution
# Check for independence 
# Seaborn Pairplot / heatmap