In [2]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [13]:
path = "ER Visits in DK (23 Categories).csv"
df = pd.read_csv(path, header = 0)
df

Unnamed: 0,Gender,Age,Region,Diagnosis,2019,2020,2021,2022
0,Men,0 years,Region Hovedstaden,Infectious diseases,248,147,274,316
1,Men,0 years,Region Hovedstaden,Diseases of lung,322,225,387,377
2,Men,0 years,Region Hovedstaden,Diseases of the nervous system,47,76,43,70
3,Men,0 years,Region Hovedstaden,Diseases of the heart and great vessels,27,40,28,30
4,Men,0 years,Region Hovedstaden,"Diseases of arteries, veins, lymph",..,4,5,..
...,...,...,...,...,...,...,...,...
4554,Women,85 years and over,Region Nordjylland,Sterilization,0,0,0,0
4555,Women,85 years and over,Region Nordjylland,Concussion,3,5,8,13
4556,Women,85 years and over,Region Nordjylland,Poisonings,7,5,7,9
4557,Women,85 years and over,Region Nordjylland,Live-born children,0,0,0,0


## Data Cleaning 

First, I will learn more about my dataframe

In [14]:
df.columns

Index(['Gender', 'Age', 'Region', 'Diagnosis', '2019', '2020', '2021', '2022'], dtype='object')

In [15]:
df.info

<bound method DataFrame.info of      Gender                Age              Region  \
0       Men            0 years  Region Hovedstaden   
1       Men            0 years  Region Hovedstaden   
2       Men            0 years  Region Hovedstaden   
3       Men            0 years  Region Hovedstaden   
4       Men            0 years  Region Hovedstaden   
...     ...                ...                 ...   
4554  Women  85 years and over  Region Nordjylland   
4555  Women  85 years and over  Region Nordjylland   
4556  Women  85 years and over  Region Nordjylland   
4557  Women  85 years and over  Region Nordjylland   
4558  Women  85 years and over  Region Nordjylland   

                                    Diagnosis 2019 2020 2021 2022  
0                         Infectious diseases  248  147  274  316  
1                            Diseases of lung  322  225  387  377  
2              Diseases of the nervous system   47   76   43   70  
3     Diseases of the heart and great vessels  

### Missing Values

In this dataset, null values are labeled as "..". I will change them to NaN so that I can work with them more easily.

In [17]:
df.replace('..', np.nan, inplace = True)

In [19]:
df.isnull().sum()

Gender         0
Age            0
Region         0
Diagnosis      0
2019         240
2020         215
2021         241
2022         232
dtype: int64

I am going to be filling in null values with the row mean in order to maintain the integrity of the categorical variables (gender, age, region, and diagnosis). 

However, I can see that the dataset contains a few rows that have zero or only one result reported. I am going to remove these rows, because filling them in with mean could distort the analysis. 

In [24]:
df.dropna(thresh = 6)

Unnamed: 0,Gender,Age,Region,Diagnosis,2019,2020,2021,2022
0,Men,0 years,Region Hovedstaden,Infectious diseases,248,147,274,316
1,Men,0 years,Region Hovedstaden,Diseases of lung,322,225,387,377
2,Men,0 years,Region Hovedstaden,Diseases of the nervous system,47,76,43,70
3,Men,0 years,Region Hovedstaden,Diseases of the heart and great vessels,27,40,28,30
4,Men,0 years,Region Hovedstaden,"Diseases of arteries, veins, lymph",,4,5,
...,...,...,...,...,...,...,...,...
4554,Women,85 years and over,Region Nordjylland,Sterilization,0,0,0,0
4555,Women,85 years and over,Region Nordjylland,Concussion,3,5,8,13
4556,Women,85 years and over,Region Nordjylland,Poisonings,7,5,7,9
4557,Women,85 years and over,Region Nordjylland,Live-born children,0,0,0,0


In [26]:
df.dtypes

Gender       object
Age          object
Region       object
Diagnosis    object
2019         object
2020         object
2021         object
2022         object
dtype: object

In [40]:
#Changing to datatype to numeric
columns_to_convert = ['2019', '2020', '2021', '2022']

for col in columns_to_convert:
    df[col] = df[col].astype('float')

In [41]:
df.dtypes

Gender        object
Age           object
Region        object
Diagnosis     object
2019         float64
2020         float64
2021         float64
2022         float64
dtype: object

In [42]:
#Filling in the NaN values with the row mean
numeric_cols = df.columns[4:]
row_means = df[numeric_cols].mean(axis = 1) 

for col in numeric_cols:
    df[col] = df[col].fillna(row_means)

In [49]:
#Confirming that the mean has correcting been used in the dataframe
df.head(10)

Unnamed: 0,Gender,Age,Region,Diagnosis,2019,2020,2021,2022
0,Men,0 years,Region Hovedstaden,Infectious diseases,248.0,147.0,274.0,316.0
1,Men,0 years,Region Hovedstaden,Diseases of lung,322.0,225.0,387.0,377.0
2,Men,0 years,Region Hovedstaden,Diseases of the nervous system,47.0,76.0,43.0,70.0
3,Men,0 years,Region Hovedstaden,Diseases of the heart and great vessels,27.0,40.0,28.0,30.0
4,Men,0 years,Region Hovedstaden,"Diseases of arteries, veins, lymph",4.5,4.0,5.0,4.5
5,Men,0 years,Region Hovedstaden,Varicose veins in leg,0.0,0.0,0.0,0.0
6,Men,0 years,Region Hovedstaden,Diseases of blood and lymphatic tissue,8.0,3.0,18.0,19.0
7,Men,0 years,Region Hovedstaden,Diseases of the stomach and intestines,112.0,142.0,147.0,126.0
8,Men,0 years,Region Hovedstaden,Diseases of the urinary tract,117.0,137.0,133.0,107.0
9,Men,0 years,Region Hovedstaden,Diseases of the musculoskeletal system,25.0,21.0,25.0,34.0


Now I will save this dataframe for analysis and visualization

In [51]:
df.to_csv('ER_Visits_2019-2022_cleaned.csv')

### Summary

Rows with insufficient data have been removed to prevent bias and the row mean was used to fill in remaining null values. 
