#### Importing important data
In this code I am trying to import the data as efficient as possible.
The data I am particulary interested in can be found in the google sheet document "interesting data"
The categories that are defined so far are:
###### Demographics
1. Literacy
2. population density
3. slum population
4. Urban population share
###### Deaths
5. covid infections
6. covid deaths
7. dates
8. district id
###### Vaccination
9. site vaccination progress
10. first dose vaccination
11. dates
12. district id
###### Age
13. age per 5 year categories to make a map of where older people live

In [2]:
import pandas as pd
import numpy as np

In [3]:
# We need to be able to access the right folder. I put mine in the following folders. This can be adjusted easily
# Keep in mind that the 'r' in front of the string converts
# the string to a 'raw' string, bc of conflicts with \ and " ' characters

# Base directory
base_dir =       r"C:\Users\danie\OneDrive\Bureaublad\Coding\EPA introduction to datascience\Intro to datascience project"

# This section specifies the file name directories that I want to use
covid_folder =  r"\covid_data\covid\csv"
demog_folder =  r"\covid_data\demography\csv"

# This section identifies the actual files
deaths_file =   r"\covid_infected_deaths_pc11.csv"
vacc_file =     r"\covid_vaccination.csv"
demog_file =    r"\pc11_demographics_district.csv"
age_file =      r"\age_bins_district_t_pc11.csv"

deaths_dir = base_dir + covid_folder + deaths_file
vacc_dir =   base_dir + covid_folder + vacc_file
demog_dir =  base_dir + demog_folder + demog_file
age_dir =    base_dir + demog_folder + age_file

dirlist = [deaths_dir, vacc_dir, demog_dir, age_dir]

In [4]:
deaths_frame = pd.read_csv(deaths_dir)
vacc_frame   = pd.read_csv(vacc_dir)
demog_frame  = pd.read_csv(demog_dir)
age_frame    = pd.read_csv(age_dir)

framedict = {"Deaths" : deaths_frame, 
             "Vaccination" : vacc_frame,
             "Demographics" : demog_frame, 
             "Age" : age_frame}

In [5]:
print("The keys of the dataframes are: \n")
for key, value in framedict.items():
    print(key, ":")
    print('length: ',len(framedict[key]))
    print(value.keys())

The keys of the dataframes are: 

Deaths :
length:  263862
Index(['pc11_state_id', 'pc11_district_id', 'date', 'total_cases',
       'total_deaths'],
      dtype='object')
Vaccination :
length:  210103
Index(['lgd_state_id', 'lgd_state_name', 'lgd_district_id',
       'lgd_district_name', 'date', 'total_individuals_registered',
       'total_sessions_conducted', 'total_sites', 'total_covaxin',
       'total_covishield', 'first_dose_admin', 'second_dose_admin', 'male_vac',
       'female_vac', 'trans_vac', 'state', 'district', 'bad_flg_covishield',
       'bad_flg_covaxin'],
      dtype='object')
Demographics :
length:  643
Index(['pc11_state_id', 'pc11_district_id', 'pc11_urb_share', 'pc11_slum_pop',
       'pc11_vd_area', 'pc11_td_area', 'pc11_tot_area', 'pc11_pop_dens',
       'pc11r_pca_tot_p', 'pc11u_pca_tot_p', 'pc11_pca_tot_p',
       'pc11r_pca_tot_m', 'pc11u_pca_tot_m', 'pc11_pca_tot_m',
       'pc11r_pca_tot_f', 'pc11u_pca_tot_f', 'pc11_pca_tot_f',
       'pc11r_pca_p_lit', 'p

In [6]:
# Here we filter the dataframes on interesting variables that we want to use
interesting_dir = base_dir + r"\variablecodes.xlsm"
interesting_frame = pd.read_excel(interesting_dir)
interesting_frame

Unnamed: 0,label,dataset,code,folder,remarks
0,Literacy,Demographics,pc11_pca_p_lit,demography,
1,population density,Demographics,pc11_pop_dens,demography,
2,slum population,Demographics,pc11_slum_pop,demography,only for urban
3,Urban population share,Demographics,pc11_urb_share,demography,
4,covid infections,Deaths,total_cases,covid,
5,covid deaths,Deaths,total_deaths,covid,
6,dates,Deaths,date,covid,
7,district id,Deaths,lgd_district_id,covid,
8,site vaccination progress,Vaccination,total_sites,covid,
9,first dose vaccination,Vaccination,first_dose_admin,covid,


In [7]:
framedict['Deaths']

Unnamed: 0,pc11_state_id,pc11_district_id,date,total_cases,total_deaths
0,1,1,30jan2020,0.0,0.0
1,1,1,02feb2020,0.0,0.0
2,1,1,03feb2020,0.0,0.0
3,1,1,02mar2020,0.0,0.0
4,1,1,03mar2020,0.0,0.0
...,...,...,...,...,...
263857,35,999,09apr2021,5161.0,62.0
263858,35,999,10apr2021,5175.0,62.0
263859,35,999,11apr2021,5190.0,62.0
263860,35,999,12apr2021,5201.0,62.0


In [23]:
# We can use this dataframe to filter on specific dates.
# This code does that:
framedict['Deaths'] = framedict['Deaths'][framedict['Deaths']['date'] == '30jan2020']

# This dataframe can be pivotted to conform to the district name on the left.
deaths_per_district = framedict['Deaths'].pivot_table(index = framedict['Deaths'].index, columns = [])

#deaths_per_district = pd.pivot_table(framedict['Deaths'],index = 'pc11_district_id')

#deaths_per_district.to_csv('pivotted.csv')

print('length dataframe is: ',len(deaths_per_district.index))
#andd drop wierd 999th district
deaths_per_district = deaths_per_district[deaths_per_district['pc11_district_id'] != 999]
deaths_per_district['censuscode'] = deaths_per_district['pc11_district_id']
deaths_per_district

length dataframe is:  642


Unnamed: 0,pc11_district_id,pc11_state_id,total_cases,total_deaths,censuscode
0,1,1,0.0,0.0,1
411,2,1,0.0,0.0,2
822,3,1,0.0,0.0,3
1233,4,1,0.0,0.0,4
1644,5,1,0.0,0.0,5
...,...,...,...,...,...
260985,635,34,0.0,0.0,635
261396,636,34,0.0,0.0,636
261807,637,34,0.0,0.0,637
262629,639,35,0.0,0.0,639
