# Data collection and cleaning

**Goal:** gather the different data sources to get a dataset including around 10 features and at least 1000 entries

**Data Preparation tasks:**
- [x] Agreggation of data sources and cleaning of columns names
- [x] Merging data sources
- [x] Checking missing values and interpolation
- [x] Creating calculated columns
- [ ] Add web-scraping of doctolib (?)

In [1]:
import pandas as pd
import numpy as np

pd.set_option('max_columns',35)

In [2]:
# Reading the dataset
apl = pd.read_excel('../data/raw_data/apl-drees.xlsx',sheet_name='APL_2018',header=7, index_col=None)
print("Shape:",apl.shape)
apl.head()

Shape: (34990, 5)


Unnamed: 0,Code commune INSEE,Communes,APL aux médecins généralistes,APL aux médecins généralistes de moins de 65 ans,Population standardisée 2016 pour la médecine générale
0,,,En nombre de consultations/visites accessibles...,En nombre de consultations/visites accessibles...,En nombre d'habitants standardisés
1,1001.0,L'Abergement-Clémenciat,2.396,2.112,761.728
2,1002.0,L'Abergement-de-Varey,2.721,2.634,241.621
3,1004.0,Ambérieu-en-Bugey,4.335,4.271,13798.6
4,1005.0,Ambérieux-en-Dombes,4.279,4.028,1634.55


In [3]:
# Dropping first row and last 2 columns that we won't use
apl.drop(0,axis=0, inplace=True)
apl.drop(apl.iloc[:,-2:],axis=1, inplace=True)

In [4]:
# Cleaning of columns names to make it easier to work
apl = apl.rename(columns={'Code commune INSEE':'CODGEO','APL aux médecins généralistes':'APL'})

# Converting dtype of CODGEO to ensure future merge
apl.CODGEO = apl.CODGEO.astype(str)
apl = apl.convert_dtypes()
apl.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34989 entries, 1 to 34989
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   CODGEO    34989 non-null  string 
 1   Communes  34989 non-null  string 
 2   APL       34989 non-null  float64
dtypes: float64(1), string(2)
memory usage: 1.1 MB


### Secondary data source: communes comparateur

Features to get from this data source: 
- Population
- Density area (hab/Km2 - Population/Superficie)
- Population growth
- Mediane Niveau de vie
- Unemployement Rate
- Part of secondary residences (%)
- Part of vacant residences (%)
- Part of city amenities (commerces, services, transports - %)
- Part of city amenities (administration, social, health, education - %)

In [5]:
# Reading dataset
com = pd.read_excel('../data/raw_data/base_cc_comparateur.xls',sheet_name='COM',header=5)
print("Shape:",com.shape)
com.head()

Shape: (34953, 36)


Unnamed: 0,CODGEO,LIBGEO,REG,DEP,P16_POP,P11_POP,SUPERF,NAIS1116,DECE1116,P16_MEN,NAISD18,DECESD18,P16_LOG,P16_RP,P16_RSECOCC,P16_LOGVAC,P16_RP_PROP,...,MED16,TP6016,P16_EMPLT,P16_EMPLT_SAL,P11_EMPLT,P16_POP1564,P16_CHOM1564,P16_ACT1564,ETTOT15,ETAZ15,ETBE15,ETFZ15,ETGU15,ETGZ15,ETOQ15,ETTEF115,ETTEFP1015
0,1001,L'Abergement-Clémenciat,84,1,767,780,15.95,41,25,306.0,10,9,348.0,306.0,16.0,26.0,260.0,...,22679.0,,91.887002,58.8732,73.924501,463.0,33.0,376.0,50.0,11.0,3.0,5.0,24.0,5.0,7.0,10.0,0.0
1,1002,L'Abergement-de-Varey,84,1,243,234,9.15,21,7,101.0,1,1,169.0,101.0,52.0,16.0,86.0,...,24382.083333,,14.035558,5.042181,14.970338,144.0,10.0,123.0,19.0,2.0,3.0,0.0,11.0,3.0,3.0,2.0,0.0
2,1004,Ambérieu-en-Bugey,84,1,14081,13839,24.6,1114,595,6348.757303,229,121,7126.116028,6348.757303,120.067029,657.291696,2870.483803,...,19721.0,17.0,7504.161673,6739.551642,7748.960974,8968.158251,1079.621428,6697.333122,1337.0,7.0,52.0,131.0,907.0,290.0,240.0,399.0,109.0
3,1005,Ambérieux-en-Dombes,84,1,1671,1600,15.92,101,42,640.0,21,12,686.62406,640.0,12.433083,34.190977,487.453143,...,23378.0,,300.815019,226.350827,289.347965,1071.339842,68.015722,864.845592,141.0,14.0,7.0,27.0,78.0,20.0,15.0,27.0,5.0
4,1006,Ambléon,84,1,110,112,5.88,9,6,53.0,0,1,74.0,53.0,12.0,9.0,38.0,...,,,6.0,4.0,5.060654,72.0,8.0,58.0,7.0,0.0,0.0,0.0,5.0,1.0,2.0,0.0,0.0


Information on selected columns:
- P16_POP	    Population en 2016
- P11_POP	    Population en 2011
- SUPERF	    Superficie (en km2)
- NAIS1116	    Nombre de naissances entre le 01/01/2011 et le 01/01/2016
- P16_LOG	    Nombre de logements en 2016
- P16_RSECOCC	Rés secondaires et logts occasionnels en 2016
- P16_LOGVAC	Logements vacants en 2016
- MED16	        Médiane du niveau vie en 2016
- P16_POP1564	Nombre de personnes de 15 à 64 ans en 2016
- P16_CHOM1564	Nombre de chômeurs de 15 à 64 ans en 2016
- ETTOT15	    Total des établissements actifs au 31 décembre 2015
- ETGU15	    Établissements actifs du commerce, transports et services divers au 31/12/2015
- ETOQ15	    Ets actifs de l'administration publique au 31/12/2015

In [6]:
# Selecting the interesting columns
sub_com = com[['CODGEO','P16_POP','SUPERF','P11_POP','P16_CHOM1564','P16_POP1564','P16_RSECOCC','P16_LOG',
     'P16_LOGVAC','ETGU15','ETOQ15','ETTOT15','MED16','NAIS1116']]

# Converting dtypes to ensure matching between CODGEO columns - dtype of CODGEO should be string
sub_com = sub_com.convert_dtypes()
sub_com.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34953 entries, 0 to 34952
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CODGEO        34953 non-null  string 
 1   P16_POP       34953 non-null  Int64  
 2   SUPERF        34953 non-null  float64
 3   P11_POP       34953 non-null  Int64  
 4   P16_CHOM1564  34953 non-null  float64
 5   P16_POP1564   34953 non-null  float64
 6   P16_RSECOCC   34953 non-null  float64
 7   P16_LOG       34953 non-null  float64
 8   P16_LOGVAC    34953 non-null  float64
 9   ETGU15        34952 non-null  Int64  
 10  ETOQ15        34952 non-null  Int64  
 11  ETTOT15       34952 non-null  Int64  
 12  MED16         31360 non-null  float64
 13  NAIS1116      34953 non-null  Int64  
dtypes: Int64(6), float64(7), string(1)
memory usage: 3.9 MB


In [7]:
# Reading dataset with sheetname for borough
arr = pd.read_excel('../data/raw_data/base_cc_comparateur.xls',sheet_name='ARM',header=5)
print("Shape:",arr.shape)
arr.head()

Shape: (45, 36)


Unnamed: 0,CODGEO,LIBGEO,REG,DEP,P16_POP,P11_POP,SUPERF,NAIS1116,DECE1116,P16_MEN,NAISD18,DECESD18,P16_LOG,P16_RP,P16_RSECOCC,P16_LOGVAC,P16_RP_PROP,...,MED16,TP6016,P16_EMPLT,P16_EMPLT_SAL,P11_EMPLT,P16_POP1564,P16_CHOM1564,P16_ACT1564,ETTOT15,ETAZ15,ETBE15,ETFZ15,ETGU15,ETGZ15,ETOQ15,ETTEF115,ETTEFP1015
0,13201,Marseille 1er Arrondissement,93,13,40202,38356,1.8,3235,1297,20907.324265,810,248,24032.214901,20907.324265,1376.561611,1748.329025,5698.233834,...,14263.0,41,21078.77085,17851.441776,20271.036676,27639.35566,4653.81235,17283.392706,9103,3,379,520,7359,2133,842,2533,357
1,13202,Marseille 2e Arrondissement,93,13,24888,24634,5.04,2280,855,11968.287204,435,240,15583.372968,11968.287204,984.838324,2630.24744,2739.015422,...,14213.0,41,20714.053396,19014.806825,17285.338049,16374.621635,2642.706978,10415.987202,4059,2,143,286,3214,839,414,1089,359
2,13203,Marseille 3e Arrondissement,93,13,47773,44600,2.6,5360,1493,20125.693789,1167,308,23878.076491,20125.693789,513.420908,3238.961794,4474.01123,...,11864.0,54,16758.946923,15088.660594,14050.133977,30755.262119,5185.169504,17042.27993,3238,3,120,406,2227,751,482,776,171
3,13204,Marseille 4e Arrondissement,93,13,48074,47953,2.9,3595,2389,24802.157019,740,483,27937.190309,24802.157019,577.503561,2557.529729,10869.527042,...,18304.0,23,13663.898533,11645.46116,16819.479929,29883.621171,3946.08043,21804.204368,4323,2,173,485,2811,743,852,862,121
4,13205,Marseille 5e Arrondissement,93,13,46274,46180,2.24,3184,1831,26113.025103,638,323,29706.195298,26113.025103,1339.004333,2254.165862,9246.743577,...,18626.0,23,21263.677813,18986.468321,21156.136781,32032.532168,3619.039234,22280.646343,4350,3,208,442,2792,582,905,741,126


In [8]:
# Selecting the interesting columns
sub_arr = arr[['CODGEO','P16_POP','SUPERF','P11_POP','P16_CHOM1564','P16_POP1564','P16_RSECOCC','P16_LOG',
     'P16_LOGVAC','ETGU15','ETOQ15','ETTOT15','MED16','NAIS1116']]

sub_arr.CODGEO = sub_arr.CODGEO.astype(str)

# Converting dtypes to ensure matching between CODGEO columns - dtype of CODGEO should be string
sub_arr = sub_arr.convert_dtypes()
sub_arr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CODGEO        45 non-null     string 
 1   P16_POP       45 non-null     Int64  
 2   SUPERF        45 non-null     float64
 3   P11_POP       45 non-null     Int64  
 4   P16_CHOM1564  45 non-null     float64
 5   P16_POP1564   45 non-null     float64
 6   P16_RSECOCC   45 non-null     float64
 7   P16_LOG       45 non-null     float64
 8   P16_LOGVAC    45 non-null     float64
 9   ETGU15        45 non-null     Int64  
 10  ETOQ15        45 non-null     Int64  
 11  ETTOT15       45 non-null     Int64  
 12  MED16         45 non-null     float64
 13  NAIS1116      45 non-null     Int64  
dtypes: Int64(6), float64(7), string(1)
memory usage: 5.3 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [9]:
# Concatenate both dataframes before merging
sub_com_arr = pd.concat([sub_com, sub_arr])

In [10]:
# Merging the dataframes
df_merged = pd.merge(apl,sub_com_arr,'left', on='CODGEO')
print(df_merged.shape)
df_merged.head()

(34989, 16)


Unnamed: 0,CODGEO,Communes,APL,P16_POP,SUPERF,P11_POP,P16_CHOM1564,P16_POP1564,P16_RSECOCC,P16_LOG,P16_LOGVAC,ETGU15,ETOQ15,ETTOT15,MED16,NAIS1116
0,1001,L'Abergement-Clémenciat,2.396,767,15.95,780,33.0,463.0,16.0,348.0,26.0,24,7,50,22679.0,41
1,1002,L'Abergement-de-Varey,2.721,243,9.15,234,10.0,144.0,52.0,169.0,16.0,11,3,19,24382.083333,21
2,1004,Ambérieu-en-Bugey,4.335,14081,24.6,13839,1079.621428,8968.158251,120.067029,7126.116028,657.291696,907,240,1337,19721.0,1114
3,1005,Ambérieux-en-Dombes,4.279,1671,15.92,1600,68.015722,1071.339842,12.433083,686.62406,34.190977,78,15,141,23378.0,101
4,1006,Ambléon,0.912,110,5.88,112,8.0,72.0,12.0,74.0,9.0,5,2,7,,9


### Secondary data source: evolution structures

Features to get from this datasource: 
- Repartition of population age
- Mobility_rate (% of population located 1 year ago)
- Socio-Professional Category

In [11]:
# Reading dataset
evol = pd.read_csv('../data/raw_data/base-cc-evol-struct-pop-2016-csv/base-cc-evol-struct-pop-2016.CSV',sep=';')
print("Shape:",evol.shape)
evol.head()

  interactivity=interactivity, compiler=compiler, result=result)


Shape: (34998, 209)


Unnamed: 0,CODGEO,P16_POP,P16_POP0014,P16_POP1529,P16_POP3044,P16_POP4559,P16_POP6074,P16_POP7589,P16_POP90P,P16_POPH,P16_H0014,P16_H1529,P16_H3044,P16_H4559,P16_H6074,P16_H7589,P16_H90P,...,C11_POP2554_CS1,C11_POP2554_CS2,C11_POP2554_CS3,C11_POP2554_CS4,C11_POP2554_CS5,C11_POP2554_CS6,C11_POP2554_CS7,C11_POP2554_CS8,C11_POP55P,C11_POP55P_CS1,C11_POP55P_CS2,C11_POP55P_CS3,C11_POP55P_CS4,C11_POP55P_CS5,C11_POP55P_CS6,C11_POP55P_CS7,C11_POP55P_CS8
0,1001,767.0,161.0,102.0,132.0,189.0,125.0,53.0,5.0,392.0,84.0,55.0,65.0,99.0,61.0,26.0,2.0,...,16.0,4.0,64.0,80.0,84.0,52.0,4.0,0.0,224.0,0.0,0.0,16.0,4.0,20.0,12.0,164.0,8.0
1,1002,243.0,55.0,28.0,70.0,37.0,34.0,17.0,2.0,123.0,26.0,10.0,38.0,17.0,22.0,9.0,1.0,...,0.0,4.0,20.0,52.0,0.0,20.0,0.0,8.0,72.0,0.0,0.0,0.0,0.0,0.0,8.0,64.0,0.0
2,1004,14081.0,2791.29495,2892.682129,2749.298629,2511.528964,1960.65755,1027.80429,147.733487,6813.623449,1401.716081,1469.282436,1392.441526,1243.397931,874.229525,387.356864,45.199087,...,0.0,262.255278,612.878739,1392.727209,1407.876604,1295.654765,39.170957,514.592587,3622.056361,0.0,40.354913,78.925152,136.259277,149.815525,170.1642,2800.673852,245.863442
3,1005,1671.0,342.918984,257.492465,330.675662,386.691402,239.848234,102.969273,10.403981,842.093176,175.520797,135.99427,149.374143,203.59126,124.580453,49.921027,3.111226,...,0.0,63.366337,95.049505,190.09901,186.138614,126.732673,7.920792,27.722772,388.118812,0.0,7.920792,7.920792,7.920792,7.920792,11.881188,312.871287,31.683168
4,1006,110.0,12.0,16.0,15.0,29.0,27.0,10.0,1.0,62.0,8.0,12.0,9.0,17.0,12.0,4.0,0.0,...,0.0,0.0,4.0,8.0,8.0,8.0,0.0,8.0,52.0,0.0,0.0,0.0,8.0,8.0,0.0,36.0,0.0


Information on selected columns:
- P16_POP01P	Nombre de personnes de 1 an ou plus localisée 1 an auparavant en 2016
- P16_POP0014	Pop 0-14 ans en 2016	
- P16_POP1529	Pop 15-29 ans en 2016		
- P16_POP3044	Pop 30-44 ans en 2016		
- P16_POP4559	Pop 45-59 ans en 2016		
- P16_POP6074	Pop 60-74 ans en 2016	
- P16_POP7589	Pop 75-89 ans en 2016
- P16_POP90P	Pop 90 ans ou plus en 2016
- C16_POP15P	Pop 15 ans ou plus en 2016		
- C16_POP15P_CS1	Pop 15 ans ou plus Agriculteurs exploitants en 2016
- C16_POP15P_CS2	Pop 15 ans ou plus Artisans, Comm., Chefs entr. en 2016
- C16_POP15P_CS3	Pop 15 ans ou plus Cadres, Prof. intel. sup. en 2016
- C16_POP15P_CS4	Pop 15 ans ou plus Prof. intermédiaires  en 2016
- C16_POP15P_CS5	Pop 15 ans ou plus Employés en 2016
- C16_POP15P_CS6	Pop 15 ans ou plus Ouvriers en 2016
- C16_POP15P_CS7	Pop 15 ans ou plus Retraités en 2016
- C16_POP15P_CS8	Pop 15 ans ou plus Autres sans activité professionnelle en 2016

In [12]:
# Selecting the interesting columns
sub_evol = evol[['CODGEO','P16_POP01P']+list(evol.columns[2:9])+list(evol.columns[51:60])]

# Correcting CODGEO that have only 4 number by adding 0 before
sub_evol.CODGEO = sub_evol.CODGEO.apply(lambda x: '0'+str(x) if len(str(x))==4 else x).astype(str).copy()

# Converting dtypes to ensure matching between CODGEO columns - dtype of CODGEO should be string
sub_evol = sub_evol.convert_dtypes()
sub_evol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34998 entries, 0 to 34997
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CODGEO          34998 non-null  string 
 1   P16_POP01P      34998 non-null  float64
 2   P16_POP0014     34998 non-null  float64
 3   P16_POP1529     34998 non-null  float64
 4   P16_POP3044     34998 non-null  float64
 5   P16_POP4559     34998 non-null  float64
 6   P16_POP6074     34998 non-null  float64
 7   P16_POP7589     34998 non-null  float64
 8   P16_POP90P      34998 non-null  float64
 9   C16_POP15P      34998 non-null  float64
 10  C16_POP15P_CS1  34998 non-null  float64
 11  C16_POP15P_CS2  34998 non-null  float64
 12  C16_POP15P_CS3  34998 non-null  float64
 13  C16_POP15P_CS4  34998 non-null  float64
 14  C16_POP15P_CS5  34998 non-null  float64
 15  C16_POP15P_CS6  34998 non-null  float64
 16  C16_POP15P_CS7  34998 non-null  float64
 17  C16_POP15P_CS8  34998 non-null 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [13]:
# Merging the dataframes
df_merged_2 = pd.merge(df_merged,sub_evol,'left', on='CODGEO')
print(df_merged_2.shape)
df_merged_2.head()

(34989, 33)


Unnamed: 0,CODGEO,Communes,APL,P16_POP,SUPERF,P11_POP,P16_CHOM1564,P16_POP1564,P16_RSECOCC,P16_LOG,P16_LOGVAC,ETGU15,ETOQ15,ETTOT15,MED16,NAIS1116,P16_POP01P,P16_POP0014,P16_POP1529,P16_POP3044,P16_POP4559,P16_POP6074,P16_POP7589,P16_POP90P,C16_POP15P,C16_POP15P_CS1,C16_POP15P_CS2,C16_POP15P_CS3,C16_POP15P_CS4,C16_POP15P_CS5,C16_POP15P_CS6,C16_POP15P_CS7,C16_POP15P_CS8
0,1001,L'Abergement-Clémenciat,2.396,767,15.95,780,33.0,463.0,16.0,348.0,26.0,24,7,50,22679.0,41,750.0,161.0,102.0,132.0,189.0,125.0,53.0,5.0,605.0,15.0,20.0,75.0,95.0,100.0,125.0,145.0,30.0
1,1002,L'Abergement-de-Varey,2.721,243,9.15,234,10.0,144.0,52.0,169.0,16.0,11,3,19,24382.083333,21,238.0,55.0,28.0,70.0,37.0,34.0,17.0,2.0,195.0,0.0,20.0,15.0,25.0,40.0,10.0,65.0,20.0
2,1004,Ambérieu-en-Bugey,4.335,14081,24.6,13839,1079.621428,8968.158251,120.067029,7126.116028,657.291696,907,240,1337,19721.0,1114,13867.484077,2791.29495,2892.682129,2749.298629,2511.528964,1960.65755,1027.80429,147.733487,11273.704757,2.804736,300.150468,782.328599,1940.196262,1830.925313,1797.133429,2789.120337,1831.045614
3,1005,Ambérieux-en-Dombes,4.279,1671,15.92,1600,68.015722,1071.339842,12.433083,686.62406,34.190977,78,15,141,23378.0,101,1654.524661,342.918984,257.492465,330.675662,386.691402,239.848234,102.969273,10.403981,1374.056524,5.194086,61.990298,108.503109,237.313443,247.598151,237.081832,321.931876,154.443729
4,1006,Ambléon,0.912,110,5.88,112,8.0,72.0,12.0,74.0,9.0,5,2,7,,9,107.0,12.0,16.0,15.0,29.0,27.0,10.0,1.0,90.0,0.0,0.0,5.0,25.0,15.0,15.0,25.0,5.0


### Secondary data source: equipements

Features to get from this datasource: 
- Level of medical education = number of health education establishment

(other possible features: Number of leisure establishments, Number of healthcare establishments)

In [14]:
# Reading the dataset
eqmt = pd.read_csv('../data/raw_data/bpe18_ensemble_csv/bpe18_ensemble.csv',sep=';')
print("Shape:",eqmt.shape)
eqmt.head()

  interactivity=interactivity, compiler=compiler, result=result)


Shape: (1035564, 7)


Unnamed: 0,REG,DEP,DEPCOM,DCIRIS,AN,TYPEQU,NB_EQUIP
0,84,1,1001,1001,2018,A401,2
1,84,1,1001,1001,2018,A404,4
2,84,1,1001,1001,2018,A504,1
3,84,1,1001,1001,2018,A507,1
4,84,1,1001,1001,2018,B203,1


In [15]:
# Aggregating number of health education establishements
education_health = eqmt[eqmt.TYPEQU=='C402'].groupby('DEPCOM').NB_EQUIP.agg('sum').reset_index()

# Renaming DEPCOM as CODGEO to match with merge dataframe
education_health = education_health.rename(columns={'DEPCOM':'CODGEO'})

# Cleaning codes to add 0 when code is only 4 number
education_health.CODGEO = education_health.CODGEO.apply(lambda x: '0'+str(x) if len(str(x))==4 else x).astype(str).copy()

# Converting dtypes to ensure matching between CODGEO columns - dtype of CODGEO should be string
education_health = education_health.convert_dtypes()
education_health.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 524 entries, 0 to 523
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   CODGEO    524 non-null    string
 1   NB_EQUIP  524 non-null    Int64 
dtypes: Int64(1), string(1)
memory usage: 8.8 KB


In [16]:
# Merging the dataframes 
df_merged_3 = pd.merge(df_merged_2, education_health,'left', on='CODGEO')
print(df_merged_3.shape)
df_merged_3.head()

(34989, 34)


Unnamed: 0,CODGEO,Communes,APL,P16_POP,SUPERF,P11_POP,P16_CHOM1564,P16_POP1564,P16_RSECOCC,P16_LOG,P16_LOGVAC,ETGU15,ETOQ15,ETTOT15,MED16,NAIS1116,P16_POP01P,P16_POP0014,P16_POP1529,P16_POP3044,P16_POP4559,P16_POP6074,P16_POP7589,P16_POP90P,C16_POP15P,C16_POP15P_CS1,C16_POP15P_CS2,C16_POP15P_CS3,C16_POP15P_CS4,C16_POP15P_CS5,C16_POP15P_CS6,C16_POP15P_CS7,C16_POP15P_CS8,NB_EQUIP
0,1001,L'Abergement-Clémenciat,2.396,767,15.95,780,33.0,463.0,16.0,348.0,26.0,24,7,50,22679.0,41,750.0,161.0,102.0,132.0,189.0,125.0,53.0,5.0,605.0,15.0,20.0,75.0,95.0,100.0,125.0,145.0,30.0,
1,1002,L'Abergement-de-Varey,2.721,243,9.15,234,10.0,144.0,52.0,169.0,16.0,11,3,19,24382.083333,21,238.0,55.0,28.0,70.0,37.0,34.0,17.0,2.0,195.0,0.0,20.0,15.0,25.0,40.0,10.0,65.0,20.0,
2,1004,Ambérieu-en-Bugey,4.335,14081,24.6,13839,1079.621428,8968.158251,120.067029,7126.116028,657.291696,907,240,1337,19721.0,1114,13867.484077,2791.29495,2892.682129,2749.298629,2511.528964,1960.65755,1027.80429,147.733487,11273.704757,2.804736,300.150468,782.328599,1940.196262,1830.925313,1797.133429,2789.120337,1831.045614,
3,1005,Ambérieux-en-Dombes,4.279,1671,15.92,1600,68.015722,1071.339842,12.433083,686.62406,34.190977,78,15,141,23378.0,101,1654.524661,342.918984,257.492465,330.675662,386.691402,239.848234,102.969273,10.403981,1374.056524,5.194086,61.990298,108.503109,237.313443,247.598151,237.081832,321.931876,154.443729,
4,1006,Ambléon,0.912,110,5.88,112,8.0,72.0,12.0,74.0,9.0,5,2,7,,9,107.0,12.0,16.0,15.0,29.0,27.0,10.0,1.0,90.0,0.0,0.0,5.0,25.0,15.0,15.0,25.0,5.0,


In [17]:
# Filling nan values of Number of health education establishments because it means there isn't any.
df_merged_3.NB_EQUIP = df_merged_3.NB_EQUIP.fillna(0)

In [18]:
df_merged_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34989 entries, 0 to 34988
Data columns (total 34 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CODGEO          34989 non-null  string 
 1   Communes        34989 non-null  string 
 2   APL             34989 non-null  float64
 3   P16_POP         34989 non-null  Int64  
 4   SUPERF          34989 non-null  float64
 5   P11_POP         34989 non-null  Int64  
 6   P16_CHOM1564    34989 non-null  float64
 7   P16_POP1564     34989 non-null  float64
 8   P16_RSECOCC     34989 non-null  float64
 9   P16_LOG         34989 non-null  float64
 10  P16_LOGVAC      34989 non-null  float64
 11  ETGU15          34988 non-null  Int64  
 12  ETOQ15          34988 non-null  Int64  
 13  ETTOT15         34988 non-null  Int64  
 14  MED16           31402 non-null  float64
 15  NAIS1116        34989 non-null  Int64  
 16  P16_POP01P      34989 non-null  float64
 17  P16_POP0014     34989 non-null 

In [19]:
# Saving the dataframe as it is if I want to retreive raw data
df_merged_3.to_csv('../data/medical_desert_raw_data.csv',index=False)

________________________
## Cleaning missing values

In [20]:
df_clean = df_merged_3.copy()

In [21]:
df_clean.isna().sum()

CODGEO               0
Communes             0
APL                  0
P16_POP              0
SUPERF               0
P11_POP              0
P16_CHOM1564         0
P16_POP1564          0
P16_RSECOCC          0
P16_LOG              0
P16_LOGVAC           0
ETGU15               1
ETOQ15               1
ETTOT15              1
MED16             3587
NAIS1116             0
P16_POP01P           0
P16_POP0014          0
P16_POP1529          0
P16_POP3044          0
P16_POP4559          0
P16_POP6074          0
P16_POP7589          0
P16_POP90P           0
C16_POP15P           0
C16_POP15P_CS1       0
C16_POP15P_CS2       0
C16_POP15P_CS3       0
C16_POP15P_CS4       0
C16_POP15P_CS5       0
C16_POP15P_CS6       0
C16_POP15P_CS7       0
C16_POP15P_CS8       0
NB_EQUIP             0
dtype: int64

In [22]:
mis_value = df_clean[df_clean.MED16.isna()]
mis_value

Unnamed: 0,CODGEO,Communes,APL,P16_POP,SUPERF,P11_POP,P16_CHOM1564,P16_POP1564,P16_RSECOCC,P16_LOG,P16_LOGVAC,ETGU15,ETOQ15,ETTOT15,MED16,NAIS1116,P16_POP01P,P16_POP0014,P16_POP1529,P16_POP3044,P16_POP4559,P16_POP6074,P16_POP7589,P16_POP90P,C16_POP15P,C16_POP15P_CS1,C16_POP15P_CS2,C16_POP15P_CS3,C16_POP15P_CS4,C16_POP15P_CS5,C16_POP15P_CS6,C16_POP15P_CS7,C16_POP15P_CS8,NB_EQUIP
4,01006,Ambléon,0.912,110,5.88,112,8.000000,72.000000,12.000000,74.000000,9.000000,5,2,7,,9,107.000000,12.000000,16.000000,15.000000,29.000000,27.000000,10.000000,1.000000,90.000000,0.000000,0.000000,5.000000,25.000000,15.000000,15.000000,25.000000,5.000000,0
16,01019,Armix,0.000,26,6.82,20,1.083333,16.250000,14.482759,30.537356,4.137931,1,1,5,,2,26.000000,3.250000,4.333333,0.000000,6.500000,9.750000,2.166667,0.000000,27.083333,10.833333,0.000000,0.000000,0.000000,0.000000,0.000000,16.250000,0.000000,0
19,01023,Asnières-sur-Saône,1.033,63,4.68,73,0.952985,41.982834,7.000000,42.636561,8.000000,1,2,6,,5,62.047015,11.435818,6.670894,16.200743,14.346272,10.534333,3.811939,0.000000,57.179092,4.764924,4.764924,4.764924,4.764924,4.764924,23.824622,9.529849,0.000000,0
46,01051,Bolozon,0.495,89,4.92,92,8.000000,50.000000,40.000000,89.000000,4.000000,4,1,9,,7,86.000000,12.000000,8.000000,11.000000,24.000000,24.000000,9.000000,1.000000,80.000000,0.000000,5.000000,0.000000,10.000000,20.000000,15.000000,25.000000,5.000000,0
59,01066,La Burbanche,0.659,71,10.82,74,7.000000,44.000000,53.000000,92.000000,2.000000,4,1,10,,6,71.000000,10.000000,10.000000,8.000000,20.000000,14.000000,8.000000,1.000000,60.000000,0.000000,0.000000,0.000000,5.000000,5.000000,5.000000,35.000000,10.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34960,97357,Grand-Santi,0.000,7428,2112.00,5526,549.985364,3428.881619,39.102113,1543.476039,28.153521,29,7,43,,966,7229.749462,3836.041326,1673.405080,1154.329746,516.943607,177.999139,63.951786,5.329316,3480.043048,0.000000,15.987947,37.305209,74.610418,175.867413,106.586311,0.000000,3069.685751,0
34961,97358,Saint-Élie,0.000,147,5680.00,420,0.000000,145.243028,0.000000,61.546682,21.136323,4,2,13,,0,147.000000,0.000000,28.697211,56.223108,54.466135,7.613546,0.000000,0.000000,137.629482,2.928287,14.641434,2.928287,5.856574,8.784861,87.848606,2.928287,11.713147,0
34962,97360,Apatou,0.489,8826,2020.00,6975,2532.332345,4855.294508,18.153821,1827.354629,21.179458,28,13,54,,832,8606.161309,3732.023485,2400.429131,1531.542877,791.419286,282.649745,80.607520,7.327956,5024.884355,5.234255,5.234255,26.171273,277.415490,198.901672,57.576800,41.874036,4412.476575,0
34963,97361,Awala-Yalimapo,0.894,1393,187.40,1305,405.000000,818.000000,11.000000,355.000000,30.000000,7,4,15,,110,1359.000000,493.000000,333.000000,257.000000,194.000000,87.000000,27.000000,2.000000,920.000000,0.000000,10.000000,5.000000,40.000000,180.000000,110.000000,65.000000,510.000000,0


In [23]:
df_clean[(df_clean.CODGEO.str.contains("^97",regex=True))&(df_clean.P16_POP<1000)]

Unnamed: 0,CODGEO,Communes,APL,P16_POP,SUPERF,P11_POP,P16_CHOM1564,P16_POP1564,P16_RSECOCC,P16_LOG,P16_LOGVAC,ETGU15,ETOQ15,ETTOT15,MED16,NAIS1116,P16_POP01P,P16_POP0014,P16_POP1529,P16_POP3044,P16_POP4559,P16_POP6074,P16_POP7589,P16_POP90P,C16_POP15P,C16_POP15P_CS1,C16_POP15P_CS2,C16_POP15P_CS3,C16_POP15P_CS4,C16_POP15P_CS5,C16_POP15P_CS6,C16_POP15P_CS7,C16_POP15P_CS8,NB_EQUIP
34916,97208,Fonds-Saint-Denis,1.808,761,24.28,843,89.155903,467.43788,36.333021,449.872627,52.566924,22,9,56,16520.0,37,759.912733,80.457766,103.290376,71.759629,219.627957,179.312095,92.417705,14.134472,695.850952,21.745342,32.618013,16.309007,97.85404,114.163047,81.545033,255.507771,76.108698,0
34919,97211,Grand'Rivière,0.329,703,16.6,567,79.932945,398.639942,12.691348,411.499986,79.07686,21,8,39,14023.913043,24,698.900875,93.255102,102.478134,79.932945,171.138484,122.973761,116.825073,16.396501,625.116618,10.247813,10.247813,0.0,35.867347,102.478134,112.725948,194.708455,158.841108,0
34943,97301,Régina,0.0,911,12130.0,904,188.267944,582.678664,85.818382,405.185191,45.974133,24,9,78,,101,899.607146,247.535645,234.661293,147.196311,168.713926,85.964262,26.928564,0.0,681.873947,62.14284,5.17857,15.53571,46.60713,119.910821,113.928539,72.320888,246.24945,0
34956,97314,Ouanary,0.0,182,1080.0,109,17.29,101.01,0.0,38.239709,6.389709,2,2,12,,1,177.45,71.89,27.3,36.4,30.03,13.65,2.73,0.0,109.2,4.55,0.0,0.0,13.65,36.4,9.1,9.1,36.4,0
34957,97352,Saül,0.0,151,4475.0,153,15.894737,95.368421,1.285196,64.801631,12.851961,5,3,14,,12,149.013158,40.730263,30.796053,20.861842,36.756579,17.881579,2.980263,0.993421,139.078947,0.0,14.901316,0.0,14.901316,34.769737,14.901316,9.934211,49.671053,0
34961,97358,Saint-Élie,0.0,147,5680.0,420,0.0,145.243028,0.0,61.546682,21.136323,4,2,13,,0,147.0,0.0,28.697211,56.223108,54.466135,7.613546,0.0,0.0,137.629482,2.928287,14.641434,2.928287,5.856574,8.784861,87.848606,2.928287,11.713147,0


In [26]:
# Creating a function to return similar city within the same department and having same population
# Coefficient attribute allow to get an higher range of similar city, especially for small city
import re

def get_similar(codgeo, pop, coef):
    
    global df_clean
    
    return df_clean[(df_clean.CODGEO.str.contains(f"^{codgeo[:2]}",regex=True))&(df_clean.P16_POP<(pop+coef))]
                           
# Testing the function 
mis_value[1:3].apply(lambda x: get_similar(x['CODGEO'],x['P16_POP'],200).MED16.median(),axis=1)
                           

16    21393.333333
19    21633.910714
dtype: float64

In [27]:
# Filling missing values with median of similar city - WARNING: Takes time to run
mis_value = df_clean[df_clean.MED16.isna()]

df_clean.MED16 = df_clean.MED16.fillna(mis_value.apply(lambda x: get_similar(x['CODGEO'],x['P16_POP'],200).MED16.median(),axis=1))



In [28]:
# Doing it again for the 3 last rows by increasing the coefficient of similarity 
# (not enough values with the previous coef)
mis_value = df_clean[df_clean.MED16.isna()]

df_clean.MED16 = df_clean.MED16.fillna(mis_value.apply(lambda x: get_similar(x['CODGEO'],x['P16_POP'],1000).MED16.median(),axis=1))

In [29]:
# Filling the last 3 missing columns by getting the mode of similar city

missing_col = df_clean.iloc[29876][df_clean.iloc[29876].isna()].index

for col in missing_col:
        df_clean.loc[29876,col] = get_similar(df_clean.iloc[29876]['CODGEO'],
                                              df_clean.iloc[29876]['P16_POP'],200)[col].mode()[0]
                              

In [30]:
df_clean.isna().sum()

CODGEO            0
Communes          0
APL               0
P16_POP           0
SUPERF            0
P11_POP           0
P16_CHOM1564      0
P16_POP1564       0
P16_RSECOCC       0
P16_LOG           0
P16_LOGVAC        0
ETGU15            0
ETOQ15            0
ETTOT15           0
MED16             0
NAIS1116          0
P16_POP01P        0
P16_POP0014       0
P16_POP1529       0
P16_POP3044       0
P16_POP4559       0
P16_POP6074       0
P16_POP7589       0
P16_POP90P        0
C16_POP15P        0
C16_POP15P_CS1    0
C16_POP15P_CS2    0
C16_POP15P_CS3    0
C16_POP15P_CS4    0
C16_POP15P_CS5    0
C16_POP15P_CS6    0
C16_POP15P_CS7    0
C16_POP15P_CS8    0
NB_EQUIP          0
dtype: int64

______________________________________
## Calculate new columns

The objective is to create calculated columns based on the data retrieved from the different sources. It will allow to add relative type of data (percentage) to make them more relevant to compare. 

In [31]:
new_df = df_clean.copy()

In [32]:
# Calculated metrics
new_df['density_area'] = new_df.P16_POP / new_df.SUPERF
new_df['annual_pop_growth'] = ((new_df.P16_POP/new_df.P11_POP)**(1/(2016-2011))-1)*100
new_df['unemployment_rate'] = (new_df.P16_CHOM1564/new_df.P16_POP1564)*100
new_df['secondary_residence_rate'] = (new_df.P16_RSECOCC/new_df.P16_LOG)*100
new_df['vacant_residence_rate'] = (new_df.P16_LOGVAC/new_df.P16_LOG)*100
new_df['active_local_business_rate'] = (new_df.ETGU15/new_df.ETTOT15)*100
new_df['city_social_amenities_rate'] = (new_df.ETOQ15/new_df.ETTOT15)*100
new_df['0_14_pop_rate'] = (new_df.P16_POP0014/new_df.P16_POP)*100
new_df['15_59_pop_rate'] = ((new_df.P16_POP1529+new_df.P16_POP3044+new_df.P16_POP4559)/new_df.P16_POP)*100
new_df['60+_pop_rate'] = ((new_df.P16_POP6074+new_df.P16_POP7589+new_df.P16_POP90P)/new_df.P16_POP)*100
new_df['mobility_rate'] = ((new_df.P16_POP-new_df.P16_POP01P)/new_df.P16_POP)*100
new_df['average_birth_rate'] = (new_df.NAIS1116/(2016-2011))/(new_df[['P16_POP','P11_POP']].mean(axis=1))*100

for i in range(1,9): 
    new_df[f'CSP{i}_rate'] = ((new_df[f'C16_POP15P_CS{i}']/new_df.C16_POP15P)*100)

# Filling nan value of CSP_rate because for few cities we got no information on 2015 population
new_df = new_df.fillna(0)

new_df = new_df.rename(columns={'MED16':'median_living_standard','NB_EQUIP':'healthcare_education_establishments'})

In [33]:
final_df = new_df[list(new_df.columns[:4])+['median_living_standard']+list(new_df.columns[-21:])]
print(final_df.shape)
final_df.head()

(34989, 26)


Unnamed: 0,CODGEO,Communes,APL,P16_POP,median_living_standard,healthcare_education_establishments,density_area,annual_pop_growth,unemployment_rate,secondary_residence_rate,vacant_residence_rate,active_local_business_rate,city_social_amenities_rate,0_14_pop_rate,15_59_pop_rate,60+_pop_rate,mobility_rate,average_birth_rate,CSP1_rate,CSP2_rate,CSP3_rate,CSP4_rate,CSP5_rate,CSP6_rate,CSP7_rate,CSP8_rate
0,1001,L'Abergement-Clémenciat,2.396,767,22679.0,0,48.087774,-0.335578,7.12743,4.597701,7.471264,48.0,14.0,20.990874,55.149935,23.859192,2.216428,1.060116,2.479339,3.305785,12.396694,15.702479,16.528926,20.661157,23.966942,4.958678
1,1002,L'Abergement-de-Varey,2.721,243,24382.083333,0,26.557377,0.757662,6.944444,30.769231,9.467456,57.894737,15.789474,22.633745,55.555556,21.8107,2.057613,1.761006,0.0,10.25641,7.692308,12.820513,20.512821,5.128205,33.333333,10.25641
2,1004,Ambérieu-en-Bugey,4.335,14081,19721.0,0,572.398374,0.347315,12.038385,1.684887,9.223702,67.838444,17.950636,19.82313,57.904337,22.272533,1.516341,1.595989,0.024879,2.662394,6.93941,17.209926,16.240671,15.94093,24.740051,16.241738
3,1005,Ambérieux-en-Dombes,4.279,1671,23378.0,0,104.962312,0.872154,6.34866,1.810755,4.979578,55.319149,10.638298,20.521782,58.339888,21.13833,0.985957,1.235096,0.378011,4.511481,7.896554,17.27101,18.019503,17.254154,23.429304,11.239984
4,1006,Ambléon,0.912,110,21660.0,0,18.707483,-0.359722,11.111111,16.216216,12.162162,71.428571,28.571429,10.909091,54.545455,34.545455,2.727273,1.621622,0.0,0.0,5.555556,27.777778,16.666667,16.666667,27.777778,5.555556


In [36]:
# Saving the cleaned dataframe
final_df.to_csv('../data/medical_desert_clean.csv',index=False)