<h1 style="text-align: center;">Pre-processing data about the <em>communes</em></h1>

<em><strong>Note:</strong><br>
This notebook is part of a project about the French counties and house prices. Please read the <a href=" https://github.com/Ashish-3/House-prices-in-France/blob/master/Readme.md">readme</a> file for more information:</em>



In this notebook we will pre-process data from three different datasets. This will allow us to collect information about the communes concerning:
    - geography
    - demography
    - standard of living
This data will later be used during the data analysis step.

## Importing libraries

In [2]:
import numpy as np
import pandas as pd 
from statsmodels.stats.weightstats import DescrStatsW # For dealing with weighted stats

#show all the columns of a data
#pd.set_option('display.max_rows', 500)
#pd.set_option('display.max_columns', 500)

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import seaborn as sns

import time

## Importing and pre-processing "geo-data"

In [3]:
# Import CSV data
communes= pd.read_csv(r'data/correspondance-code-insee-code-postal.csv', 
                      usecols= ['Code INSEE',
                               'Code Postal',
                               'Commune',
                               'Superficie',
                               'geo_point_2d',
                               'Code Département',
                                'Code Région'],
                      sep=';')
print(communes.shape)
communes.head()

(36742, 7)


Unnamed: 0,Code INSEE,Code Postal,Commune,Superficie,geo_point_2d,Code Département,Code Région
0,59276,59287,GUESNAIN,405.0,"50.3483461671,3.14816711364",59,31
1,88128,88210,DENIPAIRE,702.0,"48.3398168543,6.96189290978",88,41
2,57538,57170,PETTONCOURT,493.0,"48.7881263595,6.41106883774",57,41
3,54459,54630,RICHARDMENIL,707.0,"48.5958649273,6.17614506031",54,41
4,35022,35190,BECHEREL,55.0,"48.2965087434,-1.9428236566",35,53


In [3]:
# Converting the area of the 'communes' in km2
communes['Superficie_km2']=communes['Superficie']/100
communes.drop(['Superficie'], axis=1, inplace=True)

# Splitting the coordinates into two columns 'lat' and 'lng'
communes[['lat','lng']]=communes.geo_point_2d.str.split(',', expand=True)
communes.drop(['geo_point_2d'], axis=1, inplace=True)
communes.head(3)

Unnamed: 0,Code INSEE,Code Postal,Commune,Code Département,Code Région,Superficie_km2,lat,lng
0,59276,59287,GUESNAIN,59,31,4.05,50.3483461671,3.14816711364
1,88128,88210,DENIPAIRE,88,41,7.02,48.3398168543,6.96189290978
2,57538,57170,PETTONCOURT,57,41,4.93,48.7881263595,6.41106883774


In [4]:
communes.describe(include='all')

Unnamed: 0,Code INSEE,Code Postal,Commune,Code Département,Code Région,Superficie_km2,lat,lng
count,36742.0,36742.0,36742,36742.0,36742.0,36742.0,36742.0,36742.0
unique,36742.0,6101.0,34130,97.0,,,36742.0,36742.0
top,13062.0,51300.0,SAINTE-COLOMBE,62.0,,,43.5270584933,-0.573849021265
freq,1.0,46.0,14,895.0,,,1.0,1.0
mean,,,,,49.431141,17.357863,,
std,,,,,25.472893,144.47828,,
min,,,,,1.0,0.02,,
25%,,,,,25.0,6.45,,
50%,,,,,43.0,10.81,,
75%,,,,,73.0,18.5,,


## Importing and pre-processing age and population data

In [4]:
# Import CSV data
population= pd.read_csv(r'data/t-popmun-2016-com.csv', encoding = "ISO-8859-1",
                        usecols=['com_code','com_type','popmun_age','popmun_sexe','popmun_nb'],
                        dtype={'com_code':'object',
                               'com_type':'object',
                               'popmun_age':'float64',
                               'popmun_sexe':'object',
                               'popmun_nb':'float64'},
                       )
print(population.shape[0])
population.head()

7069596


Unnamed: 0,com_code,com_type,popmun_age,popmun_sexe,popmun_nb
0,1001,COM,0.0,F,3.0
1,1001,COM,0.0,M,13.0
2,1001,COM,1.0,F,5.0
3,1001,COM,1.0,M,1.0
4,1001,COM,2.0,F,6.0


Creating a multi-index and summing males and females figures together :

In [63]:
grouped_pop=population.groupby(['com_code','popmun_age']).sum()
grouped_pop.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,popmun_nb
com_code,popmun_age,Unnamed: 2_level_1
1001,0.0,16.0
1001,1.0,6.0
1001,2.0,7.0
1001,3.0,5.0
1001,4.0,6.0
1001,5.0,7.0
1001,6.0,9.0
1001,7.0,8.0
1001,8.0,10.0
1001,9.0,11.0


Let's use the Statsmodels model package to get the mean and median age value for each commune. This package has an object that can deal with weighted observations :

In [64]:
# Initialize variables
start = time.time()
age_stats={}

for com in grouped_pop.index.get_level_values(level='com_code').unique() : # Loop on all the commune number 
    wq = DescrStatsW(data=grouped_pop.loc[com].index, weights=grouped_pop.loc[com].values ) # Use a DescrStatsW object to later extract the mean and median values
    age_stats[com] = [wq.mean,wq.std,wq.quantile(0.5, return_pandas = False)[0] ,wq.nobs ] # Get and store the mean and median values in age_stats dictionnary

# Transform the  dictionnary in a proper DataFrame
com_age=pd.DataFrame.from_dict(age_stats, orient='index',columns=['age_mean', 'age_std', 'age_median', 'population'])
com_age.reset_index(inplace=True)
com_age.rename(columns={'index': "Code INSEE"}, inplace=True)

end = time.time()
temps=end - start
print('Temps de traitement :',temps, 'secondes ')

  return self.sum / self.sum_weights


Temps de traitement : 147.2564263343811 secondes 


In [6]:
print(com_age.shape)
com_age.head()

(34998, 5)


Unnamed: 0,Code INSEE,age_mean,age_std,age_median,population
0,1001,40.814863,23.714339,44.0,767.0
1,1002,38.740741,23.894028,38.0,243.0
2,1004,38.397455,23.933946,37.0,14081.0
3,1005,39.125199,23.160941,41.0,1671.0
4,1006,47.0,22.27269,49.0,110.0


In [5]:
#com_age.to_csv(r'com_age.csv')
com_age= pd.read_csv(r'com_age.csv', index_col=0)

Some of the communes returned NaN for age_mean and age_std. Indeed, some Commune are uninhabited, let's confirm this : 

In [67]:
uninhabited_com=[]
for column in com_age.columns[1:com_age.columns.shape[0]] :
    for i in range(0,com_age[column].shape[0]) :
        if np.isnan(com_age[column][i]) :
            uninhabited_com.append(com_age['Code INSEE'][i])
uninhabited_com=list(dict.fromkeys(uninhabited_com))        
print('After a quick analysis we can see that those area code are empty, and have no population :',
      "\n", uninhabited_com)

# Let's slice the dataframe where the 'Code INSEE' columns contains one of the value of the uninhabited_com list
com_age[com_age['Code INSEE'].str.contains('|'.join(uninhabited_com))]

After a quick analysis we can see that those area code are empty, and have no population : 
 ['55039', '55050', '55139', '55189', '55239', '55307']


Unnamed: 0,Code INSEE,age_mean,age_std,age_median,population
20075,55039,,,0.5,0.0
20086,55050,,,0.5,0.0
20161,55139,,,0.5,0.0
20204,55189,,,0.5,0.0
20241,55239,,,0.5,0.0
20300,55307,,,0.5,0.0


## Importing and pre-processing households income data

In [5]:
# Import CSV data
revenues= pd.read_csv(r'data/cc_filosofi_2017_COM.CSV' , 
                      usecols=['CODGEO','NBPERSMENFISC17','NBMENFISC17','MED17','RD17'],
                      sep=';', 
                      dtype= {'CODGEO':'object',
                             'NBPERSMENFISC17':'float64',
                              'NBMENFISC17':'float64',
                              'MED17':'float64',
                              'RD17':'float64'})

revenues.rename(columns={'CODGEO':'Code INSEE',
                        'NBMENFISC17':'Fhousehold',
                        'NBPERSMENFISC17':'person_p_Fhousehold',
                        'MED17':'revenue_median',
                        'RD17':'revenue_inequalities'},
                inplace=True)

print(revenues.shape)
revenues.head()

(34931, 5)


Unnamed: 0,Code INSEE,Fhousehold,person_p_Fhousehold,revenue_median,revenue_inequalities
0,1001,317.0,802.0,23310.0,
1,1002,107.0,258.0,24290.0,
2,1004,6505.0,14567.0,19860.0,3.2
3,1005,649.0,1700.0,23370.0,
4,1006,49.0,106.0,23970.0,


In [6]:
# Read and display the description of the columns
pd.read_csv(r'data/meta_cc_filosofi_2017_COM.CSV' , sep=';', dtype=str).head(28)

Unnamed: 0,COD_VAR,LIB_VAR,LIB_VAR_LONG,COD_MOD,LIB_MOD,TYPE_VAR,LONG_VAR
0,NBMENFISC17,Nombre de ménages fiscaux,Nombre de ménages fiscaux,,,NUM,7
1,NBPERSMENFISC17,Nombre de personnes dans les ménages fiscaux,Nombre de personnes dans les ménages fiscaux,,,NUM,7
2,MED17,Médiane du niveau de vie (€),Médiane du niveau de vie (€),,,NUM,5
3,PIMP17,Part des ménages fiscaux imposés (%),Part des ménages fiscaux imposés (%),,,NUM,4
4,TP6017,Taux de pauvreté-Ensemble (%),Taux de pauvreté-Ensemble (%),,,NUM,4
5,TP60AGE117,Taux de pauvreté-moins de 30 ans (%),Taux de pauvreté des personnes dans les ménage...,,,NUM,4
6,TP60AGE217,Taux de pauvreté-30 à 39 ans (%),Taux de pauvreté des personnes dans les ménage...,,,NUM,4
7,TP60AGE317,Taux de pauvreté-40 à 49 ans (%),Taux de pauvreté des personnes dans les ménage...,,,NUM,4
8,TP60AGE417,Taux de pauvreté-50 à 59 ans (%),Taux de pauvreté des personnes dans les ménage...,,,NUM,4
9,TP60AGE517,Taux de pauvreté-60 à 74 ans (%),Taux de pauvreté des personnes dans les ménage...,,,NUM,4


## Merging the collected dataframe into one single dataframe

In [9]:
all_data=pd.merge(communes,com_age,on='Code INSEE', how='outer', indicator=True)
all_data.rename(columns={'_merge':'merge1'}, inplace=True)
all_data=pd.merge(all_data,revenues,on='Code INSEE', how='outer', indicator=True)
all_data.rename(columns={'_merge':'merge2'}, inplace=True)

# Create a 'density' column
all_data['density']=all_data.population/all_data.Superficie_km2

# Reset index after merge
all_data.reset_index(drop=True, inplace=True)
all_data.head()

Unnamed: 0,Code INSEE,Code Postal,Commune,Code Département,Code Région,Superficie_km2,lat,lng,age_mean,age_std,age_median,population,merge1,Fhousehold,person_p_Fhousehold,revenue_median,revenue_inequalities,merge2,density
0,59276,59287,GUESNAIN,59,31.0,4.05,50.3483461671,3.14816711364,40.85186,24.642605,41.0,4651.0,both,1884.0,4624.0,17270.0,2.6,both,1148.395062
1,88128,88210,DENIPAIRE,88,41.0,7.02,48.3398168543,6.96189290978,43.832158,22.810009,48.0,246.0,both,108.0,246.0,21720.0,,both,35.042735
2,57538,57170,PETTONCOURT,57,41.0,4.93,48.7881263595,6.41106883774,40.315972,23.739636,41.0,288.0,both,109.0,298.0,20970.0,,both,58.41785
3,54459,54630,RICHARDMENIL,54,41.0,7.07,48.5958649273,6.17614506031,46.335881,23.049189,51.0,2358.0,both,1020.0,2395.0,25960.0,2.4,both,333.521924
4,35022,35190,BECHEREL,35,53.0,0.55,48.2965087434,-1.9428236566,41.722078,25.14457,42.0,673.0,both,293.0,632.0,19260.0,,both,1223.636364


In [10]:
all_data.to_csv(r'all_data.csv', index=False)
#all_data=pd.read_csv(r'all_data.csv')

In [11]:
# Let's analyse the merging operation and see if all went well
not_complete=all_data[(all_data['merge1']!='both') & (all_data['merge2']!='both')]
print('\n','Shape all_data :' , all_data.shape,
      'communes shape :', communes.shape, '\n',
      'com_age shape :', com_age.shape ,'\n',
      'revenues shape :' ,revenues.shape,'\n',
      'Rows not succesfuly merged :' ,not_complete.shape,'\n',
      'Rows in communes but not in com_age :', not_complete[not_complete.merge1=='left_only'].shape[0],'\n',
      'Rows in com_age but not in communes:', not_complete[not_complete.merge1=='right_only'].shape[0],'\n',
      'Rows in all_data but not in revenues :', not_complete[not_complete.merge2=='left_only'].shape[0],'\n',
      'Rows in revenues but not in all_data :', not_complete[not_complete.merge2=='right_only'].shape[0],'\n',
      'Rows with no Code INSEE :', all_data['Code INSEE'].isna().sum()
     )



 Shape all_data : (36748, 19) communes shape : (36742, 8) 
 com_age shape : (34998, 5) 
 revenues shape : (34931, 5) 
 Rows not succesfuly merged : (1750, 19) 
 Rows in communes but not in com_age : 1750 
 Rows in com_age but not in communes: 0 
 Rows in all_data but not in revenues : 1750 
 Rows in revenues but not in all_data : 0 
 Rows with no Code INSEE : 0


We can conclude that the merging went well. All the age and revenue statistics were added in the commune DataFrame. We had around 1700 more rows of the commune DataFrame, that didn't get completed with the age and revenues statistics. That can be explained because some of those Communes are French oversea territories and therefor weren't in the intial revenues or com_age DataFrames.