### DATA FROM CATASTRO INSPIRE
___________________________________________________________________________________________________

Obteined through qGIS after mingling from a while with:
- WMS service: http://ovc.catastro.meh.es/cartografia/INSPIRE/spadgcwms.aspx
- WFS buildings : http://ovc.catastro.meh.es/INSPIRE/wfsBU.aspx?
- todos los serivicios INSPIRE: http://www.catastro.minhap.es/webinspire/index.html

*Notas: el uso de los canales WMS/WFS devuelven las capas antes de unificar con el programa Europeo Inspire, por lo que quizá sea la razón por la que no funcionan bien.

Existen varios GitHubs enfocados a consultas del catastro con python:
- **[pyCatastro](#https://github.com/gisce/pycatastro)**: permite realizar consultas en formato API. 
>>- No permite la descarga total de datos en función de municipio y provincia
>>- Devuelve diccionarios
>>- Para descargarme Madrid, debería obtener (1) Todas las vías, (2) Todas las siglas e iterar, haciendo mogollón de llamadas (not a good idea)

- **[catastro-lib-python](#https://github.com/sperea/catastro-lib-python)**: parece un antecersor del anterior. No se ha probado pues se dejó de actualizar en 2018
- **[Catastro Inspire Downloader](#https://github.com/geomatico/cidownloader)**: realizados por los mismos cartógrafos y desarrolladores que el complemento de qGIS, permite descargar datos en formato .geopackage
>>- Permite descargar según provincia, municipio y proyección
>>- Es poco consistente: al descargar Madrid a veces devuelve sólo datos de CadastralParcel o de BuildingParts. Comparando con los gmls incluidos en ZIPs que se pueden descargar a parte, los datos son incompletos. Posiblemente se un problema con la librería GDAL.


In [1]:
import pandas as pd
import numpy as np

import geopandas as gpd
import geojson
import folium.folium

import seaborn as sns
import matplotlib.pyplot as plt
import folium

%matplotlib inline

pd.set_option('display.max_rows', 1000)

### CAPAS DISPONIBLES

Dada el volumen de datos, la inspección de éstos se realiza sobre una parte de éstos

In [2]:
CATASTRO_PATH = '../data/raw/catastro'

# Dentro de los datos displibles hay 4 capas en formato geojson

building_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.building.geojson", rows = 25000)
buildingParts_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.buildingpart.geojson", rows = 25000) # pt 1
otherConstruction_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.otherconstruction.geojson", rows = 25000)

cadastralParcel = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.CP.28900.cadastralparcel.geojson", rows = 25000)
cadastralZoning = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.CP.28900.cadastralzoning.geojson", rows = 25000)

Checkeo de uso de memoria

In [3]:
print(f"Building Layer total memory usage: \t\t\t{building_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Building Parts Layer total memory usage: \t\t{buildingParts_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Other Construction Layer total memory usage: \t\t{otherConstruction_df.memory_usage(index=True).sum()/1000} \t\tKbytes")

print(f"Cadastral Parcel Layer total memory usage: \t\t{cadastralParcel.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Cadastral Zoning Layer total memory usage: \t\t{cadastralZoning.memory_usage(index=True).sum()/1000} \tKbytes")

Building Layer total memory usage: 			4825.128 	Kbytes
Building Parts Layer total memory usage: 		2625.128 	Kbytes
Other Construction Layer total memory usage: 		774.16 		Kbytes
Cadastral Parcel Layer total memory usage: 		2200.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		1269.136 	Kbytes


## 1.1 Building Layer

En qGIS esta capa representa la parte edificada de los solares.

In [4]:
building_df.shape

(25000, 25)

In [5]:
building_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 25 columns):
 #   Column                                   Non-Null Count  Dtype   
---  ------                                   --------------  -----   
 0   gml_id                                   25000 non-null  object  
 1   beginLifespanVersion                     25000 non-null  object  
 2   conditionOfConstruction                  25000 non-null  object  
 3   beginning                                25000 non-null  object  
 4   end                                      25000 non-null  object  
 5   endLifespanVersion                       13 non-null     object  
 6   informationSystem                        25000 non-null  object  
 7   reference                                25000 non-null  object  
 8   localId                                  25000 non-null  object  
 9   namespace                                25000 non-null  object  
 10  horizontalGeometryEstimate

In [6]:
building_df.isna().sum()

gml_id                                         0
beginLifespanVersion                           0
conditionOfConstruction                        0
beginning                                      0
end                                            0
endLifespanVersion                         24987
informationSystem                              0
reference                                      0
localId                                        0
namespace                                      0
horizontalGeometryEstimatedAccuracy            0
horizontalGeometryEstimatedAccuracy_uom        0
horizontalGeometryReference                    0
referenceGeometry                              0
currentUse                                   279
numberOfBuildingUnits                          0
numberOfDwellings                              0
numberOfFloorsAboveGround                  25000
documentLink                                   0
format                                         0
sourceStatus        

**GETTING RID OF USELESS COLUMNS**

In [7]:
# DROP NULL COLUMNS
# There are a couple of columns that do not offer any information

building_nullCols = ['numberOfFloorsAboveGround', 'endLifespanVersion']
building_linkCols = ['documentLink', 'format', 'informationSystem']
building_measureCols = ['horizontalGeometryEstimatedAccuracy', 'value_uom', 'sourceStatus',
                        'horizontalGeometryEstimatedAccuracy_uom', 'horizontalGeometryReference', 
                        'officialAreaReference']

building_df.drop(building_nullCols,    axis=1, inplace = True)
building_df.drop(building_linkCols,    axis=1, inplace = True)
building_df.drop(building_measureCols, axis=1, inplace = True)

# horizontalGeometryEstimatedAccuracy is always 0.1m accuracy

In [8]:
# CORRECTING GROSS FLOOR AREA
# officialAreaReference - value - value_uom are columns that refer to the same information

building_df.rename(columns={"value": "grossFloorArea"}, inplace = True)
# building_df.drop(['officialAreaReference'], axis=1, inplace = True)

In [9]:
# SIMPLIFYING DATE COLUMNS
# changes between them migt be because of difference between registration or CFO.
# end column has no sense in the context of this project

building_dateCols = ['beginLifespanVersion', 'beginning', 'end']

# ARE beginning and end the same

building_df['Equal_beg_end'] = building_df.apply(lambda x: x['beginning'] == x['end'], axis = 1)
building_df['Equal_beg_end'].value_counts()

True     24166
False      834
Name: Equal_beg_end, dtype: int64

Let's see the rest of DATES columns.
In the [bibliography](#http://www.catastro.minhap.es/webinspire/documentos/Conjuntos%20de%20datos.pdf) it says:
>- **beginLifespanVersion**: Fecha desde cuándo se ha dado de alta en la base de datos
catastral. 
>- **dateOfConstruction**: estructura que define la fecha de construcción. Está compuesta por dos atributos: bu-c**ore2d:beginning y bu-core2d:end**; los valores son las fechas de construcción de cada unidad constructiva, si hay más de una en el campo **“beginning” se incluye la más antigua y en el campo “end” la más moderna** Siempre se referencian al 1 de enero . 

**______**
Para este proyecto, sólo nos interesa la columna beginning

In [10]:
# Dont know what this means.. could be buildings that are demolished or abandoned
# Lets look to the first row reference = 000207800VK56E
# Sede Electronica del Catastro says that this building has been built in 2004, the *end* year
# Most plausible cause is that the builing is abandoned, in construction etc, is a phase of reconstruction

# conditionOfConstruction should be different
# Lets see that

c_functional, c_declined, c_ruin = 0, 0, 0

for ref in building_df[building_df['Equal_beg_end'] == False]['reference'].tolist() :
    if building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'functional':
        c_functional += 1
    elif building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'declined':
        c_declined += 1
    elif building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'ruin':
        c_ruin += 1
    else:
        pass

print(f"FOR BUILDINGS WITH THE DIFFERENT BEGINNING AND END DATES. CONDITION OF CONSTRUCTION")
print(f"Functional : \t{c_functional}")
print(f"Declined : \t{c_declined}")
print(f"Ruins : \t{c_ruin}")

# There should be another category
# Buildings that may have another new cadastral reference for unknown reasons

FOR BUILDINGS WITH THE DIFFERENT BEGINNING AND END DATES. CONDITION OF CONSTRUCTION
Functional : 	833
Declined : 	0
Ruins : 	1


In [11]:
# LETS SEE CONDITION OF CONSTRUCTION COLUMN
building_df['conditionOfConstruction'].value_counts()

functional    24592
-               248
declined        109
ruin             51
Name: conditionOfConstruction, dtype: int64

In [12]:
# LETS ADD another category for demolished buildings, that reflects beginning != end
# por tanto, los edificios en donde se ha construido suelen son edificios funcionales
# 
# Como se ha dicho más arriba, sólo nos interesa la columna BEGINNING para efectos de este proyecto

building_df.drop(['Equal_beg_end', 'end', 'beginLifespanVersion'], axis = 1, inplace = True)

In [13]:
import datetime as dt

building_df['dateOfConstruction'] = building_df['beginning'].apply(lambda x: 
                                    dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S'))

# Out of bounds nanosecond timestamp: 1640-01-01 00:00:00
# Pandas required YEARS to be inside de bound of 1670 - 2560,
# Because of the nature of data, datetime methods cannot be used for this case

def get_yearofConstruction(strng):
    """
    Input:  string
    Output: year as string
    """
    first_w = strng.split('T')[0]
    # Not using datetime from pandas, not valid for this case
    return first_w.split('-')[0]

building_df['yearOfConstruction'] = building_df['beginning'].apply(get_yearofConstruction)

In [14]:
building_df.drop(['beginning', 'dateOfConstruction'], axis = 1, inplace = True)

**____________________________________________________________________________________________________________**

In [15]:
# THERE ARE COLUMNS THAT MAY DUPLICATE INFORMATION
# To join with the rest of data, parcels, buildingparts... share an ID
# Which col is the ID ??

cols_id = ['gml_id', 'reference', 'localId', 'namespace']

# Reference == localID ?? 
# See if there are unique values, or repeteated values (this info has to be contrasted with the rest of geojson)

building_df[cols_id].describe()

# All elements are unique (as expected)
# SHOULD COINCIDE WITH PARCELS
# gml_id for later

Unnamed: 0,gml_id,reference,localId,namespace
count,25000,25000,25000,25000
unique,25000,25000,25000,1
top,ES.SDGC.BU.0068401VK4606G,1006105VK5810E,1006105VK5810E,ES.SDGC.BU
freq,1,1,1,25000


In [16]:
# Are reference and localID the same? 
building_df.apply(lambda x: x['reference'] == x['localId'], axis = 1).value_counts() # TRUE
# What is reference Geometry ?
building_df['referenceGeometry'].value_counts() # All true --> Dropping

# Dropping reference and namespace
building_df.drop(['reference', 'namespace', 'referenceGeometry'], axis = 1, inplace = True)

In [17]:
building_df.head(10)

Unnamed: 0,gml_id,conditionOfConstruction,localId,currentUse,numberOfBuildingUnits,numberOfDwellings,grossFloorArea,geometry,yearOfConstruction
0,ES.SDGC.BU. VK4700H,-,VK4700H,,0,0,0,"POLYGON ((440433.629 4470953.961, 440432.170 4...",2000
1,ES.SDGC.BU.00006Z8VK4800A,-,00006Z8VK4800A,,0,0,0,"POLYGON ((440020.830 4479859.707, 440020.064 4...",2000
2,ES.SDGC.BU.000200100VK48E,functional,000200100VK48E,2_agriculture,1,0,26,"POLYGON ((441657.574 4487050.292, 441658.164 4...",1968
3,ES.SDGC.BU.000200500VK56E,declined,000200500VK56E,3_industrial,1,0,646,"MULTIPOLYGON (((451584.920 4467181.410, 451586...",1995
4,ES.SDGC.BU.000205600VK56E,declined,000205600VK56E,3_industrial,1,0,202,"POLYGON ((451579.400 4467169.610, 451582.370 4...",1995
5,ES.SDGC.BU.000205700VK56E,declined,000205700VK56E,3_industrial,1,0,105,"POLYGON ((451576.440 4467163.260, 451576.930 4...",1995
6,ES.SDGC.BU.000205800VK56E,declined,000205800VK56E,3_industrial,1,0,120,"POLYGON ((451573.050 4467156.010, 451576.440 4...",1995
7,ES.SDGC.BU.000205900VK56E,declined,000205900VK56E,3_industrial,1,0,120,"POLYGON ((451569.670 4467148.760, 451570.230 4...",1995
8,ES.SDGC.BU.000206000VK56E,declined,000206000VK56E,3_industrial,1,0,120,"POLYGON ((451566.290 4467141.520, 451569.670 4...",1995
9,ES.SDGC.BU.000206100VK56E,declined,000206100VK56E,3_industrial,1,0,177,"POLYGON ((451560.380 4467128.720, 451561.530 4...",1995


**____________________________________________________________________________________________________________**

**____________________________________________________________________________________________________________**

## 1.2 Building Parts Layer

In [18]:
buildingParts_df.shape

(25000, 14)

In [19]:
buildingParts_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 14 columns):
 #   Column                                   Non-Null Count  Dtype   
---  ------                                   --------------  -----   
 0   gml_id                                   25000 non-null  object  
 1   beginLifespanVersion                     25000 non-null  object  
 2   conditionOfConstruction                  0 non-null      object  
 3   localId                                  25000 non-null  object  
 4   namespace                                25000 non-null  object  
 5   horizontalGeometryEstimatedAccuracy      25000 non-null  float64 
 6   horizontalGeometryEstimatedAccuracy_uom  25000 non-null  object  
 7   horizontalGeometryReference              25000 non-null  object  
 8   referenceGeometry                        25000 non-null  bool    
 9   numberOfFloorsAboveGround                25000 non-null  int64   
 10  heightBelowGround         

In [20]:
buildingParts_df.isna().sum()

gml_id                                         0
beginLifespanVersion                           0
conditionOfConstruction                    25000
localId                                        0
namespace                                      0
horizontalGeometryEstimatedAccuracy            0
horizontalGeometryEstimatedAccuracy_uom        0
horizontalGeometryReference                    0
referenceGeometry                              0
numberOfFloorsAboveGround                      0
heightBelowGround                              0
heightBelowGround_uom                          0
numberOfFloorsBelowGround                      0
geometry                                       0
dtype: int64

In [21]:
buildingParts_df.describe()

Unnamed: 0,horizontalGeometryEstimatedAccuracy,numberOfFloorsAboveGround,heightBelowGround,numberOfFloorsBelowGround
count,25000.0,25000.0,25000.0,25000.0
mean,0.1,3.778,2.27088,0.75696
std,1.3878070000000002e-17,2.747504,2.871751,0.95725
min,0.1,0.0,0.0,0.0
25%,0.1,1.0,0.0,0.0
50%,0.1,3.0,0.0,0.0
75%,0.1,6.0,3.0,1.0
max,0.1,23.0,30.0,10.0


In [22]:
buildingParts_df['namespace'].value_counts()

ES.SDGC.BU    25000
Name: namespace, dtype: int64

In [23]:
# DROPING USELESS 
buildingParts_measureCols = ['heightBelowGround_uom','horizontalGeometryEstimatedAccuracy',
                             'horizontalGeometryEstimatedAccuracy_uom', 'horizontalGeometryReference',
                            'referenceGeometry']
buildingParts_nullCols = ['conditionOfConstruction']
builingParts_repCols = ['namespace']

buildingParts_df.drop(buildingParts_measureCols, axis=1, inplace = True)
buildingParts_df.drop(buildingParts_nullCols,    axis=1, inplace = True)
buildingParts_df.drop(builingParts_repCols,      axis=1, inplace = True)

In [24]:
# GETTING YEAR OF CONSTRUCTION
buildingParts_df['yearOfConstruction'] = buildingParts_df['beginLifespanVersion'].apply(get_yearofConstruction)
buildingParts_df.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [25]:
# SEPARATING localId from its parts
def get_part(x):
    part_str = x.split('_')[1]
    return part_str.split('t')[1]

def get_ID(x):
    return x.split('_')[0]

buildingParts_df['locadID_parts'] = buildingParts_df['localId'].apply(get_part).astype(dtype = 'int64')
buildingParts_df['locadID'] = buildingParts_df['localId'].apply(get_ID)
buildingParts_df.drop(['localId'], axis = 1, inplace = True)

In [26]:
buildingParts_df.head(7)

# falta limpiar gml_id --> separar, como part ya se repite en localID, no hace falta

Unnamed: 0,gml_id,numberOfFloorsAboveGround,heightBelowGround,numberOfFloorsBelowGround,geometry,yearOfConstruction,locadID_parts,locadID
0,ES.SDGC.BU.000200100VK48E_part1,1,0,0,"POLYGON ((441657.574 4487050.292, 441658.164 4...",2004,1,000200100VK48E
1,ES.SDGC.BU.000200500VK56E_part1,1,0,0,"POLYGON ((451584.920 4467181.410, 451586.250 4...",2013,1,000200500VK56E
2,ES.SDGC.BU.000200500VK56E_part2,1,0,0,"POLYGON ((451612.580 4467215.370, 451618.650 4...",2013,2,000200500VK56E
3,ES.SDGC.BU.000205600VK56E_part1,1,0,0,"POLYGON ((451579.400 4467169.610, 451582.370 4...",2013,1,000205600VK56E
4,ES.SDGC.BU.000205700VK56E_part1,1,0,0,"POLYGON ((451576.440 4467163.260, 451576.930 4...",2013,1,000205700VK56E
5,ES.SDGC.BU.000205800VK56E_part1,1,0,0,"POLYGON ((451573.050 4467156.010, 451576.440 4...",2013,1,000205800VK56E
6,ES.SDGC.BU.000205900VK56E_part1,1,0,0,"POLYGON ((451569.670 4467148.760, 451570.230 4...",2013,1,000205900VK56E


**________________________**

**________________________**

## 1.3 Other Construction Layer

In [27]:
otherConstruction_df.shape

(13822, 7)

In [28]:
otherConstruction_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 13822 entries, 0 to 13821
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   gml_id                   13822 non-null  object  
 1   beginLifespanVersion     13822 non-null  object  
 2   conditionOfConstruction  0 non-null      object  
 3   localId                  13822 non-null  object  
 4   namespace                13822 non-null  object  
 5   constructionNature       13822 non-null  object  
 6   geometry                 13822 non-null  geometry
dtypes: geometry(1), object(6)
memory usage: 756.0+ KB


In [29]:
otherConstruction_df.isna().sum()

gml_id                         0
beginLifespanVersion           0
conditionOfConstruction    13822
localId                        0
namespace                      0
constructionNature             0
geometry                       0
dtype: int64

In [30]:
otherConstruction_df['conditionOfConstruction'].value_counts()

Series([], Name: conditionOfConstruction, dtype: int64)

In [31]:
otherConstruction_df[['beginLifespanVersion', 'constructionNature', 'localId']].describe()

Unnamed: 0,beginLifespanVersion,constructionNature,localId
count,13822,13822,13822
unique,2191,1,13822
top,2003-01-02T00:00:00,openAirPool,7803808VK3870D_PI.10582
freq,899,13822,1


In [32]:
# All data corresponds to open AIR POOLS
# I don't need this dataset to begin with
# BUT lets cleaned it

In [33]:
otherConstruction_Cols = ['conditionOfConstruction', 'constructionNature', 'namespace']
otherConstruction_df.drop(otherConstruction_Cols, axis = 1, inplace = True)

In [34]:
otherConstruction_df.head(3)

Unnamed: 0,gml_id,beginLifespanVersion,localId,geometry
0,ES.SDGC.BU.0006601VK3800E_PI.1,2017-07-07T00:00:00,0006601VK3800E_PI.1,"POLYGON ((429964.360 4480456.150, 429961.960 4..."
1,ES.SDGC.BU.0007201VK3800E_PI.2,2003-01-02T00:00:00,0007201VK3800E_PI.2,"POLYGON ((429883.850 4480596.541, 429888.120 4..."
2,ES.SDGC.BU.0007202VK3800E_PI.3,2006-06-19T00:00:00,0007202VK3800E_PI.3,"POLYGON ((429967.989 4480563.520, 429968.469 4..."


In [35]:
# Using same functions as in Building Parts
def get_PI(x):
    return x.split('_')[1]

# Separating localID

otherConstruction_df['locadID_PI'] = otherConstruction_df['localId'].apply(get_PI) #not really useful
otherConstruction_df['locadID'] = otherConstruction_df['localId'].apply(get_ID)

otherConstruction_df.drop(['localId', 'locadID_PI'], axis = 1, inplace = True)

In [36]:
# CLEANING DATE
otherConstruction_df['yearOfConstruction'] = otherConstruction_df['beginLifespanVersion'].apply(get_yearofConstruction)
otherConstruction_df.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [37]:
# CLEANING gml_id
otherConstruction_df['gml_id'] = otherConstruction_df['gml_id'].apply(get_ID)

In [38]:
otherConstruction_df.head(3)

Unnamed: 0,gml_id,geometry,locadID,yearOfConstruction
0,ES.SDGC.BU.0006601VK3800E,"POLYGON ((429964.360 4480456.150, 429961.960 4...",0006601VK3800E,2017
1,ES.SDGC.BU.0007201VK3800E,"POLYGON ((429883.850 4480596.541, 429888.120 4...",0007201VK3800E,2003
2,ES.SDGC.BU.0007202VK3800E,"POLYGON ((429967.989 4480563.520, 429968.469 4...",0007202VK3800E,2006


In [61]:
display(otherConstruction_df['locadID'].value_counts())

3320722VK4832A    72
6243101VK4864C    38
5744401VK4854D    36
5942301VK4854D    33
6442501VK4864C    28
                  ..
4080904VK4748A     1
8281909VK4788A     1
8603831VK3880D     1
0598604VK5709H     1
8316316VK3881E     1
Name: locadID, Length: 11994, dtype: int64

In [70]:
# does the locadID with the most open pools exists in the previous datasets?
display(buildingParts_df[buildingParts_df['locadID'] == '8281909VK4788A'])
display(building_df[building_df['localId'] == '8281909VK4788A'])
display(cadastralParcel[cadastralParcel['localId'] == '8281909VK4788A'])

Unnamed: 0,gml_id,numberOfFloorsAboveGround,heightBelowGround,numberOfFloorsBelowGround,geometry,yearOfConstruction,locadID_parts,locadID


Unnamed: 0,gml_id,conditionOfConstruction,localId,currentUse,numberOfBuildingUnits,numberOfDwellings,grossFloorArea,geometry,yearOfConstruction


Unnamed: 0,gml_id,areaValue,localId,label,pos,geometry,yearOfConstruction


**________________________**

**________________________**

## 1.4 Cadastral Parcel Layer

In [39]:
cadastralParcel.shape

(25000, 11)

In [40]:
cadastralParcel.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   gml_id                      25000 non-null  object  
 1   areaValue                   25000 non-null  int64   
 2   areaValue_uom               25000 non-null  object  
 3   beginLifespanVersion        25000 non-null  object  
 4   endLifespanVersion          0 non-null      object  
 5   localId                     25000 non-null  object  
 6   namespace                   25000 non-null  object  
 7   label                       25000 non-null  object  
 8   nationalCadastralReference  25000 non-null  object  
 9   pos                         25000 non-null  object  
 10  geometry                    25000 non-null  geometry
dtypes: geometry(1), int64(1), object(9)
memory usage: 2.1+ MB


In [41]:
cadastralParcel.isna().sum()

gml_id                            0
areaValue                         0
areaValue_uom                     0
beginLifespanVersion              0
endLifespanVersion            25000
localId                           0
namespace                         0
label                             0
nationalCadastralReference        0
pos                               0
geometry                          0
dtype: int64

In [42]:
cadastralParcel['endLifespanVersion'].value_counts()

Series([], Name: endLifespanVersion, dtype: int64)

In [43]:
cadastralParcel.drop(['endLifespanVersion'], axis = 1, inplace = True)

In [44]:
cadastralParcel_measureCols = ['areaValue_uom', 'namespace']
cadastralParcel.drop(cadastralParcel_measureCols, axis = 1, inplace = True)

In [45]:
# CLEANING YEAR (same as 1.3 - 1.2 - 1.1)
cadastralParcel['yearOfConstruction'] = cadastralParcel['beginLifespanVersion'].apply(get_yearofConstruction)
cadastralParcel.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [46]:
# is localID the same as nationalCadastralReference

cadastralParcel['Equality'] = cadastralParcel.apply(lambda x : x['localId'] == x['nationalCadastralReference'], axis = 1)
cadastralParcel['Equality'].value_counts() # All true

True    25000
Name: Equality, dtype: int64

In [47]:
cadastralParcel.drop(['Equality', 'nationalCadastralReference'], axis = 1, inplace = True)

In [48]:
cadastralParcel.head(4)

Unnamed: 0,gml_id,areaValue,localId,label,pos,geometry,yearOfConstruction
0,ES.SDGC.CP.000200500VK56E,1268,000200500VK56E,5,451607.03 4467199.27,"MULTIPOLYGON (((451599.360 4467174.940, 451584...",2013
1,ES.SDGC.CP.000205600VK56E,297,000205600VK56E,56,451592.44 4467170.83,"MULTIPOLYGON (((451593.000 4467163.250, 451579...",2013
2,ES.SDGC.CP.000205700VK56E,155,000205700VK56E,57,451587.93 4467161.77,"MULTIPOLYGON (((451590.040 4467156.910, 451576...",2013
3,ES.SDGC.CP.000205800VK56E,174,000205800VK56E,58,451584.62 4467155.04,"MULTIPOLYGON (((451586.650 4467149.650, 451573...",2013


In [60]:
display(cadastralParcel['localId'].value_counts())

28900J00600005    1
0381316VK4708A    1
0399911VK4709G    1
0298720VK4709G    1
0606503VK4700F    1
                 ..
0906307VK5800F    1
0805101VK5800F    1
0187806VK4708E    1
0391404VK4709A    1
0992405VK4809D    1
Name: localId, Length: 25000, dtype: int64

**________________________**

**________________________**

## 1.5 Cadastral Zoning Layer

In [49]:
cadastralZoning.shape

(12202, 13)

In [50]:
cadastralZoning.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 12202 entries, 0 to 12201
Data columns (total 13 columns):
 #   Column                           Non-Null Count  Dtype   
---  ------                           --------------  -----   
 0   gml_id                           12202 non-null  object  
 1   beginLifespanVersion             12202 non-null  object  
 2   endLifespanVersion               0 non-null      object  
 3   estimatedAccuracy                12202 non-null  float64 
 4   estimatedAccuracy_uom            12202 non-null  object  
 5   localId                          12202 non-null  object  
 6   namespace                        12202 non-null  object  
 7   label                            12202 non-null  object  
 8   LocalisedCharacterString         12202 non-null  object  
 9   nationalCadastalZoningReference  12202 non-null  object  
 10  originalMapScaleDenominator      12202 non-null  int64   
 11  pos                              12202 non-null  object  
 

In [51]:
cadastralZoning.isna().sum()

gml_id                                 0
beginLifespanVersion                   0
endLifespanVersion                 12202
estimatedAccuracy                      0
estimatedAccuracy_uom                  0
localId                                0
namespace                              0
label                                  0
LocalisedCharacterString               0
nationalCadastalZoningReference        0
originalMapScaleDenominator            0
pos                                    0
geometry                               0
dtype: int64

In [59]:
display(cadastralZoning['estimatedAccuracy'].value_counts())
display(cadastralZoning['estimatedAccuracy_uom'].value_counts())
display(cadastralZoning['originalMapScaleDenominator'].value_counts())
display(cadastralZoning['LocalisedCharacterString'].value_counts()) # son iguales ??

KeyError: 'estimatedAccuracy'

In [54]:
cadastralZoning_uselessCols = ['estimatedAccuracy_uom', 'originalMapScaleDenominator', 'estimatedAccuracy', 
                              'namespace', 'endLifespanVersion']
cadastralZoning.drop(cadastralZoning_uselessCols, axis = 1, inplace = True)

In [76]:
# is localID == nationalCadastalZoningReference
cadastralZoning.apply(lambda x : x['localId'] == x['nationalCadastalZoningReference'], axis = 1).value_counts() # All true
cadastralZoning.drop(['nationalCadastalZoningReference'], axis = 1, inplace = True)

In [71]:
# CHANGE YEAR (same as rest)
cadastralZoning['yearOfConstruction'] = cadastralZoning['beginLifespanVersion'].apply(get_yearofConstruction)
cadastralZoning.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [77]:
cadastralZoning.head(4)

Unnamed: 0,gml_id,localId,label,LocalisedCharacterString,pos,geometry,yearOfConstruction
0,ES.SDGC.CP.Z.28900A000,28900A000,0,POLIGONO,440302.42 4492704.6,"MULTIPOLYGON (((440301.968 4492704.637, 440302...",2019
1,ES.SDGC.CP.Z.28900A001,28900A001,1,POLIGONO,444909.27 4497953.5,"MULTIPOLYGON (((444555.092 4499363.444, 444568...",2019
2,ES.SDGC.CP.Z.28900A002,28900A002,2,POLIGONO,446474.92 4496527.94,"MULTIPOLYGON (((445948.540 4496495.370, 445945...",2013
3,ES.SDGC.CP.Z.28900A003,28900A003,3,POLIGONO,447749.11 4494280.98,"MULTIPOLYGON (((449091.096 4493635.359, 449089...",2014


**_______________**

**_______________**

### CHECKING MEMORY USAGE

In [80]:
print(f"Building Layer total memory usage: \t\t\t{building_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Building Parts Layer total memory usage: \t\t{buildingParts_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Other Construction Layer total memory usage: \t\t{otherConstruction_df.memory_usage(index=True).sum()/1000} \tKbytes")

print(f"Cadastral Parcel Layer total memory usage: \t\t{cadastralParcel.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Cadastral Zoning Layer total memory usage: \t\t{cadastralZoning.memory_usage(index=True).sum()/1000} \t\tKbytes")

Building Layer total memory usage: 			1800.128 	Kbytes
Building Parts Layer total memory usage: 		1600.128 	Kbytes
Other Construction Layer total memory usage: 		442.432 	Kbytes
Cadastral Parcel Layer total memory usage: 		1400.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		683.44 		Kbytes


--- INITIALLY---

    Building Layer total memory usage: 			4825.128 	Kbytes
Building Parts Layer total memory usage: 		2625.128 	Kbytes
Other Construction Layer total memory usage: 	774.16 		Kbytes
Cadastral Parcel Layer total memory usage: 		2200.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		1269.136 	Kbytes

In [83]:
## CHANGE

print(f"Building Layer memory optimization: \t\t\t{np.round(building_df.memory_usage(index=True).sum()/(10*4825.128), 2)} \t%")
print(f"Building Parts Layer memory optimization: \t\t{np.round(buildingParts_df.memory_usage(index=True).sum()/(10*2625.128), 2)} \t%")
print(f"Other Construction Layer memory optimization: \t\t{np.round(otherConstruction_df.memory_usage(index=True).sum()/(10*774.16), 2)} \t%")

print(f"Cadastral Parcel Layer memory optimization: \t\t{np.round(cadastralParcel.memory_usage(index=True).sum()/(10*2200.128), 2)} \t%")
print(f"Cadastral Zoning Layer memory optimization: \t\t{np.round(cadastralZoning.memory_usage(index=True).sum()/(10*1269.136), 2)} \t%")

Building Layer memory optimization: 			37.31 	%
Building Parts Layer memory optimization: 		60.95 	%
Other Construction Layer memory optimization: 		57.15 	%
Cadastral Parcel Layer memory optimization: 		63.64 	%
Cadastral Zoning Layer memory optimization: 		53.85 	%


**_______________**

**_______________**

In [84]:
### CHECKINF WHICH COL IS BETTER TO CONNECT TABLES IN DDBB