### DATA FROM CATASTRO INSPIRE
___________________________________________________________________________________________________

Obteined through qGIS after mingling from a while with:
- WMS service: http://ovc.catastro.meh.es/cartografia/INSPIRE/spadgcwms.aspx
- WFS buildings : http://ovc.catastro.meh.es/INSPIRE/wfsBU.aspx?
- todos los serivicios INSPIRE: http://www.catastro.minhap.es/webinspire/index.html

*Notas: el uso de los canales WMS/WFS devuelven las capas antes de unificar con el programa Europeo Inspire, por lo que quizá sea la razón por la que no funcionan bien.

Existen varios GitHubs enfocados a consultas del catastro con python:
- **[pyCatastro](#https://github.com/gisce/pycatastro)**: permite realizar consultas en formato API. 
>>- No permite la descarga total de datos en función de municipio y provincia
>>- Devuelve diccionarios
>>- Para descargarme Madrid, debería obtener (1) Todas las vías, (2) Todas las siglas e iterar, haciendo mogollón de llamadas (not a good idea)

- **[catastro-lib-python](#https://github.com/sperea/catastro-lib-python)**: parece un antecersor del anterior. No se ha probado pues se dejó de actualizar en 2018
- **[Catastro Inspire Downloader](#https://github.com/geomatico/cidownloader)**: realizados por los mismos cartógrafos y desarrolladores que el complemento de qGIS, permite descargar datos en formato .geopackage
>>- Permite descargar según provincia, municipio y proyección
>>- Es poco consistente: al descargar Madrid a veces devuelve sólo datos de CadastralParcel o de BuildingParts. Comparando con los gmls incluidos en ZIPs que se pueden descargar a parte, los datos son incompletos. Posiblemente se un problema con la librería GDAL.


In [1]:
import pandas as pd
import numpy as np

import geopandas as gpd
import geojson

pd.set_option('display.max_rows', 1000)

### CAPAS DISPONIBLES

Dada el volumen de datos, la inspección de éstos se realiza sobre una parte de éstos

In [2]:
CATASTRO_PATH = '../data/raw/catastro'

# Dentro de los datos displibles hay 4 capas en formato geojson

#building_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.building.geojson", rows = 25000)
buildingParts_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.buildingpart.geojson", rows = 25000) # pt 1
#otherConstruction_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.otherconstruction.geojson", rows = 25000)
cadastralParcel = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.CP.28900.cadastralparcel.geojson", rows = 25000)
cadastralZoning = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.CP.28900.cadastralzoning.geojson", rows = 25000)

Checkeo de uso de memoria

In [3]:
#print(f"Building Layer total memory usage: \t\t\t{building_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Building Parts Layer total memory usage: \t\t{buildingParts_df.memory_usage(index=True).sum()/1000} \tKbytes")
#print(f"Other Construction Layer total memory usage: \t\t{otherConstruction_df.memory_usage(index=True).sum()/1000} \t\tKbytes")
print(f"Cadastral Parcel Layer total memory usage: \t\t{cadastralParcel.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Cadastral Zoning Layer total memory usage: \t\t{cadastralZoning.memory_usage(index=True).sum()/1000} \tKbytes")

Building Parts Layer total memory usage: 		2625.128 	Kbytes
Cadastral Parcel Layer total memory usage: 		2200.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		1269.136 	Kbytes


In [4]:
## NAMES OF GEO
# IN OBJECT MAKE IT SO THAT filename == NAME
#building_df.name = 'BU_ALL'
buildingParts_df.name = 'BU_PARTS'
#otherConstruction_df.name = 'BU_OTHER'
cadastralParcel.name = 'CAD_PARCEL'
cadastralZoning.name = 'CAD_ZONING'

## 1.1 Building Layer

En qGIS esta capa representa la parte edificada de los solares.

In [None]:
display(building_df.info())
display(building_df.shape)

In [None]:
building_df.isna().sum()

**GETTING RID OF USELESS COLUMNS**

In [None]:
# DROP NULL COLUMNS
# There are a couple of columns that do not offer any information

building_nullCols = ['numberOfFloorsAboveGround', 'endLifespanVersion']
building_linkCols = ['documentLink', 'format', 'informationSystem']
building_measureCols = ['horizontalGeometryEstimatedAccuracy', 'value_uom', 'sourceStatus',
                        'horizontalGeometryEstimatedAccuracy_uom', 'horizontalGeometryReference', 
                        'officialAreaReference']

building_df.drop(building_nullCols,    axis=1, inplace = True)
building_df.drop(building_linkCols,    axis=1, inplace = True)
building_df.drop(building_measureCols, axis=1, inplace = True)

# horizontalGeometryEstimatedAccuracy is always 0.1m accuracy

In [None]:
# CORRECTING GROSS FLOOR AREA
# officialAreaReference - value - value_uom are columns that refer to the same information

building_df.rename(columns={"value": "grossFloorArea"}, inplace = True)
# building_df.drop(['officialAreaReference'], axis=1, inplace = True)

In [None]:
# SIMPLIFYING DATE COLUMNS
# changes between them migt be because of difference between registration or CFO.
# end column has no sense in the context of this project

building_dateCols = ['beginLifespanVersion', 'beginning', 'end']

# ARE beginning and end the same

building_df['Equal_beg_end'] = building_df.apply(lambda x: x['beginning'] == x['end'], axis = 1)
building_df['Equal_beg_end'].value_counts()

Let's see the rest of DATES columns.
In the [bibliography](#http://www.catastro.minhap.es/webinspire/documentos/Conjuntos%20de%20datos.pdf) it says:
>- **beginLifespanVersion**: Fecha desde cuándo se ha dado de alta en la base de datos
catastral. 
>- **dateOfConstruction**: estructura que define la fecha de construcción. Está compuesta por dos atributos: bu-c**ore2d:beginning y bu-core2d:end**; los valores son las fechas de construcción de cada unidad constructiva, si hay más de una en el campo **“beginning” se incluye la más antigua y en el campo “end” la más moderna** Siempre se referencian al 1 de enero . 

**______**
Para este proyecto, sólo nos interesa la columna beginning

In [None]:
# Dont know what this means.. could be buildings that are demolished or abandoned
# Lets look to the first row reference = 000207800VK56E
# Sede Electronica del Catastro says that this building has been built in 2004, the *end* year
# Most plausible cause is that the builing is abandoned, in construction etc, is a phase of reconstruction

# conditionOfConstruction should be different
# Lets see that

c_functional, c_declined, c_ruin = 0, 0, 0

for ref in building_df[building_df['Equal_beg_end'] == False]['reference'].tolist() :
    if building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'functional':
        c_functional += 1
    elif building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'declined':
        c_declined += 1
    elif building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'ruin':
        c_ruin += 1
    else:
        pass

print(f"FOR BUILDINGS WITH THE DIFFERENT BEGINNING AND END DATES. CONDITION OF CONSTRUCTION")
print(f"Functional : \t{c_functional}")
print(f"Declined : \t{c_declined}")
print(f"Ruins : \t{c_ruin}")

# There should be another category
# Buildings that may have another new cadastral reference for unknown reasons

In [None]:
# LETS SEE CONDITION OF CONSTRUCTION COLUMN
building_df['conditionOfConstruction'].value_counts()

In [None]:
# LETS ADD another category for demolished buildings, that reflects beginning != end
# por tanto, los edificios en donde se ha construido suelen son edificios funcionales
# 
# Como se ha dicho más arriba, sólo nos interesa la columna BEGINNING para efectos de este proyecto

building_df.drop(['Equal_beg_end', 'end', 'beginLifespanVersion'], axis = 1, inplace = True)

In [None]:
import datetime as dt

building_df['dateOfConstruction'] = building_df['beginning'].apply(lambda x: 
                                    dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S'))

# Out of bounds nanosecond timestamp: 1640-01-01 00:00:00
# Pandas required YEARS to be inside de bound of 1670 - 2560,
# Because of the nature of data, datetime methods cannot be used for this case

def get_yearofConstruction(strng):
    """
    Input:  string
    Output: year as string
    """
    first_w = strng.split('T')[0]
    # Not using datetime from pandas, not valid for this case
    return first_w.split('-')[0]

building_df['yearOfConstruction'] = building_df['beginning'].apply(get_yearofConstruction)

In [None]:
building_df.drop(['beginning', 'dateOfConstruction'], axis = 1, inplace = True)

**____________________________________________________________________________________________________________**

In [None]:
# THERE ARE COLUMNS THAT MAY DUPLICATE INFORMATION
# To join with the rest of data, parcels, buildingparts... share an ID
# Which col is the ID ??

cols_id = ['gml_id', 'reference', 'localId', 'namespace']

# Reference == localID ?? 
# See if there are unique values, or repeteated values (this info has to be contrasted with the rest of geojson)

building_df[cols_id].describe()

# All elements are unique (as expected)
# SHOULD COINCIDE WITH PARCELS
# gml_id for later

In [None]:
# Are reference and localID the same? 
building_df.apply(lambda x: x['reference'] == x['localId'], axis = 1).value_counts() # TRUE
# What is reference Geometry ?
building_df['referenceGeometry'].value_counts() # All true --> Dropping

# Dropping reference and namespace
building_df.drop(['reference', 'namespace', 'referenceGeometry'], axis = 1, inplace = True)

In [None]:
building_df.head(2)

**____________________________________________________________________________________________________________**

**____________________________________________________________________________________________________________**

## 1.2 Building Parts Layer

In [5]:
display(buildingParts_df.info())
display(buildingParts_df.shape)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 14 columns):
 #   Column                                   Non-Null Count  Dtype   
---  ------                                   --------------  -----   
 0   gml_id                                   25000 non-null  object  
 1   beginLifespanVersion                     25000 non-null  object  
 2   conditionOfConstruction                  0 non-null      object  
 3   localId                                  25000 non-null  object  
 4   namespace                                25000 non-null  object  
 5   horizontalGeometryEstimatedAccuracy      25000 non-null  float64 
 6   horizontalGeometryEstimatedAccuracy_uom  25000 non-null  object  
 7   horizontalGeometryReference              25000 non-null  object  
 8   referenceGeometry                        25000 non-null  bool    
 9   numberOfFloorsAboveGround                25000 non-null  int64   
 10  heightBelowGround         

None

(25000, 14)

In [6]:
import re #findall #split # sub

# ------------

def str_forUniques(num):
    if num == 0: return 'ALL NULLS'
    else: return 'Unique items'

def checking_forUniques(gdf):
    """
    input:
    output:
    """
    cols_with_one_element = []
    print(f"-- Checking for UNIQUES in {gdf.name} --")
    
    for i,col in enumerate(gdf.columns.tolist()):
        if (col != 'geometry'):
            unique_len = len(gdf[str(col)].value_counts().tolist())
            
            if unique_len == 0: print(f"{i+1}.{col}:\t\t\t{unique_len}\t{str_forUniques(unique_len)}")                
            elif len(col) <= 12 and unique_len != 0: print(f"{i+1}.{col}:\t\t\t\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 12 < len(col) <= 20 and unique_len != 0: print(f"{i+1}.{col}:\t\t\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 20 < len(col) <= 28 and unique_len != 0: print(f"{i+1}.{col}:\t\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 28 < len(col) <= 36 and unique_len != 0: print(f"{i+1}.{col}:\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 36 < len(col) and unique_len != 0: print(f"{i+1}.{col}:\t{unique_len}\t{str_forUniques(unique_len)}")
            else: pass
            
            if (unique_len == 1) or (unique_len == 0): cols_with_one_element.append(col)
            else: pass            
        else: pass
        
    print(f"-- Finished task --\n")
    return cols_with_one_element
    
# ------------

def get_part(x):
    """
    input: col withs IDs_partXX
    output: XX as int
    Get numeric item in partXX from ID_partXX
    """
    part_str = x.split('_')[1]
    return int(part_str.split('t')[1])

def get_ID(x):
    """
    input: localID_partXX
    output: localID
    """
    return x.split('_')[0]

def separate_parts(gdf, col):
    if len(re.findall(r"_", gdf[col].tolist()[0])) != 0:
        splited_col_name = re.split(r"_", gdf[col].tolist()[0])
        part_title = re.findall(r"\D+", splited_col_name[1])
        
        gdf[col + f'_{part_title[0]}'] = gdf[col].apply(get_part).astype(dtype = 'int64')
        gdf[col] = gdf[col].apply(get_ID)
    
# ------------

def get_year(strng):
    """
    Input:  string
    Output: year as string
    
    Note_____________________________________________________________
    Pandas requires years to be inside the bound of 1677 - 2262
    To use pandas Timestamp it is need to defined custom Stamp Period
    String operations seems easier in this case
    """
    first_w = strng.split('T')[0]
    return first_w.split('-')[0]

def getYearOfConstruction(gdf, LifeSpanCol = 'beginLifespanVersion', drop_col = True):
    """
    Cleaning Datetime
    """
    print(f"-- Extract YEAR OF CONSTRUCTION from {gdf.name} --")
    gdf['yearOfConstruction'] = gdf[LifeSpanCol].apply(get_year)
    
    if drop_col:
        print(f"Droping col {LifeSpanCol}")
        buildingParts_df.drop(['beginLifespanVersion'], axis = 1, inplace = True)
    else: pass
    
    print(f"-- Finished task --\n") 

In [7]:
from itertools import combinations
from itertools import chain

def check_allTrue(gdf, col1, col2):
    
    print(f"-- Checking if PAIRS are ALL TRUE {gdf.name} --")

    # hay columnas que son alturas y otras num de plantas. Con multiplicar x3 se arregla
    if False not in gdf.apply(lambda x: (x[col1] == x[col2]) or (x[col1] == 3*x[col2]) or (3*x[col1] == x[col2]),
                              axis = 1).value_counts().index.tolist():
        
        print(f"All True --\n-- Droping {col2}")
        gdf.drop([col2], axis = 1, inplace = True)
    else:
        print(f"Pass \tThere are inequalities between columns")


def checking_forIdenCols(gdf, drop_cols = True):
    """
    Note_____________________________________________________________
    Same unique elements are an indication that they give the same 
    (or nearly) the same information, therefore to simply ddbb
    all columns that give the same info are purged
    """
    print(f"-- Checking for SAME LEN COLS in {gdf.name} --")
    # 1 // creating vars for search
    cols = [col for col in gdf.columns.tolist() if col != 'geometry']
    len_unique_cols = [len(gdf[col].value_counts().tolist()) for col in cols]
    equal_cols, del_cols = [], []
    
    # 2 // creating pairs of columns that are suspect of giving the same information
    for tup_len, tup_col in zip(list(combinations(len_unique_cols, 2)), list(combinations(cols, 2))):
        if tup_len[0] == tup_len[1]: equal_cols.append([tup_col[0], tup_col[1]])
        else: pass
        
    # 3 // if True, drop columns that are equal, evaluating if all rows are the same       
    if drop_cols and len(equal_cols) != 0:        
        for pair in equal_cols:            
            gdf.drop(pair[1], axis = 1, inplace = True)
            del_cols.append(pair.pop(1))
            
        print(f"Deleted   columns: \t\t{del_cols}")
        print(f"Remaining columns: \t\t{list(chain.from_iterable(equal_cols))}")
    
    elif len(equal_cols) == 0: print('List to return is empty')
    else: print(f"Remaining columns: \t\t{list(chain.from_iterable(equal_cols))}")

    print(f"-- Finished task --\n")
    return list(chain.from_iterable(equal_cols))           
    

In [8]:
# CHECKING cols that could be deleted
cols_to_del = checking_forUniques(buildingParts_df)

# DROPING USELESS 
buildingParts_df.drop(cols_to_del, axis=1, inplace = True)

# SEPARATE PARTS IF _
separate_parts(gdf = buildingParts_df, col = 'localId')
separate_parts(gdf = buildingParts_df, col = 'gml_id')

# CHECKING AND DROPPING IDENTICAL COLUMNS
checking_forIdenCols(buildingParts_df, drop_cols = True)

-- Checking for UNIQUES in BU_PARTS --
1.gml_id:					25000	Unique items
2.beginLifespanVersion:				974	Unique items
3.conditionOfConstruction:			0	ALL NULLS
4.localId:					25000	Unique items
5.namespace:					1	Unique items
6.horizontalGeometryEstimatedAccuracy:		1	Unique items
7.horizontalGeometryEstimatedAccuracy_uom:	1	Unique items
8.horizontalGeometryReference:			1	Unique items
9.referenceGeometry:				1	Unique items
10.numberOfFloorsAboveGround:			21	Unique items
11.heightBelowGround:				11	Unique items
12.heightBelowGround_uom:			1	Unique items
13.numberOfFloorsBelowGround:			11	Unique items
-- Finished task --

-- Checking for SAME LEN COLS in BU_PARTS --
Deleted   columns: 		['localId', 'numberOfFloorsBelowGround', 'gml_id_part']
Remaining columns: 		['gml_id', 'heightBelowGround', 'localId_part']
-- Finished task --



['gml_id', 'heightBelowGround', 'localId_part']

In [9]:
getYearOfConstruction(buildingParts_df, 
                      LifeSpanCol = 'beginLifespanVersion', drop_col = True)

-- Extract YEAR OF CONSTRUCTION from BU_PARTS --
Droping col beginLifespanVersion
-- Finished task --



In [60]:
def rename_cols(gdf):
    dict_cols_to_rename = {'gml_id': 'localID',
                           'localId_part': 'localID_part',
                           'localId_PI': 'localID_pool',
                           'numberOfFloorsAboveGround': 'numFloors_AG',
                           'numberOfFloorsBelowGround': 'numFloors_BG',
                           'heightAboveGround': 'height_AG',
                           'heightBelowGround': 'height_BG'}

    gdf.rename(columns = dict_cols_to_rename) # before: after

In [10]:
buildingParts_df.head(2)

Unnamed: 0,gml_id,numberOfFloorsAboveGround,heightBelowGround,geometry,localId_part,yearOfConstruction
0,ES.SDGC.BU.000200100VK48E,1,0,"POLYGON ((441657.574 4487050.292, 441658.164 4...",1,2004
1,ES.SDGC.BU.000200500VK56E,1,0,"POLYGON ((451584.920 4467181.410, 451586.250 4...",1,2013


In [78]:
dict_cols_to_rename = {'gml_id': 'localID',
                       'localId_part': 'localID_part',
                       'localId_PI': 'localID_pool',
                       'numberOfFloorsAboveGround': 'numFloors_AG',
                       'numberOfFloorsBelowGround': 'numFloors_BG',
                       'heightAboveGround': 'height_AG',
                       'heightBelowGround': 'height_BG'}

buildingParts_df.rename(columns = dict_cols_to_rename) # before: after

Unnamed: 0,localID,numFloors_AG,height_BG,geometry,localID_part,yearOfConstruction
0,ES.SDGC.BU.000200100VK48E,1,0,"POLYGON ((441657.574 4487050.292, 441658.164 4...",1,2004
1,ES.SDGC.BU.000200500VK56E,1,0,"POLYGON ((451584.920 4467181.410, 451586.250 4...",1,2013
2,ES.SDGC.BU.000200500VK56E,1,0,"POLYGON ((451612.580 4467215.370, 451618.650 4...",2,2013
3,ES.SDGC.BU.000205600VK56E,1,0,"POLYGON ((451579.400 4467169.610, 451582.370 4...",1,2013
4,ES.SDGC.BU.000205700VK56E,1,0,"POLYGON ((451576.440 4467163.260, 451576.930 4...",1,2013
...,...,...,...,...,...,...
24995,ES.SDGC.BU.0201602VK4800A,4,0,"POLYGON ((440172.963 4479980.283, 440169.101 4...",4,2004
24996,ES.SDGC.BU.0201604VK4800A,1,0,"POLYGON ((440184.243 4479979.654, 440189.373 4...",1,2002
24997,ES.SDGC.BU.0201605VK4800A,1,0,"POLYGON ((440190.893 4479982.524, 440193.763 4...",1,2002
24998,ES.SDGC.BU.0201609VK4800A,2,0,"POLYGON ((440188.113 4479964.823, 440187.803 4...",1,2002


In [61]:
buildingParts_df.head(3)

['gml_id', 'numberOfFloorsAboveGround', 'heightBelowGround', 'geometry', 'localId_part', 'yearOfConstruction']
['gml_id', 'numberOfFloorsAboveGround', 'heightBelowGround', 'geometry', 'localId_part', 'yearOfConstruction']


Unnamed: 0,gml_id,numberOfFloorsAboveGround,heightBelowGround,geometry,localId_part,yearOfConstruction
0,ES.SDGC.BU.000200100VK48E,1,0,"POLYGON ((441657.574 4487050.292, 441658.164 4...",1,2004
1,ES.SDGC.BU.000200500VK56E,1,0,"POLYGON ((451584.920 4467181.410, 451586.250 4...",1,2013
2,ES.SDGC.BU.000200500VK56E,1,0,"POLYGON ((451612.580 4467215.370, 451618.650 4...",2,2013


**________________________**

**________________________**

## 1.3 Other Construction Layer

In [None]:
otherConstruction_df.shape

In [None]:
otherConstruction_df.info()

In [None]:
otherConstruction_df.isna().sum()

In [None]:
otherConstruction_df['conditionOfConstruction'].value_counts()

In [None]:
otherConstruction_df[['beginLifespanVersion', 'constructionNature', 'localId']].describe()

In [None]:
# All data corresponds to open AIR POOLS
# I don't need this dataset to begin with
# BUT lets cleaned it

In [None]:
otherConstruction_Cols = ['conditionOfConstruction', 'constructionNature', 'namespace']
otherConstruction_df.drop(otherConstruction_Cols, axis = 1, inplace = True)

In [None]:
otherConstruction_df.head(3)

In [None]:
# Using same functions as in Building Parts
def get_PI(x):
    return x.split('_')[1]

# Separating localID

otherConstruction_df['locadID_PI'] = otherConstruction_df['localId'].apply(get_PI) #not really useful
otherConstruction_df['locadID'] = otherConstruction_df['localId'].apply(get_ID)

otherConstruction_df.drop(['localId', 'locadID_PI'], axis = 1, inplace = True)

In [None]:
# CLEANING DATE
otherConstruction_df['yearOfConstruction'] = otherConstruction_df['beginLifespanVersion'].apply(get_yearofConstruction)
otherConstruction_df.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [None]:
# CLEANING gml_id
otherConstruction_df['gml_id'] = otherConstruction_df['gml_id'].apply(get_ID)

In [None]:
otherConstruction_df.head(3)

**________________________**

**________________________**

## 1.4 Cadastral Parcel Layer

In [None]:
display(cadastralParcel.info())
display(cadastralParcel.shape)

In [None]:
# IS END LIFE SPAN VERSION EMPTY?
cadastralParcel['endLifespanVersion'].value_counts() # Empty
# DROP USELESS COLUMNS
cadastralParcel_dropCols = ['endLifespanVersion', 'areaValue_uom', 'namespace']
cadastralParcel.drop(cadastralParcel_measureCols, axis = 1, inplace = True)

In [None]:
# CLEANING YEAR (same as 1.3 - 1.2 - 1.1)
# using get_yearofConstruction
cadastralParcel['yearOfConstruction'] = cadastralParcel['beginLifespanVersion'].apply(get_yearofConstruction)
cadastralParcel.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [None]:
def drop_allTrue(df, col1, col2):
    if False not in df.apply(lambda x : x[col1] == x[col2], axis = 1).value_counts().index.tolist():
        print(f"-- All True --\n-- Droping {col2} --")
        df.drop([col2], axis = 1, inplace = True)
    else:
        print(f"-- Pass \nThere are inequalities between columns --")

In [None]:
# IS LOCAL_ID THE SAME AS NATIONAL CADASTRAL REFERENCE ??
# If true --> drop column
drop_allTrue(cadastralParcel, 'localId', 'nationalCadastralReference')
# DROP NATIONAL CADASTRAL REFERENCE

In [None]:
cadastralParcel.head(4)

**________________________**

**________________________**

## 1.5 Cadastral Zoning Layer

In [None]:
display(cadastralZoning.info())
display(cadastralZoning.shape)

In [None]:
checking_forEmpties(cadastralZoning)

In [None]:
display(cadastralZoning['estimatedAccuracy'].value_counts())
display(cadastralZoning['estimatedAccuracy_uom'].value_counts())
display(cadastralZoning['originalMapScaleDenominator'].value_counts())
display(cadastralZoning['LocalisedCharacterString'].value_counts()) # son iguales ??

In [None]:
cadastralZoning_uselessCols = ['estimatedAccuracy_uom', 'originalMapScaleDenominator', 'estimatedAccuracy', 
                              'namespace', 'endLifespanVersion']
cadastralZoning.drop(cadastralZoning_uselessCols, axis = 1, inplace = True)

In [None]:
# is localID == nationalCadastalZoningReference
cadastralZoning.apply(lambda x : x['localId'] == x['nationalCadastalZoningReference'], axis = 1).value_counts() # All true
cadastralZoning.drop(['nationalCadastalZoningReference'], axis = 1, inplace = True)

In [None]:
# CHANGE YEAR (same as rest)
cadastralZoning['yearOfConstruction'] = cadastralZoning['beginLifespanVersion'].apply(get_yearofConstruction)
cadastralZoning.drop(['beginLifespanVersion'], axis = 1, inplace = True)

In [None]:
cadastralZoning.head(4)

**_______________**

**_______________**

### CHECKING MEMORY USAGE

In [None]:
print(f"Building Layer total memory usage: \t\t\t{building_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Building Parts Layer total memory usage: \t\t{buildingParts_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Other Construction Layer total memory usage: \t\t{otherConstruction_df.memory_usage(index=True).sum()/1000} \tKbytes")

print(f"Cadastral Parcel Layer total memory usage: \t\t{cadastralParcel.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Cadastral Zoning Layer total memory usage: \t\t{cadastralZoning.memory_usage(index=True).sum()/1000} \t\tKbytes")

--- INITIALLY---

    Building Layer total memory usage: 			4825.128 	Kbytes
Building Parts Layer total memory usage: 		2625.128 	Kbytes
Other Construction Layer total memory usage: 	774.16 		Kbytes
Cadastral Parcel Layer total memory usage: 		2200.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		1269.136 	Kbytes

In [None]:
## CHANGE

print(f"Building Layer memory optimization: \t\t\t{np.round(building_df.memory_usage(index=True).sum()/(10*4825.128), 2)} \t%")
print(f"Building Parts Layer memory optimization: \t\t{np.round(buildingParts_df.memory_usage(index=True).sum()/(10*2625.128), 2)} \t%")
print(f"Other Construction Layer memory optimization: \t\t{np.round(otherConstruction_df.memory_usage(index=True).sum()/(10*774.16), 2)} \t%")

print(f"Cadastral Parcel Layer memory optimization: \t\t{np.round(cadastralParcel.memory_usage(index=True).sum()/(10*2200.128), 2)} \t%")
print(f"Cadastral Zoning Layer memory optimization: \t\t{np.round(cadastralZoning.memory_usage(index=True).sum()/(10*1269.136), 2)} \t%")

**_______________**

**_______________**

### CHECKINF WHICH COL IS BETTER TO CONNECT TABLES IN DDBB

In [None]:
BU_id = np.array(building_df['gml_id']); display(BU_id); display(len(BU_id))
BP_id = np.array(buildingParts_df['gml_id']); display(BP_id); display(len(BP_id))
BO_id = np.array(otherConstruction_df['gml_id']); display(BO_id); display(len(BO_id))
CP_id = np.array(cadastralParcel['gml_id']); display(CP_id); display(len(CP_id))
CZ_id = np.array(cadastralZoning['gml_id']); display(CZ_id); display(len(CZ_id))