### DATA FROM CATASTRO INSPIRE
___________________________________________________________________________________________________

Obteined through qGIS after mingling from a while with:
- WMS service: http://ovc.catastro.meh.es/cartografia/INSPIRE/spadgcwms.aspx
- WFS buildings : http://ovc.catastro.meh.es/INSPIRE/wfsBU.aspx?
- todos los serivicios INSPIRE: http://www.catastro.minhap.es/webinspire/index.html

*Notas: el uso de los canales WMS/WFS devuelven las capas antes de unificar con el programa Europeo Inspire, por lo que quizá sea la razón por la que no funcionan bien.

Existen varios GitHubs enfocados a consultas del catastro con python:
- **[pyCatastro](#https://github.com/gisce/pycatastro)**: permite realizar consultas en formato API. 
>>- No permite la descarga total de datos en función de municipio y provincia
>>- Devuelve diccionarios
>>- Para descargarme Madrid, debería obtener (1) Todas las vías, (2) Todas las siglas e iterar, haciendo mogollón de llamadas (not a good idea)

- **[catastro-lib-python](#https://github.com/sperea/catastro-lib-python)**: parece un antecersor del anterior. No se ha probado pues se dejó de actualizar en 2018
- **[Catastro Inspire Downloader](#https://github.com/geomatico/cidownloader)**: realizados por los mismos cartógrafos y desarrolladores que el complemento de qGIS, permite descargar datos en formato .geopackage
>>- Permite descargar según provincia, municipio y proyección
>>- Es poco consistente: al descargar Madrid a veces devuelve sólo datos de CadastralParcel o de BuildingParts. Comparando con los gmls incluidos en ZIPs que se pueden descargar a parte, los datos son incompletos. Posiblemente se un problema con la librería GDAL.



In [1]:
import re
from itertools import combinations
from itertools import chain

import pandas as pd
import numpy as np

import geopandas as gpd
import geojson

pd.set_option('display.max_rows', 1000)

In [2]:
RENAMING_OF_COLS = {'gml_id': 'ID',
                    'localId_part': 'ID_part',
                    'localId_PI': 'ID_pool',
                    'numberOfFloorsAboveGround': 'nFloors_AG',
                    'numberOfFloorsBelowGround': 'nFloors_BG',
                    'heightAboveGround': 'height_AG',
                    'heightBelowGround': 'height_BG',
                    'areaValue': 'area_m2p',
                    'value': 'area_m2c'}

GEOMETRY_COLS = ['geometry', 'pos']

### FUNCTIONAL PIPELINE PROTOTYPING

In [3]:
# ------------------------------------------------------------ checking which columns should be purged

def str_forUniques(num):
    """
    Return different string depending of unique_len in checking_forUniques
    """
    if num == 0: return 'ALL NULLS'
    else: return 'Unique items'

def checking_forUniques(gdf):
    """
    input:
    output:
    """
    cols_with_one_element = []
    
    print(f"\n-------------------- Current Layers in {gdf.name} ------------------------")
    print(f"------------------------------------------------------------------------")
    
    for i,col in enumerate(gdf.columns.tolist()):
        if (col not in GEOMETRY_COLS):
            unique_len = len(gdf[str(col)].value_counts().tolist())
            
            if unique_len == 0: 
                print(f"{i+1}. {col}:\t\t\t{unique_len}\t{str_forUniques(unique_len)}")                
            elif len(col) <= 12 and unique_len != 0: 
                print(f"{i+1}. {col}:\t\t\t\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 12 < len(col) <= 19 and unique_len != 0: 
                print(f"{i+1}. {col}:\t\t\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 19 < len(col) <= 28 and unique_len != 0: 
                print(f"{i+1}. {col}:\t\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 28 < len(col) <= 36 and unique_len != 0: 
                print(f"{i+1}. {col}:\t\t{unique_len}\t{str_forUniques(unique_len)}")
            elif 36 < len(col) and unique_len != 0: 
                print(f"{i+1}. {col}:\t{unique_len}\t{str_forUniques(unique_len)}")
            else: pass
            
            if (unique_len == 1) or (unique_len == 0): cols_with_one_element.append(col)
            else: pass            
        else: pass
    
    print(f"------------------------------------------------------------------------\n")
    return cols_with_one_element

def droping_DupCols(gdf, drop_cols = True):
    """
    
    """   
    if drop_cols:
        cols_to_drop = checking_forUniques(gdf)
        
        print(f"-------------- Droping DUPLICATED COLUMNS in {gdf.name} ------------------")
        [print(f'{i+1}. {col}\v') for i, col in enumerate(cols_to_drop)] # repr without new line
        
        gdf.drop(cols_to_drop, 
                 axis=1, inplace = True)
        
        print(f"-- Finished task -----------------------------------------------------\n")
    else: pass

In [143]:
# ------------------------------------------------------------ separate ID_parts if needed

def get_part(x):
    """
    input: col withs IDs_partXX
    output: XX as int
    Get numeric item in partXX from ID_partXX
    """
    part_str = x.split('_')[1]
    
    if len(re.findall(r"[\.]", part_str)) != 0: return int(part_str.split('.')[1])
    elif len(re.findall(r"t", part_str)) != 0: return int(part_str.split('t')[1])
    else: print(f"Error. Couldnt find anything to split part to")

def get_ID(x):
    """
    input: localID_partXX
    output: localID
    """
    return x.split('_')[0]

def separate_parts(gdf, cols = ['']):
    """
    If it is a geodf with ID_partXX then both parts are separated in different cols
    This is necessary to be able to join gdfs
    """
    print(f"-------------- Checking for COLS to separate in {gdf.name} --------------")
    assert type(cols) == list
    
    c = 0
    for col in cols:
        if (len(re.findall(r"_", gdf[col].tolist()[0])) != 0) and col in gdf.columns.tolist():
    
            print(f"{c+1}. {col}\t\t Dropped")            
            splited_col_name = re.split(r"_", gdf[col].tolist()[0])
            part_title = re.findall(r"\D+", splited_col_name[1])

            gdf[col + f'_{part_title[0]}'] = gdf[col].apply(get_part).astype(dtype = 'int64')
            gdf[col] = gdf[col].apply(get_ID)
            c += 1
        else: print(f"No columns to separate")
            
    print(f"-- Finished task -----------------------------------------------------\n")

In [144]:
# ------------------------------------------------------------ datetime operations

def get_year(strng):
    """
    Input:  string
    Output: year as string
    
    Note_____________________________________________________________
    Pandas requires years to be inside the bound of 1677 - 2262
    To use pandas Timestamp it is need to defined custom Stamp Period
    String operations seems easier in this case
    """
    first_w = strng.split('T')[0]
    return first_w.split('-')[0]

def getYearOfConstruction(gdf, LifeSpanCol = 'beginLifespanVersion', drop_col = True):
    """
    Cleaning Datetime
    """
    print(f"-- Getting YEAR OF CONSTRUCTION in {gdf.name} --------------------------")
    
    if LifeSpanCol in gdf.columns.tolist(): 
    gdf['yearOfConstruction'] = gdf[LifeSpanCol].apply(get_year)
    
        if drop_col:
            print(f"Droping col {drop_col}: \t{LifeSpanCol}")
            gdf.drop(LifeSpanCol, axis = 1, inplace = True)        

        else: print(f"Droping col {drop_col}: \t{LifeSpanCol}")
    
    print(f"-- Finished task -----------------------------------------------------\n")

IndentationError: expected an indented block (<ipython-input-144-2d4afbe6a891>, line 23)

In [145]:
# ------------------------------------------------------------ Droping duplicated columns

def check_allTrue(gdf, col1, col2):
    """
    Esta función se usa en ....
    """
    print(f"-- Checking if PAIRS are ALL TRUE {gdf.name} ---------------")
    
    # hay columnas que son alturas y otras num de plantas. Con multiplicar x3 se arregla
    if False not in gdf.apply(lambda x: (x[col1] == x[col2]) or (x[col1] == 3*x[col2]) or (3*x[col1] == x[col2]),
                              axis = 1).value_counts().index.tolist():
        
        print(f"All True --\n-- Droping {col2}")
        gdf.drop([col2], axis = 1, inplace = True)
    else:
        print(f"Pass \tThere are inequalities between columns")


def checking_forIdenCols(gdf, drop_cols = True):
    """
    Note_____________________________________________________________
    Same unique elements are an indication that they give the same 
    (or nearly) the same information, therefore to simply ddbb
    all columns that give the same info are purged
    """
    print(f"------------- Checking for SAME LEN COLS in {gdf.name} -----------------")
    
    # 1 // creating vars for search
    cols = [col for col in gdf.columns.tolist() if (col not in GEOMETRY_COLS)]
    len_unique_cols = [len(gdf[col].value_counts().tolist()) for col in cols]
    equal_cols, del_cols = [], []
    
    # 2 // creating pairs of columns that are suspect of giving the same information
    for tup_len, tup_col in zip(list(combinations(len_unique_cols, 2)), list(combinations(cols, 2))):
        if tup_len[0] == tup_len[1]:
            equal_cols.append([tup_col[0], tup_col[1]])
        else: pass

    # 3 // if True, drop columns that are equal, evaluating if all rows are the same       
    if drop_cols and len(equal_cols) != 0:        
        for pair in equal_cols:
            
            if pair[1] not in del_cols:
                gdf.drop(pair[1], axis = 1, inplace = True)
                del_cols.append(pair.pop(1))
            else: pass
                
        print(f"1. Deleted   columns: ") # repr without new line
        [print(f'\t\t\t{i+1}. {col} \v') for i, col in enumerate(del_cols)]; print('\n')
        
        print(f"2. Remaining columns: ") # repr without new line
        [print(f'\t\t\t{i+1}. {col} \v') for i, col in enumerate(list(chain.from_iterable(equal_cols)))]
        
    # 4 // printing columns that remain after purging
    elif len(equal_cols) == 0: print('List to return is empty')
    else: 
        print(f"Remaining columns: \n") # repr without new line
        [print(f'\t\t\t{i+1}. {col} \v') for i, col in enumerate(list(chain.from_iterable(equal_cols)))]

    print(f"-- Finished task -----------------------------------------------------\n")
    return list(chain.from_iterable(equal_cols))

In [146]:
# ------------------------------------------------------------ Unify ID columns if gml_id is dropped
    
def get_strPoint(x):
    """
    Returns last part of Cadastral ID in gml_id inside shorten_localID
    """
    return x.split('.')[-1]

def shorten_localID(gdf, cols_to_shorten = ['gml_id']):
    """
    If localId is dropped in favor of gml_id
    Then, namespace part is purged of name
    """
    print(f"------- Checking for ID col to shorten in {gdf.name} -------------------")
    
    shorted_localID = np.vectorize(get_strPoint)   
    for col_ID in cols_to_shorten:
        if col_ID in gdf.columns.tolist():
            print(f'Shortening columns: {col_ID}')
            gdf[col_ID] = shorted_localID(gdf[col_ID])
            
        else: print(f'Nothing to shorten')
    
    print(f"-- Finished task -----------------------------------------------------\n")
    
# ------------------------------------------------------------ LAST STEP, unify columns names

def rename_cols(gdf):
    """
    If col not in dict, then pass.
    This is used to unify all geojson
    """
    print(f"--------------- Renaming cols in {gdf.name} ----------------------------")
    
    dict_cols_to_rename = RENAMING_OF_COLS # dict
    cols_to_rename = [col for col in gdf.columns.tolist() if col in dict_cols_to_rename.keys()]
    
    gdf.rename(columns = dict_cols_to_rename, inplace = True) # before: after
            
    print(f"1. Initial name: ") # repr without new line
    [print(f'\t\t\t{i+1}. {col} \v') for i, col in enumerate(cols_to_rename)]

    print(f"2. Final name: ") # repr without new line
    [print(f'\t\t\t{i+1}. {dict_cols_to_rename[col]} \v') for i, col in enumerate(cols_to_rename)]
    
    print(f"-- Finished task -----------------------------------------------------\n")

In [147]:
def rawData_infoCleaning(gdf, 
                         drop_cols = True, 
                         cols_to_separate = ['localId', 'gml_id'],
                         datetime_col = 'beginLifespanVersion'):
    """
    Pipeline towards clearer data
    """
    print(f"Initiating cleaning pipeline -----------------------------------------\n")
    
    # -- 1 -- SEARCHING FOR COLS WITHOUT DATA IN {gdf.name} -----------------------
    droping_DupCols(gdf, drop_cols = drop_cols)
    # -- 2 -- SEARCHING FOR UNIQUE COLS {gdf.name} --------------------------------
    checking_forUniques(gdf)
    # -- 3 -- SEARCHING FOR COLS TO SEPARATE {gdf.name} ---------------------------
    separate_parts(gdf = gdf, cols = cols_to_separate)
    # -- 4 -- SEARCHING FOR DUPLICATED INFO {gdf.name} ----------------------------
    checking_forIdenCols(gdf, drop_cols = drop_cols)
    # -- 5 -- REFORMATTING DATA IN {gdf.name} -------------------------------------
    getYearOfConstruction(gdf, LifeSpanCol = datetime_col, drop_col = drop_cols)
    shorten_localID(gdf)
    # -- 6 -- RENAMING INFORMATION IN {gdf.name} ----------------------------------
    rename_cols(gdf)
    

    print(f"Closing cleaning pipeline --------------------------------------------\n")

### CAPAS DISPONIBLES

Dada el volumen de datos, la inspección de éstos se realiza sobre una parte de éstos

In [148]:
CATASTRO_PATH = '../data/raw/catastro'

# Dentro de los datos displibles hay 4 capas en formato geojson

#building_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.building.geojson", rows = 25000)
#buildingParts_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.buildingpart.geojson", rows = 25000) # pt 1
otherConstruction_df = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.BU.28900.otherconstruction.geojson", rows = 25000)
cadastralParcel = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.CP.28900.cadastralparcel.geojson", rows = 25000)
cadastralZoning = gpd.read_file(f"{CATASTRO_PATH}/A.ES.SDGC.CP.28900.cadastralzoning.geojson", rows = 25000)

Checkeo de uso de memoria

In [149]:
#print(f"Building Layer total memory usage: \t\t\t{building_df.memory_usage(index=True).sum()/1000} \tKbytes")
#print(f"Building Parts Layer total memory usage: \t\t{buildingParts_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Other Construction Layer total memory usage: \t\t{otherConstruction_df.memory_usage(index=True).sum()/1000} \t\tKbytes")
print(f"Cadastral Parcel Layer total memory usage: \t\t{cadastralParcel.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Cadastral Zoning Layer total memory usage: \t\t{cadastralZoning.memory_usage(index=True).sum()/1000} \tKbytes")

Other Construction Layer total memory usage: 		774.16 		Kbytes
Cadastral Parcel Layer total memory usage: 		2200.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		1269.136 	Kbytes


In [150]:
## NAMES OF GEO
# IN OBJECT MAKE IT SO THAT filename == NAME
#building_df.name = 'BU_ALL'
#buildingParts_df.name = 'BU_PARTS'
otherConstruction_df.name = 'BU_OTHER'
cadastralParcel.name = 'CAD_PARCEL'
cadastralZoning.name = 'CAD_ZONING'

## 1.1 Building Layer

En qGIS esta capa representa la parte edificada de los solares.

In [None]:
display(building_df.info())
display(building_df.shape)

In [None]:
building_df.isna().sum()

**GETTING RID OF USELESS COLUMNS**

In [None]:
# DROP NULL COLUMNS
# There are a couple of columns that do not offer any information

building_nullCols = ['numberOfFloorsAboveGround', 'endLifespanVersion']
building_linkCols = ['documentLink', 'format', 'informationSystem']
building_measureCols = ['horizontalGeometryEstimatedAccuracy', 'value_uom', 'sourceStatus',
                        'horizontalGeometryEstimatedAccuracy_uom', 'horizontalGeometryReference', 
                        'officialAreaReference']

building_df.drop(building_nullCols,    axis=1, inplace = True)
building_df.drop(building_linkCols,    axis=1, inplace = True)
building_df.drop(building_measureCols, axis=1, inplace = True)

# horizontalGeometryEstimatedAccuracy is always 0.1m accuracy

In [None]:
# CORRECTING GROSS FLOOR AREA
# officialAreaReference - value - value_uom are columns that refer to the same information

building_df.rename(columns={"value": "grossFloorArea"}, inplace = True)
# building_df.drop(['officialAreaReference'], axis=1, inplace = True)

In [None]:
# SIMPLIFYING DATE COLUMNS
# changes between them migt be because of difference between registration or CFO.
# end column has no sense in the context of this project

building_dateCols = ['beginLifespanVersion', 'beginning', 'end']

# ARE beginning and end the same

building_df['Equal_beg_end'] = building_df.apply(lambda x: x['beginning'] == x['end'], axis = 1)
building_df['Equal_beg_end'].value_counts()

Let's see the rest of DATES columns.
In the [bibliography](#http://www.catastro.minhap.es/webinspire/documentos/Conjuntos%20de%20datos.pdf) it says:
>- **beginLifespanVersion**: Fecha desde cuándo se ha dado de alta en la base de datos
catastral. 
>- **dateOfConstruction**: estructura que define la fecha de construcción. Está compuesta por dos atributos: bu-c**ore2d:beginning y bu-core2d:end**; los valores son las fechas de construcción de cada unidad constructiva, si hay más de una en el campo **“beginning” se incluye la más antigua y en el campo “end” la más moderna** Siempre se referencian al 1 de enero . 

**______**
Para este proyecto, sólo nos interesa la columna beginning

In [None]:
# Dont know what this means.. could be buildings that are demolished or abandoned
# Lets look to the first row reference = 000207800VK56E
# Sede Electronica del Catastro says that this building has been built in 2004, the *end* year
# Most plausible cause is that the builing is abandoned, in construction etc, is a phase of reconstruction

# conditionOfConstruction should be different
# Lets see that

c_functional, c_declined, c_ruin = 0, 0, 0

for ref in building_df[building_df['Equal_beg_end'] == False]['reference'].tolist() :
    if building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'functional':
        c_functional += 1
    elif building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'declined':
        c_declined += 1
    elif building_df[building_df['reference'] == ref]['conditionOfConstruction'].tolist()[0] == 'ruin':
        c_ruin += 1
    else:
        pass

print(f"FOR BUILDINGS WITH THE DIFFERENT BEGINNING AND END DATES. CONDITION OF CONSTRUCTION")
print(f"Functional : \t{c_functional}")
print(f"Declined : \t{c_declined}")
print(f"Ruins : \t{c_ruin}")

# There should be another category
# Buildings that may have another new cadastral reference for unknown reasons

In [None]:
# LETS SEE CONDITION OF CONSTRUCTION COLUMN
building_df['conditionOfConstruction'].value_counts()

In [None]:
# LETS ADD another category for demolished buildings, that reflects beginning != end
# por tanto, los edificios en donde se ha construido suelen son edificios funcionales
# 
# Como se ha dicho más arriba, sólo nos interesa la columna BEGINNING para efectos de este proyecto

building_df.drop(['Equal_beg_end', 'end', 'beginLifespanVersion'], axis = 1, inplace = True)

In [None]:
import datetime as dt

building_df['dateOfConstruction'] = building_df['beginning'].apply(lambda x: 
                                    dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S'))

# Out of bounds nanosecond timestamp: 1640-01-01 00:00:00
# Pandas required YEARS to be inside de bound of 1670 - 2560,
# Because of the nature of data, datetime methods cannot be used for this case

def get_yearofConstruction(strng):
    """
    Input:  string
    Output: year as string
    """
    first_w = strng.split('T')[0]
    # Not using datetime from pandas, not valid for this case
    return first_w.split('-')[0]

building_df['yearOfConstruction'] = building_df['beginning'].apply(get_yearofConstruction)

In [None]:
building_df.drop(['beginning', 'dateOfConstruction'], axis = 1, inplace = True)

**____________________________________________________________________________________________________________**

In [None]:
# THERE ARE COLUMNS THAT MAY DUPLICATE INFORMATION
# To join with the rest of data, parcels, buildingparts... share an ID
# Which col is the ID ??

cols_id = ['gml_id', 'reference', 'localId', 'namespace']

# Reference == localID ?? 
# See if there are unique values, or repeteated values (this info has to be contrasted with the rest of geojson)

building_df[cols_id].describe()

# All elements are unique (as expected)
# SHOULD COINCIDE WITH PARCELS
# gml_id for later

In [None]:
# Are reference and localID the same? 
building_df.apply(lambda x: x['reference'] == x['localId'], axis = 1).value_counts() # TRUE
# What is reference Geometry ?
building_df['referenceGeometry'].value_counts() # All true --> Dropping

# Dropping reference and namespace
building_df.drop(['reference', 'namespace', 'referenceGeometry'], axis = 1, inplace = True)

In [None]:
building_df.head(2)

**____________________________________________________________________________________________________________**

**____________________________________________________________________________________________________________**

## 1.2 Building Parts Layer

In [151]:
print(f"{buildingParts_df.info()}\n")
print(f"Shape of {buildingParts_df.name}: \t\t\t     {buildingParts_df.shape}")

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   gml_id                     25000 non-null  object  
 1   beginLifespanVersion       25000 non-null  object  
 2   localId                    25000 non-null  object  
 3   numberOfFloorsAboveGround  25000 non-null  int64   
 4   heightBelowGround          25000 non-null  int64   
 5   numberOfFloorsBelowGround  25000 non-null  int64   
 6   geometry                   25000 non-null  geometry
dtypes: geometry(1), int64(3), object(3)
memory usage: 1.3+ MB
None

Shape of BU_PARTS: 			     (25000, 7)


In [152]:
rawData_infoCleaning(buildingParts_df, 
                     drop_cols = True, 
                     cols_to_separate = ['localId', 'gml_id'], 
                     datetime_col = 'beginLifespanVersion')

Initiating cleaning pipeline -----------------------------------------


-------------------- Current Layers in BU_PARTS ------------------------
------------------------------------------------------------------------
1. gml_id:					25000	Unique items
2. beginLifespanVersion:			974	Unique items
3. localId:					25000	Unique items
4. numberOfFloorsAboveGround:			21	Unique items
5. heightBelowGround:				11	Unique items
6. numberOfFloorsBelowGround:			11	Unique items
------------------------------------------------------------------------

-------------- Droping DUPLICATED COLUMNS in BU_PARTS ------------------
-- Finished task -----------------------------------------------------


-------------------- Current Layers in BU_PARTS ------------------------
------------------------------------------------------------------------
1. gml_id:					25000	Unique items
2. beginLifespanVersion:			974	Unique items
3. localId:					25000	Unique items
4. numberOfFloorsAboveGround:			21	Unique items
5. h

In [153]:
buildingParts_df.head(3)

Unnamed: 0,ID,nFloors_AG,height_BG,geometry,ID_part,yearOfConstruction
0,000200100VK48E,1,0,"POLYGON ((441657.574 4487050.292, 441658.164 4...",1,2004
1,000200500VK56E,1,0,"POLYGON ((451584.920 4467181.410, 451586.250 4...",1,2013
2,000200500VK56E,1,0,"POLYGON ((451612.580 4467215.370, 451618.650 4...",2,2013


**________________________**

**________________________**

## 1.3 Other Construction Layer

In [154]:
# All data corresponds to open AIR POOLS
# I don't need this dataset to begin with
# BUT lets cleaned it

In [155]:
display(otherConstruction_df.info())
display(otherConstruction_df.shape)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 13822 entries, 0 to 13821
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   gml_id                   13822 non-null  object  
 1   beginLifespanVersion     13822 non-null  object  
 2   conditionOfConstruction  0 non-null      object  
 3   localId                  13822 non-null  object  
 4   namespace                13822 non-null  object  
 5   constructionNature       13822 non-null  object  
 6   geometry                 13822 non-null  geometry
dtypes: geometry(1), object(6)
memory usage: 756.0+ KB


None

(13822, 7)

In [156]:
rawData_infoCleaning(otherConstruction_df, 
                     drop_cols = True, 
                     cols_to_separate = ['localId', 'gml_id'], 
                     datetime_col = 'beginLifespanVersion')

Initiating cleaning pipeline -----------------------------------------


-------------------- Current Layers in BU_OTHER ------------------------
------------------------------------------------------------------------
1. gml_id:					13822	Unique items
2. beginLifespanVersion:			2191	Unique items
3. conditionOfConstruction:			0	ALL NULLS
4. localId:					13822	Unique items
5. namespace:					1	Unique items
6. constructionNature:				1	Unique items
------------------------------------------------------------------------

-------------- Droping DUPLICATED COLUMNS in BU_OTHER ------------------
1. conditionOfConstruction
2. namespace
3. constructionNature
-- Finished task -----------------------------------------------------


-------------------- Current Layers in BU_OTHER ------------------------
------------------------------------------------------------------------
1. gml_id:					13822	Unique items
2. beginLifespanVersion:			2191	Unique items
3. localId:					13822	Unique items
------

In [157]:
otherConstruction_df.head(3)

Unnamed: 0,ID,geometry,localId_PI.,yearOfConstruction
0,0006601VK3800E,"POLYGON ((429964.360 4480456.150, 429961.960 4...",1,2017
1,0007201VK3800E,"POLYGON ((429883.850 4480596.541, 429888.120 4...",2,2003
2,0007202VK3800E,"POLYGON ((429967.989 4480563.520, 429968.469 4...",3,2006


**________________________**

## 1.4 Cadastral Parcel Layer

In [158]:
display(cadastralParcel.info())
display(cadastralParcel.shape)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   gml_id                      25000 non-null  object  
 1   areaValue                   25000 non-null  int64   
 2   areaValue_uom               25000 non-null  object  
 3   beginLifespanVersion        25000 non-null  object  
 4   endLifespanVersion          0 non-null      object  
 5   localId                     25000 non-null  object  
 6   namespace                   25000 non-null  object  
 7   label                       25000 non-null  object  
 8   nationalCadastralReference  25000 non-null  object  
 9   pos                         25000 non-null  object  
 10  geometry                    25000 non-null  geometry
dtypes: geometry(1), int64(1), object(9)
memory usage: 2.1+ MB


None

(25000, 11)

In [159]:
rawData_infoCleaning(cadastralParcel, 
                     drop_cols = True, 
                     cols_to_separate = ['localId', 'gml_id'], 
                     datetime_col = 'beginLifespanVersion')

Initiating cleaning pipeline -----------------------------------------


-------------------- Current Layers in CAD_PARCEL ------------------------
------------------------------------------------------------------------
1. gml_id:					25000	Unique items
2. areaValue:					5910	Unique items
3. areaValue_uom:				1	Unique items
4. beginLifespanVersion:			1446	Unique items
5. endLifespanVersion:			0	ALL NULLS
6. localId:					25000	Unique items
7. namespace:					1	Unique items
8. label:					768	Unique items
9. nationalCadastralReference:			25000	Unique items
------------------------------------------------------------------------

-------------- Droping DUPLICATED COLUMNS in CAD_PARCEL ------------------
1. areaValue_uom
2. endLifespanVersion
3. namespace
-- Finished task -----------------------------------------------------


-------------------- Current Layers in CAD_PARCEL ------------------------
------------------------------------------------------------------------
1. gml_id:					

In [160]:
cadastralParcel.head(4)

Unnamed: 0,ID,area_m2p,label,pos,geometry,yearOfConstruction
0,000200500VK56E,1268,5,451607.03 4467199.27,"MULTIPOLYGON (((451599.360 4467174.940, 451584...",2013
1,000205600VK56E,297,56,451592.44 4467170.83,"MULTIPOLYGON (((451593.000 4467163.250, 451579...",2013
2,000205700VK56E,155,57,451587.93 4467161.77,"MULTIPOLYGON (((451590.040 4467156.910, 451576...",2013
3,000205800VK56E,174,58,451584.62 4467155.04,"MULTIPOLYGON (((451586.650 4467149.650, 451573...",2013


**________________________**

## 1.5 Cadastral Zoning Layer

In [161]:
display(cadastralZoning.info())
display(cadastralZoning.shape)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 12202 entries, 0 to 12201
Data columns (total 13 columns):
 #   Column                           Non-Null Count  Dtype   
---  ------                           --------------  -----   
 0   gml_id                           12202 non-null  object  
 1   beginLifespanVersion             12202 non-null  object  
 2   endLifespanVersion               0 non-null      object  
 3   estimatedAccuracy                12202 non-null  float64 
 4   estimatedAccuracy_uom            12202 non-null  object  
 5   localId                          12202 non-null  object  
 6   namespace                        12202 non-null  object  
 7   label                            12202 non-null  object  
 8   LocalisedCharacterString         12202 non-null  object  
 9   nationalCadastalZoningReference  12202 non-null  object  
 10  originalMapScaleDenominator      12202 non-null  int64   
 11  pos                              12202 non-null  object  
 

None

(12202, 13)

In [162]:
## son iguales ---> SI, son las mismas
display(cadastralZoning['estimatedAccuracy'].value_counts())
display(cadastralZoning['originalMapScaleDenominator'].value_counts())
display(cadastralZoning['LocalisedCharacterString'].value_counts()) # son iguales ??

cadastralZoning.apply(lambda x : x['estimatedAccuracy'] == 0.5 and \
                                 x['LocalisedCharacterString'] == 'MANZANA ' and \
                                 x['originalMapScaleDenominator'] == 500,
                      axis = 1).value_counts() # All true
print("""
Nota_______________________________________________________________________ \n
estimatedAccuracy == originalMapScaleDenominator == LocalisedCharacterString\n
              0.5 == 500                     500 == MANZANA                 \n
              1.0 == 5000                   5000 == POLIGONO                \n
""")

0.5    12111
1.0       91
Name: estimatedAccuracy, dtype: int64

500     12111
5000       91
Name: originalMapScaleDenominator, dtype: int64

MANZANA      12111
POLIGONO        91
Name: LocalisedCharacterString, dtype: int64


Nota_______________________________________________________________________ 

estimatedAccuracy == originalMapScaleDenominator == LocalisedCharacterString

              0.5 == 500                     500 == MANZANA                 

              1.0 == 5000                   5000 == POLIGONO                




In [163]:
rawData_infoCleaning(cadastralZoning, 
                     drop_cols = True, 
                     cols_to_separate = ['localId', 'gml_id'], 
                     datetime_col = 'beginLifespanVersion')

Initiating cleaning pipeline -----------------------------------------


-------------------- Current Layers in CAD_ZONING ------------------------
------------------------------------------------------------------------
1. gml_id:					12202	Unique items
2. beginLifespanVersion:			2381	Unique items
3. endLifespanVersion:			0	ALL NULLS
4. estimatedAccuracy:				2	Unique items
5. estimatedAccuracy_uom:			1	Unique items
6. localId:					12202	Unique items
7. namespace:					1	Unique items
8. label:					11507	Unique items
9. LocalisedCharacterString:			2	Unique items
10. nationalCadastalZoningReference:		12202	Unique items
11. originalMapScaleDenominator:			2	Unique items
------------------------------------------------------------------------

-------------- Droping DUPLICATED COLUMNS in CAD_ZONING ------------------
1. endLifespanVersion
2. estimatedAccuracy_uom
3. namespace
-- Finished task -----------------------------------------------------


-------------------- Current Layers in CA

In [164]:
cadastralZoning.head()

Unnamed: 0,ID,estimatedAccuracy,label,pos,geometry,yearOfConstruction
0,28900A000,1.0,0,440302.42 4492704.6,"MULTIPOLYGON (((440301.968 4492704.637, 440302...",2019
1,28900A001,1.0,1,444909.27 4497953.5,"MULTIPOLYGON (((444555.092 4499363.444, 444568...",2019
2,28900A002,1.0,2,446474.92 4496527.94,"MULTIPOLYGON (((445948.540 4496495.370, 445945...",2013
3,28900A003,1.0,3,447749.11 4494280.98,"MULTIPOLYGON (((449091.096 4493635.359, 449089...",2014
4,28900A004,1.0,4,447065.76 4493887.71,"MULTIPOLYGON (((445797.560 4495048.620, 445790...",2013


**_______________**

**_______________**

### CHECKING MEMORY USAGE

In [165]:
#print(f"Building Layer total memory usage: \t\t\t{building_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Building Parts Layer total memory usage: \t\t{buildingParts_df.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Other Construction Layer total memory usage: \t\t{otherConstruction_df.memory_usage(index=True).sum()/1000} \tKbytes")

print(f"Cadastral Parcel Layer total memory usage: \t\t{cadastralParcel.memory_usage(index=True).sum()/1000} \tKbytes")
print(f"Cadastral Zoning Layer total memory usage: \t\t{cadastralZoning.memory_usage(index=True).sum()/1000} \t\tKbytes")

Building Parts Layer total memory usage: 		1200.128 	Kbytes
Other Construction Layer total memory usage: 		442.432 	Kbytes
Cadastral Parcel Layer total memory usage: 		1200.128 	Kbytes
Cadastral Zoning Layer total memory usage: 		585.824 		Kbytes


--- INITIALLY---

    Building Layer total memory usage: 			     4825.128 	Kbytes
    Building Parts Layer total memory usage: 		 2625.128 	Kbytes
    Other Construction Layer total memory usage: 	 774.16 	Kbytes
    Cadastral Parcel Layer total memory usage: 		 2200.128 	Kbytes
    Cadastral Zoning Layer total memory usage: 		 1269.136 	Kbytes

In [166]:
## CHANGE

#print(f"Building Layer memory optimization: \t\t\t{np.round(building_df.memory_usage(index=True).sum()/(10*4825.128), 2)} \t%")
print(f"Building Parts Layer memory optimization: \t\t{np.round(buildingParts_df.memory_usage(index=True).sum()/(10*2625.128), 2)} \t%")
print(f"Other Construction Layer memory optimization: \t\t{np.round(otherConstruction_df.memory_usage(index=True).sum()/(10*774.16), 2)} \t%")

print(f"Cadastral Parcel Layer memory optimization: \t\t{np.round(cadastralParcel.memory_usage(index=True).sum()/(10*2200.128), 2)} \t%")
print(f"Cadastral Zoning Layer memory optimization: \t\t{np.round(cadastralZoning.memory_usage(index=True).sum()/(10*1269.136), 2)} \t%")

Building Parts Layer memory optimization: 		45.72 	%
Other Construction Layer memory optimization: 		57.15 	%
Cadastral Parcel Layer memory optimization: 		54.55 	%
Cadastral Zoning Layer memory optimization: 		46.16 	%


**_______________**

**_______________**

### CHECKINF WHICH COL IS BETTER TO CONNECT TABLES IN DDBB

In [None]:
BU_id = np.array(building_df['gml_id']); display(BU_id); display(len(BU_id))
BP_id = np.array(buildingParts_df['gml_id']); display(BP_id); display(len(BP_id))
BO_id = np.array(otherConstruction_df['gml_id']); display(BO_id); display(len(BO_id))
CP_id = np.array(cadastralParcel['gml_id']); display(CP_id); display(len(CP_id))
CZ_id = np.array(cadastralZoning['gml_id']); display(CZ_id); display(len(CZ_id))