# Cleaning the datasets for Registro Público de Concesiones and zipcodes

This code includes all the cleaning done for the datasets that were scrapped and aggregates these into a big dataset.

- [Zipcodes](https://xn--cdigospostales-lob.es/listado-de-codigos-postales-de-espana/)
- [Main dataset](https://sedeaplicaciones.minetur.gob.es/RPC_Consulta)
    - Main page dataset
    - Pop-up datasets

## Packages used

- datetime
- numpy
- pandas
- fuzzywuzzy
- math
- datetime

# Import packages

In [1]:
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from datetime import datetime as dt
import math

# Import data

In [2]:
# Main dataset
main_df = pd.read_csv('/Users/niko/Documents/Personal/GitHub/RadioLinkConcessionsSpain/RegistroPublicoConcesiones_General.csv')
# Zipcodes dataset
zipcodes = pd.read_csv('/Users/niko/Downloads/listado-codigos-postales-con-LatyLon.csv',sep=';',dtype={0:str, 1:str, 2:str, 3:object,4:object})
# Pop-up
df = pd.read_csv('/Users/niko/Downloads/RegistroPublicoConcesiones_Frequencies_Castilla_LaMancha.csv',sep=';')
# Population per region
population = pd.read_excel('/Users/niko/Downloads/pobmun21.xlsx',dtype={'CODIGO POSTAL':object})

# Clean the datasets

## Clean the pop-up data
Doing a fuzzy join on the dataset of zipcodes and the data of reference <br>
data inside the 'Consulta del Registro Público de Concesiones'.
This is needed, because the data from the source does not have zipcodes, longitute and latitude.<br>
The fuzzy join does a cross join, then calculates the fuzzy ratio and include the data based on a fuzzy_ratio.<br>
Then, the duplicated columns are dropped.<br>
The output is a new dataset that includes the zipcode, longitude and latitude per tower.

In [50]:
# main_df.head(5)

In [51]:
# df.head(50)
# df.shape

In [3]:
# Clean all the pop-up datasets
df = df.apply(lambda x: x.str.strip()).replace('', np.nan)
df = df.fillna(method='ffill')
df[['Frequencias', 'Tipo']] = df['Frecuencia'].str.split(' ', 1, expand=True)
del df['Frecuencia']
df['Frequencias'] = df['Frequencias'].apply(lambda x: x.replace('.', '')).apply(lambda x: x.replace(',', '.')).astype('float')
df['Frequencias'] = np.where(df['Tipo'] == 'MHz',
                                           df['Frequencias'] / 1000,
                                           df['Frequencias'])
del df['Tipo']
df = df.rename(columns={"Provincia": "referencia_Provincia"})

### Cleaning of zipcodes

In [53]:
# zipcodes.head(50)

In [4]:
# Define the columns that we want to include in the final dataset
zipcodes_columns = ['codigopostalid','lat','lon']
df_columns = [ 'Referencia','Comunidad','referencia_Provincia','Municipio','Frequencias']
all_new_columns = df_columns+zipcodes_columns

### Join pop-ups and zipcodes

In [5]:
# Define the fuzzy ratio used to include the data
fuzzy_ratio = 80

# Drop duplicates of the zipcodes on the poblacion (Municipio) and stay with the first option
zipcodes = zipcodes.drop_duplicates(subset=['poblacion'], keep='first')

# Create new column to use to join both datasets
zipcodes['merge']='all'
df['merge']='all'

# Join both datasets per row
all_datasets = pd.merge(df,zipcodes,on='merge')

# Create list of tuples based on the columns that we want to use for the join
datasets_tuple = all_datasets[['Municipio', 'poblacion']].apply(tuple, axis=1).tolist()

# Create the fuzz ratio on the list of tuples ceated
all_datasets['ratio'] = [fuzz.token_sort_ratio(*i) for i in datasets_tuple]

# Exclude those that have a low match ratio, the threshhold is set low because some matches have a low score
all_datasets = all_datasets[all_datasets.ratio>fuzzy_ratio]

# Drop all duplicates based on the defined columns and keep all the wanted ones
final_df = all_datasets.drop_duplicates(subset=['Referencia','Municipio','Frequencias'])
final_df = final_df[all_new_columns]

In [6]:
# Intermediate step to save the data
final_df.to_csv('/Users/niko/Documents/Personal/GitHub/RadioLinkConcessionsSpain/popup_zipcodes.csv',sep=';')

### Join pop-ups + zipcodes with population

In [7]:
# Define the fuzzy ratio used to include the data
fuzzy_ratio = 80

# Create new column to use to join both datasets
population['merge']='all'
final_df['merge']='all'

# Join both datasets per row
df_with_pop = pd.merge(final_df,population,on='merge')
del df_with_pop['merge']

# Create list of tuples based on the columns that we want to use for the join
datasets_tuple_with_pop = df_with_pop[['Municipio', 'NOMBRE']].apply(tuple, axis=1).tolist()

# Create the fuzz ratio on the list of tuples ceated
df_with_pop['ratio'] = [fuzz.token_sort_ratio(*i) for i in datasets_tuple_with_pop]

# Exclude those that have a low match ratio, the threshhold is set low because some matches have a low score
df_with_pop = df_with_pop[df_with_pop.ratio>fuzzy_ratio]

# Drop all duplicates based on the defined columns and keep all the wanted ones
final_df_with_pop = df_with_pop.drop_duplicates(subset=['Referencia','Municipio','Frequencias'])

final_df_with_pop = final_df_with_pop[['Referencia','Comunidad','referencia_Provincia','Municipio','Frequencias','codigopostalid','lat','lon','POB21']]

In [8]:
# Intermediate step to save the data
final_df_with_pop.to_csv('/Users/niko/Documents/Personal/GitHub/RadioLinkConcessionsSpain/popup_zipcodes_with_population.csv',sep=';')

In [182]:
# final_df.head()
# final_df.shape

In [137]:
# final_df.tail()

## Clean the main dataset


In [9]:
# Remove the rows that are unecessary
# Function to return a list with unique numeric values
def unique(list1):
    x = np.array(list1)
    return list(np.unique(x))

# Creates list of the returned values 
list_cities = unique(main_df.Localidad)

# Includes only text values, all cities of Spain
new_list_cities = []
for i in list_cities:
    if i.isnumeric() is False:
        if i != ' ':
            new_list_cities.append(i)

# Filter out all the values that are not inside the new_list_cities
main_df = main_df.loc[main_df['Localidad'].isin(new_list_cities)]

# --- Work on booleans

# Fill in False to all nulls for specific boolean columns
main_df[['Susceptible cesion','Susceptible mutualizacion','Obtenido por transferencia']] = \
                            main_df[['Susceptible cesion','Susceptible mutualizacion','Obtenido por transferencia']].fillna(False)
                            
main_df['Susceptible cesion'] = main_df['Susceptible cesion'].replace("true", True)
main_df['Obtenido por transferencia'] = main_df['Obtenido por transferencia'].replace("Detalle", True)

# --- Work on the dates

# Select columns that contain dates
date_columns = ['F. Caducidad','F. Concesion']

# Transform date objects to datetime
main_df[date_columns] = main_df[date_columns].apply(pd.to_datetime, errors='coerce',infer_datetime_format=True)

# New features day, month and year
main_df['total_dia_concesion'] = round((main_df['F. Caducidad'] - main_df['F. Concesion']).dt.days,0).fillna(0).apply(np.int64)
main_df['total_mes_concesion'] = round((main_df['F. Caducidad'] - main_df['F. Concesion']).dt.days/12).fillna(0).apply(np.int64)
main_df['total_año_concesion'] = round((main_df['F. Caducidad'] - main_df['F. Concesion']).dt.days/360).fillna(0).apply(np.int64)

main_df['left_dia_concesion'] = round((main_df['F. Caducidad'] - dt.now()).dt.days,0).fillna(0).apply(np.int64)
main_df['left_mes_concesion'] = round((main_df['F. Caducidad'] - dt.now()).dt.days/12).fillna(0).apply(np.int64)
main_df['left_año_concesion'] = round((main_df['F. Caducidad'] - dt.now()).dt.days/360).fillna(0).apply(np.int64)

# Rename column Provincia, after join with the Reference dataset it would become unclear what is what
main_df = main_df.rename(columns={"Provincia": "main_Provincia"})

## Joined datasets
Joining both datasets to create the final dataset that needs to be standarized.

In [16]:
# Inner join to only get the match that is on Referencia in both datasets
df_joined = pd.merge(main_df,final_df_with_pop, how='inner', left_on = 'Referencia', right_on = 'Referencia')

In [17]:
# df_joined.shape

(870, 26)

In [18]:
# Intermediate step to clean the data
df_joined.to_csv('/Users/niko/Documents/Personal/GitHub/RadioLinkConcessionsSpain/df_joined.csv',sep=';')