# **Introducció de relacions entre variables**

Degut al fet que la nostra base de dades no conté masses relacions, hem decidit introduir noves columnes que puguin afavorir a la relació entre variables.

## *Requisits d'execució*

In [None]:
!pip3 install pyspark
!pip3 install duckdb
import pyspark

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, mean, when, lit, count, to_json
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, rand, monotonically_increasing_id
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.functions import collect_list

import pandas as pd
import matplotlib.pyplot as plt
import duckdb




In [None]:
!git clone https://github.com/OscarMoliina/betterlifebetterhealth.git

fatal: destination path 'betterlifebetterhealth' already exists and is not an empty directory.


In [None]:
spark = SparkSession.builder \
    .appName("Afegir dades") \
    .config("spark.jars", "/content/betterlifebetterhealth/src/utils/duckdb.jar") \
    .getOrCreate()

result = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:duckdb:/content/betterlifebetterhealth/data/db/exploitation_zone.db") \
    .option("driver", "org.duckdb.DuckDBDriver") \
    .option("dbtable", "join_table") \
    .load()

## *Importacions i exportacions entre països*

A continuació mostrarem el procés que hem seguit per introduir les importacions i exportacions de cada pais expressades com la suma de transaccions.

In [None]:
# Llegim el fitxer csv que conté els diversos països amb les seves corresponents relacions i les transaccions realitzades des de l'any 2016 a l'any 2021
df = spark.read.csv("/content/betterlifebetterhealth/data/csv/import_exp_intra_extra_eu.csv", header=True, inferSchema=True)

# Esborrem la columna 'Unnamed: 0' ja que no ens aporta cap mena d'informació
df = df.drop('Unnamed: 0')

# Mostrar el dataframe
df.show()

+---+----+------------+----+---------------+----------------+----------------+----------------+---------------+---------------+----+------+
|_c0|iso2|partner_iso2|flow|       sum_2016|        sum_2017|        sum_2018|        sum_2019|       sum_2020|       sum_2021|ISO3|  name|
+---+----+------------+----+---------------+----------------+----------------+----------------+---------------+---------------+----+------+
|  0|  FR|          NL|   1|4.0947003235E10| 4.5718182715E10| 4.6458798651E10| 4.6644030672E10|4.3815998442E10|5.3880952276E10| FRA|France|
|  1|  FR|          NL|   2| 1.619427874E10| 1.7146236298E10| 1.7883851153E10| 1.7785714636E10|1.6174795048E10|2.0019063594E10| FRA|France|
|  2|  FR|          DE|   1| 9.972720463E10|1.02263568892E11|1.04036942689E11|1.03079522792E11|8.8717998604E10|1.0178303436E11| FRA|France|
|  3|  FR|          DE|   2|7.2269280722E10| 6.9627188729E10|  7.147691828E10| 7.0764166344E10|6.2174692857E10|7.0772838844E10| FRA|France|
|  4|  FR|          

In [None]:
# Calcular la mitjana de les columnes 'sum_2016' fins 'sum_2021' i afegir la nova columna 'mitjana_sum'
# Degut al fet que no hem pogut trobar registres d'anys anteriors hem decidit utilitzar la mitjana d'anys posteriors, en cas que aquest estudi s'estengués a un nivell no acadèmic, caldria revisar més a fons l'afinitat de transaccions entre els anys esmentats
df = df.withColumn('mitjana_sum', (col('sum_2016') + col('sum_2017') + col('sum_2018') + col('sum_2019') + col('sum_2020') + col('sum_2021')) / 6)

# Eliminar les columnes per simplificar la base de dades
df = df.drop('sum_2016', 'sum_2017', 'sum_2018', 'sum_2019', 'sum_2020', 'sum_2021', 'ISO3', 'name', '_c0')

# Mostrar el dataframe
df.show()

+----+------------+----+--------------------+
|iso2|partner_iso2|flow|         mitjana_sum|
+----+------------+----+--------------------+
|  FR|          NL|   1|    4.62441609985E10|
|  FR|          NL|   2|    1.75339899115E10|
|  FR|          DE|   1|    9.99347119945E10|
|  FR|          DE|   2|6.951418096266667E10|
|  FR|          IT|   1|4.517880949633333...|
|  FR|          IT|   2|3.585052119333333...|
|  FR|          GB|   1|2.238572949033333...|
|  FR|          GB|   2|3.115842170016666...|
|  FR|          IE|   1| 7.184255083166667E9|
|  FR|          IE|   2|       3.379785954E9|
|  FR|          DK|   1|      3.6769071865E9|
|  FR|          DK|   2|3.0566810623333335E9|
|  FR|          GR|   1|1.1480185673333333E9|
|  FR|          GR|   2|      2.3651218625E9|
|  FR|          PT|   1| 6.466880818833333E9|
|  FR|          PT|   2| 5.432393887333333E9|
|  FR|          ES|   1|4.119813002066666...|
|  FR|          ES|   2|3.594075796683333...|
|  FR|          BE|   1|5.68198371

Observem que en el dataset generat hi ha una columna anomenada **flow** aquesta variable indica la direcció de la relació, per exemple per a la primera fila:
  * **Flow** = 1 --> Implica que la relació és FR -> NL, és a dir que França és l'exportador i Netherlands qui importa.
  * Si **Flow** = 2 --> La relació seria FR <- NL, i seria al contrari que l'esmentat en l'exemple anterior.

Per simplificar-ho hem decidit esborrar aquesta columna i simplement intercanviar els valors de les columnes segons ens convingui.

In [None]:
df = df.withColumn('new_iso2', when(df['flow'] == 2, df['partner_iso2']).otherwise(df['iso2']))
df = df.withColumn('new_partner_iso2', when(df['flow'] == 2, 'FR').otherwise(df['partner_iso2']))

# Eliminar les columnes originals
df = df.drop('iso2', 'partner_iso2', 'flow')

# Canviar els noms de les noves columnes
df = df.withColumnRenamed('new_iso2', 'iso2').withColumnRenamed('new_partner_iso2', 'partner_iso2')

# Mostrar el dataframe
df.show()

+--------------------+----+------------+
|         mitjana_sum|iso2|partner_iso2|
+--------------------+----+------------+
|    4.62441609985E10|  FR|          NL|
|    1.75339899115E10|  NL|          FR|
|    9.99347119945E10|  FR|          DE|
|6.951418096266667E10|  DE|          FR|
|4.517880949633333...|  FR|          IT|
|3.585052119333333...|  IT|          FR|
|2.238572949033333...|  FR|          GB|
|3.115842170016666...|  GB|          FR|
| 7.184255083166667E9|  FR|          IE|
|       3.379785954E9|  IE|          FR|
|      3.6769071865E9|  FR|          DK|
|3.0566810623333335E9|  DK|          FR|
|1.1480185673333333E9|  FR|          GR|
|      2.3651218625E9|  GR|          FR|
| 6.466880818833333E9|  FR|          PT|
| 5.432393887333333E9|  PT|          FR|
|4.119813002066666...|  FR|          ES|
|3.594075796683333...|  ES|          FR|
|5.681983716016666...|  FR|          BE|
|    3.40062710425E10|  BE|          FR|
+--------------------+----+------------+
only showing top

Ara tenim el problema que si volem fer el *join* amb la base de dades original, necessitem que estiguin en un mateix format. Per fer-ho, el que fem és fer servir la base de dades de Wikipedia dels codis ISO per traduir els codis als noms dels països.

In [None]:
# Llegir el fitxer CSV amb les dades dels codis ISO
df_iso_codes = spark.read.csv("/content/betterlifebetterhealth/data/csv/wikipedia-iso-country-codes.csv", header=True, inferSchema=True)

# Mostrar el dataframe
df_iso_codes.show()

+-----------------------------+------------+------------+------------+-------------+
|English short name lower case|Alpha-2 code|Alpha-3 code|Numeric code|   ISO 3166-2|
+-----------------------------+------------+------------+------------+-------------+
|                  Afghanistan|          AF|         AFG|           4|ISO 3166-2:AF|
|                Åland Islands|          AX|         ALA|         248|ISO 3166-2:AX|
|                      Albania|          AL|         ALB|           8|ISO 3166-2:AL|
|                      Algeria|          DZ|         DZA|          12|ISO 3166-2:DZ|
|               American Samoa|          AS|         ASM|          16|ISO 3166-2:AS|
|                      Andorra|          AD|         AND|          20|ISO 3166-2:AD|
|                       Angola|          AO|         AGO|          24|ISO 3166-2:AO|
|                     Anguilla|          AI|         AIA|         660|ISO 3166-2:AI|
|                   Antarctica|          AQ|         ATA|        

In [None]:
from pyspark.sql.functions import col, broadcast

# Creem un DataFrame auxiliar amb els noms de país
iso_names = df_iso_codes.select('Alpha-2 code', 'English short name lower case') \
                        .withColumnRenamed('Alpha-2 code', 'iso2') \
                        .withColumnRenamed('English short name lower case', 'country')

# Substituim els codis ISO-2 pels noms de país al dataset original
df = df.join(broadcast(iso_names), df['iso2'] == iso_names['iso2'], 'left') \
       .drop(iso_names['iso2']) \
       .withColumnRenamed('country', 'iso2_country')

df = df.join(broadcast(iso_names), df['partner_iso2'] == iso_names['iso2'], 'left') \
       .drop(iso_names['iso2']) \
       .withColumnRenamed('country', 'partner_iso2_country')

df.show()


+--------------------+----+------------+--------------+--------------------+
|         mitjana_sum|iso2|partner_iso2|  iso2_country|partner_iso2_country|
+--------------------+----+------------+--------------+--------------------+
|    4.62441609985E10|  FR|          NL|        France|         Netherlands|
|    1.75339899115E10|  NL|          FR|   Netherlands|              France|
|    9.99347119945E10|  FR|          DE|        France|             Germany|
|6.951418096266667E10|  DE|          FR|       Germany|              France|
|4.517880949633333...|  FR|          IT|        France|               Italy|
|3.585052119333333...|  IT|          FR|         Italy|              France|
|2.238572949033333...|  FR|          GB|        France|      United Kingdom|
|3.115842170016666...|  GB|          FR|United Kingdom|              France|
| 7.184255083166667E9|  FR|          IE|        France|             Ireland|
|       3.379785954E9|  IE|          FR|       Ireland|              France|

In [None]:
# Eliminem les columnes originals
exports_df = df.drop('iso2', 'partner_iso2')
exports_df.show()

+--------------------+--------------+--------------------+
|         mitjana_sum|  iso2_country|partner_iso2_country|
+--------------------+--------------+--------------------+
|    4.62441609985E10|        France|         Netherlands|
|    1.75339899115E10|   Netherlands|              France|
|    9.99347119945E10|        France|             Germany|
|6.951418096266667E10|       Germany|              France|
|4.517880949633333...|        France|               Italy|
|3.585052119333333...|         Italy|              France|
|2.238572949033333...|        France|      United Kingdom|
|3.115842170016666...|United Kingdom|              France|
| 7.184255083166667E9|        France|             Ireland|
|       3.379785954E9|       Ireland|              France|
|      3.6769071865E9|        France|             Denmark|
|3.0566810623333335E9|       Denmark|              France|
|1.1480185673333333E9|        France|              Greece|
|      2.3651218625E9|        Greece|              Franc

In [None]:
# observem el df original per tal de poder fer el join  utilitzant la variable 'Country'
result.show()

+-------------+----+--------+-----+-----+---------+-----------------+------+-----------------+--------+----+------------------+--------------------+--------------------+---------------------+----------------------+------------------+-------------------------+-------------------+----------------------------------+--------------------------+----------------------------+-----------------------------------------+------------------------------------------------------------------------+------------------------------------------+-------------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------+----------------------------------------------+-----------------------------+-----------------+-----------------------+-----------------------------------

In [None]:
# Crear una llista dels països únics de la columna 'Country' del DataFrame 'result'
unique_countries = result.select("Country").distinct().rdd.flatMap(lambda x: x).collect()

# Filtrar 'exports_df' per assegurar que tant 'iso2_country' com 'partner_iso2_country' estan en la llista 'unique_countries'
filtered_exports_df = exports_df.filter(
    (col("iso2_country").isin(unique_countries)) &
    (col("partner_iso2_country").isin(unique_countries))
)

# Mostrar el resultat filtrat
filtered_exports_df.show()


+--------------------+--------------+--------------------+
|         mitjana_sum|  iso2_country|partner_iso2_country|
+--------------------+--------------+--------------------+
|    4.62441609985E10|        France|         Netherlands|
|    1.75339899115E10|   Netherlands|              France|
|    9.99347119945E10|        France|             Germany|
|6.951418096266667E10|       Germany|              France|
|4.517880949633333...|        France|               Italy|
|3.585052119333333...|         Italy|              France|
|2.238572949033333...|        France|      United Kingdom|
|3.115842170016666...|United Kingdom|              France|
| 7.184255083166667E9|        France|             Ireland|
|       3.379785954E9|       Ireland|              France|
|      3.6769071865E9|        France|             Denmark|
|3.0566810623333335E9|       Denmark|              France|
|1.1480185673333333E9|        France|              Greece|
|      2.3651218625E9|        Greece|              Franc

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, rank

# Definir una finestra per a cada 'iso2_country', ordenada descendentment per 'mitjana_sum'
windowSpec = Window.partitionBy("iso2_country").orderBy(col("mitjana_sum").desc())

# Afegir una nova columna 'rank' que enumeri les files dins de cada finestra
ranked_exports_df = exports_df.withColumn("rank", rank().over(windowSpec))

# Filtrar les files per mantenir només la primera entrada (la més alta) per cada 'iso2_country'
highest_exports_df = ranked_exports_df.filter(col("rank") == 1).drop("rank")

+--------------------+-------------------+--------------------+
|         mitjana_sum|       iso2_country|partner_iso2_country|
+--------------------+-------------------+--------------------+
|       4.298435183E9|               NULL|              France|
| 6.751955533333333E7|        Afghanistan|              France|
|1.3797336723333333E9|            Albania|              France|
| 4.731788065166667E9|            Algeria|              France|
|            963359.0|     American Samoa|              France|
|        9.28728823E8|            Andorra|              France|
|1.3103392146666667E9|             Angola|              France|
|           3625983.5|           Anguilla|              France|
|   9365310.333333334|         Antarctica|              France|
|1.5346806233333334E8|Antigua and Barbuda|              France|
|2.4844731536666665E9|          Argentina|              France|
|1.7494520783333334E8|            Armenia|              France|
|1.2095208033333333E8|              Arub

In [None]:
cleaned_df = highest_exports_df.dropna()

In [None]:
cleaned_df.show()

+--------------------+-------------------+--------------------+
|         mitjana_sum|       iso2_country|partner_iso2_country|
+--------------------+-------------------+--------------------+
| 6.751955533333333E7|        Afghanistan|              France|
|1.3797336723333333E9|            Albania|              France|
| 4.731788065166667E9|            Algeria|              France|
|            963359.0|     American Samoa|              France|
|        9.28728823E8|            Andorra|              France|
|1.3103392146666667E9|             Angola|              France|
|           3625983.5|           Anguilla|              France|
|   9365310.333333334|         Antarctica|              France|
|1.5346806233333334E8|Antigua and Barbuda|              France|
|2.4844731536666665E9|          Argentina|              France|
|1.7494520783333334E8|            Armenia|              France|
|1.2095208033333333E8|              Aruba|              France|
| 9.431808756833334E9|          Australi

In [None]:
from pyspark.sql import SparkSession

# Setup Spark session
spark = SparkSession.builder.appName("Data Join").getOrCreate()

# Assuming 'original_df' and 'exports_df' are already loaded as Spark DataFrames

# Perform the join
joined_df = result.join(
    cleaned_df,
    result.Country == cleaned_df.iso2_country,
    "left"  # Use "left" to keep all records from the original dataframe
)

# Select the required columns including the new ones from exports_df
df_final = joined_df.select(
    result["*"],  # Include all columns from the original dataset
    cleaned_df["partner_iso2_country"],
    cleaned_df["mitjana_sum"]
)

# Show the final DataFrame
df_final.show()

+-------------+----+--------+-----+-----+---------+-----------------+------+-----------------+--------+----+------------------+--------------------+--------------------+---------------------+----------------------+------------------+-------------------------+-------------------+----------------------------------+--------------------------+----------------------------+-----------------------------------------+------------------------------------------------------------------------+------------------------------------------+-------------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------+----------------------------------------------+-----------------------------+-----------------+-----------------------+-----------------------------------

## *Classificació de països per regió*

Afegim relació per dividir Europa en 4 zones: Europa del Est, Europa del Sud, Europa Occidental i Europa del Nord.

* **Europa del Nord:** Suècia, Finlàndia, Noruega, Dinamarca, Islàndia, Estònia, Letònia, Lituània
* **Europa de l'Est:** Polònia, Eslovàquia, República Txeca, Hongria, Romania, Bulgària, Bielorússia, Ucraïna, Moldàvia, Rússia, Albània, Kosovo, Macedònia del Nord, Montenegro, Sèrbia, Bòsnia i Hercegovina
* **Europa del Sud:** Itàlia, Espanya, Portugal, Grècia, Turquia, Xipre, Malta, Croàcia, Eslovènia
* **Europa Occidental:** Alemanya, França, Bèlgica, Països Baixos, Àustria, Suïssa, Luxemburg, Regne Unit, Irlanda

In [None]:
# llista de països per regió (incloent Asia)
north_europe = ['Sweden', 'Finland', 'Norway', 'Denmark', 'Iceland', 'Estonia', 'Latvia', 'Lithuania']
east_europe = ['Poland', 'Slovakia', 'Hungary', 'Romania', 'Bulgaria', 'Belarus', 'Ukraine', 'Albania', 'Moldova', 'Czech Republic', 'Russia', 'North Macedonia', 'Montenegro', 'Serbia', 'Bosnia and Herzegovina']
south_europe = ['Italy', 'Spain', 'Portugal', 'Greece', 'Cyprus', 'Malta', 'Croatia', 'Slovenia']
west_europe = ['Germany', 'France', 'Belgium', 'Netherlands', 'Austria', 'Switzerland', 'Luxembourg', 'United Kingdom', 'Ireland']
asia = ['Turkey', 'Israel', 'Georgia', 'Azerbaijan', 'Armenia', 'Kazakhstan', 'Uzbekistan', 'Turkmenistan', 'Tajikistan', 'Kyrgyzstan']

# funció que assigna a cada país una regió
def assign_region(country):
    if country in north_europe:
        return 'Europe North'
    elif country in east_europe:
        return 'Europe East'
    elif country in south_europe:
        return 'Europe South'
    elif country in west_europe:
        return 'Europe West'
    elif country in asia:
        return 'Asia'
    else:
        return 'Other'

region_udf = udf(assign_region, StringType())

# Creem una nova columna per posar la regió
df_final = df_final.withColumn('Region', region_udf(df_final['Country']))

## *Pertanença a UE*
Afegim relació per distingir els que pertanyen a la Unió Europea i els que no.

In [None]:
eu_countries = [
    'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark',
    'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy',
    'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal',
    'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden'
]

# Funció per determinar si pertanyen a UE
def is_eu_member(country):
    return 'EU Member' if country in eu_countries else 'Non-EU Member'

# UDF per aplicar la funció a Spark
eu_member_udf = udf(is_eu_member, StringType())

df_final = df_final.withColumn('EU Membership', eu_member_udf(df_final['Country']))



## *Pertanença a OTAN*
L'OTAN, Organització del Tractat de l'Atlàntic Nord, és una aliança militar internacional establerta el 1949 en el context de la Guerra Freda com una resposta a la influència i l'amenaça de l'expansió comunista liderada per l'URSS. Decidim separar-ho per establir un grup de països.


In [None]:
# Llista de països de l'OTAN
nato_countries = [
    'Albania', 'Belgium', 'Bulgaria', 'Canada', 'Croatia', 'Czech Republic', 'Denmark',
    'Estonia', 'France', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Italy', 'Latvia',
    'Lithuania', 'Luxembourg', 'Montenegro', 'Netherlands', 'North Macedonia', 'Norway',
    'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Turkey', 'United Kingdom', 'United States'
]

# Funció que determina si pertany a l'OTAN
def is_nato_member(country):
    return 'NATO Member' if country in nato_countries else 'Non-NATO Member'

# UDF per aplicar la funció a Spark
nato_member_udf = udf(is_nato_member, StringType())

df_final = df_final.withColumn('NATO Membership', nato_member_udf(df_final['Country']))

## *Països que fan frontera*

In [None]:
fronteres = spark.read.option("header", "true").csv('/content/betterlifebetterhealth/data/csv/GEODATASOURCE-COUNTRY-BORDERS.CSV')

In [None]:
borders_grouped = fronteres.groupBy("country_name").agg(collect_list("country_border_name").alias("border_countries"))

In [None]:
df_final = df_final.join(borders_grouped, df_final['Country'] == borders_grouped['country_name'], 'left')

In [None]:
df_final = df_final.drop("country_name")

In [None]:
df_final = df_final.withColumn("border_countries", to_json(col("border_countries")))

In [None]:
df_final.columns

['Country',
 'Year',
 'Area_Km2',
 'CBR',
 'CDR',
 'Deaths',
 'E0',
 'Medage',
 'MR0_4',
 'Pop_Dens',
 'GSCA',
 'Schizophrenia (%)',
 'Bipolar disorder (%)',
 'Eating disorders (%)',
 'Anxiety disorders (%)',
 'Drug use disorders (%)',
 'Depression (%)',
 'Alcohol use disorders (%)',
 'Total population',
 'Population density, pers per sq km',
 'Total population, male (%)',
 'Total population, female (%)',
 'Mean age of women at birth of first child',
 'Women in the Labour Force, Percent of corresponding total for both sexes',
 'Female tertiary students, percent of total',
 'Female legislators, senior officials and managers, percent of total',
 'Female professionals, percent of total for both sexes',
 'Female clerks, percent of total for both sexes',
 'Female craft and related workers, percent of total for both sexes',
 'Female plant and machine operators and assemblers, percent of total for both sexes',
 'Female members of parliament, percent of total',
 'Total employment, growth rate'

In [None]:
df_final.show()

+-------------+----+--------+-----+-----+---------+-----------------+------+-----------------+--------+----+------------------+--------------------+--------------------+---------------------+----------------------+------------------+-------------------------+-------------------+----------------------------------+--------------------------+----------------------------+-----------------------------------------+------------------------------------------------------------------------+------------------------------------------+-------------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------+----------------------------------------------+-----------------------------+-----------------+-----------------------+-----------------------------------

## *Immigration*

In [None]:
countries_df = df_final.select("Country").distinct()
origins = countries_df.withColumnRenamed("Country", "Origin")
destinations = countries_df.withColumnRenamed("Country", "Destination")
immigration_df = origins.crossJoin(destinations)
immigration_df = immigration_df.where(immigration_df["Origin"] != immigration_df["Destination"])
immigration_df = immigration_df.withColumn("Immigrants", (rand(seed=1234) * 10000).cast("integer"))
immigration_df.show()

+------+------------+----------+
|Origin| Destination|Immigrants|
+------+------------+----------+
|Sweden|      Turkey|      7151|
|Sweden|     Germany|      8334|
|Sweden|      France|      2093|
|Sweden|      Greece|      2352|
|Sweden|    Slovakia|      8935|
|Sweden|     Belgium|      4203|
|Sweden|     Albania|      5003|
|Sweden|     Finland|      4713|
|Sweden|     Belarus|      5746|
|Sweden|       Malta|      6553|
|Sweden|  Tajikistan|      5974|
|Sweden|     Croatia|      4554|
|Sweden|       Italy|      5296|
|Sweden|   Lithuania|      6262|
|Sweden|      Norway|      1459|
|Sweden|Turkmenistan|      6688|
|Sweden|       Spain|      1841|
|Sweden|     Denmark|      5184|
|Sweden|     Ireland|      5914|
|Sweden|     Ukraine|      5332|
+------+------------+----------+
only showing top 20 rows



In [None]:
df_final = immigration_df.join(df_final, immigration_df["Origin"] == df_final["Country"], "left")

In [None]:
df_final.show()

+-------------+----+--------+-----+-----+---------+-----------------+------+-----------------+--------+----+------------------+--------------------+--------------------+---------------------+----------------------+------------------+-------------------------+-------------------+----------------------------------+--------------------------+----------------------------+-----------------------------------------+------------------------------------------------------------------------+------------------------------------------+-------------------------------------------------------------------+-----------------------------------------------------+----------------------------------------------+-----------------------------------------------------------------+----------------------------------------------------------------------------------+----------------------------------------------+-----------------------------+-----------------+-----------------------+-----------------------------------

In [None]:
df_final.write \
    .format("jdbc") \
    .option("url", "jdbc:duckdb:exploitation_zone_2.db") \
    .option("dbtable", "relations_added") \
    .option("driver", "org.duckdb.DuckDBDriver") \
    .mode("append") \
    .save()

In [None]:
import os
df_f = df_final.toPandas()
# Guardar el DataFrame en un arxiu CSV
df_f.to_csv('/content/betterlifebetterhealth/data/csv/df_relations.csv', index=False)


### **Guardar db a github:**

In [None]:
# Clone the repository
%cd betterlifebetterhealth/data/db

# Copy new files from your local directory to this directory
!cp -R /content/*.db .

/content/betterlifebetterhealth/data/db


In [None]:
#CODI A LA DOCUMENTACIO NO POSAT AQUÍ PER PRIVACITAT DE USUARI

[main 50cd8a3] Update db files new
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 data/db/exploitation_zone_2.db
Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 2 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 197.17 KiB | 5.19 MiB/s, done.
Total 5 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/OscarMoliina/betterlifebetterhealth.git
   1fa8450..50cd8a3  main -> main
