# What factors influence the technological innovation of countries?

In this notebook, I will analyze the factors that influence the technological innovation of countries.

## 1. Data Integration

In this section, we will integrate the data from the three datasets which contain information about the technological innovation of countries into the main dataset. The new datasets are:

- `journal_articles.csv`
- `exports_percentages.csv`
- `exports_values.csv`


First, let's import the necessary libraries and the main dataset, which has already been cleaned and preprocessed.

In [36]:
import polars as pl

In [37]:
cleaned_data: pl.DataFrame = pl.read_csv("../data/cleaned/data.csv")

# Get the first 5 rows
cleaned_data.head()

Country,Density(P/Km2),Abbreviation,Agricultural Land(%),Land Area(Km2),Armed Forces size,Birth Rate,Co2-Emissions,CPI,CPI Change (%),Currency-Code,Fertility Rate,Forested Area (%),Gasoline Price,GDP,Gross primary education enrollment (%),Gross tertiary education enrollment (%),Infant mortality,Largest city,Life expectancy,Maternal mortality ratio,Minimum wage,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban population
str,i64,str,f64,i64,i64,f64,i64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,str,f64,i64,f64,f64,f64,i64,f64,f64,f64,f64,i64
"""Afghanistan""",60,"""AF""",58.1,652230,323000.0,32.49,8672,149.9,2.3,"""AFN""",4.47,2.1,0.7,19101000000.0,104.0,9.7,47.9,"""Kabul""",64.5,638.0,0.43,78.4,0.28,38041754,48.9,9.3,71.4,11.12,9797273
"""Albania""",105,"""AL""",43.1,28748,9000.0,11.78,4536,119.05,1.4,"""ALL""",1.62,28.1,1.36,15278000000.0,107.0,55.0,7.8,"""Tirana""",78.5,15.0,1.12,56.9,1.2,2854191,55.7,18.6,36.6,12.33,1747593
"""Algeria""",18,"""DZ""",17.4,2381741,317000.0,24.28,150006,151.36,2.0,"""DZD""",3.02,0.8,0.28,169990000000.0,109.9,51.4,20.1,"""Algiers""",76.7,112.0,0.95,28.1,1.72,43053054,41.2,37.2,66.1,11.7,31510100
"""Andorra""",164,"""AD""",40.0,468,,7.2,469,,,"""EUR""",1.27,34.0,1.51,3154100000.0,106.4,,2.7,"""Andorra la Vella""",,,6.63,36.4,3.33,77142,,,,,67873
"""Angola""",26,"""AO""",47.5,1246700,117000.0,40.73,34693,261.73,17.1,"""AOA""",5.52,46.3,0.97,94635000000.0,113.5,9.3,51.6,"""Luanda""",60.8,241.0,0.71,33.4,0.21,31825295,77.5,9.2,49.1,6.89,21061025


Now, we will load the new datasets and check their contents. The new information include:

- Scientific and technical journal articles.
- High-technology exports (current US$).
- High-technology exports (% of manufactured exports).

In [38]:
journal_articles: pl.DataFrame = pl.read_csv(
    "../data/raw/technological_innovation/journal_articles.csv"
)
exports_values: pl.DataFrame = pl.read_csv(
    "../data/raw/technological_innovation/exports_values.csv"
)
exports_percentages: pl.DataFrame = pl.read_csv(
    "../data/raw/technological_innovation/exports_percentages.csv"
)

# Select the columns that we need
journal_articles = journal_articles.select(["Country Name", "2022 [YR2022]"])
exports_values = exports_values.select(["Country Name", "2022 [YR2022]"])
exports_percentages = exports_percentages.select(["Country Name", "2022 [YR2022]"])

# Rename the columns
journal_articles = journal_articles.rename(
    {"Country Name": "Country", "2022 [YR2022]": "Tech journal articles"}
)
exports_values = exports_values.rename(
    {"Country Name": "Country", "2022 [YR2022]": "High-technology exports ($)"}
)
exports_percentages = exports_percentages.rename(
    {"Country Name": "Country", "2022 [YR2022]": "High-technology exports (%)"}
)

# Join the dataframes
technology_innovation = journal_articles.join(exports_values, on="Country")
technology_innovation = technology_innovation.join(exports_percentages, on="Country")

# Change ".." values to None
technology_innovation = technology_innovation.select(pl.all().replace("..", None))

# Change from string to float/integer and round to 2 decimal places
technology_innovation = technology_innovation.with_columns(
    pl.col("Tech journal articles").cast(pl.Float64).round(2),
    pl.col("High-technology exports ($)").cast(pl.Int64()),
    pl.col("High-technology exports (%)").cast(pl.Float64).round(2),
)

# Get the first 5 rows
technology_innovation.head()

Country,Tech journal articles,High-technology exports ($),High-technology exports (%)
str,f64,i64,f64
"""Afghanistan""",169.19,,
"""Albania""",238.59,886411.0,0.06
"""Algeria""",7606.65,,
"""American Samoa""",,,
"""Andorra""",9.6,49533520.0,13.31


Now, we can try to join the new technology innovation data with the main dataset.

In [39]:
# Left join the dataframes
data = cleaned_data.join(technology_innovation, on="Country", how="left")

# Filter out rows with missing values in the column "Tech journal articles"
null = data.filter(pl.col("Tech journal articles").is_null())

# Count the number of null values
null_rows = null.shape[0]
print(f"Number of null values: {null_rows}")

# Write null values to a csv file
null.write_csv("../data/raw/technological_innovation/null_values.csv")

Number of null values: 30


As we can see, there are some missing values in the resulting data in the column `Tech journal articles`. This happens because there are some countries that have a different name in the technology innovation data. We will fix this by renaming the countries in the technology innovation data, which can be easily identified in the csv file in which we have written the null values.

In [40]:
# Rename countries
technology_innovation = technology_innovation.with_columns(
    technology_innovation["Country"].replace(
        [
            "Bahamas, The",
            "Brunei Darussalam",
            "Cote d'Ivoire",
            "Cabo Verde",
            "Congo, Rep.",
            "Congo, Dem. Rep.",
            "Czechia",
            "Egypt, Arab Rep.",
            "Gambia, The",
            "Iran, Islamic Rep.",
            "Ireland",
            "Kyrgyz Republic",
            "Lao PDR",
            "Micronesia, Fed. Sts.",
            "Korea, Dem. People's Rep.",
            "West Bank and Gaza",
            "Russian Federation",
            "St. Kitts and Nevis",
            "St. Lucia",
            "St. Vincent and the Grenadines",
            "Sao Tome and Principe",
            "Slovak Republic",
            "Korea, Rep.",
            "Syrian Arab Republic",
            "Timor-Leste",
            "Turkiye",
            "Venezuela, RB",
            "Viet Nam",
            "Yemen, Rep.",
        ],
        [
            "The Bahamas",
            "Brunei",
            "Ivory Coast",
            "Cape Verde",
            "Republic of the Congo",
            "Democratic Republic of the Congo",
            "Czech Republic",
            "Egypt",
            "The Gambia",
            "Iran",
            "Republic of Ireland",
            "Kyrgyzstan",
            "Laos",
            "Federated States of Micronesia",
            "North Korea",
            "Palestinian National Authority",
            "Russia",
            "Saint Kitts and Nevis",
            "Saint Lucia",
            "Saint Vincent and the Grenadines",
            "São Tomé and Principe",
            "Slovakia",
            "South Korea",
            "Syria",
            "East Timor",
            "Turkey",
            "Venezuela",
            "Vietnam",
            "Yemen",
        ],
    )
)

Now, we can try to join the data again.

In [41]:
# Left join the dataframes
data = cleaned_data.join(technology_innovation, on="Country", how="left")

# Filter out rows with missing values in the column "Tech journal articles"
null = data.filter(pl.col("Tech journal articles").is_null())

# Count the number of null values
null_rows = null.shape[0]
print(f"Number of null values: {null_rows}")

# Show the null values
null.head()

Number of null values: 1


Country,Density(P/Km2),Abbreviation,Agricultural Land(%),Land Area(Km2),Armed Forces size,Birth Rate,Co2-Emissions,CPI,CPI Change (%),Currency-Code,Fertility Rate,Forested Area (%),Gasoline Price,GDP,Gross primary education enrollment (%),Gross tertiary education enrollment (%),Infant mortality,Largest city,Life expectancy,Maternal mortality ratio,Minimum wage,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban population,Tech journal articles,High-technology exports ($),High-technology exports (%)
str,i64,str,f64,i64,i64,f64,i64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,str,f64,i64,f64,f64,f64,i64,f64,f64,f64,f64,i64,f64,i64,f64
"""Vatican City""",2003,"""VA""",,0,,,,,,"""EUR""",,0.0,,,,,,,,,,,,836,,,,,,,,


The only remaining missing value in the column `Tech journal articles` is for the country `Vatican City` because it is not present in the technology innovation data. We will exclude this country from the analysis.

In [42]:
# Remove "Vatican City" from the data
data = data.filter(pl.col("Country") != "Vatican City")

We have used the column `Tech journal articles` to correct some country names as it did not have missing values for the countries that were present in the original data. Now, we will check if there are any missing for the columns `High-technology exports (%)` and `High-technology exports ($)`.

In [43]:
# Filter out rows with missing values in the column "High-technology exports (%)"
null = data.filter(pl.col("High-technology exports (%)").is_null())

# Count the number of null values
null_rows = null.shape[0]
print(f"Number of null values: {null_rows}")

Number of null values: 49


Since we have 49 rows with missing values for the column `High-tech exports (%)`, we sadly have to exclude these rows from the analysis.

In [44]:
# Remove rows with missing values
data = data.filter(pl.col("High-technology exports (%)").is_not_null())

Now, we check the column `High-technology exports ($)`.

In [45]:
# Filter out rows with missing values in the column "High-technology exports ($)"
null = data.filter(pl.col("High-technology exports ($)").is_null())

# Count the number of null values
null_rows = null.shape[0]
print(f"Number of null values: {null_rows}")

Number of null values: 0


There are no additional missing values in the `High-technology exports ($)` column, so we have completed the integration of the technology innovation data into the original dataset.

In [46]:
# Show data size
print(data.shape)

# Get the first 5 rows
data.head()

(145, 33)


Country,Density(P/Km2),Abbreviation,Agricultural Land(%),Land Area(Km2),Armed Forces size,Birth Rate,Co2-Emissions,CPI,CPI Change (%),Currency-Code,Fertility Rate,Forested Area (%),Gasoline Price,GDP,Gross primary education enrollment (%),Gross tertiary education enrollment (%),Infant mortality,Largest city,Life expectancy,Maternal mortality ratio,Minimum wage,Out of pocket health expenditure,Physicians per thousand,Population,Population: Labor force participation (%),Tax revenue (%),Total tax rate,Unemployment rate,Urban population,Tech journal articles,High-technology exports ($),High-technology exports (%)
str,i64,str,f64,i64,i64,f64,i64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,str,f64,i64,f64,f64,f64,i64,f64,f64,f64,f64,i64,f64,i64,f64
"""Albania""",105,"""AL""",43.1,28748,9000.0,11.78,4536,119.05,1.4,"""ALL""",1.62,28.1,1.36,15278000000.0,107.0,55.0,7.8,"""Tirana""",78.5,15.0,1.12,56.9,1.2,2854191,55.7,18.6,36.6,12.33,1747593,238.59,886411,0.06
"""Andorra""",164,"""AD""",40.0,468,,7.2,469,,,"""EUR""",1.27,34.0,1.51,3154100000.0,106.4,,2.7,"""Andorra la Vella""",,,6.63,36.4,3.33,77142,,,,,67873,9.6,49533520,13.31
"""Angola""",26,"""AO""",47.5,1246700,117000.0,40.73,34693,261.73,17.1,"""AOA""",5.52,46.3,0.97,94635000000.0,113.5,9.3,51.6,"""Luanda""",60.8,241.0,0.71,33.4,0.21,31825295,77.5,9.2,49.1,6.89,21061025,44.99,77204455,22.06
"""Antigua and Barbuda""",223,"""AG""",20.5,443,0.0,15.33,557,113.81,1.2,"""XCD""",1.99,22.3,0.99,1727800000.0,105.0,24.8,5.0,"""St. John's, Saint John""",76.9,42.0,3.04,24.3,2.76,97118,,16.5,43.0,,23800,7.02,0,0.0
"""Argentina""",17,"""AR""",54.3,2780400,105000.0,17.02,201348,232.75,53.5,"""ARS""",2.26,9.8,1.1,449660000000.0,109.7,90.0,8.8,"""Buenos Aires""",76.5,39.0,3.35,17.6,3.96,44938712,61.3,10.1,106.3,9.79,41339571,9122.18,663679613,4.75
