<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Creating a custom ID using string functions
© ExploreAI Academy

In this notebook, we will learn how to use string functions to create a customised country ID column. We will also use string functions to give our customised column uniformity.



> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

In this train, we will learn how to:
- Use string functions to create a customised country ID column.
- Use string functions to give our customised column uniformity. 

## Overview

Suppose that, as a general guideline, the project we are working on requires that all data IDs be created by combining existing table attributes. In the process of working with the `Access_to_Basic_Services` table, we make the decision to form custom country IDs by combining the country name, year, and population size in millions for each entry. 


## Connecting to the MySQL database

We'll start by connecting to the united_nations database. To connect to the MySQL server, run the cells below.

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [4]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:Omar2003negm*@localhost:3306/united_nations

'Connected: root@united_nations'


To accomplish this, our initial step involves selecting the specific columns we want to incorporate, namely the `Country_name`, `Time_period`, and `Est_population_in_millions` from the `Access_to_Basic_Services` table.

In [5]:
%%sql

SELECT 
	DISTINCT Country_name, 
	Time_period, 
	Est_population_in_millions
FROM 
	united_nations.Access_to_Basic_Services
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/united_nations
5 rows affected.


Country_name,Time_period,Est_population_in_millions
Kazakhstan,2015,17.542806
Kazakhstan,2016,17.794055
Kazakhstan,2017,18.037776
Kazakhstan,2018,18.276452
Kazakhstan,2019,18.513673


## Exercise


### 1. Combine the columns

Create the new ID by combining the columns `Country_name`, `Time_period`, and `Est_population_in_millions`. Save this as `Country_id`.

In [11]:
%%sql
# Add your code here
SELECT
    DISTINCT Country_name, 
    Time_period, 
    Est_population_in_millions,
    CONCAT('Country: ', Country_name , ', Time period: ',Time_period , ', Est pop: ' , Est_population_in_millions) AS Country_id
FROM
    united_nations.Access_to_Basic_Services;

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Est_population_in_millions,Country_id
Kazakhstan,2015,17.542806,"Country: Kazakhstan, Time period: 2015, Est pop: 17.542806"
Kazakhstan,2016,17.794055,"Country: Kazakhstan, Time period: 2016, Est pop: 17.794055"
Kazakhstan,2017,18.037776,"Country: Kazakhstan, Time period: 2017, Est pop: 18.037776"
Kazakhstan,2018,18.276452,"Country: Kazakhstan, Time period: 2018, Est pop: 18.276452"
Kazakhstan,2019,18.513673,"Country: Kazakhstan, Time period: 2019, Est pop: 18.513673"
Kazakhstan,2020,18.755666,"Country: Kazakhstan, Time period: 2020, Est pop: 18.755666"
Kyrgyzstan,2015,,
Kyrgyzstan,2016,,
Kyrgyzstan,2017,,
Kyrgyzstan,2018,,


### 2. Replace the NULL values

We will notice right away that this method works fine unless we encounter a `NULL` (`None`) value in any of the columns. A single `NULL` would make the entire `Country_id` string `NULL`, and that's not what we want.

Refine the solution by replacing all `NULL` values of the combined columns with the word `UNKNOWN`.



In [32]:
%%sql
# Add your code here
SELECT
    DISTINCT Country_name, 
    Time_period, 
    Est_population_in_millions,
    CONCAT
    (
        'Country: ',             
        IFNULL(Country_name,'UNKNOWN'), 
        ' Time period: ',
        IFNULL(Time_period,'UNKNOWN') , 
        ', Est pop: ', 
        IFNULL(Est_population_in_millions,'UNKNOWN')  
    ) 
    AS 
        Country_id
FROM
    united_nations.Access_to_Basic_Services

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Est_population_in_millions,Country_id
Kazakhstan,2015,17.542806,"Country: Kazakhstan Time period: 2015, Est pop: 17.542806"
Kazakhstan,2016,17.794055,"Country: Kazakhstan Time period: 2016, Est pop: 17.794055"
Kazakhstan,2017,18.037776,"Country: Kazakhstan Time period: 2017, Est pop: 18.037776"
Kazakhstan,2018,18.276452,"Country: Kazakhstan Time period: 2018, Est pop: 18.276452"
Kazakhstan,2019,18.513673,"Country: Kazakhstan Time period: 2019, Est pop: 18.513673"
Kazakhstan,2020,18.755666,"Country: Kazakhstan Time period: 2020, Est pop: 18.755666"
Kyrgyzstan,2015,,"Country: Kyrgyzstan Time period: 2015, Est pop: UNKNOWN"
Kyrgyzstan,2016,,"Country: Kyrgyzstan Time period: 2016, Est pop: UNKNOWN"
Kyrgyzstan,2017,,"Country: Kyrgyzstan Time period: 2017, Est pop: UNKNOWN"
Kyrgyzstan,2018,,"Country: Kyrgyzstan Time period: 2018, Est pop: UNKNOWN"


### 3a. Make the custom IDs consistent

Let's make the custom IDs consistent. We will do this in two steps. First, convert all the letters in each ID to uppercase.


In [33]:
%%sql
# Add your code here
SELECT
    DISTINCT Country_name, 
    Time_period, 
    Est_population_in_millions,
    UPPER
    (
        CONCAT
        (
            'Country: ',             
            IFNULL(Country_name,'UNKNOWN'), 
            ' Time period: ',
            IFNULL(Time_period,'UNKNOWN') , 
            ', Est pop: ', 
            IFNULL(Est_population_in_millions,'UNKNOWN')  
        )
    )
    AS 
        Country_id
FROM
    united_nations.Access_to_Basic_Services

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Est_population_in_millions,Country_id
Kazakhstan,2015,17.542806,"COUNTRY: KAZAKHSTAN TIME PERIOD: 2015, EST POP: 17.542806"
Kazakhstan,2016,17.794055,"COUNTRY: KAZAKHSTAN TIME PERIOD: 2016, EST POP: 17.794055"
Kazakhstan,2017,18.037776,"COUNTRY: KAZAKHSTAN TIME PERIOD: 2017, EST POP: 18.037776"
Kazakhstan,2018,18.276452,"COUNTRY: KAZAKHSTAN TIME PERIOD: 2018, EST POP: 18.276452"
Kazakhstan,2019,18.513673,"COUNTRY: KAZAKHSTAN TIME PERIOD: 2019, EST POP: 18.513673"
Kazakhstan,2020,18.755666,"COUNTRY: KAZAKHSTAN TIME PERIOD: 2020, EST POP: 18.755666"
Kyrgyzstan,2015,,"COUNTRY: KYRGYZSTAN TIME PERIOD: 2015, EST POP: UNKNOWN"
Kyrgyzstan,2016,,"COUNTRY: KYRGYZSTAN TIME PERIOD: 2016, EST POP: UNKNOWN"
Kyrgyzstan,2017,,"COUNTRY: KYRGYZSTAN TIME PERIOD: 2017, EST POP: UNKNOWN"
Kyrgyzstan,2018,,"COUNTRY: KYRGYZSTAN TIME PERIOD: 2018, EST POP: UNKNOWN"


### 3b. Make the custom IDs consistent

The second way we'll make the IDs consistent is by ensuring that all the IDs have the same length.

Use the first four characters of the `Country_name` and `Time_period` fields, then use the last 7 characters of the `Est_population_in_millions`.

In [65]:
%%sql
# Add your code here
SELECT
    DISTINCT Country_name, 
    Time_period, 
    Est_population_in_millions,
    CONCAT
    (             
        SUBSTRING(IFNULL(UPPER(Country_name),'UNKNOWN'),1,4),
        SUBSTRING(IFNULL(Time_period,'UNKNOWN'),1,4),
        SUBSTRING(IFNULL(Est_population_in_millions,'UNKNOWN'),-7,7)  
    ) 
    AS 
        Country_id
FROM
    united_nations.Access_to_Basic_Services

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Est_population_in_millions,Country_id
Kazakhstan,2015,17.542806,KAZA2015.542806
Kazakhstan,2016,17.794055,KAZA2016.794055
Kazakhstan,2017,18.037776,KAZA2017.037776
Kazakhstan,2018,18.276452,KAZA2018.276452
Kazakhstan,2019,18.513673,KAZA2019.513673
Kazakhstan,2020,18.755666,KAZA2020.755666
Kyrgyzstan,2015,,KYRG2015UNKNOWN
Kyrgyzstan,2016,,KYRG2016UNKNOWN
Kyrgyzstan,2017,,KYRG2017UNKNOWN
Kyrgyzstan,2018,,KYRG2018UNKNOWN


#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>