# SQL for Beginners with csv file
## <span style="color:green;">  Case study: Refugee analysis </span>

In this notebook we explore data on refugee population using SQL. Our data source is a csv downloaded from The World Bank website. This exploratory analysis include SQL statements such as. 

At the end of the notebook, you can find links to other relevant notebooks using SQL for data analysis. If you are looking for a tutorial focused on each SQL function individually, you might want to check out [SQL for Beginners](https://www.kaggle.com/code/tejota/sql-for-beginners-with-bigquery).

Should you find any errors, have suggestions of improvement or just want to share a different perspective, please feel free to reach out on comments.



## Table of contents <a class="anchor"  id="Index"></a>


* [Introduction](#i)
* [Summary table](#t)
* [Set up the environment - import csv file](#s)
* [Queries](#t)
    * [1. What is the time period?](#1)
    * [2. Top5 of countries taking the most refugees](#2)
        * [2.1 2022](#21)
    * [3. Top5 source countries of refugees](#3)
        * [3.1 2022 and the war in Ukraine](#31)
        * [3.2 2021 - Syria, Africa and Afghanistan](#32)
        * [3.3 TBT 1990](#33)
        * [3.4 2015-2021: Syria, always Syria](#34)
        * [3.5 Over 50 million refugees](#35)
    * [4. Impact of war in Ukraine in their border countries](#4)
        * [4.1 2021 vs. 2022](#41)
        * [4.2 Increment (abs)](#42)
        * [4.3 Increment (%)](#43)
 
* [More notebooks on SQL](#m)    

# [**Introduction**](#Index)  <a class="anchor"  id="i"></a>

Our first tasks will be import pandas and pandasql libraries and import the csv file. Then, we will check the structure of the file - number of rows, name of columns. We will use SQL to try to answer the following questions: 
* Which countries are taking the most refugees?
* Which are the top countries refugees are coming from?
* How the situation changed in 2022 in the countries border Ukraine?

# [**Summary table**](#Index)  <a class="anchor"  id="t"></a>

Query|SQL                                                    
:----|:---------------------------------
1    | SELECT, DISTINCT, FROM, AS
2.1  | WHERE ==, ORDER BY, LIMIT
3.4  | MAX, GROUP BY, BETWEEN
3.5  | SUM, AND
4.1  | WHERE IN
4.2  | LAG OVER(PARTITION BY) 

# [**Set up the environment - import csv file**](#Index)  <a class="anchor"  id="s"></a>

In [1]:
import pandas as pd
import pandasql as psql


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



/kaggle/input/refugee-population/WB_refugees_1990_2022.csv
/kaggle/input/refugee-population/793a9834-3373-433e-9482-162a37bbe1d0_Series - Metadata.csv
/kaggle/input/refugee-population/793a9834-3373-433e-9482-162a37bbe1d0_Data_wide.csv
/kaggle/input/refugee-population/871fdd65-7f81-4e78-8ec0-8c05fd44cbf3_Data_long.csv


file name: "WB_refugees_1990_2022.csv". If the path is not correct when you copy this notebook, just use the path that you get when you run the code cell above.

In [2]:
file_path = '/kaggle/input/refugee-population/WB_refugees_1990_2022.csv'

table1 = pd.read_csv(file_path)

table1.head(3)

Unnamed: 0,Time,Country Name,Refugee population by country or territory of origin,Refugee population by country or territory of asylum
0,1990,Afghanistan,6339095.0,50.0
1,1990,Albania,1822.0,
2,1990,Algeria,18.0,169110.0


In [3]:
table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2706 entries, 0 to 2705
Data columns (total 4 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   Time                                                  2706 non-null   int64  
 1   Country Name                                          2706 non-null   object 
 2   Refugee population by country or territory of origin  2210 non-null   float64
 3   Refugee population by country or territory of asylum  2010 non-null   float64
dtypes: float64(2), int64(1), object(1)
memory usage: 84.7+ KB


Our file contains a table with four columns and 2706 rows.

Columns: 
Time, Country Name, Refugee population by country or territory of **origin** and Refugee population by country or territory of **asylum**

# [**Queries**](#Index)  <a class="anchor"  id="q"></a>

## [**1. What is the time period?**](#Index)  <a class="anchor"  id="1"></a>



`SELECT` allows us to get specific columns of a table.

`FROM` indicates the table that will be our source. 

`DISTINCT` excludes duplicates.

SELECT*  - All columns are selected;

SELECT column1, column2,... - columns are separated by a comma.

In [4]:
query1 = psql.sqldf(
            """
            SELECT DISTINCT Time AS Year
            FROM table1
            """)

query1

Unnamed: 0,Year
0,1990
1,2000
2,2013
3,2014
4,2015
5,2016
6,2017
7,2018
8,2019
9,2020


We have 12 years represented in our dataset: from 1990, 2000 and from 2013 to 2022.


## [**2. Top5 of countries taking the most refugees**](#Index)  <a class="anchor"  id="2"></a>

`WHERE` - allows us to filter the data selected. 

> SELECT column1, column2 FROM table1 WHERE columni = a


> Note: The column used to filter in the WHERE clause does not need to one of columns in the SELECT clause.


> For example, SELECT day, month FROM date_table WHERE year = 2023

`AS` works as an alias. In this case, allow us to change the name of the field in the output.

`ORDER BY` sorts data. By default `ASC`. If we want to sort from largest to smallest or Z-A, we include `DESC` after the field name.

`LIMIT` the number of results.


### [**2.1 2022**](#Index)  <a class="anchor"  id="21"></a>

In [5]:
query2 = psql.sqldf(
            """
            SELECT 
                    Time AS Year,
                    [Country Name] AS [Country of asylum],
                    ([Refugee population by country or territory of asylum]/1000000) AS [Refugee population (millions)]
            FROM       
                    table1
            WHERE 
                    Time == 2022
            ORDER BY 
                    [Refugee population by country or territory of asylum] DESC
            LIMIT 5
            """)

query2

Unnamed: 0,Year,Country of asylum,Refugee population (millions)
0,2022,Turkiye,3.568259
1,2022,"Iran, Islamic Rep.",3.425091
2,2022,Jordan,3.062851
3,2022,West Bank and Gaza,2.454258
4,2022,Germany,2.075445


## [**3. Top5 source countries of refugees**](#Index)  <a class="anchor"  id="3"></a>

### [**3.1 2022 and the war in Ukraine**](#Index)  <a class="anchor"  id="31"></a>

In [6]:
query31 = psql.sqldf(
            """
            SELECT 
                    Time AS Year,
                    [Country Name] AS [Country of origin],
                    ([Refugee population by country or territory of origin]/1000000) AS [Refugee population (millions)]
            FROM       
                    table1
            WHERE 
                    Time == 2022
            ORDER BY 
                    [Refugee population by country or territory of origin] DESC
            LIMIT 5
            """)

query31

Unnamed: 0,Year,Country of origin,Refugee population (millions)
0,2022,Syrian Arab Republic,6.547818
1,2022,Ukraine,5.67988
2,2022,Afghanistan,5.661675
3,2022,South Sudan,2.294983
4,2022,Myanmar,1.253111


### [**3.2 2021 - Syria, Africa and Afghanistan**](#Index)  <a class="anchor"  id="32"></a>

In [7]:
query32 = psql.sqldf(
            """
            SELECT 
                    Time AS Year,
                    [Country Name] AS [Country of origin],
                    ([Refugee population by country or territory of origin]/1000000) AS [Refugee population (millions)]
            FROM       
                    table1
            WHERE 
                    Time == 2021
            ORDER BY 
                    [Refugee population by country or territory of origin] DESC
            LIMIT 5
            """)

query32

Unnamed: 0,Year,Country of origin,Refugee population (millions)
0,2021,Syrian Arab Republic,6.848865
1,2021,Africa Eastern and Southern,6.146257
2,2021,Afghanistan,2.712869
3,2021,South Sudan,2.362759
4,2021,Africa Western and Central,1.675916


### [**3.3 TBT 1990**](#Index)  <a class="anchor"  id="33"></a>

In [8]:
query33 = psql.sqldf(
            """
            SELECT 
                    [Country Name] AS [Country of origin],
                    Time AS Year,
                    ([Refugee population by country or territory of origin]/1000000) AS [Refugee population (millions)]
            FROM       
                    table1
            WHERE 
                    Time == 1990
            ORDER BY 
                    [Refugee population by country or territory of origin] DESC,
                    Time
            LIMIT 5
            """)

query33

Unnamed: 0,Country of origin,Year,Refugee population (millions)
0,Afghanistan,1990,6.339095
1,Ethiopia,1990,1.345928
2,Mozambique,1990,1.247991
3,Iraq,1990,1.133805
4,Liberia,1990,0.735687


### [**3.4 2015-2021: Syria, always Syria**](#Index)  <a class="anchor"  id="34"></a>

In [9]:
query34 = psql.sqldf(
            """
            SELECT 
                    Time AS Year,
                    [Country Name],
                    MAX([Refugee population by country or territory of origin]/1000000) AS [Refugee population (millions)]
            FROM       
                    table1
            WHERE
                    Time BETWEEN 2015 AND 2022
            GROUP BY
                    Time
        
          
            """)

query34

Unnamed: 0,Year,Country Name,Refugee population (millions)
0,2015,Syrian Arab Republic,4.873236
1,2016,Syrian Arab Republic,5.524511
2,2017,Syrian Arab Republic,6.310498
3,2018,Syrian Arab Republic,6.654374
4,2019,Syrian Arab Republic,6.615249
5,2020,Syrian Arab Republic,6.70291
6,2021,Syrian Arab Republic,6.848865
7,2022,Syrian Arab Republic,6.547818


### [**3.5 Over 50 million refugees**](#Index)  <a class="anchor"  id="35"></a>

In [10]:
query35 = psql.sqldf(
            """
            SELECT 
                    [Country Name],
                    SUM([Refugee population by country or territory of origin]/1000000) AS [Refugee population (millions)]
            FROM       
                    table1
            WHERE
                    [Country Name] == 'Syrian Arab Republic'
                    AND
                    Time BETWEEN 2015 AND 2022
            GROUP BY
                    [Country Name]
        
          
            """)

query35

Unnamed: 0,Country Name,Refugee population (millions)
0,Syrian Arab Republic,50.077461


## [**4. Impact of war in Ukraine in their border countries**](#Index)  <a class="anchor"  id="4"></a>

For the purpose of this analysis, Russia and Belarus were not considered for the group of Ukraine bordering countries.

### [**4.1 2021 vs. 2022**](#Index)  <a class="anchor"  id="41"></a>

In [11]:
query41 = psql.sqldf(
            """
            SELECT  
                    [Country Name] as [Country of Asylum],
                    Time as Year,
                    [Refugee population by country or territory of asylum] AS [Refugee population ]
            FROM    
                    table1 
            WHERE
                    [Country Name] in ('Poland', 'Hungary', 'Moldova', 'Slovak Republic', 'Romania')
                    AND
                    Time in ('2021','2022')
            Order By 
                    [Country Name]
                
            """)

query41

Unnamed: 0,Country of Asylum,Year,Refugee population
0,Hungary,2021,5676.0
1,Hungary,2022,35370.0
2,Moldova,2021,349.0
3,Moldova,2022,105374.0
4,Poland,2021,4875.0
5,Poland,2022,971129.0
6,Romania,2021,4200.0
7,Romania,2022,105621.0
8,Slovak Republic,2021,1046.0
9,Slovak Republic,2022,96563.0


### [**4.2 Increment (abs)**](#Index)  <a class="anchor"  id="41"></a>

In [12]:
query42 = psql.sqldf("""
            
    SELECT  
        [Country Name],
        Time AS Year,
        [Refugee population by country or territory of asylum],
        [Refugee population by country or territory of asylum] - LAG([Refugee population by country or territory of asylum],1)
                OVER
                 (PARTITION BY [Country name]) AS Increment
        
    FROM
        table1
    WHERE
        Time in ('2021',2022)
        AND
        [Country Name] in ('Poland', 'Hungary', 'Moldova', 'Slovak Republic', 'Romania')
    ORDER BY
        [Country NAME],Time
                    
            """)

query42

Unnamed: 0,Country Name,Year,Refugee population by country or territory of asylum,Increment
0,Hungary,2021,5676.0,
1,Hungary,2022,35370.0,29694.0
2,Moldova,2021,349.0,
3,Moldova,2022,105374.0,105025.0
4,Poland,2021,4875.0,
5,Poland,2022,971129.0,966254.0
6,Romania,2021,4200.0,
7,Romania,2022,105621.0,101421.0
8,Slovak Republic,2021,1046.0,
9,Slovak Republic,2022,96563.0,95517.0


### [**4.3 Increment (%)**](#Index)  <a class="anchor"  id="41"></a>

In [13]:
query44 = psql.sqldf("""
            
            SELECT  
                    [Country Name],
                    Time AS Year,
                    [Refugee population by country or territory of asylum],
                    [Refugee population by country or territory of asylum] 
                    - 
                    LAG([Refugee population by country or territory of asylum],1)
                        OVER
                            (PARTITION BY [Country name]) AS Increment,
                            
                    ([Refugee population by country or territory of asylum] 
                    - 
                    LAG([Refugee population by country or territory of asylum],1)
                        OVER
                            (PARTITION BY [Country name]))*100/LAG([Refugee population by country or territory of asylum],1)
                        OVER
                            (PARTITION BY [Country name]) AS [Increment %]        
        
            FROM
                    table1
            WHERE
                    Time in ('2021',2022)
                    AND
                    [Country Name] in ('Poland', 'Hungary', 'Moldova', 'Slovak Republic', 'Romania')
           ORDER BY
                   [Country NAME],Time
                    
            """)

query44

Unnamed: 0,Country Name,Year,Refugee population by country or territory of asylum,Increment,Increment %
0,Hungary,2021,5676.0,,
1,Hungary,2022,35370.0,29694.0,523.150106
2,Moldova,2021,349.0,,
3,Moldova,2022,105374.0,105025.0,30093.123209
4,Poland,2021,4875.0,,
5,Poland,2022,971129.0,966254.0,19820.594872
6,Romania,2021,4200.0,,
7,Romania,2022,105621.0,101421.0,2414.785714
8,Slovak Republic,2021,1046.0,,
9,Slovak Republic,2022,96563.0,95517.0,9131.644359


## [**More SQL on**](#Index)  <a class="anchor"  id="m"></a>


#### [SQL for Beginners with BigQuery](https://www.kaggle.com/tejota/sql-for-beginners-with-bigquery/edit)

#### [How to query in a Kaggle notebook](https://www.kaggle.com/code/tejota/sql-how-to-query-in-a-kaggle-notebook)

#### [How to join data from different files](https://www.kaggle.com/code/tejota/sql-how-to-join-data-from-different-datasets)

#### [Pandas: SQL vs. Python](https://www.kaggle.com/code/tejota/pandas-cheatsheet-sql-vs-python-beginners)

#### [Date & Time](https://www.kaggle.com/code/tejota/sql-date-and-time)

#### [Gender statistics | Angola, Cabo Verde, Guinea Bissau, Mozambique and Sao Tome and Principe](https://www.kaggle.com/code/tejota/sql-gender-stats-angola-cabo-verde)

#### Back to 
#### [**Table of contents**](#Index)