# Data Exploration with Spark

The downloaded folder contains 6 csv files but I will be using WDICountry.csv, WDIData.csv, and WDISeries.csv in this project

In [160]:
%%bash
ls ./Downloads/WDI_csv/

WDICountry-Series.csv
WDICountry.csv
WDIData.csv
WDIFootNote.csv
WDISeries-Time.csv
WDISeries.csv


In [1]:
import os
from IPython.display import display, HTML
import pandas as pd
import pyspark 

#import modules needed for pyspark
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Data Exploration").getOrCreate()

### Begin with the "WDICountry.csv" dataset
I will take a look at the:
1. Format
2. Structure
3. Size
4. Dimensions

#### Read the file into a Spark DataFrame

In [6]:
country = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./Downloads/WDI_csv/WDICountry.csv")

#### Inspect the schema

In [11]:
country.printSchema()

root
 |-- Country Code: string (nullable = true)
 |-- Short Name: string (nullable = true)
 |-- Table Name: string (nullable = true)
 |-- Long Name: string (nullable = true)
 |-- 2-alpha code: string (nullable = true)
 |-- Currency Unit: string (nullable = true)
 |-- Special Notes: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Income Group: string (nullable = true)
 |-- WB-2 code: string (nullable = true)
 |-- National accounts base year: string (nullable = true)
 |-- National accounts reference year: string (nullable = true)
 |-- SNA price valuation: string (nullable = true)
 |-- Lending category: string (nullable = true)
 |-- Other groups: string (nullable = true)
 |-- System of National Accounts: string (nullable = true)
 |-- Alternative conversion factor: string (nullable = true)
 |-- PPP survey year: string (nullable = true)
 |-- Balance of Payments Manual in use: string (nullable = true)
 |-- External debt Reporting status: string (nullable = true)
 |-- Syst

#### Take a look at the first 5 rows 

In [80]:
#a function to format Spark DataFrames clearly
def showDF(df, limitRows =  15, truncate = True):
    if(truncate):
        pd.set_option('display.max_colwidth', 50)
        display(df.limit(limitRows).toPandas())
    else:
        pd.set_option('display.max_colwidth', -1)
        pd.set_option('display.max_rows', limitRows)
        display(df.limit(limitRows).toPandas())
        pd.reset_option('display.max_rows')

In [81]:
showDF(country, limitRows = 5, truncate = False)

Unnamed: 0,Country Code,Short Name,Table Name,Long Name,2-alpha code,Currency Unit,Special Notes,Region,Income Group,WB-2 code,...,Government Accounting concept,IMF data dissemination standard,Latest population census,Latest household survey,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,_c30
0,ABW,Aruba,Aruba,Aruba,AW,Aruban florin,,Latin America & Caribbean,High income,AW,...,,Enhanced General Data Dissemination System (e-GDDS),2010,,,Yes,,,2016.0,
1,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,,South Asia,Low income,AF,...,Consolidated central government,Enhanced General Data Dissemination System (e-GDDS),1979,"Demographic and Health Survey, 2015","Integrated household survey (IHS), 2016/17",,,,2017.0,
2,AGO,Angola,Angola,People's Republic of Angola,AO,Angolan kwanza,,Sub-Saharan Africa,Lower middle income,AO,...,Budgetary central government,Enhanced General Data Dissemination System (e-GDDS),2014,"Demographic and Health Survey, 2015/16","Integrated household survey (IHS), 2008/09",,,,2017.0,
3,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,Consolidated central government,Enhanced General Data Dissemination System (e-GDDS),2011,"Demographic and Health Survey, 2017/18","Living Standards Measurement Study Survey (LSMS), 2012",Yes,2012.0,2013.0,2017.0,
4,AND,Andorra,Andorra,Principality of Andorra,AD,Euro,,Europe & Central Asia,High income,AD,...,,,2011. Population data compiled from administrative registers.,,,Yes,,,,


#### How many records are in this dataset?

In [30]:
country.count()

263

#### What are the different regions that the countries belong to?

In [None]:
#Register the DataFrame as a global temporary view
country.createOrReplaceTempView("country")

In [46]:
distinctRegions = """
SELECT DISTINCT
    Region
FROM 
    country
"""
showDF(spark.sql(distinctRegions))

Unnamed: 0,Region
0,South Asia
1,
2,Sub-Saharan Africa
3,Europe & Central Asia
4,North America
5,East Asia & Pacific
6,Middle East & North Africa
7,Latin America & Caribbean


#### What are the different income groups?

In [54]:
incomeGroups = """
SELECT DISTINCT
    `Income Group`
FROM 
    country
"""
showDF(spark.sql(incomeGroups))

Unnamed: 0,Income Group
0,Lower middle income
1,
2,High income
3,Upper middle income
4,Low income


### Next, I will investigate the "WDIseries.csv" dataset

#### Read the file into a Spark DataFrame

In [141]:
series = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./Downloads/WDI_csv/WDIseries.csv")

#### Inspect the schema

In [137]:
series.printSchema()

root
 |-- Series Code: string (nullable = true)
 |-- Topic: string (nullable = true)
 |-- Indicator Name: string (nullable = true)
 |-- Short definition: string (nullable = true)
 |-- Long definition: string (nullable = true)
 |-- Unit of measure: string (nullable = true)
 |-- Periodicity: string (nullable = true)
 |-- Base Period: string (nullable = true)
 |-- Other notes: string (nullable = true)
 |-- Aggregation method: string (nullable = true)
 |-- Notes from original source: string (nullable = true)
 |-- General comments: string (nullable = true)
 |-- Source: string (nullable = true)
 |-- Related source links: string (nullable = true)
 |-- Other web links: string (nullable = true)
 |-- Related indicators: string (nullable = true)
 |-- License Type: string (nullable = true)



#### First 5 rows of the dataset

In [142]:
showDF(series)

Unnamed: 0,Series Code,Topic,Indicator Name,Unit of measure,Periodicity,Base Period,Aggregation method,General comments,Source,Related source links,Other web links,Related indicators,License Type
0,AG.AGR.TRAC.NO,Environment: Agricultural production,"Agricultural machinery, tractors",,Annual,,Sum,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
1,AG.CON.FERT.PT.ZS,Environment: Agricultural production,Fertilizer consumption (% of fertilizer produc...,,Annual,,Weighted average,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
2,AG.CON.FERT.ZS,Environment: Agricultural production,Fertilizer consumption (kilograms per hectare ...,,Annual,,Weighted average,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
3,AG.LND.AGRI.K2,Environment: Land use,Agricultural land (sq. km),,Annual,,Sum,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
4,AG.LND.AGRI.ZS,Environment: Land use,Agricultural land (% of land area),,Annual,,Weighted average,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
5,AG.LND.ARBL.HA,Environment: Land use,Arable land (hectares),,Annual,,,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
6,AG.LND.ARBL.HA.PC,Environment: Land use,Arable land (hectares per person),,Annual,,Weighted average,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
7,AG.LND.ARBL.ZS,Environment: Land use,Arable land (% of land area),,Annual,,Weighted average,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
8,AG.LND.CREL.HA,Environment: Agricultural production,Land under cereal production (hectares),,Annual,,Sum,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0
9,AG.LND.CROP.ZS,Environment: Land use,Permanent cropland (% of land area),,Annual,,Weighted average,,"Food and Agriculture Organization, electronic ...",,,,CC BY-4.0


#### How many records are in this dataset?

In [143]:
series.count()

1437

#### What are the different periodicities and aggregation?

In [161]:
#Register the DataFrame as a global temporary view for SQL queries
series.createOrReplaceTempView("series")

In [148]:
periodicities = """
SELECT DISTINCT
    Periodicity
FROM 
    series
"""
showDF(spark.sql(periodicities))

Unnamed: 0,Periodicity
0,Annual
1,
2,Quarterly (represented as Annual)


In [150]:
aggregation = """
SELECT DISTINCT
    `Aggregation Method`
FROM 
    series
"""
showDF(spark.sql(aggregation))

Unnamed: 0,Aggregation Method
0,
1,Weighted average
2,Simple average
3,Gap-filled total
4,Median
5,Unweighted average
6,Linear mixed-effect model estimates
7,"World Bank, International Debt Statistics."
8,Sum


### Next, I will investigate the "WDIData.csv" dataset

#### Read the data into a Spark DataFrame

In [156]:
indicators = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./Downloads/WDI_csv/WDIData.csv")

#### Inspect the schema

In [157]:
indicators.printSchema()

root
 |-- Country Name: string (nullable = true)
 |-- Country Code: string (nullable = true)
 |-- Indicator Name: string (nullable = true)
 |-- Indicator Code: string (nullable = true)
 |-- 1960: double (nullable = true)
 |-- 1961: double (nullable = true)
 |-- 1962: double (nullable = true)
 |-- 1963: double (nullable = true)
 |-- 1964: double (nullable = true)
 |-- 1965: double (nullable = true)
 |-- 1966: double (nullable = true)
 |-- 1967: double (nullable = true)
 |-- 1968: double (nullable = true)
 |-- 1969: double (nullable = true)
 |-- 1970: double (nullable = true)
 |-- 1971: double (nullable = true)
 |-- 1972: double (nullable = true)
 |-- 1973: double (nullable = true)
 |-- 1974: double (nullable = true)
 |-- 1975: double (nullable = true)
 |-- 1976: double (nullable = true)
 |-- 1977: double (nullable = true)
 |-- 1978: double (nullable = true)
 |-- 1979: double (nullable = true)
 |-- 1980: double (nullable = true)
 |-- 1981: double (nullable = true)
 |-- 1982: double (null

#### First 5 rows of the dataset

In [158]:
showDF(indicators, limitRows = 5)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,_c64
0,Arab World,ARB,"2005 PPP conversion factor, GDP (LCU per inter...",PA.NUS.PPP.05,,,,,,,...,,,,,,,,,,
1,Arab World,ARB,"2005 PPP conversion factor, private consumptio...",PA.NUS.PRVT.PP.05,,,,,,,...,,,,,,,,,,
2,Arab World,ARB,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,82.783289,83.120303,83.533457,83.897596,84.171599,84.510171,,,,
3,Arab World,ARB,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,86.428272,87.070576,88.176836,87.342739,89.130121,89.678685,90.273687,,,
4,Arab World,ARB,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,73.942103,75.244104,77.162305,75.538976,78.741152,79.665635,80.749293,,,


#### How many records are in this dataset?

In [159]:
indicators.count()

377256