# Kuala Lumpur School of AI

## Introduction to Data Analysis

### Outline 

1) Load Data Set

2) Data Cleaning and Exploration

3) Data Analysis
   * Q1: How does the population growth defer in the US and in China?
   * Q2: How does the GDP Per Capita defer in the US and China?
   * Q3: What is the relationship between GDP Per Capita and Life Expectancy?
   
4) Real World Case Study (If we have time)
   * Q: Whats the richest country in the world on a per-person basis?

### Step 1: Loading Our Data Set

Before loading our data set, we must identify where the *path* of our data set is in our PC.

We can check this by using the linux command '!pwd'.

!pwd

Let's create a **variable** called *file_path*:

Now, let's import the **pandas** library (a pythonic library used for data manipulation and analysis) to help us load our data set.

Once imported, we will use pandas (here abbreviated as *pd*) to load our .csv file.

**Congratulations! You have loaded your first data set with pandas!**

### Step 2: Data Cleaning and Exploration

   * Data cleaning is where you will spend 70% of your time as a data analyst/scientist. 
   * The objective of this morning session to get a **feel** for analytics, so we will use a clean data set
   * Later in the evening session, you will experience how one would need to *clean* their data before any exploration
   
   

#### Data types

* Objects = strings ie 'Malaysia'
* Int64 = Integers ie 2019
* Float64 = Float ie 25.1

* More info on python, pandas and numpy data types:
http://pbpython.com/pandas_dtypes.html

* Here you can also see how to convert from one data type to another.
* This is usually part of the *data cleaning* process. eg. *"1984"* can be stored as a **string**, you want it as an **integer**. 

#### Dimensions of the Pandas Data Frame

This means that the Pandas Data Frame has 1704 rows Ã— 6 columns

#### Missing values?

We have no null values! Lets move on. 

If you are curious to know what we should do with null values, we will explore this in the second session.

#### Slicing data

Now let's explore on how to access individual columns and rows within our data set

We know that our data set has 1704 rows and 6 columns.

How do I access row 1, column 2? 

We use the **method** *.iloc* which stands for *Index Location*

In [None]:
#Here we are accessing row 0 and column 0

In [None]:
#Rows 0 to 3 and column 0

In [None]:
#Rows 0 to 3 and all columns 

In [None]:
#Rows 3 to 7 and columns 2 to 3 

**Pandas Series and Data Frames**

In [None]:
# Several Rows of One Column

In [None]:
# Several Rows and Several Columns 

In [None]:
# One Row and One Column

### Step 3: Data Analysis

The two main methods we are going to use today is frequency statistics and visualisations

* Let's analyse the data for Malaysia

First we need to extract the Malaysian data from the data set

Now, suppose we want to know, whats the average life expenctancy, population and gdpPerCapita in Malaysia?

Statistics are very useful to describe data but can be hard to understand without visualization 

**Lets do some basic visualisation**

First Import **matplotlib's pyplot** (abbriviated as *plt*) into our Jupyter Notebook

**Simple line plot**

In [None]:
# Create a list of integers
x = [1,2,3]
y = [1,2,3]
z = [4,5,6] #add a 3rd variable

#Plot 
plt.plot(x,y)
#plt.plot(x,z) #plot both lines

#Add Title
plt.title('Title of our diagram')

#Add labels
plt.xlabel('x')
plt.ylabel('y')
#plt.ylabel('y and z') #change y label to include y and z

#Add legends
plt.legend(["this is y"])
#plt.legend(["this is y","this is z"])  #change legend to include both lines

#Show diagram
plt.show()

** Let's do some analysis!**

**Q1: How does the population growth defer in the US and in China?**

Step 1. Extract data from the US and China

Step 2. Plot the population data over time

In [None]:
plt.figure(figsize=(16,8))

# Add Here
# Add Here

plt.title('Population growth in US and China')
plt.legend(['United States','China'])
plt.xlabel("year")
plt.ylabel("population")

plt.show()

**Conclusive Insights**

* US population grew from 150 million to 300 million from 57 to 07 (50 yrs)
* China population grew from 550 million to 1.3 billion in 50 yrs

This is great but, how do we get more insight from this data?
* What is the relative growth? Is it higher in the US or China?

In [None]:
#US Population 

us[['year','population']] #Data frame to Show the year and population in the US
#us.population #Series of the population in the US

Lets devide the yearly population by the first year that we have

What is the first population? 
* It is the population in 1950. 
* How do we access this data?

In [None]:
#This shows is the population growth each year when compared to the year 1950. 

This shows is the population growth each year when compared to the year 1950. 

From here we can see that in *50 years*, the population in the US grew **91%**

Impressive! But what about China?

In [None]:
#This shows is the population growth each year when compared to the year 1950. 

The population in China grew **137%** !


**Lets visualise this result**

In [None]:
plt.figure(figsize=(16,8))

# Add Here
# Add Here

plt.title('Relative population growth by percentage in the US and China')
plt.legend(['United States','China'])

plt.xlabel("year")
plt.ylabel("population growth ")
plt.show()

**Conclusive Insights**
* The population in the US grew **91% form 150million to 300million in 50 years** 
* The population in China grew **137% form 550million to 1.3 billion in 50 years**

We now know the absolute growth amount and the growth rates of each country, allowing us to compare both countries.

We can theoretically do this for all countries by writting a python function and showing our results in a table or on a visualization, I will leave that to your imagination.

**Q2: How does the GDP Per Capita defer in the US and China?**

Here we have two of the worlds strongest economies. But when did China's Economy start to boost? Are they doing better than the US?

What is GDP Per Capita? 

* GDP Per Capita is the Annual GDP of a country devided by its population.

* ie. GDP Per Capita = Annual GDP / Population

https://www.investopedia.com/terms/g/gdp.asp

Let's do the same analysis for the US and China as we did before

**1. Absolute Growth**

In [None]:
plt.figure(figsize=(16,8))

# Add Here
# Add Here

plt.title('GDP Per Capita Growth in US and China')
plt.legend(['United States','China'])
plt.xlabel("year")
plt.ylabel("GDP per Capita ($)")

plt.show()

**2. Relative Growth**

In [None]:
# Add here
# Add here

plt.figure(figsize=(16,8))
plt.plot(us.year, us_growth)
plt.plot(china.year, china_growth)
plt.title("GDP Per Capita Relative Growth in U.S and China")
plt.legend(["US","China"])
plt.xlabel("Year")
plt.ylabel("GDP per Capita ($)")
plt.show()


**Conclusive Insights**

* The GDP Per Capita in the US grew **200 %** form **14,000 USD** to **43,000 USD** in *50 years* 
* The GDP Per Capita in China grew **1140 %** form **400 USD** to **5000 USD** in *50 years*


* Remember GDP per capita is the GDP of a country / its population
* The US has 300 million people and China has 1.3 billion people
* I will leave it to you to explore the difference in GDP of the two countries
* More data and info on the economies of USA and China in this link: https://www.visualcapitalist.com/china-vs-united-states-a-tale-of-two-economies/


**Q3: What is the relationship between GDP Per Capita and Life Expectancy?**

In [None]:
data_07 = data[data.year == 2007]
plt.figure(figsize=(16,8))
plt.scatter(data_07.gdpPerCapita, data_07.lifeExpectancy, 5)
plt.title('GDP Per Capita and Life Expectancy in 2007')
plt.xlabel("GDP Per Capita ($)")
plt.ylabel('Life Expectancy (Age)')
plt.show()

* The Life Expectency increases dramatically in the lower range of GDP increase, however, after $10,000, a large increase in GDP brings a slow increase in life expectancy 

* The relationship between the variables are non linear.


Let's check the corelation between GDP Per Cap and Life Expectancy

In [None]:
data_07.gdpPerCapita.corr(data_07.lifeExpectancy) 

The corelation is neither good nor bad. 

How can we change this non-linear relationship to become linear?

* Log scales?

In [None]:
import numpy as np

np.log10([1, 10, 100, 1000]) #Log scale, 10^1 = 10, 10^2 = 100, 10^3 = 1000

Let's plot the data using a log scale for the GDP Per Cap and check the corelation

In [None]:
plt.figure(figsize=(16,8))
plt.scatter(np.log10(data_07.gdpPerCapita), data_07.lifeExpectancy)
plt.title('GDP Per Capita and Life Expectancy in 2007')
plt.xlabel("GDP Per Capita ($)")
plt.ylabel('Life Expectancy (Age)')
plt.show()

In [None]:
np.log10(data_07.gdpPerCapita).corr(data_07.lifeExpectancy) #When using the log scale

In [None]:
years_sorted = sorted(set(data.year)) #set of years

for given_year in years_sorted:  #iterate through the years
    data_year = data[data.year == given_year] #choose data of given year
    plt.figure(figsize=(16,8)) #figure size
    #plt.scatter(data_year.gdpPerCapita, data_year.lifeExpectancy, 5) #plot absolute relationship of gdp n LE
    plt.scatter(np.log10(data_year.gdpPerCapita), data_year.lifeExpectancy, 5) #plot on log scale for gdp where 3 = 10^3 = 1000
    plt.title(given_year)
    #plt.xlim(0,60000)
    plt.xlim(2,5)
    plt.ylim(25, 85)   
    plt.xlabel('GDP Per Capita')
    plt.ylabel('Life Expectancy')
    #plt.show()
    #plt.savefig(str(given_year), dpi = 200) #save figure
    plt.savefig('log_' + str(given_year), dpi = 200) #save figure
    plt.clf() #Clear the current plot

**Conclusive Insights**

* As the years have past, the Life Expectancy in the world has increased (Modern medicine?)
* The GDP of most countries have also increased (Globalization?)

### Part 4: Case Study 

#### Q: Whats the richest country in the world on a per-person basis?

In [None]:
# Load data 
data = pd.read_csv('data/world_countries.csv')

#Preview data
data.head()

In [None]:
country_mean = data.groupby(['country']).mean()

In [None]:
country_mean.head()

In [None]:
#Find 10 most highest GDP by countries

top10 = country_mean.sort_values('gdpPerCapita', ascending = False).head(10)
top10

In [None]:
# Visualize

x = range(10)
plt.figure(figsize=(16,12))
plt.bar(x, top10.gdpPerCapita)
plt.xticks(x, top10.index, rotation = 'vertical')
plt.ylabel("GDP Per Capita")
plt.title('Top 10 Countries by GDP per Capita')
plt.legend(['GDP Per Capita'])

plt.show()

#### Is Kuwait really the richest country in the world (on a per-person basis?)

In [None]:
kuwait = data[data.country == 'Kuwait']
kuwait.head()

In [None]:
plt.figure(figsize=(16,12))
plt.plot(kuwait.year, kuwait.gdpPerCapita)
plt.title("GDP for Capita in Kuwait from 1952 to 2007")
plt.xlabel("year")
plt.ylabel('GDP PerCapita in USD')
plt.show()

#### Since GDP per capita is GDP per person - Lets analyze the GDP per capita, GDP and the Population of Kuwait

In [None]:
plt.figure(figsize=(16,12))

plt.subplot(311)
plt.title("GDP Per Capita")
plt.ylabel("GDP PER CAP")
plt.plot(kuwait.year, kuwait.gdpPerCapita)

plt.subplot(312)
plt.title("GDP in Billions")
plt.ylabel("GDP in BILLIONS")
plt.plot(kuwait.year, kuwait.gdpPerCapita * kuwait.population /10**9)

plt.subplot(313)
plt.title("Population in Millions")
plt.ylabel("Population in Millions")
plt.plot(kuwait.year, kuwait.population / 10**6)

plt.tight_layout()
plt.show()

#### Relative growth of GDP in relation to their population

In [None]:
plt.figure(figsize=(16,12))
plt.plot(kuwait.year, kuwait.population / kuwait.population.iloc[0] * 100)

kuwait_gdp = kuwait.population * kuwait.gdpPerCapita
plt.plot(kuwait.year, kuwait_gdp / kuwait_gdp.iloc[0] * 100)
plt.plot(kuwait.year, kuwait.gdpPerCapita / kuwait.gdpPerCapita.iloc[0] * 100)

plt.title('GDP and Population Growth in Kuwait ')
plt.legend(['Population','GDP','GDP Per Capita'])

plt.show()

#### Lets compare this with the US

In [None]:
us = data[data.country == "United States"]

In [None]:
plt.figure(figsize=(16,12))
plt.plot(us.year, us.gdpPerCapita)
plt.plot(kuwait.year, kuwait.gdpPerCapita)
plt.title("GDP for Capita in Kuwait and US from 1952 to 2007")
plt.xlabel("year")
plt.ylabel('GDP PerCapita in USD')
plt.legend(['US','Kuwait'])
plt.show()

* So, is Kuwait really the richest country by per person basis? 
* A country like the US has a steadily growing GDP with 100 times more population than Kuwait, but is everyone rich? Are more people rich? I suppose?
* A country like Kuwait has a large amount of wealth with a very small population, does that indicate that there is a higher chance to get a slice of the pie? 
* In the end, You decide

* GDP Per Capita 2017 

https://www.statista.com/statistics/270180/countries-with-the-largest-gross-domestic-product-gdp-per-capita/

* GDP 2017 

https://www.statista.com/statistics/268173/countries-with-the-largest-gross-domestic-product-gdp/

** Thank you** 