# Assignment 1: Introduction to Data Science and Python

Student name | Hours spent on the tasks
------------ | -------------
Lenia Malki | 8
Maële Belmont | 8


To do:
- show available year list
- reference
- questions

## Setup
Python modules need to be loaded to solve the tasks below.

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression

## Task 1 - Download some data related to GDP per capita and life expectancy <br>
#### A. Write a Python program that draws a scatter plot of GDP per capita vs life expectancy. State any assumptions and motivate decisions that you make when selecting data to be plotted, and in combining data. [1p]



When extracting the data, we chose to not include population size as a parameter. As we are primarily interested in exploring the relationship between life expectancy and GDP, we did not consider the population size data to be contributing to this relationship. Disincluding this data makes it easier to focus and study the other two data entities.

In terms of countries, we chose not to remove any data points with null values for a given year. Among the available data, we chose only to focus on the data obtained in 2018. The reason for this lies in the assumption that recent data, in this context, is more accurate as it is more available today than it might have been in the nineteen hundreds. 

In order to visualize the data in a readable manner, we chose to group countries together by their respective continents and color code these. To minimize cluttering, we used Plotly to create an interactive plot, which displays more information (country and exact value of life expectancy and GDP) when the cursor is on a dot. 

It is possible to study any data between the year 1543 and 2018. In other words, the program is not limited to only data from one specific year though only data for one year  is plotted at a time. 


In [8]:
def GDPperCapita(csv_file, column1, column2, title):
    """
    input: 
    - csv_file (string): name of csv file
    - column1 (string): name of column for data GDP per capita
    - column2 (string): name of column for data other than GDP per capita (ex: 'Life expectancy')
    - title (string): Title of the plot
    output:
    - dataframe containing data for the chosen year without NaN values
    """
    #Read the csv file containing the downloaded data
    rawData = pd.read_csv(csv_file)
    
    #Remove rows containing NaN value in the 'GDP per capita' or 'Life expectancy'
    noNaN = rawData.dropna(subset=[column1, column2])
    
    #Print list with years where data is available
    yearList = []
    for i in range(len(noNaN)):
        if noNaN['Year'].values[i] not in yearList:
            yearList.append(noNaN['Year'].values[i])
    yearlist = yearList.sort() #sort it in ascending order         
    print('---------------------------------------------------------------------------------------------------------------------')
    print('List of years for which data is available: \n')
    print(yearList)
    print('---------------------------------------------------------------------------------------------------------------------')

    #Create a new table with the data for year of interest chosen by the user
    year = int(input('Year of interest for the data "%s": \n' % (title)))
    print('---------------------------------------------------------------------------------------------------------------------')
    yearData = rawData.loc[rawData['Year']==year].reset_index()
        
    #Replace NaN values in 'Continent' by the continent value for every country
    year2015 = rawData.loc[rawData['Year']==2015].reset_index() #continents are defined only in 2015, but we want them to be defined for all the years.
    
    for i in range(len(yearData)):
        for j in range(len(year2015)):
            if yearData['Entity'][i] == year2015['Entity'][j]: #compare the 'Entity' of the chosen year to the 'Entity' in 2015 and if they are the same...
                yearData.at[i, 'Continent'] = year2015['Continent'][j] #...assign the 'Continent' value of 2015 to the 'Continent' value of the chosen year 'Continent' value of 2015 to the 'Continent' value of the chosen year

    #Create a new table with only the relevant columns
    relevantColumns = yearData.loc[:, ['Entity','Year', column1, column2, 'Continent']]

    #Create a new table without rows containing NaN values
    finalData = relevantColumns.dropna()
    
    return finalData

In [9]:
def GDP(csv_file, column1, column2, title):
    """
    input: 
    - csv_file (string): name of csv file
    - column1 (string): name of column for data GDP per capita
    - column2 (string): name of column for data other than GDP
    - title (string): Title of the plot
    output:
    - dataframe containing data for the chosen year without NaN values
    """
    #Read the csv file containing the downloaded data
    rawData = pd.read_csv(csv_file)
    
    #Create a new table with the data for year of interest chosen by the user
    year = int(input('Year of interest for the data "%s": \n' % (title)))
    print('---------------------------------------------------------------------------------------------------------------------')
    yearData = rawData.loc[rawData['Year']==year].reset_index()
        
    #Replace NaN values in 'Continent' by the continent value for every country
    year2015 = rawData.loc[rawData['Year']==2015].reset_index() #continents are defined only in 2015, but we want them to be defined for all the years.
    
    for i in range(len(yearData)):
        for j in range(len(year2015)):
            if yearData['Entity'][i] == year2015['Entity'][j]: #compare the 'Entity' of the chosen year to the 'Entity' in 2015 and if they are the same...
                yearData.at[i, 'Continent'] = year2015['Continent'][j] #...assign the 'Continent' value of 2015 to the 'Continent' value of the chosen year 'Continent' value of 2015 to the 'Continent' value of the chosen year

    yearData['GDP'] = yearData[column1]*yearData['Population (historical estimates)']
    
    #Create a new table with only the relevant columns
    relevantColumns = yearData.loc[:, ['Entity','Year', 'GDP', column2, 'Continent']]

    #Create a new table without rows containing NaN values
    finalData = relevantColumns.dropna()
    
    return finalData

In [10]:
def plotFigure(finalData, xlabel, ylabel, title, min_x, max_x, min_y, max_y):
    """
    input: 
    - finalData: dataframe will be plotted
    - xlabel: label displayed under x-axis
    - ylabel: label displayed under y-axis
    - title of the plot
    output:
    - figure with the plot
    """ 
    ###Plot 
    try:
    #Set axis limits
        year = finalData['Year'].values[0]
        column1=finalData.columns[2]
        column2=finalData.columns[3]
        minimum_x = finalData[column1].min()-min_x
        maximum_x = finalData[column1].max()+max_x
        minimum_y = finalData[column2].min()-min_y 
        maximum_y = finalData[column2].max()+max_y

        #Create Plotly figure
        fig = px.scatter(finalData, x=column1, 
                     y=column2, 
                     color='Continent', hover_data=['Entity'], 
                     log_x=True, title=f'{title}, {year}',
                     labels={column1: xlabel, column2: ylabel},
                     trendline='ols', 
                     trendline_options=dict(log_x=True), trendline_scope='overall', trendline_color_override='black', 
                     range_x=[minimum_x, maximum_x], range_y=[minimum_y, maximum_y])
        #Display figure
        fig.show()
    except:
        print('---------------------------------------------------------------------------------------------------------------------')
        print('No data available for this year, please run the cell again and try with another year.\n')

In [11]:
def aboveSD(finalData, ylabel):
    """
    inputs: 
    - finalData: dataframe containing the data of interest
    - ylabel: data topic?
    output:
    - dataframe with value of data above the standard mean
    """ 
    column1=finalData.columns[2]
    column2=finalData.columns[3]
    
    #Print useful statistical values
    mean = finalData[column2].mean()
    sd = finalData[column2].std()
    oneSDAboveMean = mean + sd

    Filter = finalData[column2] >= oneSDAboveMean
    tableSDAboveMean = finalData[['Entity','Year', column2]]

    print('Dataframe with countries having', ylabel ,'higher than the standard mean.')
    display(tableSDAboveMean[Filter])

---

In [12]:
plotFigure(GDPperCapita('life-expectancy-vs-gdp-per-capita.csv',
          'GDP per capita',
          'Life expectancy',
          'Life expectancy vs. GDP per capita'), 'GDP per capita ($)', 
           'Life expectancy (years)', 'Life expectancy vs. GDP per capita', 100, 10000, 1, 5)

---------------------------------------------------------------------------------------------------------------------
List of year where data is available for the plot: 

[1543, 1548, 1553, 1558, 1563, 1568, 1573, 1578, 1583, 1588, 1593, 1598, 1603, 1608, 1613, 1618, 1623, 1628, 1633, 1638, 1643, 1648, 1653, 1658, 1663, 1668, 1673, 1678, 1683, 1688, 1693, 1698, 1703, 1708, 1713, 1718, 1723, 1728, 1733, 1738, 1743, 1748, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1775, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 1786, 1787, 1788, 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 1798, 1799, 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 

#### B. Consider whether the results obtained seem reasonable and discuss what might be the explanation for the results you obtained. [1p]

The results show a clear trend. Countries with a higher GDP tend to also rank higher in life expectancy. However, this does mean that it is always true. We can for example see that Saudi Arabia has one of the higher GDP per capita at approximately \\$50,000 and a life expectancy of 75 years. On the same y-coordinate, we find Honduras which is one of Latin America's poorest countries with a GDP of  \\$5,000. Even though it seems to be a positive trend related to GDP per capita and life expectancy, deviations from the regression can reveal otherwise. Generally speaking, countries with higher GDP per capita might have the ability to provide better health care to their population as well as better living standards for the people, thus resulting in a higher life expectancy. One must however consider that other factors are at play when looking at life expectancy, not only GDP per capita.  

#### C. Did you do any data cleaning (e.g., by removing entries that you think are not useful) for the task of drawing scatter plot(s) and the task of answering the questions d, e, f, and g? If so, explain what kind of entries that you chose to remove and why. If not, explain why you did not need to. [0.5p]

As mentioned in question 1.b, we decided to remove the population size data in order to only focus on the relationship between GDP and life expectancy. Data points with null values, such as for example missing GDP data or life expectancy score, were also removed in order to avoid outliers. Lastly, we decided to only collect and visualize the data from 2018. The reasoning behind this has to do with the assumption that the quality and availability of recent data, within this context, is better than that of much earlier years. That being said, we do not believe there to be big differences between closeby years. 

#### D. Which countries have a life expectancy higher than one standard deviation above the mean? [0.5p]

With a standard deviation of approximately 7.747 and a mean of approximately 2.66, one standard deviation above the mean would require the life expectancy to be at a minimum of 80.41. The countries with a life expectancy of one standard deviation above the mean would be those presented in table x. The data for these life expectancies are limited to that of a specific year supplied by the user. In this case, the input year was 2018. 


In [13]:
aboveSD(GDPperCapita('life-expectancy-vs-gdp-per-capita.csv',
          'GDP per capita',
          'Life expectancy',
          'Life expectancy vs. GDP per capita'), 'Life expectancy vs. GDP per capita')

---------------------------------------------------------------------------------------------------------------------
List of year where data is available for the plot: 

[1543, 1548, 1553, 1558, 1563, 1568, 1573, 1578, 1583, 1588, 1593, 1598, 1603, 1608, 1613, 1618, 1623, 1628, 1633, 1638, 1643, 1648, 1653, 1658, 1663, 1668, 1673, 1678, 1683, 1688, 1693, 1698, 1703, 1708, 1713, 1718, 1723, 1728, 1733, 1738, 1743, 1748, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1775, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784, 1785, 1786, 1787, 1788, 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797, 1798, 1799, 1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843, 1844, 1845, 1846, 

Unnamed: 0,Entity,Year,Life expectancy
14,Australia,2018,83.281
15,Austria,2018,81.434
22,Belgium,2018,81.468
39,Canada,2018,82.315
56,Cyprus,2018,80.828
60,Denmark,2018,80.784
78,Finland,2018,81.736
81,France,2018,82.541
87,Germany,2018,81.18
90,Greece,2018,82.072


#### E. Which countries have high life expectancy but have low GDP (per capita)? [0.5p]

It essentially depends on how you define high life expectancy and low GDP.  If we were to define low GDP as one standard deviation below the mean, it would yield a negative GDP score. This is because the standard deviation is greater than the mean, indicating a high variance between data points. Another way of defining low GDP would be to extract those data points whose GDP is below the median.  

The GDP median is 12165.79 with a life expectancy median of 74.368. We can see that these countries are located closer to the upper left corner of the graph.

In [11]:
def medianFilter (finalData): 
    """
    - finalData: dataframe containing the data of interest
    """
    
    GDPmedian = finalData['GDP per capita'].median()
    LEmedian = finalData['Life expectancy'].median()
    
    print(GDPmedian)
    print(LEmedian)
    
    Filter1 = finalData['GDP per capita'] < GDPmedian
    Filter2 = finalData['Life expectancy'] > LEmedian
    
    combinedFilter = Filter1 & Filter2
    filteredColumns = finalData[['Entity','Year', 'GDP per capita','Life expectancy']]
    
    print(filteredColumns[combinedFilter])


In [12]:
medianFilter(GDPperCapita('life-expectancy-vs-gdp-per-capita.csv',
          'GDP per capita',
          'Life expectancy', 'Countries with high life expectancy and low GDP'))

Year of interest for the data "Countries with high life expectancy and low GDP": 
2018
---------------------------------------------------------------------------------------------------------------------
12080.4908
74.3865
                     Entity  Year  GDP per capita  Life expectancy
2                   Albania  2018      11104.1665           78.458
11                  Armenia  2018      11454.4251           74.945
20                 Barbados  2018      11995.1868           79.081
29   Bosnia and Herzegovina  2018      10460.5201           77.262
54                     Cuba  2018       8325.6313           78.726
62                 Dominica  2018       9021.1737           74.806
66                  Ecuador  2018      10638.8251           76.800
100                Honduras  2018       5041.6354           75.088
114                  Jordan  2018      11506.3383           74.405
151                 Morocco  2018       8451.1355           76.453
191             Saint Lucia  2018      

#### F. Does every strong economy (normally indicated by GDP) have high life expectancy? [1p]

#### G. Related to question f, what would happen if you use GDP per capita as an indicator of strong economy? Explain the results you obtained, and discuss any insights you get from comparing the results of g and f. [1p]

In [None]:
plotFigure(GDP('life-expectancy-vs-gdp-per-capita.csv',
          'GDP per capita',
          'Life expectancy',
          'Life expectancy vs. GDP'), 'GDP', 'Life expectancy', 
         'Life expectancy vs. GDP', pow(10,8), pow(10,12), 5, 5)

## Task 2 - Download some other data sets, e.g. related to happiness and life satisfaction, trust, corruption, etc. <br>
#### A.	Think of several meaningful questions that can be answered with these data, make several informative visualisations to answer those questions. State any assumptions and motivate decisions that you make when selecting data to be plotted, and in combining data. [2.5p] 

##### 1. Happiness vs. GDP per capita

In [None]:
plotFigure(GDPperCapita('gdp-vs-happiness.csv',
          'GDP per capita, PPP (constant 2017 international $)',
          'Life satisfaction in Cantril Ladder (World Happiness Report 2021)',
          'Happiness vs. GDP per capita'), 'GDP per capita ($)', 
           'Life satisfaction (scale: 0-10)', 'Happiness vs. GDP per capita', 30, pow(10,4), 0.5, 0.5)

###### Questions
---

##### 2. Children per women by GDP per capita

In [None]:
plotFigure(GDPperCapita('children-per-woman-by-gdp-per-capita.csv',
          'Output-side real GDP per capita (gdppc_o) (PWT 9.1 (2019))',
          'Estimates, 1950 - 2020: Annually interpolated demographic indicators - Total fertility (live births per woman)',
          'Children per woman vs. GDP per capita'), 'GDP per capita ($)', 
           'Children per woman', 'Children per woman vs. GDP per capita', 20, pow(10,4), 0.5, 0.5)

###### Questions
---

##### 3. Share of adults who smoke and GDP

In [None]:
plotFigure(GDPperCapita('share-of-adults-who-are-smoking-by-level-of-prosperity.csv',
           'GDP per capita, PPP (constant 2017 international $)',
          'Prevalence of current tobacco use (% of adults)',
          'Share of adults who are smoking vs. GDP per capita'), 'GDP per capita ($)', 
           'Adults who smoke (% of adults)', 'Share of adults who are smoking vs. GDP per capita', 
           25, pow(10,4), 2, 2)

###### Questions
---

##### 4. Medical doctors per 1,000 people vs. GDP per capita

In [None]:
plotFigure(GDPperCapita('medical-doctors-per-1000-people-vs-gdp-per-capita.csv',
           'GDP per capita, PPP (constant 2017 international $)',
          'Physicians (per 1,000 people)',
          'Medical doctors per 1000 people vs. GDP per capita'), 'GDP per capita ($)', 
           'Medical doctors (per 1,000 people)', 'Medical doctors per 1000 people vs. GDP per capita',
          25, pow(10,4), 1, 0.5)

###### Questions
---

##### 5. Child mortality by GDP
> Child mortality is defined as the number of children born alive that die before their 5th birthday.

In [None]:
plotFigure(GDPperCapita('child-mortality-gdp-per-capita.csv',
           'GDP per capita',
          'Child mortality (Select Gapminder, v10) (2017)',
          'Child mortality vs. GDP per capita'), 'GDP per capita ($)', 
           'Child mortality rate (% children < 5 y.o.)', 'Child mortality vs. GDP per capita',
          25, pow(10, 4), 3, 1)

###### Questions
---

#### B. Discuss any observations that you make, or insights obtained, from the data visualisations. [2p]

## References
#### Task 1
1. Data was compiled by Our World in Data (2022): _Life expectancy vs. GDP per capita._ [online] Available at: <https://ourworldindata.org/grapher/life-expectancy-years-vs-real-gdp-per-capita-2011us> [Accessed 20 January 2022]. <br> Based on estimates by James C. Riley (2005) – Estimates of Regional and Global Life Expectancy, 1800–2001. Issue Population and Development Review. Population and Development Review. Volume 31, Issue 3, pages 537–543, September 2005., Zijdeman, Richard; Ribeira da Silva, Filipa, 2015, "Life Expectancy at Birth (Total)", http://hdl.handle.net/10622/LKYT53, IISH Dataverse, V1, and UN Population Division (2019)

#### Task 2
1. Data was compiled by Our World in Data (2022):. _Self-reported Life Satisfaction vs GDP per capita._ [online] Available at: <https://ourworldindata.org/grapher/gdp-vs-happiness> [Accessed 20 January 2022].
2. Our World in Data. 2022. _Children per woman by GDP per capita._ [online] Available at: <https://ourworldindata.org/grapher/children-per-woman-by-gdp-per-capita> [Accessed 20 January 2022].
3. https://ourworldindata.org/economic-growth
4. https://ourworldindata.org/economic-growth
5. Our World in Data. 2022. _Child mortality vs GDP per capita._ [online] Available at: <https://ourworldindata.org/grapher/child-mortality-gdp-per-capita> [Accessed 20 January 2022].

- [Fundamentals of Data Visualization, Claus O. Wilke](https://clauswilke.com/dataviz/index.html)
- [Plotly](https://plotly.com/python/)
- [Plotly express - Linear regression](https://plotly.com/python/linear-fits/)