# The Dataset
3 datasets on 3 different topics. One for human development index, active satellites and one for historical index

## Questions
### Human Development Index
1. Which country has the highest HDI (Human Development Index) and which has the lowest?
2. Which country has raised its HDI the most, in the period 1990 to 2014?

### Active satellites
3. Which country has the most satelites for military usage?
4. Wich country has the lightest satelite and how much does it weight?
5. Compare the usage of satelites, between the 5 poorest countries and the 5 welthiest countries, according to the HDI dataset (see first dataset), plotting optional.

## Question 1
Which country has the highest HDI (Human Development Index) and which has the lowest?

First we import pandas as pd and numpy as np and read the csv file.
We set the csv file as a specialized 2d-array with matrix() and call it dd.

We then take countries and HDI and set them as unique.

We then take the sum of every country with the HDI and iterate through all the countries.

At last we print out which of the countries had max and min HDI aswell as the values.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

#country 1, HDI 2.

HDIs = pd.read_csv("HDI.csv")

dd = HDIs.as_matrix()

In [4]:
countries = np.unique(dd[:,1])
HDI = np.unique(dd[:,2])

country = np.array([np.sum(dd[(dd[:,1] == count)][:,2]) for count in countries])

print("Country with highest HDI: " + countries[np.argmax(country)], np.amax(HDI))
print("Country with lowest HDI: " + countries[np.argmin(country)], np.amin(HDI))

Country with highest HDI: Norway 0.9440000000000001
Country with lowest HDI: Niger 0.348


## Question 2
Which country has raised its HDI the most, in the period 1990 to 2014?

To figure this out we made a dictionary with the key as the country, and the value as row8 (2014) minus row2 (1990). Then we could get the max of it.

We ignored the entries that had ".." as their value.

In [1]:
import pandas as pd
from collections import defaultdict
countries = defaultdict(lambda: 0.0)
data = pd.read_csv("./historical_index.csv")

In [2]:
for row in data.itertuples():
	if(row[3] != ".."):
		countries[row[2]] += float(row[9])-float(row[3])

best = max(countries, key=countries.get)
print(best + " with " + str(dict(countries)[best]))

Rwanda with 0.239


## Question 3
Which country has the most satelites for military usage?

A dictonary is created containing 1 entry of all countries as keys - all of them gets a value of 0.

Then we run through the csv file again. This time we check if the satellite is used for military purposes,
if it is, we add 1 to the value of the country who has the satellite. That way we can find the entry with the highest value.

In [None]:
def military_satellites(data):

    country_m = {} 
    
    for row in data.iterrows():
        if(row[1][3] in country_m):
            pass
        else:
            country_m.setdefault(row[1][3], 0)
            

    for row in data.iterrows():
        if (row[1][4] == "Military"):
            temp = row[1][3]
            country_m[temp] = country_m.get(temp, 0) + 1
                        
            
            
    maximum = max(country_m, key=country_m.get) 
    print("Country with the most satellites used for military purposes is: ") 
    print(maximum + " with a total of: " + str(country_m[maximum]) + " satellites.")

Country with the most satellites used for military purposes is:
USA with a total of: 114 satellites.

## Question 4
Which country has the lightest satelite and how much does it weight?

The first order of business is to import all the libraries that we need. For our solution to this question we will use the following:

In [5]:
import webget
import pandas as pd
import os
from urllib.parse import urlparse
import re

We will be using [pandas](http://pandas.pydata.org/) to read all the data from a csv file and prepare it for data handling as a dataframe object. Webget is a custom library written by us to download a file at a direct link location. Next, [os](https://docs.python.org/2/library/os.html) is used to get the destination of the file platform non-specific. Next, [urlparse](https://docs.python.org/2/library/urlparse.html) is used in conjunction with our webget to wellform a url. Finally, [re](https://docs.python.org/2/library/re.html) is used to split non-digits from a data entry using a regular expression. 

To download the csv file we simply make the following implemetation using webget

In [7]:
def download(url):
    #webget.download(url)
    #return os.path.basename(urlparse(url).path)
    # Getting the csv file via webget from github downloads a malformed csv that contains HTML tags, so we have to do it manually
    return "./database.csv"

We read all the rows and columns of the csv and prepare a DataFrame object of them using pandas

In [None]:
def read_from_csv(filename):
    data = pd.read_csv(filename)
    data_no_nan = set_missing_values_to_0(data)
    return data_no_nan

Notice how we make use of the custom method we have made, set_missing_values_to_0, to handle all the NaN values in the dataset

In [9]:
def set_missing_values_to_0(data):
    return data.fillna(0.0)

In [None]:
The dataset has a mix of values that are floats and strings. To get around this we need a method

In [10]:
def remove_non_digits_and_float_cast(rowstr, index):
    temp = re.sub("[^0-9]", "", rowstr)
    temp = temp.strip()
    if index != 341:
        temp = float(temp)
    else:
        temp = 0;
    return temp

The method will take an entry from the dataset as well as an index. Then we will split that string using a regular expression to exclude all non digits. Finally, we evaluate on index. The dataset has one specific row, 341, that is blank. It seems to be a bug, but a temporary fix is to ignore that specific row for now.

First let us declare everything we need in a method definition

In [None]:
def ex4_lightest_satellite(data):
    satellite_owner = ""
    satellite_owner_country = ""
    satellite_name = ""
    satellite_weight = 10000.0;
    #Row 341 has a blank 'white' empty space for no apparent reason. I hard code my way around it. Help?
    index = 0

Next, we will implement the iterative process of going through the data set and discovering the lowest value every time. Once we have iterated all the way through, we will know our winner. On the way we have to evaluate on the entries to discover if they are strings or float and make use of our handler method when appropriate. Add the following code to our ex4 method

In [None]:
for row in data.iterrows():
        #Is the index a string? If yes, it contains non-digits we need to remove
        if type( row[1][16]) is str:
            #Call to method that handles stripping our entry using a regular expression
            temp = remove_non_digits_and_float_cast(row[1][16], index)
            #If the weight is 0, then it was one of the NaN values we handled and it doesn't count
            if temp == 0:
                pass
            else:
                #King of the Hill implementation
                if temp < satellite_weight:
                    satellite_name = row[1][0]
                    satellite_owner_country = row[1][1]
                    satellite_owner = row[1][2]
                    satellite_weight = temp
            index += 1
        else:
            #The entry is already a float, so we can apply our gymnastics immediately
            if row[1][16] != 0:
                if row[1][16] < satellite_weight:
                    satellite_name = row[1][0]
                    satellite_owner_country = row[1][1]
                    satellite_owner = row[1][2]
                    satellite_weight = row[1][16]
                index += 1
            else:
                pass

Finally, we need but return the result of our operation. Add the following to our ex4 method

In [None]:
#The answer to the question is stored in 3 variables        
    return satellite_owner_country + " is the country with the lightest satellite weighing in at: " \
    + str(satellite_weight) + " kilograms" + \
    ", called: " + satellite_name + ", belonging to the " + satellite_owner

### The Result
![image](http://i66.tinypic.com/15iauxs.png)

## Question 5
Compare the usage of satelites, between the 5 poorest countries and the 5 welthiest countries, according to the HDI dataset (see first dataset), plotting optional.