# The Dataset
Explore the distribution of the marital status' of the inhabitants Copenhagen

CSV file:

http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec/download/befkbhalderkoencivst.csv>

Source:

http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec

The dataset provided contains ~174.000 observations from 1992 - 2015, with the following columns:

AAR: Which year the observation was made

BYDEL: Which part of the city, described by an integer contained in following dict; 1=Indre By, 2=Østerbro, 3=Nørrebro, 4=Vesterbro/Kgs. Enghave, 5=Valby, 6=Vanløse, 7=Brønshøj-Husum, 8=Bispebjerg, 9=Amager Øst, 10=Amager Vest, 99=Udenfor inddeling

ALDER: The age of the observed people

CIVST: Marital Status, described by an upper-case character contained in the following dict: E=Widdow, F=Divorced, G=Maried, L=Oldest living partner, O=Dissolved partnership, P=Registered partnership, U=Unmarried

KOEN: Gender of observed people, described by an integer contained in the following dict: 1=Male, 2=Female

PERSONER: Number of observations with the given features of the row

## Questions
1. Use matplotlib to show the distribution of the following four categories over the time of 1992 - 2015
  - Males between age 18 and 30
  - Females between age 18 and 30
  - Males age 50+
  - Females age 50+<br /> <br />
2. Use matplotlib to plot a bar-char showing how many single males and females of age 18 to 30, are living in BYDEL 1, 2 and 3 over the time 1992 - 2015<br /> <br />
3. Find the three most populated city parts(BYDEL), in 1992, 2000 and 2015<br /> <br />
4. Create to pie-charts, showing the distribution of marital status' in bydel 1, 2 and 3 in year 2000 and 2015<br /> <br />
5. Make a histogram of the age distribution in all of the municipality of Copenhangen<br /> <br />

## Code before the questions
First we need to prepare a python file to contain every solution to the above questions. Our solution will use the same modules throughout and there is no need to import those every time for every solution. Instead, we do all our prep-work here.

### Modules

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import webget
from collections import defaultdict
from urllib.parse import urlparse
import os
import heapq

We will be using [pandas](http://pandas.pydata.org/) to read all the data from a csv file and prepare it for data handling as a dataframe object. Webget is a custom library written by us to download a file at a direct link location. Next, [os](https://docs.python.org/2/library/os.html) is used to get the destination of the file platform non-specific. Next, [urlparse](https://docs.python.org/2/library/urlparse.html) is used in conjunction with our webget to wellform a url. Next, [heapq](https://docs.python.org/2/library/heapq.html), which is a module that gives us access to things like the [heapsort](https://en.wikipedia.org/wiki/Heapsort) algorithm. Collections contain a [defaultdict](https://docs.python.org/2/library/collections.html) object we will be using as our data structure, which has a useful lamdas:0 approach for setting up new keys with default values. Finally, we have [pyplot](http://matplotlib.org/) for plotting our data. 

### Getting data ready to handle
In order to do any data handling, we need to fetch that data and prepare it for handling. To do so we will use webget to download a csv from a target destination

[Destination](http://data.kk.dk/dataset/9070067f-ab57-41cd-913e-bc37bfaf9acd/resource/9fbab4aa-1ee0-4d25-b2b4-b7b63537d2ec/download/befkbhalderkoencivst.csv)

Webget has a download function which retrieves a file at a destination and os returns the path of the file so we know where to open it. 

In [None]:
def download(link):
    file = webget.download(link)
    return os.path.basename(urlparse(link).path)

Next up we will prepare a dataframe object using pandas from the csv file we got. It takes a csv file as src and returns a dataframe object using pandas as pd

In [None]:
def csv_to_df(src):
    return pd.read_csv(src)

### Plotting
Next up we will provide you with 4 plotting methods that each produce a different type of plot. You can use either of them for experimentation or the ones we suggest (to best solve the questions). What is important to understand is that they each take the same parameters, which are 'src' containing the data to be plotted (a dict for instance), and an optional title. (displayable)

In [3]:
def make_barplot(src, title="Untitled"):
    plt.bar(range(len(src)), src.values(), align='center')
    plt.xticks(range(len(src)), list(src.keys()))
    plt.title(title)
    plt.show()

In [4]:
def make_scatterplot(src, title="Untitled"):
    plt.scatter(list(src.keys()), list(src.values()), c=list(src.values()), cmap=plt.cm.Blues, edgecolor='none', s=40)
    plt.title(title)
    plt.show()

In [5]:
def make_plot(src, title="Untitled"):
    plt.plot(list(src.keys()), list(src.values()), linewidth=5)
    plt.title(title)
    plt.show()

The next plotting function takes an additional parameter, which is used for labels on the x-axis. And it plots multiple times as well as used for one of the upcoming solutions

In [6]:
def make_multiplot(src_list, my_labels, title="Untitled"):
    index = 0
    
    for src in src_list:
        plt.plot(list(src.keys()), list(src.values()), linewidth=5, label=my_labels[index])
        index += 1
    plt.title(title)
    plt.legend(loc="upper left", bbox_to_anchor=[0, 1],ncol=2, shadow=True, title="Legend", fancybox=True)
    plt.show()

In [9]:
def just_plot(t5_k, t5_v):
    sizes = [215, 130, 245, 210]
    # colors = ['gold', 'yellowgreen', 'lightcoral', 'white', 'red']
    #explode = (0.1, 0.1, 0.1, 0.1, 0.1)  # explode 1st slice
    plt.pie(t5_v,labels=t5_k,autopct='%1.1f%%', shadow=True, startangle=80)
 
    plt.axis('equal')
    plt.show()

In [10]:
def plot_5(age_distK, age_distV):
    plt.bar(age_distK, age_distV, width=0.5, linewidth=0, align='center')
    plt.ticklabel_format(useOffset=False)
    plt.axis([0, max(age_distK) + 10, 0, 2600])
    title = 'Distribution of {} peoples AGE in the CPH municipality'.format(sum(age_distV))
    plt.title(title, fontsize=12)
    plt.xlabel("Ages", fontsize=10)
    plt.ylabel("Amount of people", fontsize=15)
    plt.tick_params(axis='both', which='major', labelsize=15)
    plt.show()

The final plot function takes two dictionaties and their labels as well as a mandatory parameter for title.

In [7]:
def something_plot(dict_a, dict_b, label_a, label_b, name):
    males = list(dict_a.values())
    females = list(dict_b.values())
    indexes = list(dict_a.keys())
    
    p1 = plt.bar(indexes, females, width=0.5, color="#d62728")
    p2 = plt.bar(indexes, males, width=0.5, bottom=females)
    plt.title(name)
    plt.legend((p1, p2), (label_b, label_a))
    plt.show()

In [74]:
# the following line is jupyter notebook specific 
%matplotlib inline

## Question 1
Use matplotlib to show the distribution of the following four categories over the time of 1992 - 2015
  - Males between age 18 and 30
  - Females between age 18 and 30
  - Males age 50+
  - Females age 50+

## Question 2
Use matplotlib to plot a bar-char showing how many single males and females of age 18 to 30, are living in BYDEL 1, 2 and 3 over the time 1992 - 2015

> We made the assumption that single means that you are neither G (Married) nor P (Registered Partnership)

![](bydel1.png)

![](bydel2.png)

![](bydel3.png)

![](alle_bydele.png)

## Question 3
Find the three most populated city parts(BYDEL), in 1992, 2000 and 2015

Download befkbhalderkoencivst.csv into ./befkbhalderkoencivst.csv

- Top 3 City parts by population in 1992: ['Nørrebro', 'Østerbro', 'Vesterbro/Kgs. Enghave']
- Top 3 City parts by population in 2000: ['Østerbro', 'Nørrebro', 'Indre By']
- Top 3 City parts by population in 2015: ['Østerbro', 'Indre By', 'Nørrebro']

### Result
![image](http://i67.tinypic.com/1zlanaq.png)

### Solution
Prepare a function that goes by a name related to question 3. We have called it ex3. It needs to take a dataframe object as parameter (which contains all the data we prepared). Next up, we prepare 3 defaultdicts giving them a lambda expression. This will be used to default a new key entry to 0 as we iterate our for loop later.

In order to do the data handling, we iterate over every tuple in our dataframe and evaluate on the particular rows. First if-conditional evaluates whether the row containing [year] is either 1992, 2000 or the year 2015. If that is the case, then we will count up corresponding to the person in that city part that year. 

In [11]:
def ex3(df):
    winner_1992 = defaultdict(lambda: 0)
    winner_2000 = defaultdict(lambda: 0)
    winner_2015 = defaultdict(lambda: 0)
    
    for row in df.itertuples():
        if row[1] == 1992:
            winner_1992[row[2]] += 1
        if row[1] == 2000:
            winner_2000[row[2]] += 1
        if row[1] == 2015:
            winner_2015[row[2]] += 1
     
    print("Top 3 City parts by population in 1992: " + str(lookup_citypart(get_n_largest(winner_1992, 3))))
    print("Top 3 City parts by population in 2000: " + str(lookup_citypart(get_n_largest(winner_2000, 3))))
    print("Top 3 City parts by population in 2015: " + str(lookup_citypart(get_n_largest(winner_2015, 3))))

Our three defaultdicts now contain our counts and we need to find the 3 keys that have the highest count. To do so, we make use of our Python module heapq to find the n-largest-elements in a container. 

We have to write a function to do that, however.
get_n_largest takes a dictionary d and an integer n corresponding to how many elements we want from that dict

In [12]:
def get_n_largest(d, n):
    return heapq.nlargest(n, d, key=d.get)

Next up we don't want to print 3 integers (that corresponds to a city-part), instead we want to print out the name of that city-part as a string. To do so we need to create a function.

We feed lookup_citypart a defaultdict (the number of the city-part) and based on the data set and information provided, we know which integer corresponds to which city, so we can do the convertion manually.

Since it is a collection, we need a for loop to iterate each 

In [13]:
def lookup_citypart(n):
    res = []
    
    parts = {
        1: "Indre By",
        2: "Østerbro",
        3: "Nørrebro",
        4: "Vesterbro/Kgs. Enghave",
        5: "Valby",
        6: "Vanløse",
        7: "Brønshøj-Husum",
        8: "Bispebjerg",
        9: "Amager Øst,",
        10: "Amager Vest",
        99: "Udenfor inddeling"
    }
    
    for x in n:
        res.append(parts[x])
    return res

## Question 4
Create to pie-charts, showing the distribution of marital status' in bydel 1, 2 and 3 in year 2000 and 2015

## Question 5
Make a histogram of the age distribution in all of the municipality of Copenhangen