# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [131]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
import requests
from bs4 import BeautifulSoup
import os
#from matplotlib.venn import venn2

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

In [132]:
# Uploading our excel s (criminalitydata and criminalitydata2) and visualizing the first 5 rows
criminality = 'data/criminalitydata.xlsx'
criminality = pd.read_excel(criminality, skiprows=2)
criminality.head(5)

pop = 'data/population.xlsx'
pop = pd.read_excel(pop, skiprows=2)
pop.head(5)


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,2008Q1,2008Q2,2008Q3,2008Q4,2009Q1,2009Q2,2009Q3,2009Q4,...,2020Q3,2020Q4,2021Q1,2021Q2,2021Q3,2021Q4,2022Q1,2022Q2,2022Q3,2022Q4
0,Total,Copenhagen,509861,511725,511686,516962,518574,520659,521397,526918,...,633035,637936,638117,638678,638790,643613,644431,646812,647509,652564
1,,Frederiksberg,93444,93848,93921,95005,95029,95429,95565,96720,...,104118,104351,103677,103284,102940,103782,103608,103940,104094,104801
2,,Dragør,13261,13315,13350,13452,13411,13416,13460,13550,...,14515,14497,14569,14575,14588,14616,14640,14585,14669,14631
3,,Tårnby,40016,40002,40158,40234,40214,40219,40329,40429,...,42785,42757,42670,42698,42658,42664,42723,42819,43042,43129
4,,Albertslund,27602,27564,27596,27817,27706,27807,27739,27887,...,27543,27500,27366,27192,27113,27411,27599,27548,27547,27576


In [133]:
# Cleaning our data
# Deleting the column that has no information
data = [criminality, pop]
for i in data:
    i.drop("Unnamed: 0", axis = 1, inplace = True)


Import your data, either through an API or manually, and load it. 

In [134]:
# Renaming the "Unnamed: 1" to municipalities
for i in data:
    i.rename(columns = {"Unnamed: 1" : "municipalities"}, inplace=True)



# Renaming the rest of the columns in both databases so that they don't start with a number
for n,i in enumerate(data):
    for h in range(2007, 2022+1):
        for j in range(1, 4+1):
            if n == 0:  # This is the first dataset, i.e criminality data
                i.rename(columns={ str(str(h)+"Q"+str(j)):f'crim_{h}Q{j}'}, inplace = True)
            else:
                i.rename(columns={ str(str(h)+"Q"+str(j)):f'pop_{h}Q{j}'}, inplace = True)


# Creating the total per year
for h in range(2007, 2022+1):
    criminality[f'crim_{h}'] = criminality[f'crim_{h}Q1'] + criminality[f"crim_{h}Q2"] + criminality[f"crim_{h}Q3"] + criminality[f"crim_{h}Q4"]
    
for h in range(2008, 2022+1):
    pop[f'pop_{h}'] = pop[f'pop_{h}Q1'] + pop[f"pop_{h}Q2"] + pop[f"pop_{h}Q3"] + pop[f"pop_{h}Q4"]

Unnamed: 0,municipalities,crim_2007Q1,crim_2007Q2,crim_2007Q3,crim_2007Q4,crim_2008Q1,crim_2008Q2,crim_2008Q3,crim_2008Q4,crim_2009Q1,...,crim_2013,crim_2014,crim_2015,crim_2016,crim_2017,crim_2018,crim_2019,crim_2020,crim_2021,crim_2022
0,Copenhagen,21462.0,24280.0,23553.0,23782.0,21623.0,23201.0,25348.0,24929.0,22285.0,...,102478.0,97983.0,93883.0,102763.0,93446.0,89159.0,93849.0,75750.0,70167.0,81216.0
1,Frederiksberg,1907.0,2085.0,1996.0,2161.0,1801.0,1865.0,2060.0,2245.0,1845.0,...,9910.0,9211.0,8666.0,8395.0,8576.0,7230.0,7217.0,7163.0,6761.0,7798.0
2,Dragør,184.0,183.0,127.0,177.0,152.0,175.0,167.0,194.0,124.0,...,620.0,498.0,591.0,589.0,628.0,509.0,660.0,588.0,508.0,496.0
3,Tårnby,1241.0,1388.0,1371.0,1446.0,1300.0,1500.0,1604.0,1543.0,1421.0,...,7178.0,6991.0,7974.0,8447.0,8350.0,7848.0,7374.0,4656.0,4160.0,5759.0
4,Albertslund,609.0,794.0,744.0,756.0,850.0,829.0,765.0,895.0,753.0,...,3182.0,2779.0,2584.0,2767.0,2538.0,2507.0,2607.0,3373.0,2017.0,2483.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,Vesthimmerlands,633.0,621.0,673.0,686.0,642.0,660.0,838.0,898.0,742.0,...,2690.0,2404.0,2249.0,2098.0,1891.0,1877.0,1746.0,1674.0,2101.0,1590.0
98,Aalborg,4047.0,4645.0,4375.0,4891.0,4273.0,4864.0,4854.0,4801.0,4306.0,...,19730.0,18193.0,16403.0,15431.0,16463.0,15491.0,14953.0,11486.0,10359.0,14093.0
99,Unknown municipality,2204.0,1334.0,1378.0,1293.0,1171.0,1145.0,1386.0,1137.0,1244.0,...,9353.0,13471.0,19681.0,22194.0,29325.0,28324.0,35965.0,41711.0,40360.0,41431.0
100,,,,,,,,,,,...,,,,,,,,,,


## Explore each data set

In [145]:
#doing the statistics mean of the total years 

criminality.describe()
pop.describe()

Unnamed: 0,pop_2008Q1,pop_2008Q2,pop_2008Q3,pop_2008Q4,pop_2009Q1,pop_2009Q2,pop_2009Q3,pop_2009Q4,pop_2010Q1,pop_2010Q2,...,pop_2013,pop_2014,pop_2015,pop_2016,pop_2017,pop_2018,pop_2019,pop_2020,pop_2021,pop_2022
count,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0
mean,55311.020202,55376.424242,55444.666667,55616.111111,55671.222222,55709.969697,55751.929293,55884.151515,55906.444444,55962.030303,...,226674.2,227849.9,229353.8,231255.0,232769.7,233970.0,234943.6,235456.3,236373.7,238342.7
std,60970.757519,61129.762766,61111.373706,61715.863897,61877.645312,62079.456549,62118.680942,62752.828321,62859.302272,63102.530599,...,266058.0,270349.9,274589.7,279343.7,283906.6,288337.2,292628.4,295784.7,298249.9,301820.1
min,96.0,94.0,100.0,100.0,96.0,94.0,97.0,103.0,101.0,100.0,...,381.0,372.0,364.0,358.0,324.0,351.0,347.0,346.0,361.0,378.0
25%,29325.0,29346.5,29443.5,29547.0,29554.5,29519.0,29598.0,29613.0,29580.0,29538.0,...,116794.0,116637.0,117106.0,117854.0,118392.5,119192.0,119710.0,120365.0,121029.0,121803.5
50%,42817.0,42720.0,42746.0,42761.0,42807.0,42743.0,42759.0,42720.0,42768.0,42820.0,...,170655.0,170692.0,170822.0,171426.0,172069.0,172149.0,171993.0,172387.0,172543.0,172950.0
75%,59614.5,59694.0,59790.5,59811.5,59788.5,59709.0,59666.5,59570.5,59488.0,59445.5,...,234570.5,235336.5,236945.0,239613.5,240959.5,241298.0,241280.0,239855.0,238673.0,239566.5
max,509861.0,511725.0,511686.0,516962.0,518574.0,520659.0,521397.0,526918.0,528208.0,530902.0,...,2253072.0,2293733.0,2336757.0,2381695.0,2425726.0,2468077.0,2509283.0,2536760.0,2559198.0,2591316.0


In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

In [136]:
def plot_func():
    # Function that operates on data set
    pass

widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
); 


interactive(children=(Output(),), _dom_classes=('widget-interact',))

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [137]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

NameError: name 'venn2' is not defined

<Figure size 1500x700 with 0 Axes>

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.