# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
import requests
from bs4 import BeautifulSoup
import os
#from matplotlib.venn import venn2

# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import dataproject




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

In [26]:
# Uploading our excel s (criminalitydata and criminalitydata2) and visualizing the first 5 rows
criminality = 'data/criminalitydata.xlsx'
criminality = pd.read_excel(criminality, skiprows=2)
criminality.head(5)

pop = 'data/population.xlsx'
pop = pd.read_excel(pop, skiprows=2)
pop.head(5)


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,2008Q1,2008Q2,2008Q3,2008Q4,2009Q1,2009Q2,2009Q3,2009Q4,...,2020Q3,2020Q4,2021Q1,2021Q2,2021Q3,2021Q4,2022Q1,2022Q2,2022Q3,2022Q4
0,Total,Copenhagen,509861,511725,511686,516962,518574,520659,521397,526918,...,633035,637936,638117,638678,638790,643613,644431,646812,647509,652564
1,,Frederiksberg,93444,93848,93921,95005,95029,95429,95565,96720,...,104118,104351,103677,103284,102940,103782,103608,103940,104094,104801
2,,Dragør,13261,13315,13350,13452,13411,13416,13460,13550,...,14515,14497,14569,14575,14588,14616,14640,14585,14669,14631
3,,Tårnby,40016,40002,40158,40234,40214,40219,40329,40429,...,42785,42757,42670,42698,42658,42664,42723,42819,43042,43129
4,,Albertslund,27602,27564,27596,27817,27706,27807,27739,27887,...,27543,27500,27366,27192,27113,27411,27599,27548,27547,27576


In [27]:
# Cleaning our data
# Deleting the column that has no information
data = [criminality, pop]
for i in data:
    i.drop("Unnamed: 0", axis = 1, inplace = True)

# Deleting the rows that has no informtion in the criminality dataset
criminality.drop(index = 101,  inplace=True)
criminality.drop(index = 100, inplace=True)

Import your data, either through an API or manually, and load it. 

In [28]:
# Renaming the "Unnamed: 1" to municipalities
for i in data:
    i.rename(columns = {"Unnamed: 1" : "municipalities"}, inplace=True)



# Renaming the rest of the columns in both databases so that they don't start with a number
for n,i in enumerate(data):
    for h in range(2007, 2022+1):
        for j in range(1, 4+1):
            if n == 0:  # This is the first dataset, i.e criminality data
                i.rename(columns={ str(str(h)+"Q"+str(j)):f'crim_{h}Q{j}'}, inplace = True)
            else:
                i.rename(columns={ str(str(h)+"Q"+str(j)):f'pop_{h}Q{j}'}, inplace = True)


# Creating the total per year
for h in range(2007, 2022+1):
    criminality[f'crim_{h}'] = criminality[f'crim_{h}Q1'] + criminality[f"crim_{h}Q2"] + criminality[f"crim_{h}Q3"] + criminality[f"crim_{h}Q4"]
    
for h in range(2008, 2022+1):
    pop[f'pop_{h}'] = pop[f'pop_{h}Q1'] + pop[f"pop_{h}Q2"] + pop[f"pop_{h}Q3"] + pop[f"pop_{h}Q4"]


#extract only the value of the years
criminality_year = criminality[['municipalities', 'crim_2007', 'crim_2008', 
                                'crim_2009', 'crim_2010', 'crim_2011', 'crim_2012', 
                                'crim_2013', 'crim_2014', 'crim_2015', 'crim_2016', 'crim_2016', 
                                'crim_2018', 'crim_2019', 'crim_2020', 'crim_2021', 'crim_2022']]

#extract only the value of the years
pop_year = pop[['municipalities', 'pop_2008', 'pop_2009', 'pop_2010', 'pop_2011', 
                'pop_2012', 'pop_2013', 'pop_2014', 'pop_2015', 'pop_2016', 
                'pop_2016', 'pop_2018', 'pop_2019', 'pop_2020', 'pop_2021', 'pop_2022']]


criminality_year

Unnamed: 0,municipalities,crim_2007,crim_2008,crim_2009,crim_2010,crim_2011,crim_2012,crim_2013,crim_2014,crim_2015,crim_2016,crim_2016.1,crim_2018,crim_2019,crim_2020,crim_2021,crim_2022
0,Copenhagen,93077.0,95101.0,96970.0,94792.0,99787.0,102135.0,102478.0,97983.0,93883.0,102763.0,102763.0,89159.0,93849.0,75750.0,70167.0,81216.0
1,Frederiksberg,8149.0,7971.0,8239.0,8809.0,9737.0,10007.0,9910.0,9211.0,8666.0,8395.0,8395.0,7230.0,7217.0,7163.0,6761.0,7798.0
2,Dragør,671.0,688.0,667.0,687.0,651.0,742.0,620.0,498.0,591.0,589.0,589.0,509.0,660.0,588.0,508.0,496.0
3,Tårnby,5446.0,5947.0,6396.0,6439.0,6424.0,6991.0,7178.0,6991.0,7974.0,8447.0,8447.0,7848.0,7374.0,4656.0,4160.0,5759.0
4,Albertslund,2903.0,3339.0,3394.0,3241.0,2854.0,2850.0,3182.0,2779.0,2584.0,2767.0,2767.0,2507.0,2607.0,3373.0,2017.0,2483.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Rebild,1306.0,1528.0,1471.0,1453.0,1600.0,1643.0,1503.0,1364.0,1257.0,1274.0,1274.0,1128.0,1097.0,1146.0,934.0,1002.0
96,Thisted,2777.0,2938.0,3003.0,2458.0,2664.0,2673.0,2456.0,2087.0,1967.0,2378.0,2378.0,2235.0,1910.0,1462.0,1637.0,1903.0
97,Vesthimmerlands,2613.0,3038.0,2971.0,2630.0,2726.0,2570.0,2690.0,2404.0,2249.0,2098.0,2098.0,1877.0,1746.0,1674.0,2101.0,1590.0
98,Aalborg,17958.0,18792.0,18928.0,18727.0,21297.0,19057.0,19730.0,18193.0,16403.0,15431.0,15431.0,15491.0,14953.0,11486.0,10359.0,14093.0


## Explore each data set

In [29]:
# Doing the statistics mean of the total years 
criminality_year.iloc[:,1:].mean()


crim_2007    5068.58
crim_2008    5391.22
crim_2009    5551.92
crim_2010    5343.01
crim_2011    5434.64
crim_2012    5187.04
crim_2013    5202.36
crim_2014    4974.75
crim_2015    4762.91
crim_2016    4879.13
crim_2016    4879.13
crim_2018    4758.74
crim_2019    4734.35
crim_2020    4210.69
crim_2021    3887.62
crim_2022    4372.54
dtype: float64

**Interpretation**

From the mean of the crime by year since 2007 until 2022, it is possible to see that it has been decreasing. This value however does not take into account the total population, hence we are, so far, only able to conclude over the realizations of crime.

In [30]:
# Mean of the population by year
pop_year.iloc[:,1:].mean()

pop_2008    221748.222222
pop_2009    223017.272727
pop_2010    224005.121212
pop_2011    224958.656566
pop_2012    225778.020202
pop_2013    226674.232323
pop_2014    227849.909091
pop_2015    229353.797980
pop_2016    231255.020202
pop_2016    231255.020202
pop_2018    233969.959596
pop_2019    234943.616162
pop_2020    235456.262626
pop_2021    236373.666667
pop_2022    238342.656566
dtype: float64

**Interpretation**
The mean of the population by municipalities has been increasing since 2008 to 2022. Giving the previous results it suggests on average that the rate of crime has been decreasing. However, this is a first approach as further and more in dept analysis will be taken.

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

In [31]:
# Wide to long dataframe
criminality_year_long = pd.wide_to_long(criminality_year, stubnames='crim_', i='municipalities', j='year')
criminality_year_long.head(10)

pop_year_long = pd.wide_to_long(pop_year, stubnames='pop_', i='municipalities', j='year')
pop_year_long.head(10)
criminality_year_long.head(10)


Unnamed: 0_level_0,Unnamed: 1_level_0,crim_
municipalities,year,Unnamed: 2_level_1
Copenhagen,2007,93077.0
Frederiksberg,2007,8149.0
Dragør,2007,671.0
Tårnby,2007,5446.0
Albertslund,2007,2903.0
Ballerup,2007,4601.0
Brøndby,2007,3306.0
Gentofte,2007,5621.0
Gladsaxe,2007,4418.0
Glostrup,2007,2618.0


**Plots**

Plotting criminality and population data for the municipality "Aalborg".

In [32]:
# Resetting the index for criminality and population
criminality_year_long = criminality_year_long.reset_index()
pop_year_long = pop_year_long.reset_index()
criminality_year_long

Unnamed: 0,municipalities,year,crim_
0,Copenhagen,2007,93077.0
1,Frederiksberg,2007,8149.0
2,Dragør,2007,671.0
3,Tårnby,2007,5446.0
4,Albertslund,2007,2903.0
...,...,...,...
1595,Rebild,2022,1002.0
1596,Thisted,2022,1903.0
1597,Vesthimmerlands,2022,1590.0
1598,Aalborg,2022,14093.0


**Interactive plot** :

In [33]:
# Creating the interactive plot 

def plot_e(criminality_year_long, municipalities): 
    I = criminality_year_long['municipalities'] == municipalities
    ax=criminality_year_long.loc[I,:].plot(x='year', y='crim_', style='-o', legend=False)



widgets.interact(plot_e, 
    criminality_year_long = widgets.fixed(criminality_year_long),
    municipalities = widgets.Dropdown(description='municipalities', 
                                    options=criminality_year_long.municipalities.unique(), 
                                    value='Aalborg')
); 



###
#def plot_func():
    # Function that operates on data set
   # pass

#widgets.interact(plot_func, 
    # Let the widget interact with data through plot_func()    
#); 



interactive(children=(Dropdown(description='municipalities', index=98, options=('Copenhagen', 'Frederiksberg',…

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [34]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

NameError: name 'venn2' is not defined

<Figure size 1500x700 with 0 Axes>

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.