In [1]:
import numpy as np

# Some notes on the inaugural project

Generally you did really well, and it was a treat to read your projects and different approaches to project. <br>
But there where two recuring errors I would like to note and explain, and then I have some comments on style choices

## Recuring errors

### np.max()

Quite a few of you got into trouble using the np.max() function. This is quite understandable given that it looks like the obvious function to use, the problem can be shown by running:

In [2]:
np.max(-1,0)

-1

Which is not what we intended. <br>
The reason is that the np.max()-function is made to take an array as input and then compare all the elements of the array and return the largest. Thus in this case it compares -1 with nothing and returns -1. The 0 is interpreted as the axis argument.
<br>
This can be seen by looking at the documentation:

In [3]:
help(np.max) # This is an easy way to acces documentation
#np.amax and np.max is the same function

Help on function amax in module numpy:

amax(a, axis=None, out=None, keepdims=<no value>, initial=<no value>, where=<no value>)
    Return the maximum of an array or maximum along an axis.
    
    Parameters
    ----------
    a : array_like
        Input data.
    axis : None or int or tuple of ints, optional
        Axis or axes along which to operate.  By default, flattened input is
        used.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, the maximum is selected over multiple axes,
        instead of a single axis or all the axes as before.
    out : ndarray, optional
        Alternative output array in which to place the result.  Must
        be of the same shape and buffer length as the expected output.
        See `doc.ufuncs` (Section "Output arguments") for more details.
    
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
  

In [4]:
# So to make it work you'd have to the insert the values you want compared as a list:
np.max([-1,0])

0

But there are more functions that can do the job easier and works intuitively:

In [5]:
print(np.fmax(-1,0))
print(max(-1,0))

0
0


### Global variables

Some of you got into trouble because of your use of global variables in functions, instead of passing them as arguments. <br>
I would recomend when making larger projects that the variables you use in each function is an argument to the function, as this can be a gut-wrenching source of errors.

## Style comments

This is some stuff I would recomend that you think about, but of course it a bit more subjective what kind of style you like.

### Less is more 

If you can get the same results with the same precision in fewer lines, I would always recommend it. I makes your code easier to read and debug, and can save you a lot of time. <br>
The easiest way to do this is whenever you're doing repetitive tasks, to make a function to pass multiple times. The main place where I noticed this, is problem 3) and 4). Since problem 4) is just 3) with a different parameter-value, I recommend that you harness the power of programming and make one function that solves both problems. I'ts recomended and good practice to reuse earlier defined functions.

### Unpacking results

This is a small thing but in the interest of writting fewer lines of code, I would just mention it: when unpacking a result into multiple variables, you can save some typing, using unpacking:

In [6]:
# Initiate some random data:
result = [1,2,3]

#These many lines of code:
a = result[0]
b = result[1]
c = result[2]
print(a,b,c)

#Can be replaced with this:
a,b,c = result
print(a,b,c)

# Or if you dont want to unpack the whole list:
a, b = result[0],result[1]
print(a,b)
#although even this can be shorter:
a, b = result[0:2]
print(a,b)

1 2 3
1 2 3
1 2
1 2


In [7]:
# You can also unpack this way when using functions which returns multiple outputs 
# (in reality it returns a tuple, but you can unpack it as multiple variables):
def double_output_func(x1,x2):
    '''
    This function takes two arguments and returns the double of those arguments as a tuple
    
    arguments:
        x1 (int/float) : a number to double
        x2 (int/float) : a number to double
        
    returns:
        (tuple): Two elements, the inputed arguments times 2
    '''
    x1_new = x1*2
    x2_new = x2*2
    return x1_new, x2_new

res = double_output_func(1,2)
print(res)
print(type(res))

res1, res2 = double_output_func(1,2)
print(res1,res2)

(2, 4)
<class 'tuple'>
2 4


### Function documentation

The general convention, and what jeppe preferes, is using docstrings inside functions(as the double_output_func above). This might seem tedious to do some times but is really helpfull when you look at it later, and sticking with this style makes it faster to read, since you'll get used to commmon python documentation. Another added benefit is that you can reference your own documentation (which is again usefull when writing loads of code code, and you might forget the ordering of the arguments of a function you've defined a week ago):

In [8]:
help(double_output_func)

Help on function double_output_func in module __main__:

double_output_func(x1, x2)
    This function takes two arguments and returns the double of those arguments as a tuple
    
    arguments:
        x1 (int/float) : a number to double
        x2 (int/float) : a number to double
        
    returns:
        (tuple): Two elements, the inputed arguments times 2



# Notes on Problem set 4

This problemset focuses on lecture 8, loading and editing datasets, and requires some essential skills which will be necessary for you data analysis project.  <br>
I've written some notes here, that I hope will be usefull so you don't have to rely on the answers only, if you have any questions you can write me at hms467@econ.ku.dk. If youre interested you can also see my (undocumented) take on the problem set at . Generally I would recomend using the [thread](https://github.com/NumEconCopenhagen/lectures-2020/issues) the Jeppe has suggested as it good to have the tips and tricks collected so everybody benefits from it. You can see earlier solves issues in the 'closed'-section. <br>
I would recomend, if you have not already, that you start looking at the documentation when figuring out how to use the functions correctly, as it will be really beneficial for the dataproject that you are capable of this. 

**Problem: 2.1:**

For loading denmarks statistics: see 3.1 in lecture 8 <br>
For step 2: <br>
If you're unsure how to use the .replace()-method, look up the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html), I would recomend looking at the examples in the bottom. <br> 
(btw type(nah1.variable) and type(nah1.unit) are both pandas series, so you want to look at series.replace() documentation i have linked to, not DataFrame.replace() documentation)

For Step 3:
You need a boolean series that represents for each observation, whether the variable-column is recorded as Y, C, G, I, X or M. Then you can call all the obsevations in nah1 which statifies this condition. <br>
Jeppe creates this by looping through var_dict. <br>
I did it using the pandas series [.isin()-method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html), the series you wanna use it on is: nah1.variable <br>
You can also use the numpy function [isin()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.isin.html) <br>
The sign: | btw means 'or', in case you've forgotten.

For step 4, it should of course be: nah1.groupby(['variable','unit']).describe() not nah1_true. <br>
If you wanna get really technical in you discussion, you can check out the documentation for [.describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.describe.html). 

**Problem 2.2**:

Again, use the [documentaion](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) <br>
For using the .join()-method you can see 2.2.4 in lecture 8 

**Problem 3**

For 3.1 and 3.2 there is not much to say, I would just recomend that you go through it carefuly, and experiment with the commands you don't understand. This is a good resource to go back to when doing the dataproject. <br>
For 3.3 is get's a little trickier, I would recomend that you work systematically, for example: Merge pop and prices_long. Generate log variables. Generate log-diff-variables grouped on muncipalities. Plot the first figure. For the second figure it's easier to create and plot the variables in one line with the agg-method, as Jeppe suggested.

# Concluding remarks

After you've finished this problem set, you'll have had all the preparation you need, for starting on the Data Analysis Project, so I suggest you get started on it, if you haven't already started thinking about what kind of project you wanna do. <br>
In the slides for exercise 7, I've listed possible datasources. Other than that, the best advice I can give you is to choose something you're interested in, but also to keep the proejct small (unless of course you're in quarantine and have plenty of time): loading some data, cleaning it, doing some transformations/calculation and plotting a few figures can take plenty of time because of the set of potential errors, and will likely be enough for good project as long as the quality is good.

<img src="excelling_in_data.jpeg" style="width:400px;height:400px"/>