# Data Analytics Spring 2023 &mdash; Exercises 1

### Onni Roivas (last modified: Tue Jan 10 at 10:45)

- Five problems
- Minor variations between users
- Theme: Python & Numpy (no Pandas allowed)
- Hints will be given during the opening weekend
- Deadline: about a week from the opening weekend (to be specified)
- Make a copy of the original notebook (right click & duplicate) and add your answers (new cells) there
- Please make both your code and your notebook readable
- When you are done, run the handin code cell at the end of this notebook
- The original notebook may change after publication, but the
  changes should be minor (e.g. the deadline above)
- Keep your originals up to date by running the code cell below:

In [None]:
import os
os.system('/usr/bin/bash /home/varpha/data_analytics/bin/config.sh');

## Problem 1. Documentation
- read [this function documentation](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) and explain the function to your anonymous peer reviewer.

Please write a nice and clear explanation. Include some elementary examples.

In [None]:
# Concatenation refers to joining. This function is used to join two or more arrays of the same shape along a specified axis.
# Parameters taken = numpy.concatenate((a1, a2, ...), axis)

# example
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9], [10, 11, 12]])
np.concatenate((a, b), axis=1)

## Problem 2. Map, Lambda, Groupby
In this problem, only plain python may be used, no numpy.<br/>
The following links may be helpful:
- [sorting howto](https://docs.python.org/3/howto/sorting.html)
- [lambda sorting](https://blogboard.io/blog/knowledge/python-sorted-lambda)
- [itertools groupby](https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby).

Using the code cell below, read a csv (real wind turbine data) into a list of dicts.<br/>
Then do the following:
- a) using map and lambda, convert the timestamps into the format <b>MM/dd/yyyy HH:mm:ss</b>, e.g. 11/04/2018 09:10:43
- b) using sorted and lambda, sort the rows according to increasing rotorspeed
- c) add a column called <b><i>WindSpeed_Group</i></b> that contains a windspeed group A, B or C, where A = less than 5mps, B = 5-10mps, C = more than 10mps. Use [itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby).

In your handin, include the code that does a) - c) above. No need to save the modified data. Here is the code for reading the raw data:

In [None]:
from getpass import getuser
import csv
from datetime import datetime
from itertools import groupby

user = getuser().upper()
csv_location = f'/home/varpha/data_analytics/private/{user}' + \
                f'/exrc_01/data/prob2_{user}.csv'
with open(csv_location) as handle:
    mydata = list(csv.DictReader(handle))
    mydata = list(map(lambda x: {'TimeStamp': datetime.strptime(x['TimeStamp'], 
    '%Y-%m-%d %H:%M:%S.%f').strftime('%m/%d/%Y %H:%M:%S'),
    'WindSpeed_mps': x['WindSpeed_mps'],
    'RotorSpeed_rpm': x['RotorSpeed_rpm']}, mydata))

    mydata = [row for row in mydata if row['RotorSpeed_rpm'] != '']
    mydata = sorted(mydata, key=lambda x: float(x['RotorSpeed_rpm']))

    def windspeed_group_fn(row):
        if float(row['WindSpeed_mps']) < 5:
            return 'A'
        elif 5 <= float(row['WindSpeed_mps']) <= 10:
            return 'B'
        else:
            return 'C'
    mydata = [{**row, 'WindSpeed_Group': windspeed_group_fn(row)} for row in mydata]
    mydata = sorted(mydata, key=lambda x: x['WindSpeed_Group'])
    for key, group in groupby(mydata, key=lambda x: x['WindSpeed_Group']):
        print(key + ':\n')
        for row in group:
            print(row)

## Problem 3. Vectorization

- Some [general info](https://www.askpython.com/python-modules/numpy/vectorization-numpy)
- The code in <b>data_analytics/lib/integrator.py</b> contains rudimentary code,<br/>
  written in plain python, that numerically integrates a (math) function<br/>
  $f\colon \mathbb{R} \to \mathbb{R}$ over an interval $[a,b]$.
- Rewrite the code using numpy and vectorization.
- Introduce timings to measure the gain of vectorization.
- Use the (math) function $f(x)=- 13 x^{8} + 9 x^{7} - 13 x^{6} + 3$ and interval $[a,b] = [-7, 14]$ to test the code.
- Increase the number of subintervals in order to obtain a noticeable difference in the timings.

In your handin, include the rewritten code along with the timing measures.

In [None]:
import numpy as np
import time

def create_mesh(a, b, n):
    return [a+i*(b-a)/n for i in range(n)]
def integrate(f, a, b, n):
    sum_of_rectangles = 0
    left_endpoints = create_mesh(a,b,n)
    mesh_width = (b-a)/n
    for left_endpoint in left_endpoints:
        midpoint = left_endpoint + mesh_width/2
        height = f(midpoint)
        sum_of_rectangles += height * mesh_width
    return sum_of_rectangles
def integrate_vectorized(f, a, b, n):
    x = np.linspace(a, b, n+1)
    fx = f(x)
    return np.sum(fx) * (b-a)/n
def f(x):
    return -13*x**8 + 9*x**7 - 13*x**6 + 3

start = time.time()
myresult = integrate_vectorized(f, -7, 14, 100000000)
end = time.time()
print(myresult)
print("Vectorized version took:", end - start, "seconds")

start = time.time()
myresult = integrate(f,-7,14,100000000)
end = time.time()
print("Original version took:", end - start, "seconds")

# Takes a looooooong time because of the loops
# Could include multiprocessing

## Problem 4. Numpy arrays

- The folder <b>/home/AB0410/data_analytics/private/exrc_01/data</b><br/>
  contains a csv file (<b>prob4_AB0410.csv</b>) with some weather data.
- a) Use [numpy.genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) to read the file into a 2-dimensional numpy array.<br/>
  Use dtype=str in order to not lose the headers.
- b) Use Boolean masking to drop the rows that contain <b>nan</b> entries.
- c) Convert the time entries (standard timestamp) into a human-readable format of your choice.
- d) Add a new row that contains the averages of the columns, except <b>nan</b> for the time column.

In your handin, include the code that does a) - d) above (no saved data).

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime
from getpass import getuser
import csv

user = getuser().upper()
csv_location = f'/home/{user}/data_analytics/private/exrc_01/data/prob4_AB0410.csv'

data = pd.read_csv(csv_location)
data = data.dropna()

time_col = data['time']
time_col = [datetime.fromtimestamp(time) for time in time_col]
data['time'] = time_col

averages = data.mean(axis=0)
averages['time'] = 'nan'
data = pd.concat([data, averages.to_frame().T], ignore_index=True)

data

## Problem 5. Data download
- Start by exploring / running the code in <b>data_analytics/lib/statfi.py</b>
- Choose a topic that interests you. Then download a few megabytes of data of that topic.
- Save your data in one or several json files.

In your handin, include the code that you used (no saved data).
Also, tell a few words about your experiences. What problems, if any, did you encounter?

## Handin your final answers by running the code cell below.
- It lists the notebooks in the same directory as the current one, and asks you to choose the one you want to hand in. (It seems that in JupyterHub one cannot refer to the current file, only to the directory of the current file.)
- Save your latest changes first, and please remove anything that may identify you to your anonymous reviewer.
- More information about the anonymous reviewing process will appear later.
- You may run the code cell as many times as you wish.
- Your permission to write the handin file ends at the deadline.

In [None]:
import sys
sys.path.append('/home/varpha/data_analytics/lib')
from handin import handin_exrc_01
handin_exrc_01()