# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

# python

# question 1 (10 pts)

Create a function named `char_freq` that, given a string, returns a dictionary where the key is a character from the string and the value is the frequency that that character appears in that string. If they are letters, it should not matter the case of the character, and the keys should always be lowercase. **Do not use packages, only do an implementation from scratch**

For example, `char_freq("hello world")` should return a dictionary

`
{'h': 2, 'e': 2, 'l': 4, 'o': 3, ' ': 2, 'w': 2, 'r': 2, 'd': 2}
`

and 

`char_freq("gattaca")`

should return

`{'g': 1, 'a': 3, 't': 2, 'c': 1}`

In [1]:
# define the function below
# Defined the function
def char_freq(s):
    s = s.lower() # Lowercase all alphabets
    freq = {} # Empty dictionary
    for i in s:
        if i in freq:
            freq[i] +=1 # Adds number for cases where frequency is more than 1
        else:
            freq[i]=1 # Returns 1 for cases with just 1 alphabet
    return freq

In [2]:
# try it here
char_freq("Aa") # case should not matter

{'a': 2}

In [3]:
# (10 pts)
assert char_freq("gattaca") == {'g': 1, 'a': 3, 't': 2, 'c': 1}
assert char_freq("hello world") == {'h': 1, 'e': 1, 'l': 3, 'o': 2, ' ': 1, 'w': 1, 'r': 1, 'd': 1}
assert char_freq("") == {}
assert char_freq("    ") == {' ': 4}
assert char_freq("AaBbCcDd") == {'a': 2, 'b': 2, 'c': 2, 'd': 2}

## question 2 (30 pts)

Sometimes we want to expand our feature set by applying special functions to a feature. This step is commonly known as feature engineering. For example, if we have the $age$ of people as a feature, we might want to expand into $age$, $age^2$, and $\sqrt{age}$, hence increasing the complexity of our model.

In this question, you will write a function `expand` that receives a matrix  $X$ of $n$ rows and $p$ columns and generates a new matrix $E$ where a set of transformations $f_1, f_2, \dots, f_k$ has been applied to each column in the set.

For example, lets assume that $X$ has $p$ columns and we have $k=3$. Then your function should generate a matrix $E$ where the $i$-th row will have the following form

$$e_i = \begin{pmatrix} 
f_1(x_{i1})\\
f_2(x_{i1})\\
f_3(x_{i1})\\
f_1(x_{i2})\\
f_2(x_{i2})\\
f_3(x_{i2})\\\
\dots\\
f_1(x_{ip})\\
f_2(x_{ip})\\
f_3(x_{ip})\\
\end{pmatrix}
$$

The dimensions of $E$ should be $n$ by $kp$.

Your function should receive a list of rows as the first parameter `X`, each row represented by a list of elements. The second parameter should be the list of functions to apply $F$. The function should return a matrix $E$ represented as a list of rows as well.

In [119]:
# implement the functionality below. *Hint*: Use list comprehension
def expand(X, F):
    # Defining the function expand
    E=[] # Introducing an empty list
    # Checking and returning for "integers"
    if str(type(X[0]))=="<class 'int'>": #or str(type(X[0]))=="<class 'int64'>" or str(type(X[0]))=="<class 'int32'>" or str(type(X[0]))=="<class 'float'>"  or str(type(X[0])=="<class 'float64'>") or str(type(X[0])=="<class 'float32'>"):
        E=[f(X) for f in F]
    # Checking and returning for "list"
    elif str(type(X))=="<class 'list'>":
        # Checking and returning for "integer" inside "list"
        if str(type(X[0]))=="<class 'int'>": #or str(type(X[0]))=="<class 'int64'>" or str(type(X[0]))=="<class 'int32'>" or str(type(X[0]))=="<class 'float'>"  or str(type(X[0])=="<class 'float64'>") or str(type(X[0])=="<class 'float32'>"):
            E=[f(x) for x in X for f in F]
        # Checking and returning for "integer" inside nested list
        elif str(type(X[0][0]))=="<class 'int'>": #or str(type(X[0][0]))=="<class 'int64'>" or str(type(X[0][0]))=="<class 'int32'>" or str(type(X[0][0]))=="<class 'float'>"  or str(type(X[0][0])=="<class 'float64'>") or str(type(X[0][0])=="<class 'float32'>"):
            E=[[f(x) for x in row for f in F] for row in X]
        else:
            for a in X:
                E.append(expand(a,F))
    return E

In [120]:
# try the functionality here
expand([[1,2,3,4]], [lambda x: x]) # the input and output should be identical

[[1, 2, 3, 4]]

In [121]:
# (30 pts)
assert expand([[1]], [lambda x: 1]) == [[1]]
assert expand([[1,2,3], [3,4,5]], [lambda x: x*2, lambda x: x+1]) == [[2, 2, 4, 3, 6, 4], [6, 4, 8, 5, 10, 6]]
assert expand([[0,1,0], 
               [0,1,0],
               [-1,0,1]
              ], [lambda x: x, lambda x: x**2, lambda x: x**3]) == [[0, 0, 0, 1, 1, 1, 0, 0, 0],
 [0, 0, 0, 1, 1, 1, 0, 0, 0],
 [-1, 1, -1, 0, 0, 0, 1, 1, 1]]