### <font color="brown">Problem Set 7: NumPy - Solution</font>

In [4]:
import numpy as np

---

#### Problem 1

Write a function that takes a 2D ndarray and cycles the rows up by 1 so that the first row becomes the last, the last becomes second-to-last, etc. 

In [89]:
def rowcycle(ndarr):
    cycle = list(range(1,ndarr.shape[0])) + [0]
    return ndarr[cycle]

arr2d = np.random.randint(1,13,(4,3))
print(arr2d,'\n')
print(rowcycle(arr2d))

[[4 3 6]
 [2 9 3]
 [5 9 4]
 [1 5 7]] 

[[2 9 3]
 [5 9 4]
 [1 5 7]
 [4 3 6]]


---

#### Problem 2

Write a function that takes an ndarray and computes the standard deviation of the values in each row, without using the standard deviation function. It should return an array with these standard deviations. 
See https://www.mathsisfun.com/data/standard-deviation-formulas.html

In [153]:
# Building the solution, one step at a time

# 1. Sample 2D ndarray
arr = np.array([[3,1,-2],[1,8,2],[6,1,5]])
print(f'Input array:\n {arr}\n')

# 2. Mean for each row
mn = np.mean(arr,axis=1)
print(f'Row means: {mn}\n')

# 3. Flip the means array so it's a single column
mn = mn.reshape(3,1)
print(f'Row means, column vector:\n {mn}\n')

# 4. Subtract row's mean from each row value
arr1 = arr - mn
print(f'Row value minus mean:\n {arr1}\n')

# 5. Square the differences
arr1 = arr1 ** 2
print(f'Differences squared:\n {arr1}\n')

# 6. Sum the squared differences
arr1 = arr1.sum(axis=1)
print(f'Sum of squared differences:\n {arr1}\n')

# 7. Divide each by number of columns (values in each row)
arr1 = arr1 / arr.shape[1]
print(f'Divide by N={arr.shape[1]} (number of values in each row):\n {arr1}\n')

# 8. Square root of each
arr1 = np.sqrt(arr1)
print(f'Standard deviations:\n {arr1}\n')

Input array:
 [[ 3  1 -2]
 [ 1  8  2]
 [ 6  1  5]]

Row means: [0.66666667 3.66666667 4.        ]

Row means, column vector:
 [[0.66666667]
 [3.66666667]
 [4.        ]]

Row value minus mean:
 [[ 2.33333333  0.33333333 -2.66666667]
 [-2.66666667  4.33333333 -1.66666667]
 [ 2.         -3.          1.        ]]

Differences squared:
 [[ 5.44444444  0.11111111  7.11111111]
 [ 7.11111111 18.77777778  2.77777778]
 [ 4.          9.          1.        ]]

Sum of squared differences:
 [12.66666667 28.66666667 14.        ]

Divide by N=3 (number of values in each row):
 [4.22222222 9.55555556 4.66666667]

Standard deviations:
 [2.05480467 3.09120617 2.1602469 ]



In [156]:
# Verify against np standard deviation function, std
arr.std(axis=1)

array([2.05480467, 3.09120617, 2.1602469 ])

In [157]:
# Solution function
def stddev(arr):
    arr1 = arr - np.mean(arr,axis=1).reshape(3,1)
    arr1 = (arr1 ** 2).sum(axis=1)/arr.shape[1]
    return np.sqrt(arr1)


In [158]:
# Test
stddev(arr)

array([2.05480467, 3.09120617, 2.1602469 ])

---

#### Problem 3:

In data science, a popular technique called kNN, short for **k Nearest Neighbors**, is used to find the k vectors in a dataset that are most like a new vector. For instance, a vector (array) could be user's ratings (out of 5) on a set of 10 action movies on Netflix, like so:
<pre>
  [2, 3, 4, 1, 1, 4, 3, 3, 3, 4]
</pre>

Suppose there is a large collection of such ratings posted by many users, collected in a 2D array, one row per user.
(So the array size is nxm, where n is the number of users, and m is the number of movies that have been rated.) 

Now suppose you have a new user input of ratings on these movies, and you want to find out which k ratings in the collection are most like the new user's ratings, i.e. which are the k-nearest neighbors of the new user, ratings-wise. This can then be used to recommend to the new user which other movies they might like, based on the likes of the k-nearest neighbors. (This is called "collaborative filtering".)

Implement a function that will return k-nearest neighbors given a 2D ratings array,a parallel user id array (i-th entry in user id array is the user id for the i-th entry of ratings in the ratings array), a user ratings array, and a k value. 

You must use NumPy for all computation. Use the Euclidean distance between two vectors (see https://en.wikipedia.org/wiki/Euclidean_distance#Higher_dimensions) as a measure of similarity/likeness. The smaller the distance, the more alike. 

Your returned result should include the actual similarity value, as well as the user id for that similarity.


In [67]:
def kNN(ratings_data, user_ids, user_ratings, k):
    # Write your code here
    dist = np.empty(ratings_data.shape[0])
    for r in range(ratings_data.shape[0]):
        dist[r] = euclid_dist(ratings_data[r], user_ratings)
    sortidx = np.argsort(dist)[:k]
    return [(user_ids[sortidx[s]], dist[sortidx[s]]) for s in range(len(sortidx))]
    
def euclid_dist(r,u):
    return np.sqrt(np.sum((r-u)**2))
        

In [68]:
# 100 random sets of user ratings on 10 movies, between 1 and 5
rdata = np.random.randint(1,6,(100,10))
# user ids, 'u1' thru 'u100'
uids = ['u'+str(i) for i in range(1,101)]
# new user ratings on 10 movies, between 1 and 5
urate = np.random.randint(1,6,10)

res=kNN(rdata,uids,urate,5)
print(res)

[('u57', 3.872983346207417), ('u54', 4.0), ('u76', 4.123105625617661), ('u63', 4.242640687119285), ('u61', 4.242640687119285)]


---

#### Problem 4:

Create a 2D array of shape 5x3 to contain random decimal numbers between 5 and 10. Get the position (index) of the two largest numbers in each row. From the generated 2D array, replace all values greater than 8 to 10 and less than 6 to 5.  <br>
Hint: 
1. https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
2. https://numpy.org/doc/stable/reference/generated/numpy.where.html
3. https://numpy.org/doc/stable/reference/generated/numpy.flip.html

In [None]:
# Solution 


# Option1 :
a = np.random.randint(low=5, high=10, size=(5,3)) + np.random.random((5,3))

# Option 2: 
a = np.random.uniform(5,10, size=(5,3))
print(a)

max_pos = np.flip(np.argsort(a, axis=1), axis=1)
max_pos = max_pos[:,:2]
print(max_pos)

new_arr = np.where(a < 6, 5, np.where(a > 8, 10, a))
print(new_arr)

[[7.59617386 5.39732077 6.40648629]
 [6.00785031 6.38619833 6.93333009]
 [7.86349967 6.18265303 9.32175942]
 [7.25202873 8.53711484 7.4309798 ]
 [6.55980017 5.02105781 7.1242187 ]]
[[0 2]
 [2 1]
 [2 0]
 [1 2]
 [2 0]]
[[ 7.59617386  5.          6.40648629]
 [ 6.00785031  6.38619833  6.93333009]
 [ 7.86349967  6.18265303 10.        ]
 [ 7.25202873 10.          7.4309798 ]
 [ 6.55980017  5.          7.1242187 ]]


---

#### Problem 5:

Generate one-hot encodings for a list of values (classes). One-hot encoding and its applications are explained in the following resources: 
1. https://en.wikipedia.org/wiki/One-hot
2. https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179

Write a function that takes a 1-d List as input and return a 2-d Numpy array where the rows are the one-hot encoding of the classes in the list. Eg: Input: ['cat','camel','dog','cat'] <br>
Output: [[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]

In [None]:
# Solution:

def one_hot_encoding(l):
    arr = np.array(l)
    num_classes = np.unique(arr)
    encoding = np.zeros((arr.shape[0], num_classes.shape[0]))
    for i, k in enumerate(arr):
        encoding[i, k-1] = 1
    return encoding

In [None]:
# Test

l = [1,2,0,1,2]
encoding = one_hot_encoding(l)
print(encoding)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]


---

#### Problem 6: 

Mean normalization: <br>

Mean normalizing is a common technique in Data Science and Machine learning as part of pre-processing the data. Write a function that replaces all nan values to a zero from a given array. Also, the method should perform mean normalization i.e. Subtract the mean of each row of the resultant array.

In [None]:
X = np.array([[5,6,np.nan,7],[1,np.nan,0,5],[-1,5,np.nan,2]])

In [None]:
# Solution
def mean_normalize(X):
    
    return Y

#### <font color="brown">Carried over to PS 8</font>

---

#### Problem 7:
Write a function that takes a 1d array and generates out of it a 2d matrix using strides, with a window length of w and strides of s. 

For example, for an input array [0,1,2,3,....15] with window length 4 and stride 2, the output matrix should be like [[0,1,2,3], [2,3,4,5], [4,5,6,7]..]. In addition to the 1d array, the function should accept the stride length and the window length as parameters, with a default value of 5 for both.

#### <font color="brown">Carried over to PS 8</font>

---