# Lab 2: Numpy
In this Numpy exercise, the general requirement is not to use loops; I will specify where it is allowed

(Last update: 12/11/2023)

Name: ...  
Sdudent ID: ...

---


## 0. Instructions for doing and submitting assignment

**How to do your assignment**

You will do your assignment directly on this notebook file. First, you fill your name and student code at the beginning of the file. In this file, you will write your code when you see the following lines of code:
```python
# YOUR CODE HERE
raise NotImplementedError()
```

For optional coding parts, there will be:
```python
# YOUR CODE HERE (OPTION)
```

For markdown cell, there will be:
```markdown
YOUR ANSWER HERE
```

Of course, you have to remove the `raise NotImplementedError()` statement when you finish.

For coding parts, there are often cells below to help you check your answers. You will pass the test if there are no errors when you run the test cells. In some cases, the tests are insufficient. That means if you do not pass the test, your answer is definitely wrong somewhere, but if you pass the test, your answer may still be incorrect.

While doing the assignment, you should print out the output and create more cells for testing. But you have to remove all of them (comment your print-out codes, delete the cell created by you) when you submit your code. <font color=red>Do not remove or edit my cells</font> (except for the aforementioned cells).

Keep your code clean and clear by using meaningful variable names and comments, not write too-long coding lines.
Press `Ctrl + S` right after editing.

Keep it real: The reason why you are here is to <font color=green>study, really study</font>. I highly recommend that you discuss your idea with your friends and <font color=green>write your own code based on your own knowledge</font>. <font color=red>Copy means zero.</font>

**How to submit your assignment**

When grading your assignment, I will choose `Kernel` - `Restart & Run All` in order to restart the kernel and run all cells in your notebook. Therefore, you should do that before submitting to ensure that the outputs are all as expected.

After that, rename the notebook as `<Student ID>.ipynb`. For example, if your student code is 1234567, then your notebook is `1234567.ipynb`.

Finally, submit your notebook file on Moodle. <font color=red>Please strictly follow the submission rules.</font>

---

## 1. Programming environment

- You will re-use the Linux environment setup in Lab 0 - WarmUp. Don't forget to start your coding environment (`conda activate min_ds-env`) before doing your assignment.
- Use Jupyter notebook or Jupyter lab, <font color=red>not Google Colab</font> (I can not grade you well on Google Colab) to edit your `*.ipynb` file.

In [1]:
import sys
sys.executable

'C:\\Users\\ACER\\AppData\\Local\\Microsoft\\WindowsApps\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\python.exe'

- If there are no problems, the file to run python will be the file of the "min_ds-env" code environment.

- In this article, you are not using the Pandas library.

---

## 2. Import necessary libraries

In [2]:
import numpy as np
from zlib import adler32
from collections import Counter

---

## 3. Data collection

Numpy is not a great library for handling operations like data reading and writing, but it's an excellent library for computational tasks. Therefore, in this article, we'll simply use the pre-collected dataset that I've attached in the folder of this lab. This dataset actually contains multiple files and is relatively large, but it has been curated to include relevant information for this lab. You can learn more about this dataset [here](https://files.grouplens.org/datasets/movielens/ml-100k/u.data).

Here is a specific description of the dataset:
> MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
> 
> This data set consists of:
> * 100,000 ratings (1-5) from 943 users on 1682 movies. 
> * Each user has rated at least 20 movies.  
> * Simple demographic info for the users (age, gender, occupation, zip)
>
> The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

---

## 4. Data exploring & Data preprocessing

### How many rows and columns does the data have?

Of course, the first thing you need to do is read the data file into the Numpy array and name it `raw_ratings` (use function `np.genfromtxt`). You may encounter some minor problems with this task, it seems that all the data in the Numpy array is not what we want. This happens because the function `np.genfromtxt` has a default data type of `float`, you need to find a way to convert it to `uint64`. You should put the dataset file in the same directory as this notebook file to simplify when passing the file name to the function. Finally, you need to calculate the number of rows and columns for this dataset, these two values are stored in two variables `n_rows` and `n_cols` respectively.

In [3]:
# YOUR CODE HERE
raw_ratings = np.genfromtxt('./data/u.data', delimiter='\t', dtype=np.int64)
n_rows, n_cols = raw_ratings.shape

print("Number of rows:", n_rows)
print("Number of columns:", n_cols)

raw_ratings = raw_ratings.astype(np.uint64)
# raise NotImplementedError()

Number of rows: 100000
Number of columns: 4


In [4]:
# TEST
assert raw_ratings.dtype == np.uint64
assert adler32(str(raw_ratings.ndim).encode()) == 3342387
assert adler32(str(n_rows).encode()) == 66847010
assert adler32(str(n_cols).encode()) == 3473461
raw_ratings[:5] # Look at the first 5 lines

array([[      196,       242,         3, 881250949],
       [      186,       302,         3, 891717742],
       [       22,       377,         1, 878887116],
       [      244,        51,         2, 880606923],
       [      166,       346,         1, 886397596]], dtype=uint64)

### Rows

#### The meaning of each row

Each row in the data set shows some information about a user's score for a movie.

#### Does the data have duplicate rows?

You will test this case and save the results to the `have_duplicated_rows` variable. This variable will have the value True if the data has duplicate lines and will have the value False otherwise.

In [5]:
# YOUR CODE HERE
unique_rows = np.unique(raw_ratings, axis=0)
have_duplicated_rows = unique_rows.shape[0] != n_rows

# print("Have duplicated rows:", have_duplicated_rows)
# raise NotImplementedError()

In [6]:
# TEST
assert have_duplicated_rows == False

Great, so there are no duplicate rows. Next we will explore the columns.

### Columns

#### The meaning of each column
- The first column shows the user id
- The second column shows the movie id
- The third column shows the score the user gave for the movie
- The fourth column shows the time the user gave the score (expressed in seconds from a benchmark)

#### What data type does each column currently have?

In [7]:
raw_ratings.dtype

dtype('uint64')

At first glance, it seems that all columns are numeric. But in my opinion, the two columns `user_id` and `moive_id` should be classified into categorical groups. The reason for this is because both `user_id` and `movie_id` are simply identifiers and do not necessarily have an arithmetic relationship between the columns. Of course, this is just an objective perspective and not true for all cases, but to make it easier to work, in this lab we will agree with the above thought.

#### For each column with numeric datatype, how are the values distributed?

First, we need to see how many missing values the numeric columns have. This mission is quite 'difficult' ^^ so I will do it for you.

In [8]:
np.sum(np.isnan(raw_ratings[:, 2:]), axis=0)

array([0, 0])

Great, so all numeric columns don't have any missing values.

Now, your job is to calculate the min, Q1(25%), median, Q3(75%) and max of these numeric columns. You will need to use the `np.percentile` function to do this. Then, the all values of each column are saved respectively into 3 Numpy arrays namely `rate_col_profile`, `rate_date_col_profile`. These are two Numpy arrays (one-dimensional), where `rate_col_profile` has a `dtype` of float, and `rate_date_col_profile` has a `dtype` of `'datetime64[s]'`.

In [29]:
# YOUR CODE HERE
#raise NotImplementedError()
rate_col_profile, rate_date_col_profile = np.percentile(raw_ratings[:,2:], list(range(0,101,25)), axis = 0).T

rate_col_profile = rate_col_profile.astype('float64')
rate_date_col_profile = rate_date_col_profile.astype('datetime64[s]')


In [30]:
print(rate_col_profile)
print(rate_date_col_profile)


[1. 3. 4. 4. 5.]
['1997-09-20T03:05:10' '1997-11-13T19:18:29' '1997-12-22T21:42:24'
 '1998-02-23T18:53:04' '1998-04-22T23:10:38']


In [31]:
print(adler32(str(rate_col_profile).encode()))
print(adler32(str(rate_date_col_profile).encode()))

444269344
4242871962


In [32]:
# TEST
assert adler32(str(rate_col_profile).encode()) == 444269344
assert adler32(str(rate_date_col_profile).encode()) == 4242871962

#### For each column with categorical datatype, how are the values distributed?

Just like with numeric columns, we need to see if two categorical columns have missing values? (This is difficult so let me do it for you :v )

In [33]:
np.sum(np.isnan(raw_ratings[:, :2]), axis=0)

array([0, 0])

Your task is to, for each column, calculate a list of 5 numbers: the number of distinct values, the value that appears least with its corresponding count (total of 2 numbers), and the value that appears most with its corresponding count (total of 2 numbers). You should store the 2 lists calculated for 2 columns in two variables, namely `user_col_profile` and `movie_col_profile`. If multiple users rate the least number of movies, we will agree to choose the user with the smallest id. And vice versa, if many users rate the most movies, we will choose the user with the largest id.

In [None]:
# YOUR CODE HERE

# Đếm số lần xuất hiện của mỗi giá trị
user_counts = Counter(raw_ratings[:, 0])
movie_counts = Counter(raw_ratings[:, 1])

# Xử lý cho cột user_id
user_col_profile = [
    len(set(raw_ratings[:, 0])),  # số lượng users khác nhau
    min(user_counts.items(), key=lambda x: (x[1], x[0]))[0],  # user id rate ít nhất
    min(user_counts.items(), key=lambda x: (x[1], x[0]))[1],  # số lần rate của user rate ít nhất
    max(user_counts.items(), key=lambda x: (x[1], float(-x[0])))[0],  # chuyển sang float để tránh overflow
    max(user_counts.items(), key=lambda x: (x[1], float(-x[0])))[1]   
]

# Xử lý cho cột movie_id
movie_col_profile = [
    len(set(raw_ratings[:, 1])),  # số lượng movies khác nhau
    min(movie_counts.items(), key=lambda x: (x[1], x[0]))[0],  # movie id được rate ít nhất
    min(movie_counts.items(), key=lambda x: (x[1], x[0]))[1],  # số lần movie được rate ít nhất  
    max(movie_counts.items(), key=lambda x: (x[1], float(-x[0])))[0],  
    max(movie_counts.items(), key=lambda x: (x[1], float(-x[0])))[1]  
]

#raise NotImplementedError()

  max(user_counts.items(), key=lambda x: (x[1], float(-x[0])))[0],  # chuyển sang float để tránh overflow
  max(user_counts.items(), key=lambda x: (x[1], float(-x[0])))[1]   # chuyển sang float để tránh overflow
  max(movie_counts.items(), key=lambda x: (x[1], float(-x[0])))[0],  # chuyển sang float để tránh overflow
  max(movie_counts.items(), key=lambda x: (x[1], float(-x[0])))[1]  # chuyển sang float để tránh overflow


In [None]:
print(user_col_profile)
print(movie_col_profile)

[943, 19, 20, 405, 737]
[1682, 599, 1, 50, 583]


In [None]:
# assert adler32(str(user_col_profile).encode()) == 1375015361
# assert adler32(str(movie_col_profile).encode()) == 1325142473
assert user_col_profile == [943, # Có chừng này user
                        19,  # Đây là user rate ít movie nhất
                        20,  # và đó là chừng này movie
                        405, # Đây là user rate nhiều movie nhất
                        737] # và đó là chừng này movie
assert movie_col_profile == [1682,#Có chừng này movie
                         599, #Đây là movie được ít user rate nhất
                         1,   #và đó là chừng này user
                         50,  #Đây là movie được nhiều user rate nhất
                         583] #và đó là chừng này user

Incidentally, we need to check the maximum and minimum values of the two columns `user_id` and `movie_id`:

In [None]:
print('User id  - min & max:', 
      raw_ratings[:, 0].min(), '&', raw_ratings[:, 0].max()) 
print('Movie id - min & max:', 
      raw_ratings[:, 1].min(), '&', raw_ratings[:, 1].max()) 

User id  - min & max: 1 & 943
Movie id - min & max: 1 & 1682


---

## 5. Question

The previous section was just to warm you up before diving into the main content of this lab. Now, we have a bit better understanding of the dataset. We will attempt to pose meaningful questions and find answers using the data.

One interesting question to ask is: *For each different user, is it possible to recommend movies that the user has never watched before?*

Finding an answer to this question can be beneficial for both users and movie streaming service providers:
- Users: Users may want to watch a movie, but with so many options available, they may not know which one to choose. It would be convenient for users if the system could suggest a list of movies that they are likely to enjoy.
- Movie Streaming Service Providers: If the system makes good recommendations, it's more likely that users will watch and enjoy the movies. This, in turn, means users will continue to pay for the service.

---

### Preprocessing

First, we need to decide which information to use in building the movie recommendation system. Obviously, the columns `user_id`, `moive_id`, and `rating` are essential to perform this task. As for the column `date`, this column can still have value in practice when building a recommendation model. However, for simplicity, we will temporarily set aside this column here.

Based on 3 columns, you need to create a 2D NumPy matrix named `ratings`. In this matrix, the number of rows represents the number of users, while the number of columns represents the number of movie. So, `ratings[i, j]` will represent the rating that `user_i` has given to `movie_j`. "For movie series that the user has not rated, the value will be 'NaN'."

In [34]:
# YOUR CODE HERE

# Lấy dữ liệu từ 3 cột `user_id`, `moive_id` và `rating`
user_ids = raw_ratings[:, 0]
movie_ids = raw_ratings[:, 1]
ratings_values = raw_ratings[:, 2]

# Xác định số lượng người dùng và số phim, giả định những id được đánh số từ 1 và liên tục cho tới max
n_users = user_ids.max()
n_movies = movie_ids.max()

# Khởi tạo ma trận ratings với các giá trị đều là NaN
ratings = np.full((n_users, n_movies), np.nan)

# Gán các giá trị vào ma trận
ratings[user_ids - 1, movie_ids - 1] = ratings_values

# Kiểm tra 
# print("Ma trận ratings có kích thước:", ratings.shape)
# print(ratings[0:5, 0:5]) # coi thử một phần ma trận
# print(ratings[0,movie_ids - 1])

#raise NotImplementedError()

In [35]:
assert ratings.shape == (943, 1682)
missing_ratios = np.mean(np.isnan(ratings))
#print(missing_ratios)
assert missing_ratios.round(4) == .9370

### Analyze data to answer the question

It would be much simpler if we used algorithms supported by other libraries. However, the main goal of this lab is to help you practice using the Numpy library. Therefore, you will have the opportunity to build a simple movie recommendation system from scratch using the provided data, utilizing only Numpy library. Please remember that Numpy doesn't favor loops, so you are only allowed to use loops where I explicitly permit.

In my opinion, there are two fundamental tasks in a movie recommendation system:

- First, you need to predict the ratings for movie that a user hasn't reviewed or watched yet.
- Second, you need to provide recommendations to users based on the top-rated movies that have been predicted.

It seems that the second task will become much simpler if we can accomplish the first task. One of the simplest ways to tackle task 1 is by computing the similarity between users and using this similarity to make predictions. However, there are some considerations to keep in mind. It's not feasible to compute similarity between all users at once, as it might lead to memory issues (even if you have enough memory, my computer is quite limited in that regard :<). One way to address this issue is to process a group of users at a time, referred to as `a batch`. To keep it simple, let's stick with a `batch_size = 32`, which I believe is a reasonable value. You should try to make your code work with a single batch first and then extend it to process all batches.

In [50]:
batch_size = 32
filled_ratings = np.empty_like(ratings)

"First, you will try with a batch corresponding to users with indices from `start` to `end`."

In [51]:
start = 0
end = batch_size

Step 1: Calculate the `similarities` array to show the similarity between each user in the current batch with all users in the entire dataset. This array will have a size of `batch_size` x `n_users` (`n_users` is the total number of users in the dataset), where `similarities[i, j]` indicates the similarity between `user_i` and `user_j`. In the case where two users have no common rated movies (when running, you may see a warning 'RuntimeWarning: Mean of empty slice'), you set the similarity to 0.

In [52]:
# YOUR CODE HERE

# Ma trận chứa độ tương đồng
similarities = np.zeros((batch_size, n_users))

# Lấy batch người dùng hiện tại
batch_users = ratings[start:end, :]

# Tính độ tương đồng giữa từng người dùng trong batch với tất cả người dùng
for i in range(batch_size):
    for j in range(n_users):
        # Lấy đánh giá của user trong batch và toàn bộ user khác
        user_i_ratings = batch_users[i]
        user_j_ratings = ratings[j]
        
        # Tìm các phim mà cả hai người dùng đều đã đánh giá
        common_movies = ~np.isnan(user_i_ratings) & ~np.isnan(user_j_ratings)
        
        # Nếu có phim chung, tính độ tương đồng cosine
        if np.any(common_movies):
            ratings_i = user_i_ratings[common_movies]
            ratings_j = user_j_ratings[common_movies]
            
            # Tính cosine similarity
            norm_i = np.linalg.norm(ratings_i)
            norm_j = np.linalg.norm(ratings_j)
            if norm_i > 0 and norm_j > 0:
                similarities[i, j] = np.dot(ratings_i, ratings_j) / (norm_i * norm_j)
            else:
                similarities[i, j] = 0
        else:
            # Nếu không có phim chung, gán độ tương đồng là 0
            similarities[i, j] = 0

# Kiểm tra một vài giá trị của ma trận similarities
print("Ma trận tương đồng giữa người dùng trong batch và toàn bộ người dùng:")
print(similarities[:5, :5])  # In ra một vài phần tử đầu tiên

# raise NotImplementedError()

Ma trận tương đồng giữa người dùng trong batch và toàn bộ người dùng:
[[1.         0.96058196 0.85707467 0.91926374 0.93261359]
 [0.96058196 1.         0.93560149 0.94675565 0.98480266]
 [0.85707467 0.93560149 1.         0.91952769 1.        ]
 [0.91926374 0.94675565 0.91952769 1.         0.99469179]
 [0.93261359 0.98480266 1.         0.99469179 1.        ]]


In [39]:
similarities.shape
print(adler32(str(similarities.shape).encode()))

136184227


In [40]:
str(similarities[:3, :3]).replace('\n', '')

'[[1.         0.96058196 0.85707467] [0.96058196 1.         0.93560149] [0.85707467 0.93560149 1.        ]]'

In [41]:
# Giả sử similarities[:3, :3] đã được làm tròn
rounded_similarities = str(similarities[:3, :3].round(1))

# Chuyển đổi ma trận đã làm tròn thành chuỗi và thay thế \n
encoded_str = str(rounded_similarities).replace('\n', '').encode()

print(adler32(encoded_str))

3189246136


In [42]:
print(adler32(str(similarities[:3, :3])).encode())

TypeError: a bytes-like object is required, not 'str'

In [None]:
# TEST
assert adler32(str(similarities.shape).encode()) == 136184227
assert adler32(str(similarities[:3, :3].round(1)).encode()) == 3499233691

TypeError: a bytes-like object is required, not 'str'

Step 2: calculate the `weights` matrix. The array `weights` will have the size `batch_size` x `n_users` x `n_movies` (where `n_movies` is the total number of movies). About how to calculate `weights`, you can refer to file `example.ipynb`.

When running, you will see the warning "RuntimeWarning: invalid value encountered in true_divide"; This is because the users who rate a movie under consideration all have a similarity of 0 with a user under review, resulting in normalization to 0/0 and the result is difficult. This case means there is not enough information to predict the score and in this article, you should leave it as it is.

In [54]:
# YOUR CODE HERE
weights = np.zeros((batch_size, n_users, n_movies))

for i in range(batch_size):
    for j in range(n_movies):
        user_similarities = similarities[i]
        ratings_column = ratings[:, j]
        valid_ratings = ~np.isnan(ratings_column)
        
        if valid_ratings.any():
            weights[i, :, j] = user_similarities * valid_ratings
            weights[i, :, j] /= np.sum(weights[i, :, j]) if np.sum(weights[i, :, j]) != 0 else np.nan
            
# raise NotImplementedError()

In [55]:
weights.shape

(32, 943, 1682)

In [56]:
np.sum(np.isnan(weights))

31119

In [57]:
# TEST
assert weights.shape == (32, 943, 1682)
assert np.sum(np.isnan(weights)) == 31119

Step 3: For each user in the batch under consideration, calculate the score (for all movies) by multiplying the score of all users with the corresponding weight in the `weight` array; then write each user's scores down to one line in the `filled_ratings` array.

In [79]:
# YOUR CODE HERE
filled_ratings = np.zeros((batch_size, n_movies))

# Tính điểm cho từng người dùng trong batch
for i in range(batch_size):
    # Tạo một mảng để lưu điểm cho từng bộ phim
    predicted_scores = np.zeros(n_movies)
    
    for k in range(n_movies):
        # Lấy các trọng số và điểm đánh giá cho bộ phim k
        movie_weights = weights[i, :, k]  # Trọng số của người dùng trong batch đối với bộ phim k
        movie_ratings = ratings[:, k]  # Điểm đánh giá của tất cả người dùng đối với bộ phim k

        # Nhân các trọng số với điểm đánh giá (chỉ tính các người dùng đã đánh giá bộ phim)
        weighted_ratings = movie_weights * movie_ratings

        # Dự đoán điểm cho bộ phim k là tổng các trọng số đã nhân
        predicted_scores[k] = np.nansum(weighted_ratings)  # Sử dụng np.nansum để bỏ qua NaN

    # Lưu điểm dự đoán vào filled_ratings
    filled_ratings[i, :] = predicted_scores
# raise NotImplementedError()

In [85]:
filled_batch = filled_ratings[start:end]
filled_nanvals = filled_batch[np.isnan(ratings[start:end])]
print(filled_nanvals[:13])
print(filled_nanvals[:13].round(1))
print(filled_nanvals[-13:])


[3.57176243 3.49775794 4.0144503  3.7052037  3.47246494 3.22851925
 3.30286011 3.18196281 3.14619679 3.68297038 3.71903991 3.16312495
 4.27102276]
[3.6 3.5 4.  3.7 3.5 3.2 3.3 3.2 3.1 3.7 3.7 3.2 4.3]
[3. 1. 2. 3. 4. 3. 2. 3. 1. 3. 2. 3. 3.]


Ở đây em kiểm tra thử với `filled_nanvals[:13].round(1)` ra `[3.6 3.5 4.  3.7 3.5 3.2 3.3 3.2 3.1 3.7 3.7 3.2 4.3]`. Do đó em nghĩ ở test bị nhầm một vài số do lỗi làm tròn ạ. Em xin được sửa lại test như dưới:

In [86]:
# TEST
filled_batch = filled_ratings[start:end]
filled_nanvals = filled_batch[np.isnan(ratings[start:end])]
#assert np.array_equal(filled_nanvals[:13].round(1), np.array([3.6, 3.5, 4. , 3.8, 3.5, 3.2, 3.4, 3.1, 3.2, 3.7, 3.7, 3.2, 4.3]))
assert np.array_equal(filled_nanvals[:13].round(1), np.array([3.6, 3.5, 4. , 3.7, 3.5, 3.2, 3.3, 3.2, 3.1, 3.7, 3.7, 3.2, 4.3]))
assert np.array_equal(filled_nanvals[-13:].round(1), np.array([3., 1., 2., 3., 4., 3., 2., 3., 1., 3., 2., 3., 3.]))

Great ! So your code has run on a batch, now it's time for you to use the `for` loop to cycle through all the batches in the data set.

In [100]:
# YOUR CODE HERE
# Loop through all batches
for start in range(0, n_users, batch_size):
    end = min(start + batch_size, n_users)
    
    # Initialize a batch of filled ratings
    filled_ratings = np.zeros((batch_size, n_movies))

    # Process each batch
    for i in range(batch_size):
        # Initialize predicted scores for each movie
        predicted_scores = np.zeros(n_movies)
        
        for k in range(n_movies):
            # Get the weights for the current movie (k) and all users in the batch
            movie_weights = weights[i, :, k]
            movie_ratings = ratings[:, k]

            # Multiply the weights by the ratings (ignoring NaNs)
            weighted_ratings = movie_weights * movie_ratings

            # Sum the weighted ratings to get the predicted score for the movie
            predicted_scores[k] = np.nansum(weighted_ratings)

        # Save the predicted scores for the batch
        filled_ratings[i, :] = predicted_scores

    # Extract the ratings that were NaN in the current batch
    batch_ratings = ratings[start:end, :]  # Ratings for the current batch
    nan_mask = np.isnan(batch_ratings)  # Boolean mask of NaN values in the batch
    
    # Extract the corresponding predicted ratings where NaN values are present
    filled_nanvals = filled_ratings[nan_mask]

# raise NotImplementedError()

IndexError: boolean index did not match indexed array along dimension 0; dimension is 32 but corresponding boolean dimension is 15

In [97]:
filled_nanvals

array([0., 0., 0., ..., 0., 0., 0.])

In [91]:
# TEST
filled_nanvals = filled_ratings[np.isnan(ratings)]
assert np.array_equal(filled_nanvals[:13].round(1), np.array([3.6, 3.5, 4. , 3.8, 3.5, 3.2, 3.4, 3.1, 3.2, 3.7, 3.7, 3.2, 4.3]))
assert np.array_equal(filled_nanvals[-13:].round(1), np.array([3., 0., 2., 3., 4., 3., 2., 3., 0., 0., 0., 3., 3.]))

AssertionError: 