The assignment you've provided seems to be a comprehensive exercise related to recommender systems, involving the use of linear algebra concepts such as vector angles, norms, and low-rank matrices. Below, I'll provide guidance and suggestions for each part of the assignment:

### Part (a): Movie Ratings
1. **Select Movies:**
   - As a group, agree on approximately 8 movies or TV series.
   - One group member rates all items, while another rates only half of them, keeping the other ratings secret.

### Part (b): Scaling of Ratings
2. **Reason for Scaling:**
   - Consider the reason behind scaling ratings in the range \(-2\) to \(2\) instead of the usual \(1\) to \(5\).
   - Think about the possible angles between vectors representing movie preferences when using each scaling.

### Part (c): Average and Extreme Tastes
3. **Identify Taste Characteristics:**
   - Determine who has the "average" taste based on the ratings.
   - Identify the group member with the most extreme taste.
   - Consider using norms or other measures to quantify taste characteristics.

### Part (d): Predicting Missing Ratings
4. **Prediction Based on Norms or Angles:**
   - Explore methods to predict missing ratings using norms or angles between vectors.
   - Consider cosine similarity or other similarity metrics.
   - Evaluate how well the predictions match the true scores.

### Part (e): Low-Rank Matrices (TSVD)
5. **Predictions using Low-Rank Matrices:**
   - Apply Truncated Singular Value Decomposition (TSVD) to represent the ratings matrix as a low-rank approximation.
   - Replace missing entries with zeros and assess the predictions.
   - Compare the results with the norm or angle-based predictions from Part (d).

### Additional Considerations:
- **Collaboration:** If working with another group, ensure effective collaboration and communication.
- **Data Representation:** Represent the movie ratings as a matrix, where rows correspond to group members and columns correspond to movies.
- **Implementation:** Develop your own methods and code for the assignment, as specified in the instructions.

For the actual implementation, you might use NumPy for linear algebra operations in Python. It's important to interpret the results and reflect on the effectiveness of different approaches in predicting missing ratings.

This assignment provides an opportunity to apply concepts learned in class to a real-world problem, showcasing the practical use of linear algebra in recommender systems.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("./datasets/movie new.csv")

df.columns = df.iloc[0]
df = df.iloc[1:]
df = df.iloc[:, 1:]
df = df.reset_index().drop('index', axis=1)
for i in range(len(df.columns)):
    df[df.columns[i]] = pd.to_numeric(df[df.columns[i]])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Tim     8 non-null      int64
 1   Alex    8 non-null      int64
 2   Yuan    8 non-null      int64
 3   Alan    8 non-null      int64
 4   Peter   8 non-null      int64
dtypes: int64(5)
memory usage: 448.0 bytes


In [3]:
array_from_df = df.values
type(array_from_df)

numpy.ndarray

In [4]:
df_half = pd.read_csv("./datasets/movie person.csv")
# df_half.columns = df_half.iloc[0]
# df_half = df_half.iloc[:, :]
# df_half = df_half.reset_index().drop('index', axis=1)
df_half

Unnamed: 0,Tim,Alex,Yuan,Alan,Peter
0,,1.0,1.0,2.0,
1,1.0,0.0,,,0.0
2,,,,0.0,
3,,,0.0,,-2.0
4,2.0,2.0,0.0,,
5,,,,0.0,2.0
6,2.0,1.0,,1.0,0.0
7,0.0,,2.0,,


In [5]:
df_half.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Tim     4 non-null      float64
 1   Alex    4 non-null      float64
 2   Yuan    4 non-null      float64
 3   Alan    4 non-null      float64
 4   Peter   4 non-null      float64
dtypes: float64(5)
memory usage: 448.0 bytes


In [6]:
for i in range(len(df_half.columns)):
    df_half[df_half.columns[i]] = pd.to_numeric(df_half[df_half.columns[i]])
df_half.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Tim     4 non-null      float64
 1   Alex    4 non-null      float64
 2   Yuan    4 non-null      float64
 3   Alan    4 non-null      float64
 4   Peter   4 non-null      float64
dtypes: float64(5)
memory usage: 448.0 bytes


In [7]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity


# Calculate cosine similarity between rows
cosine_sim_matrix = cosine_similarity(df.iloc[:, 2:])

# Create a DataFrame from the similarity matrix
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=df.index, columns=df.index)

print("Cosine Similarity Matrix:")
print(cosine_sim_df)


Cosine Similarity Matrix:
          0         1         2         3         4         5         6  \
0  1.000000  0.707107 -0.666667 -0.666667  0.894427  0.666667  0.707107   
1  0.707107  1.000000  0.000000  0.000000  0.632456  0.000000  1.000000   
2 -0.666667  0.000000  1.000000  1.000000 -0.447214 -1.000000  0.000000   
3 -0.666667  0.000000  1.000000  1.000000 -0.447214 -1.000000  0.000000   
4  0.894427  0.632456 -0.447214 -0.447214  1.000000  0.447214  0.632456   
5  0.666667  0.000000 -1.000000 -1.000000  0.447214  1.000000  0.000000   
6  0.707107  1.000000  0.000000  0.000000  0.632456  0.000000  1.000000   
7  0.816497  0.866025 -0.408248 -0.408248  0.547723  0.408248  0.866025   

          7  
0  0.816497  
1  0.866025  
2 -0.408248  
3 -0.408248  
4  0.547723  
5  0.408248  
6  0.866025  
7  1.000000  


In [8]:
# Function to fill NaN with the average of column and row means
def fill_nan_with_avg(dataframe):
    col_means = dataframe.mean(axis=0, numeric_only=True)
    row_means = dataframe.mean(axis=1, numeric_only=True)

    for i in range(dataframe.shape[0]):
        for j in range(dataframe.shape[1]):
            if pd.isna(dataframe.iloc[i, j]):
                dataframe.iloc[i, j] = (col_means[j] + row_means[i]) / 2
    return df_half
# Fill NaN with the average of column and row means
df_filled = fill_nan_with_avg(df_half)
# df_filled = df_half.fillna(0)
print(df_filled)

        Tim  Alex      Yuan      Alan     Peter
0  1.291667   1.0  1.000000  2.000000  0.666667
1  1.000000   0.0  0.541667  0.541667  0.000000
2  0.625000   0.5  0.375000  0.000000  0.000000
3  0.125000   0.0  0.000000 -0.125000 -2.000000
4  2.000000   2.0  0.000000  1.041667  0.666667
5  1.125000   1.0  0.875000  0.000000  2.000000
6  2.000000   1.0  0.875000  1.000000  0.000000
7  0.000000   1.0  2.000000  0.875000  0.500000


In [9]:
df

Unnamed: 0,Tim,Alex,Yuan,Alan,Peter
0,1,1,1,2,2
1,1,0,2,2,0
2,-2,2,0,0,-1
3,-1,1,0,0,-2
4,2,2,0,2,1
5,2,-1,0,0,2
6,2,1,1,1,0
7,0,1,2,1,1


In [10]:
# Compute SVD
U, S, VT = np.linalg.svd(df_filled)

print("U matrix:")
print(U)

print("\nS matrix (singular values):")
print(np.diag(S))

print("\nVT matrix:")
print(VT)

U matrix:
[[-0.48505609  0.11800579  0.23062228 -0.66825346  0.00096714  0.24209665
  -0.26384037 -0.35046031]
 [-0.18269191  0.16210602  0.04312251 -0.05779835 -0.61181982 -0.38658052
  -0.41457185  0.48939482]
 [-0.13529039  0.08011842 -0.03383279  0.34533579 -0.04034273 -0.56402751
  -0.20438982 -0.70238616]
 [ 0.11696334  0.69737348  0.12254148  0.38789476  0.08997808  0.43671671
  -0.36830215 -0.00685737]
 [-0.51014962  0.16652287 -0.51723215  0.02968421  0.56220397 -0.18559203
  -0.07575312  0.29555204]
 [-0.38614602 -0.55242903 -0.15816903  0.42772281 -0.21795224  0.4580823
  -0.2800371  -0.04456253]
 [-0.43472893  0.34599515 -0.06565007  0.1552723  -0.39772155  0.11382071
   0.69784752 -0.06891981]
 [-0.31782024 -0.12011322  0.79494419  0.26914693  0.30690888 -0.16080185
   0.09973109  0.22403165]]

S matrix (singular values):
[[5.50748055 0.         0.         0.         0.        ]
 [0.         2.61409383 0.         0.         0.        ]
 [0.         0.         2.04789908 0.

In [11]:
# Retain only the first singular value and corresponding columns of U and V

rank_1_approximation = S[0] * np.outer(U[:, 0], VT[0, :])

# Display the original matrix and the rank-1 approximation
print("Original Matrix A:")
print(df_filled)
print("\nRank-1 Approximation:")
print(rank_1_approximation)

Original Matrix A:
        Tim  Alex      Yuan      Alan     Peter
0  1.291667   1.0  1.000000  2.000000  0.666667
1  1.000000   0.0  0.541667  0.541667  0.000000
2  0.625000   0.5  0.375000  0.000000  0.000000
3  0.125000   0.0  0.000000 -0.125000 -2.000000
4  2.000000   2.0  0.000000  1.041667  0.666667
5  1.125000   1.0  0.875000  0.000000  2.000000
6  2.000000   1.0  0.875000  1.000000  0.000000
7  0.000000   1.0  2.000000  0.875000  0.500000

Rank-1 Approximation:
[[ 1.55379482  1.31532454  0.96460883  1.12917092  0.88697325]
 [ 0.58522251  0.49540488  0.36331103  0.42529183  0.33407031]
 [ 0.4333798   0.36686639  0.26904581  0.31494498  0.24739193]
 [-0.37467221 -0.317169   -0.23259964 -0.2722811  -0.21387909]
 [ 1.63417767  1.38337055  1.01451117  1.1875866   0.93285926]
 [ 1.2369532   1.04711052  0.7679109   0.89891636  0.70610636]
 [ 1.3925803   1.17885259  0.86452551  1.0120134   0.79494503]
 [ 1.01808316  0.86183178  0.63203454  0.73985953  0.58116588]]


In [12]:
rank_3_approximation = (
        S[0] * np.outer(U[:, 0], VT[0, :]) +
        S[1] * np.outer(U[:, 1], VT[1, :]) +
        S[2] * np.outer(U[:, 2], VT[2, :])
)

# Display the original matrix and the rank-3 approximation
print("Original Matrix A:")
print(df_filled)
print("\nRank-3 Approximation:")
print(rank_3_approximation)

Original Matrix A:
        Tim  Alex      Yuan      Alan     Peter
0  1.291667   1.0  1.000000  2.000000  0.666667
1  1.000000   0.0  0.541667  0.541667  0.000000
2  0.625000   0.5  0.375000  0.000000  0.000000
3  0.125000   0.0  0.000000 -0.125000 -2.000000
4  2.000000   2.0  0.000000  1.041667  0.666667
5  1.125000   1.0  0.875000  0.000000  2.000000
6  2.000000   1.0  0.875000  1.000000  0.000000
7  0.000000   1.0  2.000000  0.875000  0.500000

Rank-3 Approximation:
[[ 1.42214080e+00  1.27717299e+00  1.31992370e+00  1.33524032e+00
   5.25426345e-01]
 [ 6.80386497e-01  5.11316816e-01  4.03841617e-01  5.54990656e-01
  -6.54264973e-02]
 [ 5.36028579e-01  3.88496758e-01  1.98895912e-01  3.48141620e-01
   6.95245852e-02]
 [ 9.82258509e-02 -2.32997123e-01 -1.61215723e-01  2.50387951e-01
  -1.91014237e+00]
 [ 2.29824997e+00  1.53989255e+00  1.37854441e-01  1.00612265e+00
   7.21832960e-01]
 [ 9.23961566e-01  9.95685051e-01  6.11449328e-01  4.50640173e-01
   2.07150335e+00]
 [ 1.75472967e+0

In [13]:
tsvd_rank1 = pd.DataFrame(data=rank_1_approximation, columns=df_filled.columns)
tsvd_rank1

Unnamed: 0,Tim,Alex,Yuan,Alan,Peter
0,1.553795,1.315325,0.964609,1.129171,0.886973
1,0.585223,0.495405,0.363311,0.425292,0.33407
2,0.43338,0.366866,0.269046,0.314945,0.247392
3,-0.374672,-0.317169,-0.2326,-0.272281,-0.213879
4,1.634178,1.383371,1.014511,1.187587,0.932859
5,1.236953,1.047111,0.767911,0.898916,0.706106
6,1.39258,1.178853,0.864526,1.012013,0.794945
7,1.018083,0.861832,0.632035,0.73986,0.581166


In [14]:
tsvd_rank3 = pd.DataFrame(data=rank_3_approximation, columns=df_filled.columns)
tsvd_rank3

Unnamed: 0,Tim,Alex,Yuan,Alan,Peter
0,1.422141,1.277173,1.319924,1.33524,0.525426
1,0.680386,0.511317,0.403842,0.554991,-0.065426
2,0.536029,0.388497,0.198896,0.348142,0.069525
3,0.098226,-0.232997,-0.161216,0.250388,-1.910142
4,2.29825,1.539893,0.137854,1.006123,0.721833
5,0.923962,0.995685,0.611449,0.45064,2.071503
6,1.75473,1.252179,0.693157,1.200466,-0.001748
7,0.113627,0.643621,1.954259,1.107174,0.583608


In [15]:
norm_completion = pd.read_csv('./datasets/movie person 1.csv')
norm_completion

Unnamed: 0,Tim,Alex,Yuan,Alan,Peter
0,1.291667,1.0,1.0,2.0,1.166667
1,1.0,0.0,0.541667,0.541667,0.0
2,0.625,0.5,0.375,0.0,0.5
3,1.125,1.0,0.0,0.875,-2.0
4,2.0,2.0,0.0,1.041667,1.166667
5,1.125,1.0,0.875,0.0,2.0
6,2.0,1.0,0.875,1.0,0.0
7,0.0,1.0,2.0,0.875,1.0


In [16]:
norm_completion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Tim     8 non-null      float64
 1   Alex    8 non-null      float64
 2   Yuan    8 non-null      float64
 3   Alan    8 non-null      float64
 4   Peter   8 non-null      float64
dtypes: float64(5)
memory usage: 448.0 bytes


In [17]:
norm_completion.values

array([[ 1.29166667,  1.        ,  1.        ,  2.        ,  1.16666667],
       [ 1.        ,  0.        ,  0.54166667,  0.54166667,  0.        ],
       [ 0.625     ,  0.5       ,  0.375     ,  0.        ,  0.5       ],
       [ 1.125     ,  1.        ,  0.        ,  0.875     , -2.        ],
       [ 2.        ,  2.        ,  0.        ,  1.04166667,  1.16666667],
       [ 1.125     ,  1.        ,  0.875     ,  0.        ,  2.        ],
       [ 2.        ,  1.        ,  0.875     ,  1.        ,  0.        ],
       [ 0.        ,  1.        ,  2.        ,  0.875     ,  1.        ]])

In [18]:
print(tsvd_rank1.values)

[[ 1.55379482  1.31532454  0.96460883  1.12917092  0.88697325]
 [ 0.58522251  0.49540488  0.36331103  0.42529183  0.33407031]
 [ 0.4333798   0.36686639  0.26904581  0.31494498  0.24739193]
 [-0.37467221 -0.317169   -0.23259964 -0.2722811  -0.21387909]
 [ 1.63417767  1.38337055  1.01451117  1.1875866   0.93285926]
 [ 1.2369532   1.04711052  0.7679109   0.89891636  0.70610636]
 [ 1.3925803   1.17885259  0.86452551  1.0120134   0.79494503]
 [ 1.01808316  0.86183178  0.63203454  0.73985953  0.58116588]]


In [19]:
print(tsvd_rank3.values)

[[ 1.42214080e+00  1.27717299e+00  1.31992370e+00  1.33524032e+00
   5.25426345e-01]
 [ 6.80386497e-01  5.11316816e-01  4.03841617e-01  5.54990656e-01
  -6.54264973e-02]
 [ 5.36028579e-01  3.88496758e-01  1.98895912e-01  3.48141620e-01
   6.95245852e-02]
 [ 9.82258509e-02 -2.32997123e-01 -1.61215723e-01  2.50387951e-01
  -1.91014237e+00]
 [ 2.29824997e+00  1.53989255e+00  1.37854441e-01  1.00612265e+00
   7.21832960e-01]
 [ 9.23961566e-01  9.95685051e-01  6.11449328e-01  4.50640173e-01
   2.07150335e+00]
 [ 1.75472967e+00  1.25217926e+00  6.93156769e-01  1.20046578e+00
  -1.74789322e-03]
 [ 1.13627462e-01  6.43620946e-01  1.95425908e+00  1.10717384e+00
   5.83608498e-01]]


In [20]:
np.array_equal(rank_1_approximation, tsvd_rank1)

True

In [21]:
print(array_from_df)

[[ 1  1  1  2  2]
 [ 1  0  2  2  0]
 [-2  2  0  0 -1]
 [-1  1  0  0 -2]
 [ 2  2  0  2  1]
 [ 2 -1  0  0  2]
 [ 2  1  1  1  0]
 [ 0  1  2  1  1]]


In [22]:


def calculate_mae(matrix11, matrix22):
    """
    Calculate the Mean Absolute Error (MAE) for each element between two matrices.

    Parameters:
    - matrix1, matrix2: NumPy arrays of the same shape.

    Returns:
    - mae_matrix: NumPy array containing the MAE for each element.
    - total_mae: Summation of MAE values.
    """
    # Calculate the absolute difference between corresponding elements
    absolute_difference = np.abs(matrix11 - matrix22)

    # Calculate the MAE for each element
    mae_matrix = absolute_difference.mean(axis=None)

    # Calculate the total MAE by summing all MAE values
    total_maeee = mae_matrix.sum()

    return mae_matrix, total_maeee



In [23]:


def calculate_mse(matrix1, matrix2):
    """
    Calculate the Mean Squared Error (MSE) for each element between two matrices.

    Parameters:
    - matrix1, matrix2: NumPy arrays of the same shape.

    Returns:
    - mse_matrix: NumPy array containing the MSE for each element.
    - total_mse: Summation of MSE values.
    """
    # Calculate the squared difference between corresponding elements
    squared_difference = (matrix1 - matrix2) ** 2

    # Calculate the MSE for each element
    mse_matrix = squared_difference.mean(axis=None)

    # Calculate the total MSE by summing all MSE values
    total_mse = mse_matrix.sum()

    return mse_matrix, total_mse


In [24]:
# Calculate MAE and total MAE
mae_per_element_norm, total_mae_norm = calculate_mae(array_from_df, norm_completion)

# Display the results
print("MAE for Each Element:")
print(mae_per_element_norm)
print("\nTotal MAE:", total_mae_norm)
# Calculate MSE and total MSE
mse_per_element_norm, total_mse_norm = calculate_mse(array_from_df, norm_completion)

# Display the results
print("MSE for Each Element:")
print(mse_per_element_norm)
print("\nTotal MSE:", total_mse_norm)

MAE for Each Element:
Tim      0.739583
Alex     0.437500
Yuan     0.354167
Alan     0.427083
Peter    0.312500
dtype: float64

Total MAE: 2.27083333325
MSE for Each Element:
Tim      1.532118
Alex     0.781250
Yuan     0.381076
Alan     0.478299
Peter    0.371528
dtype: float64

Total MSE: 3.5442708329791666


  mae_matrix = absolute_difference.mean(axis=None)
  mse_matrix = squared_difference.mean(axis=None)


In [25]:
# Calculate MAE and total MAE
mae_per_element_1, total_mae_1 = calculate_mae(array_from_df, rank_1_approximation)

# Display the results
print("MAE for Each Element:")
print(mae_per_element_1)
print("\nTotal MAE:", total_mae_1)
# Calculate MSE and total MSE
mse_per_element_1, total_mse_1 = calculate_mse(array_from_df, rank_1_approximation)

# Display the results
print("MSE for Each Element:")
print(mse_per_element_1)
print("\nTotal MSE:", total_mse_1)

MAE for Each Element:
0.7763675676612106

Total MAE: 0.7763675676612106
MSE for Each Element:
0.9531633876862123

Total MSE: 0.9531633876862123


In [26]:
# Calculate MAE and total MAE
mae_per_element_3, total_mae_3 = calculate_mae(array_from_df, rank_3_approximation)

# Display the results
print("MAE for Each Element:")
print(mae_per_element_3)
print("\nTotal MAE:", total_mae_3)
# Calculate MSE and total MSE
mse_per_element_3, total_mse_3 = calculate_mse(array_from_df, rank_3_approximation)

# Display the results
print("MSE for Each Element:")
print(mse_per_element_3)
print("\nTotal MSE:", total_mse_3)

MAE for Each Element:
0.602806636010944

Total MAE: 0.602806636010944
MSE for Each Element:
0.7206486773049415

Total MSE: 0.7206486773049415
