<a href="https://colab.research.google.com/github/SergeyHSE/ALSalgorithm.github.io/blob/main/ALS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this work, we will find similar movies and users using the ALS algorithm,
implement the calculation of the NDCG metric, and investigate
the effect of the dimensionality of hidden representations on the performance of the algorithm.

Dataset = MovieLens

In [1]:
import zipfile
from collections import defaultdict, Counter
import datetime
from scipy import linalg
import scipy.sparse as sps
import numpy as np
import matplotlib.pyplot as plt

In [2]:
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip

--2023-12-01 12:41:37--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip’


2023-12-01 12:41:37 (17.5 MB/s) - ‘ml-1m.zip’ saved [5917549/5917549]



In [3]:
#Let's unpack the data and see how it's organized.

with zipfile.ZipFile("ml-1m.zip", "r") as z:
    print("files in archive")
    print(z.namelist())
    print("movies")
    with z.open("ml-1m/movies.dat") as m:
        print(str(m.readline()).split("::"))
    print("users")
    with z.open("ml-1m/users.dat") as m:
        print(str(m.readline()).split("::"))
    print("ratings")
    with z.open("ml-1m/ratings.dat") as m:
        print(str(m.readline()).split("::"))

files in archive
['ml-1m/', 'ml-1m/movies.dat', 'ml-1m/ratings.dat', 'ml-1m/README', 'ml-1m/users.dat']
movies
['b"1', 'Toy Story (1995)', 'Animation|Children\'s|Comedy\\n"']
users
["b'1", 'F', '1', '10', "48067\\n'"]
ratings
["b'1", '1193', '5', "978300760\\n'"]


In [None]:
with zipfile.ZipFile("ml-1m.zip", 'r') as zip:
    file_content = zip.read('ml-1m/README')
    decoded_content = file_content.decode('utf-8')  # Decode the content assuming it's in UTF-8

    # Replace single newlines with a special token
    content_with_token = decoded_content.replace('\n', '||NEWLINE||')

    # Replace double newlines with spaces
    formatted_content = ' '.join(para.strip() for para in content_with_token.split('\n\n'))

    # Restore single newlines
    formatted_content = formatted_content.replace('||NEWLINE||', '\n')

    print(formatted_content)

SUMMARY

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000.

USAGE LICENSE

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under the following conditions:

     * The user may not state or imply any endorsement from the
       University of Minnesota or the GroupLens Research Group.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set
       (see below for citation information).

     * The user may not redistribute the data without separate
       permission.

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permission
       from a facult

We can see that the archive contains information about movies.
This is the movieId of the movie, title and genre.
About users we know userId, gender (F, M), age, coded employment information and zip-code.
And the rating information:
userId, movieId, rating and the moment in time when the rating was made.
Let's read the data.

In [4]:
# read data
movies = {} # id
users = {} # id
ratings = defaultdict(list) # user-id

with zipfile.ZipFile("ml-1m.zip", "r") as z:
    # parse movies
    with z.open("ml-1m/movies.dat") as m:
        for line in m:
            MovieID, Title, Genres = line.decode('iso-8859-1').strip().split("::")
            MovieID = int(MovieID)
            Genres = Genres.split("|")
            movies[MovieID] = {"Title": Title, "Genres": Genres}

    # parse users
    with z.open("ml-1m/users.dat") as m:
        fields = ["UserID", "Gender", "Age", "Occupation", "Zip-code"]
        for line in m:
            row = list(zip(fields, line.decode('iso-8859-1').strip().split("::")))
            data = dict(row[1:])
            data["Occupation"] = int(data["Occupation"])
            users[int(row[0][1])] = data

    # parse ratings
    with z.open("ml-1m/ratings.dat") as m:
        for line in m:
            UserID, MovieID, Rating, Timestamp = line.decode('iso-8859-1').strip().split("::")
            UserID = int(UserID)
            MovieID = int(MovieID)
            Rating = int(Rating)
            Timestamp = int(Timestamp)
            ratings[UserID].append((MovieID, Rating, datetime.datetime.fromtimestamp(Timestamp)))