# Coursework 1: Movie ratings

This is the first coursework of ECS7023P Programming for AI and Data Science, which counts 35% towards the final grade of the module. The coursework is graded out of 100 marks.

**Deadline:** Monday 30th September, 2024 - 11.59pm

**Marking criteria:** While the most important marking criterion will be for the code to achieve the expected objective and output, marks will also be given for partial or close solutions, whereas marks can be deducted for code that is overly complex, inefficient, difficult to understand and/or to maintain.

**Use of packages:** For this exercise, in addition to the built-in python functions and elements that we have seen in the lectures (see lecture notes), you can only import the **csv** and **json** packages. No other packages are allowed. You cannot use other packages such as **pandas**, you won't get any marks for a question if you use it.

**How to submit:** You will submit a completed Jupyter notebook file with your solutions, as well as a PDF version of the Jupyter notebook which includes the outputs of your code. You need to submit the python code that produces the required answers. Answers produced through means other than python code will not be deemed acceptable.

**Note:** This is an individual coursework, the solutions you submit need to be your own and developed on your own.

## Dataset

For this exercise, you are given a dataset that contains information about a collection of movies, along with ratings assigned by 2,500 users to those movies.

The dataset contains two files:
* movies.json: a JSON file with information about movies, including their ID, title, language, release date, country(ies) of origin and genre(s).
* ratings.csv: a CSV file that contains an entry for each movie rating, where a user ID rates a movie ID with a rating on a likert scale from 1 to 5 at a particular time.

The **movieId** column in ratings.csv can be linked to the IDs within movie.json, so you know which specific movie a user is rating in each case.

**Note:** ratings.csv contains entries only for movies that each user has rated, i.e. for many movies, a user may not have entered any ratings so we don't have that information.

## Exercises

### Question 1

1. Who is the most active user with the largest number of ratings? Print the user ID and the number of ratings for this user. **(10 marks)**

In [8]:
import json
import csv
file_path = 'ratings.csv'

with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)

    count_dict = {}
    next(csv_reader)
    
    for row in csv_reader:
        if row[0] not in count_dict.keys():
            count_dict[row[0]] = 1
        else:
            count_dict[row[0]] += 1
        
        max_user = 0
        max_ratings = 0
    
    for user, ratings in count_dict.items():
        if ratings > max_ratings:
            max_ratings = ratings
            max_user = user
    print(f"User {max_user} has rated the largest number of movies, with a total of {max_ratings}.")

User 8659 has rated the largest number of movies, with a total of 3023.


### Question 2

2. What is the user who, having rated at least 25 movies, has the overall lowest rating average? Print the ID of this user and their rating average. **(10 marks)**

In [11]:
import csv

file_path = 'ratings.csv'

rating_dict = {}
for user_id, rating_count in count_dict.items():  # Reusing dictionary from Q1
    if rating_count >= 25:
        rating_dict[user_id] = [0, 0] 

with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)
    next(csv_reader)
    for row in csv_reader:
        user_id = row[0]
        rating = float(row[2])
        if user_id in rating_dict:
            rating_dict[user_id][0] += rating 
            rating_dict[user_id][1] += 1 

for user_id in rating_dict.keys():
    if rating_dict[user_id][1] != count_dict[user_id]:
        print(f"Warning: User {user_id} has a mismatch in rating counts! "
              f"Expected {count_dict[user_id]}, but found {rating_dict[user_id][1]}")

min_avg = float('inf')
min_user = None

for user_id, (total_rating, num_ratings) in rating_dict.items():
    if num_ratings > 0:
        avg_rating = total_rating / num_ratings
        if avg_rating < min_avg:
            min_avg = avg_rating
            min_user = user_id

print(f"User {min_user} has the lowest rating average of {min_avg:.3f}.")

User 5228 has the lowest rating average of 0.834.


### Question 3

3. Given a year and a country as input, produce the statistics of genres for movies released in that year and country. To show the output of your code, print the results for 1995 as the input year and GB as the input country. **(10 marks)**

In [14]:
# Assuming that 'statistics' merely means the number of movies released per genre in the given year and country. The question is very ambiguous.
# Assuming that a movie made in both the US and GB counts as a movie made in GB.

data_list = [] # Will be a list of tuples: (genre_list, year, country_list), where genre_list and country_list are lists themselves.
file_path_2 = 'movies.json'
with open(file_path_2, 'r') as movies:
    for line in movies:
        entry = json.loads(line)
        year = entry.get('releasedate', '').split('-')[0]
        genre_list = entry.get('genres', [])
        country_list = entry.get('countries', [])
        data_list.append((genre_list, year, country_list))

def statistics(year, country): # Assuming both inputs are strings and that only a singular country is passed
    genre_dict = {} # Will have genre as key, number of movies of that genre as value
    for tuple in data_list:
        if str(year) == tuple[1] and str(country) in tuple[2]:
            for genre in tuple[0]: # For genre in genre_list
                if genre not in genre_dict:
                    genre_dict[genre] = 1
                else:
                    genre_dict[genre] += 1
    total = 0
    for count in genre_dict.values():
        total += count
    print(f"A total of {total} movies were released in {country} in {year}.") # Not 'true' total as movies with multiple genres have been recounted.
    for genre, count in genre_dict.items():
        print(f"{count} in the {genre} genre.")

statistics('1995', 'GB')

A total of 134 movies were released in GB in 1995.
4 in the Adventure genre.
5 in the Action genre.
7 in the Thriller genre.
37 in the Drama genre.
20 in the Romance genre.
4 in the History genre.
5 in the War genre.
19 in the Comedy genre.
4 in the Documentary genre.
4 in the Foreign genre.
5 in the Crime genre.
2 in the Fantasy genre.
2 in the Family genre.
2 in the Animation genre.
4 in the TV Movie genre.
3 in the Mystery genre.
4 in the Horror genre.
2 in the Science Fiction genre.
1 in the Western genre.


### Question 4

4. What is the title of the movie with the largest number of 3.5 ratings? How many 3.5 ratings does it have? **(15 marks)**

In [17]:
import json
import csv

file_path = 'ratings.csv'
with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)

    three_point_five_dict = {}
    next(csv_reader)

    for row in csv_reader:
        movie_id = row[1] 
        rating = float(row[2])
        if movie_id not in three_point_five_dict:
            three_point_five_dict[movie_id] = 0
        if rating == 3.5:
            three_point_five_dict[movie_id] += 1


movie_with_max_ratings = None
max_ratings = 0

for movie_id, count in three_point_five_dict.items():
    if count > max_ratings:
        max_ratings = count
        movie_with_max_ratings = movie_id


print(f"The movie with ID {movie_with_max_ratings} has the highest number of 3.5 ratings, with a total of {max_ratings}.")

file_path_2 = 'movies.json'
with open(file_path_2, 'r') as movies:
    for line in movies:
        entry = json.loads(line)
        movie_id = str(entry['id']) 
        if movie_id == movie_with_max_ratings:
            title = entry['title']
            print(f"And its name is {title}.")
            break 


The movie with ID 480 has the highest number of 3.5 ratings, with a total of 654.
And its name is Monsoon Wedding.


### Question 5

5. Write a python function which, given one or more countries as input parameter, produces a list of the top 5 movie titles with the highest average rating that match the country/ies. As an example to show your code, print the output for GB and US as the input countries. Note: the list of countries has to be part of the movie's countries, but not necessarily an exact match, e.g. a movie with GB, US, DE would be a match for GB, US as input parameter. **(15 marks)**

In [20]:
import csv
import json

file_path = 'ratings.csv'
rating_per_movie_id_dict = {}  # keys are movie IDs, values: [rating total, number of ratings, average rating]

with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)
    next(csv_reader) 
    for line in csv_reader:
        movie_id = int(line[0]) 
        rating = float(line[2]) 
        
        if movie_id not in rating_per_movie_id_dict:
            rating_per_movie_id_dict[movie_id] = [rating, 1, rating]
        else:
            rating_per_movie_id_dict[movie_id][0] += rating 
            rating_per_movie_id_dict[movie_id][1] += 1      
            rating_per_movie_id_dict[movie_id][2] = rating_per_movie_id_dict[movie_id][0] / rating_per_movie_id_dict[movie_id][1] 

file_path_2 = 'movies.json'
with open(file_path_2, 'r') as movies:
    for line in movies:
        entry = json.loads(line) 
        countries_of_movie_list = entry.get("countries", []) 
        title = entry.get("title", "Unknown Title") 
        movie_id = int(entry.get("id", -1))  
        
        if movie_id in rating_per_movie_id_dict:
            rating_per_movie_id_dict[movie_id].extend([countries_of_movie_list, title])

def top_five(country_list):  # Input country/ies with a list, e.g., ['GB', 'US']
    valid_movies = []
    for movie_id, movie_data in rating_per_movie_id_dict.items():
        if len(movie_data) >= 5:
            countries = movie_data[3]
            if all(country in countries for country in country_list):
                avg_rating = round(movie_data[2], 2)
                valid_movies.append((avg_rating, movie_data[4], movie_data[1]))
    
    valid_movies.sort(key=lambda x: x[0], reverse=True)
    
    result = ""
    for i in range(min(5, len(valid_movies))):
        avg_rating, title, total_ratings = valid_movies[i]
        result += f"{i + 1}. {title} - {total_ratings} - {avg_rating:.2f}\n"
    
    return result
print("Top Five Movies:")
print(top_five(['GB', 'US'])) # Output as: rank. title - number of ratings - avergae rating

Top Five Movies:
1. Batman - 1 - 5.00
2. Licence to Kill - 1 - 5.00
3. Kingdom of Heaven - 1 - 5.00
4. A Bridge Too Far - 1 - 5.00
5. Flicka - 6 - 4.92



### Question 6

6. Produce a list of all movie genres available in the dataset, with their overall average rating for each genre. Print also the name of the genre with the highest average rating. Note: ratings pertaining to movies with more than one genre contribute to the average of all the relevant genres. **(15 marks)**

In [23]:
import json

file_path_2 = 'movies.json'
with open(file_path_2, 'r') as movies:
    for line in movies:
        entry = json.loads(line)
        movie_id = int(entry.get("id", -1))
        genre_list = entry.get("genres", []) 
        
        # Reusing dictionary from last question, now adding a list of genres as the last element
        if movie_id in rating_per_movie_id_dict:
            rating_per_movie_id_dict[movie_id].append(genre_list)  # Append genre_list directly

genre_dict = {}  # Keys will be genre name, value [0] will be total rating, [1] will be number of ratings, and [2] will be the average rating

for entry in rating_per_movie_id_dict.values():
    if len(entry) >= 5:
        numb_ratings = entry[1]
        rating_total = entry[0]
        genres = entry[-1]
        for genre in genres:
            if genre not in genre_dict.keys():
                genre_dict[genre] = [rating_total, numb_ratings, rating_total/numb_ratings]
            else:
                genre_dict[genre][0] += rating_total
                genre_dict[genre][1] += numb_ratings
                genre_dict[genre][2] = genre_dict[genre][0] / genre_dict[genre][1]

genre_list = [] # This is the required list, a list of tuples: (genre, average rating).
for genre, data in genre_dict.items():
    genre_list.append((genre, data[-1]))
genre_list.sort(key=lambda x: x[1], reverse=True)

for genre, average_rating in genre_list:
    print(f"The {genre} genre has an average rating of {average_rating:.3f}.")

The TV Movie genre has an average rating of 3.660.
The War genre has an average rating of 3.618.
The Music genre has an average rating of 3.593.
The Animation genre has an average rating of 3.593.
The Fantasy genre has an average rating of 3.566.
The Horror genre has an average rating of 3.562.
The Foreign genre has an average rating of 3.554.
The Mystery genre has an average rating of 3.553.
The Adventure genre has an average rating of 3.548.
The Thriller genre has an average rating of 3.541.
The Comedy genre has an average rating of 3.536.
The Science Fiction genre has an average rating of 3.535.
The Drama genre has an average rating of 3.530.
The History genre has an average rating of 3.523.
The Family genre has an average rating of 3.522.
The Romance genre has an average rating of 3.515.
The Action genre has an average rating of 3.513.
The Western genre has an average rating of 3.513.
The Crime genre has an average rating of 3.512.
The Documentary genre has an average rating of 3.4

### Question 7

7. We want to implement a small recommender system which, given a movie as input, recommends the most similar movie. The idea behind it is to recommend the movie with the most similar rating pattern to the movie provided as input (e.g. if our input movie has been liked by some users and disliked by others, we will try to recommend one where similar users liked and disliked it). To do this, we will measure the pairwise cosine similarities between the input movie and each of the other movies in the dataset, to find the one that maximises the similarity.

The cosine similarity between two vectors (in python, lists) A and B is measured as:

$\mathbf{A} \cdot \mathbf{B} / \|\mathbf{A}\| \|\mathbf{B}\| = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \cdot \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$ 

That's the (element-wise) multiplication of both vectors, divided by the multiplication of their norms. As an example, if we have two vectors:
* A = [0, 1, 2]
* B = [1, 2, 3]

The cosine similarity between A and B is:

$sim(A, B) = (0 * 1 + 1 * 2 + 2 * 3) / (\sqrt{0^2 + 1^2 + 2^2} * \sqrt{1^2 + 2^2 + 3^2}) = 0.956182887$

See the toy example below where we have 3 movies and 5 users. Cells with a 0 indicate that the user hasn't rated that movie, whereas values 1-5 indicate that the user has rated the movie with that value:

|         | user 1 | user 2 | user 3 | user 4 | user 5 |
|---------|--------|--------|--------|--------|--------|
| movie 1 |      5 |      0 |      2 |      1 |      5 |
| movie 2 |      1 |      3 |      0 |      1 |      4 |
| movie 3 |      4 |      0 |      2 |      1 |      0 |

Let's say in this case our input movie is movie 1, so we want to find the movie that's most similar to movie 1. We would compute the pairwise cosine similarities between movie 1 (our input movie) and every other movie in the dataset:
* Cosine similarity between movie 1 and movie 2 is: 0.674699
* Cosine similarity between movie 1 and movie 3 is: 0.735612

Hence, we would recommend movie 3 as the one with the highest cosine similarity with movie 1.

To complete this question, write a python function which, given a movie as input, outputs the most similar movie as the recommended item based on highest cosine similarity with the input movie.

**(30 marks)**

NB that you can only use the *json* and *csv* packages. To calculate the square root of a value, you can use (1/2) as the exponent of a base number, e.g. 3**(1/2) calculates the square root of 3.

In [26]:
import csv

movies_dict = {}
user_ids = set()

file_path = 'ratings.csv'
with open(file_path, mode='r') as file:
    csv_reader = csv.reader(file)
    next(csv_reader)
    for row in csv_reader:
        user_id = int(row[0])
        movie_id = int(row[1])
        rating = float(row[2])
        
        if movie_id not in movies_dict:
            movies_dict[movie_id] = {}
        movies_dict[movie_id][user_id] = rating
        user_ids.add(user_id)

user_ids = sorted(user_ids)

for movie_id in movies_dict:
    for user_id in user_ids:
        if user_id not in movies_dict[movie_id]:
            movies_dict[movie_id][user_id] = 0

movie_norms = {}
for movie_id, user_ratings in movies_dict.items():
    norm = sum(rating ** 2 for rating in user_ratings.values()) ** 0.5
    movie_norms[movie_id] = norm

def similarity(input_movie_id):
    max_similarity = -1
    most_similar_movie = None
    
    input_movie_ratings = movies_dict.get(input_movie_id, {})
    input_norm = movie_norms.get(input_movie_id, 0)
    
    if not input_movie_ratings:
        print(f"Input movie {input_movie_id} has no ratings.")
        return
    
    for other_movie_id, other_movie_ratings in movies_dict.items():
        if other_movie_id != input_movie_id:
            dot_product = sum(input_movie_ratings[user] * other_movie_ratings[user] for user in user_ids)
            
            other_norm = movie_norms.get(other_movie_id, 0)
            
            if input_norm == 0 or other_norm == 0:
                similarity_score = 0
            else:
                similarity_score = dot_product / (input_norm * other_norm)
            
            if similarity_score > max_similarity:
                max_similarity = similarity_score
                most_similar_movie = other_movie_id
    
    if most_similar_movie is not None:
        print(f"The most similar movie to movie {input_movie_id} is movie {most_similar_movie} "
              f"with a similarity score of {max_similarity:.6f}")
    else:
        print("No similar movies found.")

input_movie_id = 480 
similarity(input_movie_id)


The most similar movie to movie 480 is movie 377 with a similarity score of 0.638680
