inspired by https://github.com/PacktPublishing/Python-Machine-Learning-By-Example-Third-Edition/blob/master/chapter2/movie_recommendation.py

loading required libraries

In [1]:
import numpy as np
from collections import defaultdict
import pandas as pd
import binascii

In [None]:
df = pd.read_csv('bangla movie user rating dataset.csv')
df.drop(4980)
s = set()
count = 1
for ind in df.index:
    serial = count
    i = str(df['User_name'][ind])
    if i in s:
        continue
    df.loc[ind,'U_ID'] = serial
    for ind2 in range(ind+1,8):
        j=str(df['User_name'][ind2])
        if i==j:
            df.at[ind2,'U_ID'] = serial
    s.add(i)
    count += 1

print(df.head(10))

function to load rating data from file and also return the number of ratings for each movie and movie_id index mapping

In [5]:
def load_rating_data(n_users, n_movies):
    """
    Load rating data from file and also return the number of ratings for each movie and movie_id index mapping
    @param data_path: path of the rating data file
    @param n_users: number of users
    @param n_movies: number of movies that have ratings
    @return: rating data in the numpy array of [user, movie]; movie_n_rating, {movie_id: number of ratings};
             movie_id_mapping, {movie_id: column index in rating data}
    """
    data = np.zeros([n_users, n_movies], dtype=np.float32)
    movie_id_mapping = {}
    movie_id_mapping = defaultdict(int)
    df = pd.read_csv('bangla movie user rating dataset.csv')
    df.insert(2,'U_ID',0) # creating a column U_ID with 0 in all rows

    # now we will create user IDs in U_ID column
    count = 1
    s = set()
    for ind in df.index:
        serial = count
        i = str(df['User_name'][ind])
        if i in s:
            continue
        df.loc[ind,'U_ID'] = serial
        for ind2 in range(ind+1,4987):
            j=str(df['User_name'][ind2])
            if i==j:
                df.at[ind2,'U_ID'] = serial
        s.add(i)
        count += 1

    # saving data from pandas dataframe `df` to numpy array `data`,dictionary `movie_id_mapping`, defaultdict `movie_id_mapping`
    for ind in df.index:
        user_id = df['U_ID'][ind]
        user_id = int(user_id) - 1
        movie_id = str(df['Movie_ID'][ind])
        movie_id = int(movie_id[2:])
        rating = int(df['Review Rating'][ind])
        if movie_id not in movie_id_mapping:
            movie_id_mapping[movie_id] = len(movie_id_mapping)
        data[user_id, movie_id_mapping[movie_id]] = rating
        if rating > 0:
            movie_n_rating[movie_id] += 1
    return data, movie_n_rating, movie_id_mapping
    

displaying rating info

In [3]:
def display_distribution(data):
    values, counts = np.unique(data, return_counts=True)
    for value, count in zip(values, counts):
        print(f'Number of rating {int(value)}: {count}')


loading data and displaying

In [6]:
n_users = 3487
n_movies = 805
data, movie_n_rating, movie_id_mapping = load_rating_data(n_users, n_movies)

display_distribution(data)

               User_name  U_ID
0            adnannizhum     1
1       SoumikBanerjee25     2
2           MandalBros-5     3
3     shovonbhattachrjee     4
4       anandolodh-96284     5
...                  ...   ...
4982       yash-mahendra  3483
4983     Pierre_Christen  3484
4984           miguelopp  3485
4985            dgerroll  3486
4986        a_la_bakwaas  3487

[4987 rows x 2 columns]
Number of rating 0: 2802071
Number of rating 1: 1881
Number of rating 2: 115
Number of rating 3: 146
Number of rating 4: 138
Number of rating 5: 199
Number of rating 6: 336
Number of rating 7: 470
Number of rating 8: 806
Number of rating 9: 873


Since most ratings are unknown, we take the movie with the most known ratings as our target movie:

In [7]:
movie_id_most, n_rating_most = sorted(movie_n_rating.items(), key=lambda d: d[1], reverse=True)[0]
print(f'Movie ID {movie_id_most} has {n_rating_most} ratings.')

Movie ID 10834986 has 70 ratings.


The movie with ID 10834986 is the target movie, and ratings of the rest of the movies are signals. We construct the dataset accordingly:

In [8]:
X_raw = np.delete(data, movie_id_mapping[movie_id_most], axis=1)
Y_raw = data[:, movie_id_mapping[movie_id_most]]

#We discard samples without a rating in movie ID 2858:
X = X_raw[Y_raw > 0]
Y = Y_raw[Y_raw > 0]

print('Shape of X:', X.shape)
print('Shape of Y:', Y.shape)


Shape of X: (70, 804)
Shape of Y: (70,)


Again, we take a look at the distribution of the target movie ratings:

In [9]:
display_distribution(Y)

Number of rating 1: 41
Number of rating 3: 3
Number of rating 4: 1
Number of rating 6: 2
Number of rating 7: 6
Number of rating 8: 8
Number of rating 9: 9


We can consider movies with ratings greater than 3 as being liked (being recommended):

In [10]:
recommend = 3
Y[Y <= recommend] = 0
Y[Y > recommend] = 1

n_pos = (Y == 1).sum()
n_neg = (Y == 0).sum()
print(f'{n_pos} positive samples and {n_neg} negative samples.')

26 positive samples and 44 negative samples.


As a rule of thumb in solving classification problems, we need to always analyze the label distribution and see how balanced (or imbalanced) the dataset is.

Next, to comprehensively evaluate our classifier's performance, we can randomly split the dataset into two sets, the training and testing sets, which simulate learning data and prediction data, respectively. Generally, the proportion of the original dataset to include in the testing split can be 20%, 25%, 33.3%, or 40%. We use the train_test_split function from scikit-learn to do the random splitting and to preserve the percentage of samples for each class:

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

print(len(Y_train), len(Y_test))

56 14


We check the training and testing sizes as follows:

In [12]:
print(len(Y_train), len(Y_test))

56 14


Another good thing about the train_test_split function is that the resulting training and testing sets will have the same class ratio.

Next, we train a Naïve Bayes model on the training set. You may notice that the values of the input features are from 0 to 5, as opposed to 0 or 1 in our toy example. Hence, we use the MultinomialNB module (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) from scikit-learn instead of the BernoulliNB module, as MultinomialNB can work with integer features. We import the module, initialize a model with a smoothing factor of 1.0 and prior learned from the training set, and train this model against the training set as follows:

In [13]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=1.0, fit_prior=True)
clf.fit(X_train, Y_train)

MultinomialNB()

Then, we use the trained model to make predictions on the testing set. We get the predicted probabilities as follows:

In [14]:
prediction_prob = clf.predict_proba(X_test)
print(prediction_prob[0:10])

[[6.25000000e-01 3.75000000e-01]
 [1.00000000e+00 1.44008814e-11]
 [7.00619741e-01 2.99380259e-01]
 [9.27397060e-01 7.26029399e-02]
 [6.25000000e-01 3.75000000e-01]
 [6.25000000e-01 3.75000000e-01]
 [5.43252988e-12 1.00000000e+00]
 [6.25000000e-01 3.75000000e-01]
 [6.25000000e-01 3.75000000e-01]
 [8.45379717e-01 1.54620283e-01]]


We get the predicted class as follows:

In [15]:
prediction = clf.predict(X_test)
print(prediction[:10])

[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]


Finally, we evaluate the model's performance with classification accuracy, which is the proportion of correct predictions:

In [18]:
accuracy = clf.score(X_test, Y_test)
print(f'The accuracy is: {accuracy*100:.1f}%')

The accuracy is: 71.4%
