First line: The Pandas library, a powerful tool for data manipulation and analysis in Python, is being imported. This library facilitates operations on data structures like DataFrames for handling large datasets.

Second line: The sqrt function from the math library is imported. This function allows you to perform square root calculations on numbers.

Third line: The numpy library is imported with the alias np. NumPy is fundamental for numerical computing in Python, providing support for multi-dimensional arrays and matrices, along with mathematical functions.

Fourth line: The matplotlib.pyplot module is imported as plt. This module enables data visualization in Python, offering a range of plotting capabilities for creating graphs and charts.

Fifth line: The %matplotlib inline magic command is used in Jupyter Notebooks. It allows the graphs and plots generated by Matplotlib to be displayed directly within the notebook, making it easier to visualize the data.

This set of imports establishes a strong foundation for data analysis and visualization, setting up the environment to efficiently handle, process, and visualize data.

In [5]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

First line: This line imports movie data from a CSV file named "movies.csv" into a tabular data structure known as a DataFrame, using the Pandas library.

Second line: This line imports user ratings data from another CSV file named "ratings.csv" into another DataFrame using Pandas.

Third line: This line retrieves and displays the first five rows of the movie data to give a quick view of the structure and some examples of the data.

In [8]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


First line: This line extracts the year from the title column of the movies DataFrame. It uses a regular expression to find a pattern resembling a four-digit year in parentheses, like (1995), and adds it to a newly created 'year' column in the same DataFrame.

Second line: After extracting the year including parentheses, this line refines the extraction to only the four-digit year by removing the parentheses, updating the values in the 'year' column.

Third line: This line removes the extracted year (in parentheses) from the 'title' column of the movies DataFrame, cleaning up the movie titles by deleting the year section with the help of a regular expression.

Fourth line: Any leading or trailing spaces in the 'title' column are removed using a lambda function with the strip() method. This ensures that all movie titles are neatly formatted without extra spaces.

Fifth line: Displays the first five rows of the updated movies DataFrame to give the user a concise overview of the changes made to the title and year columns.

In [11]:
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

  movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
  movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
  movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


First line: This line takes the 'genres' column of the movies_df DataFrame and splits each string of genres into a list of individual genres by using '|' as the delimiter. This transformation turns a single string of genres into a list, which makes it easier to handle and analyze each genre separately.

Second line: This line displays the first five rows of the updated DataFrame using the head() method. This allows you to quickly verify that the changes have been successfully applied, showing how each movie now has a list of genres instead of a single concatenated string.

By converting the genres to lists, you can perform more complex operations on each genre independently.

In [14]:
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II (1995),[Comedy],1995


Copying the DataFrame:

moviesWithGenres_df = movies_df.copy() creates a copy of movies_df named moviesWithGenres_df. This ensures that you have a separate DataFrame to manipulate without affecting the original movies_df.
Iterating and Creating Genre Columns:

The nested for loops iterate over each row in movies_df, and for each genre in the 'genres' list of that row, a new column is created if it doesn't already exist in moviesWithGenres_df.
moviesWithGenres_df.at[index, genre] = 1 sets the value to 1 in the appropriate genre column for that specific movie, indicating the presence of the genre.
Filling Missing Values with Zero:

moviesWithGenres_df = moviesWithGenres_df.fillna() replaces all NaN values with ``, indicating the absence of a genre.
Displaying the First Few Rows:

moviesWithGenres_df.head() will display the first five rows of the updated moviesWithGenres_df, showing the newly added genre columns with `` or 1 values.
Thus, this code effectively transforms the DataFrame by expanding it to include binary indicator columns for each genre, making it suitable for further analysis or machine learning tasks.

In [17]:
moviesWithGenres_df = movies_df.copy()
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, let's look at the ratings dataframe.


In [20]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Now, let's take a look at how to implement Content-Based or Item-Item recommendation systems. This technique attempts to figure out what a users favourite aspects of an item is, and then recommends items that present those aspects. In our case, we're going to try to figure out the input's favorite genres from the movies and ratings given.

Let's begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a "The", like "The Matrix" then write it in like this: 'Matrix, The' .
Line 1:
userInput = [

In this line, a variable named userInput is defined, which stores a list of dictionaries. Each dictionary represents a movie and contains two main keys: title and rating.
Line 2:
{'title':'Breakfast Club, The', 'rating':5},

This line adds a dictionary to the list. The key title holds the name of the movie “Breakfast Club, The,” and the key rating indicates a score of 5 for this movie. This score may represent the quality or popularity of the film.
Line 3:
{'title':'Toy Story', 'rating':3.5},

Here, another dictionary is added to the list. The movie title is “Toy Story,” and its rating is 3.5. This score suggests a moderate evaluation of the film.
Line 4:
{'title':'Jumanji', 'rating':2},

This line represents another dictionary that introduces the movie “Jumanji.” The rating for this film is 2, which may indicate a lack of satisfaction or lower quality.
Line 5:
{'title':"Pulp Fiction", 'rating':5},

In this line, a dictionary for the movie “Pulp Fiction” is added. It has been given a rating of 5, indicating very high popularity and quality.
Line 6:
{'title':'Akira', 'rating':4.5}

This line adds the final dictionary to the list. The movie “Akira” is introduced with a rating of 4.5, indicating high quality and popularity for this animated film.
Line 7:
]

This line marks the end of the userInput list. The list is now complete and contains all the dictionaries related to the movies.
Line 8:
inputMovies = pd.DataFrame(userInput)

In this line, a DataFrame named inputMovies is created using the pandas library. This DataFrame is constructed from the userInput list and organizes the movie information (titles and ratings) in a tabular format.
Line 9:
inputMovies

This line simply returns the inputMovies variable, allowing you to view the table of movies and their ratings. The table is displayed in a structured and readable format, which can be used for further analysis or to present the information.
If you have any further questions or need additional clarification, feel free to ask!

In [22]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


Line 1:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

In this line, a new variable named inputId is created. It filters the DataFrame movies_df, which presumably contains a larger collection of movies.
The filtering condition uses the isin() method, which checks if the titles in movies_df['title'] are present in the list of titles from inputMovies['title'].tolist().
The tolist() method converts the titles from the inputMovies DataFrame into a standard Python list. This allows for comparison against the titles in movies_df.
As a result, inputId will contain only those rows from movies_df where the title matches any title in inputMovies. This is useful for retrieving detailed information about the movies that the user has inputted.
Line 2:
inputMovies = pd.merge(inputId, inputMovies)

This line performs a merge operation between the inputId DataFrame and the inputMovies DataFrame using the pd.merge() function from the pandas library.
By default, pd.merge() will join the two DataFrames on their common columns (in this case, the title column), combining their information into a single DataFrame.
The result is that inputMovies is updated to include additional columns from inputId, which may contain more detailed information about the movies (like genre, director, year, etc.), while still retaining the original ratings from the user input.
Line 3:
inputMovies

This line simply returns the updated inputMovies DataFrame, which now contains both the user-provided ratings and any additional information that was merged from movies_df. The output will display a table that includes both the titles, user ratings, and any other relevant movie details.

In [27]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)

inputMovies

Unnamed: 0,movieId,title,genres,year,rating


Line 1:
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]

In this line, a new variable named userMovies is created. It filters the DataFrame moviesWithGenres_df, which likely contains a list of movies along with their associated genres.
The filtering condition uses the isin() method again, which checks if the movieId values in moviesWithGenres_df are present in the list of movieId values from inputMovies['movieId'].tolist().
The tolist() method converts the movieId column from the inputMovies DataFrame into a standard Python list. This allows for checking against the movieId values in moviesWithGenres_df.
As a result, userMovies will contain only those rows from moviesWithGenres_df where the movieId matches any movieId in inputMovies. This is useful for retrieving detailed information about the movies that the user has rated, including their genres.
Line 2:
userMovies

This line simply returns the userMovies DataFrame, which now contains the filtered list of movies that the user has interacted with, along with their genres. The output will display a table that includes the relevant movie information, such as movieId, title, and genre(s), for the movies that match the user’s input.
Summary
In summary, this code snippet filters the moviesWithGenres_df DataFrame to create a new DataFrame called userMovies, which contains only the movies that the user has rated in the inputMovies DataFrame. This allows for further analysis or display of the movies along with their genres based on user input.

In [29]:
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)


inputMovies['rating']

In this line, you are accessing the rating column from the inputMovies DataFrame.
The expression inputMovies['rating'] retrieves all the values in the rating column, which contains the ratings that the user has provided for the movies in the inputMovies DataFrame.
The result of this operation will be a pandas Series object containing the ratings for each movie listed in inputMovies. This Series can be used for further analysis, such as calculating statistics (like mean or median ratings), visualizations, or other operations that may require the ratings data.

In [37]:
inputMovies['rating']

Series([], Name: rating, dtype: float64)

userMovies.transpose():

This part of the code transposes the userMovies DataFrame. Transposing means flipping the DataFrame such that the rows become columns and the columns become rows.
After transposition, each movie’s genres (or other attributes) will become individual rows, and the movieId, title, and any other attributes will become columns. This is useful for matrix operations that follow.
.dot(inputMovies['rating']):

The dot() method is used to perform a dot product between the transposed userMovies DataFrame and the inputMovies['rating'] Series.
The inputMovies['rating'] Series contains the user ratings for the movies. When you perform a dot product, you multiply the corresponding elements of the transposed userMovies DataFrame (which contains the genres or attributes of the movies) by the ratings provided in inputMovies.
The result of this dot product will be a new Series or DataFrame that aggregates the contributions of each movie’s attributes (like genres) weighted by the user’s ratings. This effectively creates a profile for the user based on their ratings and the characteristics of the movies they rated.
userProfile:

The result of the dot product is stored in the variable userProfile. This variable now contains a representation of the user’s preferences based on the movies they have rated and the attributes of those movies.
The userProfile can be used for various purposes, such as recommending new movies based on the user’s preferences, analyzing user behavior, or clustering users with similar tastes.
Summary:
In summary, this line of code constructs a user profile by calculating a weighted sum of the attributes of the movies the user has rated, using their ratings as weights. The resulting userProfile can provide insights into the user’s preferences and can be utilized in recommendation systems or for further analysis.

In [42]:
userProfile = userMovies.transpose().dot(inputMovies['rating'])
userProfile

movieId               0.0
title                 0.0
genres                0.0
year                  0.0
Adventure             0.0
Animation             0.0
Children              0.0
Comedy                0.0
Fantasy               0.0
Romance               0.0
Drama                 0.0
Action                0.0
Crime                 0.0
Thriller              0.0
Horror                0.0
Mystery               0.0
Sci-Fi                0.0
War                   0.0
Musical               0.0
Documentary           0.0
IMAX                  0.0
Western               0.0
Film-Noir             0.0
(no genres listed)    0.0
dtype: float64

genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])

In this line, a new variable named genreTable is being created. It stores a modified version of the moviesWithGenres_df DataFrame.
The method set_index() is called on moviesWithGenres_df. This method is used to set one or more columns of a DataFrame as its index. In this case, the column movieId is specified as the index.
By setting movieId as the index, the DataFrame will be organized such that each row is identified by its unique movieId. This can make it easier to look up movies based on their IDs and can improve the performance of certain operations.
The moviesWithGenres_df['movieId'] part indicates that the movieId column from the moviesWithGenres_df DataFrame is being used as the new index.
Line:
genreTable.head()

This line calls the head() method on the genreTable DataFrame. The head() method returns the first five rows of the DataFrame by default.
This is useful for quickly inspecting the structure and contents of the DataFrame, allowing you to see how the data looks after setting the index. You will be able to see the movieId as the index and the associated movie information, such as title and genres.
Summary
In summary, this code snippet creates a new DataFrame called genreTable, where the movieId column is set as the index. This makes it easier to reference and manipulate the DataFrame based on movie IDs. The head() method is then used to display the first five rows of this newly indexed DataFrame, providing a quick overview of its structure and contents.

In [47]:
#Now let's get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
genreTable.head()

Unnamed: 0_level_0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,Father of the Bride Part II (1995),[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
genreTable.shape

(9742, 24)

In this line, you are accessing a specific column from a DataFrame that contains information about movies. The column you are interested in is the one that holds the ratings given by the user for each movie.

By accessing this column, you retrieve all the ratings associated with the movies listed in the DataFrame. The result will be a collection of values that represents how each movie has been rated by the user.

Purpose
This operation allows you to perform further analysis on the ratings. For example, you could calculate statistics like the average rating, identify movies with high or low ratings, or visualize the ratings in a graph. Essentially, it provides you with the necessary data to understand user preferences regarding the movies.

In [60]:
if isinstance(userProfile, list):
    userProfile = pd.Series(userProfile)

# Convert genreTable to float, coercing errors
genreTable = genreTable.apply(pd.to_numeric, errors='coerce')

# Convert userProfile to float, coercing errors
userProfile = pd.to_numeric(userProfile, errors='coerce')

# Fill NaN values with 0
userProfile = userProfile.fillna(0)

# Reindex userProfile to match genreTable columns, filling missing values with 0
userProfile = userProfile.reindex(genreTable.columns, fill_value=0)

# Calculate the recommendation table
recommendationTable_df = ((genreTable * userProfile).sum(axis=1)) / (userProfile.sum())

# Display the first few rows of the recommendation table
recommendationTable_df.head()

movieId
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
dtype: float64

Now here's the recommendation table!


In [70]:
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II (1995),[Comedy],1995
5,6,Heat (1995),"[Action, Crime, Thriller]",1995
6,7,Sabrina (1995),"[Comedy, Romance]",1995
7,8,Tom and Huck (1995),"[Adventure, Children]",1995
8,9,Sudden Death (1995),[Action],1995
9,10,GoldenEye (1995),"[Action, Adventure, Thriller]",1995
