# Unsupervised Learning: Predict Project

### Honor Code
We NM_3 {Onneile Molotlhanyi, Anele Bovu, Kamogelo Morole, Sibusiso Mofokeng, Felicia Vilakazi and Amukelani Mabunda }, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code.

### Predict Overview: Movie Recommendation

In today's tech-driven world, recommender systems are essential for helping people choose content they'll enjoy. This is especially true for movies, where there are countless options. Instead of guessing, smart algorithms suggest titles based on your preferences. We are tasked to create a recommendation algorithm. It should predict how a user will rate a movie they haven't seen, based on their past likes. In short, they want a system that can predict what movies someone will love before they watch them.

## Table of Content

1. Introdcution
2. Problem Statement
3. Objective
4. Importing Packages
5. Loading Data
6. Exploratory Data Analysis (EDA)
7. Feature Engineering
8. Modelling
9. Conclusion
10. References

## 1. Introduction

In today's technology driven world, recommender systems are critical to ensuring users can make appropriate decisions about the content they engage with daily.

Recommender systems help users select similar items when something is being chosen online. Netflix or Amazon would suggest different movies and titles that might interest individual users. In education, these systems may be used to suggest learning material that could improve educational outcomes. These types of algorithms lead to service improvement and customer satisfaction. They do this by addressing the long-tail problem shown below.

<img src="https://www.golegal.co.za/wp-content/uploads/2022/04/Picture2-1.png" width="500" align="center">
    
Customers do not have the time to browse through every available product and businesses cannot simply stop supplying less popular products. A recommender system addresses the long-tail problem by recommending less popular content that the customer is likely to rate highly.

Current recommendation systems - content-based filtering and collaborative  filtering - use difference information sources to make recommendations [[1]](#ref1).

#### Content-based filtering

This method makes recommendations based on user preferences for product features. It is able to recommend new items, but is limited by the need for more data of user preference to improve the quality of recommendations.

#### Collaborative filtering

Collaborative filtering mimics user-to-user recommendations. In other words, If you and your friend have similar tastes, you are likely to make recommendations the other would approve of. This method finds similar users and predicts their preferences as a linear, weighted combination of other user preferences. The limitation is the requirement of a large dataset with active useres who rated a product before in order to make accurate predictions. As a result of this limitation, collaborative systems usually suffer from the "cold start" problem, making predictions for new users challenging. This is usually overcome by using content-based filtering to initiate a user profile.

#### Hybrid systems

A combination of these two recommendations systems is called a hybrid system. They mix the features of the item itself and the preferences of other users [[2]](#ref2).

## 2. Problem Statement

The challenge lies in constructing a recommendation algorithm that not only assists users in discovering great movies but also accurately predicts how users will rate unseen movies based on their historical preferences. This entails investigating into the field of content-based and collaborative filtering techniques to gather insights from user behavior and preferences.

The essential part of the problem centers around the task of accurately predicting how a user will rate a movie they have not yet viewed, leveraging insights from their historical preferences. This entails developing an algorithm capable of understanding user behaviors, preferences, and patterns, and utilizing this knowledge to make personalized recommendations. The overarching goal is to enhance the user experience by providing tailored movie recommendations that resonate with individual tastes and preferences.
<img src="https://research.aimultiple.com/wp-content/webp-express/webp-images/uploads/2017/08/recommendation-system-800x450.png.webp"
    alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>

## 3. Objective

The primary objective of this project is to develop a recommendation algorithm that excels in predicting user ratings for unseen movies based on historical preferences. This involves leveraging two fundamental approaches: content-based filtering and collaborative filtering. Content-based filtering focuses on analyzing the attributes and characteristics of movies, such as genre, cast, and plot, to recommend similar items to those a user has liked in the past. On the other hand, collaborative filtering draws insights from user interactions and preferences to make recommendations. By integrating these approaches, the aim is to create a recommendation system that not only accurately predicts user ratings but also offers diverse and personalized movie suggestions. Ultimately, the goal is to enhance user satisfaction and engagement by facilitating seamless and enjoyable movie discovery experiences.

## 4.Importing Packages

In this section we are going to import libraries that will be used throughout our analysis and modelling.

In [1]:
# Exploratory Data Analysis
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data Preprocessing
import random
from time import time
import cufflinks as cf
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter
from sklearn.preprocessing import StandardScaler
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)


ModuleNotFoundError: No module named 'cufflinks'

## 5. Loading Data

In this section we are going to load the different given datasets for our findings

In [3]:
#Train
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\train.csv"


train = pd.read_csv(file_path)


In [4]:
df_train = train
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [5]:
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\tags.csv"

tags = pd.read_csv(file_path)

In [6]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [12]:
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\test.csv"

test = pd.read_csv(file_path)

In [13]:
df_test = test
df_test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [10]:
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\genome_scores.csv"

genome_scores = pd.read_csv(file_path)


In [11]:
genome_scores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [14]:
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\genome_tags.csv"

genome_tags = pd.read_csv(file_path)

In [15]:
genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [16]:
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\imdb_data.csv"

imdb_data =  pd.read_csv(file_path)

In [17]:
imdb = imdb_data
imdb.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [18]:
file_path = "C:\\Users\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\links.csv"

links = pd.read_csv(file_path)

In [19]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [20]:
file_path = "C:\\Users\\Onneile\\Downloads\\ea-movie-recommendation-predict-2023-2024\\movies.csv"

movies =  pd.read_csv(file_path)

In [21]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## 6. Exploratory Data Analysis


In this section, we are going to perform an in-depth analysis of all variables in the datasets

### 6.1 Analysing the Data

### 6.2 Visualising Data

## 7. Feature Engineering

In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase.



## 8.Modelling


In this section, we are going to create recommendation algorithm that not only assists users in discovering great movies but also accurately predicts how users will rate unseen movies based on their historical preferences.



### 8.1 Model Performance


### 8.2 Model Explanation


## 9. Conclusion


## 10.References