## BUSINESS UNDERSTANDING
The task is to build a reccomendtion system for a movies platform- MOVIES101
The goal is to enhance user engagement by offering personalized movie recommendations. 
This can be approached using these three key methods:

Collaborative Filtering:

User-Based: Suggests movies by finding users with similar tastes and recommending content they've enjoyed.
Item-Based: Recommend movies based on the similarity between movies the user has already watched.
Matrix Factorization: Use techniques like Singular Value Decomposition (SVD) to uncover hidden patterns in user-movie interactions, helping to recommend movies based on these learned patterns.

Content-Based Filtering:

Focuses on the attributes of movies (such as genres, actors, and directors) to recommend content similar to what the user has liked in the past. Methods like TF-IDF (term frequency-inverse document frequency) can help analyze textual data like descriptions or tags associated with the movies.

Hybrid Methods:

Combines collaborative and content-based filtering to create a more comprehensive recommendation system. This approach can balance the strengths of both methods, ensuring users receive diverse and relevant movie suggestions.

By leveraging these strategies, MOVIE101 can deliver personalized recommendations that boost user satisfaction and increase viewing time.

## OBJECTIVES
1. Build a Model for Top 5 Recommendations:

-Use collaborative filtering (e.g., matrix factorization techniques like SVD) or deep learning models (e.g., neural collaborative filtering) to predict the ratings a user would give to movies.

-Sort the predicted ratings for each user and recommend the top 5 movies that have not been watched yet.

2. Tackle the Cold Start Issue for New Users:

-Content-Based Recommendations: For new users, recommend movies based on their stated preferences (e.g., genre, actors) by using content-based filtering.

-Popular Movies: Suggest currently trending or highly rated movies as initial recommendations until the system gathers more user data.

-Hybrid Approach: Combine popular and content-based recommendations to ensure relevancy when user data is scarce.

3. Improve Precision and Relevance:

-Implement a Hybrid Recommendation System combining both collaborative and content-based filtering to deliver more personalized suggestions.

-Use advanced models like Factorization Machines or Autoencoders to capture complex interactions between users and items for more precise recommendations.

4. Evaluate the System’s Performance:

-Use Root Mean Squared Error (RMSE) to evaluate the accuracy of predicted ratings compared to actual user ratings.

-Additionally, evaluate performance using Precision@K, Recall@K, and F1 Score to measure how relevant and precise the top 5 recommendations are.

5. Feedback Mechanism:

-Implement a feature that allows users to rate the recommendations they receive.

-Use this feedback to adjust the model, such as by reweighting similar movies or users based on the user's rating patterns, refining future recommendations.

## DATA UNDERSTANDING
### DATA SOURCES
The project utilizes the MovieLens dataset (https://grouplens.org/datasets/movielens/latest/) from the GroupLens research lab at the University of Minnesota. Given the constraints on computational resources, we're working with the "small" dataset, which includes 100,000 user ratings.

### DATA DESCRIPTION
There are a number of csv files available with different columns in the Data file. 


movies.csv

movieId - Unique identifier for each movie.

title - The movie titles.

genre - The various genres a movie falls into.


ratings.csv

userId - Unique identifier for each user

movieId - Unique identifier for each movie.

rating - A value between 0 to 5 that a user rates a movie on. 5 is the highest while 0 is the lowest rating.

timestamp - This are the seconds that have passed since Midnight January 1, 1970(UTC)


tags.csv

userId - Unique identifier for each user

movieId - Unique identifier for each movie.

tag - A phrase determined by the user.

timestamp - This are the seconds that have passed since Midnight January 1, 1970(UTC)


links.csv

movieId - It's an identifier for movies used by https://movielens.org and has link to each movie.

imdbId - It's an identifier for movies used by http://www.imdb.com and has link to each movie.

tmdbId - is an identifier for movies used by https://www.themoviedb.org and has link to each movie.

In [19]:
# Import Libraries
import pandas as pd
import numpy as np
np.int = int
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from IPython.display import display, HTML


Loading the datasets

In [20]:
file_path = r"C:\Users\PC\Documents\Flatiron\dsc-data-science-env-config\Project_phase_4\links.csv"
links_df = pd.read_csv(file_path)

file_path = r"C:\Users\PC\Documents\Flatiron\dsc-data-science-env-config\Project_phase_4\movies.csv"
movies_df = pd.read_csv(file_path)

file_path = r"C:\Users\PC\Documents\Flatiron\dsc-data-science-env-config\Project_phase_4\ratings.csv"
ratings_df = pd.read_csv(file_path)

file_path = r"C:\Users\PC\Documents\Flatiron\dsc-data-science-env-config\Project_phase_4\tags.csv"
tags_df = pd.read_csv(file_path)


Displaying the first 5 rows of the datasets

In [21]:
def display_dataframes_side_by_side(*dataframes, titles=None):
    if titles is None:
        titles = [''] * len(dataframes)
    
    # Generate HTML representation for each DataFrame and associate it with a title
    html_content = []
    for dataframe, title in zip(dataframes, titles):
        html = dataframe.head().to_html(classes='dataframe', header=True)
        html_content.append(f"<h3>{title}</h3>{html}")
    
    # Combine all DataFrames into one HTML block with inline styling for side-by-side display
    combined_html = ''.join(
        f"<div style='display: inline-block; vertical-align: top; margin-right: 20px;'>{content}</div>" 
        for content in html_content
    )
    
    # Render the HTML content
    display(HTML(combined_html))

# Example usage with DataFrames
display_dataframes_side_by_side(
    movies_df, 
    ratings_df, 
    tags_df, 
    links_df, 
    titles=['Movies DataFrame', 'Ratings DataFrame', 'Tags DataFrame', 'Links DataFrame']
)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [22]:
# Display Summary Information
def display_infos(*dfs, titles=None):
    if titles is None:
        titles = [''] * len(dfs)
    
    for df, title in zip(dfs, titles):
        print(f"--- {title} ---")
        df.info()
        print("\n")

display_infos(
    movies_df, 
    ratings_df, 
    tags_df, 
    links_df, 
    titles=['Movies DataFrame Info', 'Ratings DataFrame Info', 'Tags DataFrame Info', 'Links DataFrame Info']
)

--- Movies DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


--- Ratings DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


--- Tags DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------  

In [23]:
# Display the shapes of each dataframe
for df, title in zip([movies_df, ratings_df, tags_df, links_df], 
                     ['Movies DataFrame Shape', 'Ratings DataFrame Shape', 'Tags DataFrame Shape', 'Links DataFrame Shape']):
    print(f"{title}: {df.shape}")

Movies DataFrame Shape: (9742, 3)
Ratings DataFrame Shape: (100836, 4)
Tags DataFrame Shape: (3683, 4)
Links DataFrame Shape: (9742, 3)
