# Movies Recommender System

The main goal of this project is to develop a content-based recommender system that suggests movies to users based on their preferences and interests. By analyzing movie attributes such as descriptions and genres, I aim to provide personalized recommendations that enhance the user experience when exploring films.

In [None]:
!pip install pandas
!pip install ast
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn

In [2]:
import pandas as pd
import ast
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preparation

The code reads the movies and credits data, merges them, and selects columns impacting the movie.

In [3]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

In [4]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [61]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [62]:
movies.shape

(4803, 20)

In [5]:
credits.head(4)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."


In [6]:
credits.columns

Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')

In [64]:
credits.shape

(4803, 4)

In [65]:
movies = movies.merge(credits, on='title')

In [66]:
# Initial data exploration
print(movies.info())
print(movies.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

In [67]:
# Choosing columns which impact the movies
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [68]:
# Checking for missing values
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [69]:
movies.dropna(inplace=True) # there are only 3 null values, so dropping them

## Textual Feature Extraction
Functions extract information such as genre names, keywords, the top three actors, and director names.

In [70]:
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L 

In [71]:
movies['genres'] = movies['genres'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [76]:
movies['keywords'] = movies['keywords'].apply(convert)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [77]:
# Getting top 3 actors
def get_actors(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [78]:
movies['cast'] = movies['cast'].apply(get_actors)
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [79]:
movies['cast'] = movies['cast'].apply(lambda x:x[0:3])
movies['cast']

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, Léa Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object

In [80]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [81]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [82]:
movies['crew']

0                                [James Cameron]
1                               [Gore Verbinski]
2                                   [Sam Mendes]
3                            [Christopher Nolan]
4                               [Andrew Stanton]
                          ...                   
4804                          [Robert Rodriguez]
4805                              [Edward Burns]
4806                               [Scott Smith]
4807                               [Daniel Hsia]
4808    [Brian Herzlinger, Jon Gunn, Brett Winn]
Name: crew, Length: 4806, dtype: object

In [83]:
movies.sample(5)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
373,2067,Mission to Mars,When contact is lost with the crew of the firs...,[Science Fiction],"[mars, spacecraft, space travel, alien, long t...","[Gary Sinise, Tim Robbins, Don Cheadle]",[Brian De Palma]
167,7364,Sahara,Scouring the ocean depths for treasure-laden s...,"[Action, Adventure, Comedy, Drama, Mystery]","[tyrant, ironclad ship]","[Matthew McConaughey, Penélope Cruz, Steve Zahn]",[Breck Eisner]
4187,37232,Travellers and Magicians,"A young government official, named Dondup, who...","[Adventure, Drama, Foreign]","[illusion, independent film, bhutan, story in ...","[Tshewang Dendup, Sonam Lhamo, Lhakpa Dorji]",[Khyentse Norbu]
730,77174,ParaNorman,"In the town of Blithe Hollow, Norman Babcock i...","[Family, Animation, Adventure, Comedy]","[medium, stop motion, curse, jock, ghost, comm...","[Kodi Smit-McPhee, Tucker Albrizzi, Jodelle Fe...","[Sam Fell, Chris Butler]"
1200,1579,Apocalypto,"Set in the Mayan civilization, when a man's id...","[Action, Adventure, Drama, Thriller]","[loss of family, solar eclipse, slavery, jagua...","[Rudy Youngblood, Raoul Max Trujillo, Gerardo ...",[Mel Gibson]


In [84]:
# Remove spaces within names to avoid unintended duplicates in vectorization
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [85]:
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

In [86]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]


In [87]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]


In [88]:
# Combine all textual features into a single 'tags' column
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."


In [89]:
movies_with_tags = movies.drop(columns=['overview','genres','keywords','cast','crew'])
movies_with_tags.head(2)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."


In [90]:
movies_with_tags['tags'] = movies_with_tags['tags'].apply(lambda x: " ".join(x))
movies_with_tags

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."
...,...,...,...
4804,9367,El Mariachi,El Mariachi just wants to play his guitar and ...
4805,72766,Newlyweds,A newlywed couple's honeymoon is upended by th...
4806,231617,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic..."
4807,126186,Shanghai Calling,When ambitious New York attorney Sam is sent t...


## Vectorization & Similarity Calculation
CountVectorizer converts the tags into feature vectors, and cosine similarity between these vectors is calculated to find movies with similar "tags."

In [96]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

In [101]:
vector = cv.fit_transform(movies_with_tags['tags']).toarray()
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [102]:
vector.shape

(4806, 5000)

In [103]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)
similarity

array([[1.        , 0.08858079, 0.05905386, ..., 0.02478408, 0.02599376,
        0.        ],
       [0.08858079, 1.        , 0.06451613, ..., 0.02707652, 0.        ,
        0.        ],
       [0.05905386, 0.06451613, 1.        , ..., 0.02707652, 0.        ,
        0.        ],
       ...,
       [0.02478408, 0.02707652, 0.02707652, ..., 1.        , 0.07150969,
        0.0489116 ],
       [0.02599376, 0.        , 0.        , ..., 0.07150969, 1.        ,
        0.05129892],
       [0.        , 0.        , 0.        , ..., 0.0489116 , 0.05129892,
        1.        ]])

In [105]:
similarity.shape

(4806, 4806)

## Recommendation
The recommend function sorts movies by similarity to the input title and prints the top five similar titles.

In [109]:
# a function to get movie recommendations
def recommend(movie_title):
    try:
        # Find the index of the movie with the given title
        movie_index = movies_with_tags[movies_with_tags['title'] == movie_title].index[0]
        
        # Get similarity scores for all movies with respect to the chosen movie
        distances = list(enumerate(similarity[movie_index]))
        
        # Sort movies by similarity score (highest first)
        sorted_distances = sorted(distances, reverse=True, key=lambda x: x[1])
        
        # Print top 5 most similar movies (excluding the first, which is the movie itself)
        print(f"Movies similar to '{movie_title}':")
        for i in sorted_distances[1:6]:
            print(movies_with_tags.iloc[i[0]].title)
    except IndexError:
        print(f"Movie '{movie_title}' not found in the database.")

In [110]:
recommend('Gandhi')

Movies similar to 'Gandhi':
Gandhi, My Father
The Wind That Shakes the Barley
A Passage to India
Guiana 1838
Ramanujan


## Questions

1. **How did you preprocess the movie data before applying CountVectorizer?**

I started by cleaning the data to remove any missing or irrelevant entries. Then, I focused on the movie descriptions and genres, grabbing the names and removing spaces. I removed all stop words. This preprocessing step ensured that the CountVectorizer could effectively tokenize the text and create a meaningful representation of the movies.

Tokenization is the process of breaking down text into smaller pieces, known as tokens, which are typically individual words or phrases. For example, the sentence "The cat sat on the mat" would be tokenized into words: ["The", "cat", "sat", "on", "the", "mat"]. Tokenization is a fundamental step in processing text data because it allows us to treat each word as a distinct feature, making it possible to analyze and compare text.

In my project, I used tokenization to split movie overviews and other text-based features (like genre, actor names) into individual words, enabling us to convert this text data into a structured format for further analysis.

The bag of words model is a way of representing text data by counting the occurrences of each word in a document, while ignoring grammar and word order. This approach converts text into a "bag" of words where each unique word becomes a feature, and the value assigned to each word is the frequency (or count) of that word in the text.

In my movie recommender system, I applied the bag of words model to create a matrix where each row represented a movie, and each column represented a unique word in the dataset. By using CountVectorizer in this way, I transformed the textual data (movie overviews, genres, cast, etc.) into numerical vectors, allowing me to calculate the cosine similarity between movies and recommend those that share similar words and themes. This way, movies with overlapping terms in their descriptions or common attributes were considered similar, enabling accurate content-based recommendations.

The CountVectorizer is used to tokenize and represent the combined text features for each movie as a bag of words.

2. **Can you describe the dataset you used for your project?**

I used the TMDB (The Movie Database) dataset, which contains detailed information about movies. It includes attributes like movie titles, overviews, genres, release dates, ratings, and keywords, among other features. For the purpose of my content-based recommender system, I focused mainly on the text-based features, such as the movie overviews and genres.

The movie overview field, which contains a synopsis or description, was particularly useful because it provided a rich source of information about the content and themes of each movie. Similarly, genres helped categorize the films based on type (e.g., drama, action, comedy), making it easier to match movies with similar characteristics.

In preprocessing, I used the overview and genre features to create a "bag of words" representation using CountVectorizer. This allowed me to create a vectorized version of each movie based on its text attributes, which I then used to calculate cosine similarity and find movies that are most similar to a given title.

In addition to the main TMDB dataset, I used the credits dataset, which provides detailed information about the cast and crew for each movie. This dataset includes columns such as movie_id, cast, and crew. The cast field contains a list of actors involved in each movie, while the crew field includes information on key members of the production team, like directors and writers.

For my recommender system, the credits dataset was valuable because cast and director data can strongly influence user preferences. For example, users may prefer movies featuring certain actors or directors. To leverage this information, I extracted the names of the main actors and director for each movie and included them as additional text features in the CountVectorizer model. This way, my system could suggest movies with similar casts or directors, providing more personalized recommendations based on users’ favorite actors or directors alongside genre and plot information.


3. **Did you consider using any other techniques for feature extraction, such as TF-IDF? Why or why not?**

Yes, I considered using TF-IDF for feature extraction as it takes into account the importance of words in relation to the entire dataset. However, I chose CountVectorizer for this project because I wanted to emphasize the presence of specific terms in movie descriptions without scaling their influence based on frequency, which aligned better with my recommendation approach.

4. **Difference between TF-IDF and CountVecorizer?**

TF-IDF (Term Frequency - Inverse Document Frequency) and Count Vectorizer are two popular techniques for converting text data into numerical features for machine learning models.

Count Vectorizer - creates a matrix that counts the frequency of each word in a document. 

- **How it works**: For each document in a corpus, Count Vectorizer produces a vector where each element represents the count of a specific word in the document.
- **Example**: If the word "cat" appears 3 times in a document and "dog" appears 2 times, the vector will include the counts (3 for "cat" and 2 for "dog").
- **When to use**: Count Vectorizer is useful for capturing the frequency of words without concern for their distinctiveness across documents, which can be helpful in simpler classification tasks or when document length is a significant feature.

TF-IDF Vectorizer - builds on the term frequency but also considers the importance of each word across all documents, giving less weight to common words that appear frequently.

- **How it works**: TF-IDF is calculated as:
  - **Term Frequency (TF)**: The frequency of a word in a document, similar to Count Vectorizer.
  - **Inverse Document Frequency (IDF)**: A measure of how unique a word is across all documents in the corpus. Rare words have higher IDF values, while common words have lower IDF values.
  - **Final Calculation**: TF-IDF for a term in a document is obtained by multiplying its TF with its IDF.
- **Example**: If "the" appears in almost every document, its IDF is low, so TF-IDF down-weights its impact. If "hemangioma" is rare, its IDF is high, giving it more weight in documents where it appears.
- **When to use**: TF-IDF is useful for cases where you want to capture word relevance within and across documents, making it a better choice for text classification, search relevance, and information retrieval tasks.

**Key Differences** - 
- **Focus on Term Importance**: Count Vectorizer considers only frequency, while TF-IDF considers both frequency and uniqueness.
- **Common Words**: Count Vectorizer can overemphasize common words, while TF-IDF down-weights them.
- **Interpretability**: Count Vectorizer is simpler to interpret but less informative. TF-IDF provides more insight into document content and context.

5. **Can you explain how cosine similarity works and why you chose it for your recommender system?**

Cosine similarity measures the cosine of the angle between two vectors in a high-dimensional space, providing a value between 0 and 1 to indicate similarity. I chose it because it effectively captures the similarity in content while being robust against differences in vector magnitude. This made it particularly useful for comparing movie features that could vary widely in scale.

6. **What are the advantages and limitations of using cosine similarity in this context?**

The main advantage of cosine similarity is that it focuses on the orientation of the vectors rather than their magnitudes, making it suitable for high-dimensional text data. However, a limitation is that it doesn’t account for the context or semantics of the words, meaning it might miss relationships between movies that are not directly reflected in their descriptions.

7. **How did you handle missing or irrelevant data in the TMDB dataset?**

   
I identified and removed any entries with missing essential information, such as movie titles or descriptions, to ensure the integrity of the dataset. For irrelevant data, I focused on features that directly contributed to the recommendation process, filtering out any unnecessary columns to streamline the analysis.

8. **Did you encounter any challenges related to data quality, and how did you address them?**

Yes, one challenge was dealing with inconsistent formatting in movie genres and descriptions. I standardized the genres by creating a uniform list of categories and ensured that all descriptions were in the same format before processing. This helped maintain consistency and improved the accuracy of the recommendations.

9. **What metrics did you use to evaluate the performance of your recommender system?**

While I primarily relied on qualitative feedback to gauge the effectiveness of my recommendations, I also considered the precision and recall metrics that could be applied to a broader evaluation. If I had more time, I would conduct user studies or surveys to gather quantitative data on user satisfaction with the recommendations provided by the system.

10. **How would you assess the effectiveness of the recommendations provided by your system?**

I would assess effectiveness through user engagement metrics, such as click-through rates on recommended movies, user ratings for suggested films, and direct feedback on the relevance of the recommendations. Additionally, I would analyze how often users explore further films after receiving suggestions, indicating the value of the recommendations.

11. **Why did you choose a content-based filtering approach over collaborative filtering for your recommender system?**

I chose a content-based filtering approach because it allows for recommendations based on the attributes of the movies themselves, which is useful when user interaction data is sparse. This method also avoids issues with cold-start problems that can occur in collaborative filtering when new users or items are introduced.

12. **What challenges did you face while implementing your chosen algorithm, and how did you overcome them?**

One challenge was ensuring that the feature extraction process captured enough relevant information from the movie descriptions. I addressed this by experimenting with various preprocessing techniques, including tokenization and stop word removal, to optimize the feature set fed into the CountVectorizer, ultimately improving the quality of the cosine similarity calculations

13. **If you had to incorporate user feedback into your recommender system, how would you approach it?**

I would implement a user feedback mechanism that allows users to rate or provide feedback on the recommendations they receive. This feedback could be used to refine the recommendation algorithm over time, perhaps by adjusting the weights of certain features or using collaborative filtering techniques to enhance personalization based on user preferences.

14. **What considerations would you have for improving user satisfaction with the recommendations?**

I would focus on diversifying the recommendations to avoid redundancy, ensuring a balance between familiar favorites and new suggestions. Additionally, incorporating user preferences and contextual factors (like time of day or mood) could further enhance the relevance of recommendations.

15. **What improvements would you consider implementing in your recommender system?**

I would consider adding a hybrid recommendation approach that combines content-based and collaborative filtering to leverage both the attributes of the movies and user interactions. Additionally, integrating machine learning models to analyze user behavior over time could help predict preferences more accurately.

16. **How could you extend your project to include hybrid recommendation methods?**

To extend the project, I could gather user ratings or interaction data and use collaborative filtering techniques alongside the existing content-based recommendations. This could involve developing a system that first provides content-based recommendations and then refines them with collaborative filtering to suggest items that similar users enjoyed.

17. **What ethical considerations did you think about when designing your recommender system?**

I considered the potential for bias in recommendations, which could arise if the dataset reflects historical biases or fails to represent diverse perspectives. It’s crucial to ensure that the recommendations are fair and inclusive, promoting a wide range of films rather than reinforcing stereotypes.

18. **How can bias affect movie recommendations, and what steps would you take to mitigate it?**

Bias can lead to the underrepresentation of certain genres or demographics, resulting in recommendations that favor popular or mainstream content. To mitigate this, I would ensure that the dataset is diverse and representative. Implementing fairness-aware algorithms could also help balance the recommendations and avoid reinforcing existing biases.