# Model Building
In this notebook, we:
- Load the processed dataset with `tags`.
- Vectorize text using TF-IDF.
- Compute cosine similarity.
- Build a recommendation function.
- Save the model artifacts for deployment.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import joblib

## Load Data with Tags

In [2]:
movies = pd.read_csv('../data/tags_movies.csv')
movies.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


## TF-IDF Vectorization

In [3]:
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
vector = tfidf.fit_transform(movies['tags'])
print('TF-IDF matrix shape:', vector.shape)

TF-IDF matrix shape: (4806, 5000)


## Compute Cosine Similarity

In [4]:
similarity = cosine_similarity(vector)
print('Similarity matrix shape:', similarity.shape)

Similarity matrix shape: (4806, 4806)


## Testing the Model

In [5]:
def recommend(movie_title):
    index = movies[movies['title'] == movie_title].index[0]
    distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])
    recommended_movie_names = []
    for i in distances[1:6]:
        recommended_movie_names.append(movies.iloc[i[0]].title)
    return recommended_movie_names

print(recommend('Iron Man'))

['Iron Man 2', 'Iron Man 3', 'Avengers: Age of Ultron', 'Captain America: Civil War', 'Ant-Man']


## Saving Model Files

In [6]:
os.makedirs("../models", exist_ok=True)
joblib.dump({"movies": movies, "similarity": similarity}, "../models/recommender.pkl", compress=3)

['../models/recommender.pkl']