## Movie Tags Vectorization

In this notebook, we:
- Load the preprocessed movie dataset (`movies_merge.csv`)
- Apply `CountVectorizer` (max_features=5000, English stop words)
- Transform the `tags` column into numerical feature vectors for downstream ML models

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000 , stop_words='english')

In [2]:
import pandas as pd
new_df=pd.read_csv('../../data/processed/movies_merge.csv')

In [3]:
cv.fit_transform(new_df['tags']).toarray().shape

(4805, 5000)

In [4]:
vectors=cv.fit_transform(new_df['tags']).toarray()

In [5]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0])

In [6]:
len(cv.get_feature_names_out())

5000

## Movie Tags Vectorization and Similarity Computation

In this notebook, we:

- Load the preprocessed movie dataset (`movies_merge.csv`)
- Apply `CountVectorizer` (max_features = 5000, English stop words) on the `tags` column
- Convert movie tags into numerical vectors for downstream recommendation models
- Apply stemming using NLTK's `PorterStemmer` to normalize tag tokens
- Compute pairwise cosine similarity between all movies based on their tag vectors
- Inspect similarity scores to later build a content-based movie recommendation system

In [7]:
import nltk
from nltk.stem.porter import PorterStemmer

In [8]:
ps=PorterStemmer()
def stem(text):
          y=[]
          for i in text.split():
                    y.append(ps.stem(i))
          return " ".join(y)


In [9]:
new_df["tags"]=new_df['tags'].apply(stem)

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

In [11]:
cosine_similarity(vectors)

array([[1.        , 0.08980265, 0.05986843, ..., 0.02431083, 0.02777778,
        0.        ],
       [0.08980265, 1.        , 0.06451613, ..., 0.02619813, 0.        ,
        0.        ],
       [0.05986843, 0.06451613, 1.        , ..., 0.02619813, 0.        ,
        0.        ],
       ...,
       [0.02431083, 0.02619813, 0.02619813, ..., 1.        , 0.0729325 ,
        0.04671418],
       [0.02777778, 0.        , 0.        , ..., 0.0729325 , 1.        ,
        0.05337605],
       [0.        , 0.        , 0.        , ..., 0.04671418, 0.05337605,
        1.        ]])

In [12]:
cosine_similarity(vectors).shape

(4805, 4805)

In [13]:
similarity=cosine_similarity(vectors)

In [14]:
similarity

array([[1.        , 0.08980265, 0.05986843, ..., 0.02431083, 0.02777778,
        0.        ],
       [0.08980265, 1.        , 0.06451613, ..., 0.02619813, 0.        ,
        0.        ],
       [0.05986843, 0.06451613, 1.        , ..., 0.02619813, 0.        ,
        0.        ],
       ...,
       [0.02431083, 0.02619813, 0.02619813, ..., 1.        , 0.0729325 ,
        0.04671418],
       [0.02777778, 0.        , 0.        , ..., 0.0729325 , 1.        ,
        0.05337605],
       [0.        , 0.        , 0.        , ..., 0.04671418, 0.05337605,
        1.        ]])

In [15]:
similarity[0].shape

(4805,)

In [16]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

[(539, np.float64(0.25724787771376323)),
 (1192, np.float64(0.24873416908154544)),
 (260, np.float64(0.24759378423606915)),
 (1214, np.float64(0.24595492912420727)),
 (507, np.float64(0.24498947175305572))]