- Imports the **Pandas** library.  
- Alias **`pd`** makes it easier to use.  
- Used for **data manipulation, analysis, and handling**.  

In [None]:
import pandas as pd

- **Reads a CSV file** named `"hindi-movies-dataset.csv"`.  
- **Creates a DataFrame (`df`)** to store the data.  
- **`index_col=0`** sets the first column as the index.

In [None]:
df=pd.read_csv("hindi-movies-dataset.csv",index_col=0)

- **Displays the first 5 rows** of the DataFrame `df`.  
- Helps in **quickly inspecting** the dataset's structure.  
- Default is `df.head()`, which shows 5 rows, but `df.head(n)` can show `n` rows.

In [None]:
df.head(5)

Unnamed: 0,movie_id,movie_name,year,genre,overview,director,cast
0,tt15354916,Jawan,2023,"Action, Thriller",A high-octane action thriller which outlines t...,Atlee,"Shah Rukh Khan, Nayanthara, Vijay Sethupathi, ..."
1,tt15748830,Jaane Jaan,2023,"Crime, Drama, Mystery",A single mother and her daughter who commit a ...,Sujoy Ghosh,"Kareena Kapoor, Jaideep Ahlawat, Vijay Varma, ..."
2,tt11663228,Jailer,2023,"Action, Comedy, Crime",A retired jailer goes on a manhunt to find his...,Nelson Dilipkumar,"Rajinikanth, Mohanlal, Shivarajkumar, Jackie S..."
3,tt14993250,Rocky Aur Rani Kii Prem Kahaani,2023,"Comedy, Drama, Family",Flamboyant Punjabi Rocky and intellectual Beng...,Karan Johar,"Ranveer Singh, Alia Bhatt, Dharmendra, Shabana..."
4,tt15732324,OMG 2,2023,"Comedy, Drama",An unhappy civilian asks the court to mandate ...,Amit Rai,"Pankaj Tripathi, Akshay Kumar, Yami Gautam, Pa..."


- **Displays metadata** about the DataFrame `df`.  
- Includes **number of rows & columns, data types, and missing values**.  
- Helps in **understanding the dataset's structure**.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2199 entries, 0 to 2199
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   movie_id    2199 non-null   object
 1   movie_name  2199 non-null   object
 2   year        2134 non-null   object
 3   genre       2199 non-null   object
 4   overview    2199 non-null   object
 5   director    2199 non-null   object
 6   cast        2199 non-null   object
dtypes: object(7)
memory usage: 137.4+ KB


- **Removes the "year" column** from the DataFrame.  
- `axis=1` → Specifies column removal (rows use `axis=0`).  
- `inplace=True` → Modifies `df` directly instead of creating a new DataFrame.

In [None]:
df.drop(labels="year",axis=1,inplace=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2199 entries, 0 to 2199
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   movie_id    2199 non-null   object
 1   movie_name  2199 non-null   object
 2   genre       2199 non-null   object
 3   overview    2199 non-null   object
 4   director    2199 non-null   object
 5   cast        2199 non-null   object
dtypes: object(6)
memory usage: 120.3+ KB


- **Imports `TfidfVectorizer`** → Converts text data into numerical **TF-IDF vectors** for analysis.  
- **Imports `cosine_similarity`** → Measures similarity between text vectors, useful for **recommendation systems and NLP tasks**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

- **Creates a `TfidfVectorizer`** to convert text data into numerical form.  
- **Transforms the "genre" column** into a **TF-IDF matrix** (`tfidf_matrix_content`).  
- **Computes cosine similarity** between all genres, storing results in `cosine_similarity_content`.  
- Helps in **finding similar movies based on genre**.

In [None]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_content = tfidf_vectorizer.fit_transform(df['genre'])
cosine_similarity_content = cosine_similarity(tfidf_matrix_content, tfidf_matrix_content)

- **Prints the cosine similarity matrix** → Shows pairwise similarity scores between movie genres.  
- **Prints the type** of `cosine_similarity_content`.  
- Expected type: **NumPy array (`<class 'numpy.ndarray'>`)**.

In [None]:
print(cosine_similarity_content)
print(type(cosine_similarity_content))

[[1.         0.         0.30289947 ... 0.         0.         0.36085961]
 [0.         1.         0.34443508 ... 0.         0.06109089 0.11857699]
 [0.30289947 0.34443508 1.         ... 0.         0.         0.69633943]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.06109089 0.         ... 0.         1.         0.08779798]
 [0.36085961 0.11857699 0.69633943 ... 0.         0.08779798 1.        ]]
<class 'numpy.ndarray'>


- **Takes user input** for a movie name and stores it in `watched_movie`.  
- **Prints the entered movie name** for confirmation.

In [None]:
watched_movie = input("Enter a name of a movie already seen by a viewer: ")
print(watched_movie)

Enter a name of a movie already seen by a viewer: PK
PK


- **Finds the index** of the movie matching `watched_movie`.  
- **Accesses the genre** of the selected movie using its index.  
- **Prints the index and genre** of the watched movie.  
- Assumes the movie exists in the dataset; otherwise, it may throw an error.

In [None]:
index=df[df['movie_name']==watched_movie].index[0]
print("index=",index,"Genre=",df['genre'][index])

index= 37 Genre= Comedy, Drama, Sci-Fi


- **Retrieves similarity scores** of the watched movie with all other movies.  
- `cosine_similarity_content[index]` gives a **1D array** where each value represents the similarity between the watched movie and another movie.  
- Useful for **movie recommendations** based on genre similarity.

In [None]:
cosine_similarity_content[index]

array([0.        , 0.04657696, 0.14779597, ..., 0.        , 0.03448698,
       0.24301588])

- **Pairs each movie index** with its similarity score to the watched movie.  
- `enumerate(cosine_similarity_content[index])` creates **(index, similarity score) tuples**.  
- **Stores the list** in `index_and_similarity` for further processing (e.g., sorting for recommendations).

In [None]:
index_and_similarity = list(enumerate(cosine_similarity_content[index]))
index_and_similarity

[(0, np.float64(0.0)),
 (1, np.float64(0.04657695792907669)),
 (2, np.float64(0.14779597120621774)),
 (3, np.float64(0.1645835610355399)),
 (4, np.float64(0.30896029395189295)),
 (5, np.float64(0.16215297645833718)),
 (6, np.float64(0.0)),
 (7, np.float64(0.0)),
 (8, np.float64(0.0)),
 (9, np.float64(0.04041851498038727)),
 (10, np.float64(0.30896029395189295)),
 (11, np.float64(0.1798692761474187)),
 (12, np.float64(0.04885381954834263)),
 (13, np.float64(0.09011822360042081)),
 (14, np.float64(0.07677681809289848)),
 (15, np.float64(0.30896029395189295)),
 (16, np.float64(0.0)),
 (17, np.float64(0.048107409733487275)),
 (18, np.float64(0.0)),
 (19, np.float64(0.03532746530585595)),
 (20, np.float64(0.04453954977160764)),
 (21, np.float64(0.17170652117430907)),
 (22, np.float64(0.04453954977160764)),
 (23, np.float64(0.0345562126335704)),
 (24, np.float64(0.30896029395189295)),
 (25, np.float64(0.05473844512530325)),
 (26, np.float64(0.14779597120621774)),
 (27, np.float64(0.0)),
 (28

- **Sorts movies by similarity score** in descending order.  
- Uses `lambda x: x[1]` to sort based on similarity values.  
- **Most similar movies appear first**, useful for recommendations.

In [None]:
sorted_movies = sorted(index_and_similarity,key=lambda x:x[1], reverse=True)
sorted_movies

[(37, np.float64(1.0)),
 (1980, np.float64(0.9647990073250792)),
 (1738, np.float64(0.9510749375107937)),
 (1931, np.float64(0.9510749375107937)),
 (830, np.float64(0.9468823110937261)),
 (83, np.float64(0.9356815559980616)),
 (448, np.float64(0.9356815559980616)),
 (1596, np.float64(0.9356815559980616)),
 (751, np.float64(0.8834968076681108)),
 (282, np.float64(0.8716776929284086)),
 (431, np.float64(0.8716776929284086)),
 (555, np.float64(0.8716776929284086)),
 (976, np.float64(0.8716776929284086)),
 (1890, np.float64(0.8716776929284086)),
 (1589, np.float64(0.8342698909326188)),
 (71, np.float64(0.8337682139859001)),
 (187, np.float64(0.8337682139859001)),
 (303, np.float64(0.8337682139859001)),
 (943, np.float64(0.8337682139859001)),
 (1491, np.float64(0.8337682139859001)),
 (1950, np.float64(0.7800850889448651)),
 (4, np.float64(0.30896029395189295)),
 (10, np.float64(0.30896029395189295)),
 (15, np.float64(0.30896029395189295)),
 (24, np.float64(0.30896029395189295)),
 (60, np.fl

- **Prints the top 5 recommended movies** based on genre similarity.  
- Skips the watched movie (`sorted_movies[k][0] != index`).  
- **Displays similarity score** and movie details using `df.iloc[]`.  
- Helps suggest movies **most similar** to the watched one.

In [None]:
print("Recommended movies: ")
for k in range(5):
  if(sorted_movies[k][0]!=index):
    print("-----------------")
    print("Similarity = ",sorted_movies[k][1])
    print(df.iloc[sorted_movies[k][0]])

Recommended movies: 
-----------------
Similarity =  0.9647990073250792
movie_id                                             tt28455771
movie_name                                          Love Nation
genre                                             Drama, Sci-Fi
overview      This is a Story of Scientist Played by Deepak ...
director                                      Basith Ahmed Khan
cast                   Dharmendra, Adeeb, Maviya, Govind Namdeo
Name: 1981, dtype: object
-----------------
Similarity =  0.9510749375107937
movie_id           tt28229250
movie_name              ANOLB
genre                  Sci-Fi
overview           Add a Plot
director      Shubham Mahadev
cast            Ravish Rajora
Name: 1739, dtype: object
-----------------
Similarity =  0.9510749375107937
movie_id                                              tt6545212
movie_name                                   Leera the Soulmate
genre                                                    Sci-Fi
overview      Salm