<a href="https://colab.research.google.com/github/Sweta-Das/CODSOFT/blob/main/codsoft_task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Rating Prediction with Python

Build a model that predicts the rating of a movie based on
features like **genre**, **director**, and **actors**. You can use regression
techniques to tackle this problem.
The goal is to analyze historical movie data and develop a model
that accurately **estimates the rating given to a movie by users or
critics**.

In [200]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [201]:
#Loading dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [202]:
dataset = pd.read_csv('drive/MyDrive/Colab Notebooks/CODSOFT/IMDbMoviesIndia.csv', encoding='latin-1')

Used 'encoding' parameter because obtained Unicode Decode Error in absence of it. This error occurs when we try to decode a bytes object with an encoding that doesn't support that character. By default, Python uses 'UTF-8' for encoding. With the above parameter, it was changed to 'latin-1'.

For ref: https://sebhastian.com/unicodedecodeerror-invalid-continuation-byte/

## Understanding Dataset

In [203]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


There're 15,509 movies instances in the dataset, with lots of missing values in terms of rating, duration, genre, and votes.
Also, every other attribute is Python object i.e., String, while Rating attribute is float.

In [204]:
dataset.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [205]:
dataset = dataset.drop(columns = ['Year', 'Duration'], axis=1)

In [206]:
dataset.shape

(15509, 8)

In [207]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Genre     13632 non-null  object 
 2   Rating    7919 non-null   float64
 3   Votes     7920 non-null   object 
 4   Director  14984 non-null  object 
 5   Actor 1   13892 non-null  object 
 6   Actor 2   13125 non-null  object 
 7   Actor 3   12365 non-null  object 
dtypes: float64(1), object(7)
memory usage: 969.4+ KB


In [208]:
dataset.nunique()

Name        13838
Genre         485
Rating         84
Votes        2034
Director     5938
Actor 1      4718
Actor 2      4891
Actor 3      4820
dtype: int64

In [209]:
# Define an aggregation function for each column
# Replacing the null values with the first non-null values
agg_functions = {
    'Name': 'first',
    'Genre': 'first',
    'Rating': 'first',
    'Votes': 'first',
    'Director': 'first',
    'Actor 1': 'first',
    'Actor 2': 'first',
    'Actor 3': 'first'
}

# Group by 'Name' and applying the aggregation functions
result_data = dataset.groupby('Name', as_index=False).agg(agg_functions)
result_data.shape

(13838, 8)

In [210]:
result_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13838 entries, 0 to 13837
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      13838 non-null  object 
 1   Genre     12494 non-null  object 
 2   Rating    7372 non-null   float64
 3   Votes     7373 non-null   object 
 4   Director  13376 non-null  object 
 5   Actor 1   12448 non-null  object 
 6   Actor 2   11812 non-null  object 
 7   Actor 3   11165 non-null  object 
dtypes: float64(1), object(7)
memory usage: 865.0+ KB


In [211]:
result_data.nunique()

Name        13838
Genre         470
Rating         84
Votes        1947
Director     5664
Actor 1      4498
Actor 2      4693
Actor 3      4614
dtype: int64

## Data Cleaning

In [212]:
dataset['Votes'].value_counts()
dataset['Votes'] = dataset['Votes'].str.replace(',', '')

In [213]:
dataset[dataset['Votes']=='$5.16M']

Unnamed: 0,Name,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
9500,Moonlight: Unfortunately a Love Story,Comedy,,$5.16M,Raman Bharadwaj,Kim Sharma,Shekhar Suman,Perizaad Zorabian


In [214]:
dataset = dataset.drop(dataset[dataset['Votes'] == '$5.16M'].index)

In [215]:
dataset['Votes'].astype(float)

0          NaN
1          8.0
2          NaN
3         35.0
4          NaN
         ...  
15504     11.0
15505    655.0
15506      NaN
15507      NaN
15508     20.0
Name: Votes, Length: 15508, dtype: float64

In [216]:
dataset['Votes'] = dataset['Votes'].fillna(0)

In [217]:
# Checking null values present in the dataset
dataset.isnull().sum()

Name           0
Genre       1877
Rating      7589
Votes          0
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [218]:
# Removing null values from the dataset will remove half of the dataset which is not wise. So, we'll replace it.
# Replacing 'Ratings' with mean value
rate_mean = dataset['Rating'].mean()
dataset['Rating'].fillna(rate_mean, inplace = True)
# Checking null values present in the dataset
dataset.isnull().sum()

Name           0
Genre       1877
Rating         0
Votes          0
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [219]:
# Checking for duplicate values in the dataset
dataset.duplicated().sum()

10

In [220]:
dataset.drop_duplicates(inplace = True)

In [221]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15498 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15498 non-null  object 
 1   Genre     13623 non-null  object 
 2   Rating    15498 non-null  float64
 3   Votes     15498 non-null  object 
 4   Director  14976 non-null  object 
 5   Actor 1   13885 non-null  object 
 6   Actor 2   13119 non-null  object 
 7   Actor 3   12360 non-null  object 
dtypes: float64(1), object(7)
memory usage: 1.1+ MB


In [222]:
# Replacing null values of Genre with most common value
genre_mode = dataset['Genre'].mode().values[0]
print(genre_mode)

Drama


In [223]:
dataset['Genre'] = dataset['Genre'].fillna(genre_mode)

In [224]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15498 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15498 non-null  object 
 1   Genre     15498 non-null  object 
 2   Rating    15498 non-null  float64
 3   Votes     15498 non-null  object 
 4   Director  14976 non-null  object 
 5   Actor 1   13885 non-null  object 
 6   Actor 2   13119 non-null  object 
 7   Actor 3   12360 non-null  object 
dtypes: float64(1), object(7)
memory usage: 1.1+ MB


In [225]:
# Replacing director and actors missing values with common values
dir_mode = dataset['Director'].mode().values[0]
print(dir_mode)
act1_mode = dataset['Actor 1'].mode().values[0]
print(act1_mode)
act2_mode = dataset['Actor 2'].mode().values[0]
print(act2_mode)
act3_mode = dataset['Actor 3'].mode().values[0]
print(act3_mode)

Jayant Desai
Ashok Kumar
Rekha
Pran


In [226]:
dataset['Director'] = dataset['Director'].fillna(dir_mode)
dataset['Actor 1'] = dataset['Actor 1'].fillna(act1_mode)
dataset['Actor 2'] = dataset['Actor 2'].fillna(act2_mode)
dataset['Actor 3'] = dataset['Director'].fillna(act3_mode)

In [227]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15498 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15498 non-null  object 
 1   Genre     15498 non-null  object 
 2   Rating    15498 non-null  float64
 3   Votes     15498 non-null  object 
 4   Director  15498 non-null  object 
 5   Actor 1   15498 non-null  object 
 6   Actor 2   15498 non-null  object 
 7   Actor 3   15498 non-null  object 
dtypes: float64(1), object(7)
memory usage: 1.1+ MB


In [228]:
dataset['Name'].nunique()

13837

In [229]:
# Checking for duplicate values in the dataset
dataset['Name'].duplicated().sum()

1661

In [230]:
# Remove duplicate rows based on the 'Name' column
dataset_1 = dataset.drop_duplicates(subset=['Name'])

In [231]:
dataset_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13837 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      13837 non-null  object 
 1   Genre     13837 non-null  object 
 2   Rating    13837 non-null  float64
 3   Votes     13837 non-null  object 
 4   Director  13837 non-null  object 
 5   Actor 1   13837 non-null  object 
 6   Actor 2   13837 non-null  object 
 7   Actor 3   13837 non-null  object 
dtypes: float64(1), object(7)
memory usage: 972.9+ KB


In [232]:
dataset_1['Name'].nunique()

13837

In [233]:
# Checking for duplicate values in the dataset
dataset_1['Name'].duplicated().sum()

0

In [234]:
# Checking for null values in the dataset
dataset_1.isnull().sum()

Name        0
Genre       0
Rating      0
Votes       0
Director    0
Actor 1     0
Actor 2     0
Actor 3     0
dtype: int64

## Dataset Exploration

In [235]:
# Finding most common genres watched by users
dataset_gen = dataset_1['Genre'].str.split(', ', expand = True)

# Stacking the genre to get a single column
stacked_genre = dataset_gen.stack()

# Getting the counts of unique genres
genre_counts = stacked_genre.value_counts()
print(genre_counts)

Drama          8062
Action         3099
Romance        2186
Comedy         1907
Thriller       1569
Crime          1188
Family          815
Musical         527
Adventure       485
Horror          477
Mystery         458
Fantasy         401
Documentary     371
Biography       198
History         191
Animation       122
Music            86
Sport            66
Sci-Fi           53
War              46
News              9
Western           5
Reality-TV        3
Short             1
dtype: int64


Most common genre is Drama for films.

In [236]:
#Sorting movies based on their rating
mov_list = dataset_1.sort_values('Rating', ascending = False)
mov_list

Unnamed: 0,Name,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
8339,Love Qubool Hai,"Drama, Romance",10.0,5,Saif Ali Sayeed,Ahaan Jha,Mahesh Narayan,Saif Ali Sayeed
5410,Half Songs,"Music, Romance",9.7,7,Sriram Raja,Raj Banerjee,Emon Chatterjee,Sriram Raja
2563,Breed,Drama,9.6,48,Bobby Kumar,Bobby Kumar,Ashfaq,Bobby Kumar
11704,Ram-Path,Documentary,9.4,5,Ashish Dubey,Ishan Jacob,Rekha,Ashish Dubey
14222,The Reluctant Crime,Drama,9.4,16,Arvind Pratap,Dharmendra Ahir,Awanish Kotnal,Arvind Pratap
...,...,...,...,...,...,...,...,...
15040,Welcome to New York,"Comedy, Drama",1.6,774,Chakri Toleti,Richard Harris,Jasmine Kaur,Chakri Toleti
6744,Jimmy,"Action, Crime, Drama",1.6,249,Raj N. Sippy,Mimoh Chakraborty,Vikas Anand,Raj N. Sippy
9639,Mumbai Can Dance Saalaa,Drama,1.6,43,Sachindra Sharma,Shakti Kapoor,Prashant Narayanan,Sachindra Sharma
3618,Desh Drohi,"Action, Thriller",1.4,3899,Jagdish A. Sharma,Kamal Rashid Khan,Gracy Singh,Jagdish A. Sharma


There is visual disparity between the ratings and the number of votes given to the film. But, we're asked to focus on genre, directors and actors only.

In [None]:
dataset_1['Director'] = dataset_1['Director'].astype(str)
dataset_1['Actor 1'] = dataset_1['Actor 1'].astype(str)
dataset_1['Actor 2'] = dataset_1['Actor 2'].astype(str)
dataset_1['Actor 3'] = dataset_1['Actor 3'].astype(str)

In [None]:
dataset_1['Genre'] = dataset_1['Genre'].astype('category')

In [239]:
dataset_1

Unnamed: 0,Name,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,Drama,5.841621,0,J.S. Randhawa,Manmauji,Birbal,J.S. Randhawa
1,#Gadhvi (He thought he was Gandhi),Drama,7.000000,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Gaurav Bakshi
2,#Homecoming,"Drama, Musical",5.841621,0,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Soumyajit Majumdar
3,#Yaaram,"Comedy, Romance",4.400000,35,Ovais Khan,Prateik,Ishita Raj,Ovais Khan
4,...And Once Again,Drama,5.841621,0,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Amol Palekar
...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,Action,4.600000,11,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Mahendra Shah
15505,Zulmi,"Action, Drama",4.500000,655,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Kuku Kohli
15506,Zulmi Raj,Action,5.841621,0,Kiran Thej,Sangeeta Tiwari,Rekha,Kiran Thej
15507,Zulmi Shikari,Action,5.841621,0,Jayant Desai,Ashok Kumar,Rekha,Jayant Desai


In [240]:
# Cleaning data again to remove spaces, and converting it into lowercase for better prediction
def clean_data(x):
  if isinstance(x, list):
    return(str.lower(i.replace(' ', '')) for i in x)
  else:
    if isinstance(x, str):
      return str.lower(x.replace(" ", ""))
    else:
      return ''

In [None]:
# Applying clean data function to features
features = ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']

for feature in features:
  dataset_1[feature] = dataset_1[feature].apply(clean_data)

In [242]:
dataset_1

Unnamed: 0,Name,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,drama,5.841621,0,j.s.randhawa,manmauji,birbal,j.s.randhawa
1,#Gadhvi (He thought he was Gandhi),drama,7.000000,8,gauravbakshi,rasikadugal,vivekghamande,gauravbakshi
2,#Homecoming,"drama,musical",5.841621,0,soumyajitmajumdar,sayanigupta,plabitaborthakur,soumyajitmajumdar
3,#Yaaram,"comedy,romance",4.400000,35,ovaiskhan,prateik,ishitaraj,ovaiskhan
4,...And Once Again,drama,5.841621,0,amolpalekar,rajatkapoor,rituparnasengupta,amolpalekar
...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,action,4.600000,11,mahendrashah,naseeruddinshah,sumeetsaigal,mahendrashah
15505,Zulmi,"action,drama",4.500000,655,kukukohli,akshaykumar,twinklekhanna,kukukohli
15506,Zulmi Raj,action,5.841621,0,kiranthej,sangeetatiwari,rekha,kiranthej
15507,Zulmi Shikari,action,5.841621,0,jayantdesai,ashokkumar,rekha,jayantdesai


In [243]:
dataset_2 = dataset_1
dataset_2

Unnamed: 0,Name,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,drama,5.841621,0,j.s.randhawa,manmauji,birbal,j.s.randhawa
1,#Gadhvi (He thought he was Gandhi),drama,7.000000,8,gauravbakshi,rasikadugal,vivekghamande,gauravbakshi
2,#Homecoming,"drama,musical",5.841621,0,soumyajitmajumdar,sayanigupta,plabitaborthakur,soumyajitmajumdar
3,#Yaaram,"comedy,romance",4.400000,35,ovaiskhan,prateik,ishitaraj,ovaiskhan
4,...And Once Again,drama,5.841621,0,amolpalekar,rajatkapoor,rituparnasengupta,amolpalekar
...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,action,4.600000,11,mahendrashah,naseeruddinshah,sumeetsaigal,mahendrashah
15505,Zulmi,"action,drama",4.500000,655,kukukohli,akshaykumar,twinklekhanna,kukukohli
15506,Zulmi Raj,action,5.841621,0,kiranthej,sangeetatiwari,rekha,kiranthej
15507,Zulmi Shikari,action,5.841621,0,jayantdesai,ashokkumar,rekha,jayantdesai


In [244]:
# Creating a simple text for Count vectorizer to work with
def cv(x):
  attr = x['Genre'].lower()
  for i in x[1:]:
    attr = attr + ' ' + str(i)
  return attr

dataset_2 = dataset_2[['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']]
dataset_2['cv'] = dataset_2.apply(cv, axis = 1)
dataset_2['cv']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_2['cv'] = dataset_2.apply(cv, axis = 1)


0          drama j.s.randhawa manmauji birbal j.s.randhawa
1        drama gauravbakshi rasikadugal vivekghamande g...
2        drama,musical soumyajitmajumdar sayanigupta pl...
3        comedy,romance ovaiskhan prateik ishitaraj ova...
4        drama amolpalekar rajatkapoor rituparnasengupt...
                               ...                        
15504    action mahendrashah naseeruddinshah sumeetsaig...
15505    action,drama kukukohli akshaykumar twinklekhan...
15506      action kiranthej sangeetatiwari rekha kiranthej
15507      action jayantdesai ashokkumar rekha jayantdesai
15508    action,drama k.c.bokadia dharmendra jayaprada ...
Name: cv, Length: 13837, dtype: object

In [245]:
# Count Vectorization + Cosine Similarity Matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(dataset_2['cv'])
count_matrix.shape

(13837, 11813)

In [246]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [247]:
print(cosine_sim)

[[1.         0.14285714 0.13363062 ... 0.         0.         0.13363062]
 [0.14285714 1.         0.13363062 ... 0.         0.         0.13363062]
 [0.13363062 0.13363062 1.         ... 0.         0.         0.125     ]
 ...
 [0.         0.         0.         ... 1.         0.28571429 0.13363062]
 [0.         0.         0.         ... 0.28571429 1.         0.13363062]
 [0.13363062 0.13363062 0.125      ... 0.13363062 0.13363062 1.        ]]


In [None]:
dataset_2['Name'] = dataset_1['Name']

In [249]:
dataset_2 = dataset_2.reset_index()
dataset_2

Unnamed: 0,index,Genre,Director,Actor 1,Actor 2,Actor 3,cv,Name
0,0,drama,j.s.randhawa,manmauji,birbal,j.s.randhawa,drama j.s.randhawa manmauji birbal j.s.randhawa,
1,1,drama,gauravbakshi,rasikadugal,vivekghamande,gauravbakshi,drama gauravbakshi rasikadugal vivekghamande g...,#Gadhvi (He thought he was Gandhi)
2,2,"drama,musical",soumyajitmajumdar,sayanigupta,plabitaborthakur,soumyajitmajumdar,"drama,musical soumyajitmajumdar sayanigupta pl...",#Homecoming
3,3,"comedy,romance",ovaiskhan,prateik,ishitaraj,ovaiskhan,"comedy,romance ovaiskhan prateik ishitaraj ova...",#Yaaram
4,4,drama,amolpalekar,rajatkapoor,rituparnasengupta,amolpalekar,drama amolpalekar rajatkapoor rituparnasengupt...,...And Once Again
...,...,...,...,...,...,...,...,...
13832,15504,action,mahendrashah,naseeruddinshah,sumeetsaigal,mahendrashah,action mahendrashah naseeruddinshah sumeetsaig...,Zulm Ko Jala Doonga
13833,15505,"action,drama",kukukohli,akshaykumar,twinklekhanna,kukukohli,"action,drama kukukohli akshaykumar twinklekhan...",Zulmi
13834,15506,action,kiranthej,sangeetatiwari,rekha,kiranthej,action kiranthej sangeetatiwari rekha kiranthej,Zulmi Raj
13835,15507,action,jayantdesai,ashokkumar,rekha,jayantdesai,action jayantdesai ashokkumar rekha jayantdesai,Zulmi Shikari


In [250]:
# Function for movie recommendation based on names
def get_recommendation(name, cosine_sim = cosine_sim):

  # Index of the movie that matches the name
  indices = pd.Series(dataset_2.index, index = dataset_2['Name'])
  idx = indices[name]

  # Pairwise similarity scores of all movie
  sim_scores = list(enumerate(cosine_sim[idx]))

  # Sorting movies based on similarity scores
  sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)

  # Getting scores of the 10 most similar movies
  sim_scores = sim_scores[1:11]

  # Getting the movie indices
  movie_idx = [i[0] for i in sim_scores]

  # Top 10 most similar movies
  return dataset_2['Name'].iloc[movie_idx]

In [251]:
get_recommendation('Zulmi', cosine_sim)

9637            Phool Aur Kaante
6938                     Kohraam
13586         Yeh Dil Aashiqanaa
875                  Anari No. 1
13476          Woh Tera Naam Tha
5471       International Khiladi
11648    Showtime - A Mocumentry
5733                  Jai Kishen
7174            Lahoo Ke Do Rang
7692               Maidan-E-Jung
Name: Name, dtype: object

## Estimating movie ratings based on director, genre, and actors

In [252]:
dataset_1

Unnamed: 0,Name,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,drama,5.841621,0,j.s.randhawa,manmauji,birbal,j.s.randhawa
1,#Gadhvi (He thought he was Gandhi),drama,7.000000,8,gauravbakshi,rasikadugal,vivekghamande,gauravbakshi
2,#Homecoming,"drama,musical",5.841621,0,soumyajitmajumdar,sayanigupta,plabitaborthakur,soumyajitmajumdar
3,#Yaaram,"comedy,romance",4.400000,35,ovaiskhan,prateik,ishitaraj,ovaiskhan
4,...And Once Again,drama,5.841621,0,amolpalekar,rajatkapoor,rituparnasengupta,amolpalekar
...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,action,4.600000,11,mahendrashah,naseeruddinshah,sumeetsaigal,mahendrashah
15505,Zulmi,"action,drama",4.500000,655,kukukohli,akshaykumar,twinklekhanna,kukukohli
15506,Zulmi Raj,action,5.841621,0,kiranthej,sangeetatiwari,rekha,kiranthej
15507,Zulmi Shikari,action,5.841621,0,jayantdesai,ashokkumar,rekha,jayantdesai


In [253]:
# Loading dataset
X = dataset_1[['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']]
y = dataset_1['Rating']

# One-hot encoding for categorical data
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
X_enc = enc.fit_transform(X).toarray()

In [254]:
from sklearn.model_selection import train_test_split

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X_enc, y, test_size = 0.25, random_state = 42)

In [255]:
# KNN model for prediction

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors = 25)
knn.fit(X_train, y_train)

# Making predictions
y_pred = knn.predict(X_test)

In [256]:
y_pred

array([5.97396742, 5.86395656, 5.84162142, ..., 5.69963228, 6.17932971,
       6.13896199])