# Part 2

# Data Discovery: ML

Q1. Predicting Rating of the movie (60 pts)
---


You have to use features like Director, Top 5 casts, pg_val, Writer, Revenue and others to predict the imdb_score of the movie.These features are available in different files and you need to find the join path and join the tables using similarity techniques. Provide proof to support why these columns are suitable for joining the tables. Your code should be properly commented describing each step.

Process to be followed
- Finding the join paths for the table and joining those tables
- Joining the table to create the final training data
- Feature engineering to create new features
- Data Cleaning on feature columns
- Splitting of data into training and test set
- Traning a regression Model(Any model of your choice)
- Evaluating predictions through mean square error



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import jaccard_score

In [2]:
# Join 1

'''
I am following the steps given in tutorial to join two csv's with best matches in it: info_df, values_df for joining first 2 tables
'''

info_df = pd.read_csv('data/information.csv')
values_df = pd.read_csv('data/values.csv')
best_matches = []

In [3]:
info_df.dtypes

Unnamed: 0      int64
info           object
Run Time       object
votes          object
Director       object
Top 5 Casts    object
Writer         object
year           object
names          object
dtype: object

In [4]:
values_df.dtypes

Unnamed: 0     int64
names         object
Run Time      object
imdb_score    object
val           object
Writer        object
film          object
dtype: object

In [5]:
# J1

'''
As given in the tutorial, I am comparing all the columns in both the dataframes and then generating the top 3 matches 
among those dataframes.

I am also commenting this block of code as it takes a lot of time to run and may crash the kernel
'''
'''
# Filter string columns from both DataFrames
string_columns_info = info_df.select_dtypes(include=['object']).columns
string_columns_values = values_df.select_dtypes(include=['object']).columns

# Iterate over each combination of string columns
for col1 in string_columns_info:
    for col2 in string_columns_values:
        # Combine values from both columns
        values_df1 = info_df[col1].astype(str).str.lower().str.strip()
        values_df2 = values_df[col2].astype(str).str.lower().str.strip()

        # Vectorize values from both columns
        vectorizer = TfidfVectorizer()
        values_df1_vectorized = vectorizer.fit_transform(values_df1)
        values_df2_vectorized = vectorizer.transform(values_df2)

        # Build index for NN search
        nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
        nn_search.fit(values_df2_vectorized)

        # Query and find nearest neighbors
        _, indices = nn_search.kneighbors(values_df1_vectorized)

        # Get matched values
        matched_values_df1 = values_df1.values
        matched_values_df2 = values_df2.iloc[indices.flatten()].values

        # Compute Jaccard similarity between matched values
        jaccard_similarity = jaccard_score(matched_values_df1, matched_values_df2, average='macro')

        # Update the best matches list if the current score is higher than the lowest score in the list
        if len(best_matches) < 3 or jaccard_similarity > best_matches[-1]['score']:
            best_matches.append({'score': jaccard_similarity, 'Column1': col1, 'Column2': col2})
            best_matches.sort(key=lambda x: x['score'], reverse=True)
            best_matches = best_matches[:3]

# Print the best matching pairs of columns
print("Top 3 matching pairs of columns:")
for i, match in enumerate(best_matches):
    print(f"Match {i+1}: {match}")
'''

Top 3 matching pairs of columns:
Match 1: {'score': 0.995318087608603, 'Column1': 'Writer', 'Column2': 'Writer'}
Match 2: {'score': 0.970035697585566, 'Column1': 'info', 'Column2': 'names'}
Match 3: {'score': 0.9410680729386401, 'Column1': 'votes', 'Column2': 'val'}


In [6]:
# J1

'''
Similar to the tutorial, I am comparing the info and names column in both dataframes
'''

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors


# Combine titles from both tables
titles_df1 = info_df['info'].str.lower().str.strip()
titles_df2 = values_df['names'].str.lower().str.strip()

# Vectorize movie titles from both tables
vectorizer = TfidfVectorizer()
titles_df1_vectorized = vectorizer.fit_transform(titles_df1)
titles_df2_vectorized = vectorizer.transform(titles_df2)

# Build index for NN search
nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
nn_search.fit(titles_df2_vectorized)

# Query and find nearest neighbors
distances, indices = nn_search.kneighbors(titles_df1_vectorized)

# Match titles based on similarity threshold
similarity_threshold = 0.8  # Adjust as needed

# Store the matches in a list of dictionaries
matches = []
for idx1, (distance, idx2) in enumerate(zip(distances, indices)):
    if distance < similarity_threshold:
        matched_title_df1 = list(titles_df1)[idx1]
        matched_title_df2 = list(titles_df2)[idx2[0]]
        matches.append({'Title_Table1': matched_title_df1, 'Title_Table2': matched_title_df2})

# Convert the list of dictionaries to a DataFrame
matches_df = pd.DataFrame(matches)

# Display the resulting DataFrame
print(matches_df.head())

              Title_Table1             Title_Table2
0        top gun: maverick        top gun: maverick
1  jurassic world dominion  jurassic world dominion
2                  top gun                  top gun
3                lightyear                lightyear
4               spiderhead               spiderhead


In [7]:
# J1
'''
I have merged the two tables on the similarity score bases (0.970035) for info and names columns
'''

merged_df = pd.merge(info_df, matches_df, left_on='info', right_on='Title_Table1', how='inner')

In [8]:
# J1
values_df['names'] = values_df['names'].replace(merged_df.set_index('Title_Table2')['info'])

In [9]:
# J1
new_df = pd.merge(values_df, info_df, left_on='names', right_on='info', how='inner')

new_df.head()

Unnamed: 0,Unnamed: 0_x,names_x,Run Time_x,imdb_score,val,Writer_x,film,Unnamed: 0_y,info,Run Time_y,votes,Director,Top 5 Casts,Writer_y,year,names_y
0,0,Top Gun: Maverick,"$170,000,000 (estimated)",8.6,187K,Jim Cash,Acorn and the Firestorm,0,Top Gun: Maverick,"$170,000,000 (estimated)",187K,Joseph Kosinski,"['Jack Epps Jr.', 'Peter Craig', 'Tom Cruise',...",Jim Cash,-2022,Larry Charles' Dangerous World of Comedy
1,1,Jurassic World Dominion,2 hours 27 minutes,6.0,56K,Emily Carmichael,Girl Hate,1,Jurassic World Dominion,2 hours 27 minutes,56K,Colin Trevorrow,"['Colin Trevorrow', 'Derek Connolly', 'Chris P...",Emily Carmichael,-2022,Home Team
2,2,Top Gun,"$15,000,000 (estimated)",6.9,380K,Jim Cash,Poum Poum !,2,Top Gun,"$15,000,000 (estimated)",380K,Tony Scott,"['Jack Epps Jr.', 'Ehud Yonay', 'Tom Cruise', ...",Jim Cash,-1986,Theeya Velai Seiyyanum Kumaru
3,3,Lightyear,"$71,101,257",5.2,32K,Angus MacLane,"Attack, Decay, Release",3,Lightyear,"$71,101,257",32K,Angus MacLane,"['Jason Headley', 'Matthew Aldrich', 'Chris Ev...",Angus MacLane,-2022,Sivaji: The Boss
4,4,Spiderhead,not-released,5.4,23K,George Saunders,Raiders Of The Lost Gold,4,Spiderhead,not-released,23K,Joseph Kosinski,"['Rhett Reese', 'Paul Wernick', 'Chris Hemswor...",George Saunders,-2022,Mystery Lab


In [10]:
# J1

drop_columns = ['Unnamed: 0_x','film','Writer_y','info','Unnamed: 0_y']
new_df.drop(columns=drop_columns, inplace=True)
print(new_df.shape)
new_df.head()

(25424, 11)


Unnamed: 0,names_x,Run Time_x,imdb_score,val,Writer_x,Run Time_y,votes,Director,Top 5 Casts,year,names_y
0,Top Gun: Maverick,"$170,000,000 (estimated)",8.6,187K,Jim Cash,"$170,000,000 (estimated)",187K,Joseph Kosinski,"['Jack Epps Jr.', 'Peter Craig', 'Tom Cruise',...",-2022,Larry Charles' Dangerous World of Comedy
1,Jurassic World Dominion,2 hours 27 minutes,6.0,56K,Emily Carmichael,2 hours 27 minutes,56K,Colin Trevorrow,"['Colin Trevorrow', 'Derek Connolly', 'Chris P...",-2022,Home Team
2,Top Gun,"$15,000,000 (estimated)",6.9,380K,Jim Cash,"$15,000,000 (estimated)",380K,Tony Scott,"['Jack Epps Jr.', 'Ehud Yonay', 'Tom Cruise', ...",-1986,Theeya Velai Seiyyanum Kumaru
3,Lightyear,"$71,101,257",5.2,32K,Angus MacLane,"$71,101,257",32K,Angus MacLane,"['Jason Headley', 'Matthew Aldrich', 'Chris Ev...",-2022,Sivaji: The Boss
4,Spiderhead,not-released,5.4,23K,George Saunders,not-released,23K,Joseph Kosinski,"['Rhett Reese', 'Paul Wernick', 'Chris Hemswor...",-2022,Mystery Lab


In [11]:
# J1

'''
In order to save and retrieve the output of the merged tables, I am using a csv file retrieve the data.
'''
new_df.to_csv('data/new.csv')

In [12]:
# Join 2

'''
Similar to join 1, I am joining the new_df obtained above with the values1_df 
'''

new_df = pd.read_csv('data/new.csv')
values1_df = pd.read_csv('data/values1.csv')
best_matches = []

In [13]:
new_df.dtypes

Unnamed: 0      int64
names_x        object
Run Time_x     object
imdb_score     object
val            object
Writer_x       object
Run Time_y     object
votes          object
Director       object
Top 5 Casts    object
year           object
names_y        object
dtype: object

In [14]:
values1_df.dtypes

Unnamed: 0                int64
title                    object
budget                    int64
production_companies     object
popularity              float64
imdb_score              float64
names                    object
dtype: object

In [15]:
# J2

'''
As given in the tutorial, I am comparing all the columns in both the dataframes and then generating the top 3 matches 
among those dataframes.

I am also commenting this block of code as it takes a lot of time to run and may crash the kernel
'''
'''
# Filter string columns from both DataFrames
string_columns_new = new_df.select_dtypes(include=['object']).columns
string_columns_values1 = values1_df.select_dtypes(include=['object']).columns

# Iterate over each combination of string columns
for col1 in string_columns_new:
    for col2 in string_columns_values1:
        # Combine values from both columns
        values_df1 = new_df[col1].astype(str).str.lower().str.strip()
        values_df2 = values1_df[col2].astype(str).str.lower().str.strip()

        # Vectorize values from both columns
        vectorizer = TfidfVectorizer()
        values_df1_vectorized = vectorizer.fit_transform(values_df1)
        values_df2_vectorized = vectorizer.transform(values_df2)

        # Build index for NN search
        nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
        nn_search.fit(values_df2_vectorized)

        # Query and find nearest neighbors
        _, indices = nn_search.kneighbors(values_df1_vectorized)

        # Get matched values
        matched_values_df1 = values_df1.values
        matched_values_df2 = values_df2.iloc[indices.flatten()].values

        # Compute Jaccard similarity between matched values
        jaccard_similarity = jaccard_score(matched_values_df1, matched_values_df2, average='macro')

        # Update the best matches list if the current score is higher than the lowest score in the list
        if len(best_matches) < 3 or jaccard_similarity > best_matches[-1]['score']:
            best_matches.append({'score': jaccard_similarity, 'Column1': col1, 'Column2': col2})
            best_matches.sort(key=lambda x: x['score'], reverse=True)
            best_matches = best_matches[:3]

# Print the best matching pairs of columns
print("Top 3 matching pairs of columns:")
for i, match in enumerate(best_matches):
    print(f"Match {i+1}: {match}")
'''

Top 3 matching pairs of columns:
Match 1: {'score': 0.8714188534290458, 'Column1': 'names_x', 'Column2': 'title'}
Match 2: {'score': 0.17466128453262378, 'Column1': 'names_y', 'Column2': 'title'}
Match 3: {'score': 0.03987135415268349, 'Column1': 'names_x', 'Column2': 'names'}


In [16]:
# J2

'''
Similar to the tutorial, I am comparing the names_x and title column in both dataframes
'''

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors


# Combine titles from both tables
titles_df1 = new_df['names_x'].str.lower().str.strip().dropna()
titles_df2 = values1_df['title'].str.lower().str.strip().dropna()

# Vectorize movie titles from both tables
vectorizer = TfidfVectorizer()
titles_df1_vectorized = vectorizer.fit_transform(titles_df1)
titles_df2_vectorized = vectorizer.transform(titles_df2)

# Build index for NN search
nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
nn_search.fit(titles_df2_vectorized)

# Query and find nearest neighbors
distances, indices = nn_search.kneighbors(titles_df1_vectorized)

# Match titles based on similarity threshold
similarity_threshold = 0.8  # Adjust as needed

# Store the matches in a list of dictionaries
matches = []
for idx1, (distance, idx2) in enumerate(zip(distances, indices)):
    if distance < similarity_threshold:
        matched_title_df1 = list(titles_df1)[idx1]
        matched_title_df2 = list(titles_df2)[idx2[0]]
        matches.append({'Title_Table1': matched_title_df1, 'Title_Table2': matched_title_df2})

# Convert the list of dictionaries to a DataFrame
matches_df = pd.DataFrame(matches)

# Display the resulting DataFrame
print(matches_df.head())

              Title_Table1             Title_Table2
0        top gun: maverick        top gun: maverick
1  jurassic world dominion  jurassic world dominion
2                  top gun                  top gun
3                lightyear                lightyear
4               spiderhead               spiderhead


In [17]:
# J2
'''
I have merged the two tables on the similarity score bases (0.87141) for info and names columns
'''
merged_df = pd.merge(new_df, matches_df, left_on='names_x', right_on='Title_Table1', how='inner')

In [18]:
# J2
values1_df['title'] = values1_df['title'].replace(merged_df.set_index('Title_Table2')['names_x'])

In [19]:
# J2

new2_df = pd.merge(values1_df, new_df, left_on='title', right_on='names_x', how='inner')

new2_df.head()

Unnamed: 0,Unnamed: 0_x,title,budget,production_companies,popularity,imdb_score_x,names,Unnamed: 0_y,names_x,Run Time_x,imdb_score_y,val,Writer_x,Run Time_y,votes,Director,Top 5 Casts,year,names_y
0,0,Inception,9296,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,6.94,Stelvio,73,Inception,"$160,000,000 (estimated)",8.8,2.3M,Christopher Nolan,"$160,000,000 (estimated)",2.3M,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,1,Interstellar,2117,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,5.57,Through the Windows,8571,Interstellar,"$165,000,000 (estimated)",8.6,1.7M,Jonathan Nolan,"$165,000,000 (estimated)",1.7M,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,2,The Dark Knight,1125,"DC Comics, Legendary Pictures, Syncopy, Isobel...",130.643,3.15,Mabo no nankai funsenki,67,The Dark Knight,"$185,000,000 (estimated)",9.0,2.6M,Jonathan Nolan,"$185,000,000 (estimated)",2.6M,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,3,Avatar,6484,"Dune Entertainment, Lightstorm Entertainment, ...",79.932,7.61,Tricks of Trade,89,Avatar,"$237,000,000 (estimated)",7.8,1.2M,James Cameron,"$237,000,000 (estimated)",1.2M,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,4,The Avengers,4613,Marvel Studios,98.082,6.84,The Well Builders,192,The Avengers,"$220,000,000 (estimated)",8.0,1.4M,Joss Whedon,"$220,000,000 (estimated)",1.4M,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [20]:
# J2

drop_columns = ['Unnamed: 0_x','val','Unnamed: 0_y']
new2_df.drop(columns=drop_columns, inplace=True)
print(new2_df.shape)
new2_df.head()

(23301, 16)


Unnamed: 0,title,budget,production_companies,popularity,imdb_score_x,names,names_x,Run Time_x,imdb_score_y,Writer_x,Run Time_y,votes,Director,Top 5 Casts,year,names_y
0,Inception,9296,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,6.94,Stelvio,Inception,"$160,000,000 (estimated)",8.8,Christopher Nolan,"$160,000,000 (estimated)",2.3M,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,Interstellar,2117,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,5.57,Through the Windows,Interstellar,"$165,000,000 (estimated)",8.6,Jonathan Nolan,"$165,000,000 (estimated)",1.7M,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,The Dark Knight,1125,"DC Comics, Legendary Pictures, Syncopy, Isobel...",130.643,3.15,Mabo no nankai funsenki,The Dark Knight,"$185,000,000 (estimated)",9.0,Jonathan Nolan,"$185,000,000 (estimated)",2.6M,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,Avatar,6484,"Dune Entertainment, Lightstorm Entertainment, ...",79.932,7.61,Tricks of Trade,Avatar,"$237,000,000 (estimated)",7.8,James Cameron,"$237,000,000 (estimated)",1.2M,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,The Avengers,4613,Marvel Studios,98.082,6.84,The Well Builders,The Avengers,"$220,000,000 (estimated)",8.0,Joss Whedon,"$220,000,000 (estimated)",1.4M,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [21]:
# J2

'''
In order to save and retrieve the output of the merged tables, I am using a csv file retrieve the data.
'''
new2_df.to_csv('data/new2.csv')

In [22]:
# Join 3

'''
Similar to join 2, I am joining the new2_df obtained above with the productions_df 
'''

new2_df = pd.read_csv('data/new2.csv')
productions_df = pd.read_csv('data/productions.csv')
best_matches = []

In [23]:
new2_df.dtypes

Unnamed: 0                int64
title                    object
budget                    int64
production_companies     object
popularity              float64
imdb_score_x            float64
names                    object
names_x                  object
Run Time_x               object
imdb_score_y             object
Writer_x                 object
Run Time_y               object
votes                    object
Director                 object
Top 5 Casts              object
year                     object
names_y                  object
dtype: object

In [24]:
productions_df.dtypes

Unnamed: 0               int64
films                   object
status                  object
tagline                 object
pg_val                  object
original_language       object
production_companies    object
spoken_languages        object
dtype: object

In [25]:
# J3

'''
As given in the tutorial, I am comparing all the columns in both the dataframes and then generating the top 3 matches 
among those dataframes.

I am also commenting this block of code as it takes a lot of time to run and may crash the kernel
'''
'''
# Filter string columns from both DataFrames
string_columns_new2 = new2_df.select_dtypes(include=['object']).columns
string_columns_productions = productions_df.select_dtypes(include=['object']).columns

# Iterate over each combination of string columns
for col1 in string_columns_new2:
    for col2 in string_columns_productions:
        # Combine values from both columns
        values_df1 = new2_df[col1].astype(str).str.lower().str.strip()
        values_df2 = productions_df[col2].astype(str).str.lower().str.strip()

        # Vectorize values from both columns
        vectorizer = TfidfVectorizer()
        values_df1_vectorized = vectorizer.fit_transform(values_df1)
        values_df2_vectorized = vectorizer.transform(values_df2)

        # Build index for NN search
        nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
        nn_search.fit(values_df2_vectorized)

        # Query and find nearest neighbors
        _, indices = nn_search.kneighbors(values_df1_vectorized)

        # Get matched values
        matched_values_df1 = values_df1.values
        matched_values_df2 = values_df2.iloc[indices.flatten()].values

        # Compute Jaccard similarity between matched values
        jaccard_similarity = jaccard_score(matched_values_df1, matched_values_df2, average='macro')

        # Update the best matches list if the current score is higher than the lowest score in the list
        if len(best_matches) < 3 or jaccard_similarity > best_matches[-1]['score']:
            best_matches.append({'score': jaccard_similarity, 'Column1': col1, 'Column2': col2})
            best_matches.sort(key=lambda x: x['score'], reverse=True)
            best_matches = best_matches[:3]

# Print the best matching pairs of columns
print("Top 3 matching pairs of columns:")
for i, match in enumerate(best_matches):
    print(f"Match {i+1}: {match}")
'''

Top 3 matching pairs of columns:
Match 1: {'score': 0.9737970419831751, 'Column1': 'production_companies', 'Column2': 'production_companies'}
Match 2: {'score': 0.973085983510012, 'Column1': 'title', 'Column2': 'films'}
Match 3: {'score': 0.973085983510012, 'Column1': 'names_x', 'Column2': 'films'}


In [26]:
# J3

'''
Similar to the tutorial, I am comparing the title and films column in both dataframes
'''

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors


# Combine titles from both tables
titles_df1 = new2_df['title'].str.lower().str.strip().dropna()
titles_df2 = productions_df['films'].str.lower().str.strip().dropna()

# Vectorize movie titles from both tables
vectorizer = TfidfVectorizer()
titles_df1_vectorized = vectorizer.fit_transform(titles_df1)
titles_df2_vectorized = vectorizer.transform(titles_df2)

# Build index for NN search
nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
nn_search.fit(titles_df2_vectorized)

# Query and find nearest neighbors
distances, indices = nn_search.kneighbors(titles_df1_vectorized)

# Match titles based on similarity threshold
similarity_threshold = 0.8  # Adjust as needed

# Store the matches in a list of dictionaries
matches = []
for idx1, (distance, idx2) in enumerate(zip(distances, indices)):
    if distance < similarity_threshold:
        matched_title_df1 = list(titles_df1)[idx1]
        matched_title_df2 = list(titles_df2)[idx2[0]]
        matches.append({'Title_Table1': matched_title_df1, 'Title_Table2': matched_title_df2})

# Convert the list of dictionaries to a DataFrame
matches_df = pd.DataFrame(matches)

# Display the resulting DataFrame
print(matches_df.head())

      Title_Table1     Title_Table2
0        inception        inception
1     interstellar     interstellar
2  the dark knight  the dark knight
3           avatar           avatar
4     the avengers     the avengers


In [27]:
# J3

'''
I have merged the two tables on the similarity score bases (0.97308) for info and names columns
'''

merged_df = pd.merge(new2_df, matches_df, left_on='title', right_on='Title_Table1', how='inner')

In [28]:
# J3
productions_df['films'] = productions_df['films'].replace(merged_df.set_index('Title_Table2')['title'])

In [29]:
new2_df.columns

Index(['Unnamed: 0', 'title', 'budget', 'production_companies', 'popularity',
       'imdb_score_x', 'names', 'names_x', 'Run Time_x', 'imdb_score_y',
       'Writer_x', 'Run Time_y', 'votes', 'Director', 'Top 5 Casts', 'year',
       'names_y'],
      dtype='object')

In [30]:
# J3
new3_df = pd.merge(productions_df, new2_df, left_on='films', right_on='title', how='inner')

new3_df.head()

Unnamed: 0,Unnamed: 0_x,films,status,tagline,pg_val,original_language,production_companies_x,spoken_languages,Unnamed: 0_y,title,...,names_x,Run Time_x,imdb_score_y,Writer_x,Run Time_y,votes,Director,Top 5 Casts,year,names_y
0,0,Inception,Released,Your mind is the scene of the crime.,F,en,"Legendary Pictures, Syncopy, Warner Bros. Pict...","English, French, Japanese, Swahili",0,Inception,...,Inception,"$160,000,000 (estimated)",8.8,Christopher Nolan,"$160,000,000 (estimated)",2.3M,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,1,Interstellar,Released,Mankind was born on Earth. It was never meant ...,False,en,"Legendary Pictures, Syncopy, Lynda Obst Produc...",English,1,Interstellar,...,Interstellar,"$165,000,000 (estimated)",8.6,Jonathan Nolan,"$165,000,000 (estimated)",1.7M,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,2,The Dark Knight,Released,Welcome to a world without rules.,F,en,"DC Comics, Legendary Pictures, Syncopy, Isobel...","English, Mandarin",2,The Dark Knight,...,The Dark Knight,"$185,000,000 (estimated)",9.0,Jonathan Nolan,"$185,000,000 (estimated)",2.6M,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,3,Avatar,Released,Enter the world of Pandora.,F,en,"Dune Entertainment, Lightstorm Entertainment, ...","English, Spanish",3,Avatar,...,Avatar,"$237,000,000 (estimated)",7.8,James Cameron,"$237,000,000 (estimated)",1.2M,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,4,The Avengers,Released,Some assembly required.,F,en,Marvel Studios,"English, Hindi, Russian",4,The Avengers,...,The Avengers,"$220,000,000 (estimated)",8.0,Joss Whedon,"$220,000,000 (estimated)",1.4M,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [31]:
# J3

drop_columns = ['Unnamed: 0_x','Unnamed: 0_y','status','original_language','production_companies_x','spoken_languages','Run Time_y','votes']
new3_df.drop(columns=drop_columns, inplace=True)
print(new3_df.shape)
new3_df.head()

(23301, 17)


Unnamed: 0,films,tagline,pg_val,title,budget,production_companies_y,popularity,imdb_score_x,names,names_x,Run Time_x,imdb_score_y,Writer_x,Director,Top 5 Casts,year,names_y
0,Inception,Your mind is the scene of the crime.,F,Inception,9296,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,6.94,Stelvio,Inception,"$160,000,000 (estimated)",8.8,Christopher Nolan,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,Interstellar,Mankind was born on Earth. It was never meant ...,False,Interstellar,2117,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,5.57,Through the Windows,Interstellar,"$165,000,000 (estimated)",8.6,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,The Dark Knight,Welcome to a world without rules.,F,The Dark Knight,1125,"DC Comics, Legendary Pictures, Syncopy, Isobel...",130.643,3.15,Mabo no nankai funsenki,The Dark Knight,"$185,000,000 (estimated)",9.0,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,Avatar,Enter the world of Pandora.,F,Avatar,6484,"Dune Entertainment, Lightstorm Entertainment, ...",79.932,7.61,Tricks of Trade,Avatar,"$237,000,000 (estimated)",7.8,James Cameron,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,The Avengers,Some assembly required.,F,The Avengers,4613,Marvel Studios,98.082,6.84,The Well Builders,The Avengers,"$220,000,000 (estimated)",8.0,Joss Whedon,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [32]:
# J3

'''
In order to save and retrieve the output of the merged tables, I am using a csv file retrieve the data.
'''
new3_df.to_csv('data/new3.csv')

In [33]:
# Join 4

'''
Similar to join 3, I am joining the new3_df obtained above with the income_df 
'''

new3_df = pd.read_csv('data/new3.csv')
income_df = pd.read_csv('data/income.csv')
best_matches = []

In [34]:
new3_df.dtypes

Unnamed: 0                  int64
films                      object
tagline                    object
pg_val                     object
title                      object
budget                      int64
production_companies_y     object
popularity                float64
imdb_score_x              float64
names                      object
names_x                    object
Run Time_x                 object
imdb_score_y               object
Writer_x                   object
Director                   object
Top 5 Casts                object
year                       object
names_y                    object
dtype: object

In [35]:
income_df.dtypes

Unnamed: 0.1              int64
Unnamed: 0                int64
revenue                  object
budget                   object
vote_count                int64
runtime                   int64
overview                 object
production_companies     object
popularity              float64
description              object
dtype: object

In [36]:
# J4

'''
As given in the tutorial, I am comparing all the columns in both the dataframes and then generating the top 3 matches 
among those dataframes.

I am also commenting this block of code as it takes a lot of time to run and may crash the kernel
'''

'''
# Filter string columns from both DataFrames
string_columns_new3 = new3_df.select_dtypes(include=['object']).columns
string_columns_income = income_df.select_dtypes(include=['object']).columns

# Iterate over each combination of string columns
for col1 in string_columns_new3:
    for col2 in string_columns_income:
        # Combine values from both columns
        values_df1 = new3_df[col1].astype(str).str.lower().str.strip()
        values_df2 = income_df[col2].astype(str).str.lower().str.strip()

        # Vectorize values from both columns
        vectorizer = TfidfVectorizer()
        values_df1_vectorized = vectorizer.fit_transform(values_df1)
        values_df2_vectorized = vectorizer.transform(values_df2)

        # Build index for NN search
        nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
        nn_search.fit(values_df2_vectorized)

        # Query and find nearest neighbors
        _, indices = nn_search.kneighbors(values_df1_vectorized)

        # Get matched values
        matched_values_df1 = values_df1.values
        matched_values_df2 = values_df2.iloc[indices.flatten()].values

        # Compute Jaccard similarity between matched values
        jaccard_similarity = jaccard_score(matched_values_df1, matched_values_df2, average='macro')

        # Update the best matches list if the current score is higher than the lowest score in the list
        if len(best_matches) < 3 or jaccard_similarity > best_matches[-1]['score']:
            best_matches.append({'score': jaccard_similarity, 'Column1': col1, 'Column2': col2})
            best_matches.sort(key=lambda x: x['score'], reverse=True)
            best_matches = best_matches[:3]

# Print the best matching pairs of columns
print("Top 3 matching pairs of columns:")
for i, match in enumerate(best_matches):
    print(f"Match {i+1}: {match}")
'''

Top 3 matching pairs of columns:
Match 1: {'score': 0.9872098131705512, 'Column1': 'tagline', 'Column2': 'overview'}
Match 2: {'score': 0.9737970419831751, 'Column1': 'production_companies_y', 'Column2': 'production_companies'}
Match 3: {'score': 0.20969583229890382, 'Column1': 'films', 'Column2': 'overview'}


In [37]:
# J4

'''
Similar to the tutorial, I am comparing the tagline and overview column in both dataframes
'''
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors


# Combine titles from both tables
titles_df1 = new3_df['tagline'].str.lower().str.strip().dropna()
titles_df2 = income_df['overview'].str.lower().str.strip().dropna()

# Vectorize movie titles from both tables
vectorizer = TfidfVectorizer()
titles_df1_vectorized = vectorizer.fit_transform(titles_df1)
titles_df2_vectorized = vectorizer.transform(titles_df2)

# Build index for NN search
nn_search = NearestNeighbors(n_neighbors=1, algorithm='auto')
nn_search.fit(titles_df2_vectorized)

# Query and find nearest neighbors
distances, indices = nn_search.kneighbors(titles_df1_vectorized)

# Match titles based on similarity threshold
similarity_threshold = 0.8  # Adjust as needed

# Store the matches in a list of dictionaries
matches = []
for idx1, (distance, idx2) in enumerate(zip(distances, indices)):
    if distance < similarity_threshold:
        matched_title_df1 = list(titles_df1)[idx1]
        matched_title_df2 = list(titles_df2)[idx2[0]]
        matches.append({'Title_Table1': matched_title_df1, 'Title_Table2': matched_title_df2})

# Convert the list of dictionaries to a DataFrame
matches_df = pd.DataFrame(matches)

# Display the resulting DataFrame
print(matches_df.head())

                                        Title_Table1  \
0               your mind is the scene of the crime.   
1  mankind was born on earth. it was never meant ...   
2                  welcome to a world without rules.   
3                        enter the world of pandora.   
4                            some assembly required.   

                                        Title_Table2  
0               your mind is the scene of the crime.  
1  mankind was born on earth. it was never meant ...  
2                  welcome to a world without rules.  
3                        enter the world of pandora.  
4                            some assembly required.  


In [38]:
# J4

'''
I have merged the two tables on the similarity score bases (0.98720) for tagline and overview columns
'''

merged_df = pd.merge(new3_df, matches_df, left_on='tagline', right_on='Title_Table1', how='inner')

In [39]:
# J4
income_df['overview'] = income_df['overview'].replace(merged_df.set_index('Title_Table2')['tagline'])

In [40]:
# J4
final_df = pd.merge(income_df, new3_df, left_on='overview', right_on='tagline', how='inner')

final_df.head()

Unnamed: 0,Unnamed: 0.1,Unnamed: 0_x,revenue,budget_x,vote_count,runtime,overview,production_companies,popularity_x,description,...,imdb_score_x,names,names_x,Run Time_x,imdb_score_y,Writer_x,Director,Top 5 Casts,year,names_y
0,0,0,825532764,160000000,34495,148,Your mind is the scene of the crime.,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,your Mind is Scene of crime .,...,6.94,Stelvio,Inception,"$160,000,000 (estimated)",8.8,Christopher Nolan,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,1,1,701M,165M,32571,169,Mankind was born on Earth. It was never meant ...,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,Mankind was Born on earth . it Never Meant d...,...,5.57,Through the Windows,Interstellar,"$165,000,000 (estimated)",8.6,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,2,2,1004558K,185000000,30619,152,Welcome to a world without rules.,"DC Comics, Legendary Pictures, Syncopy, Isobel...",130.643,welcome world without rules .,...,3.15,Mabo no nankai funsenki,The Dark Knight,"$185,000,000 (estimated)",9.0,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,3,3,2923M,237000000,29815,162,Enter the world of Pandora.,"Dune Entertainment, Lightstorm Entertainment, ...",79.932,enter the world of pandora .,...,7.61,Tricks of Trade,Avatar,"$237,000,000 (estimated)",7.8,James Cameron,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,4,4,1518815515,220000000,29166,143,Some assembly required.,Marvel Studios,98.082,Some assembly Required .,...,6.84,The Well Builders,The Avengers,"$220,000,000 (estimated)",8.0,Joss Whedon,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [41]:
# J4

drop_columns = ['Unnamed: 0.1','Unnamed: 0_x','Unnamed: 0_y']
final_df.drop(columns=drop_columns, inplace=True)
print(final_df.shape)
final_df.head()

(23708, 25)


Unnamed: 0,revenue,budget_x,vote_count,runtime,overview,production_companies,popularity_x,description,films,tagline,...,imdb_score_x,names,names_x,Run Time_x,imdb_score_y,Writer_x,Director,Top 5 Casts,year,names_y
0,825532764,160000000,34495,148,Your mind is the scene of the crime.,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,your Mind is Scene of crime .,Inception,Your mind is the scene of the crime.,...,6.94,Stelvio,Inception,"$160,000,000 (estimated)",8.8,Christopher Nolan,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,701M,165M,32571,169,Mankind was born on Earth. It was never meant ...,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,Mankind was Born on earth . it Never Meant d...,Interstellar,Mankind was born on Earth. It was never meant ...,...,5.57,Through the Windows,Interstellar,"$165,000,000 (estimated)",8.6,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,1004558K,185000000,30619,152,Welcome to a world without rules.,"DC Comics, Legendary Pictures, Syncopy, Isobel...",130.643,welcome world without rules .,The Dark Knight,Welcome to a world without rules.,...,3.15,Mabo no nankai funsenki,The Dark Knight,"$185,000,000 (estimated)",9.0,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,2923M,237000000,29815,162,Enter the world of Pandora.,"Dune Entertainment, Lightstorm Entertainment, ...",79.932,enter the world of pandora .,Avatar,Enter the world of Pandora.,...,7.61,Tricks of Trade,Avatar,"$237,000,000 (estimated)",7.8,James Cameron,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,1518815515,220000000,29166,143,Some assembly required.,Marvel Studios,98.082,Some assembly Required .,The Avengers,Some assembly required.,...,6.84,The Well Builders,The Avengers,"$220,000,000 (estimated)",8.0,Joss Whedon,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [42]:
# J4

'''
In order to save and retrieve the output of the merged tables, I am using a csv file retrieve the data.
'''
final_df.to_csv('data/final_dataset.csv')

In [43]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
import seaborn as sns

In [44]:
# Final training data

final_df = pd.read_csv("data/final_dataset.csv")

final_df.head()

Unnamed: 0.1,Unnamed: 0,revenue,budget_x,vote_count,runtime,overview,production_companies,popularity_x,description,films,...,imdb_score_x,names,names_x,Run Time_x,imdb_score_y,Writer_x,Director,Top 5 Casts,year,names_y
0,0,825532764,160000000,34495,148,Your mind is the scene of the crime.,"Legendary Pictures, Syncopy, Warner Bros. Pict...",83.952,your Mind is Scene of crime .,Inception,...,6.94,Stelvio,Inception,"$160,000,000 (estimated)",8.8,Christopher Nolan,Christopher Nolan,"['Leonardo DiCaprio', 'Joseph Gordon-Levitt', ...",-2010,Fary Is the New Black
1,1,701M,165M,32571,169,Mankind was born on Earth. It was never meant ...,"Legendary Pictures, Syncopy, Lynda Obst Produc...",140.241,Mankind was Born on earth . it Never Meant d...,Interstellar,...,5.57,Through the Windows,Interstellar,"$165,000,000 (estimated)",8.6,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'Matthew McConaughey', '...",-2014,The Invisible Guardian
2,2,1004558K,185000000,30619,152,Welcome to a world without rules.,"DC Comics, Legendary Pictures, Syncopy, Isobel...",130.643,welcome world without rules .,The Dark Knight,...,3.15,Mabo no nankai funsenki,The Dark Knight,"$185,000,000 (estimated)",9.0,Jonathan Nolan,Christopher Nolan,"['Christopher Nolan', 'David S. Goyer', 'Chris...",-2008,Marianne
3,3,2923M,237000000,29815,162,Enter the world of Pandora.,"Dune Entertainment, Lightstorm Entertainment, ...",79.932,enter the world of pandora .,Avatar,...,7.61,Tricks of Trade,Avatar,"$237,000,000 (estimated)",7.8,James Cameron,James Cameron,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",-2009,Mute
4,4,1518815515,220000000,29166,143,Some assembly required.,Marvel Studios,98.082,Some assembly Required .,The Avengers,...,6.84,The Well Builders,The Avengers,"$220,000,000 (estimated)",8.0,Joss Whedon,Joss Whedon,"['Zak Penn', 'Robert Downey Jr.', 'Chris Evans...",-2012,Scary Movie 4


In [45]:
final_df.dtypes

Unnamed: 0                  int64
revenue                    object
budget_x                   object
vote_count                  int64
runtime                     int64
overview                   object
production_companies       object
popularity_x              float64
description                object
films                      object
tagline                    object
pg_val                     object
title                      object
budget_y                    int64
production_companies_y     object
popularity_y              float64
imdb_score_x              float64
names                      object
names_x                    object
Run Time_x                 object
imdb_score_y               object
Writer_x                   object
Director                   object
Top 5 Casts                object
year                       object
names_y                    object
dtype: object

In [46]:
# converting int columns to float 
int_columns = ['vote_count', 'runtime', 'budget_y']

final_df[int_columns] = final_df[int_columns].astype(float)

In [47]:
check = final_df['pg_val'].unique()
print(check)

['F' 'False' 'True' 'T']


In [48]:
# streamlining False to F and True to T
final_df['pg_val'] = final_df['pg_val'].replace({'False':'F','True':'T'})
check = final_df['pg_val'].unique()
print(check)

['F' 'T']


In [49]:
# Using regex to process the revenue column and budget column

# eliminating M and K, and multiplying the value, converting to float

# using a new feature: net_gross
import re

def convert_to_numeric(value):
    pattern = r'(\d+\.?\d*)([MK])'
    match = re.match(pattern, value)
    
    if match:
        numeric_part = float(match.group(1))
        multiplier = match.group(2)

        if multiplier == 'M':
            return numeric_part * 1000000
        elif multiplier == 'K':
            return numeric_part * 1000
    else:
        return float(value)

final_df['revenue'] = final_df['revenue'].apply(convert_to_numeric)
final_df['budget_x'] = final_df['budget_x'].apply(convert_to_numeric)

final_df['net_gross'] = final_df['revenue'] - final_df['budget_x']

In [50]:
final_df[['revenue','budget_x','net_gross']].head()

Unnamed: 0,revenue,budget_x,net_gross
0,825532800.0,160000000.0,665532800.0
1,701000000.0,165000000.0,536000000.0
2,1004558000.0,185000000.0,819558000.0
3,2923000000.0,237000000.0,2686000000.0
4,1518816000.0,220000000.0,1298816000.0


In [51]:
# replacing no-rating values with a random 5.0 value and converting it to float datatype

final_df['imdb_score_y'].replace('no-rating', 5.0, inplace=True)

# Convert 'imdb_score_y' column to float
final_df['imdb_score_y'] = final_df['imdb_score_y'].astype(float)

In [52]:
final_df[['imdb_score_y']].head(20)

Unnamed: 0,imdb_score_y
0,8.8
1,8.6
2,9.0
3,7.8
4,8.0
5,8.0
6,8.4
7,8.8
8,8.0
9,8.9


In [53]:
# seperating Top 5 Casts  
final_df['Top 5 Casts'] = final_df['Top 5 Casts'].str.replace("[", "").str.replace("]", "").str.replace("'", "")

final_df[['Top 5 Casts']].head()

Unnamed: 0,Top 5 Casts
0,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio..."
1,"Christopher Nolan, Matthew McConaughey, Anne H..."
2,"Christopher Nolan, David S. Goyer, Christian B..."
3,"Sam Worthington, Zoe Saldana, Sigourney Weaver..."
4,"Zak Penn, Robert Downey Jr., Chris Evans, Scar..."


In [54]:
print(final_df.isnull().sum())

Unnamed: 0                   0
revenue                      0
budget_x                     0
vote_count                   0
runtime                      0
overview                     0
production_companies      1979
popularity_x                 0
description                  1
films                        0
tagline                      0
pg_val                       0
title                        0
budget_y                     0
production_companies_y    1975
popularity_y                 0
imdb_score_x                 0
names                        1
names_x                      0
Run Time_x                   0
imdb_score_y                 0
Writer_x                     0
Director                     0
Top 5 Casts                  0
year                       575
names_y                      0
net_gross                    0
dtype: int64


In [55]:
final_df.dropna(subset=['production_companies','production_companies_y','year'], inplace=True)

In [56]:
print(final_df.shape)

(21271, 27)


In [57]:
from scipy.stats import zscore

remove_outliers = zscore(final_df[['vote_count', 'runtime', 'popularity_x', 'popularity_y', 'revenue', 'net_gross','budget_x','budget_y','imdb_score_y','imdb_score_x']])

val = (remove_outliers > 3).any(axis = 1)

final_df = final_df[~val]

In [58]:
final_df.dtypes

Unnamed: 0                  int64
revenue                   float64
budget_x                  float64
vote_count                float64
runtime                   float64
overview                   object
production_companies       object
popularity_x              float64
description                object
films                      object
tagline                    object
pg_val                     object
title                      object
budget_y                  float64
production_companies_y     object
popularity_y              float64
imdb_score_x              float64
names                      object
names_x                    object
Run Time_x                 object
imdb_score_y              float64
Writer_x                   object
Director                   object
Top 5 Casts                object
year                       object
names_y                    object
net_gross                 float64
dtype: object

In [59]:
from sklearn.model_selection import train_test_split

# Selecting the relevant columns for training and testing
selected_columns = ['Director', 'Top 5 Casts', 'revenue', 'pg_val', 'tagline', 'Writer_x', 'vote_count', 'runtime', 'popularity_x', 'imdb_score_x','net_gross']
 
data = final_df[selected_columns].dropna()

X = data.drop(columns=['imdb_score_x'])  
y = data['imdb_score_x']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

Train set shape: (16157, 10) (16157,)
Test set shape: (4040, 10) (4040,)


In [60]:
'''
In Regression as given in tutorial of previous lab, we use these certain steps in order to 
achieve the RMSE for training and testing data
'''
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Selecting numerical and categorical columns
numerical_columns = ['vote_count', 'runtime', 'popularity_x', 'revenue', 'net_gross']
categorical_columns = ['Director', 'Top 5 Casts', 'Writer_x', 'pg_val', 'tagline']

# Preprocessing pipeline for numerical and categorical columns
numerical_pipeline = Pipeline([ 
    ('imputer', SimpleImputer(strategy="median")),  # Fill missing values with median
    ('scaler', StandardScaler()),  # Scale data to have mean 0 and variance 1
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot_encoder', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical columns
])

# Full preprocessing pipeline
preprocessing_pipeline = ColumnTransformer([
    ('numerical', numerical_pipeline, numerical_columns),
    ('categorical', categorical_pipeline, categorical_columns)
])

# Initialize the linear regression model
model = LinearRegression()

# Full pipeline including preprocessing and model
pipeline = Pipeline([
    ('preprocessing', preprocessing_pipeline),
    ('model', model)
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions on the training set
train_predictions = pipeline.predict(X_train)

# Make predictions on the test set
test_predictions = pipeline.predict(X_test)

In [61]:
# Calculate mean squared error on test set
test_mse = mean_squared_error(y_test, test_predictions,squared= False)
print("Test MSE:", test_mse)

Test MSE: 2.5525132298900672


In [62]:
import math

rmse = math.sqrt(test_mse)
print('Test RMSE: ',rmse)

Test RMSE:  1.5976586712718293
