# TMDB Movies, Movie Credits Analysis & Prediction

This dataset has information from The Movie Database (TMDb). It has the following 2 files:  
- tmdb_5000_credits.csv - Movie credits data  
- tmdb_5000_movies.csv - Movie metadata  

The focus of this project is to figure our the success of the moview before it is released. Will we be able to predict the success rate of the movie? Will we be able to predict whether the movie will be a box office hit, get more revenue etc. Is there any magic formula for the success of the movie??? 

## 1) Data Preparation/Data Munging

In [None]:
#Import all required libraries for reading data, analysing and visualizing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import json

In [None]:
credits = pd.read_csv('../input/tmdb_5000_credits.csv')
movies = pd.read_csv('../input/tmdb_5000_movies.csv')

In [None]:
credits.shape

In [None]:
movies.shape

In [None]:
credits.head(2)

In [None]:
credits.info()

Movie credits has the 4 follwing features for 4803 movies:   
  - movie_id - integer - corresponding to the movie id  
  - title - categorical feature - title of the movie  
  - cast - json data having the following info for the specific movie:  
        * cast_id  
        * character  
        * credit_id  
        * gender  
        * id  
        * name  
        * order  
  - credit - json data having the movie credit info for the specific movie:  
        * credit_id    
        * department    
        * gender  
        * id  
        * job  
        * name       

In [None]:
movies.head(2)

In [None]:
movies.info()

Movie info has the following 20 features for 4803 movies:   
    - budget - movie budget                
    - genres - json data having the following info for the specific movie genre  
        * id - genre id  
        * name - genre name for the specific movie  
    - homepage - URL of the movie website               
    - id - movie id                     
    - keywords - json data having the following info for the specific movie keywords                
        * id - keyword id  
        * name - keyword name for the specific movie      
    - original_language - language in which original movie was released      
    - original_title - original title of the moview         
    - overview - movie description              
    - popularity - popularity rating of the movie              
    - production_companies - json data having the following info for the production companies for the movie  
        * id - production company id  
        * name - production company name for the specific movie          
    - production_countries  - json data having the following info for the production companies of the movie  
        * iso_3166_1 -  ISO Code for the countries   
        * name - Country name where the moview was released  
    - release_date - release date           
    - revenue - movie revenue                 
    - runtime                 
    - spoken_languages        
        * iso_639_1 - Code for the language     
        * name - language name  
    - status - Movie Status - Released, Rumored, Post production                
    - tagline - Movie Tagline                
    - title - movie title                  
    - vote_average - average vote           
    - vote_count - vote count              

## 2) Data Processing

These are the steps I'm going to do inorder to process the data:  
    1) Merge movies and credit data to form one single data. Join them with movie id.  
    2) Work on the different JSON objects like Genres, cast, crew, production companies, production_countries etc.  
    3) Create new features release year and month based on Release date  

### 2.1) Merge movies & credits dataframe to a single dataframe

In [None]:
allmovies_df = pd.merge(left=movies,right=credits, left_on='id', right_on='movie_id', suffixes=('_left', '_right'))

In [None]:
allmovies_df.shape

In [None]:
allmovies_df.info()

In [None]:
# Both id and movie_id refers to movie_id. Also title_right and title_left refers to movie title
#Drop the column 'id' from the dataframe allmovies_df. 
allmovies_df.drop(['id', 'title_right'], axis=1, inplace=True)
allmovies_df = allmovies_df.rename(columns={'title_left': 'title'})
allmovies_df.head(2)

In [None]:
#Change the order of the dataframe allmovies_df
allmovies_df = allmovies_df[['movie_id', 'budget', 'title', 'original_title', 'status', 'tagline', 'release_date', 'runtime', 
               'genres', 'production_companies', 'production_countries', 'popularity', 'revenue', 'vote_average',
               'vote_count', 'cast', 'crew', 'homepage', 'keywords', 'original_language', 'overview', 'spoken_languages'
             ]]
allmovies_df.head(2)

In [None]:
allm = allmovies_df.copy() #just for backup

### 2.2) Analysis of JSON Objects - Genres, Cast, Crew, production_companies, production_countries, spoken_languages, keywords,

While analysing the JSON objects, I found the following.  
1) Crew object has the list of all the crews from director, editing, photography etc. I decided to pick only the details corresponding to the director  
2) Cast object has the list of all the actors in the movie in the order of importance. I decided to pick only the cast members of the first order as the data is becoming too huge.  
    - Some of the cast names were coming out incorrectly. Eg: 'Miguel A. N\u00fa\u00f1ez, Jr.' was parsed to have the result as just 'Jr' and this name is irrelevant. To avoid this, I have used the encoding to be utf-8 and also doing some stripping of space to make the entire name (first name, last name) to become the single name.  
3) Production companies: I'm considering only the top most production companies that have made box office hits. This list is provided in prodco.  
Hence parsing the crew, cast & prod companies objects are considered separately.  

In [None]:
#parse json input
#NOTE: I'm parsing crew, cast and production companies separately.
json_columns = ['genres', 'keywords', 'production_countries', 'spoken_languages']

### JSON Encoder
#Deserialize s (a str or unicode instance containing a JSON document) to a Python object. 

In [None]:
for column in json_columns:
    allmovies_df[column] = allmovies_df[column].apply(json.loads, encoding="utf-8")
allmovies_df['crew'] = allmovies_df['crew'].apply(json.loads, encoding="utf-8")    
allmovies_df['cast'] = allmovies_df['cast'].apply(json.loads, encoding="utf-8")    
allmovies_df['production_companies'] = allmovies_df['production_companies'].apply(json.loads, encoding="utf-8")

### Function to process the JSON objects Genres, Keywords, Production Countries, Spoken languages.
In columns 'keywords', 'production_countries', 'spoken_languages', the structure is not nested and is simply id and name. I'm basically fetching the value of the key name for these columns.

In [None]:
def process_jsoncols(colname):
    jsoncollist=[]
    for x in colname:
        jsoncollist.append(x['name'])
    return jsoncollist

In [None]:
for colname in json_columns:
    allmovies_df[colname] = allmovies_df[colname].apply(process_jsoncols)

In [None]:
allmovies_df[['genres', 'keywords', 'production_countries', 'spoken_languages']].head()

### Function to process the JSON object Production Companies
I'm considering only the top most production companies that have made box office hits. This list is provided in prodco_list.

In [None]:
allmovies_df['production_companies'] = allmovies_df['production_companies'].apply(process_jsoncols)

In [None]:
allmovies_df['production_companies'].head(2)

### Function to process the JSON object Cast
I'm considering only the leading actor with order =0 as the list of actors results in huge list. Just to keep my sanity intact.

In [None]:
for index,x in zip(allmovies_df.index,allmovies_df['cast']):
    castlist=[]
    for i in range(len(x)):
        if (x[i]['order'] < 1):
            castlist.append((x[i]['name']))
    allmovies_df.loc[index,'cast']=str(castlist)

In [None]:
allmovies_df['cast'].head(2)

In [None]:
#allmovies_df['cast'] = allmovies_df['cast'].str.strip('[]').str.replace("'",'').str.replace('"','').str.replace(' ','').str.replace(',Jr.','Jr.')
allmovies_df['cast'] = allmovies_df['cast'].str.strip('[]').str.replace("'",'').str.replace('"','').str.replace(' ','')

In [None]:
#Checking to see all the information is correct.
allmovies_df[allmovies_df['cast'].isnull()]

In [None]:
allmovies_df['cast'].head(2)

### Function to process the JSON object Crew
I'm considering only the directors from the list of all movie crew

In [None]:
for index,x in zip(allmovies_df.index,allmovies_df['crew']):
    crewlist=[]
    for i in range(len(x)):
        if (x[i]['job'] == 'Director'):
            crewlist.append((x[i]['name']))
    allmovies_df.loc[index,'crew']=str(crewlist)

In [None]:
#def process_jsoncol_crew(colname):
#    crewlist=[]
#    for x in colname:
#        if x['job'] == 'Director':
#            crewlist.append(x['name'])
#            return crewlist

In [None]:
#allmovies_df['crew'] = allmovies_df['crew'].apply(process_jsoncol_crew)

In [None]:
#for index,x in zip(allmovies_df.index,allmovies_df['crew']):
#    crewlist=[]
#    for i in range(len(x)):
#        if (x[i]['job'] == 'Director'):
#            print(x[i]['job'])
#            crewlist.append((x[i]['job']))
#            print(crewlist)
#    allmovies_df.loc[index,'crew']=str(crewlist)

In [None]:
allmovies_df['crew'].head(2)

In [None]:
allmovies_df['crew'].isnull().sum()

In [None]:
#allmovies_df['cast'] = allmovies_df['cast'].str.strip('[]').str.replace("'",'').str.replace('"','').str.replace(' ','').str.replace(',Jr.','Jr.')
allmovies_df['crew'] = allmovies_df['crew'].str.strip('[]').str.replace("'",'').str.replace('"','').str.replace(' ','')

In [None]:
allmovies_df['crew'].head(2)

### Convert Pandas Dataframe Column of Lists to string. 
The impacted columns are genres, keywords, production_countries, spoken_languages, production_companies. NOTE: crew and cast are not column of lists

In [None]:
listcols = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']

In [None]:
for colname in listcols:
    allmovies_df[colname] = allmovies_df[colname].apply(lambda x: ','.join(map(str, x)))

In [None]:
allmovies_df.head(2)

### 2.3) Create new features release year and month based on Release date

In [None]:
from datetime import datetime
allmovies_df['release_date'] = pd.to_datetime(allmovies_df['release_date'])

In [None]:
allmovies_df['release_year'] = allmovies_df['release_date'].dt.year
allmovies_df['release_month'] = allmovies_df['release_date'].dt.month

In [None]:
allmovies_df[['release_year','release_month']].head(2)

In [None]:
#another backup
afterjson = allmovies_df.copy()

## 3) Exploratory Data Analysis

1) create new dataframe with the genres related to movies and visually draw out some conclusions  
2) create new dataframe with the cast related to movies and visually draw out some conclusions  
3) create new dataframe with the crew related to movies and visually draw out some conclusions  
4) create new dataframe with the production companies and visually draw out some conclusions  
5) Draw out plots based on release year and month  

I'm going to consider only the following fields: 'movie_id', 'budget', 'title','release_year', 'release_month','revenue','vote_average','vote_count','original_language'  
Ignoring the following fields: original_title, status, tagline, release_date, runtime, popularity, homepage, overview,    
spoken_languages  

### 3.1) Creation of Movies genres dataframe 

In [None]:
movies_genres = pd.DataFrame(allmovies_df[['movie_id', 'budget', 'title','release_year', 'release_month','genres','revenue','vote_average','vote_count','original_language']])

In [None]:
movies_genres.head(2)

In [None]:
genres_list = set()
for sstr in allmovies_df['genres'].str.split(','):
    genres_list = set().union(sstr, genres_list)
genres_list = list(genres_list)
genres_list.remove('')
genres_list

In [None]:
#pd.Series(' '.join(movies_genres['genres']).split('|')).value_counts()

In [None]:
#pd.Series(' '.join(movies_genres['genres']).lower().split()).value_counts()[:10]

In [None]:
#Transforming categorical to one hot encoding
for genres in genres_list:
    movies_genres[genres] = movies_genres['genres'].str.contains(genres).apply(lambda x:1 if x else 0)

In [None]:
movies_genres.head(2)

In [None]:
genre_count = []
for genre in genres_list:
    genre_count.append([genre, movies_genres[genre].values.sum()])

In [None]:
names = ['genrename','genrecount']
genre_df = pd.DataFrame(data=genre_count, columns=names)
genre_df.sort_values("genrecount", inplace=True, ascending=False)

In [None]:
genre_df.head()

In [None]:
labels=genre_df.genrename

In [None]:
plt.subplots(figsize=(10, 10))
genre_df.genrecount.plot.pie(labels = labels, autopct='%1.1f%%', shadow=False)

In [None]:
plt.subplots(figsize=(10, 10))
genre_df['genrecount'].plot.bar( align='center', alpha=0.5, color='red')
y_pos = np.arange(len(labels))
#plt.yticks(y_pos, labels)
plt.xticks(y_pos, labels)
plt.ylabel('Genres Count')

### Movie Genres conclusion:
Drama, Comedy, Thriller, Action, Romance, Adventure, Crime forms the main genres of the released movies

### 3.2) Creation of Movies cast dataframe 

I've considered the cast ONLY with leading actor roles and not in any other roles.

In [None]:
movies_cast = allmovies_df[['movie_id', 'budget', 'title','release_year', 'release_month','cast','revenue','vote_average','vote_count','original_language']]

In [None]:
movies_cast[movies_cast['cast'].isnull()]

In [None]:
cast_list = list(movies_cast['cast'])
cast_list

In [None]:
def count_elements(lst):
    elements = {}
    for elem in lst:
        if elem in elements.keys():
            elements[elem] +=1
        else:
            elements[elem] = 1
    return elements

In [None]:
castcount = count_elements(cast_list)

I'm going to create a new cast list and consider only those top-30 actors in the list.

In [None]:
top30_cast = sorted(castcount, key=castcount.get, reverse=True)[1:30]
top30_cast

In [None]:
for cast in top30_cast:
    movies_cast[cast] = movies_cast['cast'].str.contains(cast).apply(lambda x:1 if x else 0)

In [None]:
movies_cast.head(2)

In [None]:
cast_count = []
for cast in top30_cast:
    cast_count.append([cast, movies_cast[cast].values.sum()])

In [None]:
names = ['castname','castcount']
cast_df = pd.DataFrame(data=cast_count, columns=names)
cast_df.sort_values("castcount", inplace=True, ascending=False)

In [None]:
cast_df.head()

In [None]:
cast_labels = cast_df.castname[cast_df['castcount']>15]

In [None]:
plt.subplots(figsize=(10, 10))
cast_df.castcount[cast_df['castcount']>15].plot.bar( align='center', alpha=0.5)
y_pos = np.arange(len(cast_labels))
#plt.yticks(y_pos, cast_labels)
plt.xticks(y_pos, cast_labels)
plt.ylabel('cast Count')

### Leading Actors conclusion:
The top leading actors are Bruce Willis, Robert De Niro, Nicolas Cage, Johnny Depp,	Denzel Washington, Tom Hanks who has acted in 24 to 30movies.


### 3.3) Creation of Movies director dataframe¶

In [None]:
movies_crew = allmovies_df[['movie_id','budget','title','release_year','release_month','crew','revenue','vote_average','vote_count','original_language']]

In [None]:
#movies_crew = movies_crew[movies_crew['crew'].notnull()]
movies_crew.index = pd.RangeIndex(len(movies_crew.index))
movies_crew.isnull().sum()

crew_list = []
for i in range(len(movies_crew)):
    #print(movies_crew['crew'][i][0])
    if movies_crew['crew'][i] is not None:
        crew_list.append(movies_crew['crew'][i][0])

In [None]:
crew_list = list(movies_crew['crew'])
crew_list

In [None]:
crewcount = count_elements(crew_list)

In [None]:
top30_crew = sorted(crewcount, key=crewcount.get, reverse=True)[1:30]

In [None]:
for crew in top30_crew:
    movies_crew[crew] = movies_crew['crew'].str.contains(crew).apply(lambda x:1 if x else 0)

In [None]:
movies_crew.head(3)

In [None]:
crew_count = []
for crew in top30_crew:
    crew_count.append([crew, movies_crew[crew].values.sum()])

In [None]:
names = ['crewname','crewcount']
crew_df = pd.DataFrame(data=crew_count, columns=names)
crew_df.sort_values("crewcount", inplace=True, ascending=False)

In [None]:
crew_df.head()

In [None]:
crew_labels = crew_df.crewname[crew_df['crewcount']>9]

In [None]:
plt.subplots(figsize=(10, 10))
crew_df.crewcount[crew_df['crewcount']>9].plot.bar( align='center', alpha=0.5, color='purple')
y_pos = np.arange(len(crew_labels))
#plt.yticks(y_pos, crew_labels)
plt.xticks(y_pos, crew_labels)
plt.ylabel('crew Count')

### Movie Directors conclusion:
StevenSpielberg, WoodyAllen, MartinScorsese, ClintEastwood & RobertRodriguez are the top 5 directors directing more than 16films.

### 3.4) Creation of Production Companies Dataframe

In [None]:
movies_production_companies = allmovies_df[['movie_id','budget','title','release_year','release_month','production_companies','revenue','vote_average','vote_count','original_language']]

In [None]:
movies_production_companies.head(2)

In [None]:
top30_production_companies = ['Paramount Pictures','Columbia Pictures','Twentieth Century Fox Film Corporation','Metro-Goldwyn-Mayer (MGM)',
               'Marvel Studios','Walt Disney Pictures','Walt Disney','Walt Disney Animation Studios',
               'Walt Disney Studios Motion Pictures','Warner Bros.','Universal Pictures','Universal Studios',
               'Jerry Bruckheimer Films','Pixar Animation Studios','Relativity Media','Lucasfilm',
               'RKO Radio Pictures','New Line Cinema','Miramax Films','DreamWorks','DreamWorks SKG']

In [None]:
for production_companies in top30_production_companies:
    movies_production_companies[production_companies] = movies_production_companies['production_companies'].str.contains(production_companies).apply(lambda x:1 if x else 0)

In [None]:
movies_production_companies.head(2)

In [None]:
production_companies_count = []
for production_companies in top30_production_companies:
    production_companies_count.append([production_companies, movies_production_companies[production_companies].values.sum()])
production_companies_count

In [None]:
names = ['production_companiesname','production_companiescount']
production_companies_df = pd.DataFrame(data=production_companies_count, columns=names)
production_companies_df.sort_values("production_companiescount", inplace=True, ascending=False)

In [None]:
production_companies_df

In [None]:
production_companies_labels = production_companies_df.production_companiesname[production_companies_df['production_companiescount']>1]

In [None]:
production_companies_df.head()

In [None]:
plt.subplots(figsize=(10, 10))
production_companies_df.production_companiescount[production_companies_df['production_companiescount']>1].plot.bar( align='center', alpha=0.5, color='red')
y_pos = np.arange(len(production_companies_labels))
#plt.yticks(y_pos, production_companies_labels)
plt.xticks(y_pos, production_companies_labels)
plt.ylabel('production_companies Count')

### Production Companies conclusion:
The main production companies that release movies are Warner Bros., Paramount Pictures, Universal Pictures, 20th Century Fox,
Columbia Pictures, New Line Cinema, Disney, Pixar, Miramax Films, DreamWorks

### 3.5) Release date analysis

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x='release_year', data=movies_genres, color='red')
plt.ylabel('Count', fontsize=12)
plt.xlabel('Movies released per year', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Movies released by year", fontsize=15)
plt.show()

In [None]:
movies_genres[['release_year', 'release_month']].groupby(['release_year'], as_index=False).count().sort_values(by='release_year', ascending=False)

In [None]:
movies_genres[['release_month', 'release_year']].groupby(['release_month'], as_index=False).count().sort_values(by='release_year', ascending=False)

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x='release_month', data=movies_genres, color='red')
plt.ylabel('Count', fontsize=12)
plt.xlabel('Movies released per month', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of Movies released by year", fontsize=15)
plt.show()

This shows that more movies are released during Dec/Jan combined-Holiday season as well during Sep/Oct - after school starts

# Revenue & Budget Analysis

In [None]:
movies_genres['revenue'].plot.hist(alpha=0.5, bins=20)
plt.title('Histogram of the Revenue')
plt.xlabel("Revenue")
plt.ylabel("Frequency") 

In [None]:
movies_genres['budget'].plot.hist(alpha=0.5, bins=20)
plt.title('Histogram of the Revenue')
plt.xlabel("Revenue")
plt.ylabel("Frequency") 

# Linear Regression

In [None]:
# Importing modules
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn import linear_model

## Movie Genre regression indicators

In [None]:
genre_corr = movies_genres.corr()
genre_corr['revenue'].sort_values()

Strong correlation indicators seems be   
Positive: vote_count, budget, Adventure, Fantasy, Action, Animation, vote_average, Family, Science Fiction  
Negative: Drama  

### Standard Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
movies_genres['budget'] = MinMaxScaler().fit_transform(movies_genres['budget'])
movies_genres['vote_average'] = MinMaxScaler().fit_transform(movies_genres['vote_average'])
movies_genres['vote_count'] = MinMaxScaler().fit_transform(movies_genres['vote_count'])


In [None]:
x = movies_genres[['vote_count','budget','Adventure', 'Fantasy', 'Action', 'Animation', 'vote_average', 'Family', 
                   'Science Fiction', 'Drama']]
x.head(3)

In [None]:
y = movies_genres['revenue']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

In [None]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

In [None]:
pd.DataFrame(list(zip(x.columns, linear.coef_)), columns = ['features', 'coefficients'])

Budget, Animation, Family, votecount seems to be having positive correlation

In [None]:
#Regression plot between budget and revenue
plt.figure(figsize=(8,8))
sns.regplot(x=movies_genres["budget"], y=movies_genres["revenue"], fit_reg=True)

There seems to be increase in the revenue as the budget of the movie increases (except few outliers)

In [None]:
movies_genres[movies_genres['revenue'] > 2500000000]

Movie Avatar seemd to be the outlier with way too high revenue. Lets try to remove the outlier.

In [None]:
plt.figure(figsize=(8,8))
sns.boxplot(x=movies_genres["Animation"], y=movies_genres["revenue"])

In [None]:
plt.figure(figsize=(8,8))
sns.regplot(x=movies_genres["vote_count"], y=movies_genres["revenue"], fit_reg=True)

In [None]:
plt.figure(figsize=(8,8))
sns.boxplot(x=movies_genres["Drama"], y=movies_genres["revenue"])

In [None]:
mov_g = movies_genres[movies_genres['revenue'] < 2500000000]
mov_g.shape

In [None]:
x = mov_g[['budget']]
y = mov_g['revenue']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
print (X_train.shape)
print (y_train.shape)
print (X_test.shape)
print (y_test.shape)

In [None]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

In [None]:
x = mov_g[['budget', 'vote_count']]
y = mov_g['revenue']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

In [None]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

In [None]:
mov_g.info()
len(mov_g)

In [None]:
mov_g[mov_g.release_year.isnull()]

In [None]:
mov_g = mov_g.dropna(axis=0, how='any')

In [None]:
mov_g.head()

In [None]:
x = mov_g[['budget','release_year','release_month','vote_count','Animation','Thriller','Family',
           'Adventure','Western','War','Drama','Action','Mystery','Science Fiction','Documentary','Foreign','TV Movie','Fantasy',
           'Music','History','Horror','Romance','Crime','Comedy']]
y = mov_g['revenue']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

In [None]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

In [None]:
pd.DataFrame(list(zip(x.columns, linear.coef_)), columns = ['features', 'coefficients'])

In [None]:
sns.pairplot(mov_g, x_vars=['budget','release_year','vote_count'], y_vars='revenue', size=7, aspect=0.7, kind='reg')

In [None]:
x = mov_g[['budget','release_year','release_month','vote_average','vote_count','Animation','Thriller','Family',
           'Adventure','Western','War','Drama','Action','Mystery','Science Fiction','Documentary','Foreign','TV Movie','Fantasy',
           'Music','History','Horror','Romance','Crime','Comedy']]
y = mov_g['revenue']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)

In [None]:
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(X_train, y_train)
#Predict Output
lin_predicted = linear.predict(X_test)

linear_score = round(linear.score(X_train, y_train) * 100, 2)
linear_score_test = round(linear.score(X_test, y_test) * 100, 2)
#Equation coefficient and Intercept
print('Linear Regression Score: \n', linear_score)
print('Linear Regression Test Score: \n', linear_score_test)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

### Conclusion:
The main indicator for the movie revenue or success is determined by the budget and vote_count

### Conclusion:
We can try to use other indicators like director, cast, production company etc to figure out how they impact the success of the movie. Hope this helps :)