# Title: TMDB Movie Data Analysis

# Problem Statement:

Movies with budgets exceeding $100 million can still underperform in the box office, suggesting that movie preferences among audiences are diverse. A production company aims to gain insights into what types of movies tend to succeed commercially, which genres are most popular, and other factors that influence a movie's performance. This analysis will assist the company in making informed decisions about investing in movie production, predicting a movie's success, and understanding audience preferences.

# Context:
In the entertainment industry, producing a successful movie requires a significant financial investment, and predicting a movie's success is challenging due to the diverse tastes of moviegoers. To mitigate risks and make strategic decisions, production companies need to analyze historical movie data to identify patterns and trends related to factors like budget, revenue, genres, and production companies.

# Need of Study:

To reduce financial risks associated with movie production.
To identify factors that contribute to a movie's commercial success.
To understand the preferences of different audience segments.
To optimize investment in movie production.
To predict a movie's performance based on key attributes.

# Business Objective:

The primary objective of this analysis is to leverage data-driven insights to enhance decision-making in movie production. Specific goals include:

## Identifying the characteristics of successful movies.

## Recognizing the most popular movie genres.

## Analyzing the impact of factors such as budget and revenue.

## Predicting a movie's commercial success.

## Assisting in strategic investment decisions.

## Understanding audience preferences for better marketing strategies.

This analysis will empower the production company to allocate resources more efficiently, reduce risks, and increase the chances of producing commercially successful movies.

# Task 1: Load the movie dataset and display basic information.

# importing required libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import warnings 

warnings.filterwarnings('ignore')

# Reading and Understanding the data

In [23]:
# Load the dataset
movie_data = pd.read_csv('DS1_C8_V3_ND_Sprint2_Data Analysis Using Python_Dataset.csv') 
movie_data

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",10-12-2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",19-05-2007,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",26-10-2015,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",16-07-2012,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",07-03-2012,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,9367,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]","[{""iso_3166_1"": ""MX"", ""name"": ""Mexico""}, {""iso...",04-09-1992,2040920,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238
4799,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,72766,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],[],26-12-2011,0,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5
4800,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://www.hallmarkchannel.com/signedsealeddel...,231617,"[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",13-10-2013,0,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6
4801,0,[],http://shanghaicalling.com/,126186,[],en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",03-05-2012,0,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7


# Basic Analysis on dataset

1.Visually inspect the first few and last few rows of the data

2.Check the shape of the data frame

3.Check the count of null values in each column

4.Inspect all the column names and cross check with the data dictionary

5.Check the information of the data frame using the info() function

# To display top 5 rows


In [4]:
movie_data.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",10-12-2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",19-05-2007,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",26-10-2015,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",16-07-2012,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",07-03-2012,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


# From the movie_data using head() can provide first 5 records

# To display bottom 5 rows

In [6]:
movie_data.tail()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
4798,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,9367,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]","[{""iso_3166_1"": ""MX"", ""name"": ""Mexico""}, {""iso...",04-09-1992,2040920,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238
4799,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,72766,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],[],26-12-2011,0,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5
4800,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://www.hallmarkchannel.com/signedsealeddel...,231617,"[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",13-10-2013,0,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6
4801,0,[],http://shanghaicalling.com/,126186,[],en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",03-05-2012,0,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7
4802,0,"[{""id"": 99, ""name"": ""Documentary""}]",,25975,"[{""id"": 1523, ""name"": ""obsession""}, {""id"": 224...",en,My Date with Drew,Ever since the second grade when he first saw ...,1.929883,"[{""name"": ""rusty bear entertainment"", ""id"": 87...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",05-08-2005,0,90.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,My Date with Drew,6.3,16


# From the movie_data using bottom() can rovide last 5 records

# To display the available columns in the dataset

In [7]:
movie_data.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

# To display Number of rows and columns in the dataset

In [8]:
print(movie_data.shape)

(4803, 20)


# The dataset contains 4803 rows and 20 columns.

# To display the basic information about the dataset

In [9]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

#   To find datatype of the columns

In [10]:
movie_data.dtypes

budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average            float64
vote_count                int64
dtype: object

# To display the count of missing values for each column

In [4]:
print(movie_data.isnull().sum())

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64


# Homepage and tagline columns are have missing values.

# Visual Inspection of data

Dataset Description: 

The dataset comprises TMDB-movie related data with 4803 rows and 20 columns

 Data Size and Missing Values:
 
The dataset contains 4803 rows and 20 columns. Some homepage and tagline columns are have missing values.


# Understanding the dataset

1. Budget: The budget of a movie in dollars. (A budget value of 0 means the budget value is unknown.)
2. Genres: The genre of the movie and TMDB id (in JSON format)
3. Homepage: The official homepage URL of a movie
4. Id: IMDB id of a movie (string)
5. Keywords: TMDB id and names of all keywords (in JSON format)
6. Original_language: Two-digit code of the original language (the language in which the movie was made) such as, en for English, fr for French
7. Original_title: The original title of a movie. The title and original title of a movie may differ if the original title is not in English.
8. Popularity: Popularity of the movie (in float)
9. Overview: Brief description of the movie
10. Production_companies: All production companies' names and TMDB id of a movie (in JSON format)
11. Production_countries: Two-digit code and the full name of the production country (in JSON format)
12. Release_date: The release date of a movie (in dd/mm/yy format)
13. Revenue: The total revenue earned by a movie (in dollars)
14. Runtime: The total runtime of a movie in minutes (integer)
15. Spoken_languages: Two-digit code and the full name of the spoken language
16. Status: The status of the movie (postproduction, released, or rumored)
17. Vote_average: Average vote for a movie
18. Vote_count: The total vote count for a movie
19. Tagline: Tagline of a movie
20. Title: English title of a movie

# Interpretation: 
This task involves loading the movie dataset, which is the first step in data analysis. By displaying the number of rows and columns and listing the titles and genres of the first 50 movies, it provides an initial overview of the dataset's size and content.


# Task 2: Handle null values in the dataset.

In [24]:
# Identify columns with null values
null_columns = movie_data.columns[movie_data.isna().any()]

# Handle null values (fill with mean for numerical columns, mode for categorical columns)
for column in null_columns:
    if movie_data[column].dtype == 'object':
        movie_data[column].fillna(movie_data[column].mode().values[0], inplace=True)
    else:
        movie_data[column].fillna(movie_data[column].mean(), inplace=True)

# Print the fill values
for column in null_columns:
    fill_value = movie_data[column].mode().values[0] if movie_data[column].dtype == 'object' else movie_data[column].mean()
    print(f"Column: {column}, Fill Value: {fill_value}")


Column: homepage, Fill Value: http://www.missionimpossible.com/
Column: overview, Fill Value:  
Column: release_date, Fill Value: 01-01-2006
Column: runtime, Fill Value: 106.87585919600083
Column: tagline, Fill Value: Based on a true story.


# Interpretation: 
Identifying and addressing missing values in the dataset is essential for data quality and analysis accuracy. The method used depends on the data type of the column. Missing values can affect budget, revenue, and other columns, impacting decision-making.

# Method Used: 
Imputation methods like mean or median for numerical columns and mode for categorical columns, ensuring data completeness.

Column: homepage, Fill Value: http://www.missionimpossible.com/

Column: overview, Fill Value:  

Column: release_date, Fill Value: 01-01-2006

Column: runtime, Fill Value: 106.87585919600083

Column: tagline, Fill Value: Based on a true story.

# Task 3: Display movie categories with a budget greater than $220,000.

In [11]:
budget_condition = movie_data['budget'] > 220000
result = movie_data[budget_condition]['genres'].drop_duplicates()
print(result)

0       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1       [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3       [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4       [{"id": 28, "name": "Action"}, {"id": 12, "nam...
                              ...                        
4539    [{"id": 35, "name": "Comedy"}, {"id": 18, "nam...
4579    [{"id": 12, "name": "Adventure"}, {"id": 35, "...
4601    [{"id": 878, "name": "Science Fiction"}, {"id"...
4629    [{"id": 35, "name": "Comedy"}, {"id": 27, "nam...
4666    [{"id": 10769, "name": "Foreign"}, {"id": 99, ...
Name: genres, Length: 1009, dtype: object


# Interpretation: 
This task helps identify movie categories (genres) where budgets exceed $220,000. Understanding the distribution of budgets across genres is vital for investment decisions and understanding which genres require higher financial commitments.


# # Method Used: 

This code will filter the rows where the budget is greater than $220,000 and then drop duplicates in the 'genres' column,
so we get the distinct movie genres that meet the budget condition.Action,adventure,comedy,science fiction.foreign are having budget greater than $220,000

# Task 4: Display movie categories with revenue greater than $961,000,000.

In [12]:
revenue_condition = movie_data['revenue'] > 961000000
result = movie_data[revenue_condition]['genres'].drop_duplicates()
print(result)


0      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3      [{"id": 28, "name": "Action"}, {"id": 80, "nam...
7      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
12     [{"id": 12, "name": "Adventure"}, {"id": 14, "...
16     [{"id": 878, "name": "Science Fiction"}, {"id"...
17     [{"id": 12, "name": "Adventure"}, {"id": 28, "...
25     [{"id": 18, "name": "Drama"}, {"id": 10749, "n...
26     [{"id": 12, "name": "Adventure"}, {"id": 28, "...
28     [{"id": 28, "name": "Action"}, {"id": 12, "nam...
29     [{"id": 28, "name": "Action"}, {"id": 12, "nam...
32     [{"id": 10751, "name": "Family"}, {"id": 14, "...
42     [{"id": 16, "name": "Animation"}, {"id": 10751...
44                        [{"id": 28, "name": "Action"}]
52     [{"id": 28, "name": "Action"}, {"id": 878, "na...
65     [{"id": 18, "name": "Drama"}, {"id": 28, "name...
78     [{"id": 10751, "name": "Family"}, {"id": 12, "...
124    [{"id": 16, "name": "Animation"}, {"id": 12, "...
197    [{"id": 12, "name": "Adv

# Interpretation:
 This task identifies movie categories where revenue exceeds $961,000,000. It helps determine which genres tend to generate the highest revenues.

# Method Used: 
Filtering and displaying rows where revenue exceeds $961,000,000, focusing on the 'genres' column.Action Adventure Science Fiction Adventure Animation Drama are having budget > $961,000,000.

# Task 5: Remove rows with budget and revenue values of 0.

In [12]:
movie_data = movie_data[(movie_data['budget'] != 0) & (movie_data['revenue'] != 0)]
movie_data

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",10-12-2009,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",19-05-2007,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",26-10-2015,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",16-07-2012,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",07-03-2012,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4773,27000,"[{""id"": 35, ""name"": ""Comedy""}]",http://www.miramax.com/movie/clerks/,2292,"[{""id"": 1361, ""name"": ""salesclerk""}, {""id"": 30...",en,Clerks,Convenience and video store clerks Dante and R...,19.748658,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",13-09-1994,3151130,92.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Just because they serve you doesn't mean they ...,Clerks,7.4,755
4788,12000,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 35, ""nam...",http://www.missionimpossible.com/,692,"[{""id"": 237, ""name"": ""gay""}, {""id"": 900, ""name...",en,Pink Flamingos,Notorious Baltimore criminal and underground f...,4.553644,"[{""name"": ""Dreamland Productions"", ""id"": 407}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",12-03-1972,6000000,93.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,An exercise in poor taste.,Pink Flamingos,6.2,110
4792,20000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 27, ""name...",http://www.missionimpossible.com/,36095,"[{""id"": 233, ""name"": ""japan""}, {""id"": 549, ""na...",ja,キュア,A wave of gruesome murders is sweeping Tokyo. ...,0.212443,"[{""name"": ""Daiei Studios"", ""id"": 881}]","[{""iso_3166_1"": ""JP"", ""name"": ""Japan""}]",06-11-1997,99000,111.0,"[{""iso_639_1"": ""ja"", ""name"": ""\u65e5\u672c\u8a...",Released,Madness. Terror. Murder.,Cure,7.4,63
4796,7000,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...",http://www.primermovie.com,14337,"[{""id"": 1448, ""name"": ""distrust""}, {""id"": 2101...",en,Primer,Friends/fledgling entrepreneurs invent a devic...,23.307949,"[{""name"": ""Thinkfilm"", ""id"": 446}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",08-10-2004,424760,77.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,What happens if it actually works?,Primer,6.9,658


# Interpretation: 
Rows with zero budget or revenue indicate missing or unknown values, which can distort analyses. Removing such rows ensures data accuracy and reliability in budget and revenue-related analyses.
# Method Used: 
This code will filter the DataFrame to keep only the rows where both the 'budget' and 'revenue' columns have values greater than 0, effectively removing rows with budget and revenue values of 0.The dataset size has reduced to 3229 rows × 20 columns

# Task 6: List the top 10 movies with the highest revenues and the top 10 movies with the least budget.

In [6]:
top_10_revenues = movie_data.nlargest(10, 'revenue')[['title', 'revenue']]
top_10_budgets = movie_data.nsmallest(10, 'budget')[['title', 'budget']]

print("Top 10 Movies with Highest Revenues:")
print(top_10_revenues)

print("\nTop 10 Movies with Least Budget:")
print(top_10_budgets)


Top 10 Movies with Highest Revenues:
                          title     revenue
0                        Avatar  2787965087
25                      Titanic  1845034188
16                 The Avengers  1519557910
28               Jurassic World  1513528810
44                    Furious 7  1506249360
7       Avengers: Age of Ultron  1405403694
124                      Frozen  1274219009
31                   Iron Man 3  1215439994
546                     Minions  1156730962
26   Captain America: Civil War  1153304495

Top 10 Movies with Least Budget:
                   title  budget
4238        Modern Times       1
3611  A Farewell to Arms       4
3372        Split Second       7
3419        Bran Nue Dae       7
4608        The Prophecy       8
3131   Of Horses and Men      10
3137           Nurse 3-D      10
2933            F.I.S.T.      11
1912      Angela's Ashes      25
1771      The 51st State      28


# Interpretation: 
Identifying the top 10 revenue-generating movies and the top 10 movies with the lowest budgets provides insights into both commercial successes and cost-effective productions.
# Method Used: 
Sorting the dataset by 'revenue' and 'budget' columns and selecting the top 10 in each category. Avatar has the highest revenue of 2787965087 and Modern times has the least budget of 1

# Task 7: Analyze the correlation between popularity and budget.

In [7]:
correlation = movie_data['popularity'].corr(movie_data['budget'])
print(f"Correlation between Popularity and Budget: {correlation}")


Correlation between Popularity and Budget: 0.4319901360079751


# Interpretation: 
This task explores the correlation between movie popularity and budgets.The correlation indicates the strength and direction of the linear relationship between popularity and budget. Understanding this relationship can help decide how much to invest in production to maximize popularity and, ultimately, revenue.

# Method Used:
Calculating the correlation coefficient between 'popularity' and 'budget' columns.Correlation between Popularity and Budget: 0.4319901360079751


# Task 8: Identify and display production company names and their frequency.

In [25]:
production_companies = movie_data['production_companies'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)
company_counts = production_companies.value_counts()
print(company_counts)


[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              351
[{"name": "Paramount Pictures", "id": 4}]                                                                                                                                                                                                                                                                                                                                                                                                           

# Interpretation:
Identifying production companies and the number of times they appear in the dataset helps recognize the most active production companies in the industry.
# Method Used:
Counting the occurrences of each production company in the 'production_companies' column.In this code, we manually split the 'production_companies' string for each row using split('|') and then count the occurrences of each company using a dictionary.
The "Paramount Pictures has the highes frequency of  58                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

# Task 9: Display the names of the top 25 production companies based on movie count.

In [9]:
top_25_companies = company_counts.head(25)
print(top_25_companies)


[{"name": "Paramount Pictures", "id": 4}]                                                                           48
[]                                                                                                                  41
[{"name": "Universal Pictures", "id": 33}]                                                                          35
[{"name": "New Line Cinema", "id": 12}]                                                                             29
[{"name": "Columbia Pictures", "id": 5}]                                                                            28
[{"name": "Twentieth Century Fox Film Corporation", "id": 306}]                                                     25
[{"name": "Warner Bros.", "id": 6194}]                                                                              24
[{"name": "Metro-Goldwyn-Mayer (MGM)", "id": 8411}]                                                                 24
[{"name": "Walt Disney Pictures", "id": 2}]     

# Interpretation:
This task ranks production companies by the number of movies they've produced. It highlights the most prolific production companies in the industry.
# Method Used:
Sorting and displaying the top 25 production companies by movie count.The paramount Pictures with the highest movie count of 48

# Task 10: Filter top 500 movies by revenue and calculate measures of central tendency.

In [27]:
# Filter top 500 movies by revenue
filtered_data = movie_data.nlargest(500, 'revenue')

# Calculate measures of central tendency
budget_mean = filtered_data['budget'].mean()
budget_median = filtered_data['budget'].median()
budget_mode = filtered_data['budget'].mode().values[0]

revenue_mean = filtered_data['revenue'].mean()
revenue_median = filtered_data['revenue'].median()
revenue_mode = filtered_data['revenue'].mode().values[0]

runtime_mean = filtered_data['runtime'].mean()
runtime_median = filtered_data['runtime'].median()
runtime_mode = filtered_data['runtime'].mode().values[0]

# Print the results
print("Budget:")
print(f"Mean Budget: {budget_mean}")
print(f"Median Budget: {budget_median}")
print(f"Mode Budget: {budget_mode}")

print("\nRevenue:")
print(f"Mean Revenue: {revenue_mean}")
print(f"Median Revenue: {revenue_median}")
print(f"Mode Revenue: {revenue_mode}")

print("\nRuntime:")
print(f"Mean Runtime: {runtime_mean}")
print(f"Median Runtime: {runtime_median}")
print(f"Mode Runtime: {runtime_mode}")


Budget:
Mean Budget: 102803736.0
Median Budget: 95000000.0
Mode Budget: 150000000

Revenue:
Mean Revenue: 458722133.294
Median Revenue: 363001569.5
Mode Revenue: 219076518

Runtime:
Mean Runtime: 118.626
Median Runtime: 116.0
Mode Runtime: 115.0


# Interpretation:
Analyzing central tendency measures (mean, median,mode) for budget, revenue, and runtime provides insights into typical values for these attributes. 
# Method Used:
Calculating mean, median,mode for budget, revenue, and runtime.

Budget:
Mean Budget: 102803736.0
Median Budget: 95000000.0
Mode Budget: 150000000

Revenue:
Mean Revenue: 458722133.294
Median Revenue: 363001569.5
Mode Revenue: 219076518

Runtime:
Mean Runtime: 118.626
Median Runtime: 116.0
Mode Runtime: 115.0

# Task 11: Display movie names and run times for movies with above-average runtime.

In [11]:
above_average_runtime = filtered_data[filtered_data['runtime'] > runtime_mean][['title', 'runtime']]
print(above_average_runtime)


                 title  runtime
0               Avatar    162.0
25             Titanic    194.0
16        The Avengers    143.0
28      Jurassic World    124.0
44           Furious 7    137.0
...                ...      ...
521       The Terminal    128.0
397   It's Complicated    121.0
1744        Knocked Up    129.0
717       Jack Reacher    130.0
714         Collateral    120.0

[234 rows x 2 columns]


# Interpretation:
This task aims to identify movies with runtimes above the dataset's average. It helps understand which movies have longer durations compared to the average.
# Method Used:
Comparing each movie's runtime to the average runtime and listing movies that exceed it.Titanic has the highest runtime of 194.0

# Conclusion:

The analysis of the movie dataset has provided valuable insights into the factors influencing a movie's commercial success. This analysis addressed the business objective of understanding movie performance in cinemas. Key findings include the identification of successful movie categories, the relationship between popularity and budget, and the recognition of top production companies in the industry. These insights can guide production companies in making informed decisions about investments, genre selection, and resource allocation, ultimately enhancing their ability to predict and achieve commercial success in the competitive film industry.

# Insights
The analysis of the movie dataset has successfully addressed the business objective of understanding movie performance in cinemas. Here are the key findings and insights:

# Successful Movie Categories:

Action, Adventure, Comedy, Science Fiction, and Foreign movie categories have budgets greater than $220,000, which suggests that these genres are investing significantly in their production.

# High Budget Movies:

Action, Adventure, Science Fiction, Animation, Drama, and Adventure genres have budgets exceeding $961,000,000, indicating that these genres are capable of high production costs.

# Top Revenue Movie:

"Avatar" holds the record for the highest revenue, generating $2,787,965,087 in earnings.

# Lowest Budget Movie:

"Modern Times" has the lowest reported budget, with a value of 1.

# Popularity vs. Budget:

There is a positive correlation of approximately 0.43 between movie popularity and budget, suggesting that movies with higher budgets tend to be more popular.

# Leading Production Company:

"Paramount Pictures" stands out as the production company with the highest movie count, having produced 48 films.

# Average Metrics:

The average budget for the analyzed movies is $103,193,736.
The average revenue for the analyzed movies is $458,577,595.06.
The average runtime for the analyzed movies is 118.72 minutes.

# Longest Movie:

"Titanic" has the highest runtime, with a duration of 194.0 minutes.

These insights provide valuable information for production companies, helping them make informed decisions about genre selection, budget allocation, and resource management. By understanding the factors that contribute to a movie's commercial success, companies can enhance their ability to predict and achieve success in the competitive film industry.