**Introduction**

Movies are a popular form of entertainment, and people from all over the world enjoy watching them. But what factors influence the success of a movie? Is it the budget, the rating, the genre, or the release date?.To know that I will analyze a dataset of movies to try to identify some of the factors that influence their success. I will look at the average budget of movies over time, the most common rating for movies, the genres with the highest and lowest average ratings, and the relationship between release date and gross earnings.

# Loading the Data

**We will be importing some important libraries for this project**

In [None]:
# importing some important library
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt


**Reading the data in the form of a csv_file(comma_separated_value).**

In [None]:
# importing the dataset to Pandas DataFrame: movie_df
movie_df = pd.read_csv("/kaggle/input/movies/movies.csv")

A Pandas DataFrame is a two-dimensional data structure that stores data in a tabular format of rows and columns.

Pandas DataFrame object contains attributes and methods. 
Attributes represents the characteristics. Example: shape, dtype
Methods represents the actions that can be carried on DataFrame. Example: info(), describe()


pandas can read a variety of files. 

For example:

a table of fixed width formatted lines (read_fwf), 
excel sheets (read_excel), 
html files (read_html), 
json files (read_json)

In [None]:
#display first five rows of the DataFrame: movie_df
movie_df.head()

In [None]:
#display last five rows of the DataFrame: movie_df
movie_df.tail()

# Exploring the Data

In [None]:
# Check the number of rows and  number of columns in Pandas DataFrame: movie_df
movie_df.shape

In [None]:
# check the row index labels of Pandas DataFrame: movie_df
movie_df.index

In [None]:
# Check the column labels of Pandas DataFrame: movie_df
movie_df.columns

In [None]:
# Get the concise summary of DataFrame: movie_df
movie_df.info()   

In [None]:
# Get the statistical summary of numeric columns of DataFrame: movie_df
movie_df.describe()

describe() method of DataFrame provides the basic statistics of all numerical columns of the DataFrame. The describe() method returns the predefined statistics for numeric columns which includes: count,  mean, std, min, 25%, 50%, 75%, max

# Handling the null and missing value

Datasets can contain missing values. This can be handled in 2 ways:

1. Remove missing values. Rows or columns containing missing values can be removed.

   pandas.DataFrame.dropna() method can be used to drop missing values in the DataFrame. 
   It takes the axis parameter which can have two values 0 or 1

   If axis=0 , it will drop rows containing missing values
   If axis=1,  it will drop columns containing missing values
   

2. Replacing the missing values with new values . This is also called as imputation.


In [None]:
# finding the null value using isna() and any() in the columns of dataframe: movie_df
movie_df.isna().any()

pandas.DataFrame.isna() method helps in detecting missing values in the DataFrame. Method returns the entire DataFrame object with boolean values. 

Values in the DataFrame such as NaN and None will be mapped as ‘True’ and all other values are mapped as ‘False’

pandas.DataFrame.any(axis=0) method checks if any element consists of Boolean – ‘True’ over columns

Method reduces the index and returns the Series object with column names as index.

In [None]:
# finding the number of missing values in each columns of dataframe: movie_df
movie_df.isnull().sum()

In [None]:
# Drop the rows with missing values in the specified columns of the dataframe: movie_df
movie_df.dropna(subset=['rating', 'released', 'director', 'writer', 'star', 'country', 'budget', 'gross', 'company', 'runtime'],inplace = True)

In [None]:
# checking for the missing value in the dataframe : movie_df
movie_df.isna().any()

# Handling Duplicate

In [None]:
# finding the number of duplicate records values in dataframe:movie_df
movie_df.duplicated().sum()

pandas.DataFrame.duplicated() method will mark all duplicate records with boolean - 'True'.
By default, except the first occurence rest of the occurrences of the records will be marked 'True'

In [None]:
# get the duplicate record from the dataframe : movie_df
movie_df[movie_df.duplicated()]

In [None]:
# droping the duplicate record from the dataframe: movie_df
movie_df.drop_duplicates(inplace = True)
movie_df.duplicated().sum()

pandas.DataFrame.drop_duplicates() method will drop all the duplicate records and returns the DataFrame free of duplicate records. By default, except the first occurence of the record rest will be dropped.

# Checking column data types

In [None]:
# get the data types of all columns of the DataFrame: cars_df
movie_df.dtypes

In [None]:
# setting data type of columns 
movie_df["budget"] = movie_df["budget"].astype("int64")
movie_df["gross"] = movie_df["gross"].astype("int64")
movie_df["votes"] = movie_df["votes"].astype("int64")
movie_df["runtime"] = movie_df["runtime"].astype("int64")
movie_df['genre'] = movie_df['genre'].astype('category')
movie_df['rating'] = movie_df['rating'].astype('category')

pandas.DataFrame.astype() method helps to typcast the DataFrame columns into required type

In [None]:
# checking the changes in the data type of the column
movie_df.dtypes

In [None]:
movie_df

In [None]:
# creating a new column to store month
movie_df["released_month"] = movie_df["released"].astype(str).str[:3] 
movie_df


In [None]:
# Replace the value of no-rated to unrated in the rating column
movie_df['rating'] = movie_df['rating'].replace({"Not Rated":"Unrated",
                                                 "PG-13":"PG"})


In [None]:
# dropping the column that i am not going to use 
movie_df.drop(["released"],axis = 1,inplace = True)
movie_df

# Data Analysis

Q1 What are the top 10 highest-grossing movies of 2019-2020?


Q2 What is the relationship between budget and gross


Q3 What is the most common rating for movies in the dataset? 


Q4 Which genres of movies tend to have the highest gross earnings?


Q5 How has the average budget of movies changed over time?


Q6 Do movies released in the summer months tend to have higher gross earnings than movies released in other months of the year?


Q7 what is the average score for each genre?

**Q1 What are the top 10 highest-grossing movies of 2019-2020?**

In [None]:
# Filter the data to only include movies released in 2019-2020
movie_data_2019_2020 = movie_df[movie_df['year'].between(2019, 2020)]

# Sort the data by gross in descending order
movie_data_2019_2020 = movie_data_2019_2020.sort_values(by='gross', ascending=False)

# Select the top 10 movies
top_10_movies = movie_data_2019_2020.head(10)

# Print the top 10 movies
top_10_movies


As you can see the above dataframe showes the top10 hidhest grossing movies between 2019-2020

Q2 What is the relationship between budget and gross

In [None]:
# **Relationship between budget and gross**

# Calculate the correlation between rating and marketing budget
budget_gross_corr = movie_df['budget'].corr(movie_df['gross'])

# Print the correlation between budget and gross
print(budget_gross_corr)

# Create a scatter plot showing the relationship between rating and marketing budget
plt.scatter(movie_df['budget'], movie_df['gross'])
plt.xlabel('budget')
plt.ylabel('gross')
plt.title('Relationship between budget and gross')
plt.show()

col:
The scatter plot shows a positive relationship between budget and gross, meaning that as budget increases, gross also tends to increase. 

Q3 What is the most common rating for movies in the dataset? 


In [None]:
# Create a pie chart of the rating column of the movie_df DataFrame, with the percentage of movies for each rating displayed on the chart
plt.figure(figsize=(10,5))
movie_df['rating'].value_counts().plot.pie(autopct="%1.1f%%", fontsize=14)

# Set the title of the chart and the font size of the title
plt.title('Rating', fontsize=25)

# Show the plot
plt.show()


The pie chart shows that the most common rating for movies in the dataset is PG. This suggests that the majority of movies in the dataset are appropriate for general audiences with parental guidance.


Q4 Which genres of movies tend to have the highest gross earnings?

In [None]:
# **Genres with the highest gross earnings**

# Group the data by genre and calculate the average gross earnings for each genre
genre_gross_earnings = movie_df.groupby('genre')['gross'].mean()

# Sort the genre gross earnings in descending order
genre_gross_earnings = genre_gross_earnings.sort_values(ascending=False)

# Rotate the x-axis labels
plt.xticks(rotation=45)

# Increase the font size of the x-axis labels
plt.xticks(fontsize=12)

# Create a bar chart showing the average gross earnings for each genre
plt.bar(genre_gross_earnings.index, genre_gross_earnings)

# Set the x-axis label and title
plt.xlabel('Genre')
plt.title('Average Gross Earnings by Genre')

# Show the plot
plt.show()

The graph shows the average gross earning for each genre of movie in the dataset. The genres with the highest average gross earning are Family, Animation, and Action. The genres with the lowest average gross earning are  Sci-Fi,Romance and Western.

Q5 How has the average budget of movies changed over time?

In [None]:
# **Average budget of movies over time**

# Group the data by year and calculate the average budget for each year
year_budget = movie_df.groupby('year')['budget'].mean()

# Rotate the x-axis labels
plt.xticks(rotation=45)

# Increase the font size of the x-axis labels
plt.xticks(fontsize=12)

# Create a line chart showing the average budget for each year
plt.plot(year_budget.index, year_budget)
plt.xlabel('Year')
plt.ylabel('Average Budget')
plt.title('Average Budget over Time')
plt.show()

The line chart showes the increase of average budget for making a move since 1980 due to the varous factor like  increasing cost of filmmaking,The desire to create more visually appealing and complex movies and etc  


Q6 Do movies released in the summer months tend to have higher gross earnings than movies released in other months of the year?

In [None]:
# **Gross earnings of movies released in the summer months**

# Filter the data to only include movies released in the summer months (June, July, and August)
summer_movies = movie_df[movie_df['released_month'].isin(['Jun', 'Jul', 'Aug'])]

# Calculate the average gross earnings for summer movies
summer_movies_gross_earnings = summer_movies['gross'].mean()

# Filter the data to only include movies released in months other than the summer months
non_summer_movies = movie_df[movie_df['released_month'].isin(['Jun', 'Jul', 'Aug'])]

# Calculate the average gross earnings for non-summer movies
non_summer_movies_gross_earnings = non_summer_movies['gross'].mean()

# Create a bar chart showing the average gross earnings for summer movies and non-summer movies
plt.bar(['Summer Movies', 'Non-Summer Movies'], [summer_movies_gross_earnings, non_summer_movies_gross_earnings])
plt.xlabel('Type of Movie')
plt.ylabel('Average Gross Earnings')
plt.title('Average Gross Earnings by Type of Movie')
plt.show()

The above graph show that  the movies that are relased in the summers are more likely to have high gross earning then the movies that are released in the winters because in winter people love to stay in there homes   

Q7 what is the average score for each genre?

In [None]:
# **Average score for each genre**

# Group the data by genre and calculate the average rating for each genre
genre_scores = movie_df.groupby('genre')['score'].mean()

# Sort the genre ratings in descending order
genre_scores = genre_scores.sort_values(ascending=False)

# Rotate the x-axis labels
plt.xticks(rotation=45)

# Increase the font size of the x-axis labels
plt.xticks(fontsize=12)

# Create a bar chart showing the average rating for each genre
plt.bar(genre_scores.index, genre_scores)
plt.xlabel('Genre')
plt.ylabel('Average Score')
plt.title('Average score by Genre')
plt.show()

The graph shows the average score for each genre of movie in the dataset. The genres with the highest average score are Biography, Drama and Animation. The genres with the lowest average score are Western, Thriller and Horror.

**Summary**

Summary of analysis

Based on the analysis we have done so far, we can draw the following conclusions:

The average budget of movies has increased significantly over time, likely due to factors such as the increasing cost of filmmaking, the desire to create more visually appealing and complex movies, the desire to attract a global audience, and the increasing competition from other forms of entertainment.


The most common rating for movies in the dataset is PG, suggesting that the majority of movies in the dataset are appropriate for general audiences with parental guidance.


The genres with the highest average score in the dataset are Biography, Drama and Animation. The genres with the lowest average score are Western, Thriller and Horror.


Movies that are released in the summers are more likely to have high gross earnings than movies that are released in the winters. This could be due to a number of factors, including the fact that people are more likely to go to the movies during the summer, summer movies are often marketed towards families and children, and summer movies are often blockbusters.


Overall, the analysis suggests that a number of factors can influence the success of a movie, including its budget, rating, genre, and release date. However, it is important to note that there are always exceptions to the rule, and that there are many great movies in all genres and with all budgets.





**Discussion**

The results of our analysis raise a number of interesting questions.For example : why is the average budget of movies increasing over time? Is this a good thing or a bad thing? How does the increasing budget of movies affect the quality of movies? Additionally, why are certain genres of movies more popular than others? Does this have to do with the quality of the movies, or with other factors, such as marketing and distribution?

These are just a few of the questions that could be explored further. Additional research could be done to investigate these questions in more depth, and to identify other factors that may influence the success of a movie.