# **IMDb Movie Assignment**
Project Description: We have the data for the 100 top-rated movies from the past decade along with various pieces of information about the movie, its actors, and the voters who have rated these movies online. 

Tech-Stack Used: kaggle.com

Result: In this assignment, we find some interesting insights into a few movies released between 1916 and 2016, using Python.

In [None]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import the numpy and pandas packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Task 1: Reading and Inspection

### Subtask 1.1: Import and read

Import and read the movie database. Store it in a variable called `movies`.

In [None]:
import os
print(os.listdir("../input"))

In [None]:
movies = pd.DataFrame(pd.read_csv("../input/MovieAssignmentData.csv"))
movies.head()

-  ### Subtask 1.2: Inspect the dataframe

Inspect the dataframe's columns, shapes, variable types etc.

In [None]:
movies.shape

In [None]:
# Concise summary of the Dataframe.
movies.info()

In [None]:
# Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution,
# excluding NaN values
movies.describe()

In [None]:
movies.color.describe()

In [None]:
movies.director_name.describe()

In [None]:
movies.color.describe()

In [None]:
movies.genres.describe()

In [None]:
movies.movie_title.describe()

In [None]:
movies.director_name.describe()

In [None]:
movies.actor_3_name.describe()

In [None]:
movies.plot_keywords.describe()

In [None]:
movies.movie_imdb_link.describe()

In [None]:
movies.language.describe()

In [None]:
movies.country.describe()

In [None]:
movies.content_rating.describe()

## Task 2: Cleaning the Data
Now that we have loaded the dataset and inspected it, we see that most of the data is in place. As of now, no data cleaning is required, so let's start with some data manipulation, analysis, and visualisation to get various insights about the data.
-  ### Subtask 2.1: Inspect Null values

Find out the number of Null values in all the columns and rows. Also, find the percentage of Null values in each column. Round off the percentages upto two decimal places.

In [None]:
# Write your code for column-wise null count here
movies.isnull().sum()

In [None]:
# Write your code for row-wise null count here
movies.isnull().sum(axis = 1)

In [None]:
# Write your code for column-wise null percentages here
round((movies.isnull().sum()/len(movies.index))*100,2)

-  ### Subtask 2.2: Drop unecessary columns

For this assignment, you will mostly be analyzing the movies with respect to the ratings, gross collection, popularity of movies, etc. So many of the columns in this dataframe are not required. So it is advised to drop the following columns.
-  color
-  director_facebook_likes
-  actor_1_facebook_likes
-  actor_2_facebook_likes
-  actor_3_facebook_likes
-  actor_2_name
-  cast_total_facebook_likes
-  actor_3_name
-  duration
-  facenumber_in_poster
-  content_rating
-  country
-  movie_imdb_link
-  aspect_ratio
-  plot_keywords

In [None]:
# Write your code for dropping the columns here. It is advised to keep inspecting the dataframe after each set of operations
movies = movies.drop(["color","director_facebook_likes","actor_1_facebook_likes","actor_2_facebook_likes","actor_3_facebook_likes","actor_2_name","cast_total_facebook_likes","actor_3_name","duration","facenumber_in_poster","content_rating","country","movie_imdb_link","aspect_ratio","plot_keywords"],axis =1)
movies.head()

In [None]:
movies.columns

-  ### Subtask 2.3: Drop unecessary rows using columns with high Null percentages

Now, on inspection you might notice that some columns have large percentage (greater than 5%) of Null values. Drop all the rows which have Null values for such columns.

In [None]:
# Write your code for dropping the rows here
round((movies.isnull().sum()/len(movies.index))*100,2)

In [None]:
movies = movies[~np.isnan(movies['gross'])]
movies = movies[~np.isnan(movies['budget'])]

-  ### Subtask 2.4: Fill NaN values

You might notice that the `language` column has some NaN values. Here, on inspection, you will see that it is safe to replace all the missing values with `'English'`.

In [None]:
# Fill the NaN values in the 'language' column here
movies["language"] = movies.language.fillna("English")
movies.language.isnull().sum()

-  ### Subtask 2.5: Check the number of retained rows

You might notice that two of the columns viz. `num_critic_for_reviews` and `actor_1_name` have small percentages of NaN values left. You can let these columns as it is for now. Check the number and percentage of the rows retained after completing all the tasks above.

In [None]:
# Write your code for checking number of retained rows here
len(movies.index)/5043


**Checkpoint 1:** You might have noticed that we still have around `77%` of the rows!

## Task 3: Data Analysis

-  ### Subtask 3.1: Change the unit of columns

Convert the unit of the `budget` and `gross` columns from `$` to `million $`.

In [None]:
# Write your code for unit conversion here
movies.budget = round(movies["budget"]/1000000,2)
movies.gross = round(movies["gross"]/1000000,2)

-  ### Subtask 3.2: Find the movies with highest profit

    1. Create a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
    2. Sort the dataframe using the `profit` column as reference.
    3. Extract the top ten profiting movies in descending order and store them in a new dataframe - `top10`

In [None]:
# Write your code for creating the profit column here
movies["profit"] = movies.gross - movies.budget

In [None]:
# Write your code for sorting the dataframe here
movies_by_profit = movies.sort_values("profit",ascending = False)

In [None]:
# Write code for profit vs budget plot here
plt.figure(figsize=[11,7])
plt.scatter(movies.budget,movies.profit)
plt.title("Profit v Budget")
plt.xlabel("Budget")
plt.ylabel("Profit")
plt.show()

**My Observation: Movies with higher budgets are not necessarily profitable**

The dataset contains the 100 best performing movies from the year 2010 to 2016. However scatter plot tells a different story. You can notice that there are some movies with negative profit. Although good movies do incur losses, but there appear to be quite a few movie with losses. What can be the reason behind this? Lets have a closer look at this by finding the movies with negative profit.

In [None]:
# Write your code to get the top 10 profiting movies here
top10 = movies_by_profit.head(10)
top10

-  ### Subtask 3.3: Drop duplicate values

After you found out the top 10 profiting movies, you might have notice a duplicate value. So, it seems like the dataframe has duplicate values as well. Drop the duplicate values from the dataframe and repeat `Subtask 3.2`.

In [None]:
# Write your code for dropping duplicate values here
movies = movies.drop_duplicates()

In [None]:
# Write code for repeating subtask 2 here
top10 = movies.head(10)
top10

**Checkpoint 2:** You might spot two movies directed by `James Cameron` in the list.

-  ### Subtask 3.4: Find IMDb Top 250

    1. Create a new dataframe `IMDb_Top_250` and store the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). Also make sure that for all of these movies, the `num_voted_users` is greater than 25,000.
Also add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.
    2. Extract all the movies in the `IMDb_Top_250` dataframe which are not in the English language and store them in a new dataframe named `Top_Foreign_Lang_Film`.

In [None]:
# Write your code for extracting the top 250 movies as per the IMDb score here. Make sure that you store it in a new dataframe 
# and name that dataframe as 'IMDb_Top_250'
IMDb_Top_250 = movies[movies["num_voted_users"] > 25000]
IMDb_Top_250 = IMDb_Top_250.sort_values("imdb_score", ascending = False).head(250)
IMDb_Top_250["rank"] = list(range(1,251))
IMDb_Top_250

In [None]:
Top_Foreign_Lang_Film = IMDb_Top_250[IMDb_Top_250["language"] != "English"]
Top_Foreign_Lang_Film.head()

**Checkpoint 3:** Can you spot `Veer-Zaara` in the dataframe?

- ### Subtask 3.5: Find the best directors

    1. Group the dataframe using the `director_name` column.
    2. Find out the top 10 directors for whom the mean of `imdb_score` is the highest and store them in a new dataframe `top10director`. 

In [None]:
# Write your code for extracting the top 10 directors here
director = movies.pivot_table(values = 'imdb_score', index = 'director_name', aggfunc = 'mean')
director = director.sort_values(by = 'imdb_score', ascending = False)
director = director.head(10)
director

**Checkpoint 4:** No surprises that `Damien Chazelle` (director of Whiplash and La La Land) is in this list.

-  ### Subtask 3.6: Find popular genres

You might have noticed the `genres` column in the dataframe with all the genres of the movies seperated by a pipe (`|`). Out of all the movie genres, the first two are most significant for any film.

1. Extract the first two genres from the `genres` column and store them in two new columns: `genre_1` and `genre_2`. Some of the movies might have only one genre. In such cases, extract the single genre into both the columns, i.e. for such movies the `genre_2` will be the same as `genre_1`.
2. Group the dataframe using `genre_1` as the primary column and `genre_2` as the secondary column.
3. Find out the 5 most popular combo of genres by finding the mean of the gross values using the `gross` column and store them in a new dataframe named `PopGenre`.

In [None]:
# Write your code for extracting the first two genres of each movie here

movies["genre_1"] = movies["genres"].str.split("|").str.get(0)
movies["genre_2"] = movies["genres"].str.split("|").str.get(1)
movies["genre_2"] = movies["genre_2"].fillna(movies["genre_1"])
movies.head()

In [None]:
# Write your code for grouping the dataframe here
movies_by_segment = movies.groupby(['genre_1','genre_2'])

In [None]:
# Write your code for getting the 5 most popular combo of genres here
PopGenre = pd.DataFrame(movies_by_segment.gross.mean()).sort_values("gross", ascending = False).head()
PopGenre


**Checkpoint 5:** Well, as it turns out. `Family + Sci-Fi` is the most popular combo of genres out there!

-  ### Subtask 3.7: Find the critic-favorite and audience-favorite actors

    1. Create three new dataframes namely, `Meryl_Streep`, `Leo_Caprio`, and `Brad_Pitt` which contain the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only the `actor_1_name` column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' for the said extraction.
    2. Append the rows of all these dataframes and store them in a new dataframe named `Combined`.
    3. Group the combined dataframe using the `actor_1_name` column.
    4. Find the mean of the `num_critic_for_reviews` and `num_user_for_review` and identify the actors which have the highest mean.

In [None]:
# Write your code for creating three new dataframes here

Meryl_Streep = movies[movies["actor_1_name"] == "Meryl Streep"]
Meryl_Streep

In [None]:
Leo_Caprio = movies[movies["actor_1_name"] == "Leonardo DiCaprio"]
Leo_Caprio

In [None]:
Brad_Pitt = movies[movies["actor_1_name"] == "Brad Pitt"]
Brad_Pitt.head()

**Checkpoint 6:** `Leonardo` has aced both the lists!

In [None]:
# Write your code for combining the three dataframes here
combined = pd.concat([Meryl_Streep,Leo_Caprio,Brad_Pitt ], axis = 0)
combined


In [None]:
# Write your code for grouping the combined dataframe here
movies_by_lead = combined.groupby('actor_1_name')

In [None]:
# Write the code for finding the mean of critic reviews and audience reviews here
print(movies_by_lead["num_critic_for_reviews"].mean())
print(movies_by_lead["num_user_for_reviews"].mean())