# Project: Investigate a Dataset - [tmdb-movies]

## Table of Contents

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

# <a id='intro'></a>
# Introduction

In this project, we're looking at over 10,000 movies from The Movie Database (TMDb) that were released between 1960 and 2015. We're looking at the total number of films released in the last 15 years, as well as the top 15 directors based on the number of films released. We're also looking into how much money, how popular movies are, and how much money they cost.

In [10]:
pip install --upgrade setuptools

Note: you may need to restart the kernel to use updated packages.


In [7]:
#importing important libraries for this project

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'pandas'

<a id='wrangling'></a>
# Data Wrangling

1. First we will import CSV file.
2. After importing CSV file we will analyse the dataset using basic commands.(Check Column Names, Datatypes, Null Values and Dataset)

In [None]:
# Importing CSV file and Fetching information
mov_df=pd.read_csv('tmdb-movies.csv')

mov_df.head() #analysing dataset with top 5 rows
mov_df.info() #analysing datatypes and values

In [None]:
#Total number of data available in imported dataset
print(mov_df.shape)


## Data Cleaning
1. In the previous section, several columns contain worthless information which we did not use for current analysis, so we need to remove unuseful ones.
2. Getting rid of null and dulicate values

## 1) Removing Columns

In [None]:
#Removing columns which we will not use in this analysis

mov_df.drop(columns=['id','imdb_id','budget_adj','revenue_adj','cast','homepage','tagline','overview','release_date','keywords'],inplace=True)
mov_df.dtypes

## 2) Removing Null & Duplicate Values

In [None]:
#Checking for null values and duplicated values
print("Total no of Null values: ",mov_df.isnull().sum().sum())
print("Total no of Duplicate values: ",mov_df.duplicated().sum())

#Removing Null & duplicate Values
mov_df.dropna(inplace=True)
mov_df.drop_duplicates(inplace=True)


#Verifying for null values and duplicated values
print("Total no of Null values after removing all null vlaues: ",mov_df.isnull().sum().sum())
print("Total no of Duplicate values after removing all duplicate vlaues: ",mov_df.duplicated().sum())

#Remaining number of data available in dataset
print('Total no of Rows in Dataset: ',len(mov_df))
print('Total no of Columns in Dataset: ',len(mov_df.columns))

# <a id='eda'></a>
# Exploratory Data Analysis



## Research Question 1 (NUMBER OF MOVIES RELEASED IN LAST 15 YEARS)
> The "Total Number of Movies Released in the Last 15 Years" is the first study. Will graph this and determine which year had the lowest and highest movie count in the previous 15 years.

In [None]:
#In this we are using Group By & Tail command to find out last 15 years Dataset
# Grouped Release Year and count total no of movies

release_count = mov_df.groupby('release_year').original_title.count()

release_count_by_year = release_count.tail(16) #With the Help of Tail, extracting only top 15 Years
release_count_by_year

In [None]:
#Showing Bar Chart of No of Movies Released in last 15 Years
# For color grid I referred https://matplotlib.org/stable/gallery/color/named_colors.html

plt.subplots(figsize=(15,8))
plt.bar(release_count_by_year.index,release_count_by_year,
        color=[ 'gold','peachpuff','lightcoral', 'cornflowerblue','greenyellow'],
        edgecolor='black',tick_label=release_count_by_year.index)
plt.title("NO OF MOVIES RELEASED IN LAST 15 YEARS",fontsize=18)
plt.xlabel("YEARS",fontsize=15)
plt.ylabel("COUNT",fontsize=15)

In [None]:
#Printing in which year the largest and lowest no of movies released.

print("With a total of",
      release_count_by_year.max(), 
      "films released in the last 15 years,",
      release_count_by_year.idxmax(),
      "is the year with the most, and",
      release_count_by_year.idxmin(),
      "is the year with the least, with just",
      release_count_by_year.min(),
      "films made.")


## Research Question 2  (TOP 15 DIRECTORS LIST THE ON BASIS OF NO OF MOVIES)

> A second analysis was performed to determine the top 15 directors based on the total number of films in the dataset. The TOP 15 directors will determine how many films they will direct.

>We will present the final result bar chart, and then will look up the director's name and the total number of films.

In [None]:
#For this Top 15 Director List we are using iloc command.

#Finding Top 15 Director List

top_directors = mov_df.director.value_counts().iloc[:15]
top_directors

In [None]:
#Showing Bar Chart of Top 15 Directos
plt.subplots(figsize=(18,10))
plt.bar(top_directors.index,top_directors,color=[ 'lightcoral', 'greenyellow', 'cornflowerblue', 'gold','peachpuff'],edgecolor='black',tick_label=top_directors.index)
plt.title("TOP 15 DIRECTORS",fontsize=18)
plt.xlabel("DIRECTORS",fontsize=15)
plt.ylabel("NO OF MOVIES",fontsize=15)
plt.xticks(rotation=60);

In [None]:
#Priting Top Director Name with no of Movies
print(mov_df.director.value_counts().idxmax(),
      'has directed the most films of any of the Top 15 filmmakers, with',
      mov_df.director.value_counts().max(),
     "count.")

#Lowest no of movies count same for 4 Directors and I want to print print all 4 Directors Name. 
#With the help of previous function only 1 value return so we need to fetch all 4 values.
#[df.director.value_counts().iloc[:15]==df.director.value_counts().iloc[:15].min()] will fetch all min values
min_movie_dir_name=mov_df.director.value_counts().iloc[:15][mov_df.director.value_counts().iloc[:15]==mov_df.director.value_counts().iloc[:15].min()]
print(min_movie_dir_name.index.values,'have directed',
      mov_df.director.value_counts().iloc[:15].min(),
      "films, the fewest among the Top 15.")

## Research Question 3  (MOST FAMOUS GENRE IN FILM INDUSTRY)
> The next step in the research process is to determine which genre is the most well-known in the film industry.

> Based on the information provided, we do not have a dataset with a single genre. The information for more than one genre is separated by a "|" in one row. So, before we move on, we'll divide each row into distinct genres.

In [None]:
mov_df.shape

In [None]:
#Refrence used for this code
#https://programmer.ink/think/pandas-how-do-i-split-text-in-a-column-into-multiple-lines-python.html

mov_df=mov_df['genres'].str.split('|', expand=True).stack().reset_index(level=0).set_index('level_0').rename(columns={0:'genres'}).join(mov_df.drop('genres', axis=1))
print('Total no of Rows in Dataset: ',len(mov_df))
print('Total no of Columns in Dataset: ',len(mov_df.columns))
mov_df.head()

In [None]:
#Showing Bar Chart of Genre Production
plt.subplots(figsize=(18,10))
plt.bar(mov_df.genres.value_counts().index,
        mov_df.genres.value_counts(),
        color=[ 'greenyellow', 'cornflowerblue', 'gold','peachpuff','lightcoral'],
        edgecolor='black',
        tick_label=mov_df.genres.value_counts().index)
plt.title("Genere production",fontsize=18)
plt.xlabel('GENRES',fontsize=18);
plt.ylabel('TOTAL NO OF MOVIES',fontsize=18);
plt.xticks(rotation=60);

In [None]:
#Printing Lowest and highest No of movies prodcued by which Genres.

print("The",
      mov_df.genres.value_counts().idxmax(),
      "genre has produced",
      mov_df.genres.value_counts().max(),
      "films, whereas the",
      mov_df.genres.value_counts().idxmin(),
      "genre has produced only",
      mov_df.genres.value_counts().min())

## Research Question 4  (SHOWING MOVIE GENRES IN PIE CHART)

In [None]:
#Generating the pie Chart for Movie Genre

labels=mov_df.genres.value_counts().index
sizes=mov_df.genres.value_counts()
plt.figure(figsize = (25,25))
my_colors = ['lightblue','lightsteelblue','lightcoral', 'cornflowerblue', 'gold','peachpuff','#4F6272', '#B7C3F3', '#DD7596', '#8EB897']
my_explode = (0.05, 0.05,0.05,0,0,0, 0.08,0.08,0,0,0, 0,0,0,0,0, 0,0.3,0.2,0.3)
plt.pie(sizes, labels=labels, 
        wedgeprops = { 'linewidth' : 1.5, 'edgecolor' : 'white' },
        autopct='%.1f%%', 
        startangle=60, shadow = True,
        explode=my_explode,
        colors=my_colors,
        textprops={'fontsize': 23},
        labeldistance=0.7,
        rotatelabels =True)
plt.title("MOVIE GENRES PIE CHART",fontsize=25)
plt.tight_layout() 
plt.legend(mov_df.genres.value_counts().index, loc = 'upper left',fontsize=16)
plt.tight_layout()
plt.axis('equal')
plt.show()


In [None]:
#I am trying to print Genres on basis of %. 

#Creating 3 emplty List for 3 different category
per10=[] #for more than 10 %
per5=[] #for 5-6%
per1=[] #for less than 1 %
total=sum(mov_df.genres.value_counts())
perValue=round((((mov_df.genres.value_counts())/total)*100.0),1) #calculating % of Gernes

#Adding loop to find out categories
for i in range(len(perValue)): 
    
    # If % is more than 10% then value append in per10 List.
    if (perValue.iloc[i:i+1].values) >= 10.0:
        per10.append(list(perValue.iloc[i:i+1].index))
        
    # If % between 5-6% then value append in per5 List.
    elif ((perValue.iloc[i:i+1].values) <= 6.0) and ((perValue.iloc[i:i+1].values) >= 5.0):
        per5.append(list(perValue.iloc[i:i+1].index))
    
    # If % is less than 1% then value append in per1 List.
    elif (perValue.iloc[i:i+1].values) < 1.0:
        per1.append(list(perValue.iloc[i:i+1].index))

print("According to the about grapgh, more than 10% of films are produced in the genres of",per10[0], per10[1], "and", per10[2],".")
print("5 percent of films produced in the genres of",per5[0], "and", per5[1],".")
print("In",per1[0], per1[1], "and", per1[2],"less than 1% of films are created.")

## Research Question 5  

## SHOWING 3 DIFFERENT CHARTS FOR REVENUE, BUDGET, POPULARITY & GENRE COMPARISION OVER YEARS

## 1) REVENUE & GENRE LINE CHART OVER YEAR

In [None]:
#Showing Line chart for Revenue & Genre Over Year
label=(mov_df.release_year)
mov_df.groupby(["release_year","genres"]).sum()["revenue"].unstack().plot(figsize=(20,15))
plt.title("REVENUE & GENRE GPARH OVER YEARS",fontsize=18)
plt.xlabel('YEARS',fontsize=18);
plt.ylabel('REVENUE',fontsize=18);
plt.legend( loc = 'upper left',fontsize=20,title="Genre")
plt.xticks(label,rotation=60);

In [None]:
print("The revenue of all genres varies from year to year, however the"
      ,mov_df.groupby(["release_year","genres"]).sum()["revenue"].idxmax()[1],
      "genre now has the most revenue among them in",
     mov_df.groupby(["release_year","genres"]).sum()["revenue"].idxmax()[0])

## 2) POPULARITY & GENRE LINE CHART OVER YEAR

In [None]:
#Showing Line chart for Popularity & Genre Over Year
label=(mov_df.release_year)
mov_df.groupby(["release_year","genres"]).sum()["popularity"].unstack().plot(figsize=(20,15))
plt.title("POPULARITY & GENRE GPARH OVER YEARS",fontsize=18)
plt.xlabel('YEARS',fontsize=18);
plt.ylabel('POPULARITY',fontsize=18);
plt.legend( loc = 'upper left',fontsize=20,title="Genre")
plt.xticks(label,rotation=60);

In [None]:
print("The popularity of each genre varies from year to year, although the",
      mov_df.groupby(["release_year","genres"]).sum()["popularity"].idxmax()[1],
      "genre has seen a surge in popularity in",
      mov_df.groupby(["release_year","genres"]).sum()["popularity"].idxmax()[0])

## 3) BUDGET & GENRE LINE CHART OVER YEAR

In [None]:
#Showing Line chart for Budget & Genre Over Year
label=(mov_df.release_year)
mov_df.groupby(["release_year","genres"]).sum()["budget"].unstack().plot(figsize=(20,15))
plt.title("BUDGET & GENRE GPARH OVER YEARS",fontsize=18)
plt.xlabel('YEARS',fontsize=18);
plt.ylabel('BUDGET',fontsize=18);
plt.legend( loc = 'upper left',fontsize=20,title="Genre")
plt.xticks(label,rotation=60);

In [None]:
print("The budget varies greatly from year to year, although the",
      mov_df.groupby(["release_year","genres"]).sum()["budget"].idxmax()[1],
      "Genre had the largest budget in",
      mov_df.groupby(["release_year","genres"]).sum()["budget"].idxmax()[0])

# <a id='conclusions'></a>
# Conclusions

1) People are interested in watching Adventure and Drama genres, according to this data. Even in the Drama genre, more than 4000 films were released, accounting for around 10% of all genres. Only 5% of movies are produced in the adventure genre, yet it generates a lot of money. In 2014, more than 600 films were released, compared to only 197 in 2000. The budget changes from year to year, however in 2013, the most money was spent on an action film.


#### 1) There are 10866 rows and 21 columns in the actual dataset.
#### 2)The year with the most movies produced is 2014, with over 600 films.
#### 3)Drama is the genre that produces the most movies year after year, followed by comedy and thrillers, with the Foreign, Western, and TV Movie genres producing the least.
#### 4)With 45 films under his belt, filmmaker 'Woody Allen' holds the record for most directors.

## Limitations
1) We had to eliminate unnecessary columns and rows with null values before we could analyze the data, which resulted in a smaller dataset.

2) We need to split Genres value because the proper Genre category is not provided. Dataset values grew by 2-3%, affecting the result.

THANK YOU !!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])