**Overview**

IMDB (Internet Movie Databsase) is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.

I am personally interested in this dataset as I am a movie fanatic and have always visited the IMDb website for suggestions, reviews and trivia about movies. I have watched all the movies in the IMDb Top 250 movies of all time! 

From this dataset i am curious to find out:
* What are the top countries producing movie content?
* Which are the top Languages in which movies are produced?
* What genre are most movies made in?
* Visualize the rise in movie content released each year and the drop in content because of COVID-19 Pandemic.

I am looking forward to explore various visualization techniques like charts, graphs and word clouds.

**Data Profile**

The data source i want to use for this project is: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+movies.csv
The dataset has 4 different .csv files out of which i will be focusing on 'IMDb movies.csv'

This dataset has been explored by multiple users to find demographic information of all the movie releases from the 1900s to 2020. A few explorations done on this dataset are: 
* This project explores various visualizations and statistical analysis of the PG ratings and viewership as well as visualization of movies using Choropleth. Link: https://www.kaggle.com/ritesh7355/netflix-eda-visualization-for-beginner
* This project contains visualizations, corellation and statistical analysis of various factors like Highest grossing movies, ratings, dutaions and income. Link: https://www.kaggle.com/nithinpolavarapu/data-visualization-recommendations


Data has been scraped from the publicly available website https://www.imdb.com. The data is a .svg file. It consists of entries like name, release year, genre, country, language, etc.
The Dataset is fairly clean and has all the variables that I am interested in researching about.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Analysis**

I will be analysing the dataset using simple visualizations to answer the above mentioned questions. To begin with, I am loading all the necessary packages to visualize the data. The data contains all the necessary information like year, genre, country, etc to facilitate my research.

In [None]:
#Importing all necessary packages required to visualize the data

import matplotlib.pyplot as plt
import plotly.express as px
import datetime
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

import warnings
warnings.filterwarnings("ignore")

#Importing the dataset

df = pd.read_csv("../input/imdb-extensive-dataset/IMDb movies.csv")
df.head(5)

**Top 15 Countries Producing Movie Content**

I want find out the top 15 countries that make the most movie content. I will begin with isolating the data and then visualize it into a pie chart.

In [None]:
#creating variable
df_country = df['country'].value_counts().sort_values(ascending=False)

#limiting content to top 15
top15countries = df_country.head(15)
top15countries

In [None]:
#Creating pie visualizations

Visualization = px.pie(values=top15countries, 
                       names=top15countries.index,title='Top 15 Countries Producing Movie Content')

Visualization.show()

**Top 15 Languages for Movie Content**

Next, I want find out the top 15 languages for movie content. I will begin with isolating the data and then visualize it into a funnel chart.

In [None]:
#creating variable
df_languages = df['language'].value_counts().sort_values(ascending=False)

#limiting content to top 15
top15languages = df_languages.head(15)
top15languages

In [None]:
#Creating funnel visualizations

Visualization = px.funnel(top15languages,title='Types of Rating on Netflix')

Visualization.show()

**Genre Most Movies Are Made In**

I am using word cloud to visualize the most popular genre movies are made in. Word clouds, while not accurately quantitative, help in quickly visualizing the most popular genres at a glance.

In [None]:
names=list(df['genre'])
word=[]

#splitting the genre into seperate words

for i in names:
    i=list(i.split())
    for j in i:
        word.append(j.replace(' ',""))

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

text = list(set(word))
plt.rcParams['figure.figsize'] = (13, 13)

#Filtering out words i dont want to use

stopwords = set(STOPWORDS)
stopwords.update(["Film",'Fi'])

#create variable to generate a wordcloud image 

wordcloud = WordCloud(stopwords=stopwords,max_words=1000000,background_color="white").generate(str(text))

#show wordcloud image

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

**Rise in the Number of Movies Released Each Year**

Media content has been increasing throughout the years as newer platforms are created. Here, I want to visualize the rise in movie releases throughout the years and to see how covid affected the release of movies in 2019 ad 2020.

In [None]:
#Creating variables to plot graph

totalmovies=px.bar(df.groupby('year').size())

totalmovies.update_layout(title_text='Total Number of Movies Per Year', title_x=0.5, showlegend=False)

**Conclusion**

Overall, this mini project provided me with some compelling insights and stats about movies and films released though history. Through this project, I could analyse the data and create visualizations that helped me answer a few curiosities that I had about movies. 

Moving ahead, it would be fun to explore teh possibilities to implement robust versions of these visualizations into the IMDb website for people to view. I would also be keen to take a deeper dive and analyse a few key events in the movie world like netflix releasing their own content, launch of new media platforms like amaxon prime, hulu, etc. 