![image](http://1.bp.blogspot.com/-zFzkjyuD68g/VYGSFBAiSFI/AAAAAAAACXU/wW8i2VJ3lc0/s1600/00696e.png)

The dataset used in this notebook is collected from [here.](https://data.world/) I have tried to answer a few obvious questions with visualization. For that, I had to handle the missing values.

<a id = 'back'></a>
* [Handling Missing Values](#missing)
* [Show all of the columns at a glance](#glance)
* [Which year produced the most successful movies?](#year_successful)
* [How rating varies with different columns?](#varies)
* [Which rated movies are the most popular?](#rated_popular)
* [Which is the best month to release the movie?](#released)
* [Which is the best month to release DVD?](#dvd)
* [Origin of the movie](#country)
* [Show the top 10 movies with corresponding genre and country.](#genre10)
* [Show the peak time.](#peak)
* [Which genre attracts the audience most?](#genre)
* [Movies with the longest and lowest runtime.](#runtime)
* [Oldest and the newest year.](#year)
* [Which rating got the highest votes?](#vote_highest)
* [Show the rating curve against the year?](#curve)
* [Holistic view of the given columns?](#holistic)

# Import Libraries

In [None]:
from collections import Counter
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling as pp
import plotly.express as px
import plotly.graph_objects as go
import warnings 

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# Fetch Dataset

In [None]:
on_the_fly = pd.read_csv('../input/imdb-top-250-movies/IMDB_Top250_Movies.csv')
df = on_the_fly.copy()
df.shape

# Checking For Null Values

In [None]:
plt.figure(figsize=(14,6))
sns.heatmap(df.isnull(),cmap='BuPu',cbar=False,yticklabels=False)

* I remove columns that do not convey much information
* I will fill null values of those columns that seem valuable for further inquiry.
* From Tomatometer to tomatoURL, almost all of the rows are filled with null values. Thus, they are irrelevant.
* Apply union on all of the columns mentioned in **another** and **c**, the data frame is quite clean now.

<a id="missing"></a>

# Handling Missing Values

[Go Back](#back)

In [None]:
# delete random columns
another = ['BoxOffice', 'Website', 'Plot', 'Poster', 
           'Ratings.Source', 'imdbID', 'Type', 'Website', 'Response']

# delete sequential columns
c = df.loc[:, 'tomatoMeter': 'tomatoURL'].columns

# union both of them
df.drop(c.union(another), axis=1, inplace = True)

df.shape

* first column which is named as 'unnamed' will have no contribution in this notebook.

In [None]:
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
df.shape

* showing the heatmap of data frame once more so that I can figure out which columns still have missing values.

In [None]:
plt.figure(figsize=(14,6))
sns.heatmap(df.isnull(),cmap='BuPu',cbar=False,yticklabels=False)

* Divide the date format into date, month and year

In [None]:
# For Released Date
df[['Released_Day','Released_Month','Released_Year']] = df['Released'].str.split(' ',expand=True)
df.drop('Released', axis = 1, inplace = True)

# For DVD
df[['DVD_Day','DVD_Month','DVD_Year']] = df['DVD'].str.split(' ',expand=True)
df.drop('DVD', axis = 1, inplace = True)

# For Runtime remove everything except delimiter's first portion
df['Runtime'] = df['Runtime'].str.split(' ').str[0]

# drop Ratings.value as it's the duplicate of imdbRatings
df.drop('Ratings.Value', axis = 1, inplace = True)

* remove comma from imdbVotes

In [None]:
df['imdbVotes'] = df['imdbVotes'].str.replace(',', '')

* In case of Movie Released Date and DVD Released Date, fill the numeric missing values with their median.

In [None]:
# fill null object of Released Date with the maximum value
df.fillna({'Released_Day': df["Released_Day"].median()}, inplace=True)
df.fillna({'Released_Month': df["Released_Month"].value_counts().idxmax}, inplace=True)
df.fillna({'Released_Year': df["Released_Year"].median()}, inplace=True)

# fill the null value of writer and awards with a space
df.fillna({'Writer': ' '}, inplace=True)
df.fillna({'Awards': ' '}, inplace=True)

# fill null object of DVD with the maximum value
df.fillna({'DVD_Day': df["DVD_Day"].median()}, inplace=True)
df.fillna({'DVD_Month': df["DVD_Month"].value_counts().idxmax}, inplace=True)
df.fillna({'DVD_Year': df["DVD_Year"].median()}, inplace=True)

# assert the mean value of metascore
df.fillna({'Metascore': df['Metascore'].median()}, inplace=True)

* Voila ! There are no missing values in my dataframe. 

In [None]:
plt.figure(figsize=(15,2))
sns.heatmap(df.isnull(),cmap='Blues',cbar=False,yticklabels=False)

### Rename

In [None]:
df = df.rename(columns={'Title': 'Name', 'imdbRating': 'Rating', 'imdbVotes' : 'Vote'})

In [None]:
df.info()

* type cast into immutable objects.

In [None]:
df['Runtime'] = df['Runtime'].astype(int)
df['Released_Day'] = df['Released_Day'].astype(int)
df['Released_Year'] = df['Released_Year'].astype(int)
df['DVD_Day'] = df['DVD_Day'].astype(int)
df['DVD_Year'] = df['DVD_Year'].astype(int)
    
df['Vote'] = df['Vote'].astype(float)

In [None]:
df.info()

<a id = 'glance'></a>
    
# Show all of the columns at a glance.
  
[Go Back](#back)

In [None]:
# Creating a histogram of all the features in dataframe.
df.hist(bins=30,figsize=(15,15),color='g')

In [None]:
scatterdata=df[['Year','Runtime','Metascore','Vote', 'Released_Day', 'Released_Year', 
                'DVD_Day', 'DVD_Year']]

sns.set(style="ticks")
sns.pairplot(scatterdata)

<a id = 'year_successful'></a>

# Which year produced the most successful movies?

[Go Back](#back)


* The beginning of the **1990** has the most frequency of top-rated movies.
    


In [None]:
plt.figure(figsize=(12,24))
sns.set(style="darkgrid")
ax = sns.countplot(y="Released_Year", data=df, palette="Set2", 
                   order=df['Released_Year'].value_counts().index[:])

In [None]:
df.columns

<a id = 'varies'></a>
    
# How the rating varies against different columns?
    
[Go Back](#back)

In [None]:
# against imdb rating

df_for_ML = df[['Year','Runtime','Metascore','Vote', 'Released_Day', 'Released_Year', 
                'DVD_Day', 'DVD_Year']]

In [None]:
df_for_ML.head(1)

In [None]:
for i in df_for_ML.columns:
    axis = df.groupby('Rating')[[i]].mean().plot(figsize=(10,5),marker='o',color='g')

<a id = 'rated_popular'></a>

# Which rated movies are more famous ?
    
[Go Back](#back)

* **R** contains the most successful movies. 
* As **R** rated movies are restricted for children. It can be assumed that most of the voters are adults.

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="Rated", data= df, palette="Set2", 
                   order=df['Rated'].value_counts().index[0:15])

In [None]:
color = plt.cm.Greens(np.linspace(0, 1, 2))
df['Rated'].value_counts().plot.pie(colors = color, figsize = (10, 10), startangle = 75)

plt.title('Rated', fontsize = 20)
plt.axis('off')
plt.show()

<a id = 'released'></a>

# Best released month?
    
[Go Back](#back)

* Movie released in **December** turns out to be the most successful month.

In [None]:
# If a producer wants to release some content, 
# which month must he do so?( Month when least amount of content is added)

netflix_date = on_the_fly.copy()
netflix_date = netflix_date[['Released']].dropna()
netflix_date['year'] = netflix_date['Released'].apply(lambda x : x.split(' ')[-1])
netflix_date['month'] = netflix_date['Released'].apply(lambda x : x.lstrip().split(' ')[1])

# conglomerate all of the years
for col in df.columns:
    netflix_date['year'].values[:] = ''
    
netflix_date['year']

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'][::-1]
dfg = netflix_date.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(dfg, cmap='PuRd', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(dfg.columns), 1), dfg.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(dfg.index), 1), dfg.index, fontsize=7, fontfamily='serif')

plt.title('IMDB Released Month', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()

cbar.ax.tick_params(labelsize=8) 
cbar.ax.minorticks_on()
plt.show()

<a id = 'dvd'></a>
# Best month to release DVD
    
[Go Back](#back)

* **March** is the best time to release the DVD version of the movie.

In [None]:
# If a producer wants to release some content, 
# which month must he do so?( Month when least amount of content is added)
# cmap variation is given on : https://gallantlab.github.io/pycortex/colormaps.html

imdb_dvd = on_the_fly.copy()
imdb_dvd = imdb_dvd[['DVD']].dropna()
imdb_dvd['year'] = imdb_dvd['DVD'].apply(lambda x : x.split(' ')[-1])
imdb_dvd['month'] = imdb_dvd['DVD'].apply(lambda x : x.lstrip().split(' ')[1])

# conglomerate all of the years
for col in df.columns:
    imdb_dvd['year'].values[:] = ''
    
imdb_dvd['year']

month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'][::-1]
dfg = imdb_dvd.groupby('year')['month'].value_counts().unstack().fillna(0)[month_order].T
plt.figure(figsize=(10, 7), dpi=200)
plt.pcolor(dfg, cmap='PuBuGn', edgecolors='white', linewidths=2) # heatmap
plt.xticks(np.arange(0.5, len(dfg.columns), 1), dfg.columns, fontsize=7, fontfamily='serif')
plt.yticks(np.arange(0.5, len(dfg.index), 1), dfg.index, fontsize=7, fontfamily='serif')

plt.title('DVD Released Month', fontsize=12, fontfamily='calibri', fontweight='bold', position=(0.20, 1.0+0.02))
cbar = plt.colorbar()

cbar.ax.tick_params(labelsize=8) 
cbar.ax.minorticks_on()
plt.show()

* Most of the movies were produced in the **USA**.

<a id = 'country'></a>

# Origin of the movie


    
[Go Back](#back)

In [None]:
country_count=df['Country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:11]
topcountries

In [None]:
countries1={}
cou1=list(df['Country'])
for i in cou1:
    #print(i)
    i=list(i.split(','))
    if len(i)==1:
        if i in list(countries1.keys()):
            countries1[i]+=1
        else:
            countries1[i[0]]=1
    else:
        for j in i:
            if j in list(countries1.keys()):
                countries1[j]+=1
            else:
                countries1[j]=1
                

countries_fin1={}
for country,no in countries1.items():
    country=country.replace(' ','')
    if country in list(countries_fin1.keys()):
        countries_fin1[country]+=no
    else:
        countries_fin1[country]=no
        
countries_fin1={k: v for k, v in sorted(countries_fin1.items(), 
                                        key=lambda item: item[1], reverse= True)}




# Set the width and height of the figure
plt.figure(figsize=(15,15))

# Add title
plt.title("Content creating countries")

# Bar chart showing average arrival delay for Spirit Airlines flights by month
sns.barplot(y=list(countries_fin1.keys()), x=list(countries_fin1.values()))

# Add label for vertical axis
plt.ylabel("Arrival delay (in minutes)")

<a id = 'genre10'></a>
# Show top 10 movies with their genres and countries?


    
[Go Back](#back)

In [None]:
# From https://plotly.com/python/sunburst-charts/
top_rated=df[0:10]
fig =px.sunburst(
    top_rated,
    path=['Name','Country', 'Genre'],
    values='Rating',
    color='Rating')
fig.show()

<a id = 'peak'></a>
# Show the peak time.

    
[Go Back](#back)

So, a good amount of movies on IMDB are among the duration of 100-150 mins. It is acceptable considering the fact that a fair amount of the audience cannot watch a 3 hour movie in one sitting. Can you? :p

In [None]:
# duration of movies
sns.set(style="darkgrid")
sns.kdeplot(data=df['Runtime'], shade=True)

In [None]:
genres=list(df['Genre'])
gen=[]

for i in genres:
    i=list(i.split(','))
    for j in i:
        gen.append(j.replace(' ',""))
g=Counter(gen)


text = list(set(gen))
plt.rcParams['figure.figsize'] = (13, 13)
wordcloud = WordCloud(max_font_size=50, max_words=100,background_color="white").generate(str(text))

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

<a id = 'genre'></a>
# Which genre attracts the most?
    
[Go Back](#back)

* People like to watch **drama**. They hate movies that contain **music**.  

In [None]:
g={k: v for k, v in sorted(g.items(), key=lambda item: item[1], reverse= True)}


fig, ax = plt.subplots()

fig = plt.figure(figsize = (10, 10))
x=list(g.keys())
y=list(g.values())
ax.vlines(x, ymin=0, ymax=y, color='green')
ax.plot(x,y, "o", color='maroon')
ax.set_xticklabels(x, rotation = 90)
ax.set_ylabel("Count of movies")
# set a title
ax.set_title("Genres");

<a id = 'runtime'></a>
# Movies with the longest and the lowest runtime

    
[Go Back](#back)

In [None]:
top=df[['Name','Runtime']].sort_values(by='Runtime', 
                                       ascending=False)[0:20].plot(kind='bar',
                                                                   x='Name',y='Runtime', 
                                                                   color='red')

* Lowest Runtime

In [None]:
top=df[['Name','Runtime']].sort_values(by='Runtime', 
                                       ascending=True)[0:20].plot(kind='bar',
                                                                   x='Name',y='Runtime', 
                                                                   color='red')

In [None]:
bottom=df.sort_values(by='Runtime')[:-10]

fig = go.Figure(data=[go.Table(header=dict(values=['Name', 'Runtime','Rating']),
                 cells=dict(values=[bottom['Name'],bottom['Runtime'], bottom['Rating']],fill_color='lavender'))
                     ]).show()

In [None]:
# * 'The General' has the lowest runtime.
lowest = df.loc[df['Runtime'].idxmin()]
lowest.head()

In [None]:
# * 'Gone with the Wind' has the highest runtime.
highest = df.loc[df['Runtime'].idxmax()]
highest.head()

<a id = 'year'></a>
# Oldest and the Newest year.

    
[Go Back](#back)

**The kid** is the oldest movie.

In [None]:
oldest_us_series=df.sort_values(by='Released_Year')[0:20]

fig = go.Figure(data=[go.Table(header=dict(values=['Name', 'Released_Year'],fill_color='paleturquoise'),
                 cells=dict(values=[oldest_us_series['Name'],oldest_us_series['Released_Year']],
                            fill_color='pink'))
                     ])
fig.show()

* Newest movies.

In [None]:
oldest_us_series=df.sort_values(by='Released_Year', ascending=False)[0:20]

fig = go.Figure(data=[go.Table(header=dict(values=['Name', 'Released_Year'],fill_color='white'),
                 cells=dict(values=[oldest_us_series['Name'],oldest_us_series['Released_Year']],
                            fill_color='pink'))
                     ])
fig.show()

<a id = 'vote_highest'></a>
# Which rating got the highest votes?
    
[Go Back](#back)

* Most votes are given for 8.1 holder movies. Top-rated movies have the lowest since the density of movies are not that much in that(upper 9) range. 

In [None]:
# visualising the different year distribution in the dataset

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (18, 9)

x = pd.DataFrame(df.groupby(['Rating'])['Vote'].sum().reset_index())
x.sort_values(by = ['Rating'], ascending = False, inplace = True)

sns.barplot(x['Rating'], y = x['Vote'], data = x, palette = 'afmhot')
plt.title('Distribution of Rated Movies', fontsize = 20)
plt.xlabel('rating')
plt.xticks(rotation = 90)
plt.ylabel('total votes')
plt.show()

* It shows the wave of rating. Undoubtedly 90's movies created a huge surge in the movie industry.

<a id = 'curve'></a>
# Show the rating curve against the year?


    
[Go Back](#back)

In [None]:
sns.set(style="darkgrid")
sns.set(rc={'figure.figsize':(15,10)})
#ax=sns.regplot(data=suicide_socio_economic, x='gdp_per_capita ($)', y='suicides/100k pop', x_estimator=np.median, x_jitter=0.2, order=4, x_bins=5)
ax=sns.regplot(data=df, x='Year', y='Rating', x_jitter=0.2, order=4, x_bins=8)
#ax.set_yscale('log')
#ax.set_xscale('log')

<a id = 'holistic'></a>
# the holistic view of the given columns 

    
[Go Back](#back)

In [None]:
on_the_fly.profile_report()

# <center>Thank you for your time. 😃</center>