 # Analysis Summary and Presentation of IMdB Dataset
 
> # Data Loading and Initial Exploration:

>>Various datasets from the IMDb Movies Schema were loaded using SQLite queries and stored in pandas DataFrames.
The structure and information of the "movies" DataFrame were examined, including numerical and categorical columns.
Summary statistics, such as mean and standard deviation, were calculated for numerical columns.
Null values were checked in the dataset, and the number of unique values in each column was determined.

> # Data Cleaning and Transformation:

>>The "year" column was cleaned by extracting the last element from the string values to remove unnecessary text.
Duplicated rows in the "movies" DataFrame were identified and removed if needed.
Skewness of the "num_votes" column was calculated, and log and square root transformations were applied to reduce skewness.
Transformed columns, such as "num_votes_log2," "num_votes_log10," and "num_votes_sqrt," were created.

> # Movies Analysis:
The count of movies released per year was plotted to visualize the distribution and identify the years with maximum releases.
A trend line chart was generated to analyze the trend of movies released over the years.
Average ratings per year were plotted to observe any trends or patterns.
The relationship between average ratings and the number of votes was explored through line plots.
The distribution of ratings was visualized using histograms and counts of movies within specific rating ranges.
Top movies per year and genre were printed to identify the most popular movies in recent years.
Box plots were used to visualize the distribution of ratings for recent years.

> # Genre Analysis:

>>The genre information was extracted from the "genre" and "m_genre" tables and merged with the "movies" DataFrame.
The count and distribution of movies for each genre were plotted using bar charts.
Average ratings for each genre were calculated and visualized through bar charts.
Top movies per genre were printed to identify the most highly rated movies.

> # Language Analysis:

>>The language information was extracted from the "m_language" and "lang" tables and merged with the "movies" DataFrame.
The distribution of movies by language was visualized using a bar chart.
Average ratings for each language were calculated and plotted using a bar chart.
The language diversity over the years was examined through a line plot.

> # Director Analysis:

>>The director information was extracted from the "m_director" and "person" tables and merged with the "movies" DataFrame.
The count of movies directed by each director was plotted using a bar chart.
The favorite cast members for each director were identified and visualized through bar charts.
The gender distribution of directors was analyzed using a pie chart.

> # Cast Analysis:

>>The cast information was extracted from the "m_cast" and "person" tables and merged with the "movies" DataFrame.
The count of movies in which each cast member has appeared was plotted using a bar chart.
The favorite cast members for each director were identified and visualized through bar charts.
The gender distribution of cast members was analyzed using a pie chart.

> # Producer Analysis:

>>The producer information was extracted from the "m_producer" and "person" tables and merged with the "movies" DataFrame.
The count of movies produced by each producer was plotted using a bar chart.
The favorite producers for each director were identified and visualized through bar charts.

> # Correlation Analysis:

>> The correlation matrix was calculated to determine the relationships between numerical variables in the "movies" DataFrame.
The correlation matrix was visualized using a heatmap to identify any significant correlations.

> # Additional Analysis:

>>Other analyses performed include analyzing the distribution of genres over the years, exploring the favorite cast members for each director, and examining the favorite producers for each director.
The provided code covers a wide range of analyses, including the distribution of movies over the years, average ratings, genre analysis, language analysis, director analysis, cast analysis, producer analysis, and correlation analysis. The plots and visualizations provide insights into the IMDb Movies Schema and can be used to draw meaningful conclusions for a blog post or further analysis.



# Let us Import All Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sqlite3
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
conn = sqlite3.connect("Db-IMDB-Assignment.db")

# Let us Load in all Datasets

# Data Tables

In [None]:
query = "SELECT * FROM MOVIE"
movies = pd.read_sql(query , conn)
movies.head()
movies = movies.drop("index" , axis = 1)

In [None]:
query = "SELECT * FROM PERSON"
person = pd.read_sql(query , conn)
person.head()

In [None]:
query = "SELECT * FROM GENRE"
genre = pd.read_sql(query , conn)
genre.head()

In [None]:
query = "SELECT * FROM LANGUAGE"
lang = pd.read_sql(query , conn)
lang.head()

In [None]:
query = "SELECT * FROM COUNTRY"
country = pd.read_sql(query , conn)
country.head()

In [None]:
query = "SELECT * FROM LOCATION"
location = pd.read_sql(query , conn)
location.head()

In [None]:
print("*"*50)


# Mapping Tables

In [None]:
query = "SELECT * FROM M_PRODUCER"
m_producer = pd.read_sql(query , conn)
m_producer.head()

In [None]:
query = "SELECT * FROM M_DIRECTOR"
m_director = pd.read_sql(query , conn)
m_director.head()

In [None]:
query = "SELECT * FROM M_CAST"
m_cast = pd.read_sql(query , conn)
m_cast.head()

In [None]:
query = "SELECT * FROM M_GENRE"
m_genre = pd.read_sql(query , conn)
m_genre.head()

In [None]:
query = "SELECT * FROM M_LANGUAGE"
m_language = pd.read_sql(query , conn)
m_language.head()

In [None]:
query = "SELECT * FROM M_COUNTRY"
m_country = pd.read_sql(query , conn)
m_country.head()

In [None]:
query = "SELECT * FROM M_LOCATION"
m_location = pd.read_sql(query , conn)
m_location.head()

# Movies Analysis

In [None]:
movies.info()
# 3 Numerical and 3 Categorical Columns
#Index is of no use and can be dropped

In [None]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
#Using this coder snippet to display numbers without scientific notation
movies.describe()


#Mean Rating of 6 with a standard deviation of 1.4
#Mean Votes of 4547 with a really high standard deviation - data is highly skewed

In [None]:
movies.isnull().sum()
#No Null values in the dataset

In [None]:
#Let us check the number of unique values in the dataset
movies.nunique()


In [None]:
movies['year'].unique()
#This column has some really uncleaned values which will affect the analysis
#Let us clean this data column

In [None]:
# The 'year' column is cleaned by extracting the last element from the string values
def clean_year(year):
    ans = year.split(" ")
    return int(ans[-1])

#applying the function here
movies['year'] = movies['year'].apply(lambda x : clean_year(x))

In [None]:
#Duplicated rows in the movies DataFrame are identified - there arte no duplicates 
movies[movies.duplicated()]

In [None]:
#The skewness of the 'num_votes' column is calculated, 
#Due to high skewness - log and square root transformations will be applied to visualize the data distribution.
movies['num_votes'].skew()

In [None]:
print(np.log(movies['num_votes']).skew())
np.log(movies['num_votes'])
#Reduced Skewness

In [None]:
print(np.power(movies['num_votes'],0.5).skew())
np.power(movies['num_votes'],0.5)
#High in case of Sqrt

In [None]:
sns.kdeplot(np.log2(movies['num_votes']))


In [None]:
sns.kdeplot(np.log10(movies['num_votes']))


In [None]:
sns.kdeplot(np.power(movies['num_votes'],0.2))
#Let us choose log transform of base 2 on our data on our number of votes 

In [None]:
#Applying Transformations and generating new column values


movies['num_votes_log2'] = np.log2(movies['num_votes'])
movies['num_votes_log10'] = np.log10(movies['num_votes'])
movies['num_votes_sqrt'] = np.sqrt(movies['num_votes'])

In [None]:
#Let Us Plot Count of Movies Released per year and check the Counts in Descending Order
sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies['year'].value_counts(sort = True)[:50].plot(kind = "bar")
plt.xlabel("Year" , fontsize = 20)
plt.ylabel("Movies Released" , fontsize = 20)
plt.title("Maximum Releases Per Year" , fontsize = 30)
plt.xticks(fontsize = 14)
plt.show()

In [None]:
#Will use this to generate Trend Line Chart
movies['year'].value_counts().sort_index()

In [None]:

# Let us analyze the trend line chart for movies released with time

sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies['year'].value_counts().sort_index().plot(kind = "line" , marker = "o" , color = "g" , mec = "r" , mfc = "black")
plt.xticks(np.arange(1920,2022,4) , rotation = 90 , fontsize = 15)
plt.xlabel("Year" , fontsize = 20)
plt.ylabel("Movies Released" , fontsize = 20)
plt.title("Trend Line - Movies Released Per Year" , fontsize = 30)
plt.show()


#Movies released show a dip in latest years - maybe we collected data only up to a certain date
#Certainly see more than 100 releases per year - movie business is booming


In [None]:
#Let us Check Average Ratings Per year -- this number is affected by number of votes.
#Let us analyze both votes and ratings simultaneously for a better picture


sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies.groupby("year").mean().iloc[:,0].plot(kind = "line" , marker = "o" , color = "g" , mec = "r" , mfc = "black")

plt.xticks(np.arange(1920,2022,2) , rotation = 90 , fontsize = 15)
plt.xlabel("Year", fontsize = 20)
plt.title("Avg Ratings per Year", fontsize = 30)
plt.ylabel("Avg Rating", fontsize = 20)
plt.show()

In [None]:
#Analyzing Avg Ratings and Number of Votes 


sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies.groupby("year").mean().iloc[:,0].plot(kind = "line" , marker = "x" , color = "r" , mec = "black" , mfc = "white")
movies.groupby("year").mean().iloc[:,1].plot(kind = "line" , marker = "o" , color = "g" , mec = "r" , mfc = "black")
plt.legend()
plt.xticks(np.arange(1920,2022,2) , rotation = 90 , fontsize = 15)
plt.xlabel("Year",fontsize = 20)
plt.title("Avg Ratings and Number of Votes per Year",fontsize = 30)
plt.ylabel("Avg Rating / Number of Votes(No Transformation)",fontsize = 20)
plt.show()


#Average Ratings may be higher when the number of votes for that movie are lesser
#We cannot see or decipher anything due to the range of values for number of vates is higher than ratings

In [None]:
#Analyzing Avg Ratings and Log Base2 Tranformed Number of Votes 


sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies.groupby("year").mean().iloc[:,0].plot(kind = "line" , marker = "x" , color = "r" , mec = "black" , mfc = "white")
movies.groupby("year").mean().iloc[:,2].plot(kind = "line" , marker = "o" , color = "g" , mec = "r" , mfc = "black")
plt.legend()
plt.xticks(np.arange(1920,2022,2) , rotation = 90 , fontsize = 15)
plt.xlabel("Year",fontsize = 20)
plt.title("Avg Ratings and Number of Votes per Year",fontsize = 30)
plt.ylabel("Avg Rating / Number of Votes(Base 2)",fontsize = 20)
plt.show()


#Average Ratings may be higher when the number of votes for that movie are lesser


In [None]:
#Analyzing Avg Ratings and Log Base10 Tranformed Number of Votes 


sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies.groupby("year").mean().iloc[:,0].plot(kind = "line" , marker = "x" , color = "r" , mec = "black" , mfc = "white")
movies.groupby("year").mean().iloc[:,3].plot(kind = "line" , marker = "o" , color = "g" , mec = "r" , mfc = "black")
plt.legend()
plt.xticks(np.arange(1920,2022,2) , rotation = 90 , fontsize = 15)
plt.xlabel("Year",fontsize = 20)
plt.title("Avg Ratings and Number of Votes per Year",fontsize = 30)
plt.ylabel("Avg Rating / Number of Votes(Base 10)",fontsize = 20)
plt.show()


#Average Ratings may be higher when the number of votes for that movie are lesser
# We can see that trend stabilizing over the years - aver

In [None]:
#Analyzing Avg Ratings and Log Base2 Tranformed Number of Votes 
sns.set_style("darkgrid")
plt.figure(figsize = (15,10))
movies.groupby("year").mean().iloc[:,0].plot(kind = "line" , marker = "x" , color = "r" , mec = "black" , mfc = "white")
movies.groupby("year").mean().iloc[:,4].plot(kind = "line" , marker = "o" , color = "g" , mec = "r" , mfc = "black")
plt.legend()
plt.xticks(np.arange(1920,2022,2) , rotation = 90 , fontsize = 15)
plt.xlabel("Year", fontsize = 20)
plt.title("Avg Ratings and Number of Votes per Year", fontsize = 30)
plt.ylabel("Avg Rating and Number of Votes (Log10 Transformed)", fontsize = 20)
plt.show()

In [None]:
#Let us check the distribution of num_votes feature


sns.kdeplot(movies['num_votes'], color = "r")
plt.show()
#Highly Skewed feature - definitely needs transformation

In [None]:

sns.histplot(movies['rating'] , kde = True , color = "r")
plt.title("Ratings Distribution")
plt.show()

#We can see counts of movies rated between 6-8 being the highest

In [None]:
#Let us check the frequency after bucketing the ratings
pd.cut(movies['rating'],[0,2,4,6,8 , 10]).value_counts(normalize = True)*100


In [None]:
#Let us Analyze Ratings by plotting a bar chart
pd.cut(movies['rating'],[0,2,4,6,8 , 10]).value_counts().plot(kind = "bar")
plt.show()


#More Counts of Ratings between 6-8 (~1750 such movies) or 50% of the data 

In [None]:
#Creating a Temporary Dataframe with rating >=7.5 and number of votes >=2000

movies['rating_avg'] = pd.cut(movies['rating'],[0,2,4,6,8,10])
temp = movies[(movies['rating']>=7.5)&movies['num_votes']>=2000].sort_values(["year","rating"] , ascending = [False , False])




In [None]:
temp.head()

In [None]:
#Printing Top Movies by Year starting with recent years based on the temporary dataframe
year_var = 1990
for year in temp['year'].unique():
    if year >= year_var:
        print(f"{year}: ,Top Movies By Year {temp[temp['year']==year]['title'].values}")
        print("\n")
        print("--"*20,"\n\n")

In [None]:
#Let us Check the distribution of Ratings in Recent Years
sns.catplot(data = movies[movies['year']>2010] , x = "year" , y = "rating" , kind = "box" , height = 6 , aspect = 1.5)
plt.xlabel("Year",fontsize = 20)
plt.ylabel("Ratings",fontsize = 20)
plt.title("Distribution of Ratings For Recent Years",fontsize = 30)
plt.show()

In [None]:
#Let us Check the distribution of Ratings in Recent Years
sns.catplot(data = movies[movies['year']>2010] , x = "year" , y = "rating" , kind = "boxen" , height = 6 , aspect = 1.5)
plt.xlabel("Year",fontsize = 20)
plt.ylabel("Ratings",fontsize = 20)
plt.title("Distribution of Ratings For Recent Years",fontsize = 30)
plt.show()

# Genre

In [None]:
query = "SELECT * FROM GENRE"
genre = pd.read_sql(query , conn)
genre

#We will have to split the Genres in order to analyze the counts

In [None]:
#Check Duplicates Values
genre[genre.duplicated()]

In [None]:
#Merge Genre on Movies dataframe
movies = movies.merge(m_genre[['MID',"GID"]] , on = "MID" , how = "left").merge(genre , on = "GID" , how = "left")

In [None]:
#Split Genre Column
movies = pd.concat((movies, movies['Name'].str.split(", ", expand = True )) , axis = 1)
movies.drop("Name" , axis = 1 , inplace=True)

In [None]:
#Rename Columns after pslitting
movies = movies.rename(columns = { 0:"Genre 1",1:"Genre 2" , 2: "Genre 3"})


In [None]:
#Remove Spaces and Trim Genre
movies['Genre 1'] = movies['Genre 1'].apply(lambda x : str(x).split(" ")[0])
movies['Genre 2'] = movies['Genre 2'].apply(lambda x : str(x).split(" ")[0])
movies['Genre 3'] = movies['Genre 3'].apply(lambda x : str(x).split(" ")[0])

In [None]:
#Replace Null with NA
movies['Genre 1'].replace("Null","NA",inplace=True)
movies['Genre 2'].replace("Null","NA",inplace=True)
movies['Genre 3'].replace("Null","NA",inplace=True)

In [None]:
#Replace None Values with NA
movies['Genre 1'].replace("None","NA",inplace=True)
movies['Genre 2'].replace("None","NA",inplace=True)
movies['Genre 3'].replace("None","NA",inplace=True)

In [None]:
#Let us Analyze the Counts per Genre for different genres
g = ['Genre 1',"Genre 2" , "Genre 3"]
c = ["r","b","g"]

for  color , genre in enumerate(g) :
    plt.title(f"")
    movies[genre].value_counts().plot(kind = "bar" , color =c[color])
    plt.show()

In [None]:
#Analyze Average Ratings per Genre


g = ['Genre 1',"Genre 2" , "Genre 3"]
c = ["r","b","g"]

for  color , genre in enumerate(g) :
    movies.groupby(genre)['rating'].mean().sort_values(ascending=False).plot(kind = "bar" , color =c[color])
    plt.show()

In [None]:
#Analyze Sum of Number of Votes per Genre


g = ['Genre 1',"Genre 2" , "Genre 3"]
c = ["r","b","g"]

for  color , genre in enumerate(g) :
    movies.groupby(genre)['num_votes'].sum().sort_values(ascending=False).plot(kind = "bar" , color =c[color])
    plt.show()

In [None]:
#Analyze Average Number of Votes per Genre


g = ['Genre 1',"Genre 2" , "Genre 3"]
c = ["r","b","g"]

for  color , genre in enumerate(g) :
    movies.groupby(genre)['num_votes'].mean().sort_values(ascending=False).plot(kind = "bar" , color =c[color])
    plt.show()

In [None]:
#Analyze Average Number of Votes (Log Transformed) per Genre


g = ['Genre 1',"Genre 2" , "Genre 3"]
c = ["r","b","g"]

for  color , genre in enumerate(g) :
    movies.groupby(genre)['num_votes_log2'].mean().sort_values(ascending=False).plot(kind = "bar" , color =c[color])
    plt.show()

In [None]:
movies['num_votes'].median()
var = 2000

In [None]:
for i in movies['Genre 1'].unique():
    print(f"Top Movies Per Genre: {i}") 
    print(f"{movies[(movies['Genre 1']==i)&(movies['rating']>=7.5)&(movies['num_votes']>=var)]['title'].values}")
    print("\n")
    print("--"*20,"\n\n")

In [None]:
for i in movies['Genre 2'].unique():
    print(f"Top Movies Per Genre {i}") 
    print(f"{movies[(movies['Genre 2']==i)&(movies['rating']>=7.5)]['title'].values}")
    print("--"*20,"\n\n")

In [None]:
#Let us Create a heatmap to check the correlation of features
corr = movies.corr()

sns.heatmap(corr , annot=True , annot_kws= {'size':8})

In [None]:
#Let us check the Number of movies released per Year segregated by Genre

temp2 = movies.groupby(["year","Genre 1"]).count()['title'].reset_index()
year = 1980

for y in temp2['year'].unique():
    if y >= year:
        df = temp2[temp2['year']==y].sort_values('title',ascending=False)
        df.plot(x = "Genre 1", y = "title",kind = "bar" ,cmap = sns.set_palette('bright'))
        plt.title(f"Genres in Year: {y}")
        plt.show()

In [None]:
# Genre Distribution Over the Years
genre_years = movies.groupby(['year', 'Genre 1']).size().unstack()
genre_years.plot(kind='line', figsize=(15, 10))
plt.xlabel('Year')
plt.ylabel('Number of Movies')
plt.title('Genre Distribution Over the Years')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

# Language



In [None]:
#Relation between actor / producer , actor/director 
#producers/directors
#Movies - shot in different countries and locations
# genre distribution by years 
# Top 5 Genres

In [None]:
lang_temp = movies.merge(m_language , how = "left" , on = 'MID').merge(lang , how = "left" , on = 'LAID')
lang_temp.head()

In [None]:
#Language Distribution
language_counts = lang_temp['Name'].value_counts()
language_counts.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Language')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movies by Language')
plt.show()

In [None]:
# Average Ratings by Language
language_avg_ratings = lang_temp.groupby('Name')['rating'].mean().sort_values(ascending=False)
language_avg_ratings.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Language')
plt.ylabel('Average Rating')
plt.title('Average Ratings by Language')
plt.show()

In [None]:
# Language Diversity over the Years
language_years = lang_temp.groupby(['year', 'Name']).size().unstack()
language_years.plot(kind='line', figsize=(12, 6))
plt.xlabel('Year')
plt.ylabel('Number of Movies')
plt.title('Language Diversity Over the Years')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

# Actor and Directors

In [None]:
name_map = dict(zip(person['PID'],person['Name']))
gender_map = dict(zip(person['PID'],person['Gender']))

In [None]:
movies_dir = movies.merge(m_director , how = "left" , on = "MID").merge(m_language , how = "left" , on = 'MID').merge(lang , how = "left" , on = 'LAID')
movies_dir['Director_Name'] = movies_dir['PID'].map(name_map)
movies_dir['Director_Gender'] = movies_dir['PID'].map(gender_map)

In [None]:
movies_dir

In [None]:
movies_dir.columns
movies_dir = movies_dir[['MID', 'title', 'year', 'rating', 'num_votes', 'num_votes_log2',
       'num_votes_log10', 'num_votes_sqrt', 'rating_avg',
       'Genre 1', 'Name', 'Director_Name', 'Director_Gender']]
movies_dir.rename(columns = {'Name':"Language"},inplace=True)

In [None]:
movies_dir

In [None]:


temp2 = movies_dir[['MID', 'title', 'year', 'rating', 'num_votes', 'num_votes_log2',
       'num_votes_log10', 'num_votes_sqrt', 'rating_avg',
       'Genre 1','Director_Name', 'Director_Gender']]

In [None]:
temp2['Director_Name'].value_counts()[:25].plot(kind = "bar" , figsize = (15,8) , color = 'b')
plt.xticks(fontsize = 20)
plt.yticks(fontsize = 18)
plt.xlabel('Director Name' , fontsize = 20)
plt.ylabel('Number of Movies Directed' , fontsize = 20)
plt.title('Top 25 Directors by Number of Movies Directed' , fontsize = 30)
plt.show()

In [None]:
dirc = list(movies_dir['Director_Name'].value_counts()[:25].index)
df2 = movies_dir[movies_dir['Director_Name'].isin(dirc)].groupby(['Director_Name','Genre 1']).size().unstack().fillna(0).astype(int)
for col in df2.columns:
    plt.title(f"Number of Movies directed by top Directors in Genre: {col}",fontsize = 20)
    plt.xticks(fontsize = 20)
    plt.yticks(fontsize = 18)
    df2[col].sort_values(ascending=False).plot(kind = 'bar' , figsize = (15,8))
    plt.show()
    print("*"*50)
    print("\n\n")

In [None]:
movies_dir['Director_Gender'].value_counts()
# Many Genders are unknown and imputation would be hard so we will not be analyzing this feature

# Cast

In [None]:
m_cast['PID'] = m_cast['PID'].apply(lambda x : str(x).split(" ")[1])
m_cast = m_cast[['MID','PID']]

In [None]:
movies_cast = movies_dir.merge(m_cast , on = 'MID' , how = 'left')
movies_cast

In [None]:
movies_cast[movies_cast.duplicated()]

In [None]:
movies_cast['Cast_Name'] = movies_cast['PID'].map(name_map)
movies_cast['Cast_Gender'] = movies_cast['PID'].map(gender_map)

In [None]:
sns.set_palette("plasma")
movies_cast['Cast_Name'].value_counts()[:50].plot(kind = 'bar',figsize = (15,10))
plt.title(f"Top Cast Member Counts",fontsize = 30)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 18)
plt.xlabel("Cast Members",fontsize = 20)
plt.ylabel("#Number of Movies",fontsize = 20)
plt.show()
print("*"*50)
print("\n\n")

In [None]:
for n,d in enumerate(movies_cast['Director_Name'].unique()[:-1]):
    ser = movies_cast[(movies_cast['Director_Name']==d)]['Cast_Name'].value_counts()
    try:
        plt.title(f"Favourite Cast Members for Director:{d}",fontsize = 24)
        plt.xlabel("Favourite Cast Members" , fontsize = 12)
        plt.ylabel(f"# Of Movies Acted in for Director {d}",fontsize = 12)
        
        ser[ser>=8].plot(kind = 'bar' , color = plt.cm.Reds(n),fill = np.random.choice([True,False]) , edgecolor = np.random.choice(['r','b','g']))
        plt.show()
        print(">>*"*20)
        print("\n\n")
    except IndexError:
        pass

In [None]:
plt.figure(dpi = 250)
sns.set_palette("Set1")
movies_cast['Cast_Gender'].value_counts().plot(kind = 'pie' , autopct = "%0.1f%%" , explode =[0.2 , 0.6],shadow = True)
plt.title('Pie Chart for Gender Distribution')

plt.show()

# Directors and Producers

In [None]:
m_producer['PID']

In [None]:
m_producer['PID'] = m_producer['PID'].apply(lambda x : str(x).split(" ")[1] if len(str(x).split(" "))>1 else x )

In [None]:
movies_cast_prod = movies_cast.merge(m_producer[['MID','PID']],on = 'MID',how = 'left')
movies_cast_prod['Producer_Name'] = movies_cast_prod['PID_y'].map(name_map)
movies_cast_prod['Producer_Gender'] = movies_cast_prod['PID_y'].map(gender_map)

In [None]:
temp = movies_cast_prod.groupby(['Director_Name','Producer_Name']).count()['MID'].reset_index()


In [None]:
temp

In [None]:
sns.set_palette('magma')
plt.figure(figsize = (15,10))
movies_cast_prod['Producer_Name'].value_counts()[:50].plot(kind ='bar' , edgecolor = 'purple',fill=False)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel("Producer")
plt.ylabel("# Movies Produced")
plt.show()

In [None]:
for n,d in enumerate(temp['Director_Name'].unique()):
    temp[(temp['Director_Name']==d)].sort_values('MID',ascending=False)[:15].plot(kind = 'bar',x='Producer_Name',color = plt.cm.YlGn(n),fill = np.random.choice([True,False]) , edgecolor = np.random.choice(['r','b','g']))
    plt.title(f"Favourite Producer for Director:{d}",fontsize = 20)
    plt.xlabel("Favourite Producers" , fontsize = 12)
    plt.ylabel(f"# Of Movies Produced",fontsize = 9)
    plt.show()
    print(">>*"*20)
    print("\n\n")

In [None]:
sns.set_palette('magma')
plt.figure(figsize = (15,10))
movies_cast_prod['Producer_Name'].value_counts()[:50].plot(kind ='bar' , edgecolor = 'g',fill=False)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel("Producer")
plt.ylabel("# Movies Produced")
plt.show()

# Movies Shot in Different Countries and Locations

In [None]:
country_map = dict(zip(country['CID'],country['Name']))
location_map = dict(zip(location['LID'],location['Name']))

In [None]:
movies_country_loc = movies_dir.merge(m_country[['MID','CID']],how ='left' , on = 'MID').merge(m_location[['MID','LID']],how ='left',on='MID')
movies_country_loc.head()

In [None]:
movies_country_loc['Country_Name'] = movies_country_loc['CID'].map(country_map)
movies_country_loc['Location_Name'] = movies_country_loc['LID'].map(location_map)

In [None]:
plt.figure(figsize = (15,8))
movies_country_loc['Country_Name'].value_counts().plot(kind = 'bar', edgecolor ='black')
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel("Country")
plt.ylabel("# Number of Movies Shot")
plt.title("")
plt.show()

In [None]:
plt.figure(figsize = (15,8))
movies_country_loc['Location_Name'].value_counts()[:50].plot(kind = 'bar', edgecolor ='black')
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel("Location")
plt.ylabel("# Number of Movies Shot")
plt.title("")
plt.show()