# Verzeo - Minor Project (ML-JULY-B1)

### Done by : Sanjay Marreddi  
### Email Id  : sanjay.marreddi.19041@iitgoa.ac.in
    

#### First let us import the required Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#### Let us read the given Dataset

In [None]:
movies = pd.read_csv("tmdb-movies.csv")

## Cleaning the Data and Performing EDA

In [None]:
movies.head()

In [None]:
movies.shape

In [None]:
movies.columns.tolist() # Storing all the Column Names in a List

In [None]:
movies.describe()

In [None]:
movies.info()

In [None]:
movies.isnull().sum() # Checking the Missing Values in each Column of the DataFrame

#### Note: From the results, It can be observed that all the Columns containing NaN are of type Object only

## Data Visualisation with seaborn

#### Analysing `popularity` feature of the given Movies Dataset

In [None]:
# Plotting the Distribution Plot along with Gaussian Kernel Density Estimate 
sns.distplot(movies['popularity'],kde=True,bins=50,color="Orange",hist_kws=dict(edgecolor="b", linewidth=1.5))

#### Finding the value of bar using Bin Size

In [None]:
print("Minimum popularity in the data is:",movies.popularity.min())
print("Maximum popularity in the data is:",movies.popularity.max())
print("Range of popularity is from {} to {}, value is {}".format(movies.popularity.min(),movies.popularity.max(),movies.popularity.max()-movies.popularity.min()))
print("I used a bin size of 50, so each bar corresponds to the value of:",(movies.popularity.max()-movies.popularity.min())/50)

In [None]:
# Plotting only the Histogram using Different Bin Size
sns.distplot(movies['popularity'],kde=False,hist=True,bins=20,color= "Orange",hist_kws=dict(edgecolor="b", linewidth=1.0))

In [None]:
# Plotting the Box Plot 
sns.boxplot(movies['popularity'],color ="Orange")

# Evaluating the Percentiles and Interquartile range (IQR)
Q3 = movies.popularity.quantile(.75)
Q1 = movies.popularity.quantile(.25)
IQR = Q3 - Q1

# Finding the Median
Median = movies.popularity.median()
print("Q1 Value:",Q1)
print("Median Value:",movies.popularity.median())
print("Q3 Value:",Q3)
print("Upper whisker limit:",(Q3 + 1.5*IQR))
print("Lower whisker limit:",(Q1 - 1.5*IQR))

In [None]:
# Checking the Symmetry of the popularity Feature
sns.violinplot(x='popularity',data=movies,color ="Orange")

#### Analysing `budget` feature of the given Movies Dataset

In [None]:
movies = pd.read_csv("tmdb-movies.csv")
# Plotting the Distribution Plot along with Gaussian Kernel Density Estimate 
sns.distplot(movies['budget'],kde=True,bins=20,color="green")

#### Finding the value of bar using Bin Size

In [None]:
print("Minimum budget in the data is:",movies.budget.min())
print("Maximum budget in the data is:",movies.budget.max())
print("Range of budget is from {} to {}, value is {}".format(movies.budget.min(),movies.budget.max(),movies.budget.max()-movies.budget.min()))
print("I used a bin size of 20, so each bar corresponds to the value of:",(movies.budget.max()-movies.budget.min())/20)

In [None]:
# Plotting only the Histogram using Different Bin Size
sns.distplot(movies['budget'],kde=False,hist=True,bins=10,color= "Green",hist_kws=dict(edgecolor="g", linewidth=1.0))

In [None]:
# Plotting the Box Plot 
sns.boxplot(movies['budget'],color ="Green")

# Evaluating the Percentiles and Interquartile range (IQR)
Q3 = movies.budget.quantile(.75)
Q1 = movies.budget.quantile(.25)
IQR = Q3 - Q1

# Finding the Median
Median = movies.budget.median()
print("Q1 Value:",Q1)
print("Median Value:",movies.budget.median())
print("Q3 Value:",Q3)
print("Upper whisker limit:",(Q3 + 1.5*IQR))
print("Lower whisker limit:",(Q1 - 1.5*IQR))

In [None]:
# Checking the Symmetry of the budget Feature
sns.violinplot(y='budget',data=movies,color="Green")

#### Combined Plots of `budget` and `popularity` Features of the given Movies Dataset

In [None]:
sns.jointplot(x="budget", y="popularity", data=movies,size=10)

In [None]:
sns.lmplot('budget', 'popularity',data=movies,order=2)

### Heatmaps
Using **Correlation** to measure how strong a relationship is between two variables. 

In [None]:
movies.corr()

In [None]:
# HaetMap of the Correlation Matrix
sns.heatmap(movies.corr())

In [None]:
sns.pairplot(movies)

#### Preprocessing the Outliers in `budget` feature of the given DataSet

  #### Using IQR

In [None]:
movies = pd.read_csv("tmdb-movies.csv")

In [None]:
movies.budget.median()

In [None]:
sns.boxplot(x = movies['budget'])

In [None]:
Q1 = movies.budget.quantile(0.25)
print(Q1)
Q3 = movies.budget.quantile(0.75)
print(Q3)
IQR = Q3 - Q1
print(IQR)
print(Q1 - (1.5 * IQR))
print(Q3 + (1.5 * IQR))

#### Imputation

In [None]:
movies[~((movies.budget < (Q1 - 1.5 * IQR)) |(movies.budget > (Q3 + 1.5 * IQR)))].budget.median()

In [None]:
movies.loc[movies['budget']< (Q3 + (1.5 * IQR)), 'budget'].median()

In [None]:
median = movies.loc[movies['budget']< (Q3 + (1.5 * IQR)), 'budget'].median()
movies.loc[movies.budget > (Q3 + (1.5 * IQR)) , 'budget'] = median

In [None]:
sns.boxplot(x = movies['budget'])

#### Replacing the NaN values with 0 in entire Dataset

In [None]:
movies.fillna(0)

## Solving the Given Questions

### 1) Which are the movies with the third lowest and third highest budget?

In [361]:
movies = pd.read_csv("tmdb-movies.csv")

# Creating a Temporary Copy of the DataFrame
movies_copy =movies.copy()

# Making a list of Budget Values
S = set(movies_copy.budget.tolist())
L = list(S)
L.sort()


print("The third lowest budget is ",L[2])
print("The movie with the third lowest budget is :- ", movies.loc[(movies["budget"]==L[2]),"original_title"].tolist()[0],".")

print("The third highest budget is ",L[-3])
print("The movie with the third highest budget is :- ",movies.loc[(movies["budget"]==L[-3]),"original_title"].tolist()[0],".")



The third lowest budget is  2
The movie with the third lowest budget is :-  Death Wish 2 .
The third highest budget is  300000000
The movie with the third highest budget is :-  Pirates of the Caribbean: At World's End .


### 2) What is the average number of words in movie titles between the year 2000-2005?

In [None]:
A = movies.loc[movies["release_year"].isin([2001,2002,2003,2004]),"original_title"].tolist()
A = [len(i) for i in A]

In [362]:
# Using Numpy to evaluate the Average
avg=np.mean(A)
print("The average number of words in movie titles between the year 2000-2005 are ", avg,".")


The average number of words in movie titles between the year 2000-2005 are  16.24543795620438 .


### 3) What is the most common Genre for Vin Diesel & Emma Watson movies?

In [None]:
# Initialise two empty Dictionaries 
vd={}
em={}

# Go through each row of DataFrame
for j in range(int(movies.shape[0])):
    
    # Ignore if Values are Missing at "cast"feature 
    if type(movies.cast[j]) == str:
        
        # Creating a Dict that has Combined String of Genres related to movies in which "Vin Diesel" is present 
        if "Vin Diesel" in movies.cast[j] :
            if movies.genres[j] in vd:
                vd[movies.genres[j]]+=1
            else:
                vd[movies.genres[j]] = 1
        
        # Creating a Dict that has Combined String of Genres related to movies in which "Emma Watson" is present 
        if "Emma Watson" in movies.cast[j] :
            if movies.genres[j] in em:
                em[movies.genres[j]]+=1
            else:
                em[movies.genres[j]] = 1

        

V={}

# Finding the count of each genre in entire Dataset using previous dictionaries related 
# to movies in which "Vin Diesel" is present 

for k,v in vd.items():
    tem= k.split("|") # Splitting based on "|" as delimeter
    for ea in tem:
        if ea in V:
            V[ea]+=1*int(v)
            
        else:
            V[ea]=1*int(v)
    
    
            
E={}

# Finding the count of each genre in entire Dataset using previous dictionaries related 
# to movies in which "Vin Diesel" is present 

for k2,v2 in em.items():
    tem2= k2.split("|") # Splitting based on "|" as delimeter
    for ea in tem2:
        if ea in E:
            E[ea]+=1*int(v2)
            
        else:
            E[ea]=1*int(v2)
            

In [363]:
# Finding the Keys with max value of Genre Count 
Vmax= max(V, key=V.get)
Emax= max(E, key=E.get)

print("The most common Genre for Vin Diesel :",Vmax)
print("The most common Genre for Emma Watson :",Emax)

The most common Genre for Vin Diesel : Action
The most common Genre for Emma Watson : Family


### 4) Which are the movies with most and least earned revenue?

In [364]:
# Creating a Temporary Copy of the DataFrame
movies_copy =movies.copy()
s = set(movies_copy.revenue.tolist())
a= list(s)
a.sort()

print("The least earned revenue value is",a[0])
print("The movies with the least earned revenue are :-")

for i in movies.loc[(movies["revenue"]==a[0]),"original_title"].tolist():
    print(i)
    
print("-"*120,"\n")
print("The most earned revenue value is",a[-1])
print("The movie with the most earned revenue is : ", movies.loc[(movies["revenue"]==a[-1]),"original_title"].tolist()[0])

The least earned revenue value is 0
The movies with the least earned revenue are :-
Wild Card
Survivor
Mythica: The Darkspore
Me and Earl and the Dying Girl
Mythica: The Necromancer
Vice
Frozen Fever
High-Rise
Spooks: The Greater Good
The Scorpion King: The Lost Throne
Everly
Louder Than Bombs
Dragonheart 3: The Sorcerer's Curse
Brothers of the Wind
Bone Tomahawk
Pawn Sacrifice
Momentum
Pay the Ghost
The Voices
Il racconto dei racconti
Queen of the Desert
Miss You Already
Kung Fury
Kidnapping Mr. Heineken
The Ridiculous 6
Kill Me Three Times
45 Years
Jenny's Wedding
Descendants
Tumbledown
LEGO DC Comics Super Heroes: Justice League vs. Bizarro League
Accidental Love
Halo: The Fall of Reach
Into the Forest
The Benefactor
Open Season: Scared Silly
Mojave
Forsaken
Madame Bovary
The Hallow
Barbie in Princess Power
Howl
Man Up
Return to Sender
Fathers and Daughters
Knight of Cups
Mississippi Grind
400 Days
Lava
Air
Riot
Septembers of Shiraz
Extinction
The Childhood of a Leader
Clown
Batman 

BelphÃ©gor - Le fantÃ´me du Louvre
Lost and Delirious
The Lost World
Manic
Chori Chori Chupke Chupke
The Luck of the Irish
Tanguy
Double Take
Rivers and Tides
Little Secrets
Slashers
Tinker Bell
Starship Troopers 3: Marauder
The Scorpion King: Rise of a Warrior
Frost/Nixon
High School Musical 3: Senior Year
The Reader
Diario de una ninfÃ³mana
Wild Child
Felon
Snow Buddies
Kung Fu Panda: Secrets of the Furious Five
Another Cinderella Story
Presto
The Duchess
Che: Part One
BURNâ€¢E
The Good Witch
LOL (Laughing Out Loud)
Stargate: Continuum
Rachel Getting Married
A Matter of Loaf and Death
The Other Boleyn Girl
Drillbit Taylor
Synecdoche, New York
My Sassy Girl
Joy Ride 2: Dead Ahead
Bronson
Max Manus
Dr. Dolittle: Tail to the Chief
Barbie Mariposa and Her Butterfly Fairy Friends
Zombie Strippers!
Niko: LentÃ¤jÃ¤n poika
100 Feet
The Burning Plain
Be Kind Rewind
Futurama: The Beast with a Billion Backs
The Andromeda Strain
The Ruins
Camp Rock
Ghost Town
Fifty Dead Men Walking
Dead Space: D

Stories We Tell
After Porn Ends
Hungry For Change
Good Vibrations
When the Lights Went Out
Vamps
Berberian Sound Studio
Political Animals
The Girl
PokÅ‚osie
Among Friends
Scary or Die
Bad Kids Go to Hell
A Mother's Nightmare
Party Bercy
Artifact
Amy Schumer: Mostly Sex Stuff
Birdsong
Mr. Pip
Gayby
It's a SpongeBob Christmas!
Bill Burr: You People Are All The Same
Crawlspace
Talaash
Jurassic Shark
Fresh Guacamole
Free Angela and All Political Prisoners
Girls Against Boys
The Pact
Led Zeppelin: Celebration Day
Cousin Ben Troop Screening
Du vent dans mes mollets
Student of the Year
Underground: The Julian Assange Story
Beware of Mr. Baker
Craigslist Joe
How to Make Money Selling Drugs
Minecraft: The Story of Mojang
Sleepwalk With Me
The Millionaire Tour
Una pistola en cada mano
Museum Hours
Jim Gaffigan: Mr. Universe
Aziz Ansari: Dangerously Delicious
Fresh Meat
Just Like a Woman
The Sleeper
Belle du Seigneur
Freeloaders
Undercover Bridesmaid
West of Memphis
Inch'Allah
Sassy Pants
Steve J

After Sex
Delta Farce
The Ten
Rise: Blood Hunter
Teeth
The Comebacks
Pars vite et reviens tard
Stuck
The Poughkeepsie Tapes
Trade
Helvetica
Gekijouban Clannad
Arn: Tempelriddaren
The King of Kong
Louis C.K.: Shameless
MoliÃ¨re
Kickin' It Old Skool
Creepshow 3
Lezioni di cioccolato
Borderland
Urban Justice
Chill Out, Scooby-Doo!
Zeitgeist
Straightheads
The Brothers Solomon
Jesse Stone: Sea Change
The Color of Freedom
The Three Investigators and The Secret Of Skeleton Island
Jack Brooks: Monster Slayer
The Flock
Murder Party
Catacombs
Grace is Gone
Hellphone
Gabriel Iglesias: Hot and Fluffy
Fallen
Eichmann
Live!
The Haunting Hour: Don't Think About It
Oliver Twist
The Deaths of Ian Stone
Eagle vs Shark
Eden Log
Highlander: The Search for Vengeance
Angel
Chacun son cinema ou Ce petit coup au coeur quand la lumiere s'eteint et que le film commence
Irina Palm
End of the Line
La ragazza del lago
La Antena
A Dog's Breakfast
Heima
In Tranzit
Jump In!
The Beautiful Ordinary
Hounddog
Daniel Tosh

Follow That Camel
In Like Flint
The Sword in the Stone
Dinner for One
It's a Mad, Mad, Mad, Mad World
The Haunting
Lord of the Flies
Irma la Douce
Murder at the Gallop
The Day of the Triffids
Hud
Tom Jones
The Raven
Donovan's Reef
Shock Corridor
The Terror
Le Feu follet
The Nutty Professor
This Sporting Life
McLintock!
Jason and the Argonauts
X: The Man with the X-Ray Eyes
Billy Liar
The Haunted Palace
The Servant
Lilies of the Field
The Prize
The Kiss of the Vampire
La frusta e il corpo
AstÃ©rix chez les Bretons
Allan Quatermain and the Lost City of Gold
Â¡Three Amigos!
Nothing in Common
DÃ¨moni 2
Soul Man
Salvador
Jumpin' Jack Flash
The Big Easy
April Fool's Day
Mona Lisa
Down by Law
Henry: Portrait of a Serial Killer
Mala Noche
Heartburn
Tough Guys
Nan bei Shao Lin
Luxo Jr.
House
Clockwise
Class of Nuke 'Em High
Running Scared
One Crazy Summer
Witchboard
Round Midnight
Iron Eagle
Legal Eagles
Caravaggio
The Clan of the Cave Bear
Night of the Creeps
Chopping Mall
MacskafogÃ³
Lucas
Wi

###  5) What is the average runtime of movies in the year 2006?

In [366]:
run = movies.loc[ (movies["release_year"]== 2006), "runtime" ].tolist()

In [367]:
avg= np.mean(run)

In [368]:
print("The average runtime of movies in the year 2006 is :",avg)

The average runtime of movies in the year 2006 is : 101.68382352941177


### 6) Name any 3 production companies which have invested money in worse revenue movies?

In [369]:
# Creating a Temporary Copy of the DataFrame
movies_copy =movies.copy()
s = set(movies_copy.revenue.tolist())
a= list(s)
a.sort()

# Checking all the Instances where Revenue is "0" and Budget Invested by Companies is having about 98 Percentile to find 
# the companies which invested money in worse revenue movies.
rev= movies.loc[(  (movies["revenue"]==a[0])  & ( movies["budget"] > movies.budget.quantile(0.98)  ) ),"production_companies"].tolist()

print("The 3 Production Companies which have invested many in Worse Revenue Movies are :-")

rev = rev[0].split("|")
for i in rev:
    print(i)

The 3 Production Companies which have invested many in Worse Revenue Movies are :-
Universal Pictures
Stuber Productions
Relativity Media


### The End - Sanjay Marreddi
