<a href="https://colab.research.google.com/github/MinhVuong2000/Data-Science/blob/master/MovieReportProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

From TMDB movie dataset, I analyst:
* areas have the most influence on revenue
* how genre effect movie's revenue and average score
* release date effect to revenue

The credits dataset contains the following features:-
* movie_id - A unique identifier for each movie.
* cast - The name of lead and supporting actors.
* crew - The name of Director, Editor, Composer, Writer etc.

The movies dataset has the following features:-
* budget - The budget in which the movie was made.
* genre - The genre of the movie, Action, Comedy ,Thriller etc.
* homepage - A link to the homepage of the movie.
* id - This is infact the movie_id as in the first dataset.
* keywords - The keywords or tags related to the movie.
* original_language - The language in which the movie was made.
* original_title - The title of the movie before translation or adaptation.
* overview - A brief description of the movie.
* popularity - A numeric quantity specifying the movie popularity.
* production_companies - The production house of the movie.
* production_countries - The country in which it was produced.
* release_date - The date on which it was released.
* revenue - The worldwide revenue generated by the movie.
* runtime - The running time of the movie in minutes.
* status - "Released" or "Rumored".
* tagline - Movie's tagline.
* title - Title of the movie.
* vote_average - average ratings the movie recieved.
* vote_count - the count of votes recieved.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
import plotly.express as px
import json
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline

## Describing the data

Load dataset and change format json in some column

In [None]:
#import file from kaggle
!pip install kaggle
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d tmdb/tmdb-movie-metadata
!ls

In [None]:
#upzip downloaded files from kaggle
import zipfile
zip_ref = zipfile.ZipFile('tmdb-movie-metadata.zip', 'r')
zip_ref.extractall('files')
zip_ref.close()
!ls files

In [None]:
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df= df[['genres','production_countries', 'revenue', 'release_date', 'vote_average','title']]
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())  
    df.rename(columns={"production_countries":"areas"},inplace = True)
    json_columns = ['genres', 'areas']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
        for index,i in zip(df.index,df[column]):
            list1=[]
            for j in range(len(i)):
                list1.append((i[j]['name']))# the key 'name' contains the name of the genre
            df.loc[index,column]=str(list1)  
        df[column]=df[column].str.strip('[]').str.replace("'",'')
        df[column]=df[column].str.split(', ')
    return df

def load_tmdb_credits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

movies = load_tmdb_movies("files/tmdb_5000_movies.csv")
movies.head()

In [None]:
movies.shape

### Check missing value and remove them

In [None]:
totalMissing = movies.isnull().sum()
count = movies.isnull().count()
percentMissing = (movies.isnull().sum()/movies.isnull().count())
missing_data = pd.concat([totalMissing, percentMissing],axis=1, keys=['Total', 'Percent'])
missing_data.head()

Based on above table, null data don't have or very few, so we stay the same

### Descriptive statistic of data

In [None]:
movies.describe()

In [None]:
movies.describe(include='O')

## Filtering data

none

In [None]:
#list from column have format json to do list 
def createDic(name):
    alist = {}
    for index, row in movies.iterrows():
        names = row[name]
        for n in names:
            if (n not in alist) :
                alist[n]=[]
    return alist


## Visualization

have explained analysis followed

In [None]:
#correlation
g = sns.heatmap(movies[list(movies)].corr(),annot=True, fmt = ".2f", cmap = "coolwarm",linewidths= 0.01)

From heatmap, vote_average have few impact to revenue

#### Effect by areas to revenue

In [None]:
#visualize top 10 areas have the most revenue
list1=[]
for i in movies['areas']:
    list1.extend(i)
ax = pd.Series(list1).value_counts()[:10].sort_values(ascending=True).plot.barh(width=0.9,color=sns.color_palette('summer_r',10))
for i, v in enumerate(pd.Series(list1).value_counts()[:10].sort_values(ascending=True).values): 
    ax.text(.8, i, v,fontsize=12,color='white',weight='bold')
ax.patches[9].set_facecolor('r')
plt.title('Top 10 Areas')
plt.show()

From plot, we can see that US have the highest revenue
Now, we prove impact

## Analysis

##### ANALYSIS AREAS AND REVENUE

In [None]:
#list areas
areaDic = createDic('areas')
for i in range(movies.shape[0]):
    for j in range(len(movies.areas[i])):
        areaDic[movies['areas'][i][j]].append(movies['revenue'][i])
#areaDic

In [None]:
#hepothesis revenue and areas
f_oneway(*(areaDic[t] for t in list(areaDic)))

###### with p_value<0.05 so, we can reject H0 and conclude that Areas have impact to Revenue and the most is US

##### ANALYSIS GENRES AND REVENUE

In [None]:
#Create dictionary include genres, revenue
GeRe=createDic('genres')
for i in range(movies.shape[0]):
    for j in range(len(movies.genres[i])):
        GeRe[movies['genres'][i][j]].append(movies['revenue'][i])
#GeRe

In [None]:
n_groups = len(GeRe)
sumRevenue =[]
averageRe =[]
for index,values in enumerate(GeRe):
    sumRevenue.append(sum(GeRe[values]))
    averageRe.append(int(sum(GeRe[values])/len(GeRe[values])))

# create plot
fig, ax = plt.subplots(figsize=(20, 10))
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8

rects1 = plt.bar(index, sumRevenue, bar_width,
alpha=opacity,
color='b',
label='sum of Revenue')

rects2 = plt.bar(index + bar_width, averageRe, bar_width,
alpha=opacity,
color='g',
label='average of Revenue')

plt.xlabel('Genres')
plt.ylabel('Revenue')
plt.title('Revenue by Genres')
plt.xticks(index + bar_width, list(GeRe))
plt.legend()

plt.tight_layout()
plt.show()

We can see that, have difference in Revenue of genres. So we testing

In [None]:
#hepothesis revevue and genres
f_oneway(*(GeRe[t] for t in list(GeRe)))

From above, we can reject H0 and conclude that Genres have impact to Revenue, and Adventure, Action have the highest revenue

##### ANALYSIS GENRES AND SCORE

In [None]:
#Create dictionary include genres, score_average
GeSc=createDic('genres')
for i in range(movies.shape[0]):
    for j in range(len(movies.genres[i])):
        GeSc[movies['genres'][i][j]].append(movies['vote_average'][i])
#GeSc

In [None]:
averageSc =[]
for index,values in enumerate(GeSc):
    averageSc.append(sum(GeSc[values])/len(GeSc[values]))

fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(list(GeSc),averageSc)
plt.xlabel('Genres')
plt.ylabel('Average Score')
plt.title('Average Score by Genres')
plt.show()

In [None]:
#hepothesis revevue and genres
f_oneway(*(GeSc[t] for t in list(GeSc)))

From above, we can reject H0 and conclude that Genres have impact to Score Average, and History and War have the highest Average Score

##### ANALYSIS RELEASE_DATE AND REVENUE

In [None]:
##Create dictionary include revenue and release_date follow month
DateRe={}
#d.DatetimeIndex(movies.release_date).month.dropna().unique().sort_values().to_list()
for i in range(movies.shape[0]):
    mo = movies.release_date[i].month
    if mo not in DateRe:
        DateRe[mo]=[]
    DateRe[mo].append(movies['revenue'][i])
DateRe

In [None]:
#visualize 1000 date
averageDS =[]
for index,values in enumerate(DateRe):
    averageDS.append(sum(DateRe[values])/len(DateRe[values]))
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(list(DateRe)[:1000],averageDS[:1000])
plt.xlabel('Release Date')
plt.ylabel('Revenue')
plt.title('Revenue by Release Date')
plt.show()

In [None]:
#hepothesis revevue and genres
f_oneway(*(DateRe[t] for t in list(DateRe)))

From above, we reject H0 and conclude that Release Date influent to Revenue

## Conclusion

From all, you can positive that Areas and Genres have impact to Revenue, Score_Average and Release_Date is not enough to say that it have influence to Revenue

# Open: explore other features