<h2><b>Netflix - TV Shows and Movies</b></h2>

This Jupyter Notebook was created based on a initial study's cicle of Data Science
and has as the main objective analyze the 'Netflix - TV Shows and Movies' dataset 
using Python Pandas and Plotly.

In [68]:
#Import libraries >>>
import pandas as pd
import plotly.express as px

In [69]:
#Importing the dataset >>>
df_titles = pd.read_csv('archive/titles.csv')
df_credits = pd.read_csv('archive/credits.csv')

In [70]:
# Show the titles dataset with display
display(df_titles.head(3))

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3


In [71]:
# Show the credits dataset with display
display(df_credits.head(5))

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR


<b>Relationship Between Genre and Score (TMDB and IMDB)</b>

<p>Objective: Discover if there is a correlation between genre and the perceived quality of the content.</p>

In [72]:
# The genres column can have more than one genre associated to one movie/tv-series, to solve this problem:

# Let's separate the genres
df_titles['genres'] = df_titles['genres'].str.strip('[]').str.replace("'", "").str.split(',')

#Apply the division to give one genre to each line
df_titles = df_titles.explode('genres')

df_titles['genres'] = df_titles['genres'].str.strip()

In [73]:
# Get the X and Y for bar chart
mean_scores_t = df_titles.groupby('genres')['tmdb_score'].mean().sort_values().round(2)
mean_scores_i = df_titles.groupby('genres')['imdb_score'].mean().sort_values().round(2)

x = mean_scores_t.index
y = mean_scores_t.values
z = mean_scores_i.values

In [85]:
# Creating the bar chart for TMDB and IMDB separated
import plotly.express as px

fig = px.bar(mean_scores_t, x, y, text_auto=True, title= 'Relationship Between Genre and Score (TMDB)')
fig.update_traces(marker_color='#363636')
fig.update_layout(xaxis_title='Genres', yaxis_title='Mean Score (TMDB)')

fig_2 = px.bar(mean_scores_i, x, z, text_auto=True,title='Relationship Between Genre and Score (IMDB)')
fig_2.update_traces(marker_color='#DC143C')
fig_2.update_layout(xaxis_title='Genres', yaxis_title='Mean Score (IMDB)')


In [84]:
# Creating the bar chart for TMDB and IMDB together

mean_scores_t = df_titles.groupby('genres')['tmdb_score'].mean().sort_values().round(2)
mean_scores_i = df_titles.groupby('genres')['imdb_score'].mean().sort_values().round(2)

df_tmdb = mean_scores_t.reset_index()
df_tmdb['source'] = 'TMDB'
df_tmdb = df_tmdb.rename(columns={'tmdb_score': 'score'})

df_imdb = mean_scores_i.reset_index()
df_imdb['source'] = 'IMDB'
df_imdb = df_imdb.rename(columns={'imdb_score': 'score'})

combined_scores = pd.concat([df_tmdb, df_imdb])
#      genres  score source
# 0     drama   6.95   TMDB
# 1    comedy   7.63   TMDB
# 2    action   8.35   TMDB
# 0     drama   6.80   IMDB
# 1    comedy   7.13   IMDB
# 2    action   8.15   IMDB

fig = px.bar(
    combined_scores,
    x='genres',
    y='score',
    color='source', 
    barmode='group', 
    text_auto=True,
    title='Relationship Between Genre and Score (TMDB and IMDB)',
    color_discrete_map={'TMDB': '#363636', 'IMDB': '#DC143C'}
)

fig.update_layout(xaxis_title='Genres', yaxis_title='Mean Score')
