# DO UPVOTE AND FOLLOW

# ***THINGS YOU HAVE TO FOLLOW WHILE WALKING THROUGH THE WHOLE REPORT***

1. If you are familiar with python then only follow the code. 
2. The simple explanations about any visual or graph will be there.
3. There will be a brief conclusion of the report.
4. Every explanation is presented below the line of code's output.

***Enjoy***

# THIS DATASET CONTAINS:
* TV Shows and Movies listed on Netflix
* This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.
* In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.
* Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

# !!!PREDICTING HIT FUTURE MOVIES!!!

# THIS NOTEBOOK ANSWERS THE BELOW QUESTION:

# WHICH FUTURE MOVIE SHOULD NETFLIX CREATE?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
!pip install pywaffle --quiet
from pywaffle import Waffle
from wordcloud import WordCloud

In [None]:
df= pd.read_csv("../input/netflix-shows/netflix_titles.csv")

In [None]:
df.describe()

In [None]:
## add new features in the dataset
df["date_added"] = pd.to_datetime(df['date_added'])
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

df['season_count'] = df.apply(lambda x : x['duration'].split(" ")[0] if "Season" in x['duration'] else "", axis = 1)
df['duration'] = df.apply(lambda x : x['duration'].split(" ")[0] if "Season" not in x['duration'] else "", axis = 1)
df.head()

In [None]:
plt.figure(figsize=(15, 10))
fig = px.pie(df["type"], names='type', template='seaborn')
fig.update_traces(rotation=90, pull=[0.2,0.1], textinfo="percent+label")
fig.show()

In [None]:
df['rating'] = df["rating"].fillna('TV-MA')

In [None]:
plt.figure(figsize=(25, 20))
fig = px.pie(df["rating"], names='rating', template='seaborn')
fig.update_traces(rotation=90, pull=[0.2,0.1,0.2,0.1,0.1,0.1,0.1], textinfo="percent+label")
fig.show()

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x='rating',hue='type',data=df)
plt.title('comparing frequency between type and rating')
plt.show()

# Movie==> TV-MA,TV-14
# T.V shows ==> TV-MA,TV_14

In [None]:
from collections import Counter
country_data = df['country']
country_data = country_data.astype(str)
country_counting = pd.Series(dict(Counter(','.join(country_data).replace(' ,',',').replace(', ',',').split(',')))).sort_values(ascending=False)
country_counting.drop(['nan'], axis=0, inplace=True)
tot = sum(country_counting)
top20 = sum(country_counting[:20]) # 22 is real 20% but for simple processing

print(f'total : {tot}')
print(f'top 20 countries : {top20}')
top20_country = country_counting[:20]

In [None]:
top_productive_countries=df[(df['country']=='United States')|(df['country']=='India')|(df['country']=='United Kingdom')|(df['country']=='Japan')|
                             (df['country']=='Canada')|(df['country']=='Spain')]
plt.figure(figsize=(10,8))
sns.countplot(x='country',hue='type',data=top_productive_countries)
plt.title('comparing between the types that the top countries produce')
plt.show()

# UNITED KINGDM ==> TV-SHOWS,MOVIES
# UNITED STATES ==> MOVIES
# SPAIN ==> MOVIE
# INDIA ==> MOVIE
# CANADA ==> MOVIE
# JAPAN ==> TV-SHOWS

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
countries_0_7=['United States',"United Kingdom",'Spain','Japan','India','Canada']
def country_mov_dur(country):
    netflix_country_dur=df.loc[(df.country==str(country))&(df.type=='Movie')].duration[:]
    netflix_country_dur=netflix_country_dur.apply(lambda x : int(x.strip(' minSeaso')))
    
    return netflix_country_dur


* TV-MA:This program is specifically designed to be viewed by adults and therefore may be unsuitable for children under 17.
* TV-14:This program contains some material that many parents would find unsuitable for children under 14 years of age.
* TV-PG:This program contains material that parents may find unsuitable for younger children.
* R:Under 17 requires accompanying parent or adult guardian,Parents are urged to learn more about the film before taking their young children with them.
* PG-13:Some material may be inappropriate for children under 13. Parents are urged to be cautious. Some material may be inappropriate for pre-teenagers.
* NR or UR:If a film has not been submitted for a rating or is an uncut version of a film that was submitted
* PG:Some material may not be suitable for children,May contain some material parents might not like for their young children.
* TV-Y7:This program is designed for children age 7 and above.
* TV-G:This program is suitable for all ages.
* TV-Y:Programs rated TV-Y are designed to be appropriate for children of all ages. The thematic elements portrayed in programs with this rating are specifically designed for a very young audience, including children ages 2-6.
* TV-Y7-FV:is recommended for ages 7 and older, with the unique advisory that the program contains fantasy violence.
* G:All ages admitted. Nothing that would offend parents for viewing by children.
* NC-17:No One 17 and Under Admitted. Clearly adult. Children are not admitted.

In [None]:
f, axes = plt.subplots(6,1,figsize=(18,18))
for i in range(6):
    for j in range(1):
        country_mov_duration=country_mov_dur(str(countries_0_7[i]))
        
        sns.kdeplot(country_mov_duration,Label='Movie Duration'+' in '+str(countries_0_7[i]),color='r',ax=axes[i])

# UNITED KINGDM ==> 95 MINS
# UNITED STATES ==> 95 MINS
# SPAIN ==> 100 MINS
# INDIA ==> 120 MINS
# CANADA ==> 90 MINS
# JAPAN ==> 95 MINS

In [None]:
df['genre'] = df['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
df = df.applymap(lambda x: x[0] if isinstance(x, list) else x)
plt.figure(figsize=(20,10))
sns.countplot(y='genre',hue='type',data=df)
plt.title('comparing frequency between type and rating')
plt.show()

#  BEST RATING COUNTRY-WISE

In [None]:
for i in top_productive_countries['country'].unique():
    print(i)
    print(top_productive_countries[top_productive_countries['country']==i]['rating'].value_counts(normalize=True)*100)
    print('-'*10)

# UNITED KINGDM ==> TV-MA
# UNITED STATES ==> TV-MA
# SPAIN ==> TV-MA
# INDIA ==> TV-14
# CANADA ==> TV-MA
# JAPAN ==> TV-14

# BEST DIRECTOR COUNTRY-WISE

In [None]:
def country_top_dir(country):
    indian_dir=df.loc[(df.country==str(country)) & (df.type=="Movie")]
    # indian_dir.director.value_counts()[:12]

    col = "director"
    categories = ", ".join(indian_dir[col].fillna("")).split(", ")

    directors=pd.Series(categories)
    directors=directors.value_counts()[1:16]

    trace=go.Bar(x=directors.values[:10][::-1],y=directors.index[:10][::-1],orientation='h',marker=dict(color='#a678de'))
    return trace
from plotly.subplots import make_subplots
traces = []
titles = ["United States", "","India","", "United Kingdom", "Canada","", "Spain","", "Japan"]
for title in titles:
    if title != "":
        traces.append(country_top_dir(title))

fig = make_subplots(rows=2, cols=5, subplot_titles=titles)
fig.add_trace(traces[0], 1,1)
fig.add_trace(traces[1], 1,3)
fig.add_trace(traces[2], 1,5)
fig.add_trace(traces[3], 2,1)
fig.add_trace(traces[4], 2,3)
fig.add_trace(traces[5], 2,5)

fig.update_layout(height=1200, showlegend=False)
fig.show()

# UNITED KINGDM ==> EDWARD COTTERILL
# UNITED STATES ==> JAY KARAS
# SPAIN ==> JAVIER RUIZ CALDERA
# INDIA ==> DAVID DHAWAN
# CANADA ==> JUSTIN G. DYCK
# JAPAN ==> MASAHIKO MURATA

In [None]:
d=df["country"].astype(str)
d=d.to_list()
z1=[]
for i in d:
    k=i.split(", ")
    z1.append(k)
# print(z1)
k=pd.DataFrame(z1,columns = ['alpha', 'beta','gamma','xerox','coxhie','ma','nj','ot','si','ng','h','yo'])
z=k.alpha.unique()
w=k.beta.unique()
a=k.gamma.unique()
b=k.xerox.unique()
c=k.coxhie.unique()
d=k.ma.unique()
z_new=np.concatenate((w,z,a,b,c,d), axis=0)
for i in range(len(z_new)):
    if z_new[i] == None:
        z_new[i] = "not_app"

j=np.unique(z_new)
list_type=j.tolist()
j=j.tolist()
d=np.array(z1)
# display(len(d),len(j))
x=np.zeros((6234,115))
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
j=le.fit_transform(j)


In [None]:
q=[]
for i in range(len(d)):
    u=d[i]
    u=le.transform(u)
    q.append(u)

In [None]:
for i in range(len(q)):
    oi=q[i]
    if len(oi)==1:
        oi2=oi[0]
        x[i][oi2]=1
    elif len(oi)==2:
        oi2=oi[0]
        oi3=oi[1]
        x[i][oi2]=1
        x[i][oi3]=1
x=x.astype(int)
cou=pd.DataFrame(x,columns=list_type)

In [None]:
cou

In [None]:
df_new=cou.loc[0:6234, ["United States","India", "United Kingdom", "Canada", "Spain", "Japan"]]
df = df.join(df_new)

In [None]:
df

# BEST GENRE COUNTRY-WISE

In [None]:
cat=df[["genre", "United States"]].groupby(['genre'], as_index=False).sum().sort_values(by='United States', ascending=False)[:5]
plt.figure(figsize=(20,10))
sns.barplot(x='United States', y='genre', data=cat, orient = 'h')
plt.title('TTOP 5 GENRE LOVED IN UNITED STATES')

In [None]:
cat=df[["genre", "United Kingdom"]].groupby(['genre'], as_index=False).sum().sort_values(by='United Kingdom', ascending=False)[:5]
plt.figure(figsize=(20,10))
sns.barplot(x='United Kingdom', y='genre', data=cat, orient = 'h')
plt.title('TTOP 5 GENRE LOVED IN United Kingdom')

In [None]:
cat=df[["genre", "Japan"]].groupby(['genre'], as_index=False).sum().sort_values(by='Japan', ascending=False)[:3]
plt.figure(figsize=(20,10))
sns.barplot(x='Japan', y='genre', data=cat, orient = 'h')
plt.title('TTOP 5 GENRE LOVED IN Japan')

In [None]:
cat=df[["genre", "Spain"]].groupby(['genre'], as_index=False).sum().sort_values(by='Spain', ascending=False)[:5]
plt.figure(figsize=(20,10))
sns.barplot(x='Spain', y='genre', data=cat, orient = 'h')
plt.title('TTOP 5 GENRE LOVED IN Spain')

In [None]:
cat=df[["genre", "India"]].groupby(['genre'], as_index=False).sum().sort_values(by='India', ascending=False)[:5]
plt.figure(figsize=(20,10))
sns.barplot(x='India', y='genre', data=cat, orient = 'h')
plt.title('TTOP 5 GENRE LOVED IN India')

In [None]:
cat=df[["genre", "Canada"]].groupby(['genre'], as_index=False).sum().sort_values(by='Canada', ascending=False)[:5]
plt.figure(figsize=(20,10))
sns.barplot(x='Canada', y='genre', data=cat, orient = 'h')
plt.title('TTOP 5 GENRE LOVED IN Canada')

# UNITED KINGDM ==> DOCUMENTARIES, DRAMAS
# UNITED STATES ==> DOCUMENTARIES, COMEDIES,DRAMAS
# SPAIN ==> COMEDIES,DRAMAS
# INDIA ==> DRAMAS,COMEDIES
# CANADA ==> CHILDREN & FAMILY MOVIES,DRAMAS, COMEDIES
# JAPAN ==> ACTION & ADVENTURE, ANIME SERIES

In [None]:
d=df["cast"].astype(str)
d=d.to_list()
z1=[]
for i in d:
    k=i.split(", ")
    z1.append(k)
# print(z1)
k=pd.DataFrame(z1,columns = ['alpha', 'beta','gamma','xerox','coxhie','ma','nj','ot','si','ng','h','yo','xerox1','coxhie1','xerox2','coxhie2','ma1','nj1','ot1',
                             'si1','ng1','h1','yo1','xerox3','coxhie3','ma3','nj3','ot3','si3','ng3','h3','yo3''yo4','xerox4','coxhie4','ma4','nj4','ot4',
                             'si4','ng4','h4','yo5','xerox5','coxhie5','ma5','nj5','ot5','si5','ng5','h5','yo6'])

z=k.alpha.unique()
w=k.beta.unique()
a=k.gamma.unique()
b=k.xerox.unique()
c=k.coxhie.unique()
d=k.ma.unique()
a1=k.nj.unique()
a2=k.ot.unique()
a3=k.si.unique()
a4=k.ng.unique()
a5=k.h.unique()
a6=k.yo.unique()
a7=k.xerox1.unique()
a8=k.coxhie1.unique()
a9=k.xerox2.unique()
a10=k.coxhie2.unique()
a11=k.ma1.unique()
a12=k.nj1.unique()
a13=k.ot1.unique()
a14=k.si1.unique()
a15=k.ng1.unique()
a16=k.h1.unique()
a17=k.yo1.unique()
a18=k.xerox3.unique()
a19=k.coxhie3.unique()
a20=k.ma3.unique()
a21=k.nj3.unique()
a22=k.ot3.unique()
a23=k.si3.unique()
a24=k.ng3.unique()
a25=k.h3.unique()
a26=k.yo3yo4.unique()
# a27=k.yo4.unique()
a28=k.ma4.unique()
a29=k.xerox4.unique()
a30=k.coxhie4.unique()
a112=k.nj4.unique()
a32=k.ot4.unique()
a33=k.si4.unique()
a34=k.ng4.unique()
a35=k.h4.unique()
a36=k.yo5.unique()
a37=k.xerox5.unique()
a38=k.coxhie5.unique()
a39=k.ma5.unique()
a42=k.nj5.unique()
a43=k.ot5.unique()
a44=k.si5.unique()
a45=k.ng5.unique()
a46=k.h5.unique()
a47=k.yo6.unique()
z_new=np.concatenate((w,z,a,b,c,d,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21,a22,a23,a24,
                      a25,a26,a28,a29,a30,a112,a32,a33,a34,a35,a36,a37,a38,a39,
                    a42,a43,a44,a45,a46,a47), axis=0)
for i in range(len(z_new)):
    if z_new[i] == None:
        z_new[i] = "not_app"

j=np.unique(z_new)
list_type=j.tolist()
j=j.tolist()
d=np.array(z1)
# display(len(d),len(j))
x=np.zeros((6234,27407))
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
j=le.fit_transform(j)

In [None]:
display(len(d),len(j))

In [None]:
try:
    
    q=[]
    for i in range(len(d)):
        u=d[i]
        u=le.transform(u)
        q.append(u)
except Exception as e:
    print(e)

In [None]:
for i in range(len(q)):
    oi=q[i]
    if len(oi)==1:
        oi2=oi[0]
        x[i][oi2]=1
    elif len(oi)==2:
        oi2=oi[0]
        oi3=oi[1]
        x[i][oi2]=1
        x[i][oi3]=1
x=x.astype(int)
yoi=pd.DataFrame(x,columns=list_type)

In [None]:
for i in yoi.columns:
    a=yoi[i].sum()
    if a > 5:
        print(i)
        print(a)
    

# ACTORS/ACTRESS WHO HAVE DONE MORE THAN 5 MOVIES

# ASSUMPTIONS ==>
**1. ASSUMING THAT EVERY SINGLE PICTURE IS A HIT OR HAVE A IMDB OF 10.**

# PREDICTIONS ==>

# UNITED KINGDOM
# **Populace of The United Kingdom may prefers movies with duration of 95 minutes rated under the Category TV-MA. Moreover If the movie is directed by edward cotterill under the genre of Documentaries or Dramas or both with the exception cast as Craig Sechler, David Attenborough, Jeff Dunham, Samuel West or Stephen Fry will be a hoot and a half.**

# UNITED STATES
# Populace of The United States may prefers movies with duration of 95 minutes rated under the Category TV-MA. Moreover If the movie is directed by Jay Karas under the genre of DOCUMENTARIES, COMEDIES,DRAMAS with the exception cast as Craig Sechler, David Attenborough, Jeff Dunham, Samuel West or Stephen Fry will be a hoot and a half.

# SPAIN
# **Populace of Spain may prefers movies with duration of 95 minutes rated under the Category TV-MA. Moreover If the movie is directed by Javier Ruiz Caldera under the genre of Comedies or Dramas or both with the exception cast as Craig Sechler, David Attenborough, Jeff Dunham, Samuel West or Stephen Fry will be a hoot and a half.**

# INDIA
# **Populace of India may prefers movies with duration of 95 minutes rated under the Category TV-14. Moreover If the movie is directed by David Dhawan under the genre of Comedies or Dramas or both with the exception cast as Craig Sechler, David Attenborough, Jeff Dunham, Samuel West or Stephen Fry will be a hoot and a half.**

# CANADA
# **Populace of Canada may prefers movies with duration of 95 minutes rated under the Category TV-MA. Moreover If the movie is directed by Justin G.Dyck under the genre of Children & Family or Dramas or Comedies with the exception cast as Craig Sechler, David Attenborough, Jeff Dunham, Samuel West or Stephen Fry will be a hoot and a half.**

# JAPAN 
# **Populace of Japan may prefers movies with duration of 95 minutes rated under the Category TV-14. Moreover If the movie is directed by Masahiko Murata under the genre of Action & adventure or Anime or both with the exception cast as Craig Sechler, David Attenborough, Jeff Dunham, Samuel West or Stephen Fry will be a hoot and a half.**

# THE END

In [None]:
plt.subplots(figsize=(20,35))
wordcloud = WordCloud(
                          background_color='Black',
                          width=1920,
                          height=1080
                         ).generate(" ".join(df.genre))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('cast.png')
plt.show()