<h1><center>Netflix:Movies and TV Shows</center></h1>

<center><img src="https://variety.com/wp-content/uploads/2020/05/netflix-logo.png"></center>

# **Introduction**
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

## Inspiration
Some of the interesting questions (tasks) which can be performed on this dataset -

1. Understanding what content is available in different countries
2. Identifying similar content by matching text-based features
3. Network analysis of Actors / Directors and find interesting insights
4. Is Netflix has increasingly focusing on TV rather than movies in recent years.


1. [Import data and python packages](#t1.)
    * Import packages
    * Import data
    * Data shape and info
2. [Data visualization](#t2.)
    * Pie Plot
    * Histograms
    * Count Plots

<a id="t1."></a>
# 1. Import data and python packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

import missingno as msno

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=RuntimeWarning)

In [None]:
data = pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")
data.head()

In [None]:
def getNames(arr):
    names = []
    counter = 0
    for i in range(len(arr)):
        if arr[i] ==",":
            names.append(arr[counter:i])
            counter=i+2
    for i in range(len(arr)-1,0,-1):
        if arr[i]==",":
            names.append(arr[2+i:])
            break
    return names

In [None]:
def getMonth(arr):
    sayac = 0
    month = []
    for i in range(len(arr)):
        if arr[i]=="?":
            month.append(arr[i])
        else:
            for j in range(len(arr[i])):
                if j>0:
                    if arr[i][j]==" ":
                        month.append(arr[i][0:j])
                        break
    return month

In [None]:
def getYear(arr):
    sayac = 0
    month = []
    for i in range(len(arr)):
        if arr[i]=="?":
            month.append(arr[i])
        else:
            for j in range(len(arr[i])):
                if j>0:
                    if arr[i][j]==",":
                        month.append(arr[i][j+2:len(arr[i])])
                        break
    return month

In [None]:
def PlotBarH(df, title,figsize,color,fontsize):
    df.plot(kind="barh",figsize=figsize,color=color)

    y = list(range(len(df)-1,-1,-1))
    for i in range(len(df)):
        x=df.sort_values(ascending=False)[i]
        y_=y[i]
        plt.text(x=x,y=y_-0.2,s="{}".format(x),fontsize=10,color="black")
    plt.title(title,fontsize=fontsize)
    plt.show()

In [None]:
def list_hue(df,col):
    arr_movie = ""
    for i in range(len(df.loc[df["type"]=="Movie"])):
        arr_movie += str(df.loc[df["type"]=="Movie"][col].reset_index()[col][i]) + ", "
        
    arr_show = ""
    for i in range(len(df.loc[df["type"]=="TV Show"])):
        arr_show += str(df.loc[df["type"]=="TV Show"][col].reset_index()[col][i]) + ", "
    
    return arr_movie,arr_show

In [None]:
def HUE(col):
    global data
    df = data.dropna()
    movie,show = list_hue(df,col)
    movie_list = pd.Series(getNames(movie))
    show_list  = pd.Series(getNames(show))
    
    df = movie_list.value_counts()[:15].sort_values()
    PlotBarH(df,"Movie",(10,6),"teal",11)
    
    df = show_list.value_counts()[:15].sort_values()
    PlotBarH(df,"Tv Show",(10,6),"firebrick",11)

In [None]:
data["date_added"].replace(np.nan,"?",inplace=True)
data["month"] = getMonth(data["date_added"])

In [None]:
data["date_added"].replace(np.nan,"?",inplace=True)
data["add_year"] = getYear(data["date_added"])

<a id="t2."></a>
# 2. Data visualization

In [None]:
msno.matrix(data)
plt.show()

In [None]:
labels  = data["type"].value_counts().index
sizes   = data["type"].value_counts().values
explode = (0,0.1)
plt.figure(figsize=(10,6))
plt.pie(x=sizes,explode=explode,autopct='%1.1f%%',shadow=True, startangle=90, frame=True,colors=["teal","firebrick"])
plt.axis('equal')
plt.legend(labels)
plt.title("Content Types",fontsize=15)
plt.xticks([])
plt.yticks([])
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.hist(np.array(getMonth(list(data.loc[data["type"]=="Movie"]["duration"])),dtype=int),bins=20,edgecolor="black",
         color="C4",density=True)
plt.title("Histogram of Movies Durations",fontsize=20)
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.hist(np.array(getMonth(list(data.loc[data["type"]=="Movie"]["duration"])),dtype=int),bins=20,edgecolor="black",
         color="teal",density=True)
plt.title("Histogram of Movies Durations",fontsize=20)
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(10,6))
plt.hist(np.array(getMonth(list(data.loc[data["type"]=="TV Show"]["duration"])),dtype=int),bins=15,
         edgecolor="black",color="firebrick",density=True)
plt.title("Histogram of the Tv Show Seasons",fontsize=20)
plt.ylabel("Frequency")
plt.show() 

In [None]:
plt.figure(figsize=(10,6))
plt.hist(np.array(data.loc[data["type"]=="Movie"]["release_year"],dtype=int),bins=20,edgecolor="black",color="teal",
         label="Movie",alpha=0.7,density=True)
plt.hist(np.array(data.loc[data["type"]=="TV Show"]["release_year"],dtype=int),bins=20,edgecolor="black",color="firebrick",
         label="TV Show",alpha=0.7,density=True)
plt.title("Histogram of Release Year",fontsize=20)
plt.legend()
plt.ylabel("Frequency")
plt.show() 

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=data.sort_values("add_year",ascending=True).replace("?","2014"), x="add_year", hue="type", 
              palette=["teal","firebrick"])
plt.xlabel("Added Year")
plt.title("Years when content was added to Netfilix",fontsize=20)
plt.legend(loc="upper left")
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data=data.sort_values("rating"), x="rating", hue="type", palette=["teal","firebrick"])
plt.xlabel("Rating")
plt.title("Count Plot for Rating",fontsize=20)
plt.legend(loc="upper left")
plt.show()

In [None]:
data.loc[data["add_year"]=="2019"]["listed_in"].value_counts()[:15].sort_values().plot(kind="barh",
                                                                                      figsize=(10,6),edgecolor="black",
                                                                                      color="firebrick",lw=1)
plt.title("Content types added in 2019",fontsize=20)
plt.show()

In [None]:
df = data.loc[data["release_year"]==2019]["cast"].reset_index()["cast"]
cast2019 = ""
for i in range(len(df)):
    cast2019 += str(df[i]) + ", "

In [None]:
cast_list2019 = pd.Series(getNames(cast2019))

In [None]:
cast_list2019.value_counts()[1:11].sort_values().plot(kind="barh",figsize=(10,6),
                                                      edgecolor="black",color="teal",lw=1)
plt.title("Top Cast in 2019",fontsize=20)
plt.show()

In [None]:
data.loc[data["listed_in"]=="Comedies"]["add_year"].value_counts().sort_values().plot(kind="area",figsize=(10,6),color="teal")
plt.title("How many comedy movies were added in which year?",fontsize=20)
plt.show()

In [None]:
data.loc[data["listed_in"]=="TV Comedies"]["add_year"].value_counts().sort_values().plot(kind="area",figsize=(10,6),color="teal")
plt.title("How many Tv comedy were added in which year?",fontsize=20)
plt.show()

In [None]:
data.loc[data["country"]=="Turkey"]["add_year"].value_counts().sort_index().plot(kind="area",figsize=(10,6),
                                                                                 color="firebrick")
plt.title("Turkey; Number of content by years",fontsize=20)
plt.show()

In [None]:
data.loc[data["country"]=="India"]["add_year"].value_counts().sort_index().plot(kind="area",figsize=(10,6),
                                                                                 color="firebrick")
plt.title("India; Number of content by years",fontsize=20)
plt.show()

In [None]:
data.loc[data["country"]=="United States"]["add_year"].value_counts().sort_index().plot(kind="area",figsize=(10,6),
                                                                                 color="firebrick")
plt.title("United States; Number of content by years",fontsize=20)
plt.show()

## Count Plots

In [None]:
cast = ""
for i in range(len(data)):
    cast += str(data["cast"][i]) + ", "

In [None]:
cast_list = pd.Series(getNames(cast))

In [None]:
df = cast_list.value_counts()[1:16].sort_values()
PlotBarH(df,"Top 15 actors and actresses..",(10,6),"darkgreen",20)
HUE("cast")

In [None]:
director = ""
for i in range(len(data)):
    director += str(data["director"][i]) + ", "

In [None]:
director_list = pd.Series(getNames(director))

In [None]:
df = director_list.dropna().value_counts()[1:16].sort_values()
PlotBarH(df,"Top 15 Directors.",(10,6),"darkgreen",20)
HUE("director")

In [None]:
country = ""
for i in range(len(data)):
    country += str(data["country"][i]) + ", "

In [None]:
country_list = pd.Series(getNames(country))

In [None]:
df = country_list.dropna().value_counts()[:15].sort_values()
PlotBarH(df,"Top 15 Countries with the Most Content.",(10,6),"darkgreen",20)
HUE("country")

In [None]:
category = ""
for i in range(len(data)):
    category += str(data["listed_in"][i]) + ", "

In [None]:
category_list = pd.Series(getNames(category))

In [None]:
df = category_list.value_counts()[:15].sort_values()
PlotBarH(df,"Top 15 categories.",(10,6),"darkgreen",20)
HUE("listed_in")

In [None]:
month_list = pd.Series(getMonth(data["date_added"].dropna()))

In [None]:
df = month_list.value_counts().sort_values()
PlotBarH(df,"When is Netflix contents released?.",(10,6),"darkgreen",20)
HUE("month")