# NETFLIX

Data Information: 

This dataset consists of tv shows and movies available on Netflix as of 2019. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Tasks for this dataset:

1. Understanding what content is available in different countries
2. Identifying similar content by matching text-based features
3. Network analysis of Actors / Directors and find interesting insights
4. Is Netflix has increasingly focusing on TV rather than movies in recent years.

Content:

1. Data Analysis
2. Data Visualization


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas_profiling import ProfileReport

# seaborn
import seaborn as sns  # visualization tool

# plotly
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px

#matplotlib
import matplotlib.pyplot as plt


# word cloud library
from wordcloud import WordCloud

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

        
import warnings
warnings.filterwarnings("ignore")
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Analysis

In [None]:
# Reading and Loading the data

df = pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
report = ProfileReport(df)
report

## Data Cleaning

In [None]:
# Checking for null values
df.isnull().sum()

We can deal with filling the director, cast, country, date_added and ratinf values that are null in this dataset.

In [None]:
# rating
df['rating'].value_counts().unique()

In [None]:
df[df['rating'].isna()]

In [None]:
df['rating'].value_counts()

As we can see TV-MA is the most common type. Therefore, we can change the Nan values with TV-MA.

In [None]:
# replacing values
df['rating'].replace(np.nan, 'TV-MA',inplace  = True)

In [None]:
df[df['rating'].isna()]

In [None]:
# determining the missing data for date_added
df[df['date_added'].isna()]

Finding the missing dates for this data is so difficult. Hence, we can drop them that won't affect our analysis much. 

In [None]:
# dropping missing date_added datas
df = df[df['date_added'].notna()]

In [None]:
df[df['date_added'].isna()] #controlling

In [None]:
# missing data for country
df[df['country'].isna()]

In [None]:
df['country'].value_counts()

When we examine the country data, the US is the most popular country. Thereby, we can change the NaN values with the United States

In [None]:
#country
df['country'].replace(np.nan, 'United States',inplace  = True)

In [None]:
# checking data if it is clean or not except director and cast that are yoo much to change or clear
df.isna().sum()

# Data Visualizations

In [None]:
df.head()

In [None]:
df.type.unique()

In [None]:
# Movie vs Tv Shows
ax = sns.countplot(x="type", data=df)
plt.ylabel('Count')
plt.xlabel('Type')
plt.title('Analysis of Movies vs TV Shows');

In [None]:
df.country.value_counts()

In [None]:
# Most common Countries

country = df.country
plt.subplots(figsize =(8,8))
wordcloud = WordCloud(
                            background_color = 'white',
                            width = 512,
                            height = 384
                        ).generate(" ".join(country))

plt.imshow(wordcloud)
plt.axis("off")
plt.savefig('graph.png')

plt.show()

In [None]:
countries = df.country.value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=countries[:10].index,y=countries[:10].values)
plt.xticks(rotation=45)
plt.title('Top 10 Countries',color = 'blue',fontsize=20)
plt.show()

In [None]:
df.head()

In [None]:
# Adding year and month columns
df['year_added'] = df['date_added'].apply(lambda x: x.split(" ")[-1])
df['month_added'] = df['date_added'].apply(lambda x: x.split(" ")[0])
df.head(10)

In [None]:
bar, ax = plt.subplots(figsize = (10,10))
sns.barplot(x = df['release_year'].value_counts().index[:10], y = df['release_year'].value_counts()[:10])
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Release Frequency over Years')
plt.show()

In [None]:
# Pie Plot - Movie and TV Shows

pie1 = df['type'].value_counts().values

labels = df['type'].value_counts().index

# figure
fig = {
  "data": [
    {
      "values": pie1,
      "labels": labels,
      "domain": {"x": [0, .3]},
      "name": "Content Type",
      "hoverinfo":"label+percent+name",
      "hole": .5,
      "type": "pie"
    },],
  "layout": {
        "title":"Content Type Rates",
        "annotations": [
            { "font": { "size": 20},
              "showarrow": False,
              "text": "Content Type",
                "x": 0.20,
                "y": 1
            },
        ]
    }
}
iplot(fig)

In [None]:
# Growth over the years in TV Shows and Movies 

movie = df[df['type'] == 'Movie']
tv = df[df['type'] == 'TV Show']

data = df[['type', 'release_year']]
data = data.value_counts().to_frame()
data.reset_index(level=[0,1], inplace=True)
data = data.rename(columns = {0:'count'})
data = pd.concat([data[data['type'] == 'Movie'][:10], data[data['type']== 'TV Show'][:10]])

sns.catplot(x = 'release_year', y = 'count', hue = 'type', data = data, kind = 'point')
plt.xlabel('Release Year')
plt.ylabel('Frequency')
plt.title('Growth of Movie/TV Show over Years', size=14)
plt.show()

In [None]:
# Rating Types 
plt.figure(figsize=(12,9))
plt.title("Rating Types", fontsize=30)
sns.countplot(x="rating",data=df,order= df['rating'].value_counts().index)
plt.show()

In [None]:
df.head()

In [None]:
df.duration

In [None]:
movie = df[df['type'] == 'Movie']
tv = df[df['type'] == 'TV Show']

# create trace 1 that is 3d scatter
trace1 = go.Scatter3d(
    x=movie.duration,
    y=tv.duration,
    z=df.release_year,
    mode='markers',
    marker=dict(
        size=10,
        color='rgb(110,56,186)',                # set color to an array/list of desired values      
    )
)

data = [trace1]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0  
    )
    
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
movie = df[df['type'] == 'Movie']
tv = df[df['type'] == 'TV Show']

trace0 = go.Box(
    y = movie.duration,
    name = "Duration of Movies",
    marker = dict(
        color = 'rgb(12, 12, 140)',
    )
)

trace1 = go.Box(
    y = tv.duration,
    name = "Duration of TV Shows",
    marker = dict(
        color = 'rgb(12, 128, 128)',
    )
)

data = [trace0,trace1]
iplot(data)

In [None]:
# create trace 1 that is 3d scatter
trace1 = go.Scatter3d(
    x=df.type,
    y=df.country,
    z=df.release_year,
    mode='markers',
    marker=dict(
        size=10,
        color='rgb(110,56,186)',                # set color to an array/list of desired values      
    )
)

data = [trace1]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0  
    )
    
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [None]:
# Famous Director
x = list()
clean_data = df.dropna()
clean_data.reset_index(inplace=True)
for ind, element in clean_data.iterrows():
    type_show = element['type']
    for director in str(element['director']).split(','):
        x.append([type_show, director])
director_data = pd.DataFrame(x, columns= ['type', 'director'])
director_data

In [None]:
directorcount = director_data.value_counts().to_frame()
directorcount.reset_index(level=[0,1], inplace=True)
famous_director = directorcount.rename(columns={0:'number'})
famous_director

In [None]:
x = famous_director.director.head(15)
y = famous_director.number.head(15)
plt.figure(figsize=(15,10))
ax= sns.barplot(x=x, y=y,palette = sns.cubehelix_palette(len(x)))
plt.xlabel('Name of directors')
plt.xticks(rotation=45)
plt.ylabel('Frequency')
plt.title('Most famous 15 director')
plt.show()

In [None]:
# Famous Directors in Movies
famous_director_movie = famous_director[famous_director['type'] == 'Movie']


x = famous_director_movie.director.head(15)
y = famous_director_movie.number.head(15)
plt.figure(figsize=(15,10))
ax= sns.barplot(x=x, y=y,palette = sns.color_palette("pastel"))
plt.xlabel('Name of directors')
plt.xticks(rotation=45)
plt.ylabel('Frequency')
plt.title('Most famous 15 director for movies')
plt.show()


In [None]:
# Famous Directors in TV Shows

famous_director_tv = famous_director[famous_director['type'] == 'TV Show']
x = famous_director_tv.director.head(15)
y = famous_director_tv.number.head(15)
plt.figure(figsize=(15,10))
ax= sns.barplot(x=x, y=y,palette = sns.color_palette())
plt.xlabel('Name of directors')
plt.xticks(rotation=45)
plt.ylabel('Frequency')
plt.title('Most famous 15 director for TV Shows')
plt.show()