# Hackathon Notebook

In [1]:
#!py -m pip install missingno
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime, date
%matplotlib inline


In [2]:
df = pd.read_csv('data/netflix_titles.csv')

# Exploring the Netflix 

Dataset Netflix is one of the world's largest streaming platforms, with millions of users streaming movies and TV shows every day. As a result, Netflix has accumulated a massive amount of data that can be used to gain insights into user behaviour, preferences, and more. In this project, students will explore a dataset containing information about movies and TV shows available on Netflix.

# Brief

The CEO of CS-Streaming has brought you and your team in to analyse Netflix data for their new streaming service. They are confident that your analysis will be helpful to you in making decisions about their service. They need your expertise to uncover trends and insights from the Netflix data that will be beneficial to their new product. They are eager to begin this project and will therefore need a timely response. You are to report on how CS-Streaming can best optimise its services and compete with Netflix.

In [3]:
df.columns

In [4]:
'''
Key Columns in the Dataset:
1. show_id: An exclusive identifier for each show on Netflix.
2. type: Indicates whether the entry is a "Movie" or a "TV Show."
3. title: The name of the movie or TV show.
4. director: The director(s) responsible for the movie or TV show.
5. cast: The principal cast or actors featured in the movie or TV show.
6. country: The country or countries where the movie or TV show was produced.
7. date_added: The date when the movie or TV show was added to the Netflix catalogue.
8. release_year: The original release year of the movie or TV show.
9. rating: The content rating assigned to the movie or TV show (e.g., PG, TV-MA, etc.).
10. duration: For movies, the duration is in minutes; for TV shows, the number of seasons.
11. listed_in: The genre(s) or category(ies) classifying the movie or TV show.
12. description: A concise summary of the movie or TV show
'''

In [5]:
df.head()

In [6]:
df.tail()

In [7]:
df.shape

In [8]:
df.describe()

In [9]:
df.describe(include ='object')

In [10]:
df.info()

In [11]:
df.nunique()

In [12]:
df.isnull().sum()

In [13]:
plt.figure( figsize = (20,10))
sns.heatmap(df.isnull(), cmap ='winter')
plt.show()

Large number of missing data in Director, Cast and Country. We will rectify this. 

In [14]:
df.dtypes

In [15]:
df['date_added']= df['date_added'].str.lstrip(" ")

In [16]:
df['date_added']= pd.to_datetime(df['date_added'], format ='%B %d, %Y') 
df.head()

In [17]:
df['date_added_day'] = df['date_added'].dt.day

In [18]:
df['date_added_month'] = df['date_added'].dt.month

In [19]:
df['date_added_year'] = df['date_added'].dt.year

In [20]:
df.drop(['date_added'], axis =1, inplace = True)
df.head()

In [21]:
df.dtypes

In [22]:
df.drop_duplicates()

In [23]:
df.drop(['show_id'], axis =1, inplace = True)

In [24]:
df.rename(columns={'listed_in' : 'category'}, inplace=True)

In [25]:
# Standardize capitalization in categorical columns 'type' and 'rating'
df['type'] = df['type'].str.capitalize()
df['rating'] = df['rating'].str.upper()
df.head()

In [26]:
#fill Nan Values
df['rating'] = df['rating'].fillna('Not Rated')
df['country'] = df['country'].fillna('Unknown')
df['duration'] = df['duration'].fillna('Unknown')
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')

In [27]:
#Fill Nan Values for numbrs with mode
num_mode = df.select_dtypes(include= 'number'). mode()
num_columns = df.select_dtypes(include= 'number')
num_col= num_columns.columns
df[num_col] = df[num_col].fillna(num_mode)

In [28]:
df.dtypes

# EDA including Visualizations

In [29]:
types =  df['type'].value_counts()
types

In [30]:
plt.figure(figsize=(10,6))
plt.pie(types ,labels=['Movies', 'TV Shows'], autopct='%1.1f%%')
plt.title('Movies & TV Shows')
plt.legend()
plt.show()

Movies make up nearly 2/3 of the Netflix Database. This means there is a gap in content our streaming service will be able to fil. 

In [31]:
#Relation between type and rating

plt.figure(figsize=(10,8))
sns.countplot(x='rating',hue='type',data=df)
plt.title('Relation between Type and Rating')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

The two most common ratings are TV-MA ( Mature Audience) and TV-14 (restricted Parental Guidelines).

In [32]:
'''assess director distribution by type'''
'''remove unknowns first'''
df2 = df.copy()
df2 = df2.loc[df2['director']!= 'Unknown']
types_director = df2.groupby('type')
types_directors = types_director['director'].value_counts()
types_directors

In [33]:
#Asses directors of movies
directors_movie = df['director'][df['type'] == 'Movie'].value_counts().sort_values(ascending=False).reset_index()
directors_movie

In [34]:
'''majority of directors of the movies are unkown, plot the known directors'''
directors_movie = df2['director'][df2['type'] == 'Movie'].value_counts().sort_values(ascending=False).reset_index()
directors_movie

In [35]:
top_10_directors_movie= directors_movie.head(10)
top_10_directors_movie.rename(columns = {'index': 'Directors', 'director':'Count'}, inplace = True)

In [36]:
fig = top_10_directors_movie.plot( kind ='bar', x = 'Directors',y = 'Count',figsize=(8, 6))
plt.title('Top 10 Directors In Netflix For Movies')
plt.xlabel('Movie Director')
plt.ylabel('Number Movies Directed')
plt.show()

In [37]:
'''majority of directors of the movies are unkown, plot the known directors'''
directors_tvshow = df2['director'][df2['type'] == 'Tv show'].value_counts().sort_values(ascending=False).reset_index()
top_10_directors_tvshow = directors_tvshow.head(10)
top_10_directors_tvshow.rename(columns = {'index': 'Director', 'director':'Count'}, inplace = True)
top_10_directors_tvshow

In [38]:
# Ploting Top 10 Director In Netflix For Movies
fig = top_10_directors_tvshow.plot( kind ='bar', x = 'Director',y = 'Count',figsize=(8, 6), color = 'red')
plt.title('Top 10 Directors In Netflix For TV Show')
plt.xlabel('TV Show Director')
plt.ylabel('Number TV Shows directed')

In [39]:
plt.figure(figsize=(10, 6))
sns.boxplot( data= top_10_directors_movie, color ='blue',  )
sns.boxplot( data= top_10_directors_tvshow, color ='green')
plt.title('Director Distribution by type')
plt.xlabel('Director')
plt.ylabel('Type')
#plt.legend('Movie', 'TV Show')
plt.show()

Less Variance in the TV Show Director dataset than that of the movies but data likely skewed due to the larger population of the Movies than that of TV Show

In [40]:
#Asses the Countries that have produced the content. Usually a country will produce for the audience of that country.
countries_ordered = df['country'].value_counts().sort_values(ascending=False).reset_index()
countries_ordered.rename(columns = {'index': 'Country', 'country': 'Count'}, inplace = True)
countries_ordered

In [41]:
plt.figure(figsize = (10,6))
sns.barplot(data = countries_ordered.head(10), x = 'Country' , y = 'Count')
plt.title('Distribution of countries where content has been produced')
plt.xlabel('Country')
plt.ylabel('Number of content has been produced')
plt.xticks(rotation = 90)
plt.show()

Out of 8807 content (TV Show / Movie), 2818 has been produced in the USA which is 32%.This implies the largest audience is consumers of American production, followed by India. However, 9.4% of the director database is unkown. 

In [42]:
sns.violinplot(data = countries_ordered)

Clearly the data is highly skewed with the major producers being the outliers and most contries producing less than 200 content pieces. Proving the major audience is American and Indian consumers.

In [44]:
# Create a function to change column data from string to a list
df5 = df.copy()

def update_column_string_to_list(df4, column_list):
    for each in column_list:
        df5[each] = df5[each].str.split(', ')
        
#This function will create a heatmap when providing which type of data we are looking at.
def create_heatmap(name_df):
    df_heatmap = df5[df5["type"]==name_df].copy()
    update_column_string_to_list(df_heatmap, ['country', 'category'])
    # Explode both 'country' and 'category' columns
    df_exploded = df_heatmap.explode('country').explode('category')
    # Group by 'country' and 'category', then count occurrences
    df_counts = df_exploded.groupby(['country', 'category']).size().reset_index(name='count')
    # Only care about count >= mean 
    df_counts = df_counts[df_counts["count"] >= df_counts["count"].mean()]
    # Pivot the dataframe to create a matrix for heatmap
    pivot_df = df_counts.pivot("category", "country", "count")
    # Plot heatmap
    plt.figure(figsize=(20, 15))
    sns.heatmap(pivot_df, cmap="icefire", annot=False, linewidths=.5)
    plt.title('Counts of ' + name_df + ' Type by Country (>= mean)')
    plt.xlabel('Country')
    plt.ylabel('Type')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()
create_heatmap("Tv show")

The largest Produces of TV SHows for the categories TV Comedy, Kids TV(TV Comedies) and Docuseries is the United States. Making the United States the largest producers of TV Shows. A large number of TV Shows in the category Internation TV Shows, Korean TV Shows is highly produced by Korea.

In [45]:
#Assess the countries by content type
countries_tvshow = df['country'][df['type'] == 'Tv show'].value_counts().sort_values(ascending=False).reset_index()
countries_tvshow.rename(columns = {'index': 'Country', 'country': 'Count'}, inplace = True)
countries_tvshow

In [46]:
plt.figure(figsize = (10,6))
sns.barplot(data = countries_tvshow.head(10), x = 'Country' , y = 'Count')
plt.title('Distribution of countries where TV Shows have been produced')
plt.xlabel('Country')
plt.ylabel('Number of TV Shows produced')
plt.xticks(rotation = 90)
plt.show()

Out of the 2676 TV Shows, USA produced 760, which is 28%. India was the second largest producer however, the 6th largest producer for TV Shows. Indicates India consumers prefer Movies as the country manufactures for the coutries audience. However, still evident that the large proportion of unkown directors lies in this dataset, skewing the data.

In [47]:
# Create a function to change column data from string to a list
df4 = df.copy()
name_df = 'Movie'
def update_column_string_to_list(df4, column_list):
    for each in column_list:
        df4[each] = df4[each].str.split(', ')
        
#This function will create a heatmap when providing which type of data we are looking at.
def create_heatmap(name_df):
    df_heatmap = df4[df4["type"]==name_df].copy()
    update_column_string_to_list(df_heatmap, ['country', 'category'])
    # Explode both 'country' and 'category' columns
    df_exploded = df_heatmap.explode('country').explode('category')
    # Group by 'country' and 'category', then count occurrences
    df_counts = df_exploded.groupby(['country', 'category']).size().reset_index(name='count')
    # Only care about count >= mean 
    df_counts = df_counts[df_counts["count"] >= df_counts["count"].mean()]
    # Pivot the dataframe to create a matrix for heatmap
    pivot_df = df_counts.pivot("category", "country", "count")
    # Plot heatmap
    plt.figure(figsize=(14, 10))
    sns.heatmap(pivot_df, cmap="icefire", annot=False, linewidths=.5)
    plt.title('Counts of ' + name_df + ' Type by Country (>= mean)')
    plt.xlabel('Country')
    plt.ylabel('Type')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()
create_heatmap("Movie")


In [48]:
countries_movie = df['country'][df['type'] == 'Movie'].value_counts().sort_values(ascending=False).reset_index()
countries_movie.rename(columns = {'index': 'Country', 'country': 'Count'}, inplace = True)
countries_movie

In [49]:
plt.figure(figsize = (10,6))
sns.barplot(data = countries_movie.head(10), x = 'Country' , y = 'Count')
plt.title('Distribution of countries where movies were produced')
plt.xlabel('Country')
plt.ylabel('Number of movies produced')
plt.xticks(rotation = 90)
plt.show()

Out of the 6131 movies on Netflix, USA has produce 2058 movies which is 34% followed by India, the second largest Movie producer producing 15% of movies. Nearly 50% of the movies on Netflix are aimed for american and Indian consumers. 

In [50]:
df.columns

In [51]:
# Asses the major producing countries by category
df3= df.copy()

# -Create a function to change data from string to a list
def update_column_string_to_list(column_list):
    for each in column_list:
        df3[each] = df3[each].str.split(', ')
        
update_column_string_to_list(['country', 'category'])

# Explode both 'country' and 'Category' columns
df_exploded = df3.explode('country').explode('category')

# Group by 'country' and 'Category', then count occurrences
df_counts = df_exploded.groupby(['country', 'category']).size().reset_index(name='count')

# Only want the records with count greater than mean
df_counts = df_counts[df_counts['count'] > 16]

# Pivot the dataframe to create a matrix for heatmap
pivot_df = df_counts.pivot("category", "country", "count")

# Plot heatmap with a different color palette
plt.figure(figsize=(14, 10))
sns.heatmap(pivot_df, cmap="RdYlBu", annot=False, linewidths=.5)
plt.title('Counts of Category by Country')
plt.xlabel('Country')
plt.ylabel('Category')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

It is evident that the largest producers are India and United States specifically in categories of Drama, International Movies and Comedy.

In [52]:
#Asses the most common titles
top_titles = df['title'].value_counts().head(10).reset_index()
top_titles.rename(columns = {'index': 'Title', 'title': 'Count'}, inplace = True)
top_titles

In [53]:
plt.figure(figsize = (10,6))
sns.barplot(data = top_titles, x = 'Title' , y = 'Count', palette = 'Set2')
plt.title('Top 10 Most Common Titles')
plt.xlabel('Title')
plt.ylabel('Number of Appearances')
plt.xticks(rotation = 90)
plt.show()

In [54]:
#Assess the most common cast members
top_cast = df['cast'].value_counts().head(10).reset_index()
top_cast = top_cast[top_cast['index']!= 'Unknown']
top_cast.rename(columns = {'index': 'Cast', 'cast': 'Count'}, inplace = True)
top_cast

In [55]:
plt.figure(figsize = (10,6))
sns.barplot(data = top_cast, x = 'Cast' , y = 'Count', palette = 'Set3')
plt.title('Top 10 Cost Common Cast Members')
plt.xlabel('Cast Member')
plt.ylabel('Number of Appearances')
plt.xticks(rotation = 90)
plt.show()

In [56]:
#Assess the year content is added to Netflix

plt.figure(figsize = (10,6))
sns.countplot(data = df, x = 'date_added_year' , order = df['date_added_year'].value_counts().index)
plt.title('Distribution of years content has been added to Netflix')
plt.xlabel('Year Added')
plt.ylabel('Quantity of Content')
plt.xticks(rotation = 90)
plt.show()

Most content has been added to Netflix in 2019, making a quarter of the content 2 years old. If our streaming service delivers more current content we will bring in Netflix audience and maintain them as well as open a new window for a larger audience.

In [57]:
#Assess the month content is added to Netflix

plt.figure(figsize = (10,6))
sns.countplot(data = df, x = 'date_added_month' , order = df['date_added_month'].value_counts().index)
plt.title('Distribution of the month content has been added to Netflix')
plt.xlabel('Month Added')
plt.ylabel('Quantity of Content')
plt.xticks(rotation = 90)
plt.show()

Most content has been added in July, possibly intime for school holidays in preparation to entertain children. Second most common month is december, the month of festivities and christmas, a popular time for festive content to be added. February is the month Netflix addeds conetnt, this is an opening for us to fill as well as the festivity of Valentines, there will be a market spot our service can hold.

In [58]:
sns.pairplot(df)

In [59]:
plt.figure(figsize = (10,6))
sns.histplot(data = df, x='release_year', bins=50)
plt.title('Distribution of years content has released for public consumption ordered by numerical year')
plt.xlabel('Year Content was Released')
plt.ylabel('Quantity of Content')
plt.xticks(rotation = 90)
plt.show()


In [60]:
#Assess the year content was released for consumerism
plt.figure(figsize = (10,6))
sns.countplot(data = df, x = 'release_year' , order = df['release_year'].value_counts().index)
plt.title('Distribution of years content has released for public consumption ordered by content released per year')
plt.xlabel('Year Content was Released')
plt.ylabel('Quantity of Content')
plt.xticks(rotation = 90)
plt.show()

It is noted that Netflix contains a portion of older content, year 2000 and less however most of the content was released 2010 and above. This implies that the consumer is younger of age and prefers more relavnt content in trend with current age but netflix main content is TV-MA, yet another open window our service can fill. 

In [61]:
type_release_year = df[['type', 'release_year']]
type_releaseyear =type_release_year.groupby(['type', 'release_year']).size().reset_index(name='Total Count').sort_values(by = 'Total Count', ascending = False)
type_releaseyear.head(20)

In [62]:
plt.figure(figsize = (10,6))
sns.lineplot(data = type_releaseyear.head(100), x = 'release_year', y = 'Total Count', hue = 'type')
plt.show()

In [63]:
plt.figure(figsize = (10,6))
sns.boxplot(data=df[['release_year']])
plt.show()

In [64]:
df['release_year'].describe()

As can be seen from above. Most releases were within the years 2013 - 2019 (Q1 to Q3). Making all data in the long tail below 2000 are outliers. 

In [65]:
#The oldest released Titles on Netflix
outlier_release_year = df.sort_values("release_year", ascending = True)
outlier_release_year = outlier_release_year[outlier_release_year['duration'] != ""]
outlier_release_year = outlier_release_year[outlier_release_year["release_year"]<2000]
outlier_release_year[['title', "release_year"]]

In [66]:
outlier_release_year.shape

Consequently out of the 8807 content on 525 content are outliers.

In [67]:
plt.figure(figsize = (10,6))
sns.boxplot(data=df, x= 'release_year', y='type', hue='type')

plt.show()

In [68]:
mov = df['release_year'][df['type'] == 'Movie']
tvs = df['release_year'][df['type'] == 'Tv show']
print (mov.describe(), tvs.describe())

In [69]:

''' cant use because it is not a normal distribution therefore doesnt comply with this method'''
'''tvs_mean = tvs.mean()
tvs_std = tvs.std()
tvs_q1 = tvs_mean - 3*tvs_std
tvs_q3 = tvs_mean + 3*tvs_std
print (tvs_mean, tvs_std, tvs_q1, tvs_q3)'''

As indicated in the box plot above, TVShows on Netflix have closer releases to the present year, making them more relevant. Most TV Shows have been released between 2016 - 2020 with a few outliers below the year 2016. TV Shows have less variation in release year whilst movies are more spread, as pictured by the size of each of the boxes in the box plot and the whiskers. Movies release year commonly between 2012 - 2018 with outliers released between 1942 - 2012. It is noted that the oldest content is a TV show released in 1925.

In [70]:
df3= df.copy()

# -Create a function to change data from string to a list
def update_column_string_to_list(column_list):
    for each in column_list:
        df3[each] = df3[each].str.split(', ')
        
update_column_string_to_list(['country', 'category'])

# Explode both 'country' and 'Category' columns
df_exploded = df3.explode('country').explode('category')

# Group by 'country' and 'Category', then count occurrences
df_counts = df_exploded.groupby(['country', 'category']).size().reset_index(name='count')

# Only want the records with count greater than mean
df_counts = df_counts[df_counts['count'] > 16]

# Pivot the dataframe to create a matrix for heatmap
pivot_df = df_counts.pivot("category", "country", "count")

# Plot heatmap with a different color palette
plt.figure(figsize=(14, 10))
sns.heatmap(pivot_df, cmap="RdYlBu", annot=False, linewidths=.5)
plt.title('Counts of Category by Country')
plt.xlabel('Country')
plt.ylabel('Category')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

From the heat map it is evident that the most consumed content is Internation Movies and Drama's produced in India and Drama's and Comdies produced in the United States. It is noted that these countries produce the highest count of these contents. These will therefore be prposed as the target audience and producers. 

In [71]:
#category distribution 
#df['category'] = df['category'].str.split(',')
genre_count_df = df3.explode('category').groupby('category')['title'].count().reset_index(name='count')
plt.figure(figsize=(10, 6))
sns.barplot(data=genre_count_df, x='category', y='count', color='blue')
plt.title('Category Count')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()


In [72]:
#Assess the most common categories
top_categories = df['category'].value_counts().head(10).reset_index()
top_categories.rename(columns={'index': 'Category', 'category' : 'Count'}, inplace = True)
top_categories

In [73]:
plt.figure(figsize = (10,6))
sns.barplot(data = top_categories, x = 'Category' , y = 'Count', palette = 'Set1')
plt.title('Top 10 Most Common Categories')
plt.xlabel('Category')
plt.ylabel('Number of Appearances')
plt.xticks(rotation = 90)
plt.show()

In [74]:
#Assess category by type
categories_movie = df['category'][df['type'] == 'Movie'].value_counts().sort_values(ascending=False).reset_index().head(10)
categories_movie.rename(columns={'index': 'Category', 'category' : 'Count'}, inplace = True)
categories_movie

In [75]:
plt.figure(figsize = (10,6))
sns.barplot(data = categories_movie, x = 'Category' , y = 'Count', palette = 'Set1')
plt.title('Top 10 Most Common Categories for Movies')
plt.xlabel('Categories')
plt.ylabel('Number of Appearances')
plt.xticks(rotation = 90)
plt.show()

In [76]:
categories_tvshow = df['category'][df['type'] == 'Tv show'].value_counts().sort_values(ascending=False).reset_index().head(10)
categories_tvshow.rename(columns={'index': 'Category', 'category' : 'Count'}, inplace = True)
categories_tvshow

In [77]:
plt.figure(figsize = (10,6))
sns.barplot(data = categories_tvshow, x = 'Category' , y = 'Count', palette = 'Set1')
plt.title('Top 10 Most Common Categories for TV Shows')
plt.xlabel('Categories')
plt.ylabel('Number of Appearances')
plt.xticks(rotation = 90)
plt.show()

In [78]:

#Assess most common Words
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import string
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
 
nltk.download('stopwords')
 
all_descriptions = ' '.join(df3['description'])
 
words = word_tokenize(all_descriptions)
 
stop_words = set(stopwords.words('english'))
 
filtered_words = [word for word in words if word.lower() not in stop_words and word.isalpha()]
 
word_freq = FreqDist(filtered_words)
 
common_words = word_freq.most_common()
common_words_filtered = [(word, freq) for word, freq in common_words if freq > 100]
 
print(common_words_filtered)


In [79]:
#display comment words
com_words =[]
for i in common_words:
    com_words.append(i[0])
top_words = com_words[:101]
comment_words = ''
for i in range(len(top_words)):
    top_words[i] = top_words[i].lower()
     
comment_words += " ".join(top_words)+" "

wordcloud = WordCloud(max_font_size=50, max_words=1000, background_color="white").generate(comment_words)
plt.figure(figsize = (20,10))
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()

# Recommendations:

Audience: Netflix largest consumers are based of United Stated and India productions. Initially it is recommended to match these audiences and then expand as the brand becomes better known.

The largest conetnt is produced under the ratings of TV-MA and TV-14, making the Netflix audience perceived as older in age. Ideally, another assessment including audience age is required. 

A gap in audience seems to be that the largest category for short form content (TV Shows) is Kids TV. A great recommendation is to increase the TV-MA content which Netflix lacks. 

Findings for each type of content:
* TV Shows - Most Common Producers :The largest Produces of TV SHows for the categories TV Comedy, Kids TV(TV Comedies) and Docuseries is the United States. Making the United States the largest producers of TV Shows.  Most common Categories: Kids TV and International TV Shows and Drama's. Most Common Release Years: 2016 - 2020 Most Common Date added: 2019

* Movies - Most Common Producers :The largest Produces of Movies are United States and India. Most common Categories:  Darama's, International Movies and Documentaries followed by Stand-up Commedies. Most Common Release Years:2012 - 2018.  Most Common Date added: 2019

Launch using later releases than Netflix whom tend to add older releases. This will attract Netflix audience and a new audience.

Keep up to date with the additions to maintain audience. Netflix had a large addition year in 2019 but the succeeding years were lower. Consequently, if this continues we will likely be able to poach the Netflix audience and keep thm for longer. 

Use a recommendation system to suggest titles to customers based on the common words and most common categories.

Launch in February when there are the lowest number of releases from Netflix.

Based on months, most content has been added in July, possibly intime for school holidays in preparation to entertain children. Second most common month is december, the month of festivities and christmas, a popular time for festive content to be added. It is recommended that during to add content over the month prior (June-July) and (November - December) to keep audience for longer as well as help the audience to plan this busy period to include use of your viewing system. Allow the user to get excited and 'want' the content, therefor plan it in to their lives.

It is recommended to include additions in other festive periods like that of Halloween to attract more viewings, this is not achieved by Netflix therefore a larger audience will be attracted to your viewing system.
