# Exam Q3 Nadine Kanbier - 4283724
#### A television producer has approached you with the question whether they should release the new season of their show all at once, like Netflix does, or once a week. As their market research has shown that both release strategies will result in more or less the same ratings, they want to know which release strategy will engage their audiences more; which release strategy will result in more (valuable) discussions. Try to formulate a good operationalization of this question using the methods we discussed in the last three weeks, and argue why this operationalization would be suitable to formulate a substantiated advice for the television producer. Then implement your operationalization using discussions.p (where the column ‘type’ indicates whether a show is released all in once [value ’netflix’] or linearly [value ’linear’]). Try to formulate a substantiated advice for the television producer. If your method doesn’t produce meaningful results, try to formulate suggestions on how to improve the method you proposed instead. Note: you will not be graded on the extent to which your proposed method actually produces valuable results, but on your thought process and argumentation. Don’t try to fine-tune your method until it spits out something interesting.

##### Your answer must consist of the following:
##### • An operationalization of the question (ca. 350 words)
##### • The complete code to answer the question with a short comment for every step (max. 2 sentences per step)
##### • Interpretation and conclusion (ca. 200 words)

#### Operationalization:

Television and social media are deeply intertwined: social network sites such as Twitter or Reddit allow viewers to enjoy the communal experience of group viewing without being physically together. Therefore, it is important to find out how viewers are using the social network sites when watching a TV show. For producers, it is important to know if there is a difference between releasing techniques.

Engagement is often measured by its volume and rates. We want to know which release technique (at once or per episode) produce **valuable** discussions. So first, we have to define valuable discussions. 

A post like: *I love this show. But it's hard to argue against its pace. It's sooooooooooo slooooowww. The writing, the acting, the direction, the production, all of it top notch. That however, doesn't mean we can't complain it's "boring". It's a different kind of boring, but I totally get the critique and I feel the same way. The story takes way too long to develop. It's great to have slower paced TV, but I think this show over does it.* seems like more of a valuable discussion than a post like: *'lol'*.

Therefore, we will compare the amount of words in each post per release strategy. We define valuable as more tokens per post. Because we only want the valuable words, we tokenize the posts and remove stop words.

After this, we will compare the results per release strategy. 

In [87]:
# import dataset
import pandas as pd
import spacy
import numpy as np

df = pd.read_csv('discussions_corrected.csv')
df.iloc[1]

title                                     Better Call Saul
type                                                linear
year                                                  2016
post     I love this show. But it's hard to argue again...
Name: 1, dtype: object

In [67]:
# average number of posts per 'linear' tv show
(df.post[df.type == 'linear'].count())/(df.title[df.type == 'linear'].nunique())

3526.9

In [68]:
# average number of posts per 'netflix' tv show
(df.post[df.type == 'netflix'].count())/(df.title[df.type == 'netflix'].nunique())

1473.1

##### Note: 
The average number of posts does not equal engagement per se: the average of the posts is influenced by the popularity of shows such as Game of Thrones. This does not produce a sensible answer on its own. So we will now tokenize the words in the posts and remove stop words. This way, we can see which technique produces more (and more valuable!) discussions.

In [69]:
# Tokenizing the text
import spacy 
nlp = spacy.load("en_core_web_sm")
def spacy_tokenizer(str_list):
    processed_texts = [text for text in nlp.pipe(str_list,
                                             n_threads=-1)]

    tokenized_texts = [[word.lemma_ for word in text 
                        if not word.is_punct and not word.is_stop] 
                       for text in processed_texts]
    
    return tokenized_texts

In [70]:
# adding it to the dataframe
tokenized_posts=spacy_tokenizer(df['post'])

In [71]:
df['tokens']=tokenized_posts # tokenized posts in dataframe
df['length'] = df['tokens'].apply(len) # length of tokenized post in dataframe
df.head()

Unnamed: 0,title,type,year,post,tokens,length
0,Better Call Saul,linear,2017,Walter. And there the chain ends.,"[Walter, chain, end]",3
1,Better Call Saul,linear,2016,I love this show. But it's hard to argue again...,"[love, hard, argue, pace, sooooooooooo, sloooo...",57
2,Better Call Saul,linear,2017,What am I missing? A lot of reference to ribs...,"[miss, , lot, reference, ribs, burger, Carls,...",9
3,Better Call Saul,linear,2018,"Oh come on Mike, he's a good little boy.","[oh, come, Mike, good, little, boy]",6
4,Better Call Saul,linear,2017,Look again 👀,"[look, 👀]",2


In [72]:
# subsetting
linear = df[df.type == 'linear']
netflix = df[df.type == 'netflix']

In [31]:
linear.length.mean() # average tokens per post when type is linear

10.04661317304148

In [32]:
netflix.length.mean() # average tokens per post when type is netflix

13.537098635530514

#### Conclusion:

Results show that the linear shows in our dataset produce more posts (and therefore, more engagement). There is a limitation to this statement however: the popularity of the TV shows in our dataset varied. Game of Thrones greatly increased the amount of posts. This is why we looked at the number of words(tokens) in each posts.

Our results show that netflix shows produce more valuable (longer) discussions. The average number of words (not counting stop words) is higher for the netflix strategy than for the linear strategy. 

Therefore, as an answer to the question which release strategy will result in more valuable discussions, I would recommend using the Netflix strategy. The recent show 'Queen's Gambit' popularity is a great example to further support this claim. https://www.businessinsider.com/data-shows-netflix-queens-gambit-a-word-of-mouth-hit-2020-11