# All the information is on the task(master fan wiki)

In this notebook we try to use Python's BeautifulSoup module to scrape the wording of all the tasks from the popular TV show Taskmaster from the fan wiki:

https://taskmaster.fandom.com/wiki/Taskmaster_Wiki

with a view to analysing the wording of the tasks.

In [32]:
# Ususal uploads

from bs4 import BeautifulSoup
import requests
import numpy as np

from wordcloud import WordCloud
import plotly.express as px
from pprint import pprint
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import pandas as pd

## Series 16

As a first step in this direction we try to scrape the tasks for what is (at the time of scraping) the most recent full series to be broadcast in the UK, namely series 16.

In [2]:
url =  'https://taskmaster.fandom.com/wiki/Series_16'

In [3]:
page = requests.get(url)

In [4]:
soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
print(soup.prettify)

In [None]:
soup.title.string

In [None]:
soup.a

In [7]:
all_starts = list(soup.find_all("tr", class_='tmtablerow'))
len(all_starts)

58

In [None]:
list(all_starts[3])

## All the tasks in a single list

In [8]:
all_tasks = []
for i in range(len(all_starts)):
    if len(list(all_starts[i])) > 2:
        all_tasks.append(str(list(all_starts[i])[3]))
    else:
        all_tasks.append(str(list(all_starts[i])[1]))
            
            
pprint(all_tasks)

["<td><b>Prize:</b> Most wonderful wooden thing that they've owned for a "
 'while.\n'
 '</td>',
 '<td>Build a tower out of the cans in the lab. You must put on your blindfold '
 'in this room and wear it properly for the rest of the task.\n'
 '</td>',
 '<td><b>Team:</b> Connect the most individual parts of one person to '
 'individual parts of another person. All members of your team must be '
 'connected.\n'
 '</td>',
 '<td><b>Team:</b> Cross the finish line with all connections still '
 'connected.\n'
 '</td>',
 '<td>Get the duck into the lake. You must not touch the beak. If the duck '
 'leaves the course, it must re-enter at the point it left the course. If your '
 'duck touches the boundary or a flamingo or a pineapple, one minute will be '
 'added to your time.\n'
 '</td>',
 '<td><b>Live:</b> Say whether you think the next item is heavier or lighter '
 'than the previous item. If you are wrong, you are eliminated.\n'
 '</td>',
 '<td><b>Prize:</b> Best sign.\n</td>',
 '<td>Pull t

In [9]:
# Removing the stuff from the end

all_tasks = [task[:-7] for task in all_tasks]

In [10]:
all_tasks

["<td><b>Prize:</b> Most wonderful wooden thing that they've owned for a while",
 '<td>Build a tower out of the cans in the lab. You must put on your blindfold in this room and wear it properly for the rest of the task',
 '<td><b>Team:</b> Connect the most individual parts of one person to individual parts of another person. All members of your team must be connected',
 '<td><b>Team:</b> Cross the finish line with all connections still connected',
 '<td>Get the duck into the lake. You must not touch the beak. If the duck leaves the course, it must re-enter at the point it left the course. If your duck touches the boundary or a flamingo or a pineapple, one minute will be added to your time',
 '<td><b>Live:</b> Say whether you think the next item is heavier or lighter than the previous item. If you are wrong, you are eliminated',
 '<td><b>Prize:</b> Best sign',
 '<td>Pull the sword from the stone. You may not force the sword or break the stone',
 '<td>Make a cheeky picture on this piece 

## Extracting the Prize Tasks

In [None]:
prize_tasks = []

for i in range(len(all_tasks)):
    if 'Prize' in all_tasks[i]:
        prize_tasks.append(all_tasks[i])
        
prize_tasks

In [None]:
prize_tasks = [task[18:] for task in prize_tasks]
prize_tasks

In [None]:
prize_task_lengths = [len(task.split()) for task in prize_tasks]

In [None]:
prize_task_lengths

In [None]:
np.array(prize_task_lengths).mean()

In [None]:
np.array(prize_task_lengths).var()

## Live tasks

In [None]:
live_tasks = []

for i in range(len(all_tasks)):
    if 'Live:' in all_tasks[i]:
        live_tasks.append(all_tasks[i])
        
live_tasks

In [None]:
live_tasks[3][22:]

In [None]:
for i in range(len(live_tasks)):
    if 'Team Live:' in live_tasks[i]:
        live_tasks[i] = live_tasks[i][22:]
        
live_tasks

In [None]:
live_tasks[0][17:]

In [None]:
for i in range(len(live_tasks)):
    if 'Live:' in live_tasks[i]:
        live_tasks[i] = live_tasks[i][17:]
    
live_tasks

In [None]:
len(live_tasks)

In [None]:
live_task_lengths = [len(task.split()) for task in live_tasks]

In [None]:
live_task_lengths

In [None]:
np.array(live_task_lengths).mean()

In [None]:
np.array(live_task_lengths).var()

# Cleaning-up the full list

In [None]:
all_tasks

In [11]:
all_tasks[0][4:]

"<b>Prize:</b> Most wonderful wooden thing that they've owned for a while"

In [12]:
# Getting rid of the <td>s from the beinnings 

all_tasks = [task[4:] for task in all_tasks]
all_tasks

["<b>Prize:</b> Most wonderful wooden thing that they've owned for a while",
 'Build a tower out of the cans in the lab. You must put on your blindfold in this room and wear it properly for the rest of the task',
 '<b>Team:</b> Connect the most individual parts of one person to individual parts of another person. All members of your team must be connected',
 '<b>Team:</b> Cross the finish line with all connections still connected',
 'Get the duck into the lake. You must not touch the beak. If the duck leaves the course, it must re-enter at the point it left the course. If your duck touches the boundary or a flamingo or a pineapple, one minute will be added to your time',
 '<b>Live:</b> Say whether you think the next item is heavier or lighter than the previous item. If you are wrong, you are eliminated',
 '<b>Prize:</b> Best sign',
 'Pull the sword from the stone. You may not force the sword or break the stone',
 'Make a cheeky picture on this piece of wood using nails and one continuo

In [13]:
for i in range(len(all_tasks)):
    if 'Prize:' in all_tasks[i]:
        all_tasks[i] = all_tasks[i][14:]
    if 'Team Live' in all_tasks[i]:
        all_tasks[i] = all_tasks[i][18:]
    if 'Team' in all_tasks[i]:
        all_tasks[i] = all_tasks[i][13:]
    if 'Live' in all_tasks[i]:
        all_tasks[i] = all_tasks[i][13:]
        
all_tasks

["Most wonderful wooden thing that they've owned for a while",
 'Build a tower out of the cans in the lab. You must put on your blindfold in this room and wear it properly for the rest of the task',
 'Connect the most individual parts of one person to individual parts of another person. All members of your team must be connected',
 'Cross the finish line with all connections still connected',
 'Get the duck into the lake. You must not touch the beak. If the duck leaves the course, it must re-enter at the point it left the course. If your duck touches the boundary or a flamingo or a pineapple, one minute will be added to your time',
 'Say whether you think the next item is heavier or lighter than the previous item. If you are wrong, you are eliminated',
 'Best sign',
 'Pull the sword from the stone. You may not force the sword or break the stone',
 'Make a cheeky picture on this piece of wood using nails and one continuous piece of wire. Also, if any egg timers stop, you must stare at t

In [14]:
all_tasks[55]

'Demonstrate the most effective high-intensity four-part exercise routine. Each of your four exercises must be original and must take place on this mat. The first must last eight seconds, the second four, the third two and the fourth one second long.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup'

In [15]:
# Wierd issue with task 55

for i in range(len(all_tasks)):
    if '<' in all_tasks[i]:
        print(i)
    

55


In [16]:
all_tasks[55] = all_tasks[55][:-75]

In [17]:
all_tasks[55]

'Demonstrate the most effective high-intensity four-part exercise routine. Each of your four exercises must be original and must take place on this mat. The first must last eight seconds, the second four, the third two and the fourth one second long'

In [18]:
for i in range(len(all_tasks)):
    if all_tasks[i] == '-':
        print(i)
    


In [19]:
all_tasks[32]

''

In [20]:
len(all_tasks)

58

## NLP cleaning up

In [21]:
tasks_split = [task.split() for task in all_tasks]
tasks_split

[['Most',
  'wonderful',
  'wooden',
  'thing',
  'that',
  "they've",
  'owned',
  'for',
  'a',
  'while'],
 ['Build',
  'a',
  'tower',
  'out',
  'of',
  'the',
  'cans',
  'in',
  'the',
  'lab.',
  'You',
  'must',
  'put',
  'on',
  'your',
  'blindfold',
  'in',
  'this',
  'room',
  'and',
  'wear',
  'it',
  'properly',
  'for',
  'the',
  'rest',
  'of',
  'the',
  'task'],
 ['Connect',
  'the',
  'most',
  'individual',
  'parts',
  'of',
  'one',
  'person',
  'to',
  'individual',
  'parts',
  'of',
  'another',
  'person.',
  'All',
  'members',
  'of',
  'your',
  'team',
  'must',
  'be',
  'connected'],
 ['Cross',
  'the',
  'finish',
  'line',
  'with',
  'all',
  'connections',
  'still',
  'connected'],
 ['Get',
  'the',
  'duck',
  'into',
  'the',
  'lake.',
  'You',
  'must',
  'not',
  'touch',
  'the',
  'beak.',
  'If',
  'the',
  'duck',
  'leaves',
  'the',
  'course,',
  'it',
  'must',
  're-enter',
  'at',
  'the',
  'point',
  'it',
  'left',
  'the',
 

In [22]:
task_words = []
for task in tasks_split:
    task_words += task

task_words

['Most',
 'wonderful',
 'wooden',
 'thing',
 'that',
 "they've",
 'owned',
 'for',
 'a',
 'while',
 'Build',
 'a',
 'tower',
 'out',
 'of',
 'the',
 'cans',
 'in',
 'the',
 'lab.',
 'You',
 'must',
 'put',
 'on',
 'your',
 'blindfold',
 'in',
 'this',
 'room',
 'and',
 'wear',
 'it',
 'properly',
 'for',
 'the',
 'rest',
 'of',
 'the',
 'task',
 'Connect',
 'the',
 'most',
 'individual',
 'parts',
 'of',
 'one',
 'person',
 'to',
 'individual',
 'parts',
 'of',
 'another',
 'person.',
 'All',
 'members',
 'of',
 'your',
 'team',
 'must',
 'be',
 'connected',
 'Cross',
 'the',
 'finish',
 'line',
 'with',
 'all',
 'connections',
 'still',
 'connected',
 'Get',
 'the',
 'duck',
 'into',
 'the',
 'lake.',
 'You',
 'must',
 'not',
 'touch',
 'the',
 'beak.',
 'If',
 'the',
 'duck',
 'leaves',
 'the',
 'course,',
 'it',
 'must',
 're-enter',
 'at',
 'the',
 'point',
 'it',
 'left',
 'the',
 'course.',
 'If',
 'your',
 'duck',
 'touches',
 'the',
 'boundary',
 'or',
 'a',
 'flamingo',
 'or',

In [23]:
tasks_lower = [task.lower() for task in task_words]
tasks_lower

['most',
 'wonderful',
 'wooden',
 'thing',
 'that',
 "they've",
 'owned',
 'for',
 'a',
 'while',
 'build',
 'a',
 'tower',
 'out',
 'of',
 'the',
 'cans',
 'in',
 'the',
 'lab.',
 'you',
 'must',
 'put',
 'on',
 'your',
 'blindfold',
 'in',
 'this',
 'room',
 'and',
 'wear',
 'it',
 'properly',
 'for',
 'the',
 'rest',
 'of',
 'the',
 'task',
 'connect',
 'the',
 'most',
 'individual',
 'parts',
 'of',
 'one',
 'person',
 'to',
 'individual',
 'parts',
 'of',
 'another',
 'person.',
 'all',
 'members',
 'of',
 'your',
 'team',
 'must',
 'be',
 'connected',
 'cross',
 'the',
 'finish',
 'line',
 'with',
 'all',
 'connections',
 'still',
 'connected',
 'get',
 'the',
 'duck',
 'into',
 'the',
 'lake.',
 'you',
 'must',
 'not',
 'touch',
 'the',
 'beak.',
 'if',
 'the',
 'duck',
 'leaves',
 'the',
 'course,',
 'it',
 'must',
 're-enter',
 'at',
 'the',
 'point',
 'it',
 'left',
 'the',
 'course.',
 'if',
 'your',
 'duck',
 'touches',
 'the',
 'boundary',
 'or',
 'a',
 'flamingo',
 'or',

In [24]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [25]:
# removing pubtutation marks
    
tasks_no_punc = [word.replace(string.punctuation, '') for word in tasks_lower]
tasks_no_punc

['most',
 'wonderful',
 'wooden',
 'thing',
 'that',
 "they've",
 'owned',
 'for',
 'a',
 'while',
 'build',
 'a',
 'tower',
 'out',
 'of',
 'the',
 'cans',
 'in',
 'the',
 'lab.',
 'you',
 'must',
 'put',
 'on',
 'your',
 'blindfold',
 'in',
 'this',
 'room',
 'and',
 'wear',
 'it',
 'properly',
 'for',
 'the',
 'rest',
 'of',
 'the',
 'task',
 'connect',
 'the',
 'most',
 'individual',
 'parts',
 'of',
 'one',
 'person',
 'to',
 'individual',
 'parts',
 'of',
 'another',
 'person.',
 'all',
 'members',
 'of',
 'your',
 'team',
 'must',
 'be',
 'connected',
 'cross',
 'the',
 'finish',
 'line',
 'with',
 'all',
 'connections',
 'still',
 'connected',
 'get',
 'the',
 'duck',
 'into',
 'the',
 'lake.',
 'you',
 'must',
 'not',
 'touch',
 'the',
 'beak.',
 'if',
 'the',
 'duck',
 'leaves',
 'the',
 'course,',
 'it',
 'must',
 're-enter',
 'at',
 'the',
 'point',
 'it',
 'left',
 'the',
 'course.',
 'if',
 'your',
 'duck',
 'touches',
 'the',
 'boundary',
 'or',
 'a',
 'flamingo',
 'or',

In [26]:
tasks_no_punc = []

for word in tasks_lower:
    tasks_no_punc.append(word.replace(string.punctuation, ''))
    
tasks_no_punc

['most',
 'wonderful',
 'wooden',
 'thing',
 'that',
 "they've",
 'owned',
 'for',
 'a',
 'while',
 'build',
 'a',
 'tower',
 'out',
 'of',
 'the',
 'cans',
 'in',
 'the',
 'lab.',
 'you',
 'must',
 'put',
 'on',
 'your',
 'blindfold',
 'in',
 'this',
 'room',
 'and',
 'wear',
 'it',
 'properly',
 'for',
 'the',
 'rest',
 'of',
 'the',
 'task',
 'connect',
 'the',
 'most',
 'individual',
 'parts',
 'of',
 'one',
 'person',
 'to',
 'individual',
 'parts',
 'of',
 'another',
 'person.',
 'all',
 'members',
 'of',
 'your',
 'team',
 'must',
 'be',
 'connected',
 'cross',
 'the',
 'finish',
 'line',
 'with',
 'all',
 'connections',
 'still',
 'connected',
 'get',
 'the',
 'duck',
 'into',
 'the',
 'lake.',
 'you',
 'must',
 'not',
 'touch',
 'the',
 'beak.',
 'if',
 'the',
 'duck',
 'leaves',
 'the',
 'course,',
 'it',
 'must',
 're-enter',
 'at',
 'the',
 'point',
 'it',
 'left',
 'the',
 'course.',
 'if',
 'your',
 'duck',
 'touches',
 'the',
 'boundary',
 'or',
 'a',
 'flamingo',
 'or',

In [27]:
for i in range(len(tasks_no_punc)):
    if tasks_no_punc[i] == "they've":
        tasks_no_punc[i] = 'they'
        
tasks_no_punc

['most',
 'wonderful',
 'wooden',
 'thing',
 'that',
 'they',
 'owned',
 'for',
 'a',
 'while',
 'build',
 'a',
 'tower',
 'out',
 'of',
 'the',
 'cans',
 'in',
 'the',
 'lab.',
 'you',
 'must',
 'put',
 'on',
 'your',
 'blindfold',
 'in',
 'this',
 'room',
 'and',
 'wear',
 'it',
 'properly',
 'for',
 'the',
 'rest',
 'of',
 'the',
 'task',
 'connect',
 'the',
 'most',
 'individual',
 'parts',
 'of',
 'one',
 'person',
 'to',
 'individual',
 'parts',
 'of',
 'another',
 'person.',
 'all',
 'members',
 'of',
 'your',
 'team',
 'must',
 'be',
 'connected',
 'cross',
 'the',
 'finish',
 'line',
 'with',
 'all',
 'connections',
 'still',
 'connected',
 'get',
 'the',
 'duck',
 'into',
 'the',
 'lake.',
 'you',
 'must',
 'not',
 'touch',
 'the',
 'beak.',
 'if',
 'the',
 'duck',
 'leaves',
 'the',
 'course,',
 'it',
 'must',
 're-enter',
 'at',
 'the',
 'point',
 'it',
 'left',
 'the',
 'course.',
 'if',
 'your',
 'duck',
 'touches',
 'the',
 'boundary',
 'or',
 'a',
 'flamingo',
 'or',
 '

In [28]:
stop_words = set(stopwords.words('english'))

In [29]:
tasks_no_stop = [word for word in tasks_no_punc if word not in stop_words]

In [30]:
len(tasks_no_punc)

1347

In [31]:
len(tasks_no_stop)

722

## Word counts

We find the 10 most frequently used words.

In [33]:
df = pd.DataFrame(tasks_no_stop)
df

Unnamed: 0,0
0,wonderful
1,wooden
2,thing
3,owned
4,build
...,...
717,rope.
718,ball
719,ends
720,"bathtub,"


In [37]:
df[0].value_counts().nlargest(10)

must      43
one       21
may       13
touch      9
thing      8
make       8
best       8
choose     7
task       7
get        7
Name: 0, dtype: int64

## Attempted word cloud

We make a word cl

In [None]:
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt

In [None]:
df = pd.DataFrame(tasks_no_stop)
df.head()

In [None]:
df[0].value_counts()

In [None]:
text = ''
for words in tasks_no_stop:
    text += words + ' '


In [None]:
wordcloud = WordCloud(background_color="white").generate(text)

In [None]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()