# GYM EXERCISES AND WORKOUTS DATAFRAME BUILD - 

## THE BRIDGE DATA SCIENCE BOOTCAMP NOV 2021


Objective of the exercise:
  
  - Take data from at least 3 different sources and 2 different methods to build a data base
  - The chosen methods/sources are:
      - API for exercises:   https://wger.de
      - Web for exercises:   https://bodybuilding.com  
      - Web behind username/password for complete workouts:  https://bodybuilding.com (paid content)

## Considerations: 
    - This project is part of a bootcamp and has been done with the aim of practising different data extraction methods and data cleaning rather than as a part of a data analysis exercise or machine learning larger project.
    
    - Therefore the decisions on which data to choose from each source not always are the optimal in regards to data analysis, but to try different methods and approaches in data extraction / cleaning 
    
 

## DATA SOURCE 1: WGER API

https://wger.de/en/software/api

Open source fitness API with several exercises

In [1]:
import requests as req # for requesting info to the API
import json            # we will get the info in json format
import pandas as pd    # for converting the data into dataframes and transforming it

In [2]:
#get all exercises in english (option 2, 1 is german) and increasing shown results per page to 500
exercises_eng=req.get('https://wger.de/api/v2/exercise/?limit=500&language=2').json()
exercises_eng

{'count': 231,
 'next': None,
 'previous': None,
 'results': [{'id': 345,
   'uuid': 'c788d643-150a-4ac7-97ef-84643c6419bf',
   'name': '2 Handed Kettlebell Swing',
   'exercise_base': 9,
   'status': '2',
   'description': '<p>Two Handed Russian Style Kettlebell swing</p>',
   'creation_date': '2015-08-03',
   'category': 10,
   'muscles': [],
   'muscles_secondary': [],
   'equipment': [10],
   'language': 2,
   'license': 2,
   'license_author': 'deusinvictus',
   'variations': []},
  {'id': 227,
   'uuid': '53ca25b3-61d9-4f72-bfdb-492b83484ff5',
   'name': 'Arnold Shoulder Press',
   'exercise_base': 20,
   'status': '2',
   'description': '<p>Very common shoulder exercise.</p>\n<p>\xa0</p>\n<p>As shown here:\xa0https://www.youtube.com/watch?v=vj2w851ZHRM</p>',
   'creation_date': '2014-03-09',
   'category': 13,
   'muscles': [],
   'muscles_secondary': [],
   'equipment': [3],
   'language': 2,
   'license': 1,
   'license_author': 'trzr23',
   'variations': [227, 329, 229, 190, 

We got 231 exercises but we only need the "results" key which is a list of dictionaries

In [3]:
exercises_eng = exercises_eng['results']

In [4]:
exercises_eng

[{'id': 345,
  'uuid': 'c788d643-150a-4ac7-97ef-84643c6419bf',
  'name': '2 Handed Kettlebell Swing',
  'exercise_base': 9,
  'status': '2',
  'description': '<p>Two Handed Russian Style Kettlebell swing</p>',
  'creation_date': '2015-08-03',
  'category': 10,
  'muscles': [],
  'muscles_secondary': [],
  'equipment': [10],
  'language': 2,
  'license': 2,
  'license_author': 'deusinvictus',
  'variations': []},
 {'id': 227,
  'uuid': '53ca25b3-61d9-4f72-bfdb-492b83484ff5',
  'name': 'Arnold Shoulder Press',
  'exercise_base': 20,
  'status': '2',
  'description': '<p>Very common shoulder exercise.</p>\n<p>\xa0</p>\n<p>As shown here:\xa0https://www.youtube.com/watch?v=vj2w851ZHRM</p>',
  'creation_date': '2014-03-09',
  'category': 13,
  'muscles': [],
  'muscles_secondary': [],
  'equipment': [3],
  'language': 2,
  'license': 1,
  'license_author': 'trzr23',
  'variations': [227, 329, 229, 190, 119, 123, 152, 155]},
 {'id': 289,
  'uuid': '6add5973-86d0-4543-928a-6bb8b3f34efc',
  'na

In [5]:
wger_df=pd.DataFrame(exercises_eng) # to create a Pandas Data Frame

In [6]:
wger_df.head(20)

Unnamed: 0,id,uuid,name,exercise_base,status,description,creation_date,category,muscles,muscles_secondary,equipment,language,license,license_author,variations
0,345,c788d643-150a-4ac7-97ef-84643c6419bf,2 Handed Kettlebell Swing,9,2,<p>Two Handed Russian Style Kettlebell swing</p>,2015-08-03,10,[],[],[10],2,2,deusinvictus,[]
1,227,53ca25b3-61d9-4f72-bfdb-492b83484ff5,Arnold Shoulder Press,20,2,<p>Very common shoulder exercise.</p>\n<p> </p...,2014-03-09,13,[],[],[3],2,1,trzr23,"[227, 329, 229, 190, 119, 123, 152, 155]"
2,289,6add5973-86d0-4543-928a-6bb8b3f34efc,Axe Hold,31,2,<p>Grab dumbbells and extend arms to side and ...,2014-11-02,8,[],[],[3],2,1,GrosseHund,[]
3,637,0fd6154d-fb53-4b24-acc0-1c5c05b57ebc,Back Squat,34,2,<p>Place a barbell in a rack just below should...,2019-05-29,9,[],[],[],2,2,axel,[]
4,343,1b9dc5bc-790b-4e21-a55d-f8b3115e94c5,Barbell Ab Rollout,41,2,<p>Place a barbell on the floor at your feet.<...,2015-07-27,10,[14],[],[1],2,2,sevae,[]
5,407,1215dad0-b7e0-42c6-80d4-112f69acb68a,Barbell Hack Squats,43,2,<p>Perform leg squats with barbell behind your...,2016-07-30,9,[10],[8],[1],2,2,BePieToday,"[407, 342, 300, 191, 650, 389, 355, 160, 185, ..."
6,405,ae6a6c23-4616-49b7-a152-49d7461c2b7f,Barbell Lunges,46,2,<p>Put barbell on the back of your shoulders. ...,2016-07-30,9,[10],[8],[1],2,2,Mikko Ruohola,"[405, 112, 113]"
7,344,2cd5e256-20a7-4bc8-a7a8-d62bf8ce00cf,Barbell Triceps Extension,50,2,<p>Position barbell overhead with narrow overh...,2015-07-27,8,[5],"[2, 4]",[1],2,2,sevae,"[344, 274, 89, 90]"
8,307,1b8b1657-40fd-4e3b-97b7-1c79b1079f8e,Bear Walk,57,2,<p>-Rest your weight on your palms and the bal...,2015-02-06,11,"[2, 7, 4, 6, 3, 15, 5]","[8, 12, 14, 10, 9]",[7],2,2,nate303303,[]
9,192,5da6340b-22ec-4c1b-a443-eef2f59f92f0,Bench Press,73,2,"<p>Lay down on a bench, the bar should be dire...",2013-08-11,11,[4],"[2, 5]","[1, 8]",2,1,sistab2,"[192, 100, 101, 163, 210, 211, 270, 399]"


- From the obtained data frame we do not need all the info for analysis, so we will drop some identification columns.

- Exercises variations are identified by a key, so we will keep the column id for being able to later match this information

In [7]:
wger_df=wger_df.drop(['uuid',"exercise_base","status","creation_date","language","license","license_author"], axis=1)
wger_df

Unnamed: 0,id,name,description,category,muscles,muscles_secondary,equipment,variations
0,345,2 Handed Kettlebell Swing,<p>Two Handed Russian Style Kettlebell swing</p>,10,[],[],[10],[]
1,227,Arnold Shoulder Press,<p>Very common shoulder exercise.</p>\n<p> </p...,13,[],[],[3],"[227, 329, 229, 190, 119, 123, 152, 155]"
2,289,Axe Hold,<p>Grab dumbbells and extend arms to side and ...,8,[],[],[3],[]
3,637,Back Squat,<p>Place a barbell in a rack just below should...,9,[],[],[],[]
4,343,Barbell Ab Rollout,<p>Place a barbell on the floor at your feet.<...,10,[14],[],[1],[]
...,...,...,...,...,...,...,...,...
226,886,Weighted Butterfly Stretch,"<p>Seated with your back against a wall, put t...",9,[],[],[3],[]
227,320,Weighted Step,"<p>Box step-ups w/ barbell, 45's on each side</p>",9,[],[],[1],[]
228,321,Weighted Step-ups,<p>box step ups w/ barbell and 45's on each si...,9,[],[],[],[]
229,204,Wide-grip Pulldown,<p>Lat pulldowns with a wide grip on the bar.</p>,12,[12],[],[],"[213, 188, 187, 216, 215, 212, 424, 204]"


- Aditionally we see that the category, muscles and equipment of the exercise are also keys. After checking the documentation we see that there are different  key:values tables with this info. also in the API.

- Lets get them and substitute the numbers in the Data-Frame with the corresponding text

In [8]:
categories=req.get('https://wger.de/api/v2/exercisecategory').json()
categories=categories['results']
categories

[{'id': 10, 'name': 'Abs'},
 {'id': 8, 'name': 'Arms'},
 {'id': 12, 'name': 'Back'},
 {'id': 14, 'name': 'Calves'},
 {'id': 11, 'name': 'Chest'},
 {'id': 9, 'name': 'Legs'},
 {'id': 13, 'name': 'Shoulders'}]

In [9]:
categories_d={} #we create a key-value dictionary for substituting the key with values  in the dataframe (see next cell in the Notebook)
for x in categories: 
    categories_d[x[str('id')]]= x["name"]
categories_d

{10: 'Abs',
 8: 'Arms',
 12: 'Back',
 14: 'Calves',
 11: 'Chest',
 9: 'Legs',
 13: 'Shoulders'}

In [10]:
index = 0
for x in wger_df['category']:
    wger_df['category'][index] = categories_d[(wger_df['category'][index])]
    index+=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wger_df['category'][index] = categories_d[(wger_df['category'][index])]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [11]:
wger_df['category'].head()

0          Abs
1    Shoulders
2         Arms
3         Legs
4          Abs
Name: category, dtype: object

In [12]:
equipment=req.get('https://wger.de/api/v2/equipment').json() 
equipment=equipment['results']
equipment

[{'id': 1, 'name': 'Barbell'},
 {'id': 8, 'name': 'Bench'},
 {'id': 3, 'name': 'Dumbbell'},
 {'id': 4, 'name': 'Gym mat'},
 {'id': 9, 'name': 'Incline bench'},
 {'id': 10, 'name': 'Kettlebell'},
 {'id': 7, 'name': 'none (bodyweight exercise)'},
 {'id': 6, 'name': 'Pull-up bar'},
 {'id': 5, 'name': 'Swiss Ball'},
 {'id': 2, 'name': 'SZ-Bar'}]

In [13]:
equipment_d={}
for x in equipment:
    equipment_d[x[str('id')]]= x["name"]
equipment_d

{1: 'Barbell',
 8: 'Bench',
 3: 'Dumbbell',
 4: 'Gym mat',
 9: 'Incline bench',
 10: 'Kettlebell',
 7: 'none (bodyweight exercise)',
 6: 'Pull-up bar',
 5: 'Swiss Ball',
 2: 'SZ-Bar'}

In [14]:
index_df = 0    #we proceed as we did with categories, but here we make another loop with a list as one exercise can have +1 equipment
for elemento_lista in wger_df['equipment']:
    
    index_lista=0
    for x in elemento_lista: 
        wger_df['equipment'][index_df][index_lista] = equipment_d[(wger_df['equipment'][index_df][index_lista])]
        index_lista+=1
        
    index_df+=1

In [15]:
wger_df['equipment'].head(20)

0                     [Kettlebell]
1                       [Dumbbell]
2                       [Dumbbell]
3                               []
4                        [Barbell]
5                        [Barbell]
6                        [Barbell]
7                        [Barbell]
8     [none (bodyweight exercise)]
9                 [Barbell, Bench]
10               [Bench, Dumbbell]
11                [Barbell, Bench]
12                      [Dumbbell]
13                       [Barbell]
14                              []
15               [Bench, Dumbbell]
16                      [Dumbbell]
17                       [Barbell]
18                       [Barbell]
19                       [Barbell]
Name: equipment, dtype: object

In [16]:
muscles=req.get('https://wger.de/api/v2/muscle').json() #500 por página y lengua=2 inglés
muscles=muscles['results']
muscles

[{'id': 2,
  'name': 'Anterior deltoid',
  'is_front': True,
  'image_url_main': '/static/images/muscles/main/muscle-2.svg',
  'image_url_secondary': '/static/images/muscles/secondary/muscle-2.svg'},
 {'id': 1,
  'name': 'Biceps brachii',
  'is_front': True,
  'image_url_main': '/static/images/muscles/main/muscle-1.svg',
  'image_url_secondary': '/static/images/muscles/secondary/muscle-1.svg'},
 {'id': 11,
  'name': 'Biceps femoris',
  'is_front': False,
  'image_url_main': '/static/images/muscles/main/muscle-11.svg',
  'image_url_secondary': '/static/images/muscles/secondary/muscle-11.svg'},
 {'id': 13,
  'name': 'Brachialis',
  'is_front': True,
  'image_url_main': '/static/images/muscles/main/muscle-13.svg',
  'image_url_secondary': '/static/images/muscles/secondary/muscle-13.svg'},
 {'id': 7,
  'name': 'Gastrocnemius',
  'is_front': False,
  'image_url_main': '/static/images/muscles/main/muscle-7.svg',
  'image_url_secondary': '/static/images/muscles/secondary/muscle-7.svg'},
 {'id

In [17]:
muscles_d={}
for x in muscles:
    muscles_d[x[str('id')]]= x["name"]
muscles_d

{2: 'Anterior deltoid',
 1: 'Biceps brachii',
 11: 'Biceps femoris',
 13: 'Brachialis',
 7: 'Gastrocnemius',
 8: 'Gluteus maximus',
 12: 'Latissimus dorsi',
 14: 'Obliquus externus abdominis',
 4: 'Pectoralis major',
 10: 'Quadriceps femoris',
 6: 'Rectus abdominis',
 3: 'Serratus anterior',
 15: 'Soleus',
 9: 'Trapezius',
 5: 'Triceps brachii'}

In [18]:
index_df = 0    #we proceed as we did with equipment
for elemento_lista in wger_df['muscles']:
    
    index_lista=0
    for x in elemento_lista: 
        wger_df['muscles'][index_df][index_lista] = muscles_d[(wger_df['muscles'][index_df][index_lista])]
        index_lista+=1
        
    index_df+=1

In [19]:
index_df = 0
for elemento_lista in wger_df['muscles_secondary']:
    
    index_lista=0
    for x in elemento_lista: 
        wger_df['muscles_secondary'][index_df][index_lista] = muscles_d[(wger_df['muscles_secondary'][index_df][index_lista])]
        index_lista+=1
        
    index_df+=1

In [20]:
wger_df['muscles'].head(20)

0                                                    []
1                                                    []
2                                                    []
3                                                    []
4                         [Obliquus externus abdominis]
5                                  [Quadriceps femoris]
6                                  [Quadriceps femoris]
7                                     [Triceps brachii]
8     [Anterior deltoid, Gastrocnemius, Pectoralis m...
9                                    [Pectoralis major]
10                                   [Pectoralis major]
11                                    [Triceps brachii]
12                                          [Trapezius]
13                                                   []
14                                   [Latissimus dorsi]
15                                                   []
16                                   [Anterior deltoid]
17                                   [Latissimus

In [21]:
wger_df['muscles_secondary'].head(20)

0                                                    []
1                                                    []
2                                                    []
3                                                    []
4                                                    []
5                                     [Gluteus maximus]
6                                     [Gluteus maximus]
7                  [Anterior deltoid, Pectoralis major]
8     [Gluteus maximus, Latissimus dorsi, Obliquus e...
9                   [Anterior deltoid, Triceps brachii]
10                  [Anterior deltoid, Triceps brachii]
11                 [Anterior deltoid, Pectoralis major]
12                                                   []
13                                                   []
14                                                   []
15                                                   []
16                                          [Trapezius]
17                   [Anterior deltoid, Biceps b

Now that we have the equipment, muscles and category with its propper name, lets substitute the ids in the exercises variations with the exercises names that we can obtain by the id column

In [22]:
wger_df

Unnamed: 0,id,name,description,category,muscles,muscles_secondary,equipment,variations
0,345,2 Handed Kettlebell Swing,<p>Two Handed Russian Style Kettlebell swing</p>,Abs,[],[],[Kettlebell],[]
1,227,Arnold Shoulder Press,<p>Very common shoulder exercise.</p>\n<p> </p...,Shoulders,[],[],[Dumbbell],"[227, 329, 229, 190, 119, 123, 152, 155]"
2,289,Axe Hold,<p>Grab dumbbells and extend arms to side and ...,Arms,[],[],[Dumbbell],[]
3,637,Back Squat,<p>Place a barbell in a rack just below should...,Legs,[],[],[],[]
4,343,Barbell Ab Rollout,<p>Place a barbell on the floor at your feet.<...,Abs,[Obliquus externus abdominis],[],[Barbell],[]
...,...,...,...,...,...,...,...,...
226,886,Weighted Butterfly Stretch,"<p>Seated with your back against a wall, put t...",Legs,[],[],[Dumbbell],[]
227,320,Weighted Step,"<p>Box step-ups w/ barbell, 45's on each side</p>",Legs,[],[],[Barbell],[]
228,321,Weighted Step-ups,<p>box step ups w/ barbell and 45's on each si...,Legs,[],[],[],[]
229,204,Wide-grip Pulldown,<p>Lat pulldowns with a wide grip on the bar.</p>,Back,[Latissimus dorsi],[],[],"[213, 188, 187, 216, 215, 212, 424, 204]"


In [23]:
variations_d={}
index=0
for x in wger_df['id']:
    variations_d[x]= wger_df['name'][index]
    index+=1
variations_d

{345: '2 Handed Kettlebell Swing',
 227: 'Arnold Shoulder Press',
 289: 'Axe Hold',
 637: 'Back Squat',
 343: 'Barbell Ab Rollout',
 407: 'Barbell Hack Squats',
 405: 'Barbell Lunges',
 344: 'Barbell Triceps Extension',
 307: 'Bear Walk',
 192: 'Bench Press',
 97: 'Benchpress Dumbbells',
 88: 'Bench Press Narrow Grip',
 268: 'Bent High Pulls',
 412: 'Bent Over Barbell Row',
 362: 'Bentover Dumbbell Rows',
 421: 'Bent-over Lateral Raises',
 919: 'Bent Over Laterals',
 109: 'Bent Over Rowing',
 110: 'Bent Over Rowing Reverse',
 74: 'Biceps Curls With Barbell',
 81: 'Biceps Curls With Dumbbell',
 80: 'Biceps Curls With SZ-bar',
 129: 'Biceps Curl With Cable',
 341: 'Body-Ups',
 342: 'Braced Squat',
 914: 'Bulgarian Split Squat',
 354: 'Burpees',
 98: 'Butterfly',
 99: 'Butterfly Narrow Grip',
 124: 'Butterfly Reverse',
 207: 'Cable Cross-over',
 265: 'Cable External Rotation',
 167: 'Cable Woodchoppers',
 308: 'Calf Press Using Leg Press Machine',
 776: 'Calf Raises',
 104: 'Calf Raises o

In [24]:
index_df = 0    #we proceed as we did with equipment
for elemento_lista in wger_df['variations']:

    index_lista=0
    for x in elemento_lista: 
        wger_df['variations'][index_df][index_lista] = variations_d[(wger_df['variations'][index_df][index_lista])]
        index_lista+=1
        
    index_df+=1

In [25]:
wger_df

Unnamed: 0,id,name,description,category,muscles,muscles_secondary,equipment,variations
0,345,2 Handed Kettlebell Swing,<p>Two Handed Russian Style Kettlebell swing</p>,Abs,[],[],[Kettlebell],[]
1,227,Arnold Shoulder Press,<p>Very common shoulder exercise.</p>\n<p> </p...,Shoulders,[],[],[Dumbbell],"[Arnold Shoulder Press, Diagonal Shoulder Pres..."
2,289,Axe Hold,<p>Grab dumbbells and extend arms to side and ...,Arms,[],[],[Dumbbell],[]
3,637,Back Squat,<p>Place a barbell in a rack just below should...,Legs,[],[],[],[]
4,343,Barbell Ab Rollout,<p>Place a barbell on the floor at your feet.<...,Abs,[Obliquus externus abdominis],[],[Barbell],[]
...,...,...,...,...,...,...,...,...
226,886,Weighted Butterfly Stretch,"<p>Seated with your back against a wall, put t...",Legs,[],[],[Dumbbell],[]
227,320,Weighted Step,"<p>Box step-ups w/ barbell, 45's on each side</p>",Legs,[],[],[Barbell],[]
228,321,Weighted Step-ups,<p>box step ups w/ barbell and 45's on each si...,Legs,[],[],[],[]
229,204,Wide-grip Pulldown,<p>Lat pulldowns with a wide grip on the bar.</p>,Back,[Latissimus dorsi],[],[],"[Close-grip Lat Pull Down, Lat Pull Down (Lean..."


Now lets clean description text and remove the id row

In [26]:
wger_df=wger_df.drop(['id'], axis=1)

In [27]:
index=0
for x in wger_df['description']:
    wger_df['description'][index] = x.replace("<p>","").replace("</p>","").replace("\n","").replace("-","")
    index+=1

In [28]:
wger_df

Unnamed: 0,name,description,category,muscles,muscles_secondary,equipment,variations
0,2 Handed Kettlebell Swing,Two Handed Russian Style Kettlebell swing,Abs,[],[],[Kettlebell],[]
1,Arnold Shoulder Press,Very common shoulder exercise. As shown here: ...,Shoulders,[],[],[Dumbbell],"[Arnold Shoulder Press, Diagonal Shoulder Pres..."
2,Axe Hold,Grab dumbbells and extend arms to side and hol...,Arms,[],[],[Dumbbell],[]
3,Back Squat,Place a barbell in a rack just below shoulderh...,Legs,[],[],[],[]
4,Barbell Ab Rollout,Place a barbell on the floor at your feet.Bend...,Abs,[Obliquus externus abdominis],[],[Barbell],[]
...,...,...,...,...,...,...,...
226,Weighted Butterfly Stretch,"Seated with your back against a wall, put the ...",Legs,[],[],[Dumbbell],[]
227,Weighted Step,"Box stepups w/ barbell, 45's on each side",Legs,[],[],[Barbell],[]
228,Weighted Step-ups,box step ups w/ barbell and 45's on each side,Legs,[],[],[],[]
229,Wide-grip Pulldown,Lat pulldowns with a wide grip on the bar.,Back,[Latissimus dorsi],[],[],"[Close-grip Lat Pull Down, Lat Pull Down (Lean..."


## SOURCE 2 - BODYBUILDING.COM fitness exercises web scrap - 

For evaluating purposes it is not necessary to execute this code as a file has been provided with the scrapped data: bodybuildingcom2.csv

However The Code is working as of 30/11/2021 and most probably will work if you try

In [29]:
from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as bs
import time 
import re 

In [33]:
class Br():
    
    def init(self):
        PATH=ChromeDriverManager().install()
        self.browser = webdriver.Chrome(PATH)
               
    def start(self,url):
        self.browser.get(url)
        time.sleep(1)

    def accept_cook(self):
        element = self.browser.find_element_by_xpath("//*[@id=\"onetrust-accept-btn-handler\"]")
        element.click()
        time.sleep(1)

    def remove_filter(self): #as it is not possible to access the page directly we have to access to the chest exercises, this function is to remove the filter to access all the exercises
        element = self.browser.find_element_by_xpath("/html/body/div[3]/main/div/div[1]/form/div/div[1]/section/ul/li[1]/label/span[2]")
        element.click()
        time.sleep(1)

    def load_more(self): #only 15 exercises are displayed per page, this function press the load_morebutton 
        element = self.browser.find_element_by_xpath('//div[@class="ExLoadMore"]//button')
        element.click()
        time.sleep(5)
       
    def close_out(self):
        self.browser.close()

In [4]:
b=Br()
b.init()
b.start("https://www.bodybuilding.com/exercises/finder/?muscle=chest")



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/96.0.4664.45/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\manue\.wdm\drivers\chromedriver\win32\96.0.4664.45]
  self.browser = webdriver.Chrome(PATH)


In [6]:
b.accept_cook()

  element = self.browser.find_element_by_xpath("//*[@id=\"onetrust-accept-btn-handler\"]")


In [7]:
b.remove_filter()

  element = self.browser.find_element_by_xpath("/html/body/div[3]/main/div/div[1]/form/div/div[1]/section/ul/li[1]/label/span[2]")


In [8]:
a=True                            
while a:            #infinite loop till there is no Load more button
    try:
        b.load_more()                           
    except:
        a=False
        

  element = self.browser.find_element_by_xpath('//div[@class="ExLoadMore"]//button')


In [9]:
# parser with bs
soup=bs(b.browser.page_source, 'html.parser') #make a soup of the page with all exercises available in it

In [10]:
# get exercises cells
exercises_cells=soup.find_all('div',{'itemtype':'http://schema.org/ExerciseAction'}) #sub soup with the exercises cells in list format

Now lets create a list for each variable we want to save and make a loop so that for each exercise it extracts the desired data

In [11]:
exercises_names=[]
muscles=[]
equipment=[]
thumbnails1=[]
thumbnails2=[]
exercises_links=[]
for x in exercises_cells:
    exercises_names.append(x.find('h3').text.strip())
    muscles.append(x.find('div', class_='ExResult-details ExResult-muscleTargeted').text.replace("Muscle Targeted:","").replace("\n","").strip())
    equipment.append(x.find('div', class_='ExResult-details ExResult-equipmentType').text.replace("Equipment Type:","").replace("\n","").strip())
    thumbnails=x.find_all('img')
    try:
        thumbnails1.append(thumbnails[0].get('src'))
    except:
        thumbnails1.append("Unavailable")
    try:
        thumbnails2.append(thumbnails[1].get('src'))
    except:
        thumbnails2.append("Unavailable")
    
    exercises_links.append("https://bodybuilding.com" + (x.find('a').get('href')))

In [15]:
data={'exercise':exercises_names, 'muscle':muscles, 'equipment':equipment, 'link':exercises_links,'tb1':thumbnails1, 'tb2':thumbnails2}
bbcom_df = pd.DataFrame(data)
bbcom_df

Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2
0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...
1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...
2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...
3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...
4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...
...,...,...,...,...,...,...
3127,Band chest fly,Chest,Bands,https://bodybuilding.com/exercises/band-chest-fly,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...
3128,Band overhead squat,Quadriceps,Bands,https://bodybuilding.com/exercises/band-overhe...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...
3129,Band standing concentration curl,Biceps,Bands,https://bodybuilding.com/exercises/band-standi...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...
3130,Band seated row,Traps,Bands,https://bodybuilding.com/exercises/band-seated...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...


In [None]:
b.close_out()

In each exercise link there is still some info that we want to get:
- exercise type
- exercise alternatives
- exercise level

We will create 3 empty lists and populate them with data of each of the exercises

In [16]:
exercises_type=[]
exercises_alternatives=[]
exercises_level=[]
evolucion=0 # as this is a long scrapping we will print the number of the exercise as it goes

In [17]:
b=Br()
b.init()

for url in exercises_links:
       
    b.start(url)

    soup=bs(b.browser.page_source, 'html.parser')

    try:           #try/except so that if the data is not available the scrap continues                      
        tipo=soup.find('a',itemprop="exerciseType").text.strip()
    
    except:
        tipo="unavailable"
        
    exercises_type.append(tipo)
    print(tipo)

    try:           #try/except so that if the data is not available the scrap continues
        alternatives=soup.find_all('h3',class_="ExHeading ExResult-resultsHeading")
        alternatives=[x.text.strip() for x in alternatives]
    except:
        alternatives="unavailable"
                
    exercises_alternatives.append(alternatives)
    print(alternatives)
    
    try:
        level=soup.find_all('li')  #as the level of the exercise is outside any class, we transform it to text to search for the word level
        for x in level:
            if "Level" in x.get_text():                        
                level=x.text.replace('Level:\n',"").strip() #delete the word level so that we keep only the difficulty     
    except:
        level="unavailable"
                
    exercises_level.append(level)
    print(level)
                
    evolucion+=1
    print(evolucion)



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [C:\Users\manue\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache
  self.browser = webdriver.Chrome(PATH)


Strongman
["Dumbbell farmer's walk"]
Beginner
1
Strength
['Barbell Bulgarian split squat']
Intermediate
2
Strength
['Russian twist']
Intermediate
3
Strength
[]
Intermediate
4
Strength
['Seated Cable Rows']
Intermediate
5
Strength
['Palms-Down Dumbbell Wrist Curl Over A Bench', 'Seated palms-down wrist curl', 'Seated One-Arm Dumbbell Palms-Down Wrist Curl']
Intermediate
6
Strongman
['Atlas Stone Trainer']
Intermediate
7
Strength
[]
Intermediate
8
Olympic Weightlifting
['Power clean', 'Hang Clean']
Beginner
9
Strength
[]
Beginner
10
Strength
['Plate Pinch']
Intermediate
11
Powerlifting
['Glute bridge', 'Barbell Hip Thrust']
Intermediate
12
Strength
['Military press', 'Single-arm kettlebell push-press', 'Circus Bell']
Intermediate
13


KeyboardInterrupt: 

now we update the dataframe

In [None]:
bbcom_df['type']=exercises_type
bbcom_df['alternatives']=exercises_alternatives
bbcom_df['level']=exercises_level

In [None]:
bbcom_df

## Processing the extracted data :

You can execute again the code from here

In [30]:
#exercises_df.to_csv('bodybuildingcom2.csv', index=False)      Activate if you executed the code and want to save the data

bbcom_df = pd.read_csv("bodybuildingcom2.csv")          #Load the saved csv 


In [31]:
bbcom_df

Unnamed: 0.1,Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2,type,alternatives,level
0,0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...,Strongman,"[""Dumbbell farmer's walk""]",Beginner
1,1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Barbell Bulgarian split squat'],Intermediate
2,2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Russian twist'],Intermediate
3,3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,[],Intermediate
4,4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...,Strength,['Seated Cable Rows'],Intermediate
...,...,...,...,...,...,...,...,...,...,...
3127,3127,Band chest fly,Chest,Bands,https://bodybuilding.com/exercises/band-chest-fly,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate
3128,3128,Band overhead squat,Quadriceps,Bands,https://bodybuilding.com/exercises/band-overhe...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate
3129,3129,Band standing concentration curl,Biceps,Bands,https://bodybuilding.com/exercises/band-standi...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate
3130,3130,Band seated row,Traps,Bands,https://bodybuilding.com/exercises/band-seated...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate


In [32]:
bbcom_df=bbcom_df.drop(['Unnamed: 0'], axis=1) #we drop the unnamed 0 column as when we saved the file we didnt use index false

In [33]:
bbcom_df

Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2,type,alternatives,level
0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...,Strongman,"[""Dumbbell farmer's walk""]",Beginner
1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Barbell Bulgarian split squat'],Intermediate
2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Russian twist'],Intermediate
3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,[],Intermediate
4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...,Strength,['Seated Cable Rows'],Intermediate
...,...,...,...,...,...,...,...,...,...
3127,Band chest fly,Chest,Bands,https://bodybuilding.com/exercises/band-chest-fly,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate
3128,Band overhead squat,Quadriceps,Bands,https://bodybuilding.com/exercises/band-overhe...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate
3129,Band standing concentration curl,Biceps,Bands,https://bodybuilding.com/exercises/band-standi...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate
3130,Band seated row,Traps,Bands,https://bodybuilding.com/exercises/band-seated...,https://www.bodybuilding.com/images/2020/xdb/2...,https://www.bodybuilding.com/images/2020/xdb/2...,Strength,[],Intermediate


## Avoiding duplicity between exercises entries: scoring strings similarity methods

Now we have 2 different data sets with exercises, the bodybuilding.com (bbcom) one is larger and more complet but we still want to get the description from the wger dataset. 

As most probably many of the exercises in wger will be already in bbcom we will try to match them, however this will not be an easy task as the same exercise can be nammed in different ways, as for example:
- 2 handed kettlebell swing == Kettleble swing
- Arnold press == Arnold Shoulder press
- Cable external rotation == external rotation with cable

On the other hand, for example sumo deadlift IS NOT = deadlift

After doing some research in https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a we have evaluated between 2 different methods for scoring strings similarity:
      - Levenshtein distance
      - Cosine similarity

in Levensthein distance you want to calculate how many transformations you need to perform on the string A to make it equal to string B, this method is more effective for typos detection rather than comparing 2 sentences with similar words in different orther. 

For the latest is better to use Cosine similarity, which is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Lets try to use it to compare our exercises      

Before applying this techniche to our data set we will make a simpler exercise to see how it works with the technique in https://towardsdatascience.com/calculating-string-similarity-in-python-276e18a7d33a  but changed a littlebit so that we can appply it later to our dataset.

Some of the explanations in the article have been summarized below for better understanding the technique utilized

In [34]:
# pip install nltk      # activate this if you need to install nltk (see below for explanation what is this)

In [35]:
import nltk
nltk.download('stopwords')   # to download stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [36]:
import string # to remove punctuations from the strings
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer # to convert the strings to numerical vectors
from nltk.corpus import stopwords # to remove non significat words as propositions, pronouns etc.
stopwords = stopwords.words('english') # set stopwords to english
import numpy as np

In [37]:
def clean_string(text):              # function to clean the strings so that we remain with relevant words and characters
    text = "".join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text= " ".join([word for word in text.split() if word not in stopwords])
    
    return text


In [38]:
sentences = [                                  # test
    'bench press with dumbells',
    'bench dumbell+ press',
    'barbell press',
    'Hi my friend'
]

In [39]:
cleaned = list(map(clean_string, sentences))       # we have a resulting cleaned list of words
cleaned

['bench press dumbells', 'bench dumbell press', 'barbell press', 'hi friend']

Now we apply vertorizer to create an array with k vectors in n-dimensional space, where k is the number of sentences, and n is the number of unique words in all sentences combined. Then, if a sentence contains a certain word, the value will be 1 and 0 otherwise

In [40]:
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors=vectorizer.toarray()
vectors

array([[0, 1, 0, 1, 0, 0, 1],
       [0, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 1, 0]], dtype=int64)

Now to see how similar are some vectors to others we apply cosine similarity and obtain a matrix that cross all vectors and returns the cosine similarity for each pair (therefore the diagonal being 1s)

In [41]:
csim = cosine_similarity(vectors)
csim

array([[1.        , 0.66666667, 0.40824829, 0.        ],
       [0.66666667, 1.        , 0.40824829, 0.        ],
       [0.40824829, 0.40824829, 1.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        ]])

We see that the most similar sentences are the first with the second, giving a 0.66 cosine similarity

Below a function to calculate directly the similarity for two given vectors. 

As cosine_similarity() expect 2D arrays, and the input vectors are 1D arrays by default, we need reshaping:

In [42]:
def cosine_sim_vectors(vec1, vec2):
    vec1 = vec1.reshape(1, -1)
    vec2 = vec2.reshape(1, -1)
    
    return cosine_similarity(vec1, vec2)[0][0]

For better understanding what the function does, see example below

In [43]:
vec1 = [0,1,0,1,0,0,1]
vec1=np.array(vec1)
vec1                    #return a 1D array

array([0, 1, 0, 1, 0, 0, 1])

In [44]:
vec1.reshape(1, -1)  # see that one dimension has been added (now we have 2 [[]] instead of []  - see below)

array([[0, 1, 0, 1, 0, 0, 1]])

Now lets see how the function wors and try to take directly the cosine similarity for the most similar sentences in our example, it should give again 0.66

In [45]:
cosine_sim_vectors(vectors[0], vectors[1])

0.6666666666666669

Now, thinking in our data set we will see how we can compare several sentences and take the most similar one according to cosine simmilarity

In [46]:
similarities_list=[]
for i in range(len(sentences)):
    similarities_list.append(cosine_sim_vectors(vectors[0], vectors[i]))



In [47]:
similarities_list[0]=0
max_index=similarities_list.index(max(similarities_list))

In [48]:
sentences[max_index]

'bench dumbell+ press'

## Applying cosine similarity to our Data set

In [49]:
wger_lista = list(wger_df['name'])    # extract to a list the exercises in wger
bbcom_lista = list(bbcom_df['exercise']) # extract to a list the exercises in bbcom

sim_lista = []     # to add the most similar exercises found in bbcom_lista to the ones in wger
score_lista = []   # to add the cosine similarity punctuation 

In [50]:
for x in wger_lista:           #oterates over wgwer_lista len == 231
    x_clean= clean_string(x)                                #clean x
    comp_lista = bbcom_lista                   #make a temp list with bbcom exercises to compare and work with it, len == 3,132
    comp_cleaned = list(map(clean_string, comp_lista)) #clean comp lista
    
    vector_comp_list=[] #create a list to introduce the comparisons between vectors
    
    checklist = [x_clean] # to compare each of the exercises with comp_cleaned
    

    comp_cleaned.insert(0,x_clean)           # put x_clean in comp_cleaned as the first sentence  len comp_cleaned==232    

    vectorizer = CountVectorizer().fit_transform(comp_cleaned)      #apply countvectorizer
    vectors=vectorizer.toarray()

    for i in range(len(comp_cleaned)):  
        vector_comp_list.append(cosine_sim_vectors(vectors[0], vectors[i]))     #we apply the cosine similarity function for each vector compared with the first exercise    -- vector_comp_list ==232

    vector_comp_list[0]=-1    # we set the cosine similarity of the first word to -1 to avoid that the below code return the first exercise as the maximum cosine similarity 
    max_index=vector_comp_list.index(max(vector_comp_list))    #get the index from the exercise with the maximum cosine similarity

    sim_lista.append(comp_cleaned[max_index])   # append to sim_lista the name of the exercise with the maximum cosine similarity
    score_lista.append(max(vector_comp_list))   # append to score_lista the number of the cosine similarity

In [51]:
len(sim_lista) # for debugging

231

In [52]:
len(wger_lista) # for debugging

231

In [53]:
len(score_lista) # for debugging

231

In [54]:
equivalences=list(zip(wger_lista, sim_lista, score_lista))   # to check the results we make a zip list with wger exercise compared to the most similar in bbcom list and with its cosine similarity

In [55]:
equiv_df=pd.DataFrame(equivalences)
equiv_df= equiv_df.rename(columns={0: 'wger', 1: 'bbcom', 2: 'score'})

In [56]:
pd.set_option('display.max_rows', 231)

In [57]:
equiv_df

Unnamed: 0,wger,bbcom,score
0,2 Handed Kettlebell Swing,kettlebell swing,0.816497
1,Arnold Shoulder Press,arnold press,0.816497
2,Axe Hold,hollowbody hold,0.5
3,Back Squat,barbell back squat,0.816497
4,Barbell Ab Rollout,barbell ab rollout knees,0.866025
5,Barbell Hack Squats,barbell hack squat,0.666667
6,Barbell Lunges,barbell deadlift,0.5
7,Barbell Triceps Extension,standing barbell overhead triceps extension,0.774597
8,Bear Walk,yoke walk,0.5
9,Bench Press,bench press,1.0


*We observe that for some exercises for which it hasn't found any similar exercise (cosine similarity 0 for all) it returned the first exercise in the list as the most similar exercise (after itself as we set it to -1)* 

###  Evaluating the results and adjusting them 

It seems it works pretty decently, however some adjustment will be needed.

After some training of the model we have come to the code below for detecting identical exercises, for the moment it doesn´t work 100% but its performance its now approx 90% of the values in w_ger and if it fails it returns a very similar exercise.

We will improve the algortihm performance in a following version of the notebook

**PS: Yonatan , no me ha dado tiempo a dejarlo perfect pero de momento estoy satisfecho con el resultado:**
 - Después de aplicar los ajusteds de más abajo, de los 20 primeros ejercicios acierta todos, si bien calculo que hay entre 20/25 ejercicios de 230 que no da con ellos. 
 - Es complicado dar con el algoritmo exacto pero tengo ideas para mejorarlo hasta el 95 por lo menos:
     - hacer que quite las s del final de las palabras
     - hacer sets de palabras y ver el numero de palabras diferentes y en las palabras diferentes calcular las letras diferentes entre ellas.
     - meterle palabras usadas en las variantes, como: cable, reverse, etc... que sean excluyentes entre ellas
 
 - De momento tiro para delante  para poder entregar el ejercicio, que me pilla el toro

In [58]:
index=0
equiv_df['similarity']="tbd"

for x in equiv_df['score']:
    if x>=0.8:                                 # if more than 0.8 is the same exercise
        equiv_df['similarity'][index]=1
    elif x<=0.6:                              # if less than 0.6  --- is a different exercise
        equiv_df['similarity'][index]=0    
                                              # For the range 0.6  -- 0.8  We consider them different if the words of 1 exercise are double or more than the other, as this means it is a variation of the same exercise 
    
    elif len(equiv_df['wger'][index].split()) >= 2*len(equiv_df['bbcom'][index].split()):
        equiv_df['similarity'][index]=0                  # for the rest if len difference is => 2x is different       
    elif len(equiv_df['bbcom'][index].split()) >= 2*len(equiv_df['wger'][index].split()):
        equiv_df['similarity'][index]=0
    
    
   
    else:
        equiv_df['similarity'][index]=1            # the rest we consider them correct (here is where I have more failures at the moment)
        
    index+=1


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  equiv_df['similarity'][index]=1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  equiv_df['similarity'][index]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  equiv_df['similarity'][index]=1            # the rest we consider them correct (here is where I have more failures at the moment)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni

In [59]:
equiv_df.head(20)

Unnamed: 0,wger,bbcom,score,similarity
0,2 Handed Kettlebell Swing,kettlebell swing,0.816497,1
1,Arnold Shoulder Press,arnold press,0.816497,1
2,Axe Hold,hollowbody hold,0.5,0
3,Back Squat,barbell back squat,0.816497,1
4,Barbell Ab Rollout,barbell ab rollout knees,0.866025,1
5,Barbell Hack Squats,barbell hack squat,0.666667,1
6,Barbell Lunges,barbell deadlift,0.5,0
7,Barbell Triceps Extension,standing barbell overhead triceps extension,0.774597,1
8,Bear Walk,yoke walk,0.5,0
9,Bench Press,bench press,1.0,1


In [60]:
bbcom_df['aux_match']= list(map(clean_string, list(bbcom_df['exercise']))) # creates temporary column to compare with the names of the exercises cleaned
bbcom_df['description'] = 'unavailable'

index=0
for x in equiv_df['similarity']:
    if x==1:
        bbcom_df.loc[bbcom_df['aux_match'] == equiv_df['bbcom'][index],'description'] = wger_df['description'][index]
        
    else:
        new_row = {'exercise': wger_df['name'][index],
                   'muscle':wger_df['muscles'][index],
                   'equipment':wger_df['equipment'][index],
                   'link':'unavailable',
                   'tb1':'unavailable',
                   'tb2':'unavailable',
                   'type':'unavailable',
                   'alternatives':wger_df['variations'][index],
                   'level':'unavailable',
                   'category':wger_df['category'][index],
                   'description':wger_df['description'][index],
                                     }
        bbcom_df = bbcom_df.append(new_row, ignore_index=True)
    
    index+=1


In [61]:
bbcom_df

Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2,type,alternatives,level,aux_match,description,category
0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...,Strongman,"[""Dumbbell farmer's walk""]",Beginner,rickshaw carry,unavailable,
1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Barbell Bulgarian split squat'],Intermediate,singleleg press,unavailable,
2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Russian twist'],Intermediate,landmine twist,unavailable,
3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,[],Intermediate,weighted pullup,unavailable,
4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...,Strength,['Seated Cable Rows'],Intermediate,tbar row handle,unavailable,
...,...,...,...,...,...,...,...,...,...,...,...,...
3246,Wall Pushup,"[Anterior deltoid, Pectoralis major, Triceps b...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[],unavailable,,Pushup against a wall,Arms
3247,Wall Slides,"[Biceps brachii, Biceps femoris, Pectoralis ma...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[],unavailable,,"Stand with heels, shoulders, back of head, a...",Back
3248,Weighted Butterfly Stretch,[],[Dumbbell],unavailable,unavailable,unavailable,unavailable,[],unavailable,,"Seated with your back against a wall, put the ...",Legs
3249,Weighted Step,[],[Barbell],unavailable,unavailable,unavailable,unavailable,[],unavailable,,"Box stepups w/ barbell, 45's on each side",Legs


In [62]:
bbcom_df.drop_duplicates(subset=['exercise'])

Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2,type,alternatives,level,aux_match,description,category
0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...,Strongman,"[""Dumbbell farmer's walk""]",Beginner,rickshaw carry,unavailable,
1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Barbell Bulgarian split squat'],Intermediate,singleleg press,unavailable,
2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Russian twist'],Intermediate,landmine twist,unavailable,
3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,[],Intermediate,weighted pullup,unavailable,
4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...,Strength,['Seated Cable Rows'],Intermediate,tbar row handle,unavailable,
...,...,...,...,...,...,...,...,...,...,...,...,...
3246,Wall Pushup,"[Anterior deltoid, Pectoralis major, Triceps b...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[],unavailable,,Pushup against a wall,Arms
3247,Wall Slides,"[Biceps brachii, Biceps femoris, Pectoralis ma...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[],unavailable,,"Stand with heels, shoulders, back of head, a...",Back
3248,Weighted Butterfly Stretch,[],[Dumbbell],unavailable,unavailable,unavailable,unavailable,[],unavailable,,"Seated with your back against a wall, put the ...",Legs
3249,Weighted Step,[],[Barbell],unavailable,unavailable,unavailable,unavailable,[],unavailable,,"Box stepups w/ barbell, 45's on each side",Legs


In [63]:
bbcom_df=bbcom_df.fillna('unavailable')  #remove Nans
bbcom_df

Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2,type,alternatives,level,aux_match,description,category
0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...,Strongman,"[""Dumbbell farmer's walk""]",Beginner,rickshaw carry,unavailable,unavailable
1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Barbell Bulgarian split squat'],Intermediate,singleleg press,unavailable,unavailable
2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Russian twist'],Intermediate,landmine twist,unavailable,unavailable
3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,[],Intermediate,weighted pullup,unavailable,unavailable
4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...,Strength,['Seated Cable Rows'],Intermediate,tbar row handle,unavailable,unavailable
...,...,...,...,...,...,...,...,...,...,...,...,...
3246,Wall Pushup,"[Anterior deltoid, Pectoralis major, Triceps b...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[],unavailable,unavailable,Pushup against a wall,Arms
3247,Wall Slides,"[Biceps brachii, Biceps femoris, Pectoralis ma...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[],unavailable,unavailable,"Stand with heels, shoulders, back of head, a...",Back
3248,Weighted Butterfly Stretch,[],[Dumbbell],unavailable,unavailable,unavailable,unavailable,[],unavailable,unavailable,"Seated with your back against a wall, put the ...",Legs
3249,Weighted Step,[],[Barbell],unavailable,unavailable,unavailable,unavailable,[],unavailable,unavailable,"Box stepups w/ barbell, 45's on each side",Legs


In [64]:
def clean_una(column):               # to clean empty lists with unavailables
    index=0
    for x in bbcom_df[column]:
        if len(x)==0:
            bbcom_df[column][index]=["unavailable"]  # we put it in list format so that all colums contain same data type

        index+=1    

In [65]:
clean_una('muscle')
clean_una('equipment')
clean_una('alternatives')

In [66]:
bbcom_df

Unnamed: 0,exercise,muscle,equipment,link,tb1,tb2,type,alternatives,level,aux_match,description,category
0,Rickshaw Carry,Forearms,Other,https://bodybuilding.com/exercises/rickshaw-carry,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/exercises/exercis...,Strongman,"[""Dumbbell farmer's walk""]",Beginner,rickshaw carry,unavailable,unavailable
1,Single-Leg Press,Quadriceps,Machine,https://bodybuilding.com/exercises/single-leg-...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Barbell Bulgarian split squat'],Intermediate,singleleg press,unavailable,unavailable
2,Landmine twist,Abdominals,Other,https://bodybuilding.com/exercises/landmine-180s,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,['Russian twist'],Intermediate,landmine twist,unavailable,unavailable
3,Weighted pull-up,Lats,Other,https://bodybuilding.com/exercises/weighted-pu...,https://www.bodybuilding.com/images/2020/xdb/c...,https://www.bodybuilding.com/images/2020/xdb/c...,Strength,[],Intermediate,weighted pullup,unavailable,unavailable
4,T-Bar Row with Handle,Middle Back,Other,https://bodybuilding.com/exercises/t-bar-row-w...,https://www.bodybuilding.com/exercises/exercis...,https://www.bodybuilding.com/images/2020/octob...,Strength,['Seated Cable Rows'],Intermediate,tbar row handle,unavailable,unavailable
...,...,...,...,...,...,...,...,...,...,...,...,...
3246,Wall Pushup,"[Anterior deltoid, Pectoralis major, Triceps b...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[unavailable],unavailable,unavailable,Pushup against a wall,Arms
3247,Wall Slides,"[Biceps brachii, Biceps femoris, Pectoralis ma...",[none (bodyweight exercise)],unavailable,unavailable,unavailable,unavailable,[unavailable],unavailable,unavailable,"Stand with heels, shoulders, back of head, a...",Back
3248,Weighted Butterfly Stretch,[unavailable],[Dumbbell],unavailable,unavailable,unavailable,unavailable,[unavailable],unavailable,unavailable,"Seated with your back against a wall, put the ...",Legs
3249,Weighted Step,[unavailable],[Barbell],unavailable,unavailable,unavailable,unavailable,[unavailable],unavailable,unavailable,"Box stepups w/ barbell, 45's on each side",Legs


Yonatan, se puede hacer aún más para unificar los datos, como sacar el equipment para todos los ejercicios, o la parte del cuerpo que trabajan, de momento lo dejo para que me de tiempo a acabar. También habría que pasar a formato lista los datos de equipment y muscle.

In [67]:
bbcom_df=bbcom_df.drop(['aux_match'], axis=1) # we drop the temporary column as we don't need it any more

In [68]:
bbcom_df.to_csv('complete_exercise_db.csv', index=False) 

## WEB BEHIND USER / PASSWORD SCRAP - BODYBUILDING.COM WORKOUTS

Now that we have an exercise Data base we will proceed to download all the workouts in Bodybuilding.com

For evaluating purposes it is not necessary to execute the code below until further notice, dictio.txt file has been provided with the text from the resulting dictionary, also the dictionary text can be copied from the output cell below

In [69]:
import time 
import re 
import json

In [70]:
class Br():
    
    def init(self):
        PATH=ChromeDriverManager().install()
        self.browser = webdriver.Chrome(PATH)
               
    def start(self,url):             # Open a certain URL
        self.browser.get(url)
        time.sleep(1)

    def click(self,path):              # clicks in element by xpath
        element = self.browser.find_element_by_xpath(path)
        element.click()
        time.sleep(2)
       
    def close_out(self):
        self.browser.close()

In [None]:
b=Br()
b.init()
b.start("https://www.bodybuilding.com/combined-signin?referrer=https%3A%2F%2Fwww.bodybuilding.com%2Fworkout-plans%2Fgoal%2Fbuild-muscle%23&country=ES")

In [None]:
b.click("//*[@id=\"onetrust-accept-btn-handler\"]")
address = ""                 #email and password for login
password = ""

In [None]:
emailinput =b.browser.find_element_by_xpath('//*[@id="ispbxii_1"]') 
emailinput.send_keys(address) 

passwordinput =b.browser.find_element_by_xpath('//*[@id="root"]/div/div/div[2]/div[1]/form/div[2]/div/input') 
passwordinput.send_keys(password) 

time.sleep(1)

nextbutton = b.browser.find_element_by_xpath('//*[@id="root"]/div/div/div[2]/div[1]/form/button') 
nextbutton.click()

time.sleep(5)

In [None]:
d={}         # creates a dictionary to input all workouts
obj_links = ["build-muscle", "lose-weight", "gain-strength", "get-fit", "performance"] # list to go to desired page by type of workout (see below)
for obj in obj_links:      # first we iter through types of workouts
    print(f"currently downloading plans for {obj}")  # to see evolution while scrapping
    b.start("https://www.bodybuilding.com/workout-plans/goal/"+obj)  # go to type of workout main page
    soup=bs(b.browser.page_source, 'html.parser')
    work_outs_links=[]   # makes a list to insert all the links to each specific workout inside the category
    soup=soup.find_all(class_='plan__link')
    
    for x in soup:
        work_outs_links.append(x.get('href'))

    for link in work_outs_links: 

        
        name_workout = link.replace("https://www.bodybuilding.com/workout-plans/about/","").replace("-"," ") #take name from link
        print(f"currently downloading workout {name_workout}")
        b.start(link)
        b.click("//*[@id=\"js-go-to-plan-btn\"]/div") #go to plan
        b.click("//*[@id=\"PLAN_NAV_SCHEDULE\"]/span") #workout schedule

        soup=bs(b.browser.page_source, 'html.parser') #take soup from workout page
        soup=soup.find('select',{'id':'PLAN_WEEK_DROPDOWN'}) # create a list with the weeks inside the workout
        soup=soup.text
        list_week=[int(s) for s in soup.split() if s.isdigit()]
        list_week=list(set(list_week)) #for eliminating repeated numbers

        b.click(f"//*[@id=\"PLAN_WEEK_DROPDOWN\"]/option[{list_week[-1]}]") # go to last week to take last day and see how many days the workout has
        soup=bs(b.browser.page_source, 'html.parser') 
        soup=soup.find_all('span',{'class':'sub-nav__button__day'}) 
        soup=soup[-1] # takes the soup where the last day is
        soup=soup.text # takes the text
        soup=[int(s) for s in soup.split() if s.isdigit()] # take the numbers
        soup=soup[0] # take unly first
        list_day=[i+1 for i in range(soup)] # make a list including all days of the workout

        d[name_workout]={} #creates workout key in dictionary
        print("currently downloading week...")

        for week in list_week:
            print(week)
            b.click(f"//*[@id=\"PLAN_WEEK_DROPDOWN\"]/option[{week}]") #go to week
            time.sleep(2)
            d[name_workout][week]={} #creates week key in dictionary

            for day in list_day[(week*7-7):(week*7)]: # so that it takes the corresponding days keys, ex. for week 2 (8,9,10,11,12,13,14) 
                b.click(f"//*[@id=\"PLAN_NAV_DAY{day}\"]/span[2]") #go to day
                time.sleep(2)
                soup=bs(b.browser.page_source, 'html.parser') #take soup from day page
                d[name_workout][week][day]={} #creates day key in dictionary

                if "rest" in soup.find('h1').text.lower():
                    d[name_workout][week][day]="rest day"
                    continue #move to next day if it is a rest day

                else:    #now we have to take the exercises types
                    exercise_type_soup = soup.find_all('div',class_='cms-article-list__content--container')


                exercise_number=0
                for exercise in exercise_type_soup:
                    exercise_number+=1
                    warm_check_list=["WARM", "warm", "Warm"] 
                    if "cms-article-list__content--group-title" in str(exercise): #if it has group title and "warm" it is a warmup, 
                        if  any(substring in str(exercise) for substring in warm_check_list): 
                            exercise_type = str(exercise_number) + " " + "Warm-up exercise"             
                        else:
                            exercise_type = str(exercise_number) + " " + "Compound exercise" # otherwise it is a compound exercise             


                    else: # if it doesn't have a group title then it is an individual exercise
                        exercise_type = str(exercise_number) + " " "individual exercise"

                    d[name_workout][week][day][exercise_type]={} #creates exercise type key in dictionary

                    exercise_soup = exercise.find_all('div',class_='cms-article-list__content--container-left')

                    set_nr=0
                    for sett in exercise_soup:
                        set_nr+=1
                        sett_name = sett.find('a').text
                        sett_workload = sett.find('span').text
                        d[name_workout][week][day][exercise_type][str(set_nr)+ ". " + sett_name]=sett_workload # input exercises name and workload for each exercise

The above code achieved to download 42 building muscle exercises, and one lose weight exercise, however it broke in the second loose weight exercise as the page includes some diet concepts and the structure is somehow different. 

For the purpose of this exercise we will continue just with the building muscle exercises

In [72]:
d #copy the text below and load it into the variable d again, or copy the text from the provided dictio.txt file

{'jim stoppanis 12 week shortcut to size': {1: {1: {'1 Warm-up exercise': {'1. Barbell Bench Press - Medium Grip': '2 sets, 5-10 reps (rest 1 min.)'},
    '2 individual exercise': {'1. Barbell Bench Press - Medium Grip': '4 sets, 12-15 reps (rest 2 min.)'},
    '3 individual exercise': {'1. Barbell Incline Bench Press Medium-Grip': '3 sets, 12-15 reps (rest 2 min.)'},
    '4 individual exercise': {'1. Incline Dumbbell Flyes': '3 sets, 12-15 reps (rest 1 min.)'},
    '5 individual exercise': {'1. Cable Crossover': '3 sets, 12-15 reps (rest 1 min. )'},
    '6 individual exercise': {'1. Triceps Pushdown': '4 sets, 12-15 reps (rest 1 min.)'},
    '7 individual exercise': {'1. Dumbbell skullcrusher': '3 sets, 12-15 reps (rest 1 min.)'},
    '8 individual exercise': {'1. Low cable overhead triceps extension': '3 sets, 12-15 reps (rest 1 min. )'},
    '9 individual exercise': {'1. Standing Dumbbell Calf Raise': '3 sets, 25-30 reps (rest 1 min. )'},
    '10 individual exercise': {'1. Seated Ca

Now we want to have a column in our combined exercises data frame that includes a list with all the workouts in which it is used. 

The downloaded exercise database gives us a lot of information that we can utilize in further analysis another time, but for this exercise we will simplify the dictionary as we only need the exercises used in each workout

In [73]:
simplified_d={}

for k, v in d.items():        #k== workout name
    
    exercises_list=[]
    
    
    for k1, v1 in v.items():       #k1 = week
                             
        for k2, v2 in v1.items():      #k2==dia
                          
            if v2=="rest day":
                
                continue
            
            else:
                for k3, v3 in v2.items():   # k3 == exercise type 
                     
                    for k4, v4 in v3.items():
                        exercises_list.append(k4[3:])  # exercise without the number before
                       
    
    simplified_d[k]=list(set(exercises_list)) # to avoid duplicates

In [74]:
simplified_d

{'jim stoppanis 12 week shortcut to size': ['Barbell Incline Bench Press Medium-Grip',
  'Standing Calf Raises',
  'Leg Extension',
  'Dumbbell skullcrusher',
  'Leg Press',
  'Single-Leg Leg Press',
  'Lying cable triceps extension',
  'Seated Cable Rows',
  'Seated Calf Raise',
  'Lying Leg Curls',
  'Cable Seated Lateral Raise',
  'Overhead cable curl',
  'Seated rear delt fly',
  'Barbell Curl',
  'Smith machine upright row',
  'Barbell back squat',
  'Close-grip bench press',
  'Standing lat pull-down',
  'Leg Extensions',
  'Behind-the-neck pull-down',
  'Barbell Squat',
  'Hanging leg raise',
  'Lying Leg Curl',
  'Dumbbell Flyes',
  'Single-Arm Smith Machine Shrug',
  'Cable Lateral Raise',
  'Triceps Pushdown',
  'Incline cable chest fly',
  'Dumbbell bent-over row',
  'Barbell Bench Press - Medium Grip',
  'Lat pull-down',
  'Dumbbell Crunch Isometric Hold',
  'Single-arm cable front raise',
  'Standing Cable Wood Chop',
  'Kneeling cable crunch',
  'Side Plank',
  'Cable Cro

In [75]:
bbcom_df["Workouts using the exercise"]="none" #creates new colum

Now using the simplified dictionary we will create and append a list with all the workouts in which each exercise is utilized

In [76]:
index=0
for x in bbcom_df["exercise"]:
    in_workout_list=[]
    for k, v in simplified_d.items():
        for v1 in v:
            if x.lower() in v1.lower():
                in_workout_list.append(k)
    in_workout_list= list(set(in_workout_list))
    
    if len(in_workout_list)==0:
        in_workout_list.append("none")
        
    bbcom_df["Workouts using the exercise"][index]= in_workout_list
    index+=1

In [77]:
bbcom_df["Workouts using the exercise"][3] # example

['get swole cory gregorys 20 week muscle building trainer',
 'maximum muscle 9 week advanced training for gains',
 'kris gethins 12 week muscle building trainer',
 'project mass jake wilsons 14 week muscle building trainer',
 '30 day back with abel albonetti']

In [78]:
bbcom_df.to_csv('final.csv')