# Weighted Pull up & Front Lever Survey

## Model Building

Models I am considering to use:
* Linear SVC
* Random Forest
<br>
but I need to read more about them before actually implementing it

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import svm, metrics
from statistics import mean 

#### brainstorming

split FL groups into 0 (no FL, tuck, adv tuck) and 1 (half lay straddle, straddle, full)

^this one I don't think captures the degree of difference between the stages especially between no FL & adv tuck, half-lay straddle & full

calculate estimated 1rm of each athlete by get the mid range of each 1rm percentage

^Ideally the 1RM data would be a flat percentage instead of a range because this does dilute(?) the accuracy of the data

gonna go with getting the mid range and doing a SVM

In [2]:
raw = pd.read_csv("Weighted PU and FL cleaned.csv",encoding = "ISO-8859-1")
copy = raw.copy()
copy = copy.drop(["Other thoughts or comments? Helpful data","Timestamp"],axis=1)
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 609 entries, 0 to 608
Data columns (total 5 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Weighted Pullup 1RM (% of BW)                      609 non-null    object 
 1   Max Front Lever progression (3 seconds good form)  609 non-null    object 
 2   Bodyweight (kg)                                    590 non-null    float64
 3   Max pullups (endurance)                            586 non-null    float64
 4   Height (cm)                                        498 non-null    float64
dtypes: float64(3), object(2)
memory usage: 23.9+ KB


### data cleaning again + big assumption note
when I get the mid range of the over 90% group I have to make the assumption that none of the 90%+ participants have over 100% of extra weight added on their weighted pull up (next time should just get raw number from participants instead of range)

In [3]:
copy["Weighted Pullup 1RM (% of BW)"] = copy["Weighted Pullup 1RM (% of BW)"].apply(lambda x: x.replace("%","").replace("<10","0-10").replace(">90","90-100"))

In [4]:
copy.head()

Unnamed: 0,Weighted Pullup 1RM (% of BW),Max Front Lever progression (3 seconds good form),Bodyweight (kg),Max pullups (endurance),Height (cm)
0,0-10,Straddle lever,59.8,20.0,165.0
1,65-80,Full lever,72.0,21.0,178.0
2,50-65,Straddle Halflay lever,68.0,17.0,169.0
3,50-65,Straddle lever,62.0,20.0,142.0
4,30-50,Full lever,45.0,20.0,160.0


In [5]:
copy.tail()

Unnamed: 0,Weighted Pullup 1RM (% of BW),Max Front Lever progression (3 seconds good form),Bodyweight (kg),Max pullups (endurance),Height (cm)
604,90-100,Full lever,55.0,27.0,170.0
605,65-80,Full lever,72.0,22.0,175.0
606,10-30,Tuck lever,88.0,10.0,187.0
607,65-80,Straddle lever,65.0,35.0,170.0
608,50-65,Advanced Tuck lever,68.0,15.0,183.0


In [6]:
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 609 entries, 0 to 608
Data columns (total 5 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Weighted Pullup 1RM (% of BW)                      609 non-null    object 
 1   Max Front Lever progression (3 seconds good form)  609 non-null    object 
 2   Bodyweight (kg)                                    590 non-null    float64
 3   Max pullups (endurance)                            586 non-null    float64
 4   Height (cm)                                        498 non-null    float64
dtypes: float64(3), object(2)
memory usage: 23.9+ KB


In [7]:
#thanks for the sample code from the data scientist salary project Ken
copy["min %"] = copy["Weighted Pullup 1RM (% of BW)"].apply(lambda x: int(x.split('-')[0]))
copy["max %"] = copy["Weighted Pullup 1RM (% of BW)"].apply(lambda x: int(x.split('-')[1]))
copy["Estimated Weighted Pullup 1RM (% of BW)"] = (copy["min %"]+copy["max %"])/2

In [8]:
copy.head()

Unnamed: 0,Weighted Pullup 1RM (% of BW),Max Front Lever progression (3 seconds good form),Bodyweight (kg),Max pullups (endurance),Height (cm),min %,max %,Estimated Weighted Pullup 1RM (% of BW)
0,0-10,Straddle lever,59.8,20.0,165.0,0,10,5.0
1,65-80,Full lever,72.0,21.0,178.0,65,80,72.5
2,50-65,Straddle Halflay lever,68.0,17.0,169.0,50,65,57.5
3,50-65,Straddle lever,62.0,20.0,142.0,50,65,57.5
4,30-50,Full lever,45.0,20.0,160.0,30,50,40.0


In [9]:
copy = copy.drop(["Weighted Pullup 1RM (% of BW)","min %","max %"],axis=1)

In [10]:
copy.head()

Unnamed: 0,Max Front Lever progression (3 seconds good form),Bodyweight (kg),Max pullups (endurance),Height (cm),Estimated Weighted Pullup 1RM (% of BW)
0,Straddle lever,59.8,20.0,165.0,5.0
1,Full lever,72.0,21.0,178.0,72.5
2,Straddle Halflay lever,68.0,17.0,169.0,57.5
3,Straddle lever,62.0,20.0,142.0,57.5
4,Full lever,45.0,20.0,160.0,40.0


In [11]:
copy = copy[["Max Front Lever progression (3 seconds good form)","Bodyweight (kg)","Height (cm)","Estimated Weighted Pullup 1RM (% of BW)","Max pullups (endurance)"]]
copy.head()

Unnamed: 0,Max Front Lever progression (3 seconds good form),Bodyweight (kg),Height (cm),Estimated Weighted Pullup 1RM (% of BW),Max pullups (endurance)
0,Straddle lever,59.8,165.0,5.0,20.0
1,Full lever,72.0,178.0,72.5,21.0
2,Straddle Halflay lever,68.0,169.0,57.5,17.0
3,Straddle lever,62.0,142.0,57.5,20.0
4,Full lever,45.0,160.0,40.0,20.0


In [12]:
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 609 entries, 0 to 608
Data columns (total 5 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Max Front Lever progression (3 seconds good form)  609 non-null    object 
 1   Bodyweight (kg)                                    590 non-null    float64
 2   Height (cm)                                        498 non-null    float64
 3   Estimated Weighted Pullup 1RM (% of BW)            609 non-null    float64
 4   Max pullups (endurance)                            586 non-null    float64
dtypes: float64(4), object(1)
memory usage: 23.9+ KB


There's a lot of missing data so I'm considering replacing it with averages. Hmmm I wonder if that's a good idea or not.

In [13]:
copy["Bodyweight (kg)"].fillna(round(copy["Bodyweight (kg)"].mean()),inplace=True)
copy["Height (cm)"].fillna(round(copy["Height (cm)"].mean()),inplace=True)
copy["Max pullups (endurance)"].fillna(round(copy["Max pullups (endurance)"].mean()),inplace=True)

In [14]:
copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 609 entries, 0 to 608
Data columns (total 5 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Max Front Lever progression (3 seconds good form)  609 non-null    object 
 1   Bodyweight (kg)                                    609 non-null    float64
 2   Height (cm)                                        609 non-null    float64
 3   Estimated Weighted Pullup 1RM (% of BW)            609 non-null    float64
 4   Max pullups (endurance)                            609 non-null    float64
dtypes: float64(4), object(1)
memory usage: 23.9+ KB


### model building
the sample size is still pretty small so I'm gonna do a 90,10 split

In [15]:
x = copy.drop(["Max Front Lever progression (3 seconds good form)"],axis=1)
y = copy["Max Front Lever progression (3 seconds good form)"]

In [16]:
x.head()

Unnamed: 0,Bodyweight (kg),Height (cm),Estimated Weighted Pullup 1RM (% of BW),Max pullups (endurance)
0,59.8,165.0,5.0,20.0
1,72.0,178.0,72.5,21.0
2,68.0,169.0,57.5,17.0
3,62.0,142.0,57.5,20.0
4,45.0,160.0,40.0,20.0


In [17]:
y.head()

0            Straddle lever
1                Full lever
2    Straddle Halflay lever
3            Straddle lever
4                Full lever
Name: Max Front Lever progression (3 seconds good form), dtype: object

testing the accuracy of the model 1000 times

In [18]:
from tqdm import tqdm
scoreList = []
for i in tqdm(range(1000)):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
    classifier = svm.SVC()
    classifier.fit(x_train, y_train)
    predict = classifier.predict(x_test)
    #print("Predicted results: ",predict)
    score = metrics.accuracy_score(y_test,predict)
    scoreList.append(score)
    #print("Accuracy score is: ",score)

100%|██████████| 1000/1000 [00:16<00:00, 59.56it/s]


In [21]:
print("Average score of 1000 tests is: ",round(mean(scoreList)*100),"%")

Average score of 1000 tests is:  46 %
