# Modelling

I will try to create a model to predict the sport that a person would be best at based on their attributes.

I will use athlete data from the Olympics to do this.

In [3]:
# For data usage
import pandas as pd
import numpy as np

# For visualisations
import seaborn as sns
import matplotlib.pyplot as plt

# For performance measures
from sklearn import metrics  # accuracy, recall, precision
from time import time   # measure time it takes to run a model
from sklearn.metrics import (confusion_matrix, accuracy_score)

# For modelling (Lin Regression)
import statsmodels.api as sm   # For the linear regression model
import statsmodels.tools       # For the evaluation of our model

# Stuff for modelling (Decision Tree)
from sklearn.tree import DecisionTreeClassifier  # the model itself
from sklearn import tree   # visualising the model
from sklearn.model_selection import train_test_split  # Train-test splitting
from sklearn.model_selection import GridSearchCV  # For Grid Search

# For the function
import time


| **Column**     | **Use for model?** | **Nulls?**        | **Dtype**         | **Feature Engineering** | **Notes** |
| -------------- | -------------------| ------------------| ------------------| ------------------------|-----------|
| athlete_id     | NO                 | NO                | integer           | Remove column           | Identification number is irrelevant for modelling. |
| name           | NO                 | NO                | object (string)   | Remove column           | Similarly name is irrelevant for modelling.        |
| height_cm      | YES                | YES               | float             | Remove null entries     | This is important for the model. |
| weight_kg      | MAYBE              | YES               | float             | Remove null entries     | This may be important for the model. However, weight is something that can change, and someone entering their information won't be the weight of an athlete. |
| born_date      | NO                 | NO                | object (string)   | Remove column           | We've got the age column so won't need born_date. |
| **age**        | YES                | NO                | float             | Possible target column  | If the discipline model is too complicated, using age as a target could be an alternative. Otherwise, age could also be a useful feature. |
| country_name   | MAYBE              | NO                | object (string)   | Requires OHE, may possibly require splitting countries into regions to make it simpler. | This may be too complicated since there are A LOT of countries. It also might make the model overfit. |
| country_code   | MAYBE              | NO                | object (string)   | Same as country_name. Would only use ONE of country_name and country_code. | Same as country_name. |
| born_city      | NO                 | YES               | object (string)   | Remove column           | City of birth is too specific for a model. |
| born_region    | NO                 | YES               | object (string)   | Remove column           | Region of birth is too specific for a model. |
| born_country   | PROBABLY NOT       | YES               | object (string)   | Requires OHE, may possibly require splitting countries into regions to make it simpler. | This may be too complicated since there are A LOT of countries. It would almost certainly be too much to have born_country AND country_name/code both as features. |
| year           | NO                 | NO                | integer           | Remove column           | Year is 2020 for all entries. |
| olympics_date  | NO                 | NO                | object (string)   | Remove column           | Date is the same for all entries. |
| **discipline** | MAYBE              | NO                | object (string)   | Possible target column  | Not sure if a target column with around 50 possible options is possible. I would either have to cut down on the possible options OR focus on athletics data OR find another target column. |
| event          | PROBABLY NOT       | NO                | object (string)   | May require OHE if I use it | I would only use this column if I end up focusing on athletics data only, in which case this column could be a possible target column. |
| position       | PROBABLY NOT       | YES               | float             | Either remove or edit null entries | This column is unlikely to be relevant for my model. |
| tied           | NO                 | NO                | bool              | Remove column           | Irrevelant information. |
| medal          | MAYBE              | YES               | object (string)   | Would convert to a bool column (True = Medal, False = No Medal) | This column is unlikely to be relevant for my model. However, I would use this ahead of the position column.
| male           | YES                | NO                | bool              | May need to change True/False to 1/0 | This is important for the model. |
| age_rounded    | PROBABLY NOT       | NO                | float             | Possible scaling OR possible target column | This may actually be a better target column than age would be due to it being rounded. However, as a feature it would be less accurate than age.
| physical       | MAYBE              | NO                | bool              | May need to change True/False to 1/0 | Unsure whether or not this will be relevant to the model. Either I use this column in my model OR remove it completely OR make my model only contain entries with physical = True. |
| team           | PROBABLY           | NO                | bool              | May need to change True/False to 1/0 | I think this is likely to be useful for my model. |

In [10]:
# Let's start by dropping columns that definitely won't be used
df.drop(columns = ['athlete_id', 'name', 'born_date', 'born_city', 'born_region', 'year', 'olympics_date', 'tied'], inplace = True)

In [12]:
df.head()

Unnamed: 0,height_cm,weight_kg,age,country_name,country_code,born_country,discipline,event,position,medal,male,age_rounded,physical,team
0,167.0,59.0,45.735797,Nigeria,NGR,NGR,Table Tennis,"Singles, Women (Olympic)",65.0,,False,45.0,False,False
1,185.0,82.0,52.982888,Poland,POL,POL,Archery,"Individual, Men (Olympic)",33.0,,True,52.0,False,False
2,162.0,57.0,45.36345,Hungary,HUN,HUN,Fencing,"Foil, Team, Women (Olympic)",7.0,,False,45.0,False,True
3,162.0,53.0,43.389459,Brazil,BRA,BRA,Football (Football),"Football, Women (Olympic)",6.0,,False,43.0,True,True
4,153.0,43.0,46.094456,Germany Unified Team Uzbekistan,UZB,UZB,Artistic Gymnastics (Gymnastics),"Horse Vault, Women (Olympic)",14.0,,False,46.0,True,False


## Model Ideas

| Model                | Use Case |
|----------------------|----------|
| Linear Regression    | Target is continuous |
| Logistic Regression  | Target is discrete (2 possible options) |
| Decision Tree        | Target is discrete (2+ possible options) |

1. **Predict Olympic Discipline** using columns height, male, and probably age, weight, team, physical, and possibly country.
    - This model can be used as a recommendation of what Olympic sport is most suited to you.
    - PROBLEM: This may not be possible as there are 46 possible disciplines.
        - Could try reducing the number of disciplines to predict.
        - Could possibly group some disciplines (e.g. combine cycling track and cycling road etc.)
    - MODEL: Decision Tree.
2. **Predict Athletics Event** using columns height, male, and probably age, weight, team, physical, and possibly country.
    - This model can be used as a recommendation of what Athletics event is most suited to you.
    - PROBLEM: This may not be possible as there are 47 possible events (similar problem as above).
        - Could try reducing the number of events to predict.
        - Could possibly group some events (e.g. categories such as short distance, long distance, throwing, jumping).
    - MODEL: Decision Tree.
3. **Predict Age (rounded)** using columns height, weight, male, physical, team, discipline, possibly medal, position.
    - This model could return a prediction for the age of peak performance given the chosen discipline and other attributes.
    - PROBLEM: I don't think this model idea is as good as the other two.
        - Although, it is probably better linked to my EDA subject.
    - MODEL: Linear Regression.

## Model: Predict Olympic Discipline

### Uploading the Dataset for our model

- Import df2 from the EDA.
- Drop irrelevant columns.
- Only keep data entries from 2000-2020.
    - (so that the model won't be too outdated)
- Remove rows that contain null values in the ```height_cm``` or ```weight_kg``` columns.
- Remove football players from the dataframe.
    - (football data is skewed)
- Save the new model dataframe.

In [36]:
# Import df2 from the EDA
df = pd.read_csv('Updated Olympic Athletes Data.csv')

In [38]:
# Let's start by dropping columns that definitely won't be used
df.drop(columns = ['Unnamed: 0', 'athlete_id', 'name', 'born_date', 'born_city', 'born_region', 'olympics_date', 'tied'], inplace = True)

In [40]:
# I will use data from 2000-2020 so that the model won't be too outdated
df[df.year>=2000].info()

<class 'pandas.core.frame.DataFrame'>
Index: 75413 entries, 18 to 201217
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   height_cm     69068 non-null  float64
 1   weight_kg     67224 non-null  float64
 2   age           75413 non-null  float64
 3   country_name  75413 non-null  object 
 4   country_code  75413 non-null  object 
 5   born_country  64630 non-null  object 
 6   year          75413 non-null  int64  
 7   discipline    75413 non-null  object 
 8   event         75413 non-null  object 
 9   position      72450 non-null  float64
 10  medal         11296 non-null  object 
 11  male          75413 non-null  bool   
 12  age_rounded   75413 non-null  float64
 13  physical      75413 non-null  bool   
 14  team          75413 non-null  bool   
dtypes: bool(3), float64(5), int64(1), object(6)
memory usage: 7.7+ MB


In [42]:
# The model dataframe only contains data from 2000-2020
# Drop rows containing nulls in the height and weight columns
model_df = df.copy()
model_df = model_df[(model_df.year >=2000) & (model_df.height_cm.notnull()) & (model_df.weight_kg.notnull())].reset_index(drop=True)

In [49]:
# Remove Football from our model dataframe - because of the skew due to the Olympic rules
model_df = model_df[model_df.discipline != 'Football (Football)']

In [51]:
# Save the model_df 
model_df.to_csv('Model Data.csv')

### Train-test split and GridSearch

- Split the data into training and testing data.
    - **X = Features** = Height, Weight, Gender, Age, Physical, Team
    - **y = Target** = Discipline
- Perform a GridSearch to find the best possible hyperparameters for a Decision Tree model.
- Fit the Decision Tree with these hyperparameters to the training data.
- Check the accuracy of the model.

In [62]:
# Relevant feature columns that don't need OHE.
feature_cols = ['height_cm', 'weight_kg', 'male','age', 'physical', 'team']

X_train, X_test, y_train, y_test = train_test_split(model_df[feature_cols],  # X
                                                    model_df['discipline'],  # y
                                                    test_size = 0.3, # Define a training %
                                                    random_state = 42)

In [64]:
# GridSearch to find the best hyperparameters
grid = GridSearchCV(estimator = DecisionTreeClassifier(),   # I want to use a Decision Tree!
                    param_grid = {'max_depth': [10, 20, 30, 50],  # 4 possible options
                                  'min_samples_split': [10, 20, 50, 100, 200],  # 6 possible options
                                  'min_samples_leaf': [2, 3, 5, 20, 50],  # 6 possible options
                                  'max_features': [5,6]},  # 2 possible options
                    cv = 10,   # How many folds we want -- i.e. the value of K: In our case 10-fold CV
                    refit = True, # Do we want to refit on each 9 folds?
                    verbose = 1, # How much you want the output to print out
                    scoring = 'accuracy')  # What metric do I prioritise?

In [119]:
# Take the gridsearch and fit it on the Training set

now = time.time()   # Start by saving the current time

# Fit the gridsearch on our training set
grid.fit(X_train,y_train)

print(f' Time in seconds: {time.time() - now}')   # Show the difference in time - i.e. how long this took

Fitting 10 folds for each of 200 candidates, totalling 2000 fits




KeyboardInterrupt: 

In [66]:
# Grid best hyperparameters (ran the grid search multiple times)
grid1_params = {'max_depth': 30, 'max_features': 6, 'min_samples_leaf': 2, 'min_samples_split': 100}
grid2_params = {'max_depth': 50, 'max_features': 6, 'min_samples_leaf': 2, 'min_samples_split': 100}

In [68]:
# Best parameters found using a grid search
dt = DecisionTreeClassifier(max_depth = 50, max_features = 6, min_samples_leaf = 2, min_samples_split = 100)

In [70]:
# Fit the best Decision Tree to our training data
dt.fit(X_train,y_train)

In [74]:
# The accuracy scores of the model on the training and the testing data.
print(f'Score on training set: {dt.score(X_train,y_train)}')
print(f'Score on testing set: {dt.score(X_test, y_test)}')

Score on training set: 0.5015539929904997
Score on testing set: 0.452633203044641


In [76]:
# How important each feature was to the model
dt.feature_importances_
# [height_cm, weight_kg, male, age, physical, team]

array([0.24802098, 0.23569262, 0.09527753, 0.18069627, 0.12464437,
       0.11566823])

**Notes:**

- The model is overfitting but not by too much.
- The accuracy on the testing set is 45% which is pretty good and can be used for a recommender model.
- All the features have at least 10% importance which is pretty good.


Now test the model on my own data and see what sport it recommends.

In [84]:
# Example input to the model using my own data entry
my_data = pd.DataFrame({'height_cm' : [170], 'weight_kg' : [70], 'male' : [True], 'age' : [23],
               'physical' : [False], 'team' : [True]})
my_data

Unnamed: 0,height_cm,weight_kg,male,age,physical,team
0,170,70,True,23,False,True


In [86]:
# The results using my data
dt.predict(my_data)

array(['Diving (Aquatics)'], dtype=object)

## Function

I created a function that takes inputted features and returns the recommended Olympic sport.

In [1201]:
def predictor():
    # Introduction
    input('Welcome to the Olympic sport recommender!')
    print('')
    print('This application will ask you a few questions about yourself.')
    print('Then using data from real athletes, it will give you the Olympic sport that is most compatible to your choices and attributes.')
    print('')
    input('Press Enter to start.')

    x=0
    while x==0:
        try:
            height = float(input('Enter your height in cm: '))
            if (height >= 120) & (height <= 230):
                x=1
            else:
                print('Unfortunately the height you entered is out of range (120cm - 230cm). Please try again.')
        except:
            print('You did not enter a valid height. Please try again. ')
            continue
    print('')
    while x==1:
        try:
            weight = float(input('Enter your weight in kg: '))
            if (weight >=20) & (weight <= 250):
                x=2
            else:
                print('Unfortunately the weight you entered is out of range (20kg - 250kg). Please try again.')
        except:
            print('You did not enter a valid weight. Please try again. ')
            continue
    print('')
    while x==2:
        try:
            age = float(input('Enter your age in years: '))
            if (age >= 10) & (age <= 75):
                x=3
            else:
                print('Unfortunately the age you entered is out of range (10 - 75). Please try again.')
        except:
            print('You did not enter a valid age. Please try again. ')
            continue
    print('')
    while x==3:
        gender = input('Are you male or female? (Enter M or F): ').lower()
        if gender == 'm':
            x=4
        elif gender == 'f':
            x=4
        else:
            print('You did not enter either M or F. Please try again.')
            
    print('')
    print('Would you prefer a more physically challenging sport (e.g. Athletics), or a less physical sport that requires a specialised skill (e.g. Shooting)?')
    while x==4:
        physical = input('Enter P if you would prefer a physical sport. Enter N if not: ').lower()
        if physical == 'p':
            x=5
        elif physical == 'n':
            x=5
        else:
            print('You did not enter either P or N. Please try again.')

    while x==5:
        team = input('Would you prefer a team sport or an individual sport? Enter T for team. Enter I for individual: ').lower()
        if team == 't':
            x=6
        elif team == 'i':
            x=6
        else:
            print('You did not enter either T or I. Please try again.')
    print('\n')
    time.sleep(1)
    print('Thank you for answering the questions. Here are the details you entered:')
    print('')
    time.sleep(.5)
    print(f"Height : {height}cm")
    time.sleep(.5)
    print(f"Weight : {weight}kg")
    time.sleep(.5)
    print(f"Age : {age}")
    time.sleep(.5)
    if gender == 'm':
        print(f"Gender : Male")
        gender = True
    else:
        print(f"Gender : Female")
        gender = False
    time.sleep(.5)
    if physical == 'p':
        physical = True
        if team == 't':
            print(f"Preferred Sport Type : Physical and Team")
            team = True
        else:
            print(f"Preferred Sport Type : Physical and Individual")
            team = False
    else:
        physical = False
        if team == 't':
            print(f"Preferred Sport Type : Non-physical and Team")
            team = True
        else:
            print(f"Preferred Sport Type : Non-physical and Individual")
            team = False
    time.sleep(1)
    print('\n')
    input('Press Enter to view your results')
    
    # Model
    data = pd.DataFrame({'height_cm' : [height], 'weight_kg' : [weight], 'age' : [age], 'male' : [gender],
               'physical' : [physical], 'team' : [team]})
    

    result = model.predict(data)[0]
    return f"Based on the information that you have entered, the sport we think is most suited to you is:   {result}"

In [1213]:
predictor()

Welcome to the Olympic sport recommender! 



This application will ask you a few questions about yourself.
Then using data from real athletes, it will give you the Olympic sport that is most compatible to your choices and attributes.



Press Enter to start. 
Enter your height in cm:  170





Enter your weight in kg:  70





Enter your age in years:  23





Are you male or female? (Enter M or F):  m



Would you prefer a more physically challenging sport (e.g. Athletics), or a less physical sport that requires a specialised skill (e.g. Shooting)?


Enter P if you would prefer a physical sport. Enter N if not:  n
Would you prefer a team sport or an individual sport? Enter T for team. Enter I for individual:  t




Thank you for answering the questions. Here are the details you entered:

Height : 170.0cm
Weight : 70.0kg
Age : 23.0
Gender : Male
Preferred Sport Type : Non-physical and Team




Press Enter to view your results 


'Based on the information that you have entered, the sport we think is most suited to you is:   Diving (Aquatics)'