# Modelling

I will try to create a model to predict the sport that a person would be best at based on their attributes.

I will use athlete data from the Olympics to do this.

In [3]:
# For data usage
import pandas as pd
import numpy as np

# For visualisations
import seaborn as sns
import matplotlib.pyplot as plt

# For performance measures
from sklearn import metrics  # accuracy, recall, precision
from time import time   # measure time it takes to run a model
from sklearn.metrics import (confusion_matrix, accuracy_score)

# For modelling (Lin Regression)
import statsmodels.api as sm   # For the linear regression model
import statsmodels.tools       # For the evaluation of our model

# Stuff for modelling (Decision Tree)
from sklearn.tree import DecisionTreeClassifier  # the model itself
from sklearn import tree   # visualising the model
from sklearn.model_selection import train_test_split  # Train-test splitting
from sklearn.model_selection import GridSearchCV  # For Grid Search

# For the function
import time


| **Column**     | **Use for model?** | **Nulls?**        | **Dtype**         | **Feature Engineering** | **Notes** |
| -------------- | -------------------| ------------------| ------------------| ------------------------|-----------|
| athlete_id     | NO                 | NO                | integer           | Remove column           | Identification number is irrelevant for modelling. |
| name           | NO                 | NO                | object (string)   | Remove column           | Similarly name is irrelevant for modelling.        |
| height_cm      | YES                | YES               | float             | Remove null entries     | This is important for the model. |
| weight_kg      | MAYBE              | YES               | float             | Remove null entries     | This may be important for the model. However, weight is something that can change, and someone entering their information won't be the weight of an athlete. |
| born_date      | NO                 | NO                | object (string)   | Remove column           | We've got the age column so won't need born_date. |
| **age**        | YES                | NO                | float             | Possible target column  | If the discipline model is too complicated, using age as a target could be an alternative. Otherwise, age could also be a useful feature. |
| country_name   | MAYBE              | NO                | object (string)   | Requires OHE, may possibly require splitting countries into regions to make it simpler. | This may be too complicated since there are A LOT of countries. It also might make the model overfit. |
| country_code   | MAYBE              | NO                | object (string)   | Same as country_name. Would only use ONE of country_name and country_code. | Same as country_name. |
| born_city      | NO                 | YES               | object (string)   | Remove column           | City of birth is too specific for a model. |
| born_region    | NO                 | YES               | object (string)   | Remove column           | Region of birth is too specific for a model. |
| born_country   | PROBABLY NOT       | YES               | object (string)   | Requires OHE, may possibly require splitting countries into regions to make it simpler. | This may be too complicated since there are A LOT of countries. It would almost certainly be too much to have born_country AND country_name/code both as features. |
| year           | NO                 | NO                | integer           | Remove column           | Year is 2020 for all entries. |
| olympics_date  | NO                 | NO                | object (string)   | Remove column           | Date is the same for all entries. |
| **discipline** | MAYBE              | NO                | object (string)   | Possible target column  | Not sure if a target column with around 50 possible options is possible. I would either have to cut down on the possible options OR focus on athletics data OR find another target column. |
| event          | PROBABLY NOT       | NO                | object (string)   | May require OHE if I use it | I would only use this column if I end up focusing on athletics data only, in which case this column could be a possible target column. |
| position       | PROBABLY NOT       | YES               | float             | Either remove or edit null entries | This column is unlikely to be relevant for my model. |
| tied           | NO                 | NO                | bool              | Remove column           | Irrevelant information. |
| medal          | MAYBE              | YES               | object (string)   | Would convert to a bool column (True = Medal, False = No Medal) | This column is unlikely to be relevant for my model. However, I would use this ahead of the position column.
| male           | YES                | NO                | bool              | May need to change True/False to 1/0 | This is important for the model. |
| age_rounded    | PROBABLY NOT       | NO                | float             | Possible scaling OR possible target column | This may actually be a better target column than age would be due to it being rounded. However, as a feature it would be less accurate than age.
| physical       | MAYBE              | NO                | bool              | May need to change True/False to 1/0 | Unsure whether or not this will be relevant to the model. Either I use this column in my model OR remove it completely OR make my model only contain entries with physical = True. |
| team           | PROBABLY           | NO                | bool              | May need to change True/False to 1/0 | I think this is likely to be useful for my model. |

In [3]:
# Import the df_2020 dataset from the previous notebook (as df)
df = pd.read_csv('2020 Olympic athletes data.csv')

In [5]:
# For some reason there is now an extra index column, so remove this
df = df.drop(columns = ['Unnamed: 0'])

In [7]:
# For modelling we will need to feature engineer all the relevant columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12920 entries, 0 to 12919
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   athlete_id     12920 non-null  int64  
 1   name           12920 non-null  object 
 2   height_cm      7124 non-null   float64
 3   weight_kg      5740 non-null   float64
 4   born_date      12920 non-null  object 
 5   age            12920 non-null  float64
 6   country_name   12920 non-null  object 
 7   country_code   12920 non-null  object 
 8   born_city      9474 non-null   object 
 9   born_region    9474 non-null   object 
 10  born_country   9474 non-null   object 
 11  year           12920 non-null  int64  
 12  olympics_date  12920 non-null  object 
 13  discipline     12920 non-null  object 
 14  event          12920 non-null  object 
 15  position       12400 non-null  float64
 16  tied           12920 non-null  bool   
 17  medal          2120 non-null   object 
 18  male  

In [10]:
# Let's start by dropping columns that definitely won't be used
df.drop(columns = ['athlete_id', 'name', 'born_date', 'born_city', 'born_region', 'year', 'olympics_date', 'tied'], inplace = True)

In [12]:
df.head()

Unnamed: 0,height_cm,weight_kg,age,country_name,country_code,born_country,discipline,event,position,medal,male,age_rounded,physical,team
0,167.0,59.0,45.735797,Nigeria,NGR,NGR,Table Tennis,"Singles, Women (Olympic)",65.0,,False,45.0,False,False
1,185.0,82.0,52.982888,Poland,POL,POL,Archery,"Individual, Men (Olympic)",33.0,,True,52.0,False,False
2,162.0,57.0,45.36345,Hungary,HUN,HUN,Fencing,"Foil, Team, Women (Olympic)",7.0,,False,45.0,False,True
3,162.0,53.0,43.389459,Brazil,BRA,BRA,Football (Football),"Football, Women (Olympic)",6.0,,False,43.0,True,True
4,153.0,43.0,46.094456,Germany Unified Team Uzbekistan,UZB,UZB,Artistic Gymnastics (Gymnastics),"Horse Vault, Women (Olympic)",14.0,,False,46.0,True,False


In [14]:
len(df.discipline.unique())

46

In [16]:
len(df[df.discipline == 'Athletics'].event.unique())

47

## Model Ideas

| Model                | Use Case |
|----------------------|----------|
| Linear Regression    | Target is continuous |
| Logistic Regression  | Target is discrete (2 possible options) |
| Decision Tree        | Target is discrete (2+ possible options) |

1. **Predict Olympic Discipline** using columns height, male, and probably age, weight, team, physical, and possibly country.
    - This model can be used as a recommendation of what Olympic sport is most suited to you.
    - PROBLEM: This may not be possible as there are 46 possible disciplines.
        - Could try reducing the number of disciplines to predict.
        - Could possibly group some disciplines (e.g. combine cycling track and cycling road etc.)
    - MODEL: Decision Tree.
2. **Predict Athletics Event** using columns height, male, and probably age, weight, team, physical, and possibly country.
    - This model can be used as a recommendation of what Athletics event is most suited to you.
    - PROBLEM: This may not be possible as there are 47 possible events (similar problem as above).
        - Could try reducing the number of events to predict.
        - Could possibly group some events (e.g. categories such as short distance, long distance, throwing, jumping).
    - MODEL: Decision Tree.
3. **Predict Age (rounded)** using columns height, weight, male, physical, team, discipline, possibly medal, position.
    - This model could return a prediction for the age of peak performance given the chosen discipline and other attributes.
    - PROBLEM: I don't think this model idea is as good as the other two.
        - Although, it is probably better linked to my EDA subject.
    - MODEL: Linear Regression.

## Model: Predict Olympic Discipline

In [623]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12920 entries, 0 to 12919
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   height_cm     7124 non-null   float64
 1   weight_kg     5740 non-null   float64
 2   age           12920 non-null  float64
 3   country_name  12920 non-null  object 
 4   country_code  12920 non-null  object 
 5   born_country  9474 non-null   object 
 6   discipline    12920 non-null  object 
 7   event         12920 non-null  object 
 8   position      12400 non-null  float64
 9   medal         2120 non-null   object 
 10  male          12920 non-null  bool   
 11  age_rounded   12920 non-null  float64
 12  physical      12920 non-null  bool   
 13  team          12920 non-null  bool   
dtypes: bool(3), float64(5), object(6)
memory usage: 1.1+ MB


In [625]:
# Create a copy of df for feature engineering
df1 = df.copy()

In [627]:
# Remove entries that have a null value in either height or weight column
df1 = df1[df1.height_cm.notnull() & df1.weight_kg.notnull()]

In [629]:
# Unfortunately we do lose over half of the data doing this
len(df1)

5740

In [26]:
#df1.medal = df1.medal.replace({'Gold' : True, 'Silver' : True, 'Bronze' : True})
#df1.medal = df1.medal.fillna(False)

In [631]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5740 entries, 0 to 12915
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   height_cm     5740 non-null   float64
 1   weight_kg     5740 non-null   float64
 2   age           5740 non-null   float64
 3   country_name  5740 non-null   object 
 4   country_code  5740 non-null   object 
 5   born_country  5339 non-null   object 
 6   discipline    5740 non-null   object 
 7   event         5740 non-null   object 
 8   position      5524 non-null   float64
 9   medal         1079 non-null   object 
 10  male          5740 non-null   bool   
 11  age_rounded   5740 non-null   float64
 12  physical      5740 non-null   bool   
 13  team          5740 non-null   bool   
dtypes: bool(3), float64(5), object(6)
memory usage: 554.9+ KB


In [633]:
# Relevant feature columns that don't need OHE.
feature_cols = ['height_cm', 'weight_kg', 'age', 'male', 'physical', 'team']

X_train, X_test, y_train, y_test = train_test_split(df1[feature_cols],  # X
                                                    df1['discipline'],  # y
                                                    test_size = 0.3, # Define a training %
                                                    random_state = 42)

In [635]:
grid = GridSearchCV(estimator = DecisionTreeClassifier(),   # I want to use a Decision Tree!
                    param_grid = {'max_depth': [5, 10, 20, 30],  # 4 possible options
                                  'min_samples_split': [5, 10, 15, 20, 50, 100],  # 6 possible options
                                  'min_samples_leaf': [2, 3, 5, 6, 20, 50],  # 6 possible options
                                  'max_features': [5,6]},  # 2 possible options
                    cv = 10,   # How many folds we want -- i.e. the value of K: In our case 10-fold CV
                    refit = True, # Do we want to refit on each 9 folds?
                    verbose = 1, # How much you want the output to print out
                    scoring = 'accuracy')  # What metric do I prioritise?

In [637]:
# Take the gridsearch and fit it on the Training set

now = time()   # Start by saving the current time

# Fit the gridsearch on our training set
grid.fit(X_train,y_train)

print(f' Time in seconds: {time() - now}')   # Show the difference in time - i.e. how long this took

Fitting 10 folds for each of 288 candidates, totalling 2880 fits




 Time in seconds: 56.098875522613525


  _data = np.array(data, dtype=dtype, copy=copy,


In [639]:
grid.best_params_

{'max_depth': 10,
 'max_features': 5,
 'min_samples_leaf': 2,
 'min_samples_split': 10}

In [641]:
grid.best_score_

0.3782930732869319

In [643]:
model = grid.best_estimator_

In [645]:
# score - does a predict, and calculates score!
# increasing the max_depth increases the score
# However, increasing max_depth increases the chance of overfitting
print(f'Score on training set: {model.score(X_train,y_train)}')
print(f'Score on testing set: {model.score(X_test, y_test)}')

Score on training set: 0.5246391239422599
Score on testing set: 0.36875725900116146


In [647]:
model.feature_importances_
# [height_cm, weight_kg, age, male, physical, team]

array([0.28746801, 0.23109907, 0.22473836, 0.05470335, 0.0995134 ,
       0.10247781])

In [649]:
train_results = X_train.copy()
train_results['y_pred'] = model.predict(X_train)
train_results['y_real'] = y_train

train_results

Unnamed: 0,height_cm,weight_kg,age,male,physical,team,y_pred,y_real
481,172.0,56.0,29.147159,False,True,True,Football (Football),Swimming (Aquatics)
4672,183.0,73.0,32.219028,True,True,True,Cycling Track (Cycling),Athletics
4336,194.0,83.0,26.924025,True,True,False,Swimming (Aquatics),Cycling Track (Cycling)
4158,188.0,87.0,26.198494,True,False,False,Fencing,Diving (Aquatics)
1628,182.0,70.0,28.156057,True,False,False,Shooting,Sailing
...,...,...,...,...,...,...,...,...
3827,188.0,84.0,28.569473,True,True,False,Athletics,Athletics
7134,190.0,97.0,37.379877,True,True,True,Handball,Handball
7342,166.0,68.0,26.261465,False,True,True,Rugby Sevens (Rugby),Handball
8369,187.0,100.0,31.173169,False,True,True,Water Polo (Aquatics),Water Polo (Aquatics)


In [651]:
# Number of correct predictions (bit less than half)
len(train_results[train_results['y_pred'] == train_results['y_real']])

2108

In [653]:
# The model doesn't even predict every sport
train_results.groupby('y_pred')['y_pred'].count()

y_pred
Archery                               67
Artistic Gymnastics (Gymnastics)     253
Artistic Swimming (Aquatics)          37
Athletics                           1189
Baseball (Baseball/Softball)           4
Basketball (Basketball)               57
Beach Volleyball (Volleyball)          6
Boxing                                 4
Canoe Slalom (Canoeing)                2
Canoe Sprint (Canoeing)               48
Cycling Mountain Bike (Cycling)       14
Cycling Road (Cycling)                14
Cycling Track (Cycling)               42
Diving (Aquatics)                    110
Fencing                              194
Football (Football)                  103
Golf                                   5
Handball                             360
Hockey                               138
Judo                                 159
Rhythmic Gymnastics (Gymnastics)      19
Rowing                               119
Rugby Sevens (Rugby)                  65
Sailing                               98
Shooting 

In [655]:
test_results = X_test.copy()
test_results['y_pred'] = model.predict(X_test)
test_results['y_real'] = y_test

test_results

Unnamed: 0,height_cm,weight_kg,age,male,physical,team,y_pred,y_real
768,161.0,62.0,28.969199,True,True,False,Artistic Gymnastics (Gymnastics),Artistic Gymnastics (Gymnastics)
3029,186.0,80.0,28.052019,False,True,False,Swimming (Aquatics),Modern Pentathlon
4715,193.0,89.0,28.410678,True,True,False,Athletics,Athletics
3562,184.0,83.0,25.218344,True,True,True,Swimming (Aquatics),Swimming (Aquatics)
3435,204.0,98.0,28.008214,True,False,False,Sailing,Sailing
...,...,...,...,...,...,...,...,...
11679,193.0,85.0,22.910335,True,True,True,Swimming (Aquatics),Canoe Sprint (Canoeing)
3135,178.0,72.0,32.027379,False,True,True,Handball,Basketball (Basketball)
2350,169.0,49.0,22.540726,False,True,True,Swimming (Aquatics),Athletics
4582,162.0,62.0,25.681040,True,True,False,Artistic Gymnastics (Gymnastics),Artistic Gymnastics (Gymnastics)


In [657]:
# Number of correct predictions (bit less than half)
len(test_results[test_results['y_pred'] == test_results['y_real']])

635

In [816]:
# Example input to the model using my own data entry
my_data = pd.DataFrame({'height_cm' : [170], 'weight_kg' : [70], 'age' : [23], 'male' : [True],
               'physical' : [False], 'team' : [True]})
my_data

Unnamed: 0,height_cm,weight_kg,age,male,physical,team
0,170,70,23,True,False,True


In [818]:
# The results using my data
model.predict(my_data)

array(['Archery'], dtype=object)

### Uploading the Dataset for our model

- Import df2 from the EDA.
- Drop irrelevant columns.
- Only keep data entries from 2000-2020.
    - (so that the model won't be too outdated)
- Remove rows that contain null values in the ```height_cm``` or ```weight_kg``` columns.
- Remove football players from the dataframe.
    - (football data is skewed)
- Save the new model dataframe.

In [36]:
# Import df2 from the EDA
df = pd.read_csv('Updated Olympic Athletes Data.csv')

In [38]:
# Let's start by dropping columns that definitely won't be used
df.drop(columns = ['Unnamed: 0', 'athlete_id', 'name', 'born_date', 'born_city', 'born_region', 'olympics_date', 'tied'], inplace = True)

In [40]:
# I will use data from 2000-2020 so that the model won't be too outdated
df[df.year>=2000].info()

<class 'pandas.core.frame.DataFrame'>
Index: 75413 entries, 18 to 201217
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   height_cm     69068 non-null  float64
 1   weight_kg     67224 non-null  float64
 2   age           75413 non-null  float64
 3   country_name  75413 non-null  object 
 4   country_code  75413 non-null  object 
 5   born_country  64630 non-null  object 
 6   year          75413 non-null  int64  
 7   discipline    75413 non-null  object 
 8   event         75413 non-null  object 
 9   position      72450 non-null  float64
 10  medal         11296 non-null  object 
 11  male          75413 non-null  bool   
 12  age_rounded   75413 non-null  float64
 13  physical      75413 non-null  bool   
 14  team          75413 non-null  bool   
dtypes: bool(3), float64(5), int64(1), object(6)
memory usage: 7.7+ MB


In [42]:
# The model dataframe only contains data from 2000-2020
# Drop rows containing nulls in the height and weight columns
model_df = df.copy()
model_df = model_df[(model_df.year >=2000) & (model_df.height_cm.notnull()) & (model_df.weight_kg.notnull())].reset_index(drop=True)

In [49]:
# Remove Football from our model dataframe - because of the skew due to the Olympic rules
model_df = model_df[model_df.discipline != 'Football (Football)']

In [51]:
# Save the model_df 
model_df.to_csv('Model Data.csv')

### Train-test split and GridSearch

- Split the data into training and testing data.
    - **X = Features** = Height, Weight, Gender, Age, Physical, Team
    - **y = Target** = Discipline
- Perform a GridSearch to find the best possible hyperparameters for a Decision Tree model.
- Fit the Decision Tree with these hyperparameters to the training data.
- Check the accuracy of the model.

In [62]:
# Relevant feature columns that don't need OHE.
feature_cols = ['height_cm', 'weight_kg', 'male','age', 'physical', 'team']

X_train, X_test, y_train, y_test = train_test_split(model_df[feature_cols],  # X
                                                    model_df['discipline'],  # y
                                                    test_size = 0.3, # Define a training %
                                                    random_state = 42)

In [64]:
# GridSearch to find the best hyperparameters
grid = GridSearchCV(estimator = DecisionTreeClassifier(),   # I want to use a Decision Tree!
                    param_grid = {'max_depth': [10, 20, 30, 50],  # 4 possible options
                                  'min_samples_split': [10, 20, 50, 100, 200],  # 6 possible options
                                  'min_samples_leaf': [2, 3, 5, 20, 50],  # 6 possible options
                                  'max_features': [5,6]},  # 2 possible options
                    cv = 10,   # How many folds we want -- i.e. the value of K: In our case 10-fold CV
                    refit = True, # Do we want to refit on each 9 folds?
                    verbose = 1, # How much you want the output to print out
                    scoring = 'accuracy')  # What metric do I prioritise?

In [119]:
# Take the gridsearch and fit it on the Training set

now = time.time()   # Start by saving the current time

# Fit the gridsearch on our training set
grid.fit(X_train,y_train)

print(f' Time in seconds: {time.time() - now}')   # Show the difference in time - i.e. how long this took

Fitting 10 folds for each of 200 candidates, totalling 2000 fits




KeyboardInterrupt: 

In [66]:
# Grid best hyperparameters (ran the grid search multiple times)
grid1_params = {'max_depth': 30, 'max_features': 6, 'min_samples_leaf': 2, 'min_samples_split': 100}
grid2_params = {'max_depth': 50, 'max_features': 6, 'min_samples_leaf': 2, 'min_samples_split': 100}

In [68]:
# Best parameters found using a grid search
dt = DecisionTreeClassifier(max_depth = 50, max_features = 6, min_samples_leaf = 2, min_samples_split = 100)

In [70]:
# Fit the best Decision Tree to our training data
dt.fit(X_train,y_train)

In [74]:
# The accuracy scores of the model on the training and the testing data.
print(f'Score on training set: {dt.score(X_train,y_train)}')
print(f'Score on testing set: {dt.score(X_test, y_test)}')

Score on training set: 0.5015539929904997
Score on testing set: 0.452633203044641


In [76]:
# How important each feature was to the model
dt.feature_importances_
# [height_cm, weight_kg, male, age, physical, team]

array([0.24802098, 0.23569262, 0.09527753, 0.18069627, 0.12464437,
       0.11566823])

**Notes:**

- The model is overfitting but not by too much.
- The accuracy on the testing set is 45% which is pretty good and can be used for a recommender model.
- All the features have at least 10% importance which is pretty good.


Now test the model on my own data and see what sport it recommends.

In [84]:
# Example input to the model using my own data entry
my_data = pd.DataFrame({'height_cm' : [170], 'weight_kg' : [70], 'male' : [True], 'age' : [23],
               'physical' : [False], 'team' : [True]})
my_data

Unnamed: 0,height_cm,weight_kg,male,age,physical,team
0,170,70,True,23,False,True


In [86]:
# The results using my data
dt.predict(my_data)

array(['Diving (Aquatics)'], dtype=object)

## Function

I created a function that takes inputted features and returns the recommended Olympic sport.

In [1201]:
def predictor():
    # Introduction
    input('Welcome to the Olympic sport recommender!')
    print('')
    print('This application will ask you a few questions about yourself.')
    print('Then using data from real athletes, it will give you the Olympic sport that is most compatible to your choices and attributes.')
    print('')
    input('Press Enter to start.')

    x=0
    while x==0:
        try:
            height = float(input('Enter your height in cm: '))
            if (height >= 120) & (height <= 230):
                x=1
            else:
                print('Unfortunately the height you entered is out of range (120cm - 230cm). Please try again.')
        except:
            print('You did not enter a valid height. Please try again. ')
            continue
    print('')
    while x==1:
        try:
            weight = float(input('Enter your weight in kg: '))
            if (weight >=20) & (weight <= 250):
                x=2
            else:
                print('Unfortunately the weight you entered is out of range (20kg - 250kg). Please try again.')
        except:
            print('You did not enter a valid weight. Please try again. ')
            continue
    print('')
    while x==2:
        try:
            age = float(input('Enter your age in years: '))
            if (age >= 10) & (age <= 75):
                x=3
            else:
                print('Unfortunately the age you entered is out of range (10 - 75). Please try again.')
        except:
            print('You did not enter a valid age. Please try again. ')
            continue
    print('')
    while x==3:
        gender = input('Are you male or female? (Enter M or F): ').lower()
        if gender == 'm':
            x=4
        elif gender == 'f':
            x=4
        else:
            print('You did not enter either M or F. Please try again.')
            
    print('')
    print('Would you prefer a more physically challenging sport (e.g. Athletics), or a less physical sport that requires a specialised skill (e.g. Shooting)?')
    while x==4:
        physical = input('Enter P if you would prefer a physical sport. Enter N if not: ').lower()
        if physical == 'p':
            x=5
        elif physical == 'n':
            x=5
        else:
            print('You did not enter either P or N. Please try again.')

    while x==5:
        team = input('Would you prefer a team sport or an individual sport? Enter T for team. Enter I for individual: ').lower()
        if team == 't':
            x=6
        elif team == 'i':
            x=6
        else:
            print('You did not enter either T or I. Please try again.')
    print('\n')
    time.sleep(1)
    print('Thank you for answering the questions. Here are the details you entered:')
    print('')
    time.sleep(.5)
    print(f"Height : {height}cm")
    time.sleep(.5)
    print(f"Weight : {weight}kg")
    time.sleep(.5)
    print(f"Age : {age}")
    time.sleep(.5)
    if gender == 'm':
        print(f"Gender : Male")
        gender = True
    else:
        print(f"Gender : Female")
        gender = False
    time.sleep(.5)
    if physical == 'p':
        physical = True
        if team == 't':
            print(f"Preferred Sport Type : Physical and Team")
            team = True
        else:
            print(f"Preferred Sport Type : Physical and Individual")
            team = False
    else:
        physical = False
        if team == 't':
            print(f"Preferred Sport Type : Non-physical and Team")
            team = True
        else:
            print(f"Preferred Sport Type : Non-physical and Individual")
            team = False
    time.sleep(1)
    print('\n')
    input('Press Enter to view your results')
    
    # Model
    data = pd.DataFrame({'height_cm' : [height], 'weight_kg' : [weight], 'age' : [age], 'male' : [gender],
               'physical' : [physical], 'team' : [team]})
    

    result = model.predict(data)[0]
    return f"Based on the information that you have entered, the sport we think is most suited to you is:   {result}"

In [1213]:
predictor()

Welcome to the Olympic sport recommender! 



This application will ask you a few questions about yourself.
Then using data from real athletes, it will give you the Olympic sport that is most compatible to your choices and attributes.



Press Enter to start. 
Enter your height in cm:  170





Enter your weight in kg:  70





Enter your age in years:  23





Are you male or female? (Enter M or F):  m



Would you prefer a more physically challenging sport (e.g. Athletics), or a less physical sport that requires a specialised skill (e.g. Shooting)?


Enter P if you would prefer a physical sport. Enter N if not:  n
Would you prefer a team sport or an individual sport? Enter T for team. Enter I for individual:  t




Thank you for answering the questions. Here are the details you entered:

Height : 170.0cm
Weight : 70.0kg
Age : 23.0
Gender : Male
Preferred Sport Type : Non-physical and Team




Press Enter to view your results 


'Based on the information that you have entered, the sport we think is most suited to you is:   Diving (Aquatics)'

In [535]:
df1 = df.copy()

In [537]:
# All the sports in the data split into categories
aquatics = ['Diving (Aquatics)', 'Marathon Swimming (Aquatics)', 'Swimming (Aquatics)', 'Water Polo (Aquatics)',
            'Artistic Swimming (Aquatics)']
athletics = ['Athletics']
ball_sports = ['Football (Football)', 'Basketball (Basketball)', 'Beach Volleyball (Volleyball)', 'Handball',
               'Hockey', 'Softball (Baseball/Softball)', 'Volleyball (Volleyball)', 'Baseball (Baseball/Softball)',
               'Rugby Sevens (Rugby)', '3x3 Basketball (Basketball)']
combat_sports = ['Judo', 'Wrestling', 'Boxing', 'Taekwondo', 'Karate']
cycling = ['Cycling Road (Cycling)', 'Cycling Track (Cycling)', 'Cycling Mountain Bike (Cycling)', 'Cycling BMX Racing (Cycling)',
           'Cycling BMX Freestyle (Cycling)']
gymnastics = ['Artistic Gymnastics (Gymnastics)', 'Trampolining (Gymnastics)', 'Rhythmic Gymnastics (Gymnastics)']
racquet_sports = ['Table Tennis', 'Tennis', 'Badminton']
shooting_sports = ['Archery', 'Shooting']
watercraft_sports = ['Rowing', 'Sailing', 'Canoe Sprint (Canoeing)', 'Canoe Slalom (Canoeing)']
strength_and_climbing = ['Weightlifting', 'Sport Climbing']
other_sports = ['Fencing', 'Golf', 'Modern Pentathlon', 'Triathlon', 'Skateboarding (Roller Sports)', 'Surfing']

In [539]:
# Check that I have done ALL the sports
len(aquatics + athletics + ball_sports + combat_sports + cycling + gymnastics + racquet_sports + shooting_sports 
    + watercraft_sports +  strength_and_climbing  + other_sports) - len(df1.discipline.unique().tolist())

0

In [541]:
categories = {'aquatics' : aquatics, 'athletics' : athletics, 'ball_sports' : ball_sports,
              'combat_sports' : combat_sports, 'cycling' : cycling, 'gymnastics' : gymnastics,
              'racquet_sports' : racquet_sports, 'shooting_sports' : shooting_sports,
              'watercraft_sports' : watercraft_sports,
              'strength_and_climbing' : strength_and_climbing,
              'other_sports' : other_sports}

In [543]:
# Create a dictionary to map each discipline to its category
mapping = {}
for i in df1.discipline.unique().tolist():
    for j in categories:
        if i in categories[j]:
            mapping[i] = j

print(mapping)

{'Table Tennis': 'racquet_sports', 'Archery': 'shooting_sports', 'Fencing': 'other_sports', 'Football (Football)': 'ball_sports', 'Artistic Gymnastics (Gymnastics)': 'gymnastics', 'Rowing': 'watercraft_sports', 'Shooting': 'shooting_sports', 'Diving (Aquatics)': 'aquatics', 'Sailing': 'watercraft_sports', 'Athletics': 'athletics', 'Canoe Sprint (Canoeing)': 'watercraft_sports', 'Marathon Swimming (Aquatics)': 'aquatics', 'Weightlifting': 'strength_and_climbing', 'Basketball (Basketball)': 'ball_sports', 'Swimming (Aquatics)': 'aquatics', 'Cycling Road (Cycling)': 'cycling', 'Triathlon': 'other_sports', 'Judo': 'combat_sports', 'Beach Volleyball (Volleyball)': 'ball_sports', 'Cycling Track (Cycling)': 'cycling', 'Tennis': 'racquet_sports', 'Wrestling': 'combat_sports', 'Cycling Mountain Bike (Cycling)': 'cycling', 'Handball': 'ball_sports', 'Hockey': 'ball_sports', 'Softball (Baseball/Softball)': 'ball_sports', 'Volleyball (Volleyball)': 'ball_sports', 'Water Polo (Aquatics)': 'aquatics

In [545]:
# Create a new column in df1 using this mapping
df1['category'] = df1.discipline.replace(mapping)

In [547]:
df1.groupby('category')['category'].count()

category
aquatics                 1965
athletics                2082
ball_sports              2580
combat_sports            1115
cycling                   718
gymnastics                966
other_sports              780
racquet_sports            582
shooting_sports           620
strength_and_climbing     216
watercraft_sports        1296
Name: category, dtype: int64

In [549]:
# Remove entries with nulls in height or weight
df1 = df1[df1.height_cm.notnull() & df1.weight_kg.notnull()]

In [551]:
# Relevant feature columns that don't need OHE.
feature_cols = ['height_cm', 'weight_kg', 'age', 'male', 'physical', 'team']

# Creating X and y
X = df1[feature_cols]
y = df1['category']

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,  # X
                                                    y,  # y
                                                    test_size = 0.3, # Define a training %
                                                    random_state = 42)

In [555]:
grid = GridSearchCV(estimator = DecisionTreeClassifier(),   # I want to use a Decision Tree!
                    param_grid = {'max_depth': [5, 10, 20, 30],  # 4 possible options
                                  'min_samples_split': [5, 10, 15, 20, 50, 100],  # 6 possible options
                                  'min_samples_leaf': [2, 3, 4, 5, 6, 7],  # 6 possible options
                                  'max_features': [5,6]},  # 2 possible options
                    cv = 10,   # How many folds we want -- i.e. the value of K: In our case 10-fold CV
                    refit = True, # Do we want to refit on each 9 folds?
                    verbose = 1, # How much you want the output to print out
                    scoring = 'accuracy')  # What metric do I prioritise?

In [559]:
# Take the gridsearch and fit it on the Training set

now = time()   # Start by saving the current time

# Fit the gridsearch on our training set
grid.fit(X_train,y_train)

print(f' Time in seconds: {time() - now}')   # Show the difference in time - i.e. how long this took

Fitting 10 folds for each of 288 candidates, totalling 2880 fits
 Time in seconds: 48.696045875549316


  _data = np.array(data, dtype=dtype, copy=copy,


In [571]:
grid.best_params_

{'max_depth': 30,
 'max_features': 5,
 'min_samples_leaf': 4,
 'min_samples_split': 20}

In [579]:
grid.best_score_

0.4534583938164539

In [581]:
model = grid.best_estimator_

In [585]:
# score - does a predict, and calculates score!
# increasing the max_depth increases the score
# However, increasing max_depth increases the chance of overfitting
print(f'Score on training set: {model.score(X_train,y_train)}')
print(f'Score on testing set: {model.score(X_test, y_test)}')


Score on training set: 0.6095072175211548
Score on testing set: 0.45005807200929154


Very overfitting but not sure there's a way to fix this without more detailed data.
- Maybe try adding data from previous Olympic games if I have time so that the training and testing data is bigger.

In [587]:
model.feature_importances_
# [height_cm, weight_kg, age, male, physical, team]

array([0.25205894, 0.21148655, 0.20939151, 0.0555302 , 0.10665625,
       0.16487656])

In [589]:
model.predict(X_train)

array(['ball_sports', 'watercraft_sports', 'aquatics', ..., 'ball_sports',
       'aquatics', 'aquatics'], dtype=object)

In [591]:
train_results = X_train.copy()
train_results['y_pred'] = model.predict(X_train)
train_results['y_real'] = y_train

train_results

Unnamed: 0,height_cm,weight_kg,age,male,physical,team,y_pred,y_real
481,172.0,56.0,29.147159,False,True,True,ball_sports,aquatics
4672,183.0,73.0,32.219028,True,True,True,watercraft_sports,athletics
4336,194.0,83.0,26.924025,True,True,False,aquatics,cycling
4158,188.0,87.0,26.198494,True,False,False,watercraft_sports,aquatics
1628,182.0,70.0,28.156057,True,False,False,watercraft_sports,watercraft_sports
...,...,...,...,...,...,...,...,...
3827,188.0,84.0,28.569473,True,True,False,athletics,athletics
7134,190.0,97.0,37.379877,True,True,True,ball_sports,ball_sports
7342,166.0,68.0,26.261465,False,True,True,ball_sports,ball_sports
8369,187.0,100.0,31.173169,False,True,True,aquatics,aquatics


In [593]:
# Number of correct predictions (bit less than half)
len(train_results[train_results['y_pred'] == train_results['y_real']])

2449

In [595]:
# The model doesn't even predict every sport
train_results.groupby('y_pred')['y_pred'].count()

y_pred
aquatics                 762
athletics                796
ball_sports              815
combat_sports            267
cycling                  201
gymnastics               292
other_sports             212
racquet_sports            50
shooting_sports          229
strength_and_climbing     54
watercraft_sports        340
Name: y_pred, dtype: int64

In [597]:
test_results = X_test.copy()
test_results['y_pred'] = model.predict(X_test)
test_results['y_real'] = y_test

test_results

Unnamed: 0,height_cm,weight_kg,age,male,physical,team,y_pred,y_real
768,161.0,62.0,28.969199,True,True,False,gymnastics,gymnastics
3029,186.0,80.0,28.052019,False,True,False,combat_sports,other_sports
4715,193.0,89.0,28.410678,True,True,False,aquatics,athletics
3562,184.0,83.0,25.218344,True,True,True,ball_sports,aquatics
3435,204.0,98.0,28.008214,True,False,False,other_sports,watercraft_sports
...,...,...,...,...,...,...,...,...
11679,193.0,85.0,22.910335,True,True,True,aquatics,watercraft_sports
3135,178.0,72.0,32.027379,False,True,True,ball_sports,ball_sports
2350,169.0,49.0,22.540726,False,True,True,athletics,athletics
4582,162.0,62.0,25.681040,True,True,False,combat_sports,gymnastics


In [599]:
# Number of correct predictions (bit less than half)
len(test_results[test_results['y_pred'] == test_results['y_real']])

775

In [601]:
# The model doesn't even predict every sport
test_results.groupby('y_pred')['y_pred'].count()

y_pred
aquatics                 309
athletics                346
ball_sports              366
combat_sports            123
cycling                   86
gymnastics                94
other_sports              84
racquet_sports            28
shooting_sports          108
strength_and_climbing     17
watercraft_sports        161
Name: y_pred, dtype: int64

In [603]:
# Example input to the model using my own data entry
my_data = pd.DataFrame({'height_cm' : [170], 'weight_kg' : [70], 'age' : [23], 'male' : [True],
               'physical' : [False], 'team' : [False]})
my_data

Unnamed: 0,height_cm,weight_kg,age,male,physical,team
0,170,70,23,True,False,False


In [605]:
# The results using my data
model.predict(my_data)

array(['aquatics'], dtype=object)

**Pros**
- It doesn't look too bad to be fair, the predictions for me make sense I think.
- Could make more models that use the category prediction and then predict the specific discipline from that category

**Cons**
- 43% accuracy on test set isn't great (although might not be too bad considering the number of targets)
- The model doesn't seem to predict every category (only 9/13)

### Basic Attempt

In [491]:
df3 = df.copy()

In [493]:
df3 = df3[df3.height_cm.notnull() & df3.weight_kg.notnull()]

In [507]:
# Convert boolean columns to integer (True = 1, False = 0
df3[['male', 'physical', 'team']] = df3[['male', 'physical', 'team']].astype(int)

In [505]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5740 entries, 0 to 12915
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   height_cm     5740 non-null   float64
 1   weight_kg     5740 non-null   float64
 2   age           5740 non-null   float64
 3   country_name  5740 non-null   object 
 4   country_code  5740 non-null   object 
 5   born_country  5339 non-null   object 
 6   discipline    5740 non-null   object 
 7   event         5740 non-null   object 
 8   position      5524 non-null   float64
 9   medal         1079 non-null   object 
 10  male          5740 non-null   int32  
 11  age_rounded   5740 non-null   float64
 12  physical      5740 non-null   int32  
 13  team          5740 non-null   int32  
dtypes: float64(5), int32(3), object(6)
memory usage: 605.4+ KB


In [511]:
def feature_eng(df):
        df = df.copy()     # This is just good practice!
    # Making columns numerical
    # 1. OHE
        df = pd.get_dummies(df, columns = ['discipline'], drop_first = True, dtype = int)

        df = sm.add_constant(df) 
        return df

In [531]:
feature_cols = ['height_cm', 'weight_kg', 'male', 'physical', 'team', 'discipline']

In [533]:
X = df3[feature_cols]
y = df3['age']

In [535]:
# Use the train-test split function from sklearn to do it
X_train, X_test, y_train, y_test = train_test_split(X,  # The features
                                                    y,  # The target
                                                    test_size = 0.3,    # What %  of the whole dataset to reserve for testing
                                                    random_state = 42)  # Add a random state

In [537]:
X_train_fe = feature_eng(X_train)
feature_cols = X_train_fe.columns.tolist()

The results below are pretty awful! Probably not worth doing this model.

In [539]:
lin_reg = sm.OLS(y_train, X_train_fe[feature_cols]) # First parameter: y, Second parameter: X
results = lin_reg.fit()
results.summary()

0,1,2,3
Dep. Variable:,age,R-squared:,0.205
Model:,OLS,Adj. R-squared:,0.195
Method:,Least Squares,F-statistic:,21.29
Date:,"Tue, 28 Jan 2025",Prob (F-statistic):,1.65e-160
Time:,16:56:27,Log-Likelihood:,-11577.0
No. Observations:,4018,AIC:,23250.0
Df Residuals:,3969,BIC:,23560.0
Df Model:,48,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,30.3963,1.822,16.686,0.000,26.825,33.968
height_cm,-0.0343,0.013,-2.629,0.009,-0.060,-0.009
weight_kg,0.0420,0.008,5.007,0.000,0.026,0.058
male,0.4678,0.188,2.485,0.013,0.099,0.837
physical,-1.1869,2.392,-0.496,0.620,-5.877,3.503
team,-0.2625,0.209,-1.254,0.210,-0.673,0.148
discipline_Archery,2.8995,0.565,5.129,0.000,1.791,4.008
discipline_Artistic Gymnastics (Gymnastics),-0.1860,2.549,-0.073,0.942,-5.183,4.811
discipline_Artistic Swimming (Aquatics),-0.5987,0.700,-0.855,0.392,-1.971,0.774

0,1,2,3
Omnibus:,350.05,Durbin-Watson:,2.015
Prob(Omnibus):,0.0,Jarque-Bera (JB):,665.577
Skew:,0.593,Prob(JB):,2.96e-145
Kurtosis:,4.603,Cond. No.,3.13e+16
