## Sport predictor using Olympics Data

Using an Olympics data set from: https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results?resource=download, this project seeks to use Age, sex, height, weight and Team to predict what sport they should be competing in. 

#### CSV file columns: 
Name - Athlete's name
Sex - M or F
Age - Integer
Height - In centimeters
Weight - In kilograms
Team - Team name
NOC - National Olympic Committee 3-letter code
Games - Year and season
Year - Integer
Season - Summer or Winter
City - Host city
Sport - Sport
Event - Event
Medal - Gold, Silver, Bronze, or NAN

#### Imports 

In [195]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from patsy import dmatrices

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.tree import export_graphviz

import graphviz
from graphviz import Source

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Part 1: Cleaning and Discovering the data 


In [143]:
df = pd.read_csv('athlete_events.csv')
regions = pd.read_csv('noc_regions.csv')
df.shape

(271116, 15)

#### Check the values of the columns 

In [144]:
column_list =  df[["Name","Sex","Team", "NOC", "Games", "Season","City","Sport", "Event","Medal","Age","Height","Weight", "ID", "Year"]].columns

for i in column_list:
    #print(df.[i].unique())
    df2 = pd.unique(df[[i]].values.ravel('k'))
    print('Unique values from column '+ i + "\n", df2)

Unique values from column Name
 ['A Dijiang' 'A Lamusi' 'Gunnar Nielsen Aaby' ... 'Andrzej ya' 'Piotr ya'
 'Tomasz Ireneusz ya']
Unique values from column Sex
 ['M' 'F']
Unique values from column Team
 ['China' 'Denmark' 'Denmark/Sweden' ... 'Solos Carex' 'Dow Jones' 'Digby']
Unique values from column NOC
 ['CHN' 'DEN' 'NED' 'USA' 'FIN' 'NOR' 'ROU' 'EST' 'FRA' 'MAR' 'ESP' 'EGY'
 'IRI' 'BUL' 'ITA' 'CHA' 'AZE' 'SUD' 'RUS' 'ARG' 'CUB' 'BLR' 'GRE' 'CMR'
 'TUR' 'CHI' 'MEX' 'URS' 'NCA' 'HUN' 'NGR' 'ALG' 'KUW' 'BRN' 'PAK' 'IRQ'
 'UAR' 'LIB' 'QAT' 'MAS' 'GER' 'CAN' 'IRL' 'AUS' 'RSA' 'ERI' 'TAN' 'JOR'
 'TUN' 'LBA' 'BEL' 'DJI' 'PLE' 'COM' 'KAZ' 'BRU' 'IND' 'KSA' 'SYR' 'MDV'
 'ETH' 'UAE' 'YAR' 'INA' 'PHI' 'SGP' 'UZB' 'KGZ' 'TJK' 'EUN' 'JPN' 'CGO'
 'SUI' 'BRA' 'FRG' 'GDR' 'MON' 'ISR' 'URU' 'SWE' 'ISV' 'SRI' 'ARM' 'CIV'
 'KEN' 'BEN' 'UKR' 'GBR' 'GHA' 'SOM' 'LAT' 'NIG' 'MLI' 'AFG' 'POL' 'CRC'
 'PAN' 'GEO' 'SLO' 'CRO' 'GUY' 'NZL' 'POR' 'PAR' 'ANG' 'VEN' 'COL' 'BAN'
 'PER' 'ESA' 'PUR' 'UGA' 'HON' 'ECU

Unique values from column Year
 [1992 2012 1920 1900 1988 1994 1932 2002 1952 1980 2000 1996 1912 1924
 2014 1948 1998 2006 2008 2016 2004 1960 1964 1984 1968 1972 1936 1956
 1928 1976 2010 1906 1904 1908 1896]


#### Check and Deal with NaN values 

In [145]:
df.isnull().values.any()

True

In [146]:
#### Find all NaN values and replace with 0 
df = df.fillna(0)

In [147]:
df.isnull().values.any()

False

#### Types of Data 

In [148]:
df.dtypes

ID          int64
Name       object
Sex        object
Age       float64
Height    float64
Weight    float64
Team       object
NOC        object
Games      object
Year        int64
Season     object
City       object
Sport      object
Event      object
Medal      object
dtype: object

In [149]:
categorical_columns = df[["Name","Sex","Team", "NOC", "Games", "Season","City","Sport", "Event","Medal"]].columns

numerical_columns = df[["Age","Height","Weight" ]].columns

for column in categorical_columns:
    df[column] = df[column].astype('category')

for i in numerical_columns:
    df[i] = df[i].astype('int')




#### Create a new column for possible use later

In [150]:

conditions = [
    (df['Medal'] == 'Bronze'),
    (df['Medal'] == "Silver"),
    (df['Medal'] == 'Gold'),
    (df['Medal'].isna()),
    ]

#values = ['Winner3', 'Winner2', 'Winner1','Not qualified']
values2 = ['True', 'True', 'True', "False"]


In [151]:
#df['Status'] = np.select(conditions, values)
df['Winner'] = np.select(conditions, values2)
# display updated DataFrame
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,Winner
0,1,A Dijiang,M,24,180,80,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,0,0
1,2,A Lamusi,M,23,170,60,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,0,0
2,3,Gunnar Nielsen Aaby,M,24,0,0,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,0,0
3,4,Edgar Lindenau Aabye,M,34,0,0,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,True
4,5,Christine Jacoba Aaftink,F,21,185,82,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,0,0


In [152]:
#There are duplicates that need to be dealt with 
print('Number of duplicate rows in the table is: ', df.duplicated().sum())
print('Number of duplicate  columns in the table is: ', df.columns.duplicated().sum())

Number of duplicate rows in the table is:  1385
Number of duplicate  columns in the table is:  0


In [None]:
['CHN' 'DEN' 'NED' 'USA' 'FIN' 'NOR' 'ROU' 'EST' 'FRA' 'MAR' 'ESP' 'EGY'
 'IRI' 'BUL' 'ITA' 'CHA' 'AZE' 'SUD' 'RUS' 'ARG' 'CUB' 'BLR' 'GRE' 'CMR'
 'TUR' 'CHI' 'MEX' 'URS' 'NCA' 'HUN' 'NGR' 'ALG' 'KUW' 'BRN' 'PAK' 'IRQ'
 'UAR' 'LIB' 'QAT' 'MAS' 'GER' 'CAN' 'IRL' 'AUS' 'RSA' 'ERI' 'TAN' 'JOR'
 'TUN' 'LBA' 'BEL' 'DJI' 'PLE' 'COM' 'KAZ' 'BRU' 'IND' 'KSA' 'SYR' 'MDV'
 'ETH' 'UAE' 'YAR' 'INA' 'PHI' 'SGP' 'UZB' 'KGZ' 'TJK' 'EUN' 'JPN' 'CGO'
 'SUI' 'BRA' 'FRG' 'GDR' 'MON' 'ISR' 'URU' 'SWE' 'ISV' 'SRI' 'ARM' 'CIV'
 'KEN' 'BEN' 'UKR' 'GBR' 'GHA' 'SOM' 'LAT' 'NIG' 'MLI' 'AFG' 'POL' 'CRC'
 'PAN' 'GEO' 'SLO' 'CRO' 'GUY' 'NZL' 'POR' 'PAR' 'ANG' 'VEN' 'COL' 'BAN'
 'PER' 'ESA' 'PUR' 'UGA' 'HON' 'ECU' 'TKM' 'MRI' 'SEY' 'TCH' 'LUX' 'MTN'
 'CZE' 'SKN' 'TTO' 'DOM' 'VIN' 'JAM' 'LBR' 'SUR' 'NEP' 'MGL' 'AUT' 'PLW'
 'LTU' 'TOG' 'NAM' 'AHO' 'ISL' 'ASA' 'SAM' 'RWA' 'DMA' 'HAI' 'MLT' 'CYP'
 'GUI' 'BIZ' 'YMD' 'KOR' 'THA' 'BER' 'ANZ' 'SCG' 'SLE' 'PNG' 'YEM' 'IOA'
 'OMA' 'FIJ' 'VAN' 'MDA' 'YUG' 'BAH' 'GUA' 'SRB' 'IVB' 'MOZ' 'CAF' 'MAD'
 'MAL' 'BIH' 'GUM' 'CAY' 'SVK' 'BAR' 'GBS' 'TLS' 'COD' 'GAB' 'SMR' 'LAO'
 'BOT' 'ROT' 'CAM' 'PRK' 'SOL' 'SEN' 'CPV' 'CRT' 'GEQ' 'BOL' 'SAA' 'AND'
 'ANT' 'ZIM' 'GRN' 'HKG' 'LCA' 'FSM' 'MYA' 'MAW' 'ZAM' 'RHO' 'TPE' 'STP'
 'MKD' 'BOH' 'TGA' 'LIE' 'MNE' 'GAM' 'COK' 'ALB' 'WIF' 'SWZ' 'BUR' 'NBO'
 'BDI' 'ARU' 'NRU' 'VNM' 'VIE' 'BHU' 'MHL' 'KIR' 'UNK' 'TUV' 'NFL' 'KOS'
 'SSD' 'LES']

## Part 2: Prepare for Machine Learning 

#### Drop columns
The aim of this project is to take someones Sex, Age, Height, Weight and Nationality (Team) and train and ML model to find someone's sport and later possibly their likely hood of winning or medal. 

In [194]:
newdf = df.drop(columns=['ID','Name', 'Team', 'Year','Season','City','Event','Medal',"Games", "Winner"])
newdf.head(5)

Unnamed: 0,Sex,Age,Height,Weight,NOC,Sport
0,M,24,180,80,CHN,Basketball
1,M,23,170,60,CHN,Judo
2,M,24,0,0,DEN,Football
3,M,34,0,0,DEN,Tug-Of-War
4,F,21,185,82,NED,Speed Skating


In [196]:
#Change Sex to 0 and 1 
newdf['Sex'] = df['Sex'].map({'M': 1, 'F': 0})
newdf.head(5)

Unnamed: 0,Sex,Age,Height,Weight,NOC,Sport
0,1,24,180,80,CHN,Basketball
1,1,23,170,60,CHN,Judo
2,1,24,0,0,DEN,Football
3,1,34,0,0,DEN,Tug-Of-War
4,0,21,185,82,NED,Speed Skating


### Change Categorial data using OneHotEncoder 

## Part 3: Machine learning Models

#### Create Train and Test set 

In [None]:
#newdf = newdf.sample(frac = 1)

In [191]:
# The target y is sport 
y = newdf["Sport"]

# X is all the everything else 
X = newdf.drop(["Sport"], 1)

# Split the dataset into two datasets: 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100, )





In [193]:
lr = LinearRegression()
lr.fit(X_train, y_train) 

ValueError: could not convert string to float: 'Ice Hockey'

ValueError: Expected 2D array, got 1D array instead:
array=[  0.  24. 180.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [137]:
print("\nDescriptive features in X:\n", X_train.head(5))
print("\nTarget feature in y:\n", y_train.head(5))


Descriptive features in X:
        Sex   Age  Height  Weight           Team        Games Winner
122824   M  30.0   186.0    75.0         Greece  2016 Summer  False
24397    F  21.0     NaN     NaN      Romania-2  1992 Summer  False
268846   F  26.0   174.0    66.0          China  1996 Summer  False
155981   M  26.0   171.0    63.0  Cote d'Ivoire  1976 Summer  False
134742   M  29.0   172.0    68.0      Hong Kong  2012 Summer  False

Target feature in y:
 122824         Sailing
24397     Table Tennis
268846        Handball
155981       Athletics
134742         Archery
Name: Sport, dtype: category
Categories (66, object): ['Aeronautics', 'Alpine Skiing', 'Alpinism', 'Archery', ..., 'Volleyball', 'Water Polo', 'Weightlifting', 'Wrestling']


In [182]:
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
X_train.head(5)

Unnamed: 0,Sex,Age,Height,Weight,Team,Games,Winner
0,M,30.0,186.0,75.0,Greece,2016 Summer,False
1,F,21.0,,,Romania-2,1992 Summer,False
2,F,26.0,174.0,66.0,China,1996 Summer,False
3,M,26.0,171.0,63.0,Cote d'Ivoire,1976 Summer,False
4,M,29.0,172.0,68.0,Hong Kong,2012 Summer,False


## Linear Regression 

In [139]:
multiple_linreg = LinearRegression().fit(X_train, y_train)

ValueError: Unable to convert array of bytes/strings into decimal numbers with dtype='numeric'