# <font color=blue> **Olympics Project**</font>

### **Introduction**

In this project, we are attempting to count the number of time the US has won a golden medal in the Olympics using a regression model. The data set being used is derived from the [dataset_olympics.csv](https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data) found on the Kaggle.com. However, since our goal has narrowed down our data to be focused on the U.S., we used MySQL and filtered the data to only include information in regards to the U.S. We also feature-engineered a new column called Gold_count which includes a numeric representation of the golden medal count. Our final dataset used is called [Olympics_dataset.csv](https://docs.google.com/spreadsheets/d/1ROArs8QPH0nMEGD4ophEH5DgjqqBD-hSf0mlIpkP_Ds/edit?usp=sharing)

### Reading the Data Set 

In [1]:
#we will start by importing necessary libraries to implement our goal
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("Olympics_dataset.csv")

In [3]:
# a quick representation of the data set 
df. head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,Gold_Count
0,22700,James Brendan Bennet Connolly,M,27,175,72.0,United States,USA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's High Jump,Silver,0
1,22700,James Brendan Bennet Connolly,M,27,175,72.0,United States,USA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's Long Jump,Bronze,0
2,22700,James Brendan Bennet Connolly,M,27,175,72.0,United States,USA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's Triple Jump,Gold,1
3,16616,"Thomas Edmund Tom"" Burke""",M,21,183,66.0,United States,USA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's 100 metres,Gold,1
4,16616,"Thomas Edmund Tom"" Burke""",M,21,183,66.0,United States,USA,1896 Summer,1896,Summer,Athina,Athletics,Athletics Men's 400 metres,Gold,1


In [4]:
df.tail()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,Gold_Count
3851,8093,Danny Barrett,M,26,188,102.0,United States,USA,2016 Summer,2016,Summer,Rio de Janeiro,Rugby Sevens,Rugby Sevens Men's Rugby Sevens,,0
3852,8128,"Jennifer Mae Jenny"" Barringer-Simpson""",F,29,166,53.0,United States,USA,2016 Summer,2016,Summer,Rio de Janeiro,Athletics,"Athletics Women's 1,500 metres",Bronze,0
3853,8173,Thomas Barrows III,M,28,186,82.0,United States,USA,2016 Summer,2016,Summer,Rio de Janeiro,Sailing,Sailing Men's Skiff,,0
3854,6317,"Anthony Lawrence Tony"" Azevedo""",M,34,186,90.0,United States,USA,2016 Summer,2016,Summer,Rio de Janeiro,Water Polo,Water Polo Men's Water Polo,,0
3855,6911,Tavis Bailey,M,24,191,125.0,United States,USA,2016 Summer,2016,Summer,Rio de Janeiro,Athletics,Athletics Men's Discus Throw,,0


#### **-> In total there are 3857 records with 15 features (X) and one target variable (y)** <br>

### Splitting The Data  

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
#Specifying the features/independent variable (X) and target/dependet variable (y)

X = df[['ID','Name', 'Sex','Age','Height','Weight', 'Team', 'NOC', 'Games', 'Year','Season', 'City', 'Sport', 'Event','Medal']]
y = df['Gold_Count']

#Splitting X and y into training and temporary tuning sets using train-test split 
#Stratifying to ensure y is preserved

X_train, X_tun, y_train, y_tun = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

#Splitting the temporary tuning set into a validation set and a test set
X_valid, X_test, y_valid, y_test = train_test_split(X_tun, y_tun, test_size=0.5, random_state=42)


<br>Since our data set has a mix of both numerical and categorical variables, an issue will occur since our goal is to solve a regression question. So, we have to represent the categorical variables numerically using proper encoding. Therefore, we have to preprocess the data. <br>

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [8]:
#First, we define the relevant numerical features and regularize them 
numeric_features = ['Age', 'Height', 'Weight']
transform_num = StandardScaler()

#Second, we define the relevant categorical features and apply one hot encoding 

categorical_features = ['Sex', 'Season', 'City', 'Sport', 'Event']
transform_cat = OneHotEncoder(handle_unknown='ignore')

Now, we combine both steps to make sure our data that we are using is appropraitely combined <br>

In [9]:
combined_data = ColumnTransformer(transformers=[('num', transform_num, numeric_features),('cat', transform_cat, categorical_features)])

<br>

### Establishing a Baseline

Since we are solving for a regression problem, the best baseline to be established is a mean and median baseline. <br>

In [10]:
from sklearn.metrics import r2_score, mean_squared_error

In [12]:
# finding the Mean and Median of the target variable
y_median = y_train.median()
y_mean = y_train.mean()

#Predicting mean and median
y_pred_median = [y_median] * len(y_test)
y_pred_mean = [y_mean] * len(y_test)

#Evalutaing the Median Baseline
median_mse = mean_squared_error(y_test, y_pred_median)
median_rmse = np.sqrt(median_mse)
median_r2 = r2_score(y_test, y_pred_median)

#Evaluating the Mean Baseline
mean_mse = mean_squared_error(y_test, y_pred_mean)
mean_rmse = np.sqrt(mean_mse)
mean_r2 = r2_score(y_test, y_pred_mean)

# Displaying the Results
print("Performance of Median Baseline")
print(f"MSE: {median_mse:.4f}")
print(f"RMSE: {median_rmse:.4f}")
print(f"R² score: {median_r2:.4f}\n")

print("Performance of Mean Baseline")
print(f"MSE: {mean_mse:.4f}")
print(f"RMSE: {mean_rmse:.4f}")
print(f"R² score: {mean_r2:.4f}")

Performance of Median Baseline
MSE: 0.1520
RMSE: 0.3899
R² score: -0.1792

Performance of Mean Baseline
MSE: 0.1289
RMSE: 0.3590
R² score: -0.0000


### Model Implementation

Our Main goal is to utilize the Linear Regression model to predict the output of how many gold medals the U.S. won in the Olympics over the years. However, further research indicated that there could be more powerful models that might be more successful when predicting the same output. <br>
Therefore, we will test out other models in the process and compare the outcome to that of linear regression. <br> <br>
Our performance metrics will evaluated by the the measurements of MSE (Mean Square Error), RMSE (Root Mean Square Error), and $R^{2}$ (R-squared)

#### Linear Regression

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

In [14]:
#Defining the model using Pipelining to preserve the data transformation 

LR_model = Pipeline(steps=[('combined_data', combined_data),('Model', LinearRegression())])

#Training the model
LR_model.fit(X_train, y_train)


In [15]:
#Evaluating the Linear Regression Model 
y_pred = LR_model.predict(X_test)

#Finding MSE
val_MSE = mean_squared_error(y_test, y_pred)

#Finding RMSE
r_MSE = np.sqrt(val_MSE)

#Finding R^2 on both training and testing sets 
r2_train = LR_model.score(X_train, y_train)
r2 = r2_score(y_test, y_pred)

#Displaying the results
print("Performance of Linear Regression Model")
print(f"MSE: {val_MSE:.4f}")
print(f"RMSE: {r_MSE:.4f}" )
print(f"Testing Data R² score: {r2:.4f}" )

Performance of Linear Regression Model
MSE: 0.0929
RMSE: 0.3047
Testing Data R² score: 0.2795


#### KNN

In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

In [17]:
# Defining a range of k values
k = range(1,20)

# Using Cross Validation and GridSearch to find the best value for k 
#Defining the parameter grid
param_grid = {'Model__n_neighbors': k}

# Creating a KNN model that is based on Regression
model = KNeighborsRegressor()

knn_model = Pipeline(steps=[('combined_data', combined_data), ('Model', model)])

# Using GridSearch and cv of 5
grid_search = GridSearchCV(knn_model, param_grid, cv=5)

grid_search.fit(X_train, y_train)


In [18]:
#Evaluating the Knn model

#Finding MSE
val_MSE = grid_search.best_score_

#Finding RMSE
r_MSE = np.sqrt(val_MSE)

#Finding R^2 on both training and testing sets 
r2 = r2_score(y_test, y_pred)

#Displaying the results
print("Performance of KNN Model")
print(f"MSE: {val_MSE:.4f}")
print(f"RMSE: {r_MSE:.4f}" )
print(f"Testing Data R² score: {r2:.4f}" )


Performance of KNN Model
MSE: 0.2787
RMSE: 0.5279
Testing Data R² score: 0.2795


#### Decision Trees

In [21]:
from sklearn.tree import DecisionTreeRegressor

In [22]:
DT_model = Pipeline(steps=[('combined_data', combined_data),('Model', DecisionTreeRegressor())])

In [23]:
#Defining grid parameters with tree depth, number of examples for splitting, and number of examples for leaf nodes
param_grid = {
    'Model__max_depth': [3, 5, 7, None],  
    'Model__min_samples_split': [2, 5, 10], 
    'Model__min_samples_leaf': [1, 2, 4]  
}

# Using GridSearch and cv of 5
grid_search = GridSearchCV(DT_model, param_grid, cv=5)

#Training the Model
grid_search.fit(X_train, y_train)


In [24]:
#Evaluating the Decision Tree Model

#Finding MSE
val_MSE = grid_search.best_score_

#Finding RMSE
r_MSE = np.sqrt(val_MSE)

#Finding R^2 on both training and testing sets 
r2 = r2_score(y_test, y_pred)

#Displaying the results
print("Performance of Decision Tree Model")
print(f"MSE: {val_MSE:.4f}")
print(f"RMSE: {r_MSE:.4f}" )
print(f"Testing Data R² score: {r2:.4f}" )

Performance of Decision Tree Model
MSE: 0.3031
RMSE: 0.5505
Testing Data R² score: 0.2795


#### Random Forest

In [25]:
from sklearn.ensemble import RandomForestRegressor

In [26]:
RF_model = Pipeline(steps=[('combined_data', combined_data),('Model', RandomForestRegressor())])

# Training the model
RF_model.fit(X_train, y_train)


In [27]:
#Evaluating the Random Forest Model 
y_pred = RF_model.predict(X_test)

#Finding MSE
val_MSE = mean_squared_error(y_test, y_pred)

#Finding RMSE
r_MSE = np.sqrt(val_MSE)

#Finding R^2 on both training and testing sets 
r2_train = RF_model.score(X_train, y_train)
r2 = r2_score(y_test, y_pred)

#Displaying the results
print("Performance of Random Forest Model")
print(f"MSE: {val_MSE:.4f}")
print(f"RMSE: {r_MSE:.4f}" )
print(f"Testing Data R² score: {r2:.4f}" )

Performance of Random Forest Model
MSE: 0.0753
RMSE: 0.2745
Testing Data R² score: 0.4155


#### Artificial Neural Networks

In [31]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Input

In [None]:
X_train = combined_data.fit_transform(X_train)

# Transforming the data accordingly
X_valid= combined_data.transform(X_valid)
X_test = combined_data.transform(X_test)
shape = X_train.shape[1]

In [61]:
# Defining the Neural model
Neural_model = Sequential()
Neural_model.add(Dense(64, activation='relu', input_shape=(shape,)))
Neural_model.add(Dense(32, activation='relu'))
Neural_model.add(Dense(1))

# Compiling the Model
Neural_model.compile(optimizer='adam', loss='mean_squared_error')

# Training the Neural model
Neural_model.fit(X_train, y_train,validation_data=(X_valid, y_valid),epochs=100, batch_size=32)

# Evaluating the Neural Model
test = Neural_model.evaluate(X_test, y_test)
print('Test :', test)

y_pred = Neural_model.predict(X_test)

# Finding the MSE
val_MSE = mean_squared_error(y_test, y_pred)

# Finding RMSE
r_MSE = np.sqrt(val_MSE)

# Finding R^2
r2 = r2_score(y_test, y_pred)

print("Performance of Neural Network Model")
print(f"MSE: {val_MSE:.4f}")
print(f"RMSE: {r_MSE:.4f}" )
print(f"Testing Data R² score: {r2:.4f}" )

Epoch 1/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 0.1202 - val_loss: 0.0926
Epoch 2/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0854 - val_loss: 0.0820
Epoch 3/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0694 - val_loss: 0.0808
Epoch 4/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0517 - val_loss: 0.0781
Epoch 5/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0501 - val_loss: 0.0795
Epoch 6/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0421 - val_loss: 0.0744
Epoch 7/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0371 - val_loss: 0.0754
Epoch 8/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.0324 - val_loss: 0.0778
Epoch 9/100
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━

### Summary

| Model              | MSE          | RMSE         | R2           |
|--------------------|--------------|--------------|--------------|
| Linear Regression  | 0.0929       | 0.3047       | 0.2795       |
| KNN                | 0.2787       | 0.5279       | 0.2795       |
| Decision Tree      | 0.2969       | 0.5449       | 0.2795       |
| Random Forest      | 0.0749       | 0.2737       | 0.4187       |
| Neural Network     | 0.0882       | 0.2970       | 0.3156       |


##### Conclusion

In order to evaluate each model's performance, generally lower MSE and RMSE values are preferred with a higher $R^{2}$ value. <br> Among the implemented models above, the Random Forest Model has outperformed the Linear regression model as well as all of the others because it has the lowest MSE and RMSE scores accompanied by the highest $R^{2}$ score. 
<br> The MSE score of 0.0749 indicates that the model's performance predictions are averagely close to the actual values. The RMSE score of 0.2737 indicates that the predictions of the Random Forest model are generally off that score from the actual values. The $R^{2}$ score of 0.4187 indicates a variance percentage of 41.87%, meaning that the model could detect that much resulted in variance in the data attributed to the independent variables. 
<br> In second place comes the Linear Regression model, which performed comparatively well. Then, the Neural networks mode