## Lab Assignment 3
### Data Mining 7331 Section 403
---
- Brian Coari
- Stephen Merritt
- Cory Thigpen
- Quentin Thomas

## Business Understanding 
---
- Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs? (10 points)

For our dataset this semester we selected the ["120 years of Olympic history: athletes and results"](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) dataset from Kaggle. This dataset tracks all Olympians' medal results from 1896-2016, as well as various physical attributes such as `gender`, `age`, `weight`, and `height`. This dataset was originally collected for various analyses, such as whether host countries' athletes win more medals, and to track the involvement of women in the Olympics from various countries over time.

Our intent this semester was to focus on analyzing the physical attributes of the Olympians over time, when they are available. We determined the trends around `gender`, `age`, `height`, and `weight` in sports over time, and if they seemed to be trending to some kind of "ideal ratio" for each sport. We eliminated linear regression as a viable option, since any non-zero slope would trend to positive or negative infinity, and long-term extrapolation would quickly become absurd. We attempted quadratic, exponential, and logarithmic regression in order to derive meaningful insights from the data. To consider our analysis "successful", our target was to explain at least 80% of the variablity in the data with our regression model. We also looked at how these ratios compare to the average person in order to see which Olympic sports might be the most accessible from a body size perspective. Finally, we constructed a means to analyze our team members' own body types and see for which Olympic sports we might be best-suited, strictly from a `sex`, `age`, `weight`, and `height` perspectives.

## Data Understanding 
---
- Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems? (10 pts)
- Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs. (10 pts)

A full description of the original dataset can be found [here](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/home), we are focusing only on the relevant and non-reduntant fields for our analysis.

This dataset (athlete_events.csv) contains 271,116 rows. Each row corresponds to an individual athlete competing in an individual Olympic event. The columns are:

`ID` (Integer) - A unique number for each athlete

`Name` (String) - The athlete's name.

`Sex` (Character) - The gender of the athlete, currently either "M" or "F" in this dataset.

`Age` (Integer) - The age of the athlete.

`Height` (Integer) - The height of the athlete in centimeters.

`Weight` (Integer) - The weight of the athlete in kilograms.

`Team` (String) - The team (mainly the country) for which the athlete is participating.

`Year` (Integer) - The year of Olympic games, ranging from 1896-2016 in this dataset.

`Season` (String) - The season of Olympic Games, either "Summer" or "Winter".

`City` (String) - The host city of Olympic Games.

`Sport` (String) - The sport in which the Athlete competed.

`Event` (String) - The event in which the Athlete competed.

`Medal` (String) - The medal obtained by the athlete for this row's data, "Gold", "Silver", "Bronze", or "NA".

About 89% of the rows are missing data in at least one of these columns. This is due to records not being kept that far back in history, or for specific sports. For sports that did not track `height` and `weight` at all, we will exclude these from our analysis. For other sports that just contain missing data for some athletes we will run our analysis both by dropping the records entirely and by imputing the values using the mean for the population we are analyzing. Both of these approaches have pros and cons and we will target the approach that yields maximum statistical benefit.

Three variables were added to the data set: `Population_Prop`, `BMI`, and `Previous_Medals`. The `Population_Prop` variable was added to represent the percentage of the overall world population for a given `Country`. Athlete `BMI` was added to the dataframe where $\omega$ represents `Weight` in kilograms, and $\eta$ represents `Height` in centimeters, and $C$ is a constant at `10,000`:

$$
BMI = \left(\frac{\omega}{\eta \eta}\right) C
$$

`Previous_Medals` was created as a binary representation if an athlete had won a medal at a previous olympic games (indicated by a `1`). The `Medal` variable was converted from ordinal to binary, where `Gold`, `Silver`, or `Bronze` were given a `1` and `No Medal` performers were given a `0`. `Season` and `Sex` were also converted to binary. 

One-hot encoding was performed for binarization of the remaining nominal variables(`Country`, `Event`, and `Sport`) so they could be included as features to train the classification model.

The final data set contains the following data of different types:

| Nominal | Ordinal | Numeric         | Binary          |
| ------- | ------- | --------------- | --------------- | 
|         |         | Age             | Country         |
|         |         | BMI             | Event           |
|         |         | Height          | Medal           |
|         |         | Population_Prop | Previous_Medals |
|         |         | Weight          | Season          |
|         |         | Year            | Sex             |
|         |         |                 | Sport           |

### Next in this section we should go back to Lab 1 and pull in some graphics.

## Modeling and Evaluation
---
Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results. Each option is broken down by:
- Train and adjust parameters (10 pts)
- Evaluate and Compare (10 pts)
- Visualize Results (10 pts)
- Summarize the Ramifications (20 pts)

Cluster Analysis
- Train: Perform cluster analysis using several clustering methods (adjust parameters).
- Eval: Use internal and/or external validation measures to describe and compare the clusterings and the clusters— how did you determine a suitable number of clusters for each method?
- Visualize: Use tables/visualization to discuss the found results. Explain each visualization in detail.
- Summarize: Describe your results. What findings are the most interesting and why?

## Deployment 
---
- Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?  How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?  How would your deploy your model for interested parties?  What other data should be collected?  How often would the model need to be updated, etc.? (10 pts)

`Model Useability`


Our model can be made accessible via a simple web portal for users of all kinds. Our specific audiences would be individuals that would like to determine which Olympic sport they would be most fit for based on specific attributes that they give the web application. An aspiring athlete that would like to one day join the Olympics could go to our system to get a quick recommendation for the sport he or she can fit into based on body type. Unfortunately, our model takes up 2.4GB of space in a machine's memory. This makes it difficult to deploy the model on most machines especially mobile apps. Any interested party would need to have a comprehensive plan designed to deploy this model at scale to not sacrifice accuracy. This size limitation could limit the use to a mass audience. 


`The Goal`

The goal of the model is to accurately and successfully predict the most likely Olympic sport that is most fitting for the user. A user could take this information and act on it by beginning to start the process of training for the ideal sport in order to one day make the Olympics. This application cannot however, guarantee that a user will make the Olympics, but it can give a user some insight on the sport that might be most suitable for the individuals success if he or she ever decided to go into the Olympic games as a career option. Using this system is much better than shooting in the dark by selecting a sport by chance without any data to back the decision. For us to determine whether or not we achieved our goal we would need to study real world use cases of the system on real people. Our success would be determined by how well those people performed in their respective sports chosen by the algorithm. We would be interested in whether the sport chosen by the model was indeed a good fit for the user and lead to an overall successful performance in the Olympic games.

`Usefulness of the model`

Our model could be most useful to future athletes. The age range that we look at is anyone between the age of *10* and *90* (Though 90 years is not typical based on the Olympic data, but there has been anomalies in this age group throughout Olympic history). Our interested parties would take the information from our model and hopefully develop a career based on our findings. Coaches and personal trainers could also make use of this model as they could use it to help their clients determine what Olympic sport to design their exercises around for a specific client's body type. The applications for this model are far reaching within the athletics arena.

Ideally this model would help world-class athletes transition between sports for optimal success. There has been precendent;  for example, American football and track star Sam McGuffie transitioned into being a member of the United States Olympic Men's Bobsleigh Team for the 2018 Olympics Games where his physical attributes made him an ideal push crewman and brakeman.

`Deploying The Model`

Our model takes up about 2.4GB worth of memory on disk. In order to have a successful release of the model we would need to consider a cloud solution that would allow access to users that wanted consultation with the model. A good way to deploy this is to center the model around a Olympic sport consultancy business. This vertical would allow future athletes to come to the consultancy and give us their athletic data. This athletic data could then be fed to the model and the model could give back a detailed report of the most optimal sport for the user's success in the Olympics. A part of the business could be designed to do a follow up on all the athletes that had consultations by the system. The follow up data would then be used to feed back into the system to make better and better predictions for current and future athletes. Over time this defensible business could obtain proprietary Olympic data which no one else would have. A successful deployment of this model in the real world could lead to obtaining the business of only the best athletes, and because of their success, more athletes of high caliber would consult with our model, which will in turn create a constant loop of successful Olympians along with their data points that made those specific Olympians successful. This could lead to a viable Olympic consultancy unmatched by anyone.

`Data that should be collected`

The longer this model goes in production the more data points we could have to constantly improve the predictions. Some of the new data points we would be interested in collecting:

- Number of times appeared in the Olympics for those that used our model
- Average age of first Olympic appearance for those that used our model
- Average age of first Olympic podium finish for those that used our model
- Most recommended sports by our model? What is the pattern that would emerge here overtime?
- Average number of times our athlete quit after getting a recommendation from our model

Beyond these metrics, an attribute rating system by sport would need to be developed to match athlete skillsets to sport requirements. After collecting `gender`, `age`, `weight`, and `height` the model would collect 1-10 ratings on `depth perception`, `hand-eye`, `jump`, and `stamina` per athlete and compare against an ideal rating per sport. Consider the long jump for example:

| Characteristic | Ideal Rating |
| ---------------- | -- | 
| Depth Perception | 5  |
| Hand-Eye         | 2  |
| Jump             | 10 |
| Stamina          | 6  |

These metrics would be good factors in measuring the why or why not of success around our recommendations. This feedback could be used to make our models prediction better. Ultimately, we would like to be able to help all consulted athletes win gold in their events on a consistent basis.

`How often the model should be updated`

For the best success our model should be updated every 2 to 4 years coinciding with the new Olympic data that is captured. We are fortunate enough to be in a domain that doesn't happen as often as most domains so there would not be a need for real-time updates.

## Exceptional Work 
---
- You have free reign to provide additional analyses or combine analyses. (10 pts)

Our exceptional work for this model was a production deployment to the web for a proof of concept. Our goal was to get a working Olympic sport prediction system out on the web. Below is a walkthrough for what the user could expect to find when they use our application.

![Home Page](images/home_example.png)

We have spent time styling a front-end user-friendly application for the model which allows for an easy access point for interested parties. The user will be able to click the **Find Sport** link at the top right corner. After clicking the link, they will be able to enter some information to the model for processing.

![User Form](images/find_sport_example.png)

The form is setup to take in the following...

- `Gender`: Can be either male or female
- `Season`: Can be either Summer or Winter Olympic games
- `Age`: The age of the athlete in question
- `Weight`: The user weight in pounds
- `Height`: The height of the user in centimeters

These factors are what the model uses in order to make an Olympic sport recommendation. Once the user enters their information, they can click **Get Sport** and the system will analyze their inputs and respond with a recommendation.

![Recommendation](images/recommendation_example2.png)

Once the user enters a sport, they will get an Olympic sport recommendation based on their inputs. Of course, there are many things we can do with the response that the user gets. In the future we could have interesting statistics based on each sport that is recommended so that the user can get an idea of why he or she got the response they got.

You can experiment with the model here... [`Olympic Sport Recommender POC`](https://olympics-msds-7331.herokuapp.com/)