# Final Report
## NCAA Predictor
Brandon Clark and Ben Comer
Spring 2021, May 5
CPSC322 - Data Science Algorithms (Sprint)
Final Project

## Introduction

We decided to use Men's Basketball NCAA Statistics from the 2020-21 season to try to come up with classifiers that would determine winning percentage of the individual teams based on their statistics in the areas studied.

We found that there were several moderately strong indicators of winning percentage, particularly in Rebound margin and in Effective Field Goal Percentage, which we will cover further down in this report. We found that Scoring Margin was a fantastic predictor thereto, so much so that we dropped it as a used attribute towards finding interesting results.

## Data Analysis
### Preface
We'll be using the following statistics, as noted by the side-by-side notation in LaTeX:
- $PTS$: The number of points the team scored.
- $PTS_{opp}$: The number of points the team allowed.
- $3PM$: The number of 3-Point shots made by the team.
- $2PM$: The number of 2-Point shots made by the team.
- $FGA$: The number of Field Goals (3's and 2's) attempted by the team.
- $REB$: The number of rebounds the team recovered.
- $REB_{opp}$: The number of rebounds the team's opponent recovered.
- $SPG$: The number of steals per game the team got.
- $BPG$: The number of blocks per game the team achieved.
- $W$: The number of wins a team has on the season.
- $L$: The number of losses a team has on the season.
- $G$: The number of games a team played on the season.

Additionally, if you wish to see the data we used in a csv format, check out [NCAA_Statistics.csv](input_data/NCAA_Statistics.csv).

### Attribute Selection
There are four attributes used in the classification schemas, and one classification. They are as follows:

- _**Scoring Margin**_: $SCM = \frac{PTS}{PTS_{opp}}$
The margin of a team's scored points to their opponent's scored points. Average should be 1.0, when weighting for points scored for the whole game (which we are not doing).

- _**Effective Field Goal Percentage**_: $EFG\% = \frac{3PM * 1.5 + 2PM}{FGA}$
A team's likelihood of making a given shot given historical data, with added weight to three pointers (for their point value). Has no default average.

- _**Rebound Margin**_: $RBM = \frac{REB}{REB_{opp}}$
A team's ratio of rebounds taken versus their opponent. Average should be 1.0, given weight for number of rebounds recovered for the whole game.

- _**Steals Plus Blocks Per Game**_: $SPB = SPG + BPG$
The total number of steals and blocks a team gets in a game. No default average.

And the classification...
- _**Winning Percentage**_: $W\% = \frac{W}{G}$
The percentage of games a team played over the season that they won. Expressed as a value $x$ such that $0 \le x \le 1$.

You can find all of these data in [NCAA_Statistics_Parsed.csv](input_data/NCAA_Statistics_Parsed.csv), in the input_data folder. To see the code used, check out [data_parser.ipynb](data_parser.ipynb).

### Normalization
We normalized using min-max scaling on each attribute, from its minimum to its maximum. The exception was in the winning percentage, which we normalized along 0 to 1, the possible minimum and maximum for the value, and one acheivable historically, in both directions.

You can see the result of this in [NCAA_Statistics_Normalized.csv](input_data/NCAA_Statistics_Normalized.csv).

### Discretization
We checked a ton of splitting methods here, and upon looking at a ton of ditribution charts and generation of many decision trees, we decided to split each attribute into 4 discrete labels. We also decided here to drop the Scoring Margin feature, as we felt it was not helpful, and if one had the information thereof, it would be trivial to take a stab at what classification each deserves. With this discretization schema, we were able to generate interesting data, as seen below, further down.

Upon this Discretization, we got [NCAA_Statistics_44444.csv](input_data/NCAA_Statistics_44444.csv).

See how we did both of these tasks in [norm_and_disc.ipynb](norm_and_disc.ipynb).

### Summary Statistics
Using the data at the end of [EDA.ipynb](EDA.ipynb)...

Seeing this, our earlier inferences align with the proof here.

### Distributions and Regression Analysis
We made several histograms for each attribute to gather the frequency between certain bounds, and linear regression plots to show our work. Rather than copy over the work, [here's a link to EDA.ipynb instead](EDA.ipynb).

Some analysis is twofold, here. For one, it's visible to the naked eye that these are normal distributions, skewed as they may be. The bell curve shape is obvious enough. Second, we have one amazing correlative stat, being Scoring Margin, with around a .9 correlation coefficient, two moderately good ones in Rebound Margin and Effective Field Goal Percentage, each with around .6 r, and a poor one, in Steals + Bloacks, which has around a .2 r.

## Classification
You can see all of our classification work in [classification_eval.ipynb](classification_eval.ipynb), alongside the confusion matrices and the like listed below.

### kNN Classification
We used our standard kNN classification upon the dataset, with n_neighbors=10. The kNN performed without issue using stratified k-fold cross-validation with 10 folds (we'll use this for all our classifiers). We got the following info:

A 53% accuracy rate is nothing to scoff at with 4 classification possibilities. Notably, this classifier was excellent in detecting bad teams; A 71% recognition is notable therewith. The detection rates in 2's and 3's were also solid, but they were abysmal for detecting 4's. It is tough to detect 4's in general, as there are few trees that lead there, but alas there should be some. Overall, for our worst classifier, kNN performed adaquetly.

### Decision Tree Classification
Moving to decision trees, we decided to run two separate subprojects.

The first subproject (found in [decision_tree.ipynb](decision_tree.ipynb)) helped with testing, and let us flex our pruning muscles. We used this to decide our splits, to make them interesting, and this is how we got the \_4444 splits. We manually pruned [this tree](tree_vis/_4444_tree.pdf) to get [this tree](tree_vis/_4444_tree_pruned.pdf); A good contrast. Because we used the whole dataset to build the former, we decided not to run tests upon this tree, instead using unpruned trees for our testing. On a side note, some of the alternate trees we got with Scoring Margin were, um, [bad](tree_vis/24444_tree.pdf). It was a good call to drop Scoring Margin.

The second subproject involved actually testing out decision tree results using our defined function. Using the \_4444 format, we came out with these results:

### Excuse the difference in formatting; It's a result of us splitting up the work even to the most minor of problems.

Anyway, here we see that the dataset worked well at detecting 3's, which by the distributions shown, were proven to be (ever so slightly) the most populous group, with 2's leading right behind. This suggests the classifier is good at detecting prevelent cases as themself, and terrible at pointing out fringe cases, as shown by the bad detection results above. To improve this in the future, we'd implement automatic pruning.

### Random Forest Classification
On to our ensemble classifier, Random Forests. We implemented Forests with two varying aspects for randomization: Attributes used and subdataset used. Let's go over each.

For the problem of using attributes, I (Ben) decided to implement my own strategy to get a good mix, but with weight towards using all the attributes. It's the following, as mentioned in [myclassifiers.py](mysklearn/myclassifiers.py):

    # 1. Set n to num_atts
    # 2. If n == min_atts, return the current atts
    # 3. Flip a coin, heads or tails
    # 4. If heads, return the current attributes
    # 5. Otherwise:
    # 6.   Remove a random attribute from the list
    # 7.   n -= 1
    # 8.   Repeat from Step 2 onwards
    
Essentially, we wanted a formula that continuously split the odds in half of using the set as is or not. This was successful, as we got a generally good number of removals via testing in our [testing file](random_forest.ipnyb) (About half included them, as verified through prints).

As for the latter problem, we implemented bagging. Not much to see here, though it should be noted we stored validation accuracy, in order to implement optional weighted voting. In essence, if the user wanted to weight the votes of accurate "experts" over the common citizenry, they could set weighted=True, and each vote would be multiplied by the validation accuracy. Upon testing this, there wasn't a noticeable difference in accuracy, so we didn't use it for our testing.

We ran trials from seeds 10-15, and found Random Forests had an approximate average accuracy of 57%, contributed to by the following run on seed 15:

As you probably see, this is very similar to the Decision Tree classifier's results, which makes a ton of sense! We decided also to use m=10 and n=50, giving a community of experts approach. This was our best result (though not by much), and we used it for our Heroku rollout.

Before that, a reminder that you can find all this work on [classification_eval.ipynb](classification_eval.ipynb).

### Heroku Rollout of Random Forests
Here's our Heroku web app. It hasn't got much in terms of bells and whistles, but it works! __TODO__

## Conclusion
We conclude that the attributes and advanced statistics we used were very helpful in finding classifications. While there is some degree of variability in a team's winning percentage, these 3 or 4 attributes really help to decipher such performance's occurance.

The data we used was from [the NCAA Statistics Page and API](http://stats.ncaa.org/rankings/change_sport_year_div). We simply moved the API data from the site to Excel sheets, which were converted into csv files via Windows Excel. From there, we did all the work explained about this section. There were no problems for classification.

For our approach and challenges, I have to say candidly, we crushed the splitting of work and accomplishing our set tasks. Each of us did equal parts and communicated when they were done, leading to a smooth rollout. Here's the list of who did what:

Brandon:
- Put together data from API
- Exploratory Data Analysis
- kNN implementation
- Heroku implementation and creation

Ben:
- Statistic Parsing
- Discretization and Normalization
- Decision Tree implementation
- Random Forest implementation
  
The challenges we ran into were technicalities, such as some bugs in our Heroku setup, struggles to normalize the data (given my (Ben's) faulty myutils functions up to that point) and git merges. Though we were able to get through them fairly steadily.

To improve the performance, we think adding pruning to the decision trees and therefore the random forests would help a ton. Additionally, we could expand the scope of our data beyond the current season. This would have been fairly easy, though tedious in setup and in computational time.

If you have any questions, feel free to shoot either of us a message through your preferred service. Have a great summer!