# Machine Learning Basics

## Today we're going to discuss some ML concepts by using some FIFA data as a bit of a playground to play in.There is a lot more that you have in your toolkit at this point than just Machine Learning, so I will try to bring in some concepts from beyond the model world and the standard model>fit>predict paradigm that we may often think is the most important part of the job. At this point, many of you have realized that there are so many other pieces that are important, and you've been spending time learning them. A lot of what is important in this business lies before the model so always keep in mind that being able to easily and seamlessly bring the data to the table will always make you a more valuable asset to your company. Today we'll try to discuss (although this will likely be split up over 2 sessions)...

## Preparing data
- Nearly every tool you've learned thus far in the course goes into this, unless you have a personal team of data engineers working for you, getting the horse to the trough is a daily part of the job. So we'll review these a bit
- Standard Scaler
- One-Hot Encoding
- Why?

## Other Important notes
- Putting bias aside when building models (in your inputs, parameter choices, otherwise)
- Productized ML (EX: BigQuery custom sql)
- Commoditization of models that can perform consistently well (EX: automl)

## Some Model Building Concepts
- Supervized vs. Unsupervized learning
- Model -> Fit -> Predict Workflow
- Parameter vs HyperParameter Tuning
- Linear/Logistic Regression

## Next Steps
- KNN
- SVM
- Decision Trees (lone, aggregated, random forest, boosted trees)
- Neural Networks
- Grid Search
- automl

In [None]:
import tensorflow as tf
import keras
import pandas as pd
import sklearn as sk
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
df = pd.read_csv('fifa_player_data.csv')

display(df.head(5))

### Cleaning the data

#### We will not be focusing too much on data cleaning, and I'll only go over a few quick things here, but this can involve many, many steps. Keep in mind that if you end up in a role that is constantly building new models based on output from others, that you may be using several datasets feeding them into multiple models (possibly even in multiple frameworks or in a distributed environment). Steps could involve
- Aggregating and disaggregating data
- Datetime handling (are timestamps correct? Different time zones? 
- Possibly manual tagging of data... If you are getting unprocessed data with no tags, it may fall on you to facilitate the tagging of that data (either you do it yourelf, or need to get the datastream sent to a service for tagging)
- Changing currency, applying usable labels
- Getting rid of bad data (bad sensors or corrupt data)
- Splitting Strings
- etc. etc. etc.

### Here, I will just be splitting off the height in centimeters from the height column (this may be useful to us later), I am also going to be dropping some more confusing columns to make the data slightly narrower just so we can see a slightly larger proportion of it for now (Obviously, datasets that are fed into ML models can be several hundred or thousand features wide, but let's start small for now)

In [3]:
df.pop('weakfoot')
df.pop('work')
df.pop('version')
df.set_index('player_num', inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2760 entries, 0 to 2759
Data columns (total 47 columns):
name             2760 non-null object
rating           2760 non-null int64
position         2760 non-null object
price            2760 non-null int64
skills           2760 non-null int64
PACE             2760 non-null int64
SHOT             2760 non-null int64
PASS             2760 non-null int64
DRIBBLE          2760 non-null int64
DEFENSE          2760 non-null int64
PHYSICAL         2760 non-null int64
height           2760 non-null object
overall          2760 non-null int64
IGS              2760 non-null int64
acceleration     2760 non-null int64
aggression       2760 non-null int64
agility          2760 non-null int64
balance          2760 non-null int64
control          2760 non-null int64
crossing         2760 non-null int64
curve            2760 non-null int64
dribbling        2760 non-null int64
heading          2760 non-null int64
interceptions    2760 non-null int64
ju

In [4]:
display(df.height.head())
display(print('============'))
display(df.height[0])

player_num
0    183cm | 6'0"
1    173cm | 5'8"
2    191cm | 6'3"
3    187cm | 6'2"
4    173cm | 5'8"
Name: height, dtype: object



None

'183cm | 6\'0"'

### Quickly show how I would possibly clean/edit one of the columns
- This is just a very basic example, as I mentioned above, the parts before we get to the final cleaning for the model are extermely important. There aren't many real issues with this data as it is pretty clean on the way in, so I just wanted to pull this out as an example of something quick.

In [5]:
def cm_getter(height):
    return(float((str(height)).split('cm')[0]))

df.height = df.height.apply(cm_getter)

df.height.head()

player_num
0    183.0
1    173.0
2    191.0
3    187.0
4    173.0
Name: height, dtype: float64

### Applying one-hot encoding to our data for categorical data
- Luckily these days, a lot of things that may feel pretty complex can be handled with a quick bit of boilerplate code.
- The process here is that all numerical data stays the same, while all categorical data will add width to the data.
- To facilitate this, we will have to drop the Primary Key that I created in this dataset because it adds a ton of width, and I will also drop the name column. Devang has often talked about changing the data in ways that'd make it nearly unreadable for us, but for a model to be able to crunch it, some things need to change... (NOTE: Not all models need data to be prepared this way)

In [6]:
df.pop('PKey')
df.pop('name')

dummies = pd.get_dummies(df)

display(dummies)

Unnamed: 0_level_0,rating,price,skills,PACE,SHOT,PASS,DRIBBLE,DEFENSE,PHYSICAL,height,...,league_SAF,league_Saudi Professional League,league_Scottish Premiership,league_Serie A TIM,league_South African FL,league_Superliga,league_Süper Lig,league_Ukrayina Liha,league_Österreichische Fußball-Bundesliga,league_Česká Liga
player_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,96,6900000,5,97,95,81,95,45,76,183.0,...,0,0,0,0,0,0,0,0,0,0
1,98,4800000,5,95,96,93,96,60,76,173.0,...,0,0,0,0,0,0,0,0,0,0
2,90,3500000,4,88,88,88,87,80,87,191.0,...,0,0,0,0,0,0,0,0,0,0
3,94,3280000,5,90,93,81,90,35,79,187.0,...,0,0,0,1,0,0,0,0,0,0
4,95,3100000,5,96,93,90,95,56,75,173.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2755,76,400,4,61,74,75,75,65,72,188.0,...,0,0,0,0,0,0,0,0,0,0
2756,75,400,4,70,71,73,77,61,63,171.0,...,0,0,0,0,0,0,0,0,0,0
2757,75,400,3,81,68,71,80,27,51,176.0,...,0,0,0,0,0,0,0,0,0,0
2758,75,400,4,82,72,68,80,26,66,181.0,...,0,0,0,0,0,0,0,0,0,0


### Splitting up data

- To train supervized models, we will need to split the data up into either 2 or 3 buckets. That would be the training set, the test set, and the validation set.
- Training set - This is almost always going to be the largest slice of your data, as the name implies, you will be using this data to train your mdoel. So theoretically, for all of the hours (or minutes) that you spend building your model and tuning all of the parameters, it should only be looking at the training set. 
- Test set - When you have finished your work (or want to check out how you're doing, though this can be controvertial), you can apply your model to your test set to see how you've done. If you've underfit or overfit your training data then this step would be able to help you understand if that is the case.
- Validation set - If you ever plan to use your test set for periodical testing. you will definitely want to keep a validation set aside. This will typically be the smallest slice of the data and you will use it to test your final product. Or in the case of a comptetion, this data will be withheld from you and used to test your model by whoever is organizing the competition.
- Cross-validation will also be a tool in our kit that we may use (


### For our scope here, we are just going to split into a training and a test set (a 70/30 split)
- Best-practices or prefered training vs test data split size may change by team or organization, but it seems that leaving about this amount aside seems to be pretty standard, the SKlearn default is a 75/25 split
- It seems like leaving aside a validation set (or having some new data at hand can be helpful, size is helpful, but don't allow you to get bogged down by the size of the data you're building the initial model with, start small and scale up as you can/are comfortable/needs increase. If there is a small ad-hoc project or piece of analysis you want to do that can be helped by applying machine learning then feel free to start building on your laptop, but if the final implementation will be in a big data environment, you may want to get yourself a very large dataset (and possibly consider how it would be fed new data in production as well)

In [7]:
y = dummies.rating.values.reshape(-1, 1)
X = dummies.drop('rating', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, train_size = .7)

In [8]:
display(len(X_train))
display(len(y_train))
display(len(X_test))
display(len(y_test))

1931

1931

828

828

### Using Standars Scaler to scale your data.
- Another important element of preparing data for a model is that we should scale it, there are a few ways to scale data, but the built-in sklearn standardscaler is one of the easiest ways of doing so and again, is easy to apply and 
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- Standardizing, or scaling your data is an extremely important step for most models, sometimes it will offer you speed in your training, and other times it will be incredibly important for the results of the model themselves, and there are some models that don't really require any of these steps.

In [9]:
X_scaler = StandardScaler().fit(X_train)
y_scaler = StandardScaler().fit(y_train)

X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_train_scaled = y_scaler.transform(y_train)
y_test_scaled = y_scaler.transform(y_test)

In [10]:
X_test_scaled

array([[-2.71468695e-01,  1.05881621e+00, -1.18338326e+00, ...,
        -1.02302125e-01, -3.94463762e-02,  0.00000000e+00],
       [-2.70842487e-01,  1.09551600e-03,  4.27523106e-01, ...,
        -1.02302125e-01, -3.94463762e-02,  0.00000000e+00],
       [-2.23000143e-01, -1.05662518e+00,  3.42738561e-01, ...,
        -1.02302125e-01, -3.94463762e-02,  0.00000000e+00],
       ...,
       [-2.72094904e-01,  1.09551600e-03, -5.89891438e-01, ...,
        -1.02302125e-01, -3.94463762e-02,  0.00000000e+00],
       [-2.71969662e-01,  1.09551600e-03,  8.83849248e-02, ...,
        -1.02302125e-01, -3.94463762e-02,  0.00000000e+00],
       [ 1.70258909e-01,  1.05881621e+00,  1.44493765e+00, ...,
        -1.02302125e-01, -3.94463762e-02,  0.00000000e+00]])

In [14]:
y_test[:5]

array([[75],
       [82],
       [79],
       [85],
       [89]])

In [15]:
y_test_scaled[:5]

array([[-1.08003737],
       [ 0.44847955],
       [-0.20659913],
       [ 1.10355823],
       [ 1.97699647]])

## Some more things to think about before we get to fitting models.

### One very important thing to consider when preparing data for machine learning models as well as when you are training and testing models is that you should leave any kind of bias on the shelf. Whether these are strictly a bias towards the columns you choose to train off of (EX: "I'm SURE that square footage is the most important component in setting the price of a house or rental")  or in some other way reflect your views of the world, you will be introducing unnessesary bias into your models. 

- It seems that bias can be introduced in so many ways in machine learning, we should do our best to cause as little of that as possible when we are working with them.
- You can introduce bias with...
1. The data you bring in
2. The balance of your classes (EX: Thinking about red vs. blue marbles)
3. The columns you choose to train your model on
4. If you have data leakage (and many, many more)

### Something else to think about s figuring out which parameters matter. One fairly easy way of doing this is to build a random forest model and take a look at the feature importances, or you can build several models with different slices of the dataset available to them to understand which ones help to account for the most information gain. Starting with a wide dataset is great, but if there are too many dimensions that aren't helpful in getting our models to regress or classify correctly, we begin to play in a rapidly increasing space...

- The Curse of Dimensionality https://en.wikipedia.org/wiki/Curse_of_dimensionality
- Curse of Dimensionality + Over/Under Fitting Data https://builtin.com/data-science/curse-dimensionality

### There are parts of this space that have been abstracted into data products and with each passing day, it seems like there is a more accessible way to work with ML models, on one hand that is great because it makes them more approachable, and on the other, if you allow yourself to think of too much of your models as a black box that you have no insight into, they will become increasinyl hard to explain to a manager, CEO, layman, etc. and can be considered more risky if they are shrouded behind more and more abstraction.

- One of the things that I have been taught time and time again is that getting off the ground with a "simple" model that you can explain to just about anybody and took seconds to train without a massive amount of black-box components is an incredibly valuable thing. Being able to communicate to those on your team which parameters you expect may be going wrong or may have drifted in a short amount of time to get everybody up to speed could become a very important thing in certain situations so definitely keep that in mind. Don't focus on making the best model no matter the cost. Weigh the cost of having a model that just makes you say "I dunno" if somebody asks you about it and think about what it can mean when something needs to be troubleshot or ownership needs to be transfered to another member of your team.

#### Google Bigquery integration of linear regression (and other models)

- With more and more data products becoming available to more people daily, it is important to keep an eye on what the latest and greatest offerings are from the top tech companies. A fairly recent example of this was when Google started to include linear regression as a tool to its custom SQL stable. This means that you can now apply a tried and true model to your data as it sits on a Google server and all of the computations are done in the cloud. I want to say they are still fairly limited with the models in play, but moving the data out of a SQL environment into a Python (or elsewhere) environment to work on it is currently a huge part of the job. Keep these data products in mind when you are working and doing research 
- https://cloud.google.com/bigquery-ml/docs/bigqueryml-natality

#### H2O AutoML
- The task of building more complex models is also getting abstracted into a more occluded space every day, we are handing over the task of building models and training them (even if they are very complex) back to the machines and allowing them to pick out parameters that work best. A product like h2o automl is free to use and is like a "grid search of a grid search" in terms of complexity. It will allow you to test several different types of models against eacnother, all with varying hyperparameters, and it can be set up in a distributed environment these days...
- Think about what this means, you can arrive with your final dataset that is cut up, cleaned, etc. staight to either a library like h2o automl, Google Cloud AutoML, AWS, Azure and running N models on X machines. You can really see how the complexity can stack up in a properly built out ML environment.
- https://www.youtube.com/watch?v=42Oo8TOl85I&t=453s
- https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

#### It really is no joke that this space is becoming more and more approachable for just about anybody over the last few years. 15-20 years ago, these models had to mostly be hand-built because the libraries that now exist weren't around. Who knows what will be available to us in another 5-10 years? Staying up to date on the different options in this space is why it is the type of job where you will learn every day of your life. You have to have good fundamentals with the underlying models and technology so that you can keep up with the deluge of new things that people are building every day. I had an instructor in a prior bootcamp who was hands down one of the smartest people I've ever met that would always laugh that he was forever destined to be "current up to about 2 years ago" because things were changing so quickly.


### Supervised vs. Unsupervised learning

#### Supervised learning occurs when we have tagged data so that the model understands the value that it is trying to regress or classify to. An example of this is in a regression model, we know that we will always be thinking about a specific number when analyzing the other ones. We want to know how changes in different features cause a change in this one variable. Because we have the number in hand, we will use it to help ourselves build a better regresssion or ML model.
- This data is tyically tagged in some type of way in the case that it is a classification problem (EX: Photos that were tagged by humans with the tag "cat", "dog", "fish", "car", etc.
- In cases where we are regressing, this will usually be a number of some sort of importance for our business (EX: If a player is taller, heavier, and has been playing in the Center Back position, we would rate her a better defender)

#### Unsupervised learning - The most common/intro concept in unsupervised learning is that of clustering, where you are handing in a dataset without a specific target or column to end on, you are sort of teasing the computer to group the data in some clever way that you may try to get an understanding of later, or you may even use a model to tag data for you (if you trust it enough).
- https://machinelearningmastery.com/clustering-algorithms-with-python/

### The MODEL->FIT->PREDICT Workflow -  This refers to the current paradigm for setting up models and seems to be the way that you can work with models in most different languages, packages, etc. this is obviously not a 100% all-consuming way of working with models, but overall it seems to be the one that we are employing in a mess of different applications these days.

#### MODEL - This refers to building a model (we will be doing this a few times from within SKLearn). You can set up the model with some hyperparameters (these are specifically ones that are related to the model and not to the underlying data), name it, and then it is an object in memory.

#### FIT - This refers to the step of actually fitting your model to the data that you have in hand, what this actually results in is different for different ML models, but this is likely the process that will take the most time to crunch. This is when you can get up and stretch or grab a glass of water because you have set up the model hyperparameters, passed it the data, and let it do its thing.

#### PREDICT - At this point, you have a trained model based on all of the training data that you have sent in, now you can use the trained model with all of the different coefficients and components trained and see how it does on your testing data.

- You can think of the fit and predict portions of this workflow like training someone to pick out everybody wearing an orange shirt when they walk in through a doorway by sitting down with them and looking at pictures of people in different color shirts and picking out the orange ones (FIT). Then the following day you actually stand them in the doorway of your business and any time they see somebody wearing an orange shirt they hand them a coupon (PREDICT). Then if they didn't hand coupons to people wearing orange shirts, but instead handed them all to folks wearing green ones, you would have to go back to the drawing board and re-train this person (or in the case of a model that works in a similar way, you'd have to do the same thing)

### Parameter vs Hyperparameters

#### Parameters are the space which your model can learn from, so this is going to be the actual data that is getting pumped into the model. You will have to consider how certain parameters look, there may be a class imbalance to them, they may be full of dirty data, they could be completely irrelevant to the model you build 

- EX IRRELEVANT: Although we could likely agree that an outsized portion of fast cars are yellow, we can also agree that a layer of paint has nothing to do with the horsepower, weight, acceleration, and top speed of a car, So maybe if we are trying to figure out how fast we would expect a car to go or many G's it can pull around a corner, we may drop the column with the colors of the car because that data isn't going to really turn up too much.
- EX CLASS IMBALANCE: If we want to understand which messages are spam, but we are expecting that 99.99% of messages will not be spam, we can have an algorithm set up in a way that basically wouldn't ever get penalized if it always just chose to let email through and marked it "NOT SPAM"
- EX CLASS IMBALANCE 2: Okay, so now we are looking at diagnosing a patient with a rare but deadly disease, we may also see a class imbalance in our training data, but it is important to note that if we make a mistake here, that it will be costly, so if the algorithm always guessed "DIAGNOSIS NEGATIVE" it would be a much bigger problem than the previous example. How can we adjust around that? We'll cover that later.
- EX DATA LEAKAGE: https://machinelearningmastery.com/data-leakage-machine-learning/

#### Hyperparameters are specific to a model, so in Random Forests (we'll also be covering this next week in a little bit more depth), we can say that the number of decision trees that will make up the forest is a good example of a hyperparameter, or the maximum depth of each tree is also a hyperparameter.

- these have nothing to do with the underlying data itself and are a function of the model, so inherently, each different machine learning model that you work with can have different model parameters. Trees will have similar ones, different regressors may have similarities, etc. but you will always want to think of a hyperparameter as existing outside of the space of your data, that is the most important distinction. These are the things that you use to actually tune the performance of the model. Hyperparameter tuning is a part of the job that may be getting abstracted into libraries and boilerplate code, but it still really helps to know what you're doing and I recommend reading a bit about the most commonly tweaked hyperparamets of all of the models that you'll find yourself working with more consistently. They are fun to try to understand, and it is very rewarding to go from just chalking something that a hyperparameter does to "math" to actually understanding a little bit more about why that can affect your outcome.
- https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74


#### Grid Search

- I'm not going to spend a ton of time covering grid-search today, but the basic idea is that it will allow you to test several combinations of hyperparameters in your model. It is important to note that this is time and processor-consuming, and that sometimes grid-search will only give slightly better results than the out of the box model, but when you are looking to tune those hyperparameters up very nicely, you can employ tools like grid search to get your model in tip-top shape. Keep in mind that a tool like h2o's AutoML that we talked about before is like an upgrade on top of this, so instead of allowing one model to be built and trained with several different parameters, you are allowing several different models to do this as well...
- https://scikit-learn.org/stable/modules/grid_search.html

### I want to cover some more about machine learning models and what is going on behind the scenes in them next week.