# BYU CS 280 Tool Lab 2: Machine Learning Tools

In [9]:
# Dependencies for the lab
import sklearn as sk
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ARDRegression
from sklearn import svm
import xgboost as xgb
import pandas as pd

## Introduction:

### Scikit-learn
Scikit-learn is a very popular library for machine learning in Python. You can think of it as an add-on to scipy/numpy with a very large number of implementations of common machine learning algorithms.

In general, the scikit-learn API can help you accomplish the following tasks:
* Preprocessing
* Dimensionality Reduction
* Clustering
* Classification
* Regression

### XGBoost
XGBoost stands for Extreme Gradient Boosting, which is a scalable, distributed gradient-boosted decision tree machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems. It implements machine learning algorithms under the Gradient Boosting framework. It also provides parallel training capabilities so you can train your algorithms using multiple processors and gpus.
 

# Question 1: Preprocessing
Very rarely will you be given perfect data, and this lab is no exception to the rule. In this section you will preprocess and prepare the datasets for testing various algorithms.

You will need to implement a train/test split function which will take in a percent number, and then returns four different values. The X_Train, the X_Test, the Y_Train, and the Y_Test. Where the test set is x% of the original data set, and the train set is 1-x%. So for example, if the dataset had 1000 individual data points, and we split it at 20%, the train set would have 800 data points, and the test set would have 200 data points.

In [None]:
def train_test_split(percent, X, y):
  x_train = 
  x_test = 
  y_train = 
  y_test = 
  return ((x_train, x_test), (y_train, y_test))

We will be using two different datasets in this lab, because we will be training both classifiers and regression algorithms. You'll be able to download them by running the following cell.

In [None]:
!wget https://raw.githubusercontent.com/michael-holland-dev/CS280/main/tool_labs/tool_lab_2/titanic_dataset.csv
!wget https://raw.githubusercontent.com/michael-holland-dev/CS280/main/tool_labs/tool_lab_2/videogamesales_dataset.csv

In [4]:
titanic_df = pd.read_csv("titanic_dataset.csv")
video_game_df = pd.read_csv("videogamesales_dataset.csv")

For the video game dataframe, there are quite a few categorical variables, you should use the sklearn function for [one hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). This will allow you represent the catagories in a numerical fashion. This will allow the regression algorithm to classify the data easier than it would with raw text

Using your function above, split the two datasets with a 75/25 split. (75% train, 25% test) 

# Question 2: Classification
XGBoost and Scikit Learn both have classification algorithms, in this section of the lab, you will train these algorithms using the titanic dataset and attempt to classify whether or not a given person survives based on their stats. You will use both the XGBoost classifiers as well as set of various SKLearn Classifiers.

You can read the documentation for integrating xgboost with sklearn [here](https://xgboost.readthedocs.io/en/stable/python/examples/sklearn_parallel.html#sphx-glr-python-examples-sklearn-parallel-py)

In [None]:
#Initialize the XGBClassifier
xgboost_classifier = xgb.XGBClassifier()

#Fit the data using the XGBClassifier.fit(X,y) function

#Predict the classification using the XGBClassifier.predict(X_Test) function

#Calculate the percent correct by using the accuracy_score function

In [None]:
#Initialize the XGBRF classifier
xgboost_random_forest_classifier = xgb.XGBRFClassifier()

#Fit the data using the same .fit(X,y) function

#Predict the classification using the same .predict(X_Test) function 

#Calculate the percent correct by using the accuracy_score function

Support Vector Machine Classification [Documentation](https://scikit-learn.org/stable/modules/svm.html#classification)

In [10]:
#Initialize the SKLearn SVM classifier
supportVectorMachine = svm.SVC()

#Fit the data using the same .fit(X,y) function

#Predict the classification using the same .predict(X_Test) function 

#Calculate the percent correct by using the accuracy_score function

KNeighbors Classification [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
* For this algorithm, do some hyperparameter tuning based on documentation above.
* We'd recommend varying the number of neighbors, and see if that increases the performance of this algorithm

In [None]:
#Initialize the SKLearn KNeighbor's classifier
n = #Instantiate the number of neighbors
kneighbors = KNeighborsClassifier(n)

#Fit the data using the same .fit(X,y) function

#Predict the classification using the same .predict(X_Test) function 

#Calculate the percent correct by using the accuracy_score function

Multi Layer Perceptron Classifier [Documentation](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#multi-layer-perceptron)
* For this algorithm, do some hyperparameter tuning based on documentation above.
* Vary the hiddenlayers, randomstate, and the alpha to see if that increases the perfomance of this algorithm.

In [None]:
custom_solver = #Enter solver
custom_alpha = #Enter alpha
custom_hidden_layer_sizes = #Enter hidden_layer_sizes
custom_random_state = #Enter random_state
multi_layer_perceptron_classifier = MLPClassifier(solver=custom_solver, alpha=custom_alpha, hidden_layer_sizes=custom_hidden_layer_sizes, random_state=custom_random_state)

#Fit the data using the same .fit(X,y) function

#Predict the classification using the same .predict(X_Test) function 

#Calculate the percent correct by using the accuracy_score function

Using the given empirical results, explain below which machine learning algorithm would be best suited for prediciting whether a person would survive the titanic

(Enter your results here)

# Question 3: Regression
In addition to classification, XGBoost and Scikit Learn also have regression algorithms. Regression algorithms are implemented when attempting to predict some sort of value such as temperature or a like count for a youtube video. In this section of the lab, you will train these algorithms using the video game sales dataset to predict the sales of a given title. You may have to feature engineer the data using the skills you gained in CS 180 and 280 to effectively predict the sales. You will compare a few different sklearn regression algorithms in this problem, as well as XGBoost regression algorithms.

In [None]:
#Initialize the XGBRegressor
xgboost_regressor = xgb.XGBRegressor()

#Fit the data using the .fit(X,y) function

#Predict the regression using the .predict(X_Test) function

#Calculate the percent correct by using the .score(X,y) function

In [None]:
#Initialize the XGBRFRegressor
xgboost_rf_regressor = xgb.XGBRFRegressor()

#Fit the data using the .fit(X,y) function

#Predict the classification using the .predict(X_Test) function

#Calculate the percent correct by using the .score(X,y) function

Isotonic Regression [Documenation](https://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html#sklearn.isotonic.IsotonicRegression)

In [None]:
#Initialize the Isotonic Regressor
isotonic_regression = IsotonicRegression()

#Fit the data using the .fit(X,y) function

#Predict the classification using the .predict(X_Test) function

#Calculate the percent correct by using the .score(X,y) function

Linear Regression [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [None]:
#Initialize the Linear Regressor
linear_regression = LinearRegression()

#Fit the data using the .fit(X,y) function

#Predict the regression using the .predict(X_Test) function

#Calculate the percent correct by using the .score(X,y) function

ADR Bayseian Regressor [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html#sklearn.linear_model.ARDRegression)
* With this function do some hyperparameter tuning/a hyperparameter sweep to see which is the best for approximating.
* There are too many parameters for this algorithm to write in this lab, so we recommend going to the documentation above to review the potential hyperparameters you can tune.

In [None]:
#Initialize the Linear Regressor
linear_regression = ARDRegression()

#Fit the data using the .fit(X,y) function

#Predict the classification using the .predict(X_Test) function

#Calculate the percent correct by using the .score(X,y) function

Using the given empirical results, explain below which machine learning algorithm would be best suited for predicting sale prices for videogames.

(Enter your answer here)

#Question 4: Write Up
Do some more additional research on how these algorithms work. We don't go super deep into the machine learning algorithms in this class because there's going to be a new class CS 270, which is an Introduction to Machine Learning, as well as CS 472, which is Machine Learning. Write 1-2 paragraphs explaining what you learned in this lab, as well as something you learned about your additional research that you did.

(Enter your answer here)