# ASSIGNMENT 4 - PubG Finish Placement Prediction 

In this assignment, we shall be applying supervised learning techniques on a recent dataset released by [Kaggle](https://www.kaggle.com/c/pubg-finish-placement-prediction). Kaggle is a great platform for data scientists where one can get good and large datasets. Additionally, they host a lot of competitions where you can pit your models against everyone else. At the end of this assignment, you will (hopefully) participate in this competition and report your positions. Good Luck! :D



### Problem Description

We will be solving [this problem posted](https://www.kaggle.com/c/pubg-finish-placement-prediction) on Kaggle (Swags up for grabs). The dataset contains 43,57,336 rows of data, where each row corresponds to all the game statistics for a player (25 features) and the corresponding placement score they got (score between 0 and 1). **To download the data, you will need to register on the site and accept their competition rules and conditions.**

We will be employing a simple Linear Regression model to predict the placemnt score. However, we shall **not resort to SciKitLearn** for regression. But before solving the regression we will perform PCA. In the end, we will train our models for various *hyperparameters* like the learning rate, PCA / No PCA and find the best combination. **Please submit the Notebook with all the ouputs you have got for evaluation**. Some parts of the code and the basic structure is provided to you to make your job easier. 

To sum up, following are the tasks required to complete this assignment:

-  Normalize and clean the data                                                                      10%
-  Perform PCA on the data                                                                           20%
-  Perform Linear Regression on the training data, with the help of Gradient Descent Method          40%
-  Run the Linear Regression for different learning rates, as well as the PCA-projected data         20%
-  Test the models on the test data and report the **BEST** model and its performance                10%

For Bonus marks:

-  Copy Paste the relevant parts of your code on to kaggle kernel and execute it (Bonus 10%)
-  Report your kaggle ranking (via your screenshot). Name your team MEC_<team_no.> (Bonus 15%)

You need to fil out the cells which have "*Write your code here*" mentioned.

Regular PubG players might have a better understanding of how PubG scores the players and might want to try their hands at [feature engineering](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b). In Kaggle, there are two types of winners, ones who run a very powerful algorithm on the huge dataset and tune all the parameters to perfection, and the others who come up with clever features using their domain knowledge. Given that you are beginners, you cannot hope to be the former yet, but regular gamers can attempt to be the latter.

Lastly, we hope that you will use this opportunity to participate in this Kaggle competition (but only use Linear Regression model with or without feature engineering). You can get 10 - 25% bonus marks for doing so. If you do, please share your score and ranking (Via a screenshot of Kaggle Leaderboard).

**Note -** that this is a huge dataset and using all of it might crash your computer, or more probably give you a Memory Error. There are ways of handling such huge datasets, by doing Batch-wise gradient descent, which the more zealous amongst you could attempt. The more modest could simply reduce the dataset by *randomly* sampling only a few (order of 10,000) datapoints.

## Data Exploration

First, we shall load the data into a pandas (a python library incontournable for data scientists) dataframe . This will help us manipulate the data as we desire, by resorting to nice in-built pandas methods. 

In [2]:
import pandas as pd

# pandas displaying options. Not necessary, but needed for Data Exploration
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [5]:
number_of_data_points = 10000                # Increase this to have a better model
train_df_total = pd.read_csv("train.csv")     # This might take a few seconds
print(train_df_total.shape)
train_df = train_df_total.sample(n=number_of_data_points)    # Sampling 10,000 rows at random
train_df.head()



(4357336, 26)


Unnamed: 0,Id,groupId,matchId,assists,boosts,damageDealt,DBNOs,headshotKills,heals,killPlace,killPoints,kills,killStreaks,longestKill,maxPlace,numGroups,revives,rideDistance,roadKills,swimDistance,teamKills,vehicleDestroys,walkDistance,weaponsAcquired,winPoints,winPlacePerc
3457992,4924656,1247815,38378,0,0,54.18,0,0,0,63,1233,0,0,0.0,48,48,0,366.7,0,0.0,0,0,606.2,4,1533,0.5106
1014130,1442282,223952,15050,0,1,146.5,0,0,1,52,1000,0,0,0.0,27,25,0,0.0,0,0.0,0,0,1902.0,5,1500,0.6923
1290622,1840185,2619509,8143,0,0,0.0,0,0,0,69,1000,0,0,0.0,22,22,1,0.0,0,0.0,0,0,136.5,3,1500,0.0952
4280657,6111218,273154,16170,0,0,73.85,0,0,1,32,1000,1,1,54.58,27,25,0,0.0,0,0.0,0,1,1195.0,4,1500,0.4615
3666578,5225885,929599,31178,0,2,0.0,0,0,0,42,1017,0,0,0.0,48,44,0,0.0,0,0.0,0,0,3556.0,6,1508,0.8298


Now, we need to prepare our training data. For that, we will need to inspect the different columns and judge their usefulness. A priori, we do not know if any column is going to help us predict the finishing place. However, we can eliminate atleast 3 columns. For example, the "Id" column is pretty useless as it cannot possibly contribute to the finishing place. So, we need to drop that column. Also find other columns that could be dropped from the analysis. (Use train_df.drop() method to drop the columns) 

In [None]:
# Write your code here



We also need to remove winPlacePerc from the dataframe as it is the column we are trying to predict. Write a code to save that column as the variable 'target' and drop it from the train_df

In [None]:
# Write your code here



### Normalizing (or scaling the data)

First we will need to scale the data to a normal distribution with mean zero and standard deviation one.

To do this, we apply the following transformation to every value of a column $j$:

$z_j = \frac{x_j-\mu_j}{\sigma_j}$

where $\mu$ is the mean of the column $j$ and $\sigma$ its standard deviation.

To do so, you will need to find the mean and standard deviation of evry column, and then perform the transformation.

Going forward, we shall be using the normalised dat for both, PCA and Linear Regression.

In [None]:
# Write your code here


## PCA

Now, we are left with 22 columns, which is an improvement from 26 columns, but maybe many of these columns are redundant? Let us find out by doing principal component analysis on the remaining data.

To do this, we will first calculate the covariance matrix for 22 columns using the numpy function cov. Then we do the eigenvalue decomposition of the covariance matrix.

The magnitude of the eigenvalues signify the importance of the eigenvectors in explaining the data. More precisely, eigenvector $i$ explains $\frac{\lambda_i}{\sum_{j=1}^N \lambda_j}$ of the variance (in proportion). Sort the eigenvalues obtained and **find how many Principal Components, i.e. eigenvectors are needed to explain 90% of the variance.**



In [73]:
from numpy import cov            # To calculate covraiance matrix
from numpy.linalg import eig     # To calculate eigenvalues and eigenvectors

train_array = train_df.values   # Returns numpy array of values of the dataframe

# Write your code here




### Projecting the original data on the PCA component space

We would now like to create our PCA data which is basically transforming each row of data (23 features) into the number of components we have chosen to keep after performing PCA.

To do this, a neat trick is to multiply the matrix with the eigenvectors we chose to keep with the scaled data matrix.

Let your data matrix be X (size m x n) and your eigenvector matrix be eig (size n x k), where k is the number you have determined in the previous cell. The projected data is then simply X.eig, which one can easily verify will be a m x k matrix. Implement this in the following cell

In [None]:
# Write your code here



Let us keep the pca data for now. We will use it later to compare its performance against non-PCA fit data. For now, our aim is to predict the winPlacePerc, which we have previously saved as the variable 'target'

# Linear Regression

Refer to your lecture notes for understanding Linear Regression. Here we will implement the various steps inolved. Firstly, thanks to PCA, we have already performed the normalization for the data, and need not repeat the step. Moreover, we also have stored it in numpy format which is convenient for performing the Gradient Descent.

For further convenience, and to not bother about the bias term seperately, add another column of ones to train_std. To understand what is bias, think of the y-intercept $c$ in the equation of a line $y = mx + c$. In higher dimensions, that intercept $c$ actually becomes a vector.

In [None]:
# Write your code here - add another column of ones (representing the bias term) to train_std




Now, we shall divide our data into training set and test set. The proportion to use for trainng set is your choice (and you could play around with it) but to start off, lets consider a 70:30 split


In [80]:
target_np = target.values   # converting pandas df to numpy array

# Write your code here - Split into train and test set



#### Cost Function

Now, to perform Linear regression, we will first need to define a Cost Function. A good cost function in this case would be the mean squared error (MSE), which as the name suggests is the mean of the errors squared. Mathematically :

$L(\hat{y},y) = \frac{1}{2M} \sum_{i=1}^M (\hat{y}_i - y_i)^2$

where $\hat{y}_i = w^T x_i$. Note there is no bias term since we shrewdly included bias in our $x_i$. We shall now code the cost functionm which takes as arguments the weight vector, the data, and the values to be predicted and returns the cost.

In [116]:
# Write your code here - Note : Dont worry about vectorisnig your code just yet. You can write a for loop.

# However, if you want much faster execution, vectorisation is a must. This might be needed for bonus marks.




#### Gradient Descent

To perform gradient descent, we need to know how to update the weights $w$ to reduce the error slightly. We do this by simply taking finding the gradient of the loss funtion $L(\hat{y},y)$ with respect to individual wieght components $w_i$. More precisely the update step is,

$w_j = w_j - \alpha(\frac{\partial{L}}{\partial{w}_j}$)

where $\alpha$ is the learning rate, which is a hyperparameter that you will play around with. Now, only one thing remains to be calculated. What is $\frac{\partial{L}}{\partial{w}_j}$ ? A bit of basic calculus will show that :

$\frac{\partial{L}}{\partial{w}_j} = \frac{1}{M} \sum_{i=1}^M (\hat{y}_i - y_i) x_i^j$

A bit of notation abuse : $x_i^j$ denotes the $j^{th}$ feature of $i^{th}$ data point

In [None]:
%%time   # Cell magic to time the execution of your cell
# Write your code for Gradient Descent here - Again, dont bother about vectorisation for now. 



Plot how the cost function varies (**ON THE SAME GRAPH**) with the number of iterations for differnt values of learning rate $\alpha \in \{0.01,0.1,0.5,1\}$. Which learning rate is better and why?

**Also each time, store the weights as you will need to use them on the test set later.**

In [None]:
# Write your code for plotting here



## Test your model(s) on the test set

Calculate the cost of your different models on the test set


In [None]:
# Write your code here



## Create a model for the PCA data

Now that you have a running code for the entire data, perform linear regression with the PCA data. Report what you find. 

**Note** - Normally, with PCA data, prediciton accuracy will be lower for the same number of iterations and learning rate. However, since the data is compressed (by a factor of 2 almost), you can use twice as much data for the same running time. So, it is not obvious which model will perform better

In [115]:
%%time 
# Write your code here



## Time to test it out on Kaggle

Choose the best model you get after tuning (learning rate, training data size, and PCA/ No PCA), and apply it on the kaggle test set to find out your performance and get a rank. Submit it as MEC_<team_no.> on this competition page : https://www.kaggle.com/c/pubg-finish-placement-prediction

**Bonus 10 % for applying your model**

Note that this is a kernel based competition, so you will need to copy-paste the relevant parts of your code in a kernel there.

**Based on your ranking on Kaggle, extra 15% marks will be awarded**

Please refrain from using any advanced algorithm while doing so.

Most importantly, dont be disheartened if your score is not too great on Kaggle. You used the most basic Machine Learning algorithm, that too coded by hand. If you do get a good score nonetheless, it shows how easy it is to be a data scientist.