## KNN Regressor

As discussed in our lecture, we are now all familiar with the **KNN** for the classification algorithm.

Interestingly, **KNN** can be used for regression as well and during this exercise, you will know some concepts around it. Also, we will talk about some implementational details when needed alongside our workflow. 

### Idea 
The idea of KNN Regressor is very simple, just find the most similar data points and take the average of them all.

### Dataset
We will be working with the **NBA** dataset. A dataset for the NBA competition for basketball. Our goal will be to find the possible points scored (**pts**) by each player based on some statistics measured for every one of them.

### Training
Interestingly, **KNN** can be used for regression as well and during this exercise, you will know some concepts around it. Also, we will talk about some implementational details when needed alongside our workflow. 

### Evaluation Metric
R-squared

### Implementation Skills
1. Imputation
2. Grid Search
3. ML Pipeline.


#### Importing and reading the dataset

In the following few cells, import and read the **nba_2013.csv** file from the **data** folder. Try to view some of it's data, shape, Nulls, total number of rows and colums etc.. 

The first step has been done for you.

In [None]:
# Import the data
import pandas as pd
nba = pd.read_csv('data/nba_2013.csv')
# Read the data head
nba.head()

Unnamed: 0,player,pos,age,bref_team_id,g,gs,mp,fg,fga,fg.,...,drb,trb,ast,stl,blk,tov,pf,pts,season,season_end
0,Quincy Acy,SF,23,TOT,63,0,847,66,141,0.468,...,144,216,28,23,26,30,122,171,2013-2014,2013
1,Steven Adams,C,20,OKC,81,20,1197,93,185,0.503,...,190,332,43,40,57,71,203,265,2013-2014,2013
2,Jeff Adrien,PF,27,TOT,53,12,961,143,275,0.52,...,204,306,38,24,36,39,108,362,2013-2014,2013
3,Arron Afflalo,SG,28,ORL,73,73,2552,464,1011,0.459,...,230,262,248,35,3,146,136,1330,2013-2014,2013
4,Alexis Ajinca,C,25,NOP,56,30,951,136,249,0.546,...,183,277,40,23,46,63,187,328,2013-2014,2013


View column names, dimentions, and Nulls count.

It may be also a good point to see some of statstics about columns.

### Preprocessing

So, our preprocessing will be constructed of three main parts listed as follows:
1. **Removing unnecessary columns and data splitting**

    You may have noticed the existence of several columns that holds categorical data. Hence, we are handling a regression problem, you can assume these will be of the least importance and remove them then, we will .
    
    
2. **Imputation**

    You should also have noticed that we have a missing data problem. Hence, our data is not big enough, removing them will not be so good. so, we will impute this value with any statistical approximation measure such as mean or median ( Hint: You can use the mean as it will be the simplest.
    
3. **Scaling**

    You may have noticed the existence of several columns that holds categorical data. Hence, we are handling a regression problem, you can assume these will be of the least importance and remove them..* or **StandardScaler** but for KNN especially, we prefer **MinMaxSaler**. Do you know **Why**?
    
    

When applying these concepts we will adopt the sklearn implementations as they are much more organized and robust to changes and will be grouped together as a one full pipeline. Let's start!

#### Removing unnecessary columns and data splitting

Try to find the categorical columns and remove them then split the data into predictors and target. After this, split the data into train and test chuncks. The below import statement could be helpful for you. 


In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split



# Remove unnessary columns



# Split into target and predictors




# Split with 75% for training and 25% for testing respectively.




### Imputation

Use the sklearn implementation **SimpleImputer**. The import statement below may help you. 

**Hint**: search for the documentation to know how **SimpleImputer** works.

In [None]:
# Import simple Imputer
from sklearn.impute import SimpleImputer


# Instantiate SimpleImputer Instance


### Scaling

Use the sklearn implementation **MinMaxScaler**. The import statement below may help you. 

**Hint**: search for the documentation to know how **MinMaxScaler** works.


In [None]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler


# Instantiate MinMaxScaler Instance


### Modeling

Now, as we finished the preprocessing by now we will now go to the modeling part.

Note that: We will not fit any of these classes right now and the model as well. We will wait to fit the overall pipeline and as usual, the import statement will be done for you

In [None]:
# Import KNeighborsRegressor 
from sklearn.neighbors import KNeighborsRegressor


# Now, instantiate a model instance (n_neighbors = 3)


### Pipeline

Now, let's combine all these steps into a full pipeline. all the steps should be ordered logically. The import statement will be done for you. 

In [None]:
# Import pipeline
from sklearn.pipeline import Pipeline

# Let's organize a list of tuble with the required steps

steps = [(),                                                  # Add the imputation here
         (),                                                  # Add the scaling here
         ()]                                                  # Add the model here


# Instantiate the Pipeline instance


# Fit the pipeline into the training data


# Get predictions for the testing data 


# Evaluate your model using R-squared (Will import it for you)
from sklearn.metrics import r2_score


# Print the R-squared for the testing data


### Hyperparameter Tuning

As you may have noticed, there are several hyperparameters in the pipeline we created such as the number of neighbors or the distance measure.
One of the easiest and quickest ways in deciding the best hyper paramenters. Now, you will try to tune the following:
    -  n_neighbors
    -  distance measure ( choose between cosine, manhattan , or euclidean )
    
    
GridSearch will be imported for you.
    

In [None]:
# Import GridSearch
from sklearn.model_selection import GridSearchCV




# Create the parameters dictionary. (You may need to check the documentations to know about it)



# Instantiate a GridSearch instance (Note : make CV=5)



# Print the best hyperparameters

## Thanks and all the best