# Recap

In [3]:
import pandas as pd

df = pd.read_csv('data.csv')

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,True,southwest,16884.924
1,18,male,33.77,1,False,southeast,1725.5523
2,28,male,33.0,3,False,southeast,4449.462
3,33,male,22.705,0,False,northwest,21984.47061
4,32,male,28.88,0,False,northwest,3866.8552


## Finding nearest neighbors

❓I am 28 years old, I have a bmi of 30, and I don't smoke. Which person in the dataset is most like me, and how much does she pay? 

Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor.kneighbors)

In [4]:
from sklearn.neighbors import KNeighborsRegressor

# Prepare data
X = df[['age','bmi','smoker']]
y = df['charges']

# Instanciate model
knn = KNeighborsRegressor()

# Train model
knn.fit(X,y)

KNeighborsRegressor()

In [9]:
# Find closest neighboors
knn.kneighbors([[28,30,False]], n_neighbors=10)

(array([[0.68493223, 0.74      , 0.875     , 1.0345168 , 1.04403065,
         1.06282642, 1.07703296, 1.08078675, 1.11803399, 1.12      ]]),
 array([[  63, 1006,  749,  143,  253,  291,  429,   76,  562,  205]]))

In [10]:
# Closest observation filtered on index
df.iloc[63]

age                28
sex            female
bmi           30.6849
children            1
smoker          False
region      northwest
charges       4133.64
Name: 63, dtype: object

❓Which person is least like me?

In [11]:
# Find closest neighboors
knn.kneighbors([[28,30,False]], n_neighbors=len(X))

(array([[ 0.68493223,  0.74      ,  0.875     , ..., 37.18936542,
         37.28391074, 37.49440492]]),
 array([[  63, 1006,  749, ...,  199,  768,  534]]))

In [12]:
df.iloc[534]

age                64
sex              male
bmi             40.48
children            0
smoker          False
region      southeast
charges       13831.1
Name: 534, dtype: object

## Base KNN

👇 Train and score a base KNN model with `age`,`bmi`, and `smoker` to predict `charges`.

In [14]:
from sklearn.model_selection import train_test_split

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Instanciate
knn = KNeighborsRegressor()

# Train
knn.fit(X_train,y_train)

# Score
knn.score(X_test,y_test)

0.13475584321848322

## Scaling

👇 Machine Learning algorithms are sensitive to the scale of features. Go to [this link](https://www.codecademy.com/articles/normalization#:~:text=Min%2Dmax%20normalization%20is%20one,decimal%20between%200%20and%201.), read up to the part on Min-Max Normalization, and transform `X` according to the formula.

In [17]:
normalized_X= (X-X.min())/(X.max()-X.min())

normalized_X.head()

Unnamed: 0,age,bmi,smoker
0,0.021739,0.321227,1.0
1,0.0,0.47915,0.0
2,0.217391,0.458434,0.0
3,0.326087,0.181464,0.0
4,0.304348,0.347592,0.0


## KNN Scaled features

👇 Train and score a KNN model with the features you just scaled.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(normalized_X, y, test_size = 0.3,random_state = 1)

# Instanciate
knn = KNeighborsRegressor()

# Train
knn.fit(X_train,y_train)

# Score
knn.score(X_test,y_test)

0.8352564952228109

# Predicting new data

👇 Using the model trained on scaled features, predict the charges I would pay if I were 28 years old, a bmi of 30, and non-smoker.

In [26]:
new_X = [28, 30, False]

new_X_scaled = (new_X-X.min())/(X.max()-X.min())

new_X_scaled

age       0.217391
bmi       0.377724
smoker    0.000000
dtype: float64

In [27]:
knn.predict([new_X_scaled])

array([6810.60666])

# 🏁