## TASK - TO FIND A KNN REGRESSION FOR A GIVEN DIAMOND SET DATA

### K - NN ALGORITHM: 

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm that can be used to solve both classification and regression problems. It's easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.

                                                    (or)

K- NN regression is a non-parametric method that, in an intuitive manner, approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighbourhood  


## Regression: 
Regression analysis consists of a set of machine learning methods that allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (x). Linear regression is the most simple and popular technique for predicting a continuous variable.

### How does knn works in regression?

- KNN algorithm can be used for both classification and regression problems.
- The KNN algorithm uses 'feature similarity' to predict the values of any new data points. 
- This means that the new point is assigned a value based on how closely it resembles the points in the training set.

### What is the purpose of KNN?

- KNN is a non-parametric, lazy learning algorithm. 
- Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

### How is Knn calculated?
#### Here is step by step on how to compute K-nearest neighbors KNN algorithm:
1. Determine parameter K = number of nearest neighbors.
2. Calculate the distance between the query-instance and all the training samples.
3. Sort the distance and determine nearest neighbors based on the K-th minimum distance.

### IMPORTING THE NECESSARY LIBRARIES

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

## OWN MODEL

In [2]:
class K_Nearest_Neighbors_Regressor() : 

    def __init__( self, K ) : 

        self.K = K 

        # Function to store training set 

    def fit( self, X_train, Y_train ) : 

        self.X_train = X_train 

        self.Y_train = Y_train 

        # no_of_training_examples, no_of_features 

        self.m, self.n = X_train.shape 
        # Function for prediction 
    def predict( self, X_test ) : 
        
        self.X_test = X_test 

    # no_of_test_examples, no_of_features 

        self.m_test, self.n = X_test.shape 

        # initialize Y_predict 

        Y_predict = np.zeros( self.m_test ) 

        for i in range( self.m_test ) : 

            x = self.X_test[i] 

            # find the K nearest neighbors from current test example 

            neighbors = np.zeros( self.K ) 

            neighbors = self.find_neighbors( x ) 

            # calculate the mean of K nearest neighbors 

            Y_predict[i] = np.mean( neighbors ) 

        return Y_predict 

    # Function to find the K nearest neighbors to current test example 

    def find_neighbors( self, x ) : 

        # calculate all the euclidean distances between current test 
        # example x and training set X_train 

        euclidean_distances = np.zeros( self.m ) 
        
        for i in range( self.m ) : 
            d = self.euclidean( x, self.X_train[i] ) 

            euclidean_distances[i] = d 
    
        # sort Y_train according to euclidean_distance_array and 
        # store into Y_train_sorted 

        inds = euclidean_distances.argsort() 
    
        Y_train_sorted = self.Y_train[inds] 

        return Y_train_sorted[:self.K] 
    
    # Function to calculate euclidean distance 

    def euclidean( self, x, x_train ) : 

        return np.sqrt( np.sum( np.square( x - x_train ) ) )

In [3]:
df = pd.read_csv(r'C:\Users\admin\Downloads\DATASETS/diamonds.csv')
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [4]:
df.shape

(53940, 10)

### OBSERVATION: 

We have the data set of diamonds where it has 53940 rows and 10 columns

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [5]:
df.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [8]:
df.cut.value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

In [9]:
df.price.value_counts()

605      132
802      127
625      126
828      125
776      124
        ... 
13550      1
13014      1
6811       1
5354       1
11600      1
Name: price, Length: 11602, dtype: int64

In [10]:
df.carat.value_counts()

0.30    2604
0.31    2249
1.01    2242
0.70    1981
0.32    1840
        ... 
2.70       1
3.67       1
5.01       1
2.77       1
3.40       1
Name: carat, Length: 273, dtype: int64

In [11]:
df.depth.value_counts()

62.0    2239
61.9    2163
61.8    2077
62.2    2039
62.1    2020
        ... 
72.9       1
52.7       1
69.1       1
70.5       1
69.4       1
Name: depth, Length: 184, dtype: int64

### Some Advantages of KNN: 

1. Quick calculation time.
2. Simple algorithm – to interpret. 
3. Versatile – useful for regression and classification.
4. High accuracy – you do not need to compare with better-supervised learning models.
5. No assumptions about data – no need to make additional assumptions, tune several parameters, or build a model. This makes it crucial in nonlinear data case. 


### Some Disadvantages of KNN:

1. Accuracy depends on the quality of the data.
2. With large data, the prediction stage might be slow.
3. Sensitive to the scale of the data and irrelevant features.
4. Require high memory – need to store all of the training data.
5. Given that it stores all of the training, it can be computationally expensive.


## SKLEARN MODEL

In [16]:
def main() : 

    # Importing dataset 
    
    df = pd.read_csv(r'C:\Users\admin\Downloads\DATASETS/diamonds.csv',  nrows = 53940)

    number = LabelEncoder()
    
    df["cut"]=number.fit_transform(df["cut"].astype('str'))
    df["color"]=number.fit_transform(df["color"].astype('str'))
    df["clarity"]=number.fit_transform(df["clarity"].astype('str'))

    X = df.drop('price',axis=1).values

    Y = df['price'].values

    ## Scaling
    sc = StandardScaler()
    X = sc.fit_transform(X)

    
    # Splitting dataset into train and test set 

    X_train, X_test, Y_train, Y_test = train_test_split( 
    X, Y, test_size = 0.25, random_state = 0 ) 
    
    # Model training 
    
    model_own = K_Nearest_Neighbors_Regressor( K = 3 ) 

    model_own.fit( X_train, Y_train ) 
    
    model_sklearn = KNeighborsRegressor( n_neighbors = 3 ) 
    model_sklearn.fit( X_train, Y_train ) 
    
    # Prediction on test set 

    Y_pred_own = model_own.predict( X_test ) 
    
    Y_pred_sklearn = model_sklearn.predict( X_test ) 
    
    print('accuracy of Own Model',r2_score(Y_test,Y_pred_own))
    print('accuracy of Sklearn',r2_score(Y_test,Y_pred_sklearn))
     

if __name__ == "__main__" : 

    main()

accuracy of Own Model 0.9588484862268778
accuracy of Sklearn 0.9588455152840776


### OBSERVATION:

1. From our own model i.e., from Python scratch we got the accuracy as 0.9588484862268778
2. From the Sklearn Library we got the accuracy as 0.9588455152840776

### CONCLUSION:

Both from our own model and the using sklearn has the same accuracy i.e., 0.9588455152840776 <=> 95.9%

## A Quick Summary of KNN Algorithm: 

- K is a positive integer.
- With a new sample, you have to specify K.
- K is selected from database closest to the new sample.
- KNN doesn’t learn any model.
- KNN makes predictions using the similarity between an input sample and each training instance.
- This blog has given you the fundamentals of one of the most basic machine learning algorithms.

### KNN is a great place to start when first learning to build models based on different data sets. 

### Data set with a lot of different points and accurate information is your best place, to begin with KNN.

## We should Keep these 3 points in mind:

- A data set with lots of different points and labelled data is the ideal to use.
- The best languages to use with KNN are R and python.
- To find the most accurate results from your data set, you need to learn the correct practices for using this algorithm.