## - Linear regression for missing values

Group members:
- Name (ID): Charan K (0753844)
- Name (ID): Pradeep Reddy O (0755620)
- Name (ID): Sai Krishna V (0753617)

### Academic integrity statement

*Replace the underscores below with your names, acknowledging that you have read and understood the statement in the context of St. Clair College’s Academic Integrity policies.*

We, Charan K, Pradeep Reddy O, Sai Krishna V hereby state that in preparing this lab for submission for grading, we have abided by the College’s academic integrity policies, and that all work presented is our own.

### Overview

In this lab, the main objective is to use a linear regression model to make predictions that you can use to fill in missing values in a dataset. The procedure is the same, however, you are using one of the features as the "target" instead of what you may normally think of as the target for that particular dataset. By the end of this lab you should have:

- gained experience manipulating dataframes with Pandas
- an initial understanding of how missing data is represented
- applied a linear regression model to fill in missing data   

### Grading

This lab will be graded as follows:
- 50% for comments/text
    - Half of the lab grade will come from an assessment of the comments/text included in your Jupyter notebook submission
        - The comments/text should explain clearly what you are doing and why it's necessary to achieve the objective
        - You should think of the comments/text as if you were creating a tutorial/blog to guide someone through your work 
- 50% for code
    - Half of the lab grade will come from an assessment of your code
        - The code in the notebook should use base python, NumPy, Pandas, sklearn, and/or matplotlib. 
        - All code cells should run error free
        - The code does not have to be optimized or pretty: it needs to be functional for the specific task

### What to submit

You should submit the following:
- a well-commented Jupyter notebook
- the original dataset used as a .csv file
    - if it did not come as a .csv file, you can write to a .csv from Pandas using `.to_csv()`

### Instructions

Please execute the following steps using a mixture of base python, NumPy, sklearn, Pandas, and/or matplotlib:

1. Find a dataset
    - I would suggest looking [here](https://archive.ics.uci.edu/ml/datasets.php) for **regression** datasets
    - The dataset for this lab does not have to be complicated, but it should meet the following criteria:
        - have at least 100 samples/rows
        - have at least 4 numeric features
    - if necessary, categorical features can simply be dropped from the dataset
2. Import the data as a Pandas dataframe
    - Depending on the data format, you may need to consult this [page](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)
3. Verify that your data has no missing values
    - If it does have missing values, drop them from the dataset but be sure that your dataset still meets the criteria of *Step 1* above
4. Choose a single, numeric feature (not the target)
    - Replace approximately 15% of the values of this feature with `nan`, which means "not a number" and is one way to represent missing data
5. Split your dataset into 2 dataframes
    - Dataframe 1: has all `nan` values for the feature chosen in *Step 4*
    - Dataframe 2: has no `nan` values for the feature chosen in *Step 4*
6. Use *Dataframe 2* to create a linear regression model to predict the feature chosen in *Step 4* (not the usual target)
    - Split the data
    - Scale the data
    - Create the model
    - evaluate the model on the train and test sets
7. Use the model you created in *Step 6* to predict the missing values in *Dataframe 1*
    - At the end of this step, *Dataframe 1* will have the `nan` values replaced with the predictions from the model you created in *Step 6*
8. Create a final dataframe by combining *Dataframe 1* and *Dataframe 2*
    - This dataframe should have no missing values
9. Create a k nearest neighbours regressor (`k = 3`) for the dataframe you created in *Step 8*
    - Follow the usual procedures
10. Create a k nearest neighbours regressor (`k = 3`) for the original dataframe (from *Step 2* and maybe *Step 3*)
    - Follow the usual procedures
11. Is there any significant performance difference between *Step 9* and *Step 10*?


### Visualizing the process

<img src="Lab_2_sequence.png" width=600 align="center">

### Standard package import

In [None]:
# Importing important Packages which are required to complete this lab. 

import pandas as pd # We use pandas for data manipulation
import numpy as np # numpy to do numerical calculations
import seaborn as sns # For visualization
import matplotlib.pyplot as plt # For visualization

from sklearn.model_selection import train_test_split # for splitting the dataset in to train and test
from sklearn.preprocessing import StandardScaler # We use standard Scaler for feature scaling



from sklearn.linear_model import LinearRegression # To build Linear model
from sklearn.neighbors import KNeighborsRegressor# To build K-Nearest Neighbor regressor
from sklearn.neighbors import KNeighborsClassifier # To build K-Nearest classifier.

## About Dataset

We are using seeds dataset for this LAB which we took from the [website](https://archive.ics.uci.edu/ml/datasets/seeds)

It has 210 records and 8 Features (7 predictor variables and 1 Target variable). 

The data is stored in text format (.txt). 

### Step 2 : Importing dataset

In [None]:
df_org = pd.read_csv('seeds_dataset.txt',header=None,delimiter='\t',names=['Area','Perimeter','Compactness','Length Of Kernel','Width Of Kernel','Asymmetry Coefficient','Length Of Kernel Groove','Seed Type'])

# we read data in to variable called df_org, which has original dataframe

# First parameter which we give is a dataset file name, 
# By default it takes first row as header row, so we set header option to None
# next parameter we gave is delimiter as the values in text file is tab separated 
# text file doesn't have column names, we gave column names which was described in the UCI repository. 


df_org.head() # Displays top 5 records of dataframe

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove,Seed Type
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


In [None]:
df_org.to_csv('Seeds_data3.csv',index=False) 

# As datafile is in text format, we will now save it in 'csv' format using pandas 'to_csv' function. For LAB purpose.

# First parameter which it take is "file name with which you want to save along withextension" and you set index parameter to False so that it wont show index values in csv file.

In [None]:
df_org.shape # Checking the shape of dataframe df_org

(210, 8)

In [None]:
df = df_org.iloc[:,0:7] # Subsetting df_org such that we remove true target variable for implementing further steps

# iloc is used for integer location based slicing.

# [:,0:7] part represents [ ROWS , Columns], which mean it takes all rows and columns based on integer location which we specified i.e all the columns from 0 to 7 (7th column is excluded). [start:end).

# In python, Indexing starts from 0, so we have to slice with 0

In [None]:
print(df) # Displays the dataframe df, we see that we only have 7  columns and 210 observatiosn.

      Area  Perimeter  Compactness  Length Of Kernel  Width Of Kernel  \
0    15.26      14.84       0.8710             5.763            3.312   
1    14.88      14.57       0.8811             5.554            3.333   
2    14.29      14.09       0.9050             5.291            3.337   
3    13.84      13.94       0.8955             5.324            3.379   
4    16.14      14.99       0.9034             5.658            3.562   
..     ...        ...          ...               ...              ...   
205  12.19      13.20       0.8783             5.137            2.981   
206  11.23      12.88       0.8511             5.140            2.795   
207  13.20      13.66       0.8883             5.236            3.232   
208  11.84      13.21       0.8521             5.175            2.836   
209  12.30      13.34       0.8684             5.243            2.974   

     Asymmetry Coefficient  Length Of Kernel Groove  
0                    2.221                    5.220  
1              

In [None]:
df.head() # Checking the dataframe df after deleting Target variable.

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175


In [None]:
print("The shape of dataframe after Subsetting required columns from original dataframe and removing original target variable {}".format(df.shape))

The shape of dataframe after Subsetting required columns from original dataframe and removing original target variable (210, 7)


#### 3. Verify that your data has no missing values : Checking for missing values

In [None]:
df.info() 
# this command displays class of dataframe , Range Index (which is number of records), Column names.

# It also displays Cloumn names, Non-Null count and Datatype of each variable in a tabular form. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Area                     210 non-null    float64
 1   Perimeter                210 non-null    float64
 2   Compactness              210 non-null    float64
 3   Length Of Kernel         210 non-null    float64
 4   Width Of Kernel          210 non-null    float64
 5   Asymmetry Coefficient    210 non-null    float64
 6   Length Of Kernel Groove  210 non-null    float64
dtypes: float64(7)
memory usage: 11.6 KB


##### It is clear that none of the 7 variables have missing values. Non-Null count column shows 210 for every column,  which mean there are NO-Null values. 

### Step 4 : 

We choose the Feature "Length of Kernel Groove" for introducing "NaN's" in to the feature values

In [None]:
# Code reference https://stackoverflow.com/questions/39059032/randomly-insert-nas-values-in-a-pandas-dataframe/39059033



df['Length Of Kernel Groove']= df['Length Of Kernel Groove'].mask(np.random.random(df['Length Of Kernel Groove'].shape) <= 0.20)

# We used mask function to introducs NaN values in to the 'Length Of Kernel Groove' column. 

# 'np.random.random(df['Length Of Kernel Groove'].shape)' randomly generates 210 values between 0 and 1.

# Mask is applied based on the condition which we applied, out of 210 values between 0 and 1 it checkes for values which is lessthan or equal to 0.15 and where ever condition is true it replaces that value with NaN

# When condition in mask is false, it keeps original value



# Not every time we get the same number of NaN values as we use random function, % of NaN's also vary with each run.

In [None]:
df.info() # Checking for Non-null count of "Length Of Kernel Groove" after applying MASK.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Area                     210 non-null    float64
 1   Perimeter                210 non-null    float64
 2   Compactness              210 non-null    float64
 3   Length Of Kernel         210 non-null    float64
 4   Width Of Kernel          210 non-null    float64
 5   Asymmetry Coefficient    210 non-null    float64
 6   Length Of Kernel Groove  176 non-null    float64
dtypes: float64(7)
memory usage: 11.6 KB


In [None]:
Total_null_values = df['Length Of Kernel Groove'].isnull().sum() 

# Assigning sum of total null values to variable "Total_null_values"

In [None]:
Total_number_of_records = df['Length Of Kernel Groove'].shape[0]

# df['Length Of Kernel Groove'].shape[0] gives the total observations count. Normally it will be a tuple of the form (rows, columns). We subset first value of tuple.

# Assigning total number of Rows present in our target variable to "Total_number_of_records".

In [None]:
print("The percentage of null values in the variable we selected is :{:.2f}%".format((Total_null_values/Total_number_of_records)*100))



# This value may change because we used random function for mask.

The percentage of null values in the variable we selected is :16.19%


### Step 5 : Split your dataset into 2 dataframes
    - Dataframe 1: has all `nan` values for the feature chosen in *Step 4*
    - Dataframe 2: has no `nan` values for the feature chosen in *Step 4*


In [None]:
df1= df[df['Length Of Kernel Groove'].isnull()] 

# df1 will now have all the rows with NaN values.
# df['Length Of Kernel Groove'].isnull() -- Checks for if an observation contains NaN value in column ''Length Of Kernel Groove' returns True\False values 
# df[df['Length Of Kernel Groove'].isnull()]  -- it returns only the rows which have TRUE values. i.e. all NaN observations will be saved to dataframe df1.




df2= df[df['Length Of Kernel Groove'].notnull()] # df2 will now have all the rows with non-NaN values.

# .notnull() works exactly opposite of isnull().

# Checks for condition if particular observation is NaN or not and it returns true if it is not NaN. 
# We subset rows based on this condition and store all non-nul observations into dataframe df2.

In [None]:
print("The shape of dataframe which has NULL values : {}".format(df1.shape)) # This command prints shape of df1

print("The shape of dataframe which has NON-NULL values : {}".format(df2.shape)) # This command prints shape of df2


The shape of dataframe which has NULL values : (34, 7)
The shape of dataframe which has NON-NULL values : (176, 7)


In [None]:
df1.isnull().sum() 

# .isnull() returns true /false and sum() adds all the values if it is True.


Area                        0
Perimeter                   0
Compactness                 0
Length Of Kernel            0
Width Of Kernel             0
Asymmetry Coefficient       0
Length Of Kernel Groove    34
dtype: int64

In [None]:
df1.head() # Checking top 5 records of dataframe which has all NaN's.

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove
7,14.11,14.1,0.8911,5.42,3.302,2.7,
8,16.63,15.46,0.8747,6.053,3.465,2.04,
12,13.89,14.02,0.888,5.439,3.199,3.986,
29,13.45,14.02,0.8604,5.516,3.065,3.531,
35,16.12,15.0,0.9,5.709,3.485,2.27,


In [None]:
df2.isnull().sum() # Checking to see if dataframe 2 has any NULL values.

Area                       0
Perimeter                  0
Compactness                0
Length Of Kernel           0
Width Of Kernel            0
Asymmetry Coefficient      0
Length Of Kernel Groove    0
dtype: int64

In [None]:
df2.head() # Checking top 5 records of dataframe 2

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175


### Step 6 : Use *Dataframe 2* to create a linear regression model to predict the feature chosen in *Step 4* (not the usual target)

#### Splitting the data

In [None]:
X= df2.iloc[:,0:6].values 


# .values convert the input values of all features into numpy array. It will be easy for us to do calculations if it is in array format.
# We subset all the Features except the Targert feature "Length Of Kernel Groove" from dataframe 2 and store it in variable x.



y=df2.iloc[:,-1].values

# We store target variable "Length Of Kernel Groove" values in variable 'y'
# we used .values to convert the values to numpy array, which will be easy to perform mathematical features.

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1111) 

# train_test_split splits data into 75 - 25 ratio by default. i.e. 75% training data and 25% of test data.

In [None]:
print("Shape of training data is {}".format(X_train.shape))
print("Shape of test data is {}".format(X_test.shape))
print('\n')
print("Shape of training labels is {}".format(y_train.shape))
print("Shape of test labels is {}".format(y_test.shape))

Shape of training data is (132, 6)
Shape of test data is (44, 6)


Shape of training labels is (132,)
Shape of test labels is (44,)


##### Scaling data using Standard Scaler

In [None]:
scaling = StandardScaler() # We already imported from sklearn.preprocessing 

# As features will be in different scales, we will scale them to bring them to one scale.

# There are different scaling techniques, but we use StandardScaler() for scaling the variables, This standardizes feature by subtracting the mean and then scaling to unit variance. 

scaling.fit(X_train) # we fit X_train data to standard scaler function

StandardScaler(copy=True, with_mean=True, with_std=True)

In [None]:
X_train_scaled = scaling.transform(X_train) 

# We transform\scale featues using '.transform' method. All scaled values of x_train will be stored in "X_train_scaled " variable.

X_test_scaled = scaling.transform(X_test)

# All scaled values of x_test will be stored in "X_test_scaled " variable.

In [None]:
X_train_scaled[0:3] # Checking first 3 elements of X_scaled train data

array([[-1.1403603 , -0.96228078, -2.27919666, -0.60210966, -1.57020088,
         0.18679429],
       [ 1.36015519,  1.25914658,  1.23470033,  1.23455871,  1.47417566,
        -0.41526406],
       [ 0.58568998,  0.598099  ,  0.51493263,  0.5602812 ,  0.57426561,
         1.1263121 ]])

In [None]:
X_test_scaled[0:3]  # Checking first 3 elements of X_scaled test data

array([[-0.97018633, -0.88541478, -1.30460436, -0.82005795, -1.1216135 ,
         0.7786266 ],
       [-0.16446467, -0.13212799,  0.08128376, -0.13442896, -0.2846151 ,
        -1.47365964],
       [ 1.42614102,  1.42825178,  0.54175627,  1.40710111,  1.36202882,
        -0.25676038]])

##### Creating the model

In [None]:
model = LinearRegression() 

# We first instantiate the LinearRegression model with default parametersand assign it to variable 'model'.  



model.fit(X_train_scaled,y_train) 

# We Fit model using  X_train_scaled (scaled values) and y_train.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

####  Checking model parameters


In [None]:
model_intercept = model.intercept_ # Assigning value of intercept to variable named 'model_intercept'

print("The model intercept is: {}".format(model_intercept)) # Printing model co-efficients.

The model intercept is: 5.395159090909093


In [None]:
model_coef_ = np.round(model.coef_,3) 

# rounding co-efficients of model to 3 decimals using numpy.round function and assigning to variable model_coef_

In [None]:
data={'Input Variables': ['Area','Perimeter','Compactness','Length Of Kernel','Width Of Kernel','Asymmetry Coefficient'],
       'Model_co-efficients': [model_coef_[0],model_coef_[1],model_coef_[2],model_coef_[3],model_coef_[4],model_coef_[5]]}

'''

creating a dictionary named "data" with keys (input variables, Model_co-efficients) and values (feature names, co-efficients of each feature which we gotby model)

to print Input variables and it's co-efficients  in dataframe format. 

'''

'\n\ncreating a dictionary named "data" with keys (input variables, Model_co-efficients) and values (feature names, co-efficients of each feature which we gotby model)\n\nto print Input variables and it\'s co-efficients  in dataframe format. \n\n'

In [None]:
co_efficients_df = pd.DataFrame(data) 

# We use the dictionary "data" which we created earlier to print the input variables and it's co-efficients in a tabular form. 

In [None]:
co_efficients_df

Unnamed: 0,Input Variables,Model_co-efficients
0,Area,1.225
1,Perimeter,-0.685
2,Compactness,-0.104
3,Length Of Kernel,0.283
4,Width Of Kernel,-0.325
5,Asymmetry Coefficient,0.058


##### Evaluating the model on the train and test sets

In [None]:
y_pred = model.predict(X_test_scaled) # 'predict' method, predicts the values of scaled X_test using linear model.

#print("The predicted values of model {}".format(y_pred))

In [None]:
print(y_pred) # Checking predicted values

[5.1262267  5.24413929 6.04899535 5.94718544 5.75596691 5.35559632
 4.82013892 6.48800905 5.74826887 6.3377064  5.90576682 5.27577736
 5.10152402 5.31094069 5.18705772 5.23999161 5.23299823 5.80599662
 4.97272321 5.25623613 5.91745018 4.77590744 4.97971362 5.06038864
 4.75703274 5.0255435  5.34761218 4.96248131 5.63490638 6.10226492
 5.22773563 5.18899435 4.73960687 6.31981287 5.0552103  4.82404564
 6.17730384 5.86825082 5.55927662 5.07256337 5.43187553 6.4992608
 5.49519096 4.79417967]


##### $R^2$ value for train and test

In [None]:
R2_train = model.score(X_train_scaled,y_train)  # We check R^2 score for training data and store in a variable 'R2_train'

# 'score()' method takes 2 arguments and returns co-efficient of determination R^2 values of the prediction.

R2_test = model.score(X_test_scaled,y_test) # We check R^2 score for test data and store in a variable 'R2_test'

# In general, R^2 values tells that how much of total variation in Target variable is explained by Explanatory variables.

In [None]:
print("R^2 value for Training set accuracy: {:.2f} %".format(R2_train*100)) # Printing Training accuracy (R^2) value of training set.


print("R^2 value for Testing set accuracy: {:.2f} %".format(R2_test*100)) # Printing Testing accuracy (R^2) value of test set.



R^2 value for Training set accuracy: 92.06 %
R^2 value for Testing set accuracy: 93.47 %


In [None]:
# Cross checking r2 _score using 'r2_score' function

from sklearn.metrics import r2_score

R2_method_2 = r2_score(y_test, y_pred) #r2_score takes arguments (true values, predicted values)
R2_method_2 



0.911415673068204

### Step 7 : Use the model you created in *Step 6* to predict the missing values in *Dataframe 1*
    - At the end of this step, *Dataframe 1* will have the `nan` values replaced with the predictions from the model you created in *Step 6*


In [None]:
# Code reference : https://stackoverflow.com/questions/44097633/imputing-missing-values-using-a-linear-regression-in-python




df1['Length Of Kernel Groove']= df1.apply(lambda x: 
                                         model_coef_[0]*x['Area']+model_coef_[1]*x['Perimeter'] +model_coef_[2]*x['Compactness']+model_coef_[3]*x['Length Of Kernel']+model_coef_[4]*x['Width Of Kernel'] +model_coef_[5]*x['Asymmetry Coefficient'] + model_intercept
                                          if np.isnan(x['Length Of Kernel Groove']) else x['Length Of Kernel Groove'],axis=1)



# Working of the lambda function which we used above:

''' 
First it checks for condition "if np.isnan(x['Length Of Kernel Groove'])" 
if this condition turns out to be true,(It will be true if it has NaN values in record of that particular column). 

If condition is true, it replaces 'NaN' with the linear model which we predicted using dataframe 2, we use same model parameters.

i.e. 

y_predicted = model_coef_[0]*x['Area']+model_coef_[1]*x['Perimeter'] +model_coef_[2]*x['Compactness']+model_coef_[3]*x['Length Of Kernel']+model_coef_[4]*x['Width Of Kernel'] +model_coef_[5]*x['Asymmetry Coefficient'] + model_intercept

(it is of the form y_cap = intercept + m1*X1 = m2*X2 + m3*X3 ...+ m5*X5).

.apply() will apply the lambda function to the target column of dataframe df1.

'''





# We use lambda function for replacing NaN values in dataframe df1. It is quite easy and most powerful function which can be used inpython programming.

# Lamdba functions are also called Anonymous functions. 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


' \nFirst it checks for condition "if np.isnan(x[\'Length Of Kernel Groove\'])" \nif this condition turns out to be true,(It will be true if it has NaN values in record of that particular column). \n\nIf condition is true, it replaces \'NaN\' with the linear model which we predicted using dataframe 2, we use same model parameters.\n\ni.e. \n\ny_predicted = model_coef_[0]*x[\'Area\']+model_coef_[1]*x[\'Perimeter\'] +model_coef_[2]*x[\'Compactness\']+model_coef_[3]*x[\'Length Of Kernel\']+model_coef_[4]*x[\'Width Of Kernel\'] +model_coef_[5]*x[\'Asymmetry Coefficient\'] + model_intercept\n\n(it is of the form y_cap = intercept + m1*X1 = m2*X2 + m3*X3 ...+ m5*X5).\n\n.apply() will apply the lambda function to the target column of dataframe df1.\n\n'

In [None]:
df1 # Checking the dataframe df1 after imputing missing values using Linear model.

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove
5,14.38,14.21,0.8951,5.386,3.312,2.462,11.726548
13,13.78,14.06,0.8759,5.479,3.156,3.136,11.259294
20,14.16,14.4,0.8584,5.658,3.129,3.072,11.47769
22,15.88,14.9,0.8988,5.618,3.507,0.7651,12.823882
25,16.19,15.16,0.8849,5.833,3.421,0.903,13.052192
27,12.74,13.67,0.8564,5.395,2.956,2.504,10.375998
39,14.28,14.17,0.8944,5.397,3.298,6.685,11.858357
48,14.79,14.52,0.8819,5.545,3.291,2.704,12.008386
50,14.43,14.4,0.8751,5.585,3.272,3.975,11.779465
53,14.33,14.28,0.8831,5.504,3.199,3.328,11.710652


In [None]:
df1.isnull().sum()

# After performing imputation, i tried to see null values and found that no-null values are present in dataframe 1.

Area                       0
Perimeter                  0
Compactness                0
Length Of Kernel           0
Width Of Kernel            0
Asymmetry Coefficient      0
Length Of Kernel Groove    0
dtype: int64

### Step 8 : Create a final dataframe by combining *Dataframe 1* and *Dataframe 2*

             - "final_df" should have same dimensions as dataframe "df"

In [None]:
final_df = pd.concat([df1,df2]) 

# pd.concat() combines 2 dataframes with equal number of columns.


In [None]:
print('The Dimensions of Final dataframe after combining df1 and df2 is : {}'.format(final_df.shape))
print('\n')
print('The Dimensions of Original dataframe when we loaded is {}'.format(df_original.shape))

The Dimensions of Final dataframe after combining df1 and df2 is : (210, 7)


The Dimensions of Original dataframe when we loaded is (210, 7)


In [None]:
# cor=final_df.corr()
# cor

In [None]:
# cor['Length Of Kernel Groove'][abs(cor['Length Of Kernel Groove']) > 0.1 ]

In [None]:
print(final_df.head(20)) # WIll display top 20 records of final dataframe after combining dataframes 1 and 2.

'''

As we concatenated 2 dataframes to one dataframe "final_df", the index is not in sorted manner. 

Hence we will sort the index first.

Also we dont have original target variable in final_df, we need to add that original Target variable "Seed type" as well.

'''



      Area  Perimeter  Compactness  Length Of Kernel  Width Of Kernel  \
5    14.38      14.21       0.8951             5.386            3.312   
13   13.78      14.06       0.8759             5.479            3.156   
20   14.16      14.40       0.8584             5.658            3.129   
22   15.88      14.90       0.8988             5.618            3.507   
25   16.19      15.16       0.8849             5.833            3.421   
27   12.74      13.67       0.8564             5.395            2.956   
39   14.28      14.17       0.8944             5.397            3.298   
48   14.79      14.52       0.8819             5.545            3.291   
50   14.43      14.40       0.8751             5.585            3.272   
53   14.33      14.28       0.8831             5.504            3.199   
62   12.36      13.19       0.8923             5.076            3.042   
63   13.22      13.84       0.8680             5.395            3.070   
66   14.34      14.37       0.8726             5.63

'\n\nAs we concatenated 2 dataframes to one dataframe "final_df", the index is not in sorted manner. \n\nHence we will sort the index first.\n\nAlso we dont have original target variable in final_df, we need to add that original Target variable "Seed type" as well.\n\n'

In [None]:
final_df.sort_index(inplace=True) 

# this command sorts the index in ascending order and the argument which we gave inplace=True,performs operation and returns the copy of teh object. 

In [None]:
final_df # displaying dataframe to see if index values are sorted. 

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220000
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956000
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825000
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805000
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175000
...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870000
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003000
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056000
208,11.84,13.21,0.8521,5.175,2.836,3.598,9.732637


### Checking for null values in final dataframe.

In [None]:
final_df.isnull().sum() 

# this command checks for null values in entire dataset, It returns true if it has NULL and sum() will add all the true values. 

# Boolean value of True is considered as 1 and sum() returns the sum of all the nul lvalues present in a particular column.

# Per below output, we see that there are no Null values inthe dataframe. 

Area                       0
Perimeter                  0
Compactness                0
Length Of Kernel           0
Width Of Kernel            0
Asymmetry Coefficient      0
Length Of Kernel Groove    0
dtype: int64

#### Checking our original dataframe where we have original target variable.

In [None]:
df_org # Displays original dataframe

# We use this original dataframe to extract target variable and add it to final_df (imputed dataframe)

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove,Seed Type
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,3
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,3
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,3
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,3


In [None]:
seed_typ = df_for_target['Seed Type'] 

# assigning the values of seed type column from original dataframe to a variable 'seed_typ'


In [None]:
final_df['Seed Type'] = seed_typ

# This command will create a new column called "Seed Type" in to final dataframe and this column will have all the original Target label values.

In [None]:
final_df.head() 

# Checking top 5 observationsto see if new column and it's values are displayed. 

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove,Seed Type
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


### Step 9 : Create a k nearest Classifier   (`k = 3`) for the dataframe you created in *Step 8*

As we took classification dataset, we will perform classification instead of Regression.

##### Assigning input variables and Target variable to X1 and Y1 restectively

In [None]:
X1= final_df.iloc[:,0:7].values # we used pandas iloc, integer based indexing for selecting predictor variables.


Y1 = final_df.iloc[:,-1].values # -1 selects the last column of the dataframe, which is target variable 'seed type'

####  Split the data into train and test sets

In [None]:
X_final_train,X_final_test, y_final_train, y_final_test = train_test_split(X1,Y1,test_size= 0.28, random_state = 1111)

# We gave test_size parameteras 0.28. which implies testing set data will be 28% and Training setdata will be 72 % of original data, we set random state to 1111, this can be any number. If we set a particular number, that particular sample will be repeated again on every iteration. 

In [None]:

print("Shape of training data is {}".format(X_final_train.shape)) # printing shape of train data
print("Shape of test data is {}".format(X_final_test.shape)) # printing shape of test data
print("\n") # To create a gap between the codes.

print("Shape of training labels is {}".format(y_final_train.shape)) # printing shape of train data labels
print("Shape of test labels is {}".format(y_final_test.shape)) # printing shape of test data labels


Shape of training data is (151, 7)
Shape of test data is (59, 7)


Shape of training labels is (151,)
Shape of test labels is (59,)


#### Scaling the data

In [None]:
scaler_final = StandardScaler() # We initialise the standard scaler, for each feature it subtracts mean and divides it by standard deviation.

scaler_final.fit(X_final_train) #  this command fits the scaler function to training data.

X_final_train_scaled = scaler_final.transform(X_final_train) # this command transforms and returns the scaled values of X_train_final.


X_final_test_scaled = scaler_final.transform(X_final_test) # this command transforms and returns the scaled values of X_test_final.


#### Instantiating model and setting model parameters.


In [None]:
clf_final = KNeighborsClassifier(n_neighbors=3)

# We instantiate the model by assigning KNeighborsClassifier it to variable "clf_final". Also we give number of neighbors, i.e. K=3 which is a hyperparameter in KNN.

#### Fitting the model.

In [None]:
clf_final.fit(X_final_train_scaled,y_final_train) # We fit the model to scaled X_final_train and y_final_train data.

# model will learn from this

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

#### Evaluating the model

In [None]:
acc_train_final = clf_final.score(X_final_train_scaled, y_final_train) # This command checks the mean accuracy of trainng set data
acc_test_final = clf_final.score(X_final_test_scaled, y_final_test) # This command checks the mean accuracy of test set data

# score() method returns the mean accuracy on the given data and labels.

print("Training set accuracy: {:.2f} %".format(acc_train_final*100))  # prints training set accuracy of Imputed_dataframe
print("Test set accuracy: {:.2f} %".format(acc_test_final*100))   # Prints test set accuracy of Imputed dataframe.

Training set accuracy: 94.04 %
Test set accuracy: 93.22 %


### Step 10 : Create a k nearest neighbours Classifier (`k = 3`) for the original dataframe (from *Step 2* and maybe *Step 3*)

#### df_org contains the original dataframe data

In [None]:

# df_org contains the original dafatrame

df_org.head() 

# Checkingtop 5 observatrions of original dataframe.

Unnamed: 0,Area,Perimeter,Compactness,Length Of Kernel,Width Of Kernel,Asymmetry Coefficient,Length Of Kernel Groove,Seed Type
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


#### Assiging input and target variables to X2 and Y2 variables. 

In [None]:
X2 = df_org.iloc[:,0:7].values # we used pandas iloc, integer based indexing for selecting predictor variables, we select all 6 by slicing.

# .values returns values in numpy array. 

Y2 = df_org.iloc[:,-1].values 

# # -1 selects the last column of the dataframe. Y2 contains, target variable values in numpy format.

### Splitting the dataset in to Training and Tesing

In [None]:
X_org_train,X_org_test, y_org_train, y_org_test = train_test_split(X2,Y2,test_size= 0.28, random_state = 2222)


# here original dataframe will be splitted into training and testing, i have set test_size parameter to 0.28, which mean 28% of total data will be used for testng and 72% of data for training the model.


In [None]:
print("Shape of training data is {}".format(X_org_train.shape)) # prints shape of original train data
print("Shape of test data is {}".format(X_org_test.shape)) # prints shape of Original test data
print("\n")

print("Shape of training labels is {}".format(y_org_train.shape)) # prints shape of Original train data labels
print("Shape of test labels is {}".format(y_org_test.shape)) # prints shape of Original test data labels

Shape of training data is (151, 7)
Shape of test data is (59, 7)


Shape of training labels is (151,)
Shape of test labels is (59,)


#### Scaling the data

In [None]:
scaler_org = StandardScaler() # We initialise StandardSaler
 
scaler_org.fit(X_org_train) # we fit the X_org_train data

X_org_train_scaled = scaler_org.transform(X_org_train) # this command returns the scaled values of X_org_train


X_org_test_scaled = scaler_org.transform(X_org_test) # this command returns the scaled values of X_org_test

#### Instantiating model and setting model parameters.

In [None]:
clf_org = KNeighborsClassifier(n_neighbors=3)

# We instantiate the model by assigning KNeighborsClassifier it to variable "clf_org". Also we give number of neighbors, i.e. K=3 which is a hyperparameter in KNN.


#### Fitting the model

In [None]:
clf_org.fit(X_org_train_scaled,y_org_train) 

# We fit the model by giving scaled X train and y train data.

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

#### Evaluating the model 

In [None]:
acc_train_org = clf_org.score(X_org_train_scaled, y_org_train) # This command checks the mean accuracy of trainng set data, we assign to variable acc_train_org.
acc_test_org = clf_org.score(X_org_test_scaled, y_org_test) # This command checks the mean accuracy of test set data,  we assign to variable acc_test_org.

print("Training set accuracy: {:.2f} %".format(acc_train_org*100))  # prints Training set accuracy
print("Test set accuracy: {:.2f} %".format(acc_test_org*100))   # Prints testing set accuracy

Training set accuracy: 95.36 %
Test set accuracy: 89.83 %


### Step 11 :  Is there any significant performance difference between *Step 9* and *Step 10*?

In [None]:
data1 = {
    'Final_imputed_dataframe' : [acc_train_final*100, acc_test_final*100],
    'Original_dataframe' : [acc_train_org*100,acc_test_org*100 ]
}

'''

I created a dictionary with keys (Final_imputed_dataframe, Original_dataframe) and values of training accuracy and testing
accuracy, which we got by building respective models.

Reason for creating dictionary is, we want to show Train and Test accuracies of both imputed and original dataframes in tabular form.


'''

In [None]:
accuracy_display = pd.DataFrame(data1,index=['Training accuracy (%)','Testing accuracy (%)'])

# Above command creates a dataframe, first argument which we gave is a dictionary. We set index as Training and Testing to represent data in a clear form.

accuracy_display # displays the dataframe. 

Unnamed: 0,Final_imputed_dataframe,Original_dataframe
Training accuracy (%),94.039735,95.364238
Testing accuracy (%),93.220339,89.830508


#### Observation :

            1. From the above Tabular representation, we can see that the Test accuray of "Final_Imputed_Dataframe" is higher when compared to "Original_dataframe".
                    
            2. We don't see any huge differences between Test accuracies of original dataframe and Imputed dataframe. 
                    
            3. KNN model on imputed dataframe for seeds dataset provided little higher accuracy (3.3%) more than that of original dataframe.
                    
            4. Overall, we have less number of observations in the dataset, which mean model couldnt learn much about the data. 