##**Aim: Identify datasets related to Robotics and Automatics domain and carry out data preprocessing techniques**##

##**Theory**
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the initial and most important step in developing a machine learning model.

When developing a machine learning project, it is not always the case that we encounter clean and well-structured data. In addition, it is necessary to cleanse and prepare data prior to performing any operation on it. For this reason, we employ data preprocessing tasks.

---
> **Need of Data Preprocessing**
---
A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models.

Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.


For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner.

Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set.


Another aspect is that the data set should be formatted in such a way that more than one Machine Learning and Deep Learning algorithm are executed in one data set, and best out of them is chosen.

---

> **Steps in data-Preprocessing**

---
1. Handling of missing values
2. Categorical-Encoding
3. Data Scaling etc.

In this experiment we will study the how to handle the missing values, ho to perform categorical encoading and how to scale the data.

#**1. Handling of missing values**

**Import required libraries**

In [None]:
import pandas as pd
import numpy as np

**Import data "4_movie_scores.csv" (.csv form)** (dataset is uploaded in classroom)

In [None]:
df=pd.read_csv('/content/4_movie_scores.csv')
df

**Check five head values of imported Dataframe**

In [None]:
df.head()

**Checking and Selecting Null Values**

In [None]:
df.isnull()                                       # null position is indicated by boolean operator "True"

In [None]:
df.notnull()                                         # Non-null position is indicated by boolean operator "True"

In [None]:
df.columns

**non-null values from perticular raw or column**

In [None]:
df['first_name']

In [None]:
df[df['first_name'].isnull()]

In [None]:
[df['first_name'].notnull()]

In [None]:
df[df['first_name'].notnull()]

**non-null and null values from multiple raw or column**

In [None]:
df['pre_movie_score'].isnull()

In [None]:
df['Gender'].notnull()

In [None]:
df[(df['pre_movie_score'].isnull()) & df['Gender'].notnull()]

**How to drop or replace null values?**

**Actual Dataset**

In [None]:
df

**count of missing values**

**Missing value count along Columns**

In [None]:
df.isna().sum()

**Missing value count of complete dataframe**

In [None]:
df.isna().sum().sum()

**Drop rows contains null values**

In [None]:
df

In [None]:
df.dropna()

Unnamed: 0,first_name,last_name,age,Gender,pre_movie_score,post_movie_score
0,Root,Joss,36.0,m,8.0,9.0
3,Sofie,Miller,39.0,f,7.0,8.0
4,Emma,Roy,84.0,f,6.0,8.0


In [None]:
df.dropna(axis=1)

0
1
2
3
4


**keep rows contains at least one non-null value**

In [None]:
df

In [None]:
df.dropna(thresh=1)

In [None]:
df.dropna(thresh=2)

In [None]:
df.dropna(thresh=5)

**Drop columns contains null values**

In [None]:
df.dropna(axis=1)

**keep columns contains at least four non-null values**

In [None]:
df.dropna(thresh=4,axis=1)

**Fill null value Data**

In [None]:
df.fillna("NEW")

In [None]:
df['age'].fillna(5)

In [None]:
df['first_name'].fillna("Empty")

In [None]:
df

In [None]:
df['first_name'] = df['first_name'].fillna("Empty")

In [None]:
df

**Fill 'pre_movie_score' with mean value of 'pre_movie_score'**

**Find mean value of 'pre_movie_score' column**

In [None]:
df['pre_movie_score'].mean()

In [None]:
df.fillna(df['pre_movie_score'].mean())

In [None]:
df['pre_movie_score'].mode()

0    6.0
1    7.0
2    8.0
dtype: float64

In [None]:
df['pre_movie_score'].median()

##**Import dstaset (Students_expenses.csv)**

##**Replace missing values with mean and median**

Replacing with mean

In [None]:
dz.median()

In [None]:
dz.fillna(dz.mean())

**Replcing with median values**

In [None]:
dz.fillna(dz.median())

##**KNN imputer for filling missing values**

**Import "Before_imputation.csv" dataset**

In [None]:
Before_imputation=pd.read_csv('/content/Before_imputation.csv')
Before_imputation

##**Replace the missing values with KNNImputer**

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
After_imputation = imputer.fit_transform(Before_imputation)
After_imputation                                              #After transforming the data becomes a numpy array.

##**Convert the array into dataframe**

In [None]:
Before_imputation.columns

In [None]:
After_imputation=pd.DataFrame(After_imputation,columns=['Maths', 'Chemistry', 'Physics', 'Biology'])
After_imputation

---
#**2. CATEGORICAL ENCOADING**



**Import "Encoding_data.csv"**  

**Import label encoder**

In [None]:
from sklearn.preprocessing import LabelEncoder

**Check unique values in column country**

In [None]:
df.Country.unique()

In [None]:
z= df.Country.unique()

In [None]:
len(z)

**Encode labels in column 'Country'**

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df['Country_labels']= label_encoder.fit_transform(df['Country'])
df

In [None]:
df_new=df.drop(columns=['Country'])
df_new

**Importing one hot encoder**

In [None]:
from sklearn.preprocessing import OneHotEncoder

## **reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object**

In [None]:
df.Country.ndim

In [None]:
df.Country.values

In [None]:
df.Country.values.reshape(-1,1)

In [None]:
df.Country.values.reshape(-1,1).ndim

In [None]:
onehotencoder = OneHotEncoder()
X = onehotencoder.fit_transform(df.Country.values.reshape(-1,1)).toarray()
X

**To add this back into the original dataframe**

In [None]:
X.shape

In [None]:

df_OneHot = pd.DataFrame(X, columns = ["Country_"+str(int(i)) for i in range(X.shape[1])])
df_OneHot

**concat dfOneHot with original dataframe**

In [None]:
df2_new = pd.concat([df, df_OneHot], axis=1)
df2_new

In [None]:
df2_new.drop(columns=['Country'],inplace=True)
df2_new

#**3.Data Scaling**

**import dataset "scaling_data.csv"**

**Graphical Visualization of data using matplotlib (x and y data)**

In [None]:
import matplotlib.pyplot as plt
plt.scatter( data.X, data.Y)
plt.axvline(x=0, c="red", label="x=0")
plt.axhline(y=0, c="yellow", label="y=0")
plt.show()

**Min_max scaler**

**Import Min-max Scaler (normalization_scaler)**

In [None]:
from sklearn.preprocessing import MinMaxScaler

**Define, fit and transform min-max scaler**

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()                         # y = (x – min) / (max – min)
min_max_scaled = scaler.fit_transform(data)

**Create dataframe of scaled data**

In [None]:
min_max_scaled = pd.DataFrame(min_max_scaled, columns= ['X','Y','Z'])
min_max_scaled

**Scaled and unscaled data representation using matplotlib**

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
plt.scatter( data.X, data.Y)
plt.axvline(x=0, c="red", label="x=0")
plt.axhline(y=0, c="yellow", label="y=0")

plt.subplot(1,2,2)
plt.scatter(  min_max_scaled.X, min_max_scaled.Y)
plt.show()

**Data scaling using Standard scaler**

**Import Standard Scaler (normalization_scaler)**

In [None]:
from sklearn.preprocessing import StandardScaler

**Apply StandardSaler on data**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler1 = StandardScaler()                   # X new = (X – X mean) / ( X std )
std_scaled = scaler1.fit_transform(df)
std_scaled

**Create dataframe of scaled data**

In [None]:
std_scaled = pd.DataFrame(std_scaled,columns= ['X','Y','Z'] )
std_scaled

**Scaled and unscaled data representation using matplotlib**

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
plt.scatter( data.X, data.Y)
plt.axvline(x=0, c="red", label="x=0")
plt.axhline(y=0, c="yellow", label="y=0")

plt.subplot(1,2,2)
plt.scatter(  std_scaled.X, std_scaled.Y)
plt.axvline(x=0, c="red", label="x=0")
plt.axhline(y=0, c="yellow", label="y=0")
plt.show()

##**Complete each and every steps of Task-1 and Task-2**


#**Task-1 (missing values)**
1.   **Import necessory libraries for kNN imputation**
2.   **Create dataframe X using following data with column names as 'Class A','Class B', 'Class C', 'Class D'**

---
                X = [[1, 3, np.nan, 4], [6, np.nan, 8, np.nan], [5, 4, 2, 3], [9, np.nan, 6, 8]]

---


3. **find number of missing values in each columns**
4. **find total number of missing values in dataframe**
5. **Find percent missing values in dataframe**
6. **Drop the rows having missing values**
7. **Keep the rows having with atleast 3 non-null values**
8. **Keep the columns having with atleast 3 non-null values**
9. **Drop the columns having missing values**
10. **Fill the missing values with mean values**
11. **Use KNN imputer to impute the missing values**
12. **Save the imputed data in dataframe as variable Y**  





#**Task-2 (KNN imputer and Categorical Encoading)**
1. import dataset (Loan_status.csv)
2. calculate sum of missing values in each columns
3. In Dataframe, Keep only missing values categorical data type columns
4. fill the null values with categorical imputer
5. create new dataframe with all previously categorical non-null columns and newly fill value columns
6. use the lable encoder to encode the categorical data
6. use one-hot encoder to encode the "Property_Area" column
7. Use mim-max scaler and standard scaler for 'ApplicantIncome' column.