<div class="alert alert-block alert-success">
    <h1 align="center">Mohammad Amirifard</h1>
    <h1 align="center">Feature Scaling</h1>
</div>

Feature Scaling is a technique of bringing down the values of all the independent features of our dataset on the same scale. Feature selection helps to do calculations in algorithms very quickly. It is the important stage of data preprocessing.
<figure>
    <center> <img src="https://miro.medium.com/max/1200/1*yR54MSI1jjnf2QeGtt57PA.png"  style="width:600px;height:400px;" ></center>
</figure>


## Outline
- [1 Project detail](#top_1)
- [&nbsp;&nbsp;1.1 Used Notations](#top_1.1)
- [&nbsp;&nbsp;1.2 Used Equations](#top_1.2)
- [2 Our Program](#top_2)
- [&nbsp;&nbsp;2.1 Import libraries](#top_2.1)
- [&nbsp;&nbsp;2.2 Load data](#top_2.2)
- [&nbsp;&nbsp;2.3 EDA](#top_2.3)
- [&nbsp;&nbsp;2.4 Define Variables](#top_2.4)
- [&nbsp;&nbsp;2.5 Define Model and train it without feature scaling](#top_2.5)
- [&nbsp;&nbsp;2.6 Define Model and train it with feature scaling](#top_2.6)


<a name="top_1"></a>
# 1. Project detail
---------------
In this prject we want to learn how to scale features of our dataset to obtain a better result.Our dataset is to regarding power consumption. Actually the goal is find the best correlation between power consumption and other features to predict suitable values for power consumption based on  new reported features.

Let's get started

<a name="top_1.1"></a>
## 1.1 Notations used in this project
------------
Here is a summary of some of the notation you will encounter.  

|General <img width=70/> <br />  Notation  <img width=70/> | Description<img width=350/>| Python (if applicable) |
|: ------------|: ------------------------------------------------------------||
|  $\mathbf{x-train}$ | Training Example feature values (in this lab - 41932 items))  | `x_train` |   
|  $\mathbf{y-train}$  | Training Example  targets (in this lab lab - 41932 items).  | `y_train` 
|  $\mathbf{x-test}$ | Test Example feature values (in this lab - 10484 items))  | `x_train` |   
|  $\mathbf{y-test}$  | Test Example  targets (in this lab lab - 10484 items).  | `y_train` 
| m | Number of training examples | `m`|
| n | Number of features          | `n`|


<a name="top_1.2"></a>
## 1.2  Equations Used in this project
 
 
 
 
-----------------
We have several different feature scaling methods, but two of which are defined
### 1st type: Normalization

$$ X_{new}= \frac {X-X_{min}}{X_{max}-X_{min}} \tag{1}$$


### 2nd type: Standardizatiom 
$$ X_{new}= \frac {X-X_{mean}}{Standard Deviation} \tag{2}$$

<a name="top_2"></a>
# 2. Our Program
Here you can see the codes to run

<a name="top_2.1"></a>
## 2.1 Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

<a name="top_2.2"></a>
## 2.2 Load data

In [2]:
data = pd.read_csv('Feature_Scaling_data.csv')
data.head()

Unnamed: 0,Temperature,Humidity,Wind Speed,general diffuse flows,diffuse flows,Power Consumption
0,6.559,73.8,0.083,0.051,0.119,34055.6962
1,6.414,74.5,0.083,0.07,0.085,29814.68354
2,6.313,74.5,0.08,0.062,0.1,29128.10127
3,6.121,75.0,0.083,0.091,0.096,28228.86076
4,5.921,75.7,0.081,0.048,0.085,27335.6962


In [3]:
print(f'Number of rows:{data.shape[0]}\nNumber of columns:{data.shape[1]}')

Number of rows:52416
Number of columns:6


<a name="top_2.3"></a>
## 2.3 EDA

In [4]:
# Let's see some inforamtion of this dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52416 entries, 0 to 52415
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Temperature            52416 non-null  float64
 1   Humidity               52416 non-null  float64
 2   Wind Speed             52416 non-null  float64
 3   general diffuse flows  52416 non-null  float64
 4   diffuse flows          52416 non-null  float64
 5   Power Consumption      52416 non-null  float64
dtypes: float64(6)
memory usage: 2.4 MB


In [5]:
data.nunique()

Temperature               3437
Humidity                  4443
Wind Speed                 548
general diffuse flows    10504
diffuse flows            10449
Power Consumption        27709
dtype: int64

In [6]:
data.isnull().sum()

Temperature              0
Humidity                 0
Wind Speed               0
general diffuse flows    0
diffuse flows            0
Power Consumption        0
dtype: int64

<a name="top_2.4"></a>
## 2.4 Define Variables

In [7]:
# Set x and y 
x = np.array(data.iloc[:,:-1])
y = np.array(data.iloc[:,-1]).reshape(-1,1)

# Set x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Set m and n
m = x_train.shape[0]
n = x_train.shape[1]

# ----------------------------------------
print(f'Number of training items: {m}')
print(f'Number of features: {n}')
print('\nFive rows of x: \n',x_train[:5])
print('\nFive rows of y: \n',y_train[:5])

Number of training items: 41932
Number of features: 5

Five rows of x: 
 [[3.017e+01 5.203e+01 4.926e+00 4.321e+02 1.104e+02]
 [1.937e+01 9.060e+01 3.070e-01 7.300e-02 1.300e-01]
 [2.997e+01 3.398e+01 4.916e+00 7.070e+01 7.250e+01]
 [1.234e+01 8.530e+01 7.600e-02 5.500e-02 1.630e-01]
 [1.810e+01 6.242e+01 8.900e-02 7.380e+01 8.330e+01]]

Five rows of y: 
 [[35567.84053]
 [24352.56637]
 [41857.27575]
 [23272.85106]
 [35775.18987]]


<a name="top_2.5"></a>
## 2.5 Define Model and train it without feature scaling

In [8]:
# Our model to train is Linear Regression
Regressor = LinearRegression()
Regression = Regressor.fit(x_train,y_train)
print(f'List of w for the fit line:\n\n{Regression.coef_[0]}')

List of w for the fit line:

[ 5.34405683e+02 -5.57553242e+01 -1.38529003e+02 -1.61333391e+00
 -1.10103823e-01]


In [9]:
# Predict on test data
y_predict = Regression.predict(x_test)
y_predict

array([[36086.64216639],
       [32086.82092766],
       [30872.26086217],
       ...,
       [32935.8833043 ],
       [36604.64289327],
       [35598.2234164 ]])

In [10]:
y_test

array([[32985.14532],
       [34737.64259],
       [27894.68354],
       ...,
       [43473.83607],
       [27347.25664],
       [37641.37625]])

In [11]:
# See RSE score obtained from this model without feature scaling
RSE1 = mean_squared_error(y_test, y_predict)
print('RSE1 = ',RSE1)

RSE1 =  39842406.218287066


<a name="top_2.6"></a>
## 2.6 Define Model and train it with feature scaling

In [12]:
# Here I just use Normalization, If you like to use other methods, it's ok.Don not worry
Scaler = MinMaxScaler()
data= pd.DataFrame(Scaler.fit_transform(data))

In [13]:
data.head()

Unnamed: 0,0,1,2,3,4,5
0,0.090091,0.748382,0.00513,4e-05,0.000115,0.526251
1,0.086146,0.75677,0.00513,5.7e-05,7.9e-05,0.415545
2,0.083399,0.75677,0.004663,5e-05,9.5e-05,0.397623
3,0.078176,0.762761,0.00513,7.5e-05,9.1e-05,0.374149
4,0.072736,0.771148,0.004819,3.8e-05,7.9e-05,0.350834


In [14]:
# Again we must make our model and train it to see the effect of Normalization of the result
# Set x and y 
x = np.array(data.iloc[:,:-1])
y = np.array(data.iloc[:,-1]).reshape(-1,1)

# Set x_train, x_test, y_train, y_test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Regressor = LinearRegression()
Regression = Regressor.fit(x_train,y_train)

# Predict on test data
y_predict = Regression.predict(x_test)


In [15]:
y_predict

array([[0.5792665 ],
       [0.47485624],
       [0.44315169],
       ...,
       [0.49701994],
       [0.59278825],
       [0.56651695]])

In [16]:
y_test

array([[0.49830586],
       [0.54405258],
       [0.36542581],
       ...,
       [0.77209983],
       [0.35113593],
       [0.61985086]])

In [17]:
# See the RSE score obtained from this model with feature scaling
RSE2 = mean_squared_error(y_test, y_predict)
print('RSE2 = ',RSE2)

print(f'As you see the amount of this RSE is highly lower than the previous RSE which was:{RSE1}')


RSE2 =  0.02714880727370638
As you see the amount of this RSE is highly lower than the previous RSE which was:39842406.218287066
