### SCALING DATA FEATURES

When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?
The answer to this problem is scaling. We can scale data into new values that are easier to compare.

In cars.csv, weight 790gm and volume 1000ccm are some what comparable but in **cars1.csv**, weight 790gm and volume 1 litre (1000ccm) are not comparable here we need to scale the values to a comparable unit.
Scaling can be done in many ways, here we use **_standardization formula_**.

```
new val. = (original val. - mean) / std. dev.
scale weight value: (790 - 1292.23) / 238.74 = -2.1
scale volume value: (1.0 - 1.61) / 0.38 = -1.59
now we can compare -2.1 (weight) and -1.59 (volume)
```


In [3]:
import pandas
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

df = pandas.read_csv("../data/cars1.csv")

X = df[["Weight", "Volume"]]
y = df[["CO2"]]

print("unscaled independent variables(X)")
display(X.head(3))

print("dependent variable(y)")
display(y.head(3))


unscaled independent variables(X)


Unnamed: 0,Weight,Volume
0,790,1.0
1,1160,1.2
2,929,1.0


dependent variable(y)


Unnamed: 0,CO2
0,99
1,95
2,95


In [4]:
scale = StandardScaler()
scaledX = scale.fit_transform(X.values)

print("scaled independent variables(X)")
print(scaledX[:3])

regr = linear_model.LinearRegression()  # create the regression model
regr.fit(scaledX, y)

scaled = scale.transform([[2300, 1.3]])  # scale these values

predictedCO2 = regr.predict(scaled)
print("predicted CO2:", predictedCO2)


scaled independent variables(X)
[[-2.10389253 -1.59336644]
 [-0.55407235 -1.07190106]
 [-1.52166278 -1.59336644]]
predicted CO2: [[107.2087328]]
