# Normalizing Features

You might have noticed that our previous model wasn't exactly stellar. This is because we missed a key technique used in machine learning: normalization. Normalizing features takes all of our data and fits variables to a similar scale and range. When we previously performed gradient descent, the budget and imdbVotes features had a much larger impact on our steps and the resulting output. That's not necessarily due to those features being more predictive of the gross domestic sales, but rather simply because the values of those features were much higher then the imdbRating or Metascore features which had much more narrow ranges. To account for this, we'll start normalizing our data, and transform back to the raw version when needed.


In [1]:
import pandas as pd
%matplotlib inline

In [2]:
df = pd.read_excel('movie_data_detailed_with_ols.xlsx')
df.head()

Unnamed: 0,budget,domgross,title,Response_Json,Year,imdbRating,Metascore,imdbVotes,Model
0,13000000,25682380,21 &amp; Over,0,2008,6.8,48,206513,49127590.0
1,45658735,13414714,Dredd 3D,0,2012,0.0,0,0,226726.5
2,20000000,53107035,12 Years a Slave,0,2013,8.1,96,537525,162662400.0
3,61000000,75612460,2 Guns,0,2013,6.7,55,173726,77233810.0
4,40000000,95020213,42,0,2013,7.5,62,74170,41519580.0


### 1. Basic Norm function
Write a function norm(col) that takes in a pandas series, and rescales the data to have a minimum of zero and a maximum of 1. Think about how you can do this by simply using the minimum and maximum of the column.

In [3]:
def norm(col):
    minimum = col.min()
    maximum = col.max()
    return (col-maximum)/(maximum-minimum)

### 2. Apply your norm function to the X feature columns

In [7]:
cols = ['budget',  'imdbRating', 'Metascore', 'imdbVotes']
for col in cols:
    df[col] = norm(df[col])
df.head()

Unnamed: 0,budget,domgross,title,Response_Json,Year,imdbRating,Metascore,imdbVotes,Model
0,-0.965831,25682380,21 &amp; Over,0,2008,-0.160494,-0.5,-0.615808,49127590.0
1,-0.817044,13414714,Dredd 3D,0,2012,-1.0,-1.0,-1.0,226726.5
2,-0.933941,53107035,12 Years a Slave,0,2013,0.0,0.0,0.0,162662400.0
3,-0.747153,75612460,2 Guns,0,2013,-0.17284,-0.427083,-0.676804,77233810.0
4,-0.842825,95020213,42,0,2013,-0.074074,-0.354167,-0.862016,41519580.0


In [8]:
X = df[['budget', 'imdbRating',
       'Metascore', 'imdbVotes']]
y = df['domgross']

### 3. Try writing a slightly different normalization function: the mean normaliztion.
Here's how its defined:  
mean_normalized_x = $\frac{x-mean(x)}{max(x)-min(x)}$

In [17]:
def norm(col):
    minimum = col.min()
    maximum = col.max()
    return (col-np.mean(col))/(maximum-minimum)