# <font color='#eb3483'> Pipeline Practice </font>


For this exercises we have to build a processing pipeline that processes the movies dataset. Its not a matter of copy pasting code, but of taking decisions on how to deal with each variable.

### <font color='#eb3483'> Movies Data </font>

In [1]:
import pandas as pd
import numpy as np

movies = pd.read_csv("data/movies.1.initial_process.csv")
movies = movies[movies.status=="Released"]
del movies["status"]
movies.head()

Unnamed: 0,belongs_to_collection,budget,genre,original_language,popularity,production_company,production_country,release_date,revenue,runtime,title,vote_average,vote_count
0,Father of the Bride Collection,,Comedy,en,8.387519,Sandollar Productions,United States of America,1995-02-10,76578911.0,106.0,Father of the Bride Part II,5.7,173.0
1,,,Drama,en,0.894647,Miramax,South Africa,1995-12-15,676525.0,106.0,"Cry, the Beloved Country",6.7,13.0
2,Friday Collection,3500000.0,Comedy,en,14.56965,New Line Cinema,United States of America,1995-04-26,28215918.0,91.0,Friday,7.0,513.0
3,,,Comedy,en,8.963037,Paramount Pictures,United States of America,1996-02-01,32.0,87.0,Black Sheep,6.0,124.0
4,,12000000.0,Comedy,en,9.592265,Universal Pictures,United States of America,1996-02-16,41205099.0,92.0,Happy Gilmore,6.5,767.0


In [10]:
target_var = "revenue"
numerical_col = movies.drop(columns = target_var).select_dtypes(np.number).columns
categorical_col = movies.drop(columns = numerical_col).drop(columns =['title', 'revenue', 'belongs_to_collection'])
categorical_col

Unnamed: 0,genre,original_language,production_company,production_country,release_date
0,Comedy,en,Sandollar Productions,United States of America,1995-02-10
1,Drama,en,Miramax,South Africa,1995-12-15
2,Comedy,en,New Line Cinema,United States of America,1995-04-26
3,Comedy,en,Paramount Pictures,United States of America,1996-02-01
4,Comedy,en,Universal Pictures,United States of America,1996-02-16
...,...,...,...,...,...
1344,Comedy,sv,Svensk Filmindustri (SF),Sweden,1975-03-17
1345,Comedy,fr,France 2 Cinéma,France,2003-11-12
1346,Comedy,it,,,2008-12-05
1347,,it,,Italy,2002-03-15


### Create a Pipeline that process the dataset. You have to make sure you deal accordingly with numerical, categorical and text variables.

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#numerical pipeline
imputer = SimpleImputer()    #default strategy is mean
scaler = StandardScaler()

In [4]:
from mlxtend.feature_selection import ColumnSelector
numerical_col_selector = ColumnSelector(cols=numerical_col)
numerical_col_selector.fit_transform(movies)

numerical_pipeline = make_pipeline(
    numerical_col_selector,
    imputer,
    scaler
)
#applying to whole dataset, but only numerical values are transformed 

In [12]:
#categorical pipeline
from category_encoders import OneHotEncoder

categorical_pipeline = make_pipeline(
     ColumnSelector(cols=categorical_col),
     OneHotEncoder()
)
#categorical_pipeline

In [14]:
#pipeline union
from sklearn.pipeline import make_union
processing_pipeline = make_union(
numerical_pipeline, 
categorical_pipeline)

processing_pipeline

FeatureUnion(transformer_list=[('pipeline-1',
                                Pipeline(steps=[('columnselector',
                                                 ColumnSelector(cols=Index(['budget', 'popularity', 'runtime', 'vote_average', 'vote_count'], dtype='object'))),
                                                ('simpleimputer',
                                                 SimpleImputer()),
                                                ('standardscaler',
                                                 StandardScaler())])),
                               ('pipeline-2',
                                Pipeline(steps=[('columnselector',
                                                 ColumnSelector(cols=       genre original_language        pr...
1348     NaN                en                       NaN   

            production_country release_date  
0     United States of America   1995-02-10  
1                 South Africa   1995-12-15  
2     United States of America

### Transform the dataset

In [15]:
processing_pipeline.fit_transform(movies)

KeyError: 0

### Create a Ridge estimator to predict a movies revenue based on the other features. What is the optimal value of alpha to minimize the RMSE? *Hint*: You can use validation curves to figure it out.

### Remember when we did exploratory data analyses and we grouped the numerical variables into quintiles? That is a valid technique used in Machine Learning to expand a dataset, it is called [Binning or Bucketing](http://blog.yhat.com/tutorials/5-Feature-Engineering.html).

### Create your own transformer that given a numerical variable and a number of buckets returns the specificed quartile (so if we choose buckets = 4, it would return 1, 2,3 or 4 depending on each observation being on the 1st, 2nd, 3rd or 4th quartile).

### Try putting your bucket transformer into a pipeline to make sure it works, and check if it improves the performance of your model.

**Hint**:

here is a custom scikit learn transformer that selects columns (similar to mlxtend selector), you can use it as a template to create your custom transformer (and you could use pandas `qcut` to do the actual binning.)

In [None]:
from sklearn.base import BaseEstimator

class ColumnSelector(BaseEstimator):
    """
    Custom transformer that select specific columns from a pandas dataframe, similar to 
    mlxtend ColumnSelector
    """
    def __init__(self, cols=None, drop_axis=False):
        self.cols = cols
        self.drop_axis = drop_axis

    def fit_transform(self, X, y=None):
        return self.transform(X=X, y=y)

    def transform(self, X, y=None):
        if hasattr(X, 'loc'):
            #only pandas dataframes have the method loc
            t = X.loc[:, self.cols].values
        else:
            # its a numpy array
            t = X[:, self.cols]

        if t.shape[-1] == 1 and self.drop_axis:
            t = t.reshape(-1)
        if len(t.shape) == 1 and not self.drop_axis:
            t = t[:, np.newaxis]
        return t

    def fit(self, X, y=None):
        return self