# Review of Regex and Pipelining, Column Transformer

## Regular Expressions

- Very powerful tool.
- Used to identify whether a pattern exists in a given sequence of characters (string) or not.
- Helpful in manipulating textual data, very important for text mining.
- Examples - Validattion of the format of email addresses or password during registration, used for parsing text data files to find, replace or delete certain string, etc.

In [37]:
import re
import pandas as pd
import numpy as np

In [2]:
#Ordinary characters
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


This is called a **raw string literal**. It changes how the string literal is interpreted. Such literals are stored as they appear.

For example, \ is just a backslash when prefixed with a **r rather than being interpreted as an escape sequence**. You will see what this means with special characters. Sometimes, the syntax involves backslash-escaped characters and to prevent these characters from being interpreted as escape sequences, you use the raw r prefix. You don't actually need it for this example, however it is a good practice to use it for consistency.

The **re.search()** method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a **match object** or **None** otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded.

#basic format of search function
match = re.search(pat, str)

In [6]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')

# match.group() is the matching text (e.g. 'word:cat')

found word:cat


Basic Patterns
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

- **a, X, 9, <** -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: **. ^ $ * + ? { [ ] \ | ( )** (details below)

-  **. (a period)** -- matches any single character except newline '\n'
- **\w** -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. **Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word**. \W (upper case W) matches any non-word character.


- **\s** -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

- **\t, \n, \r** -- tab, newline, return

- **\d** -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

- **^ = start, $ = end** -- match the start or end of the string

- **\** -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

## re.search vs re.match
- The match() function checks for a match only at the beginning of the string (by default) whereas the search() function checks for a match anywhere in the string.

- **search(pattern, string, flags=0)**
- With this function, you scan through the given string/sequence looking for the first location where the regular expression produces a match. It returns a corresponding match object if found, else returns None if no position in the string matches the pattern. Note that None is different from finding a zero-length match at some point in the string.
- **All of the pattern must be matched, but not all of the string**

In [7]:
pattern = "cookie"
sequence = "Cake and cookie"

re.search(pattern, sequence).group()

'cookie'

- **match(pattern, string, flags=0)**
- Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None, if the string does not match the given pattern.

In [8]:
pattern = "C"
sequence1 = "IceCream"

# No match since "C" is not at the start of "IceCream"
re.match(pattern, sequence1)

In [9]:
sequence2 = "Cake"

re.match(pattern,sequence2).group()

'C'

In [10]:
  ## Search for pattern 'iii' in string 'piiig'.
  ## All of the pattern must match, but it may appear anywhere.
  ## On success, match.group() is matched text.
  match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
  match = re.search(r'igs', 'piiig') # not found, match == None

  ## . = any char but \n
  match = re.search(r'..g', 'piiig') # found, match.group() == "iig"

  ## \d = digit char, \w = word char
  match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
  match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"

## Repetition
You use + and * to specify repetition in the pattern

- "+" -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
- "*" -- 0 or more occurrences of the pattern to its left
- "?" -- match 0 or 1 occurrences of the pattern to its left

### Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be **"greedy" - match as much as possible**).

In [None]:
##Repetition Examples

  ## i+ = one or more i's, as many as possible.
  match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"

  ## Finds the first/leftmost solution, and within it drives the +
  ## as far as possible (aka 'leftmost and largest').
  ## In this example, note that it does not get to the second set of i's.
  ## important example
  match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"

  ## \s* = zero or more whitespace chars
  ## Here look for 3 digits, possibly separated by whitespace.
  match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
  match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
  match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"

  ## ^ = matches the start of string, so this fails:
  match = re.search(r'^b\w+', 'foobar') # not found, match == None
  ## but without the ^ it succeeds:
  match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"

But what if you want to check for exact number of sequence repetition?

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

**{x} - Repeat exactly x number of times.**

**{x,} - Repeat at least x times or more.**

**{x, y} - Repeat at least x times but no more than y times.**

In [26]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

## Greedy vs Non-Greedy Matching
When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression but sometimes this behavior is not desired:

In [27]:
pattern = "cookie"
sequence = "Cake and cookie"

heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()

'<h1>TITLE</h1>'

In [None]:
""""The pattern <.*> matched the whole string, right up to the second occurrence of >.

However, if you only wanted to match the first <h1> tag, you could have used the greedy qualifier *? that matches as little
text as possible.

Adding ? after the qualifier makes it perform the match in a non-greedy or minimal 
fashion; That is, as few characters as possible will be matched. When you run <.*>, you will only get a match with <h1>.""""""

In [28]:
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()

'<h1>'

In [12]:
#Email Example

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
  print (match.group())  ## 'b@google'


b@google


The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address

### Square Brackets
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one **exception that dot (.) just means a literal dot**. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [20]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print (match.group())  ## 'alice-b@google.com'

alice-b@google.com


You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. **To use a dash without indicating a range, put the dash last, e.g. [abc-].** **An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or'b'.**

## Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. **To do this, add parenthesis ( ) around the username and host in the pattern**, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [21]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print (match.group())   ## 'alice-b@google.com' (the whole match)
    print (match.group(1))  ## 'alice-b' (the username, group 1)
    print (match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

## findall
- findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
- So it will basically work like a search but it will find all the matching expressions

In [23]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']

for email in emails:
    print (email)

alice@google.com
bob@abc.com


## findall With Files
- Feed the whole file text into findall() and it will return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

In [None]:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

## findall and Groups

- The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').

In [24]:
  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
  print (tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
  for tuple in tuples:
    print (tuple[0])  ## username
    print (tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


## sub(pattern, repl, string, count=0, flags=0)

This is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl. **If the pattern is not found then the string is returned unchanged.**

In [29]:
email_address = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', email_address)
print(new_email_address)

Please contact us at: support@datacamp.com


## compile(pattern, flags=0)
Compiles a regular expression pattern into a regular expression object. When you need to use an expression several times in a single program, using the compile() function to save the resulting regular expression object for reuse is more efficient. This is because the compiled versions of the most recent patterns passed to compile() and the module-level matching functions are cached.

In [30]:
pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
pattern.search(sequence).group()

# This is equivalent to:
#re.search(pattern, sequence).group()

'cookie'

## Example


In [97]:
import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence): 
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
#print(processed_book)

In [32]:
#Find the number of the pronoun "the" in the corpus
len(re.findall(r'the', processed_book))

302

In [98]:
#convert every single stand-alone instance of 'i' to 'I' in the corpus. Make sure not to change the 'i' occuring in a word:
#using space hence did not convert the i near the period
processed_book = re.sub(r'\si\s', " I ", processed_book)
#print(processed_book)

In [34]:
#What are the words connected by '--' in the corpus?

re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',
 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',
 'do--by',
 'know--my',
 'il

# Pipeline and Column Transformer

- Used for Applying different transformations to different features in a scikit-learn pipeline
- Available from release 0.20 onwards

- Real-world data often contains heterogeneous data types. When processing the data before applying the final prediction model, we typically want to use different preprocessing steps and transformations for those different types of columns.

- A simple example: we may want to scale the numerical features and one-hot encode the categorical features.
- Putting preprocessing steps in a scikit-learn Pipeline is important to avoid data leakage or to do a grid search over preprocessing parameters.

In [52]:
titanic = pd.read_csv("https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv")
#print (titanic.head().T)
#dropping the missing values for simplcity

titanic2 = titanic.dropna(subset=['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'])
titanic2.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [39]:
target = titanic2.survived.values
features = titanic2[['pclass', 'sex', 'age', 'fare', 'embarked']]
features.head()

Unnamed: 0,pclass,sex,age,fare,embarked
0,1,female,29.0,211.3375,S
1,1,male,0.9167,151.55,S
2,1,female,2.0,151.55,S
3,1,male,30.0,151.55,S
4,1,female,25.0,151.55,S


This dataset contains some categorical variables ("pclass", "sex" and "embarked"), and some numerical variables ("age" and "fare"). Note that the "pclass", although categorical, is already encoded as integers in the dataset. So let's use the ColumnTransformer to combine transformers for those two types of features:

In [40]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(
    (['age', 'fare'], StandardScaler()),
    (['pclass', 'sex', 'embarked'], OneHotEncoder())
)
#create a pipeline for numerical features and categorical features



The above creates a simple preprocessing pipeline (that will be combined in a full prediction pipeline below) to scale the numerical features and one-hot encode the categorical features.
We can check this is indeed working as expected by transforming the input data

In [42]:
#to check that the input data has been transformed, print the transformed verison of array
preprocess.fit_transform(features)[:5]

array([[-0.05663194,  3.13554913,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ],
       [-2.01237899,  2.06268333,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  1.        ],
       [-1.93693697,  2.06268333,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ],
       [ 0.01300899,  2.06268333,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  1.        ],
       [-0.33519565,  2.06268333,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ]])

## Integrating in a full pipeline
Example to integrate the ColumnTransformer in a prediction pipeline.

In [43]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

In [44]:
#take the full dataset this time including the missing values:

target = titanic.survived.values
features = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']].copy()

# Filling missing values in embarked to simply handle categorical missing values

features['embarked'].fillna(features['embarked'].value_counts().index[0], inplace=True)

In [53]:
#let's split the data in training and testing data:

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=0)

In [55]:
#selecting numerical and categorical columns based on their data types
numerical_features = features.dtypes == 'float'
categorical_features = ~numerical_features

In [57]:
#pipeline inside pipeline

#for numerical data, make another pipeline for imputation and scaling and then put it into this pipeline
#for categorical data, directly apply one hot encoder

#simple imputer will do imputation of missing values
#standard scalar will do scaling and standardization of feature values

preprocess = make_column_transformer(
    (numerical_features, make_pipeline(SimpleImputer(), StandardScaler())),
    (categorical_features, OneHotEncoder()))



Now we can combine this preprocessing step based on the ColumnTransformer with a classifier in a Pipeline to predict whether passengers of the Titanic survived or not:

In [58]:
model = make_pipeline(
    preprocess,
    LogisticRegression())

model.fit(X_train, y_train)

print("logistic regression score: %f" % model.score(X_test, y_test))

logistic regression score: 0.786585




## Using our pipeline in a grid search¶

- Using grid search to find the best regularization parameter of the logistic regression and the imputer strategy

In [59]:
from sklearn.model_selection import GridSearchCV

#these are the parameters that I will be searching over
param_grid = {
    'columntransformer__pipeline__simpleimputer__strategy': ['mean', 'median'],
    'logisticregression__C': [0.1, 1.0, 1.0],
    }

#Performing the grid search:

grid_clf = GridSearchCV(model, param_grid, cv=10, iid=False)
grid_clf.fit(X_train, y_train);





In [60]:
grid_clf.best_params_

{'columntransformer__pipeline__simpleimputer__strategy': 'mean',
 'logisticregression__C': 0.1}

In [61]:
print("best logistic regression from grid search: %f" % grid_clf.best_estimator_.score(X_test, y_test))

best logistic regression from grid search: 0.792683


## Example - Column transformer and One Hot Encoder
The purpose of this example is to predict wether or not a loan application will be successful based on a number of customer features. This contains both categorical and numerical variables.

In [63]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape, test.shape)
print(train.dtypes)
print (train.head())

(614, 13) (367, 12)
Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1

In [None]:
#drop loan id as it does not provide any useful information for creating features
train = train.drop(['Loan_ID'], axis=1)
test =  test.drop(['Loan_ID'], axis=1)

#Filling the missing values with the most commonly occuring ones

train = train.apply(lambda x:x.fillna(x.value_counts().index[0]))
test = test.apply(lambda x:x.fillna(x.value_counts().index[0]))

In [65]:
feature_set = train.drop(['Loan_Status'], axis=1)

X = feature_set.columns[:len(feature_set.columns)]
y = 'Loan_Status'

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train[X], train[y], random_state=0)

- ColumnTransformer
- Apply transformation to columns to optimise them for use in the classification model
- Normalize the numerical columns using the Normalizer function.
- The ColumnTransformer takes a list of tuples specifying the transformers, and the corresponding columns on which the transformation needs to be applied. The columns can either be entered as strings specifying the column names in a pandas data frame, or as in the code as used below, as integers which are interpreted as the column positions.

In [68]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer, OneHotEncoder

colT = ColumnTransformer(
    [("dummy_col", OneHotEncoder(categories=[['Male', 'Female'],
                                           ['Yes', 'No'],
                                            ['0','1', '2','3+'],
                                            ['Graduate', 'Not Graduate'],
                                            ['No', 'Yes'],
                                            ['Semiurban', 'Urban', 'Rural']]), [0,1,2,3,4,10]),
      ("norm", Normalizer(norm='l1'), [5,6,7,8,9])])

- Categories argument of the OnHotEncoder function. This takes a list of all possible categories in each column as a list of lists. This produces one hot encoded columns for all categories even if data does not exist for that category in the column. The reason for doing this is that when using the ColumnTransformer function on new data. If it doesn’t contain the same categories in each feature then the array produced will not be the same shape as the data used to train the model, and you will get an error.

## Example toxic commets

In [87]:
import pandas as pd
import numpy as np
from scipy import sparse

from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_auc_score

In [96]:
df = pd.read_csv('input/train.csv')
print (df.head())
x = df['comment_text'].values[:5000]
y = df['toxic'].values[:5000]

                 id                                       comment_text  toxic  \
0  0000997932d777bf  Explanation\nWhy the edits made under my usern...      0   
1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   
2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   
3  0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...      0   
4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   

   severe_toxic  obscene  threat  insult  identity_hate  
0             0        0       0       0              0  
1             0        0       0       0              0  
2             0        0       0       0              0  
3             0        0       0       0              0  
4             0        0       0       0              0  


In [90]:
# default params that we will be using with different transformations later
scoring='roc_auc'
cv=3
n_jobs=-1
max_features = 2500

In [91]:
#Simple pipelines of default sklearn TfidfVectorizer to prepare features and Logistic Reegression to make predictions.

tfidf = TfidfVectorizer(max_features=max_features)
lr = LogisticRegression()
p = Pipeline([
    ('tfidf', tfidf),
    ('lr', lr)
])

cross_val_score(estimator=p, X=x, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)

array([0.91248134, 0.92889109, 0.92437274])

Lets create or own Estimator. This estimator is created with sklearn BaseEstimator class and needs to have fit and transform methods. First Pipeline calls fit methods to learn your dataset and then calls transform to apply knowledge and does some transformations.

In [92]:
class NBFeaturer(BaseEstimator):
    def __init__(self, alpha):
        self.alpha = alpha
    
    def preprocess_x(self, x, r):
        return x.multiply(r)
    
    def pr(self, x, y_i, y):
        p = x[y==y_i].sum(0)
        return (p+self.alpha) / ((y==y_i).sum()+self.alpha)

    def fit(self, x, y=None):
        self._r = sparse.csr_matrix(np.log(self.pr(x,1,y) / self.pr(x,0,y)))
        return self
    
    def transform(self, x):
        x_nb = self.preprocess_x(x, self._r)
        return x_nb

In [93]:
tfidf = TfidfVectorizer(max_features=max_features)
lr = LogisticRegression()
nb = NBFeaturer(1)
p = Pipeline([
    ('tfidf', tfidf),
    ('nb', nb),
    ('lr', lr)
])

cross_val_score(estimator=p, X=x, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)

array([0.91851711, 0.93572898, 0.91808511])

## References
1. https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial
2. http://queirozf.com/entries/scikit-learn-pipeline-examples