### Text Features
Any text for machine learning model must be preprocessed and encoded into numbers. Theare are many different techniques that allow to implement it:
- Bag of Words (TF-IDF and N-Grams)
- Word Embeddings 
- Manually Defined Features

In [5]:
import pandas as pd
import numpy as np
from textblob import Word, TextBlob
import re 

import nltk
nltk.download('punkt')

from scipy.sparse import csr_matrix 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vlad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!



**Manual Feature Engineering**

Usually, specific domain leads to specific information, hidden inside of your data. We need to extract it, as much as possible. For example, for sentiment analysis task we may have the following texts:
- `"Average film, however, starring Matt Damon, 8/10"` -> 8/10 means positive sentiment
- `"2/10, there is nothing to add"` -> 2/10 means negative sentiment

**Token Based Features**
- smiles positive/negative
- numbers that might be related to rating

In [7]:
def get_rate(text: str):
    rating_candidates = re.findall(r'(\d{1,3}[\\|/]{1}\d{1,2})', text)
    rates = []
    for candidate in rating_candidates:
        try:
            rates.append(eval(candidate))
        except SyntaxError:
            pass
        except ZeroDivisionError:
            return 0
    return np.mean(rates) if rates else -1

def get_positive_smiles():
    positive_smiles = set([
    ":‑)",":)",":-]",":]",":-3",":3",":->",":>","8-)","8)",":-}",":}",":o)",":c)",":^)","=]","=)",":‑D",":D","8‑D","8D",
    "x‑D","xD","X‑D","XD","=D","=3","B^D",":-))",";‑)",";)","*-)","*)",";‑]",";]",";^)",":‑,",";D",":‑P",":P","X‑P","XP",
    "x‑p","xp",":‑p",":p",":‑Þ",":Þ",":‑þ",":þ",":‑b",":b","d:","=p",">:P", ":'‑)", ":')",  ":-*", ":*", ":×"
    ])
    return positive_smiles

def get_negative_smiles():
    negative_smiles = set([
    ":‑(",":(",":‑c",":c",":‑<",":<",":‑[",":[",":-||",">:[",":{",":@",">:(","D‑':","D:<","D:","D8","D;","D=","DX",":‑/",
    ":/",":‑.",'>:\\', ">:/", ":\\", "=/" ,"=\\", ":L", "=L",":S",":‑|",":|","|‑O","<:‑|"
    ])
    return negative_smiles


def get_token_features(text, return_sparce=False):
    features_df = pd.DataFrame()
    positive_smiles = get_positive_smiles()
    negative_smiles = get_negative_smiles()

    features_df['rating'] = text.apply(get_rate).fillna(-1)
    features_df['positive_smiles'] = text.apply(lambda s: len([x for x in s.split() if x in positive_smiles]))
    features_df['negative_smiles'] = text.apply(lambda s: len([x for x in s.split() if x in negative_smiles]))

    if return_sparce:
        return csr_matrix(features_df.values)
    return features_df

**Sentence-based Features**
- Count Features
    - sentence len
    - exclamation mark, question mark, ...
    - uppercase word count

- Contrast
    - words like "instead", "on the contrary", ...

- First last sentence comparison:
    - polarity, subjectivity, purity of first/last sentence[s]

In [9]:
def get_contras_words():
    contrast_conj = set([
        'alternatively','anyway','but','by contrast','differ from','elsewhere','even so','however','in contrast','in fact',
        'in other respects','in spite of','in that respect','instead','nevertheless','on the contrary','on the other hand',
        'rather','though','whereas','yet'
    ])
    return contrast_conj

# to get review "purity" ~ shows same sentiment over review (~1) or changing sentiment (~0)
def get_purity(text: str):
    """
    Obtain polarities across the sentences.
    shows same sentiment over review (~1) or changing sentiment (~0)
    """
    polarities = np.array([TextBlob(x).sentiment.polarity for x in text])
    return polarities.sum() / np.abs(polarities).sum()


def get_text_features(text, return_sparce=False):
    features_df = pd.DataFrame()
    uppercase_pattern = re.compile(r'(\b[0-9]*[A-Z]+[0-9]*[A-Z]+[0-9]*\b)')
    sentence_splitter = re.compile('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\!|\?|\.)\s')
    contrast_words = get_contras_words()

    features_df['sentences'] = text.apply(lambda s: re.split(sentence_splitter, s))
    features_df['sentence_cnt'] = text.apply(len) 
    features_df['exclamation_cnt'] = text.str.count('\!') 
    features_df['question_cnt'] = text.str.count('\?')
    features_df['upper_word_cnt'] = text.apply(lambda s: len(re.findall(uppercase_pattern, s)))
    features_df['contrast_conj_cnt'] = text.apply(lambda s: len([c for c in contrast_words if c in s]))

    features_df['polarity_1st_sent'] = features_df['sentences'].apply(lambda s: TextBlob(s[0]).sentiment.polarity)
    features_df['polarity_last_sent'] = features_df['sentences'].apply(lambda s: TextBlob(s[-1]).sentiment.polarity)
    features_df['subjectivity_1st_sent'] = features_df['sentences'].apply(lambda s: TextBlob(s[0]).sentiment.subjectivity)
    features_df['subjectivity_last_sent'] = features_df['sentences'].apply(lambda s: TextBlob(s[-1]).sentiment.subjectivity)
    features_df['polarity'] = text.apply(lambda s: TextBlob(s[-1]).sentiment.polarity)
    features_df['purity'] = features_df['sentences'].apply(get_purity).fillna(0)

    if return_sparce:
        return csr_matrix(features_df[features_df.columns[1:]].values)
    return features_df

In [12]:
# let's test custom features:
reviews = pd.Series([
    "Waste of time :( 2/10 for the plot and 4/10 for acting!",
    'Awful film! Nobody can like it',
    'Wow! Am I impressed?? TOTALLY :D',
    '7/10'
])

# token-based
token_features = get_token_features(reviews)
token_features

Unnamed: 0,rating,positive_smiles,negative_smiles
0,0.3,0,1
1,-1.0,0,0
2,-1.0,1,0
3,0.7,0,0


In [13]:
# text-based
sentence_features = get_text_features(reviews)
sentence_features

Unnamed: 0,sentences,sentence_cnt,exclamation_cnt,question_cnt,upper_word_cnt,contrast_conj_cnt,polarity_1st_sent,polarity_last_sent,subjectivity_1st_sent,subjectivity_last_sent,polarity,purity
0,[Waste of time :( 2/10 for the plot and 4/10 f...,55,1,0,0,0,-0.316667,-0.316667,0.333333,0.333333,0.0,-1.0
1,"[Awful film!, Nobody can like it]",30,1,0,0,0,-1.0,0.0,1.0,0.0,0.0,-1.0
2,"[Wow!, Am I impressed??, TOTALLY :D]",32,1,2,1,0,0.125,0.5,1.0,0.875,0.0,1.0
3,[7/10],4,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


### FeatureUnion. Glue Different Feature Bricks
Sometimes, we need to "glue" different feature "bricks" into one feature matrix *X*. For example, we may want to combine BoW features and manually generated features. The best and easiest way to do it, use `sklear.FeatureUnion`

However, all that feature blocks must be wrapped with a class that implements `.fit()` and `.transform()` methods. This can be achieved in several ways:
- deriving a class from `BaseTransformer`
- writing a custom function and passing it into `FunctionTransformer`


In [14]:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import FunctionTransformer

In [16]:
# consider simple examle
def get_features_a(X):
    return X

def get_features_b(X):
    return X**2

features1 = FunctionTransformer(
    func=get_features_a,
    validate=False, # to silence many warnings
    accept_sparse=True # to use convenient sparse representations
)

features2 = FunctionTransformer(
    func=get_features_b,
    validate=False, # to silence many warnings
    accept_sparse=True # to use convenient sparse representations
)

features = FeatureUnion([
    ('f1', features1),
    ('f2', features2)
])

In [17]:
X = np.array([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
])

features.transform(X)

array([[1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1]])