# Further Regression Analyses
## Summary and Initial Investigation Recap
The Initial Regression investigation showed that the best results were found with the non-sampled stemmatized reviews which were vectorized with the Term Frequency-Inverse Document Frequency (TFIDF) method. 

Regression was measured using the LinearRegression model only and the best prediction of the test sample provided by this data included the following scores:
* Coefficient: 0.755
* R Squared Value: 0.612
* P Value: 0 to at least 5dp
* Precision: 0.39
* F1 Score: 0.34
* Geometric Mean: 0.52

These scores are not good enough to accurately provide a sentiment analysis of the reviews, but the P Value is very promising.

The initial Analysis also revealed that the text data could be adjusted to fit the overall star rating from 1-5, however the final results must be cut into values of 1 to 5 integers otherwise the errors contained in the float decimal skew the regression error metrics like the geometric mean. It was also observed that the predicted values were not contained within the 1 - 5 bounds and a small amount of values drifted beyond ±10.

The data was unbalanced, and due to the quantity of reviews available only under sampling resampling was attempted on the training data, however the resampled data had a negative impact on all scores.

## Hypothesis
The Initial Regression Analysis P Value indicates that there is indeed a correlation between the reviews and the review rating scores from 1-5, since the linear regression model did not provide satisfactory scores the correlation may not be linear.

The resampling was not sufficient in the Initial exploration, however we can clearly see that the datset is imbalanced with 71.9% of the values falling in 1 and 5. Due to the amount of data only under sampling was implemented, however by reducing the size of the training set other sampling methods could be implemented. Of course different sampling methods may affect the data differently, so it may be necessary to lemmatize and stemmatize the original texts for different methods. Similarly different models may provide better scores with the vectorization methods.

By implementing pipelines and GridSearchCV, several regression models can be tested quickly and return the best scores and parameters for each model can be returned. In this notebook the data will be lemmatized, stemmatized, CountVectorized and TFIDF Vectorized as before. In addition the training set will be halved and RandomOverSampling, Synthetic Minority Over-sampling (SMOTE) and CentroidCluster resampling techniques will be implemented alongside RandomUndersampling. The raw data alongside the resampled data will be dimensionally reduced using Lasso, Random Forest, XGBoost and Principle Component Analysis (PCA) Methods. Finally Regression analyses will be performed by fitting and predicting data using Lasso, Ridge, ElasticNet and GradientBoostingRegressor models.

It may be that the data is better suited to Classification than Regression, and so whilst this document is prepared to analyse Regression a second notebook is being prepared to model the data with classification.

### Importing Libraries

In [61]:
from TextMiningProcesses import column_lemmatizer, column_stemmatizer, count_vectorize_data, tfidf_vectorize_data

from imblearn.metrics import classification_report_imbalanced
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids

from matplotlib import pyplot as plt

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import RFE, SelectKBest, SelectFromModel
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from scipy.stats import linregress

import xgboost as xgb

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline


#### Train - Test Preprocessing

In [53]:
# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# View length of data
print(f'Original number of Training data records: {len(train)}')
print(f'Original number of Test data records:{len(test)}')

# half the amount of training data for sampling
half_train = train.iloc[:len(train) // 2, :]

# View length of data
print(f'Training data will be shortened to enable resampling within a reasonable timeframe')
print(f'New number of Training data records {len(half_train)}')

# Create X_train, X_test, y_train, y_test variables
X_train = half_train['reviewText']
y_train = half_train['overall']
X_test = test['reviewText']
y_test = test['overall']


Original number of Training data records: 43043
Original number of Test data records:18447
Training data will be shortened to enable resampling within a reasonable timeframe
New number of Training data records 21521


#### Text Processing pipeline

In [54]:
class Tokenizer(BaseEstimator, TransformerMixin):
    def __init__(self, enabled=True):
        self.enabled = enabled

    def fit(self, X, y):
        return self
    
    def transform(self, X):
        if self.enabled:
            X_lem = column_lemmatizer(X)
            return X_lem
        else:
            X_stem = column_stemmatizer(X)
            return X_stem
    
tokens = Tokenizer(enabled=True)

text_pp_pipe = Pipeline([
    ('tokenized', tokens)
])

In [48]:
# Testing text functions

tokens = Tokenizer(enabled=False)

processed_text = text_pp_pipe.transform(X_train)

print(processed_text[0])

bought cooker order able make chinese hot pot home instruction set easy follow even come cooking pot box included magnet make easy test pan pot see work cooker purchasing setting simple easy use cord nice enough length extend outlet kitchen table happy purchase wait find us 


#### Vectorization Pipeline

In [65]:
class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, enabled=True):
        self.enabled = enabled

    def fit(self, X, y):
        return self
    
    def transform(self, X):
        if self.enabled:
            CV_X = count_vectorize_data(X)
            return CV_X
        else:
            TFIDF_X = tfidf_vectorize_data(X)
            return TFIDF_X
        
Vector = Vectorizer(enabled=True)

vector_pipe = Pipeline([
    ('Vector', Vector)
])

In [64]:
variable = None

if variable:
    print(True)
else:
    print(False)

False


In [66]:
# Testing Functions

param_grid = {
    'text_preprocessing__tokenized__enabled': [True, False],
    'Vectorizing__Vector__enabled': [True, False]
}

lr = LinearRegression()

test_pipeline = Pipeline([
    ('text_preprocessing', text_pp_pipe),
    ('Vectorizing', vector_pipe),
    ('LR', lr)
])

grid = GridSearchCV(test_pipeline, param_grid=param_grid, cv=5)

grid.fit(X_train, y_train)

results_df = pd.DataFrame(grid.cv_results_)

results_df.head(10)

KeyboardInterrupt: 

#### Resampling pipeline

In [None]:
class cluster_sampler(BaseEstimator, TransformerMixin):
    def __init__(self, clusters):
        self.estimator = KMeans(n_clusters=clusters, random_state=69)

    def fit(self, X_train, y_train):
        sampler = ClusterCentroids(sampling_strategy='majority', n_jobs=-1, random_state=69, estimator=self.estimator)
        X_undersampled, y_undersampled = sampler.fit_resample(X_train, y_train)
        return X_undersampled, y_undersampled
    
    def transform(self, X):
        return X
    
oversampler = RandomOverSampler(sampling_strategy='minority', n_jobs=-1, random_state=69)
smotesampler = SMOTE(sampling_strategy='minority', n_jobs=-1, random_state=69)
undersampler = RandomUnderSampler(sampling_strategy='majority', n_jobs=-1, random_state=69)

resample_pipe = ([
    ('RandomOverSampled', oversampler),
    ('SMOTE', smotesampler),
    ('RandomUnderSampled', undersampler),
    ('CC', cluster_sampler)
])

#### Dimensional Reduction Pipeline

In [None]:
class lasso_reduction(BaseEstimator, TransformerMixin):
        def __init__(self, alpha=1, max_iter=1000):
            self.alpha = alpha
            self.max_iter = max_iter
        
        def fit(self, X, y):
            self.lasso = Lasso(alpha=self.alpha, max_iter=self.max_iter)
            self.lasso.fit(X, y)
            return self

        def transform(self, X):
            sfm = SelectFromModel(self.lasso, prefit=True)
            reduced_X = sfm.transform(X)
            return reduced_X

class RandomForestReduction(BaseEstimator, TransformerMixin):
    def __init__(self, max_features=None, n_estimators=100, max_depth=None):
        self.max_features = max_features
        self.n_estimators = n_estimators
        self.max_depth = max_depth

    def fit(self, X, y):
        self.rf = RandomForestRegressor(max_features=self.max_features, n_estimators=self.n_estimators, max_depth=self.max_depth)
        self.rf.fit(X, y)
        return self

    def transform(self, X):
        sfm = SelectFromModel(self.rf, prefit=True)
        reduced_X = sfm.transform(X)
        return reduced_X

class XGBoostReduction(BaseEstimator, TransformerMixin):
    def __init__(self, n_estimators=100, max_depth=None, learning_rate=0.1, threshold=5):
        self.n_estimators= n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
        self.threshold = threshold

    def fit(self, X, y):
        self.boost_red = xgb.XGBRegressor(n_estimators=self.n_estimators, max_depth=self.max_depth, learning_rate=self.learning_rate)
        self.boost_red.fit(X, y)
        return self

    def transform(self, X):
        sfm = SelectFromModel(self.boost_red, threshold=self.threshold)
        reduced_X = sfm.transform(X)
        return reduced_X

pca = PCA()

dim_red_pipe = Pipeline([
    ('lasso', lasso_reduction()),
    ('RF', RandomForestReduction()),
    ('XGBoost', XGBoostReduction()),
    ('PCA', pca)
])

#### Regression Pipeline

In [None]:
regression_pipe = Pipeline([
    ('LinearRegression', LinearRegression()),
    ('Lasso', Lasso()),
    ('Ridge', Ridge()),
    ('ElasticNet', ElasticNet()),
    ('GBR', GradientBoostingRegressor())
])

#### Cutting and Scoring


In [None]:
class cut():
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        return pd.cut(x=X, bins=[X.min(),1.5,2.5,3.5,4.5,X.max()], labels=[1,2,3,4,5], include_lowest=True)
    

precision_scorer = make_scorer(precision_score, average='weighted')
recall_scorer = make_scorer(recall_score, average='weighted')
f1_scorer = make_scorer(f1_score, average='weighted')
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
    
scoring = {
    'accuracy': 'accuracy',
    'precision': precision_scorer,
    'recall': recall_scorer,
    'f1_score': f1_scorer,
    'r2_score': 'r2',
    'mse': mse_scorer
}

#### Parameter Grid

In [None]:
param_grid = {
    'text_preprocessing__lemmatized__enabled': [True, False],
    'resampling__SMOTE__k_neighbors': [2, 5, 10, 15, 25, 50, 100, 250, 500, 1000],
    'resampling__SMOTE__kind': ['deprecated', 'regular', 'svm'],
    'resampling__RandomUnderSampled__n_neighbors': [2, 5, 10, 15, 25, 50, 100, 250, 500, 1000],
    'resampling__CC__clusters': [2, 5, 10, 15, 25, 50, 100, 250, 500, 1000],
    'dim_red__lasso__alpha': np.logspace(-4, 0, 8, endpoint=True, base=10),
    'dim_red__lasso__max_iter': [1000, 5000, 10000],
    'dim_red__RF__max_features': ['auto', 'log2'],
    'dim_red__RF__n_estimators': [50, 100, 250, 500, 1000, 2000, 4000],
    'dim_red__RF__max_depth': [10, 20, 30, None],
    'dim_red__XGBoost__max_learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'dim_red__XGBoost__n_estimators': [50, 100, 250, 500, 1000, 2000, 4000],
    'dim_red__XGBoost__max_depth': [10, 20, 30, None],
    'dim_red__XGBoost__threshold': [5, 10, 25, 50, 100, 250, 500, 1000, 1500, 2000, 2500, 5000],
    'regression__Lasso__alpha': np.logspace(-4, 0, 8, endpoint=True, base=10),
    'regression__Ridge__alpha': np.logspace(-4, 4, 9, endpoint=True, base=10),
    'regression__ElasticNet__alpha': np.logspace(-4, 4, 9, endpoint=True, base=10),
    'regression__ElasticNet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
    'regression__GBR__max_learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'regression__GBR__n_estimators': [50, 100, 250, 500, 1000, 2000, 4000],
    'regression__GBR__max_depth': [10, 20, 30, None]
}

#### Combined Pipeline

In [None]:
full_pipeline = ([
    ('text_preprocessing', text_pp_pipe),
    ('Vectorizing', vector_pipe),
    ('resampling', resample_pipe),
    ('dim_red', dim_red_pipe),
    ('regression', regression_pipe),
    ('cut', cut)
])

regression_grid = GridSearchCV(full_pipeline, param_grid=param_grid, scoring=scoring, cv=5)
regression_grid.fit(X_train, y_train)

results_df = pd.DataFrame(regression_grid.cv_results_)

In [1]:
def over_sampler(X,y):
    """
    This function resamples target (y) and (X) using the RandomOverSampler from the imblearn
    oversampler library. The sampler uses the minority sampling strategy to suit the amazon
    review data. There are no further arguments to adjust.
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    over_sampler(X_train, y_train)

    returns:
    X_train_oversampled, y_train_oversampled
    """
    sampler = RandomOverSampler(sampling_strategy='minority', n_jobs=-1, random_state=69)
    X_oversampled, y_oversampled = sampler.fit_resample(X,y)
    return X_oversampled, y_oversampled

def smote_sampler(X, y, k, kind):
    """
    This function resamples target (y) and (X) using the SMOTE resampler from the imblearn
    oversampler library. The sampler uses the minority sampling strategy to suit the amazon
    review data. The model can be further optimized by cycling kind methods with a parameter
    grid
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    smote_sampler(X_train, y_train, k=1000, kind='regular')

    returns:
    X_train_oversampled, y_train_oversampled
    """
    sampler = SMOTE(sampling_strategy='minority', n_jobs=-1, random_state=69, k_neighbors=k, kind=kind)
    X_oversampled, y_oversampled = sampler.fit_resample(X,y)
    return X_oversampled, y_oversampled

def under_sampler(X, y, k):
    """
    This function resamples target (y) and (X) using the RandomUnderSampler from the imblearn
    undersampler library. The sampler uses the majority sampling strategy to suit the amazon
    review data. The model can be further optimized by cycling the n_neighbours argument
    values with a parameter grid
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    under_sampler(X_train, y_train, k=1000)

    returns:
    X_train_undersampled, y_train_undersampled
    """
    sampler = RandomUnderSampler(sampling_strategy='majority', n_jobs=-1, random_state=69, n_neighbors=k)
    X_undersampled, y_undersampled = sampler.fit_resample(X,y)
    return X_undersampled, y_undersampled

def cluster_sampler(X, y, k):
    """
    This function resamples target (y) and (X) using the ClusterCentroids resampler from the 
    imblearn undersampler library. To maintain simplicity, this function uses the Kmeans 
    cluster method only, the n_clusters can be tuned to fit the data better and further
    cluster methods can be explored if the resampler provides good scores.
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    cluster_sampler(X_train, y_train, k=5)

    returns:
    X_train_undersampled, y_train_undersampled
    """
    estimator = KMeans(n_clusters=k, random_state=69)
    sampler = ClusterCentroids(sampling_strategy='majority', n_jobs=-1, random_state=69, estimator=estimator)
    X_undersampled, y_undersampled = sampler.fit_resample(X,y)
    return X_undersampled, y_undersampled