# Further Regression Analyses
## Summary and Initial Investigation Recap
The Initial Regression investigation showed that the best results were found with the non-sampled stemmatized reviews which were vectorized with the Term Frequency-Inverse Document Frequency (TFIDF) method. 

Regression was measured using the LinearRegression model only and the best prediction of the test sample provided by this data included the following scores:
* Coefficient: 0.755
* R Squared Value: 0.612
* P Value: 0 to at least 5dp
* Precision: 0.39
* F1 Score: 0.34
* Geometric Mean: 0.52

These scores are not good enough to accurately provide a sentiment analysis of the reviews, but the P Value is very promising.

The initial Analysis also revealed that the text data could be adjusted to fit the overall star rating from 1-5, however the final results must be cut into values of 1 to 5 integers otherwise the errors contained in the float decimal skew the regression error metrics like the geometric mean. It was also observed that the predicted values were not contained within the 1 - 5 bounds and a small amount of values drifted beyond ±10.

The data was unbalanced, and due to the quantity of reviews available only under sampling resampling was attempted on the training data, however the resampled data had a negative impact on all scores.

## Hypothesis
The Initial Regression Analysis P Value indicates that there is indeed a correlation between the reviews and the review rating scores from 1-5, since the linear regression model did not provide satisfactory scores the correlation may not be linear.

The resampling was not sufficient in the Initial exploration, however we can clearly see that the datset is imbalanced with 71.9% of the values falling in 1 and 5. Due to the amount of data only under sampling was implemented, however by reducing the size of the training set other sampling methods could be implemented. Of course different sampling methods may affect the data differently, so it may be necessary to lemmatize and stemmatize the original texts for different methods. Similarly different models may provide better scores with the vectorization methods.

By implementing pipelines and GridSearchCV, several regression models can be tested quickly and return the best scores and parameters for each model can be returned. In this notebook the data will be lemmatized, stemmatized, CountVectorized and TFIDF Vectorized as before. In addition the training set will be halved and RandomOverSampling, Synthetic Minority Over-sampling (SMOTE) and CentroidCluster resampling techniques will be implemented alongside RandomUndersampling. The raw data alongside the resampled data will be dimensionally reduced using Lasso, Random Forest, XGBoost and Principle Component Analysis (PCA) Methods. Finally Regression analyses will be performed by fitting and predicting data using Lasso, Ridge, ElasticNet and GradientBoostingRegressor models.

It may be that the data is better suited to Classification than Regression, and so whilst this document is prepared to analyse Regression a second notebook is being prepared to model the data with classification.

### Importing Libraries

In [2]:
from TextMiningProcesses import column_lemmatizer, column_stemmatizer, count_vectorize_data, tfidf_vectorize_data

from imblearn.metrics import classification_report_imbalanced
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids

from matplotlib import pyplot as plt

from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin
from sklearn.cluster import KMeans
from sklearn.compose import TransformedTargetRegressor
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_selection import RFE, SelectKBest, SelectFromModel
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

from scipy.stats import linregress

import xgboost as xgb

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline


0             good like 
1    happy best product 
2    good happy product 
3      bad stuff broken 
4        evil bad thing 
5          good product 
dtype: object
[[0.         0.         0.         0.         0.56921261 0.
  0.82219037 0.         0.         0.        ]
 [0.         0.68172171 0.         0.         0.         0.55902156
  0.         0.47196441 0.         0.        ]
 [0.         0.         0.         0.         0.54209195 0.64208461
  0.         0.54209195 0.         0.        ]
 [0.50161301 0.         0.61171251 0.         0.         0.
  0.         0.         0.61171251 0.        ]
 [0.50161301 0.         0.         0.61171251 0.         0.
  0.         0.         0.         0.61171251]
 [0.         0.         0.         0.         0.70710678 0.
  0.         0.70710678 0.         0.        ]]
1.0


#### Train - Test Preprocessing

In [3]:
# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# View length of data
print(f'Original number of Training data records: {len(train)}')
print(f'Original number of Test data records:{len(test)}')

# half the amount of training data for sampling
half_train = train.iloc[:len(train) // 2, :]

# View length of data
print(f'Training data will be shortened to enable resampling within a reasonable timeframe')
print(f'New number of Training data records {len(half_train)}')

# Create X_train, X_test, y_train, y_test variables
X_train = half_train['reviewText']
y_train = half_train['overall']
X_test = test['reviewText']
y_test = test['overall']


Original number of Training data records: 43043
Original number of Test data records:18447
Training data will be shortened to enable resampling within a reasonable timeframe
New number of Training data records 21521


#### Cutting and Scoring


In [4]:
class CutTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        X_cut = pd.cut(x=X, bins=[X.min(),1.5,2.5,3.5,4.5,X.max()], labels=[1,2,3,4,5], include_lowest=True)
        return X_cut
    

precision_scorer = make_scorer(precision_score, average='weighted')
recall_scorer = make_scorer(recall_score, average='weighted')
f1_scorer = make_scorer(f1_score, average='weighted')
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
    
scoring = {
    'accuracy': 'accuracy',
    'precision': precision_scorer,
    'recall': recall_scorer,
    'f1_score': f1_scorer,
    'r2_score': 'r2',
    'mse': mse_scorer
}

#### Text Processing

In [5]:
class Tokenizer(BaseEstimator, TransformerMixin):
    def __init__(self, lemmatizer=True):
        self.lemmatizer = lemmatizer

    def fit(self, X, y):
        return self
    
    def transform(self, X):
        if self.lemmatizer:
            tokenized_X = column_lemmatizer(X)
            print(f'Lemmatization Shape: {tokenized_X.shape}')
            return tokenized_X
        else:
            tokenized_X = column_stemmatizer(X)
            print(f'Stemmatization Shape: {tokenized_X.shape}')
            return tokenized_X
    
tokens = Tokenizer(lemmatizer=True)



In [6]:
# Testing text functions

tokens = Tokenizer(lemmatizer=True)

processed_text = tokens.transform(X_train)

print(processed_text[0])
print(len(X_train))
print(len(processed_text))

Lemmatization Shape: (21521,)
bought cooker order able make chinese hot pot home instruction set easy follow even come cooking pot box included magnet make easy test pan pot see work cooker purchasing setting simple easy use cord nice enough length extend outlet kitchen table happy purchase wait find us 
21521
21521


#### Vectorization

In [7]:
class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, CountVector=True, max_features=None):
        self.CountVector = CountVector
        self.max_features = max_features

    def fit(self, X, y):
        return self
    
    def transform(self, X):
        if self.CountVector:
            vectorized_x = count_vectorize_data(X, max_features=self.max_features)
            print(f'CVector shape: {vectorized_x[0].shape}')
            return vectorized_x[0]
        else:
            vectorized_x = tfidf_vectorize_data(X, max_features=self.max_features)
            print(f'TFIDF shape: {vectorized_x[0].shape}')
            return vectorized_x[0]
        
Vector = Vectorizer(CountVector=True, max_features=8000)

In [8]:
# tokenizer, vectorizer, cutter test

tokens = Tokenizer(lemmatizer=True)

Vector = Vectorizer(CountVector=True, max_features=8000)

cutter = CutTransformer()

lr = LinearRegression()

X_tokens = tokens.transform(X_train)
X_vector = Vector.transform(X_tokens)
lr.fit(X_vector, y_train)
y_pred = lr.predict(X_vector)
y_cut = cutter.transform(y_pred)

y_cut.describe()

display(pd.crosstab(y_cut, y_train))

Lemmatization Shape: (21521,)
CVector shape: (21521, 8000)


overall,1,2,3,4,5
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2227,184,26,2,0
2,2037,635,255,62,30
3,1305,577,1004,699,768
4,112,93,382,1637,4196
5,0,3,32,527,4728


#### Resampling pipeline

In [9]:
class Sampler(BaseEstimator, TransformerMixin):
    def __init__(self, clusters=5, state=0, k_neighbors=50):
        self.clusters = clusters
        self.estimator = KMeans(n_clusters=self.clusters, random_state=69)
        self.state = state
        self.k_neighbors = k_neighbors

    def fit(self, X_train, y_train):
        if self.state == 0:
            oversampler = RandomOverSampler(sampling_strategy='minority', random_state=69)
            X_sampled, y_sampled = oversampler.fit_resample(X_train, y_train)
            print(f'rOs size: {X_sampled.size, y_sampled.size}')
            return X_sampled, y_sampled
        
        elif self.state == 1:
            smotesampler = SMOTE(sampling_strategy='minority', random_state=69, k_neighbors=self.k_neighbors)
            X_sampled, y_sampled = smotesampler.fit_resample(X_train, y_train)
            print(f'smote size: {X_sampled.size, y_sampled.size}, k_neighbors ={self.k_neighbors}')
            return X_sampled, y_sampled
        
        elif self.state == 2:
            undersampler = RandomUnderSampler(sampling_strategy='majority', random_state=69)
            X_sampled, y_sampled = undersampler.fit_resample(X_train, y_train)
            print(f'rUs size: {X_sampled.size, y_sampled.size}')
            return X_sampled, y_sampled
        
        elif self.state == 3:
            ccsampler = ClusterCentroids(sampling_strategy='majority', random_state=69, estimator=self.estimator)
            X_sampled, y_sampled = ccsampler.fit_resample(X_train, y_train)
            print(f'Cluster size: {X_sampled.size, y_sampled.size}, clusters = {self.clusters}')
            return X_sampled, y_sampled
        
        else:
            raise ValueError('State outside 0-3')
    
    def transform(self, X):
        return X

sample = Sampler()

In [11]:
param_grid = {
    'Text_PP__lemmatizer': [True, False],
    'CVector__CountVector': [True, False]
    # 'Resample__state': [0, 1, 2, 3],
    # 'Resample__clusters': [5],
    # 'Resample__k_neighbors': [50]
}

Vector = Vectorizer(CountVector=True, max_features=None)

test_pipeline = Pipeline([
    ('Text_PP', tokens),
    ('CVector', Vector),
    ('LR', lr)
])

grid = GridSearchCV(test_pipeline, param_grid=param_grid, cv=2)

grid.fit(X_train, y_train)

results_df = pd.DataFrame(grid.cv_results_)

results_df.head()

Lemmatization Shape: (10760,)
CVector shape: (10760, 19833)
Lemmatization Shape: (10761,)
CVector shape: (10761, 18825)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Lemmatization Shape: (10761,)
CVector shape: (10761, 18825)
Lemmatization Shape: (10760,)
CVector shape: (10760, 19833)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Stemmatization Shape: (10760,)
CVector shape: (10760, 15078)
Stemmatization Shape: (10761,)
CVector shape: (10761, 14091)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Stemmatization Shape: (10761,)
CVector shape: (10761, 14091)
Stemmatization Shape: (10760,)
CVector shape: (10760, 15078)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Lemmatization Shape: (10760,)
TFIDF shape: (10760, 19833)
Lemmatization Shape: (10761,)
TFIDF shape: (10761, 18825)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Lemmatization Shape: (10761,)
TFIDF shape: (10761, 18825)
Lemmatization Shape: (10760,)
TFIDF shape: (10760, 19833)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Stemmatization Shape: (10760,)
TFIDF shape: (10760, 15078)
Stemmatization Shape: (10761,)
TFIDF shape: (10761, 14091)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Stemmatization Shape: (10761,)
TFIDF shape: (10761, 14091)
Stemmatization Shape: (10760,)
TFIDF shape: (10760, 15078)


Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 810, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 760, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 760, in score
    y_pred = self.predict(X)
             ^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\sit

Lemmatization Shape: (21521,)
CVector shape: (21521, 27015)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_CVector__CountVector,param_Text_PP__lemmatizer,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score
0,8.226485,0.119513,1.267871,0.002429,True,True,"{'CVector__CountVector': True, 'Text_PP__lemma...",,,,,1
1,10.956761,0.754733,2.929029,0.004717,True,False,"{'CVector__CountVector': True, 'Text_PP__lemma...",,,,,1
2,19.012377,1.251978,1.261772,0.002349,False,True,"{'CVector__CountVector': False, 'Text_PP__lemm...",,,,,1
3,6.398927,1.28302,2.921455,0.002443,False,False,"{'CVector__CountVector': False, 'Text_PP__lemm...",,,,,1


In [23]:
# Sample test

tokens = Tokenizer(lemmatizer=True)

Vector = Vectorizer(CountVector=True, max_features=8000)

cutter = CutTransformer()

sample=Sampler(state=3)

lr = LinearRegression()

X_tokens = tokens.transform(X_train)
X_vector = Vector.transform(X_tokens)
X_sampled, y_sampled = sample.fit(X_vector, y_train)
lr.fit(X_sampled, y_sampled)
y_pred = lr.predict(X_sampled)
y_cut = cutter.transform(y_pred)

y_cut.describe()

Lemmatization Shape: (21521,)
CVector shape: (21521, 8000)


  super()._check_params_vs_input(X, default_n_init=10)


Cluster size: (672359, 13291), clusters = 5


Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3938,0.296291
2,3550,0.267098
3,2473,0.186066
4,1920,0.144459
5,1410,0.106087


#### Dimensional Reduction Pipeline

In [12]:
class lasso_reduction(BaseEstimator, TransformerMixin):
        def __init__(self, alpha=1, max_iter=1000):
            self.alpha = alpha
            self.max_iter = max_iter
        
        def fit(self, X, y):
            self.lasso = Lasso(alpha=self.alpha, max_iter=self.max_iter)
            self.lasso.fit(X, y)
            return self

        def transform(self, X):
            sfm = SelectFromModel(self.lasso, prefit=True)
            reduced_X = sfm.transform(X)
            return reduced_X

class RandomForestReduction(BaseEstimator, TransformerMixin):
    def __init__(self, max_features=None, n_estimators=100, max_depth=None):
        self.max_features = max_features
        self.n_estimators = n_estimators
        self.max_depth = max_depth

    def fit(self, X, y):
        self.rf = RandomForestRegressor(max_features=self.max_features, n_estimators=self.n_estimators, max_depth=self.max_depth)
        self.rf.fit(X, y)
        return self

    def transform(self, X):
        sfm = SelectFromModel(self.rf, prefit=True)
        reduced_X = sfm.transform(X)
        return reduced_X

class XGBoostReduction(BaseEstimator, TransformerMixin):
    def __init__(self, n_estimators=100, max_depth=None, learning_rate=0.1, threshold=5):
        self.n_estimators= n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
        self.threshold = threshold

    def fit(self, X, y):
        self.boost_red = xgb.XGBRegressor(n_estimators=self.n_estimators, max_depth=self.max_depth, learning_rate=self.learning_rate)
        self.boost_red.fit(X, y)
        return self

    def transform(self, X):
        sfm = SelectFromModel(self.boost_red, threshold=self.threshold)
        reduced_X = sfm.transform(X)
        return reduced_X

pca = PCA()

dim_red_pipe = Pipeline([
    ('lasso', lasso_reduction()),
    ('RF', RandomForestReduction()),
    ('XGBoost', XGBoostReduction()),
    ('PCA', pca)
])

In [42]:
# Test Dimensional Reduction

tokens = Tokenizer(lemmatizer=True)

Vector = Vectorizer(CountVector=True, max_features=8000)

cutter = CutTransformer()

sample=Sampler(state=0)

lasso = lasso_reduction()
rf = RandomForestReduction()
xbg = XGBoostReduction()


lr = LinearRegression()

X_tokens = tokens.transform(X_train)
X_vector = Vector.transform(X_tokens)
X_sampled, y_sampled = sample.fit(X_vector.todense(), y_train)
lasso.fit(X_sampled, y_sampled)
X_reduced = lasso.transform(X_sampled)

print(X_reduced.size)



# test_pipeline = Pipeline([
#     ('dim_red', dim_red_pipe),
#     ('LR', transformed_lr_regressor)
# ])

# param_grid = {
#     'dim_red__lasso__alpha': [0.5],
#     'dim_red__lasso__max_iter': [5000],
#     'dim_red__RF__max_features': ['sqrt'],
#     'dim_red__RF__n_estimators': [250],
#     'dim_red__RF__max_depth': [10],
#     'dim_red__XGBoost__learning_rate': [0.1],
#     'dim_red__XGBoost__n_estimators': [250],
#     'dim_red__XGBoost__max_depth': [10],
#     'dim_red__XGBoost__threshold': [25]
# }

# grid = GridSearchCV(test_pipeline, param_grid=param_grid, cv=2)

# grid.fit(X_sampled, y_sampled)

Lemmatization Shape: (21521,)
CVector shape: (21521, 8000)


TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

#### Regression Pipeline

In [38]:
class CustomRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, regressor=LinearRegression(), transformer=None):
        self.regressor = regressor
        self.transformer = transformer

    def fit(self, X, y):
        self.regressor.fit(X, y)
        return self
    
    def transform(self, X):
        y_pred = self.regressor.predict(X)
        y_pred_cut = self.transformer.transform(y_pred)
        return y_pred_cut
    
    def predict(self, X):
        y_pred = self.regressor.predict(X)
        y_pred_cut = self.transformer.transform(y_pred)
        return y_pred_cut
    
    def score(self, X, y, custom_scoring='accuracy'):
        y_pred = self.regressor.predict(X)
        y_pred_cut = self.transformer.transform(y_pred)
        if custom_scoring == 'accuracy':
            return accuracy_score(y, y_pred_cut)
        elif custom_scoring == 'precision':
            return precision_score(y, y_pred_cut)
        elif custom_scoring == 'recall':
            return recall_score(y, y_pred_cut)
        elif custom_scoring == 'f1_score':
            return f1_score(y, y_pred_cut)
        elif custom_scoring == 'r2_score':
            return r2_score(y, y_pred_cut)
        elif custom_scoring == 'mse':
            return mean_squared_error(y, y_pred_cut)
        else:
            raise ValueError("Invalid custom scoring metric")
        

custom_reg = CustomRegressor(regressor=lr, transformer=cutter)

custom_reg.fit(X_sampled, y_sampled)

custom_reg.score(X_sampled, y_sampled)


0.664133624257016

In [39]:
param_grid = {'Lasso__regressor': [Lasso()],
    'Lasso__regressor__alpha': [0.1],
    'Ridge__regressor': [Ridge()],
    'Ridge__regressor__alpha': [0.1],
    'ElasticNet__regressor': [ElasticNet()],
    'ElasticNet__regressor__alpha': [0.1],
    'ElasticNet__regressor__l1_ratio': [0.7],
    'GBR__regressor': [GradientBoostingRegressor()],
    'GBR__regressor__learning_rate': [0.1],
    'GBR__regressor__n_estimators': [250],
    'GBR__regressor__max_depth': [10]
}

test_pipeline = Pipeline([
    ('Lasso', custom_reg),
    ('Ridge', custom_reg),
    ('ElasticNet', custom_reg),
    ('GBR', custom_reg)
])

grid = GridSearchCV(test_pipeline, param_grid=param_grid, cv=2)

grid.fit(X_sampled, y_sampled)

results_df = pd.DataFrame(grid.cv_results_)

ValueError: 
All the 2 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 423, in fit
    Xt = self._fit(X, y, **fit_params_steps)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 377, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\memory.py", line 353, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 959, in _fit_transform_one
    res = transformer.fit(X, y, **fit_params).transform(X)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Scratticus\AppData\Local\Temp\ipykernel_30396\3126852135.py", line 12, in transform
    y_pred_cut = self.transformer.transform(y_pred)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_set_output.py", line 157, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Scratticus\AppData\Local\Temp\ipykernel_30396\109930978.py", line 6, in transform
    X_cut = pd.cut(x=X, bins=[X.min(),1.5,2.5,3.5,4.5,X.max()], labels=[1,2,3,4,5], include_lowest=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Scratticus\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\reshape\tile.py", line 291, in cut
    raise ValueError("bins must increase monotonically.")
ValueError: bins must increase monotonically.


#### Parameter Grid

In [None]:
param_grid = {
    'text_preprocessing__lemmatized__enabled': [True, False],
    'resampling__SMOTE__k_neighbors': [2, 5, 10, 15, 25, 50, 100, 250, 500, 1000],
    'resampling__SMOTE__kind': ['deprecated', 'regular', 'svm'],
    'resampling__RandomUnderSampled__n_neighbors': [2, 5, 10, 15, 25, 50, 100, 250, 500, 1000],
    'resampling__CC__clusters': [2, 5, 10, 15, 25, 50, 100, 250, 500, 1000],
    'dim_red__lasso__alpha': np.logspace(-4, 0, 8, endpoint=True, base=10),
    'dim_red__lasso__max_iter': [1000, 5000, 10000],
    'dim_red__RF__max_features': ['auto', 'log2'],
    'dim_red__RF__n_estimators': [50, 100, 250, 500, 1000, 2000, 4000],
    'dim_red__RF__max_depth': [10, 20, 30, None],
    'dim_red__XGBoost__max_learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'dim_red__XGBoost__n_estimators': [50, 100, 250, 500, 1000, 2000, 4000],
    'dim_red__XGBoost__max_depth': [10, 20, 30, None],
    'dim_red__XGBoost__threshold': [5, 10, 25, 50, 100, 250, 500, 1000, 1500, 2000, 2500, 5000],
    'regression__Lasso__alpha': np.logspace(-4, 0, 8, endpoint=True, base=10),
    'regression__Ridge__alpha': np.logspace(-4, 4, 9, endpoint=True, base=10),
    'regression__ElasticNet__alpha': np.logspace(-4, 4, 9, endpoint=True, base=10),
    'regression__ElasticNet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9],
    'regression__GBR__max_learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'regression__GBR__n_estimators': [50, 100, 250, 500, 1000, 2000, 4000],
    'regression__GBR__max_depth': [10, 20, 30, None]
}

#### Combined Pipeline

In [None]:
full_pipeline = ([
    ('text_preprocessing', text_pp_pipe),
    ('Vectorizing', vector_pipe),
    ('resampling', resample_pipe),
    ('dim_red', dim_red_pipe),
    ('regression', regression_pipe),
    ('cut', cut)
])

regression_grid = GridSearchCV(full_pipeline, param_grid=param_grid, scoring=scoring, cv=5)
regression_grid.fit(X_train, y_train)

results_df = pd.DataFrame(regression_grid.cv_results_)

In [1]:
def over_sampler(X,y):
    """
    This function resamples target (y) and (X) using the RandomOverSampler from the imblearn
    oversampler library. The sampler uses the minority sampling strategy to suit the amazon
    review data. There are no further arguments to adjust.
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    over_sampler(X_train, y_train)

    returns:
    X_train_oversampled, y_train_oversampled
    """
    sampler = RandomOverSampler(sampling_strategy='minority', n_jobs=-1, random_state=69)
    X_oversampled, y_oversampled = sampler.fit_resample(X,y)
    return X_oversampled, y_oversampled

def smote_sampler(X, y, k, kind):
    """
    This function resamples target (y) and (X) using the SMOTE resampler from the imblearn
    oversampler library. The sampler uses the minority sampling strategy to suit the amazon
    review data. The model can be further optimized by cycling kind methods with a parameter
    grid
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    smote_sampler(X_train, y_train, k=1000, kind='regular')

    returns:
    X_train_oversampled, y_train_oversampled
    """
    sampler = SMOTE(sampling_strategy='minority', n_jobs=-1, random_state=69, k_neighbors=k, kind=kind)
    X_oversampled, y_oversampled = sampler.fit_resample(X,y)
    return X_oversampled, y_oversampled

def under_sampler(X, y, k):
    """
    This function resamples target (y) and (X) using the RandomUnderSampler from the imblearn
    undersampler library. The sampler uses the majority sampling strategy to suit the amazon
    review data. The model can be further optimized by cycling the n_neighbours argument
    values with a parameter grid
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    under_sampler(X_train, y_train, k=1000)

    returns:
    X_train_undersampled, y_train_undersampled
    """
    sampler = RandomUnderSampler(sampling_strategy='majority', n_jobs=-1, random_state=69, n_neighbors=k)
    X_undersampled, y_undersampled = sampler.fit_resample(X,y)
    return X_undersampled, y_undersampled

def cluster_sampler(X, y, k):
    """
    This function resamples target (y) and (X) using the ClusterCentroids resampler from the 
    imblearn undersampler library. To maintain simplicity, this function uses the Kmeans 
    cluster method only, the n_clusters can be tuned to fit the data better and further
    cluster methods can be explored if the resampler provides good scores.
    
    Args:
    Feature data, target data: These are expected to be tokenized, Vectorized array data.
    Inputs should be training data only to avoid statistical leakage
    
    Returns:
    resampled feature data, resampled target data. As tokenized vectorized arrays.

    Example:
    cluster_sampler(X_train, y_train, k=5)

    returns:
    X_train_undersampled, y_train_undersampled
    """
    estimator = KMeans(n_clusters=k, random_state=69)
    sampler = ClusterCentroids(sampling_strategy='majority', n_jobs=-1, random_state=69, estimator=estimator)
    X_undersampled, y_undersampled = sampler.fit_resample(X,y)
    return X_undersampled, y_undersampled