## Data Preparation (huge dataset)

### Way1: 
1. Pandas with chunk size 
2. del unnecesary information, gc.collect() 


--->RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 4374528000 bytes.

### Way2

1. Load and process the data in chunks.
2. Embed the reviews using the UAE large v1 model.
3. Save each chunk's embeddings and ratings to a temporary file.
4. Combine all temporary files into a single NumPy array at the end.


## Embedding Generation
Looked into --->  MTEB Leaderboard (massive text embedding benchmark)
### UAE - large-V1 (universal angie embeddings)
1. Dimensions : 1024
2. Time : took 1 hr for 3k records.
3. Model Size : 1.25 GB (fp32)
4. Context length (tokens) : 512
5. price : open source

### Voyage-large-2-instruct
1. Dimensions : 1024
2. Time : 
3. Model Size : 
4. Context length (tokens) : 16k
5. price : $ 0.12 / 1 million tokens

### text-embedding-3-small
1. Dimensions : 1536
2. Time : 
3. Model Size : 
4. Context length (tokens) : 512
5. price : $ 0.02 / 1 million tokens



1. Latency - Performance Trade - off :
Delay between a user's action and the response from a system is known as an latency.
2. Capturing Complexity of data - operational efficiency trade off:

## How to train your own embedding model?



In [1]:
# load .npy file

import numpy as np
file=r"D:\projects_llm\review_prediction_using_llm_embeddings\temp_chunks\ratings_embeddings.npy"
data=np.load(file)

In [2]:
print(type(data))
data.shape

<class 'numpy.ndarray'>


(10000, 1025)

In [3]:
data[0].shape

(1025,)

In [4]:
data[1]

array([ 4.        , -0.03100718, -0.00820842, ...,  0.01742723,
        0.03722996,  0.02347247])

## ML Techniques

In [5]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LinearRegression,Ridge,Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error,r2_score

In [6]:
ratings = data[:,0]   # shape (10,000,)
embeddings = data[:,1:] # shape (10,000,1024)

In [7]:
x_train,x_test,y_train,y_test=train_test_split(embeddings,ratings,test_size=0.25,random_state=42)

In [8]:

## 'Ridge Regression'
# alpha: Regularization strength. Larger values specify stronger regularization. 
# The range of values here allows the model to try different levels of regularization to find the optimal one.

## 'Lasso Regression'
#alpha: Similar to Ridge Regression, alpha controls the regularization strength in Lasso Regression. 
# The values provided let the model explore different regularization strengths.

## 'Random Forest Regressor'
# n_estimators: The number of trees in the forest. Testing with different values (50, 100, 200) helps find the optimal number of trees.
#max_depth: The maximum depth of each tree. Limiting the depth can prevent overfitting.
#  None means nodes are expanded until all leaves are pure or contain fewer than the minimum samples.

## 'SVM Regressor'
# C: Regularization parameter. The strength of the regularization is inversely proportional to C. Smaller values specify stronger regularization.
# epsilon: Specifies the epsilon-tube within which no penalty is associated in the training loss function.
# kernel: Specifies the kernel type to be used in the algorithm. 'linear' uses a linear kernel, while 'rbf' uses a radial basis function kernel.

In [12]:

# List of regressors and their parameter grids for hyperparameter tuning
models = {
    'Linear Regression': (LinearRegression(), {}),
    'Ridge Regression': (Ridge(), {'alpha': [0.1, 1, 10]}),
    'Lasso Regression': (Lasso(), {'alpha': [0.1, 1, 10]}),
    'Random Forest Regressor': (RandomForestRegressor(), {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}),
    'SVM Regressor': (SVR(), {'C': [0.1, 1, 10], 'epsilon': [0.01, 0.1, 0.2]}) # kernel : linear, RBF
}

## Components
1. GridSearchCV:

GridSearchCV is a function from scikit-learn used to perform an exhaustive search over specified parameter values for an estimator. It helps in finding the best combination of hyperparameters for a model.
Parameters:

a. model: The machine learning model/estimator to be optimized (e.g., LinearRegression(), Ridge(), SVR(), etc.).

b. params: A dictionary where keys are the hyperparameter names and values are lists of values to try. For example, {'alpha': [0.1, 1, 10]} for Ridge Regression.

c. cv=5: This sets up cross-validation with 5 folds. The data will be split into 5 parts, and the model will be trained and validated 5 times, each time using a different part of the data as the validation set and the remaining parts as the training set.

d. scoring='neg_mean_squared_error': The scoring method to evaluate the predictions. neg_mean_squared_error is used because GridSearchCV expects a score to maximize. By using the negative MSE, it effectively minimizes the MSE.

e. n_jobs=-1: This parameter allows the search to use all available CPU cores to parallelize the computation, speeding up the process.
Fit the Grid Search:

2. grid_search.fit(X_train, y_train): This line performs the grid search on the training data. It tries all combinations of parameters specified in params using cross-validation and evaluates them using the scoring method. The best combination of parameters is selected based on the cross-validation performance.
Best Estimator:

3. best_model = grid_search.best_estimator_: After fitting the grid search, grid_search.best_estimator_ contains the model with the best combination of hyperparameters found during the search.
Store the Best Model:

4. best_models[name] = best_model: This line stores the best model found for the current regressor in the best_models dictionary, using the name of the regressor as the key.

In [13]:
best_models = {}
from joblib import dump
# Train, tune, and evaluate each model
for name, (model, params) in models.items():
    grid_search = GridSearchCV(model, params, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    grid_search.fit(x_train, y_train)
    
    best_model = grid_search.best_estimator_
    best_models[name] = best_model
    
    y_pred = best_model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f'{name} - Best Parameters: {grid_search.best_params_}')
    print(f'{name} - Mean Squared Error: {mse}')
    print(f'{name} - R2 Score: {r2}')

# Save the best model for each regressor
for name, model in best_models.items():
    dump(model, f'{name.replace(" ", "_").lower()}_best_model.joblib')

Linear Regression - Best Parameters: {}
Linear Regression - Mean Squared Error: 0.609811512037517
Linear Regression - R2 Score: 0.6989985696544256
Ridge Regression - Best Parameters: {'alpha': 1}
Ridge Regression - Mean Squared Error: 0.5744906312561676
Ridge Regression - R2 Score: 0.7164328676733807
Lasso Regression - Best Parameters: {'alpha': 0.1}
Lasso Regression - Mean Squared Error: 2.0261129777777778
Lasso Regression - R2 Score: -8.427573817559875e-05
Random Forest Regressor - Best Parameters: {'max_depth': 20, 'n_estimators': 100}
Random Forest Regressor - Mean Squared Error: 0.6321252902811235
Random Forest Regressor - R2 Score: 0.6879845447710674
SVM Regressor - Best Parameters: {'C': 1, 'epsilon': 0.2}
SVM Regressor - Mean Squared Error: 0.5456886564468697
SVM Regressor - R2 Score: 0.7306494500816224
