# **Filtrado basado en Contenido**

## Estrategias

1. TF-IDF:
    - TFIDF + LogisticRegression -  `MAE: 0.65`
    - TFIDF + RandomForestRegression - `MAE: 0.82`
    - TFIDF + xgboost [HACER] - `MAE: -`
2. Doc2Vec + LogisticRegression - `MAE: 1.24`
3. [HACER]

In [10]:
%pip install nltk --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances


df_train = pd.read_csv("train_reviews.csv", sep="," , index_col="review_id")

In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 967784 entries, ZZO43qKB-s65zplC8RfJqw to auSo_fXuICntO1hLC68tTg
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   user_id      967784 non-null  object 
 1   business_id  967784 non-null  object 
 2   stars        967784 non-null  float64
 3   useful       967784 non-null  int64  
 4   funny        967784 non-null  int64  
 5   cool         967784 non-null  int64  
 6   text         967784 non-null  object 
 7   date         967784 non-null  object 
dtypes: float64(1), int64(3), object(4)
memory usage: 66.5+ MB


In [6]:
df_train.head(5)

Unnamed: 0_level_0,user_id,business_id,stars,useful,funny,cool,text,date
review_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ZZO43qKB-s65zplC8RfJqw,-1BSu2dt_rOAqllw9ZDXtA,smkZq4G1AOm4V6p3id5sww,5.0,0,0,0,Fantastic fresh food. The greek salad is amazi...,2016-09-30 15:49:32
vojXOF_VOgvuKD95gCO8_Q,xpe178ng_gj5X6HgqtOing,96_c_7twb7hYRZ9HHrq01g,1.0,2,0,1,Been a patient at Largo Med/Diagnostic Clinic ...,2020-12-09 14:39:51
KwxdbiseRlIRNzpgvyjY0Q,axbaerf2Fk92OB4b9_peVA,e0AYjKfSF0DL-5C1CpOq6Q,4.0,0,0,0,The location is convenient to my campus so I d...,2013-09-04 16:19:51
3mwoBcTy-2gMh0L91uaIeA,_GOiybb0rImYKJfwyxEaGg,vF-uptiQ34pVLHJKzPHUlA,5.0,0,0,0,I agree with all the other compliments posted ...,2019-03-02 12:24:14
XfWf7XsBWs3kYyYq7Ns1ZQ,ojWKg3B5pH3ncAsxun3kUw,X28XK71RuEXPapeyUOwNzg,5.0,10,4,7,"Wanting to help out the local economy, I thoug...",2020-04-23 18:26:29


In [2]:
df_train.reset_index(inplace=True)

In [8]:
df_train['text'].isna().value_counts()

text
False    967784
Name: count, dtype: int64

In [3]:
df_test = pd.read_csv("test_reviews.csv", sep=",")

### **TFIDF**

In [10]:
# -- Tfidf elimina las stopwords
vectorizer = TfidfVectorizer(stop_words='english', max_features=200, ngram_range=(1, 2))

X_train_tfidf = vectorizer.fit_transform(df_train['text'])
X_test_tfidf = vectorizer.transform(df_test['text'])

In [None]:
y = df_train['stars']

X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, y, test_size=0.2, random_state=42)

# model = RandomForestRegressor(n_estimators=100, random_state=42)
model = LogisticRegression(max_iter=1000)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error (Train Test Split): {mae}')

y_pred_test = model.predict(X_test_tfidf)

Mean Absolute Error (Train Test Split): 0.6478453375491457


In [12]:
submission_df = pd.DataFrame({
    'review_id': df_test['review_id'],
    'stars': y_pred_test
})

submission_df.to_csv('prediction_tfidf_logisticReg.csv', index=False)

**GridSearch** con LogisticRegression

In [14]:
y = df_train['stars']
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, y, test_size=0.2, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],     # Regularización
    'solver': ['liblinear', 'saga'],  # Métodos de optimización
    'max_iter': [100, 500, 1000],     # Número máximo de iteraciones
}

model = LogisticRegression(random_state=42)

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
best_params = grid_search.best_params_
y_pred_best = best_model.predict(X_test)
mae_best = mean_absolute_error(y_test, y_pred_best)

print(f"Mejores parámetros: {best_params}")
print(f"Mean Absolute Error (mejor modelo): {mae_best}")

# Comparacion entre el mejor modelo y uno predeterminado (sin optimizar)
model_default = LogisticRegression(max_iter=1000, random_state=42)
model_default.fit(X_train, y_train)
y_pred_default = model_default.predict(X_test)
mae_default = mean_absolute_error(y_test, y_pred_default)
print(f"Mean Absolute Error (modelo sin optimizar): {mae_default}")

### Predicciones
y_pred_test = best_model.predict(X_test_tfidf)

submission_df = pd.DataFrame({
    'review_id': df_test['review_id'],  
    'stars': y_pred_test  
})

submission_df.to_csv('prediction_tfidf_logisticReg_gridSearch.csv', index=False)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Mejores parámetros: {'C': 100, 'max_iter': 100, 'solver': 'saga'}
Mean Absolute Error (mejor modelo): 0.6476800115728183
Mean Absolute Error (modelo sin optimizar): 0.6478453375491457


In [15]:
submission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414765 entries, 0 to 414764
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   review_id  414765 non-null  object 
 1   stars      414765 non-null  float64
dtypes: float64(1), object(1)
memory usage: 6.3+ MB


**Con un Clasificador RandomForest**

In [4]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=100, ngram_range=(1, 2))

X_train_tfidf = vectorizer.fit_transform(df_train['text'])
X_test_tfidf = vectorizer.transform(df_test['text'])

In [5]:
### Con RandomForest
y = df_train['stars']

X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=10, random_state=42, n_jobs=-1)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error (Train Test Split): {mae}')

y_pred_test = model.predict(X_test_tfidf)

submission_df = pd.DataFrame({
    'review_id': df_test['review_id'],
    'stars': y_pred_test
})

submission_df.to_csv('prediction_tfidf_randomForest.csv', index=False)

Mean Absolute Error (Train Test Split): 0.824586461717978


### **Doc2Vec**
Enlace: https://spotintelligence.com/2023/09/06/doc2vec/

In [None]:
%pip install scipy==1.12 # necesario para solucionar el error: ImportError: cannot import name 'triu' from 'scipy.linalg' (c:\Users\34627\anaconda3\Lib\site-packages\scipy\linalg\__init__.py)

In [4]:
%pip install gensim --quiet

import re
from joblib import Parallel, delayed
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib
import nltk
#nltk.download('all')

Note: you may need to restart the kernel to use updated packages.


In [None]:
######################
#### PREPROCESADO ####
######################
print("--- Inicio Preprocesado...")

# Tokenizer
def preprocess_text_parallel(text):
    return re.findall(r'\b[a-zA-Z]+\b', text.lower())

df_train['tokens'] = Parallel(n_jobs=-1)(delayed(preprocess_text_parallel)(text) for text in df_train['text'])
df_test['tokens'] = Parallel(n_jobs=-1)(delayed(preprocess_text_parallel)(text) for text in df_test['text'])

# TaggedDocument
tagged_train = [TaggedDocument(words=tokens, tags=[str(i)]) for i, tokens in enumerate(df_train['tokens'])]

########################
#### MODELO DOC2VEC ####
########################

# Initialize the Doc2Vec model
model = Doc2Vec(vector_size=50,   # Dimensionality of the document vectors
                window=2,         # Maximum distance between the current and predicted word within a sentence
                min_count=1,      # Ignores all words with total frequency lower than this
                workers=-1,       # Number of CPU cores to use for training
                epochs=2)         # Number of training epochs

model.build_vocab(tagged_train)
model.train(tagged_train, total_examples=len(tagged_train), epochs=model.epochs)

# Inferir vectores
df_train['vector'] = df_train['tokens'].apply(lambda x: model.infer_vector(x))
df_test['vector'] = df_test['tokens'].apply(lambda x: model.infer_vector(x))


X_test = list(df_test['vector'])

#############################
#### MODELO CLASIFICADOR ####
#############################

X_list_train = list(df_train['vector'])
y_train = df_train['stars']


classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_list_train, y_train)

predicted_stars = classifier.predict(X_test)

submission_df = pd.DataFrame({
    'review_id': df_test['review_id'],
    'stars': predicted_stars
})
submission_df.to_csv('prediction_doc2vec_logreg.csv', index=False)

--- Inicio Preprocesado...
