<a href="https://colab.research.google.com/github/ElvisG2003/Sentiment-API-Hackathon/blob/main/Sentiment_API_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìä Proyecto 1: SentimentAPI ‚Äî An√°lisis de Sentimientos de Feedbacks

**Sector:** Atenci√≥n al cliente / Marketing / Operaciones  
**Objetivo:** Clasificar comentarios en **positivo / negativo** y generar insights accionables.  
**Equipo:** Data Science (Colab + Python)  
**Repositorio:** (url del repositorio)

## ‚úÖ Resultado esperado

- Notebook reproducible end-to-end (EDA ‚Üí modelo ‚Üí evaluaci√≥n ‚Üí exportaci√≥n)
- Modelo baseline: **TF-IDF + Logistic Regression** (scikit-learn)
- M√©tricas: Accuracy / Precision / Recall / F1-score + Matriz de confusi√≥n
- Exportaci√≥n del pipeline: `joblib.dump(...)` o `pickle.dump(...)`


## üîÅ Convenciones del equipo


- Cada secci√≥n tiene: **qu√© haremos**, **por qu√©**, y **resultado**.
- No se editan celdas ‚ÄúLOCK‚Äù sin avisar (usar comentarios o PR).
- Todo c√≥digo nuevo debe tener: `# TODO(autor): ...` o `# NOTE: ...`

## üë• Trabajo colaborativo (reglas r√°pidas)


**Roles sugeridos**
- DS Lead: revisa enfoque, m√©tricas, baseline
- Data Engineer: limpieza / calidad / pipeline
- NLP Engineer: features y modelos
- QA/Reviewer: reproducibilidad y validaci√≥n

**Buenas pr√°cticas**
- Usar nombres de variables consistentes: `df_raw`, `df`, `X_train`, `y_train`, `pipe`.
- Documentar decisiones de limpieza (por qu√© se eliminan/transforman registros).
- Guardar outputs en `/content/drive/MyDrive/<proyecto>/`.
- Versionar el notebook: `sentiment_v1.ipynb`, `sentiment_v2.ipynb`.

## üìÇ Instalaci√≥n/Imports

In [None]:
# =========================
# SETUP - Imports & Config
# =========================

import pandas as pd

## üè≠ Carga de datos


Formas de carga de dataset:
1) Subir CSV manualmente  
2) Google Drive
3) URL p√∫blica

**Formato esperado m√≠nimo del dataset:**
- `text`: comentario / rese√±a
- `label`: sentimiento (positivo / negativo) o (0/1)



In [None]:
# Opci√≥n A) Subir manualmente
# from google.colab import files
# uploaded = files.upload()

# Opci√≥n B) Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# path = "/content/drive/MyDrive/<proyecto>/data/dataset.csv"

# Opci√≥n C) URL
# url = "https://....csv"

url = "https://raw.githubusercontent.com/ElvisG2003/Sentiment-API-Hackathon/refs/heads/main/data-science/Customer_Sentiment.csv"

In [None]:
DF = pd.read_csv(url, sep=',')


In [None]:
DF.head()

Unnamed: 0,customer_id,gender,age_group,region,product_category,purchase_channel,platform,customer_rating,review_text,sentiment,response_time_hours,issue_resolved,complaint_registered
0,1,male,60+,north,automobile,online,flipkart,1,very disappointed with the quality.,negative,46,yes,yes
1,2,other,46-60,central,books,online,swiggy instamart,5,fast delivery and great packaging.,positive,5,yes,no
2,3,female,36-45,east,sports,online,facebook marketplace,1,very disappointed with the quality.,negative,38,yes,yes
3,4,female,18-25,central,groceries,online,zepto,2,product stopped working after few days.,negative,16,yes,yes
4,5,female,18-25,east,electronics,online,croma,3,neutral about the quality.,neutral,15,yes,no


In [None]:
DF.tail()

Unnamed: 0,customer_id,gender,age_group,region,product_category,purchase_channel,platform,customer_rating,review_text,sentiment,response_time_hours,issue_resolved,complaint_registered
24995,24996,female,36-45,south,beauty,online,lenskart,1,very disappointed with the quality.,negative,40,yes,yes
24996,24997,other,60+,central,automobile,online,flipkart,5,"amazing experience, highly recommend!",positive,25,yes,no
24997,24998,male,18-25,south,beauty,online,ajio,4,fast delivery and great packaging.,positive,9,yes,no
24998,24999,female,26-35,central,automobile,online,snapdeal,5,great value for money.,positive,65,no,no
24999,25000,male,46-60,central,travel,online,lenskart,3,"product is okay, nothing special.",neutral,67,no,no


In [None]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   customer_id           25000 non-null  int64 
 1   gender                25000 non-null  object
 2   age_group             25000 non-null  object
 3   region                25000 non-null  object
 4   product_category      25000 non-null  object
 5   purchase_channel      25000 non-null  object
 6   platform              25000 non-null  object
 7   customer_rating       25000 non-null  int64 
 8   review_text           25000 non-null  object
 9   sentiment             25000 non-null  object
 10  response_time_hours   25000 non-null  int64 
 11  issue_resolved        25000 non-null  object
 12  complaint_registered  25000 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.5+ MB


In [None]:
DF.shape

(25000, 13)

In [None]:
DF.columns

Index(['customer_id', 'gender', 'age_group', 'region', 'product_category',
       'purchase_channel', 'platform', 'customer_rating', 'review_text',
       'sentiment', 'response_time_hours', 'issue_resolved',
       'complaint_registered'],
      dtype='object')

## üßæ Diccionario de datos y validaci√≥n inicial

Aqu√≠ validamos:
- columnas m√≠nimas
- valores √∫nicos de `label`
- registros de nulos
- duplicados

In [None]:
DF_modificado = DF[['customer_id','purchase_channel','customer_rating','review_text','sentiment']]
DF_modificado

Unnamed: 0,customer_id,purchase_channel,customer_rating,review_text,sentiment
0,1,online,1,very disappointed with the quality.,negative
1,2,online,5,fast delivery and great packaging.,positive
2,3,online,1,very disappointed with the quality.,negative
3,4,online,2,product stopped working after few days.,negative
4,5,online,3,neutral about the quality.,neutral
...,...,...,...,...,...
24995,24996,online,1,very disappointed with the quality.,negative
24996,24997,online,5,"amazing experience, highly recommend!",positive
24997,24998,online,4,fast delivery and great packaging.,positive
24998,24999,online,5,great value for money.,positive


Consultar el Diccionario de datos


*   customerID: n√∫mero de identificaci√≥n √∫nico de cada cliente
*   Purchase_channel: Canal de compra
*   costumer_rating: puntiaci√≥n del cliente
*   review_text: texto de la review del cliente
*   sentiment:tipo de opini√≥n expresada


In [None]:
DF_modificado.nunique()

Unnamed: 0,0
customer_id,25000
purchase_channel,1
customer_rating,5
review_text,15
sentiment,3


In [None]:
for col in DF_modificado.columns:
   print(f"Valores unicos para la columna {col}:")
   print(DF_modificado[col].unique())
   print("-"*80)

Valores unicos para la columna customer_id:
[    1     2     3 ... 24998 24999 25000]
--------------------------------------------------------------------------------
Valores unicos para la columna purchase_channel:
['online']
--------------------------------------------------------------------------------
Valores unicos para la columna customer_rating:
[1 5 2 3 4]
--------------------------------------------------------------------------------
Valores unicos para la columna review_text:
['very disappointed with the quality.'
 'fast delivery and great packaging.'
 'product stopped working after few days.' 'neutral about the quality.'
 'amazing experience, highly recommend!' 'great value for money.'
 'excellent product! exceeded expectations.'
 'product is okay, nothing special.' 'not worth the price.'
 'customer service was unhelpful.' 'late delivery and poor packaging.'
 'average experience overall.' 'works fine but could be better.'
 'very satisfied with the quality.'
 'delivery was 

In [None]:
print(f"Valores Nulos")
DF_modificado.isnull().sum()

Valores Nulos


Unnamed: 0,0
customer_id,0
purchase_channel,0
customer_rating,0
review_text,0
sentiment,0


In [None]:
print(f"Duplicados:{DF_modificado.duplicated().sum()}")

Duplicados:0


## üßπ Limpieza de datos

Objetivos:
- Normalizar labels
- Limpiar texto (espacios, vacios)
- Eliminar nulos/duplicados
- Dejar un dataset listo para modelado

Obtencion de CSV con columnas seleccionadas para ser normalizado

Obtencion de archivo CSV modificado para normalizar etiquetas.
Este CSV se subura a la carpeta Data Science junto al original bajo el nombre de "Customer_Sentiment_modificado"


In [None]:
DF_modificado.to_csv("Customer_Sentiment_modificado.csv", index=False)


Carga de archivo CSV modificado para la normalizaci√≥n de las etiquetas

In [None]:
url = "https://raw.githubusercontent.com/ElvisG2003/Sentiment-API-Hackathon/refs/heads/main/data-science/Customer_Sentiment_modificado.csv"
DF_modificado = pd.read_csv(url, sep=',')
DF_modificado


Unnamed: 0,customer_id,purchase_channel,customer_rating,review_text,sentiment
0,1,online,1,very disappointed with the quality.,negative
1,2,online,5,fast delivery and great packaging.,positive
2,3,online,1,very disappointed with the quality.,negative
3,4,online,2,product stopped working after few days.,negative
4,5,online,3,neutral about the quality.,neutral
...,...,...,...,...,...
24995,24996,online,1,very disappointed with the quality.,negative
24996,24997,online,5,"amazing experience, highly recommend!",positive
24997,24998,online,4,fast delivery and great packaging.,positive
24998,24999,online,5,great value for money.,positive


Normalizaci√≥n de Labels

In [None]:
DF_modificado.rename(columns={'customer_id':'id_cliente','purchase_channel':'canal_de_compra','customer_rating':'calificacion','review_text':'texto_de_review','sentiment':'opinion'},inplace=True)
DF_modificado.columns=DF_modificado.columns.str.lower()
DF_modificado.columns=DF_modificado.columns.str.replace('.','_')
DF_modificado.columns=DF_modificado.columns.str.replace(' ','_')
DF_modificado

Unnamed: 0,id_cliente,canal_de_compra,calificacion,texto_de_review,opinion
0,1,online,1,very disappointed with the quality.,negative
1,2,online,5,fast delivery and great packaging.,positive
2,3,online,1,very disappointed with the quality.,negative
3,4,online,2,product stopped working after few days.,negative
4,5,online,3,neutral about the quality.,neutral
...,...,...,...,...,...
24995,24996,online,1,very disappointed with the quality.,negative
24996,24997,online,5,"amazing experience, highly recommend!",positive
24997,24998,online,4,fast delivery and great packaging.,positive
24998,24999,online,5,great value for money.,positive


Consultar el Diccionario de datos


*   id_cliente: n√∫mero de identificaci√≥n √∫nico de cada cliente
*   Canal_de_compra: Canal de compra
*   calificacion: puntuaci√≥n del cliente
*   Texto_de_review: texto de la review del cliente
*   opinion: tipo de opini√≥n expresada


Conversion columna con campos binarios (opinion) desde negative/positive/neutral a 0 (negative/neutral)/ 1 (positive) obteniendo resultado en archivo "DF_modificado_OB"

In [None]:
DF_modificado_OB = DF_modificado.copy()

DF_modificado_OB["opinion"] = DF_modificado_OB["opinion"].map(
    {"positive": 1, "neutral": 0, "negative": 0}
)


In [None]:
DF_modificado_OB

Unnamed: 0,id_cliente,canal_de_compra,calificacion,texto_de_review,opinion
0,1,online,1,very disappointed with the quality.,0
1,2,online,5,fast delivery and great packaging.,1
2,3,online,1,very disappointed with the quality.,0
3,4,online,2,product stopped working after few days.,0
4,5,online,3,neutral about the quality.,0
...,...,...,...,...,...
24995,24996,online,1,very disappointed with the quality.,0
24996,24997,online,5,"amazing experience, highly recommend!",1
24997,24998,online,4,fast delivery and great packaging.,1
24998,24999,online,5,great value for money.,1


Verificaci√≥n de cambio correcto

In [None]:
print(DF_modificado["opinion"].value_counts())
print(DF_modificado_OB["opinion"].value_counts())


opinion
positive    9978
negative    9937
neutral     5085
Name: count, dtype: int64
opinion
0    15022
1     9978
Name: count, dtype: int64


In [None]:
comparacion = DF_modificado[["opinion"]].join(
    DF_modificado_OB["opinion"],
    lsuffix="_original",
    rsuffix="_binaria"
)

print(comparacion.head(15))


   opinion_original  opinion_binaria
0          negative                0
1          positive                1
2          negative                0
3          negative                0
4           neutral                0
5          positive                1
6          positive                1
7          positive                1
8           neutral                0
9          positive                1
10         negative                0
11         negative                0
12         positive                1
13         positive                1
14         negative                0


Obtencion de archivo CSV modificado final.
Este CSV se subir√° a la carpeta Data Science junto al original bajo el nombre de "Customer_Sentiment_final"


In [None]:
DF_modificado_OB.to_csv("Customer_Sentiment_final.csv", index=False)

## üîé EDA (Exploraci√≥n de datos)

Revisamos:
- distribuci√≥n de clases
- longitud de textos
- ejemplos por clase
- posibles sesgos / desbalance

## ‚úÇÔ∏è Train/Test Split

Usamos estratificaci√≥n para mantener proporciones de clases.

In [None]:
# aqu√≠ va el c√≥digo