<a href="https://colab.research.google.com/github/LuisHiram99/trabajos_PLN/blob/main/Tareas/TC-07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!gdown 14nqDnZ3oDXqRtIcrpB6IP1f_NZP_Gb08

Downloading...
From: https://drive.google.com/uc?id=14nqDnZ3oDXqRtIcrpB6IP1f_NZP_Gb08
To: /content/YoutubeCommentsDataSet.csv
  0% 0.00/2.57M [00:00<?, ?B/s]100% 2.57M/2.57M [00:00<00:00, 40.2MB/s]


In [45]:
import pandas as pd
from collections import Counter


In [2]:


df = pd.read_csv('YoutubeCommentsDataSet.csv',index_col=0)

# Trabajo en clase

En esta práctica, exploraremos dos métodos comunes para vectorizar textos en el NLP: **Bag of Words (BOW)** y **TF-IDF**. Utilizaremos implementaciones de `scikit-learn` para llevar a cabo la vectorización y entrenaremos un clasificador para realizar análisis de sentimientos. El objetivo es comparar el rendimiento de ambos métodos bajo diferentes configuraciones.

## Instrucciones

1. **Preprocesamiento y análisis exploratorio**:
   - Realiza la limpieza de los textos que consideres necesario.
   - Elimina stopwords si lo consideras necesario.
   - Realiza un análisis exploratorio de los datos: distribución de clases, longitud de textos, palabras más frecuentes (puedes usar una nube de palabras o mostrar las palabras más frecuentes por clase).

2. **Vectorización de textos**:
   - **BOW**: Usa `CountVectorizer` de `scikit-learn` para vectorizar los textos. Prueba con tres valores de `max_features` en órdenes de magnitud distintos.
   - **TF-IDF**: Usa `TfidfVectorizer` de `scikit-learn` para vectorizar los textos. Prueba con los mismos tres valores de `max_features` que usaste para BOW.

3. **Entrenamiento del clasificador**: Elige un clasificador (por ejemplo, `LogisticRegression`, `DecisionTree`, `Naive Bayes`, etc.) y entrénalo utilizando los datos vectorizados tanto con BOW como con TF-IDF. Cuida los aspectos relacionados con el *data leakage*.

4. **Evaluación y reporte**: Para cada combinación de método de vectorización (BOW y TF-IDF) y valor de `max_features`, calcula el F1-score. Reporta los resultados en una tabla como la siguiente:

| Método  | max_features | F1-score |
|---------|--------------|----------|
| BOW     | valor1          |          |
| BOW     | valor2         |          |
| BOW     | valor3        |          |
| TF-IDF  | valor1          |          |
| TF-IDF  | valor2         |          |
| TF-IDF  | valor3        |          |

5. **Conclusiones**: Responde las siguientes preguntas en una celda de texto:
   - ¿Cuál método de vectorización (BOW o TF-IDF) obtuvo mejores resultados en general? ¿qué combinación de vectorización y valor de `max_features` produjo el mejor resultado.
   - ¿Cómo afecta el valor de `max_features` al rendimiento del modelo?
   - ¿Qué estrategias adicionales consideras que podrían mejorar el rendimiento de tu modelo? Describe dos de estas estrategias. **Importante**: Cada estrategia debe ser con BOW/TF-IDF.  
   - Describe el proprocesamiento que hiciste en el paso 1.

In [19]:
df

Unnamed: 0,Comment,Sentiment
1,here in nz 50 of retailers don’t even have con...,negative
2,i will forever acknowledge this channel with t...,positive
3,whenever i go to a place that doesn’t take app...,negative
4,apple pay is so convenient secure and easy to ...,positive
6,we only got apple pay in south africa in 20202...,positive
...,...,...
18401,i come from a physics background and usually w...,positive
18403,i really like the point about engineering tool...,positive
18404,i’ve just started exploring this field and thi...,positive
18406,hey daniel just discovered your channel a coup...,positive


## #1 Preprocesamiento


In [20]:
print(df.shape)
print(df["Sentiment"].value_counts())

(13770, 2)
Sentiment
positive    11432
negative     2338
Name: count, dtype: int64


In [21]:
df.isnull().sum()

Unnamed: 0,0
Comment,31
Sentiment,0


In [22]:
df.dropna(axis=0, how="any", inplace=True)


In [24]:
df.isnull().sum()

Unnamed: 0,0
Comment,0
Sentiment,0


### Separando datos en train y test

In [57]:
from sklearn.model_selection import train_test_split
x = df['Comment']
y = df['Sentiment']

X_train,X_test, Y_train, Y_test = train_test_split(x,y, test_size=0.2, random_state=42, stratify=y)

In [64]:
print(Y_train.value_counts())
print(Y_test.value_counts())

Sentiment
positive    9121
negative    1870
Name: count, dtype: int64
Sentiment
positive    2281
negative     467
Name: count, dtype: int64


### Quitando stop words

In [65]:
import spacy
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

X_train = X_train.reset_index(drop=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [67]:
X_train[0]

'extremely helpful great overall video thank you i even love how your matched with some of the colors on the slide very coordinated just subscribed and will probably be binge watching your videos for the rest of the day i know the effort that it takes to produce these tutorials so much appreciated'

In [69]:
for i in range(0,X_train.size):
  X_train.iloc[i] = ' '.join([word for word in X_train.iloc[i].split() if word.lower() not in stop_words])

X_train.head()

Unnamed: 0,Comment
0,extremely helpful great overall video thank ev...
1,computer science gives language communicate te...
2,android user 10 plus years especially samsung ...
3,ants truly insane know wood ants even sacrific...
4,sir lot respect excellent session


In [70]:
X_train[0]

'extremely helpful great overall video thank even love matched colors slide coordinated subscribed probably binge watching videos rest day know effort takes produce tutorials much appreciated'

In [41]:
df_clean = X_train.copy()
df_pos = df_clean[df_clean['Sentiment']=="positive"]
df_neg = df_clean[df_clean['Sentiment']=="negative"]

In [42]:
df_pos

Unnamed: 0,Comment,Sentiment
0,extremely helpful great overall video thank ev...,positive
1,computer science gives language communicate te...,positive
2,android user 10 plus years especially samsung ...,positive
4,sir lot respect excellent session,positive
5,quick resume game changer found completely fli...,positive
...,...,...
10986,love guys hope know guys brighten day massive ...,positive
10987,thank much sis tshepi work amazing future work,positive
10988,useful thank,positive
10989,pretty good one relaxing day lot thank much cr...,positive


In [49]:
def word_frecuencies(df):
  list_of_tokens = []
  for i in range(0,df["Comment"].size):
    doc = df["Comment"][i]
    words = [t.lower() for t in nltk.word_tokenize(doc) if t.isalpha() and t.lower() not in stop_words]
    list_of_tokens.extend(words)
  return Counter(list_of_tokens)

In [50]:
word_frecuencies(df_pos)

KeyError: 3

## Usando CountVectorizer:


In [82]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

def vectorizer_accuracy(max_features):
  vectorizer = CountVectorizer(max_features=max_features)

  X_train_counts = vectorizer.fit_transform(X_train)
  X_test_counts = vectorizer.fit_transform(X_test)

  model = MultinomialNB()
  model.fit(X_train_counts, Y_train)

  y_pred = model.predict(X_test_counts)

  accuracy = accuracy_score(Y_test, y_pred)
  print(f"Accuracy: {accuracy:.2f}")

In [91]:
vectorizer_accuracy(50)

Accuracy: 0.80


In [87]:
vectorizer_accuracy(100)

Accuracy: 0.79


In [88]:
vectorizer_accuracy(10)

Accuracy: 0.79


In [89]:
vectorizer_accuracy(1000)

Accuracy: 0.48


In [90]:
vectorizer_accuracy(500)

Accuracy: 0.64


### Usando Tf-Idf

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [94]:
def tfidf_accuracy(max_features):
  vectorizer = TfidfVectorizer(max_features=100)
  X_train_counts = vectorizer.fit_transform(X_train)
  X_test_counts = vectorizer.fit_transform(X_test)

  model = MultinomialNB()
  model.fit(X_train_counts, Y_train)

  y_pred = model.predict(X_test_counts)

  accuracy = accuracy_score(Y_test, y_pred)
  print(f"Accuracy: {accuracy:.2f}")

In [95]:
tfidf_accuracy(50)

Accuracy: 0.83


In [96]:
tfidf_accuracy(100)

Accuracy: 0.83


In [98]:
tfidf_accuracy(500)

Accuracy: 0.83


In [99]:
tfidf_accuracy(1000)

Accuracy: 0.83


In [100]:
tfidf_accuracy(2000)

Accuracy: 0.83
