<a href="https://colab.research.google.com/github/RafaGusmao/Analise-de-Sentimento/blob/main/An%C3%A1lise_de_Sentimento.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:


# Agora que sabemos onde os arquivos estão, podemos carregar os dados
train_file = '/content/train.txt.txt'
test_file = '/content/test.txt.txt'

# Função para carregar os dados
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    reviews = []
    labels = []

    for line in lines:
        label, review = line.split(' ', 1)  # Separando o rótulo da avaliação
        labels.append(label.strip())  # Retirando os espaços extras e o __label__
        reviews.append(review.strip())  # Avaliação limpa

    return reviews, labels

# Carregar os dados de treino e teste
train_reviews, train_labels = load_data(train_file)
test_reviews, test_labels = load_data(test_file)

# Exibir as primeiras linhas para garantir que está correto
print(train_reviews[:5])
print(train_labels[:5])


['Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^', "The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.", 'Amazing!: This soundtrack is my favorite music of all t

In [3]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from multiprocessing import Pool

# Baixar os recursos necessários apenas uma vez
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab') # Download the missing punkt_tab data package

# Carregar stopwords e lematizador uma vez
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Função de pré-processamento
def preprocess_text(text):
    text = text.lower()  # Converter para minúsculas
    tokens = word_tokenize(text)  # Tokenizar o texto
    # Filtrar palavras que não são alfabéticas e remover stopwords
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatização
    return " ".join(tokens)

# Função para aplicar o pré-processamento de forma paralelizada
def parallel_preprocess(reviews, num_workers=4):
    with Pool(num_workers) as pool:
        return pool.map(preprocess_text, reviews)

# Aplicar o pré-processamento nas avaliações de treino e teste
processed_train_reviews = parallel_preprocess(train_reviews)
processed_test_reviews = parallel_preprocess(test_reviews)

# Exibir as primeiras 5 avaliações processadas
print(processed_train_reviews[:5])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['stuning even sound track beautiful paint senery mind well would recomend even people hate vid game music played game chrono cross game ever played best music back away crude keyboarding take fresher step grate guitar soulful orchestra would impress anyone care listen', 'best soundtrack ever anything reading lot review saying best soundtrack figured write review disagree bit opinino yasunori mitsuda ultimate masterpiece music timeless listening year beauty simply refuse price tag pretty staggering must say going buy cd much money one feel would worth every penny', 'amazing soundtrack favorite music time hand intense sadness prisoner fate mean played game hope distant promise girl stole star important inspiration personally throughout teen year higher energy track like chrono cross time time dreamwatch chronomantique indefinably remeniscent chrono trigger absolutely superb soundtrack amazing music probably best composer work heard xenogears soundtrack ca say sure even never played game

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Inicializar o vetorizador TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)  # Ajuste o número de features conforme necessário

# Transformar as avaliações em vetores numéricos
X_train = vectorizer.fit_transform(processed_train_reviews)
X_test = vectorizer.transform(processed_test_reviews)

print(X_train.shape)  # Mostrar as dimensões da matriz TF-IDF


(3600000, 5000)


In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Os rótulos já estão no formato '__label__1' e '__label__2', então vamos convertê-los para 0 e 1
y_train = [1 if label == '__label__2' else 0 for label in train_labels]  # Positivo = 1, Negativo = 0
y_test = [1 if label == '__label__2' else 0 for label in test_labels]  # O mesmo para o conjunto de teste

# Criar e treinar o modelo
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Fazer previsões no conjunto de teste
y_pred = model.predict(X_test)

# Avaliar o desempenho do modelo
accuracy = accuracy_score(y_test, y_pred)
print(f"Acurácia do modelo: {accuracy * 100:.2f}%")

# Mostrar um relatório detalhado de desempenho
print(classification_report(y_test, y_pred))


Acurácia do modelo: 88.71%
              precision    recall  f1-score   support

           0       0.89      0.88      0.89    200000
           1       0.88      0.89      0.89    200000

    accuracy                           0.89    400000
   macro avg       0.89      0.89      0.89    400000
weighted avg       0.89      0.89      0.89    400000



In [6]:
print("Primeiras previsões:", y_pred[:5])

Primeiras previsões: [1 1 0 0 1]
