## Integrantes:
1. Camila Coltriani
2. Luis Dartayet
3. Irania Fuentes
4. Jonathan Fichelson
5. Ornella Cevoli
# Trabajo práctico  4: Analisis de los sentimientos en Twitter

# Introducción
Las redes sociales como Twitter han demostrado ser excelentes recursos de información sobre muchos eventos que acontecen en el mundo; tienen el poder de cambiar las opiniones de millones de personas siendo especialmente útil para influir en las masas: campañas políticas, cotización de monedas virtuales, publicidad de ventas, entre otros.
Pensando en esto, se presenta el siguiente objetivo.

# Objetivo:
Analizar los sentimientos con la finalidad de predecir el comportamiento de personas y propagar cambios en tiempo real a medida que se desarrolla el evento que se quiere estudiar.

## Fuente:
Dataset Kaggle: https://www.kaggle.com/code/paoloripamonti/twitter-sentiment-analysis/input

In [1]:
# # Librerías
## import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
## from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
# from sklearn.metrics import accuracy_score,plot_confusion_matrix,roc_auc_score, classification_report, confusion_matrix, precision_recall_curve, auc
# from sklearn.naive_bayes import GaussianNB
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn import tree
# from scipy.stats import mode
# import seaborn as sns
## import re

# # nltk
# import nltk
# from nltk.tokenize import word_tokenize, sent_tokenize
## from nltk.corpus import stopwords
## from  nltk.stem import SnowballStemmer
## from nltk.stem import WordNetLemmatizer
## from sklearn.feature_extraction.text import CountVectorizer,
# TfidfTransformer
# from sklearn.feature_extraction.text import TfidfVectorizer


# from sklearn.naive_bayes import MultinomialNB
# from sklearn.model_selection import GridSearchCV,StratifiedKFold,train_test_split
# from sklearn.metrics import accuracy_score
## from sklearn.pipeline import Pipeline

# from sklearn.pipeline import make_pipeline, make_union
# from sklearn.preprocessing import StandardScaler
# from sklearn.impute import SimpleImputer
## from sklearn.base import BaseEstimator, TransformerMixin




## Importación de los datos

In [1]:
import pandas as pd

In [2]:
#cargar y leer el dataset
data = pd.read_csv('./data/twitter.csv', encoding='ANSI')
data.columns = [ 'target', 'id', 'date', 'flag', 'user', 'text']
print(data.shape)
data.head()

(1599999, 6)


Unnamed: 0,target,id,date,flag,user,text
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


## Analisis exploratorio de los datos

In [4]:
# Check nulls and duplicates
print('Nulls: ', data.isnull().sum())
print('Duplicates: ', data.duplicated().sum())

Nulls:  target    0
id        0
date      0
flag      0
user      0
text      0
dtype: int64
Duplicates:  0


In [3]:
# Remove useless columns
data.drop(['id', 'date', 'flag', 'user'], axis=1, inplace=True)

In [4]:
# Rename target
data['target'] = data['target'].map({0: 'negative', 4: 'positive'})

In [5]:
# Check target distribution
print('Target distribution: ', data.target.value_counts())

Target distribution:  positive    800000
negative    799999
Name: target, dtype: int64


In [6]:
#Asignamos las variables X e Y a modelar
X = data.text
y = data.target

In [7]:
from sklearn.model_selection import train_test_split

## Preprocesamiento de los datos

### División del dataset en train y test

In [8]:
#Dividimos en train-test
X_train, X_test, y_train, y_test = train_test_split(X, y,stratify=y, test_size=0.3)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1119999,)
(480000,)
(1119999,)
(480000,)


In [9]:
import re

from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer

### Limpieza del corpus: definición de una función para eliminar Stopword, aplicación de Stemming, convertir a minisculas

In [10]:
# Expresion regular para eliminar del corpus signos de puntación/ @/ direcciones electronicas/ numeros
limpieza_re = "\d+[^0-9]|@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]"

stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

#Funcion para limpiar el corpus
def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(limpieza_re, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)



In [11]:
from sklearn.base import BaseEstimator, TransformerMixin

In [12]:
#Definición de la clase de procesamiento de texto
class TextProcessing(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X.apply(preprocess)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer


## Configuración del pipeline y modelos a utilizar para la predicción de los sentimientos

In [14]:
common_steps = [
            ('text_processing', TextProcessing()),
            ('vectorizer', CountVectorizer()),
            ]


In [15]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV,StratifiedKFold

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import tree

In [19]:
# Modelos a utilizar

MNB_pipe = [Pipeline( common_steps + [('mnb', MultinomialNB())]), {'mnb__alpha':[1, 10, 20]}]
KNN_pipe = [Pipeline( common_steps + [('knn', KNeighborsClassifier())]), {'knn__n_neighbors':[5]}]
LR_pipe = [Pipeline( common_steps + [('lr', LogisticRegression())]), {'lr__C':[0.1,1,10], 'lr__penalty':['l2']}] #elimine l1 porque salia un error que solo soporta l2
DT_pipe = [Pipeline( common_steps + [('dt', tree.DecisionTreeClassifier())]), {'dt__max_depth':[2, 4, 8] }]

skf=StratifiedKFold(n_splits=3,random_state=0,shuffle=True)
models = []


In [20]:
#Configuración del gridsearch para el modelo de Regresión Logistica y definir los mejores parametros 
pipelines = [ LR_pipe]

for pipe in pipelines:
    GS_CV=GridSearchCV(pipe[0],pipe[1],cv=skf,verbose=10,n_jobs=3);
    GS_CV.fit(X_train, y_train);
    models.append(GS_CV)
    print('best score:',GS_CV.best_score_)
    print('best params:',GS_CV.best_params_) 

Fitting 3 folds for each of 3 candidates, totalling 9 fits
best score: 0.775319442249502
best params: {'lr__C': 0.1, 'lr__penalty': 'l2'}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
#Configuración del gridsearch para el modelo de arbol de decisión y definir los mejores parametros 
pipelines = [ DT_pipe]

for pipe in pipelines:
    GS_CV=GridSearchCV(pipe[0],pipe[1],cv=skf,verbose=10,n_jobs=3);
    GS_CV.fit(X_train, y_train);
    models.append(GS_CV)
    print('best score:',GS_CV.best_score_)
    print('best params:',GS_CV.best_params_) 

Fitting 3 folds for each of 3 candidates, totalling 9 fits
best score: 0.5620621089840259
best params: {'dt__max_depth': 8}


In [22]:
#Configuración del gridsearch para el modelo de Multinomial Naive Baye y definir los mejores parametros 
pipelines = [ MNB_pipe]

for pipe in pipelines:
    GS_CV=GridSearchCV(pipe[0],pipe[1],cv=skf,verbose=10,n_jobs=3);
    GS_CV.fit(X_train, y_train);
    models.append(GS_CV)
    print('best score:',GS_CV.best_score_)
    print('best params:',GS_CV.best_params_) 

Fitting 3 folds for each of 3 candidates, totalling 9 fits
best score: 0.7666203273395779
best params: {'mnb__alpha': 10}


In [23]:
# iba por 30 min y no corrió
# pipelines = [ KNN_pipe]

# for pipe in pipelines:
#     GS_CV=GridSearchCV(pipe[0],pipe[1],cv=skf,verbose=10,n_jobs=3);
#     GS_CV.fit(X_train, y_train);
#     models.append(GS_CV)
#     print('best score:',GS_CV.best_score_)
#     print('best params:',GS_CV.best_params_) 

Fitting 3 folds for each of 1 candidates, totalling 3 fits


In [28]:
models[0].predict(X_test)

array(['negative', 'positive', 'negative', ..., 'positive', 'positive',
       'negative'], dtype=object)

>> Estos parametros son para el primer modelo del pipe? MNB?

In [30]:
## Evaluación del modelo 
from sklearn.metrics import accuracy_score,plot_confusion_matrix,roc_auc_score, classification_report, confusion_matrix, precision_recall_curve, auc

y_pred = models[0].predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Classification report: ', classification_report(y_test, y_pred))
print('Confusion matrix: ', confusion_matrix(y_test, y_pred))


Accuracy:  0.7765958333333334
Classification report:                precision    recall  f1-score   support

    negative       0.79      0.75      0.77    240000
    positive       0.76      0.81      0.78    240000

    accuracy                           0.78    480000
   macro avg       0.78      0.78      0.78    480000
weighted avg       0.78      0.78      0.78    480000

Confusion matrix:  [[179404  60596]
 [ 46638 193362]]


In [40]:
# Predict new data
new_data = ['supercalifragilisticexpialidocious']

new_data = pd.Series(new_data)
new_data = new_data.apply(preprocess)

models[0].predict(new_data)



array(['positive'], dtype=object)

>>Esta parte queda?

In [62]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\LuisD\AppData\Roaming\nltk_data...


True

In [63]:
sentimentIntensityAnalyzer = SentimentIntensityAnalyzer()

def sentiment_analyzer_scores(sentence):
    score = sentimentIntensityAnalyzer.polarity_scores(sentence)
    return score

data['sentiment'] = data['text'].apply(sentiment_analyzer_scores)
data.head()

Unnamed: 0,target,text,sentiment
0,negative,is upset that he can't update his Facebook by ...,"{'neg': 0.303, 'neu': 0.697, 'pos': 0.0, 'comp..."
1,negative,@Kenichan I dived many times for the ball. Man...,"{'neg': 0.0, 'neu': 0.833, 'pos': 0.167, 'comp..."
2,negative,my whole body feels itchy and like its on fire,"{'neg': 0.321, 'neu': 0.5, 'pos': 0.179, 'comp..."
3,negative,"@nationwideclass no, it's not behaving at all....","{'neg': 0.241, 'neu': 0.759, 'pos': 0.0, 'comp..."
4,negative,@Kwesidei not the whole crew,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
