<a name="top"> <h1>01. ML Models</h1> <a>

<p>Análisis de sentimiento: Tweets<br />
<strong>Trabajo de Fin de Master</strong><br />
<strong>Master Universitario en Ciencia de Datos</strong></p>

<p>&nbsp;</p>

<p style="text-align:right">V&iacute;ctor Viloria V&aacute;zquez (<em>victor.viloria@cunef.edu</em>)</p>


<hr style="border:1px solid gray">

### Estructura

[1. Librerias utilizadas y funciones](#librerias) 

[2. Introducción ](#introduccion) 

   - Objetivo de negocio.

[3. Yelp Dataset ](#yelp) 

   - Información del dataset
   - Características del dataset


[4. Transformación del formato de ficheros](#transformacion) 


[5. Transformación de datos](#datos)

   - Business
       - Carga del fichero
       - Transformación de los datos
       - Exportación de ficheros procesados

<hr style="border:1px solid gray">

# <a name="librerias"> 1. Librerias utilizadas y funciones <a>


Importamos las librerias a utilizar para el preprocesamiento:

In [16]:
# Import basic libraries.

import pandas as pd
import numpy as np
import string
import pickle
import warnings
warnings.filterwarnings('ignore')

# Import ML libraries.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import metrics 

# Import libraries for evaluation.

from sklearn.model_selection import GridSearchCV

# <a name="lectura"> 2. Lectura del dataframe y preparación de los datos<a>


## 2.1. Lectura del DF

In [5]:
#Import parquet file.

tweets = pd.read_parquet('../../data/processed/tweets.parquet')

# Show the head of the dataframe.

tweets.head()

Unnamed: 0,text,sentiment,SentimentText_clean
0,id have responded if i were going,0,id have responded if i were going
1,sooo sad i will miss you here in san diego,2,sooo sad i will miss you here in san diego
2,my boss is bullying me,2,my boss is bullying me
3,what interview leave me alone,2,what interview leave me alone
4,sons of why couldnt they put them on the rel...,2,sons of why couldnt they put them on the rele...


## 2.2. Preparación de los datos

Preparamos los datos para que puedan ser introducidos dentro de los modelos, separando en X el texto y en Y las puntuaciones.

In [9]:
# Define X and y.

X = tweets.SentimentText_clean
y = tweets.sentiment

# Split into train and test.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# <a name="modelos"> 3. Evaluación de los modelos<a>


## 3.1. Support Vector Classifier

In [11]:
# Set up the pipeline.

tfidf = TfidfVectorizer(stop_words="english",max_df=0.99,min_df=0.01)
svc = SVC(probability=True)
model = make_pipeline(tfidf, svc)

# Training the model.

model.fit(x_train, y_train)

# Prediction.
preds= model.predict(x_test)
predict_probabilities = model.predict_proba(x_test)

In [12]:
# Evaluation
print(classification_report(y_test,preds))
print(metrics.accuracy_score(y_test,preds))

              precision    recall  f1-score   support

           0       0.52      0.78      0.63      2252
           1       0.73      0.57      0.64      1719
           2       0.60      0.31      0.41      1526

    accuracy                           0.59      5497
   macro avg       0.62      0.55      0.56      5497
weighted avg       0.61      0.59      0.57      5497

0.5852283063489175


In [13]:
# Entrenar el modelo
clf = MultinomialNB()

model = make_pipeline(tfidf, clf)
#Training
model.fit(x_train, y_train)

#Prediccion sobre test
preds= model.predict(x_test )


In [14]:
# Evaluation
print(classification_report(y_test,preds))
print(metrics.accuracy_score(y_test,preds))

              precision    recall  f1-score   support

           0       0.50      0.86      0.63      2252
           1       0.75      0.51      0.61      1719
           2       0.70      0.24      0.35      1526

    accuracy                           0.58      5497
   macro avg       0.65      0.53      0.53      5497
weighted avg       0.64      0.58      0.55      5497

0.5752228488266327


In [17]:
# Entrenar el modelo

xgb = XGBClassifier()

model = make_pipeline(tfidf, xgb)

#Training

model.fit(x_train, y_train)

#Prediccion sobre test

preds= model.predict(x_test )

In [40]:
# Entrene con modelo random forest

from sklearn.ensemble import RandomForestClassifier

# Entrenar el modelo

rf = RandomForestClassifier()

model = make_pipeline(tfidf, rf)

#Training

model.fit(x_train, y_train)

#Prediccion sobre test

preds= model.predict(x_test )

np.mean(preds==y_test)

0.5723121702746953

In [43]:
# Train with logistic regression

from sklearn.linear_model import LogisticRegression

# Entrenar el modelo

lr = LogisticRegression()

model = make_pipeline(tfidf, lr)

#Training

model.fit(x_train, y_train)

#Prediccion sobre test

preds= model.predict(x_test )

np.mean(preds==y_test)



0.585955975986902