### Delivery 2. Classification of texts. 
---

by Jaime de Cecilio 


#### Objective
The objective of this practice is to perform a classification of a series of tweets, which are collected in the file tweets.txt. This file contains several columns separated by the symbol ":", but we can consider that it has a CSV structure and that the separator between columns is "::::". In these circumstances, the first column contains the text itself, while the last one contains the label: positive, negative, neutral or None.

#### Activity script
1. Read the content of the txt file into a Data Frame. It is suggested to use the pandas.read_csv function.
2. Perform the pre-processing that you consider necessary. You can use functions from the NLTK or spaCy library, as you wish. We recommend a modular writing of the code, to be able to test later, seeing if better results are obtained when using stop-words, when performing an extraction of canonical forms, etc.
3. Divide the document set into a training subset and an evaluation subset.
4. Convert the document corpus into a TF-idf matrix. It is most convenient to use the Tfidf Vectorizer, which is part of sklearn. Does the maximum number of features to be used influence the final result?
5. At this point, perform training models at least with Naive Bayesian classifier algorithms and SVM machines. Obtain classification accuracy results as well as confusion matrices for both models.
6. Comment on the results obtained, what factors are involved, are the results obtained as initially expected, and what is the reason for these results? Think about the quality of the data set you are working with.

In [1]:
## Importación de librerías

import nltk
stopwords = nltk.corpus.stopwords.words('spanish')

import csv
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings("ignore")


In [3]:
## Lectura de datos

tweets = pd.read_csv("/Users/jaime2/Desktop/DATA SCIENCE/5. ANALISIS DE INFORMACION NO ESTRUCTURADA/ENTREGAS/ENTREGA 2/tweets.txt", sep="::::", encoding="utf-8", header=None)
tweets.columns = ["text", "label"]
tweets.head()

Unnamed: 0,text,label
0,"Salgo de #VeoTV , que día más largoooooo...",
1,@PauladeLasHeras No te libraras de ayudar me/n...,neutro
2,@marodriguezb Gracias MAR,positivo
3,"Off pensando en el regalito Sinde, la que se v...",negativo
4,Conozco a alguien q es adicto al drama! Ja ja ...,positivo


In [5]:
## Cuántos ejemplos hay de cada tipo

tweets["label"].value_counts()

positivo     2883
negativo     2182
None         1482
neutro        670
:positivo       1
:negativo       1
Name: label, dtype: int64

In [6]:
# Mapeamos :negativo a negativo, y :positivo a positivo

print(type(tweets["label"]))

tweets["label"].replace(to_replace=[":positivo", ":negativo"],value=["positivo","negativo"], inplace=True)
tweets["label"].value_counts()


<class 'pandas.core.series.Series'>


positivo    2884
negativo    2183
None        1482
neutro       670
Name: label, dtype: int64

In [7]:
# ¿Eliminamos los elementos sin categoría?

# tweets.drop(tweets[tweets.label=="None"].index, inplace=True)
# tweets["label"].value_counts()

In [8]:
# Separaciónde documentos y de catgegorías

docs = tweets.iloc[:,0] # extract column with review
categs = tweets.iloc[:,-1] # extract column with sentiment

## Obtención de la matriz Tf-idf

In [9]:
# tokenizamos los documentos y convertimos en matriz TfIdf

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=200)
# vectorizer = TfidfVectorizer(max_features=200, lowercase=True, strip_accents="ascii", stop_words = stopwords)

docs_tfidf = vectorizer.fit_transform(docs)
docs_tfidf

<7219x200 sparse matrix of type '<class 'numpy.float64'>'
	with 51257 stored elements in Compressed Sparse Row format>

## Preparación de los subconjuntos de entrenamiento y test

In [10]:
# División mediante train_test_split. Test de 25%

from sklearn.model_selection import train_test_split
docs_train, docs_test, categs_train, categs_test = train_test_split(docs_tfidf, categs, test_size = 0.25, 
                                                                    random_state = 50)

## Clasificador ingenuo bayesiano

In [11]:
# Entrenamiento del clasificador NB

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

clf.fit(docs_train, categs_train)

MultinomialNB()

## Clasificador Support Vector Machine

In [12]:
# Entrenamiento del clasificador NB

from sklearn.svm import SVC
svm = SVC(kernel='linear')
# svm = SVC(kernel='poly')
# svm = SVC(kernel='rbf')
# svm = SVC(kernel='sigmoid')


svm.fit(docs_train, categs_train)

SVC(kernel='linear')

## Evaluación del modelo ingenuo bayesiano

In [13]:
# Predicción del set de test

categs_pred = clf.predict(docs_test)

In [14]:
# Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(categs_test, categs_pred)
cm

array([[ 91,  82,   0, 176],
       [ 29, 340,   0, 175],
       [  4,  92,   0,  72],
       [ 52, 153,   0, 539]])

In [15]:
acc_train = clf.score(docs_train, categs_train)
acc_test = clf.score(docs_test, categs_test)

print("Accuracy entrenamiento: ", acc_train)
print("Accuracy PRUEBA: ", acc_test)
print("Fiabilidad: ", acc_test / acc_train)  

Accuracy entrenamiento:  0.5615072035463613
Accuracy PRUEBA:  0.5373961218836565
Fiabilidad:  0.9570600670651698


## Evaluación del modelo svm

In [16]:
# Predicción del set de test

categs_pred = svm.predict(docs_test)

In [17]:
# Confusion Matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(categs_test, categs_pred)
cm

array([[142,  73,   0, 134],
       [ 54, 345,   1, 144],
       [ 11,  99,   0,  58],
       [ 98, 168,   0, 478]])

In [18]:
acc_train = svm.score(docs_train, categs_train)
acc_test = svm.score(docs_test, categs_test)

print("Accuracy entrenamiento: ", acc_train)
print("Accuracy PRUEBA: ", acc_test)
print("Fiabilidad: ", acc_test / acc_train)  

Accuracy entrenamiento:  0.583671961581086
Accuracy PRUEBA:  0.5346260387811634
Fiabilidad:  0.9159700550510187
