## [75.06 / 95.58] Organización de Datos
## Trabajo Práctico 2: Competencia de Machine Learning
### Grupo 18: DATAVID-20

* 102732 - Bilbao, Manuel
* 101933 - Karagoz, Filyan
* 98684 - Markarian, Darío
* 100901 - Stroia, Lautaro

### Importación general de librerias y set-up de datos.

In [1]:
import pandas as pd
import numpy as np
import os
import re

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from sklearn import svm
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

#!pip3 install tensorflow_text
import tensorflow as tf
import tensorflow_text
import tensorflow_hub as hub 


import gensim
from gensim.parsing.preprocessing import remove_stopwords

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_rows = None #mostrar todas las filas del df
%matplotlib inline
pd.options.display.float_format = '{:20,.2f}'.format # suprimimos la notacion cientifica en los outputs

### Set-up y limpieza de datos.

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

#Eliminar numeros de un texto
def eliminar_numeros(text):
    return re.sub("\d+", "",text)

#Eliminar puntuacion
def eliminar_puntuacion(text):
    return re.sub(r'[^\w\s]','',text)

#Pasar letras a minusculas
def minusculas(text):
    return text.lower()

#Eliminar caracteres especiales
def eliminar_caracteres(text):
    return re.sub('[^a-zA-Z0-9 \n\.]', '',text)

#Eliminar urls
def eliminar_url(text):
    url_reg = re.compile(r'https?://\S+|www\.\S+')
    return url_reg.sub(r'',text)

In [3]:
for data in [test,train]:
    data['text'] = data['text'].apply(lambda x: eliminar_puntuacion(x))
    data['text'] = data['text'].apply(lambda x: minusculas(x))
    data['text'] = data['text'].apply(lambda x: eliminar_numeros(x))
    data['text'] = data['text'].apply(lambda x: eliminar_caracteres(x))
    data['text'] = data['text'].apply(lambda x: remove_stopwords(x))
    data['text'] = data['text'].apply(lambda x: eliminar_url(x))    

## 1. SVM - Classification (hiperparametros default)

Este algoritmo usa "kernels", los que transforman los datos de entrada en un formato especifico: toman el input de baja dimension y lo transforma en datos de una dimension mayor. Esto ayuda a mejorar el accuracy del clasificador.

### 1.1. Tokenizacion y Split de datos

In [4]:
#Universal Sentence Encoder
encoder = hub.load('https://tfhub.dev/google/universal-sentence-encoder-large/5')

#hago reshape de los vectores encodeados y les doy formato unidimensional con el reshape [-1]
train_tokens = [tf.reshape(encoder([line]), [-1]).numpy() for line in train.text]
test_tokens = [tf.reshape(encoder([line]), [-1]).numpy() for line in test.text]

#Divido el set de train en 80% training y 20% test
X_train, X_test, y_train, y_test = train_test_split(train_tokens, train.target, test_size = 0.2,
                                                   random_state = 139)

### 1.2. Generar, entrenar y evaluar modelo

In [5]:
#Generamos un clasificador SVC con un kernel lineal
classifier = svm.SVC(kernel='linear', probability=True)

#Entrenamos
classifier.fit(X_train, y_train)

#Predecimos
preds = classifier.predict(X_test)
print('Accuracy score', accuracy_score(y_test, preds))
print('Precision score', sklearn.metrics.precision_score(y_test,preds))
print('Recall score', sklearn.metrics.recall_score(y_test, preds))
print('f1_score', f1_score(y_test, preds))

Accuracy score 0.8168089297439265
Precision score 0.8422876949740035
Recall score 0.7210682492581603
f1_score 0.7769784172661871


### 1.3. Prediccion y submit de kaggle

In [6]:
test_pred = classifier.predict(test_tokens) #0.805 KAGGLE
submit = pd.DataFrame(test['id'])
submit['target'] = test_pred
#submit.to_csv('SUBMITS/submission-svm.csv',index=False)

## 2. SVM - Classification (con RandomizedSearchCV)

Con RandomizedSearchCV buscamos los mejores valores para los hiperparametros del SVM

### 2.1. Tokenizacion y Split de datos

In [5]:
#Universal Sentence Encoder
encoder2 = hub.load('https://tfhub.dev/google/universal-sentence-encoder-large/5')

#hago reshape de los vectores encodeados y les doy formato unidimensional con el reshape [-1]
train_tokens2 = [tf.reshape(encoder2([line]), [-1]).numpy() for line in train.text]
test_tokens2 = [tf.reshape(encoder2([line]), [-1]).numpy() for line in test.text]

#Divido el set de train en 80% training y 20% test
X_train2, X_test2, y_train2, y_test2 = train_test_split(train_tokens2, train.target, test_size = 0.2,
                                                   random_state = 1)

### 2.2. Generar, entrenar y evaluar modelo

In [13]:
#Generamos un clasificador SVC
classifier2 = svm.SVC()

#Hiperparametros
parameter_space= {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['linear','sigmoid','poly','rbf']}  

#RandomizedSearch para hiperparametros
random_params = RandomizedSearchCV(classifier2, parameter_space, refit = True)

#entrenamiento
model = random_params.fit(X_train2, y_train2)

#Vemos cuales son los hiperparametros que mejores resultaados dan
model.best_params_


In [9]:
#Predecimos
preds = model.predict(X_test2)
print('Accuracy score', accuracy_score(y_test2, preds))
print('Precision score', sklearn.metrics.precision_score(y_test2,preds))
print('Recall score', sklearn.metrics.recall_score(y_test2, preds))
print('f1_score', f1_score(y_test2, preds))

Accuracy score 0.8194353250164149
Precision score 0.857421875
Recall score 0.6848673946957878
f1_score 0.761491760624458


### 2.3. Submit Kaggle

In [12]:
test_pred2 = model.predict(test_tokens2) #0.8109 en kaggle
submit = pd.DataFrame(test['id'])
submit['target'] = test_pred2
#submit.to_csv('SUBMITS/submission-svm-gridsearch.csv',index=False)

## 3. SVM con un poquito de feature engineering 

### 3.1. Tokenizacion y split de datos

In [4]:
train_features = train.copy()
test_features = test.copy()

test_features['keyword'] = test_features['keyword'].fillna('unknown').apply(lambda x: re.sub(r'%20',' ', str(x)))
train_features['keyword'] = train_features['keyword'].fillna('unknown').apply(lambda x: re.sub(r'%20',' ', str(x)))

for data in [test_features, train_features]:
    data['tweet_len'] = data['text'].str.len()
    data['qty_strings'] = data['text'].apply(lambda x: len(str(x).split()))
    data['len_gt_mean'] = (data['tweet_len'] > data['tweet_len'].mean()).astype(int)
    data['has_keyword_notempty'] = (data['keyword'] != 'unknown').astype(int)
    data['qty_keywords'] = data['keyword'].apply(lambda x: len(str(x).split()))
    data['qty_urls'] = data['text'].apply(lambda x: x.count('http'))
    data['has_location_notempty'] = (data['location'] != 'unknown').astype(int)
    

In [5]:
#Universal Sentence Encoder
encoder3 = hub.load('https://tfhub.dev/google/universal-sentence-encoder-large/5')

#hago reshape de los vectores encodeados y les doy formato unidimensional con el reshape [-1]
train_tokens3 = [tf.reshape(encoder3([line]), [-1]).numpy() for line in train_features.text]
test_tokens3 = [tf.reshape(encoder3([line]), [-1]).numpy() for line in test_features.text]

train_tokens3 = pd.DataFrame(train_tokens3)
test_tokens3 = pd.DataFrame(test_tokens3)

train_tokens3['tweet_len'] = train_features['tweet_len']
train_tokens3['qty_strings'] = train_features['qty_strings']
train_tokens3['len_gt_mean'] = train_features['len_gt_mean']
train_tokens3['has_keyword_notempty'] = train_features['has_keyword_notempty']
train_tokens3['qty_keywords'] = train_features['qty_keywords']
train_tokens3['qty_urls'] = train_features['qty_urls']
train_tokens3['has_location_notempty'] = train_features['has_location_notempty']

test_tokens3['tweet_len'] = test_features['tweet_len']
test_tokens3['qty_strings'] = test_features['qty_strings']
test_tokens3['len_gt_mean'] = test_features['len_gt_mean']
test_tokens3['has_keyword_notempty'] = test_features['has_keyword_notempty']
test_tokens3['qty_keywords'] = test_features['qty_keywords']
test_tokens3['qty_urls'] = test_features['qty_urls']
test_tokens3['has_location_notempty'] = test_features['has_location_notempty']
        

#Divido el set de train en 80% training y 20% test
X_train3, X_test3, y_train3, y_test3 = train_test_split(train_tokens3, train_features.target, test_size = 0.2,
                                                   random_state = 1)

### 3.2. Generar, entrenar y evaluar modelo

In [23]:
#Generamos un clasificador SVC 
classifier3 = svm.SVC(C=0.1, gamma=0.0001,kernel='linear')

#Hiperparametros
#params = {'C': [0.1, 1, 10, 100, 1000],  
#              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
#              'kernel': ['linear','sigmoid','rbf']}  
#Entrenamos
classifier3.fit(X_train3, y_train3)

SVC(C=10, gamma=0.0001, kernel='linear')

In [24]:
#Predecimos
preds3 = classifier3.predict(X_test3)
print('Accuracy score', accuracy_score(y_test3, preds3))
print('Precision score', sklearn.metrics.precision_score(y_test3,preds3))
print('Recall score', sklearn.metrics.recall_score(y_test3, preds3))
print('f1_score', f1_score(y_test3, preds3))

Accuracy score 0.8082731451083388
Precision score 0.8003442340791739
Recall score 0.7254290171606864
f1_score 0.7610474631751228


### 3.3. Submit de Kaggle

In [27]:
test_pred3 = classifier3.predict(test_tokens3) #0.80324 en kaggle
submit = pd.DataFrame(test['id'])
submit['target'] = test_pred3
#submit.to_csv('SUBMITS/submission-svm-features.csv',index=False)

#DIO MENOR SCORE EN KAGGLE QUE USANDO SOLO COLUMNA DE TEXT

## 4. SVM + Keyword + Tuneo de hiperparametros

**Features a usar:**keywords.
**Tunear hiperparametros con RandomizedSearchCV**

### 4.1. Tokenizacion y split de datos

In [17]:
train4 = train.copy()
test4 = test.copy()
test4['keyword'] = test4['keyword'].fillna('unknown').apply(lambda x: re.sub(r'%20',' ', str(x)))
train4['keyword'] = train4['keyword'].fillna('unknown').apply(lambda x: re.sub(r'%20',' ', str(x)))

train4['combined_text'] = train4['text']+';'+train4['keyword']
test4['combined_text'] = test4['text']+';'+test4['keyword']

#Universal Sentence Encoder
encoder4 = hub.load('https://tfhub.dev/google/universal-sentence-encoder-large/5')

#hago reshape de los vectores encodeados y les doy formato unidimensional con el reshape [-1]
train_tokens4 = [tf.reshape(encoder4([line]), [-1]).numpy() for line in train4.combined_text]
test_tokens4 = [tf.reshape(encoder4([line]), [-1]).numpy() for line in test4.combined_text]

In [18]:
train_tokens4 = pd.DataFrame(train_tokens4)
test_tokens4 = pd.DataFrame(test_tokens4)

train_tokens4['tweet_len'] = train4['text'].str.len()
test_tokens4['tweet_len'] = test4['text'].str.len()
#Divido el set de train en 80% training y 20% test
X_train4, X_test4, y_train4, y_test4 = train_test_split(train_tokens4, train4.target, test_size = 0.2,
                                                   random_state = 1)

In [None]:
#Generamos un clasificador SVC
classifier4 = svm.SVC()

#Hiperparametros
parameter_space= {'C': [0.1, 1, 10, 100, 1000],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['linear','sigmoid','poly','rbf']}  

#RandomizedSearch para hiperparametros
random_params = RandomizedSearchCV(classifier4, parameter_space, refit = True, verbose = 3)

#Estuvo mas de 8 horas corriendo y faltando 5 fits tiraba warning de memoria ram no disponible. Viendo
#los mensajes de cada score, el mejor fue "kernel=linear, gamma=0.01, C=1, score=0.818"
#Asi que entreno con ese score a mano.
#entrenamiento
model4 = random_params.fit(X_train4, y_train4)

#Vemos cuales son los hiperparametros que mejores resultaados dan
model4.best_params_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] kernel=sigmoid, gamma=1, C=1 ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ........ kernel=sigmoid, gamma=1, C=1, score=0.568, total=  15.0s
[CV] kernel=sigmoid, gamma=1, C=1 ....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   15.0s remaining:    0.0s


[CV] ........ kernel=sigmoid, gamma=1, C=1, score=0.568, total=  14.5s
[CV] kernel=sigmoid, gamma=1, C=1 ....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   29.6s remaining:    0.0s


[CV] ........ kernel=sigmoid, gamma=1, C=1, score=0.568, total=  14.6s
[CV] kernel=sigmoid, gamma=1, C=1 ....................................
[CV] ........ kernel=sigmoid, gamma=1, C=1, score=0.567, total=  14.0s
[CV] kernel=sigmoid, gamma=1, C=1 ....................................
[CV] ........ kernel=sigmoid, gamma=1, C=1, score=0.568, total=  15.3s
[CV] kernel=rbf, gamma=0.0001, C=0.1 .................................
[CV] ..... kernel=rbf, gamma=0.0001, C=0.1, score=0.607, total=  15.2s
[CV] kernel=rbf, gamma=0.0001, C=0.1 .................................
[CV] ..... kernel=rbf, gamma=0.0001, C=0.1, score=0.598, total=  15.3s
[CV] kernel=rbf, gamma=0.0001, C=0.1 .................................
[CV] ..... kernel=rbf, gamma=0.0001, C=0.1, score=0.605, total=  15.3s
[CV] kernel=rbf, gamma=0.0001, C=0.1 .................................
[CV] ..... kernel=rbf, gamma=0.0001, C=0.1, score=0.589, total=  15.1s
[CV] kernel=rbf, gamma=0.0001, C=0.1 .................................
[CV] .

In [16]:
#Predecimos
preds4 = model4.predict(X_test4)
print('Accuracy score', accuracy_score(y_test4, preds4))
print('Precision score', sklearn.metrics.precision_score(y_test4,preds4))
print('Recall score', sklearn.metrics.recall_score(y_test4, preds4))
print('f1_score', f1_score(y_test4, preds4))

Accuracy score 0.8023637557452397
Precision score 0.8079710144927537
Recall score 0.6957878315132605
f1_score 0.7476948868398995


In [None]:
Accuracy score 0.8023637557452397
Precision score 0.8079710144927537
Recall score 0.6957878315132605
f1_score 0.7476948868398995