## [75.06 / 95.58] Organización de Datos
## Trabajo Práctico 2: Competencia de Machine Learning
### Grupo 18: DATAVID-20

* 102732 - Bilbao, Manuel
* 101933 - Karagoz, Filyan
* 98684 - Markarian, Darío
* 100901 - Stroia, Lautar

### Importación general de librerias y set-up de datos.

In [1]:
import pandas as pd
import numpy as np
import os
import re

#Instalar tensorflow
#!pip3 install tensorflow
import tensorflow
from tensorflow import keras 
from tensorflow.keras.preprocessing.text import Tokenizer

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim.parsing.preprocessing import remove_stopwords

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_rows = None #mostrar todas las filas del df
%matplotlib inline
pd.options.display.float_format = '{:20,.2f}'.format # suprimimos la notacion cientifica en los outputs

### Set-up de datos.

In [13]:
#Cargamos los archivos train y test.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train = train.fillna('empty')
test = test.fillna('empty')

## 1. Logistic Regression sin features extra.

Vamos a seleccionar los features a utilizar para entrenar el modelo de regresión logistica. En esta primer evaluación, vamos a utilizar la columna 'text' como feature.

In [46]:
X = train['text'] #features
y = train['target'] #variable a predecir

### 1.1. Split de los datos.

In [35]:
#Me quedo con el 75% del set para entrenar, y el otro 25% para testear
X_train,X_valid,y_train,y_valid = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(X_valid.shape)

(5709,)
(1904,)


### 1.1.2. Desarrollo del modelo y predicción.

In [39]:
#Vectorizamos en tokens los datos
vector = CountVectorizer()
vector.fit(pd.concat([train.text,test.text]))
X_train_vec = vector.transform(X_train)
X_valid_vec = vector.transform(X_valid)

#instanciamos el regresor con parametros por default
logistic_reg = LogisticRegression()

#entrenamos
logistic_reg.fit(X_train_vec, y_train)

#predecimos sobre nuestra variable a predecir
y_pred = logistic_reg.predict(X_valid_vec)
y_pred = (y_pred > 0.5)

print("Accuracy:",accuracy_score(y_valid, y_pred))
print("Precision:", sklearn.metrics.precision_score(y_valid,y_pred))
print("Recall:", sklearn.metrics.recall_score(y_valid,y_pred))
print("F1:",f1_score(y_valid, y_pred))


Accuracy: 0.8109243697478992
Precision: 0.8150208623087621
Recall: 0.7207872078720787
F1: 0.7650130548302871


### 1.1.3. Submit de Kaggle.

In [45]:
#Predicciones con el set de TEST y submit -> score 0.79252 en kaggle
test_vec = vector.transform(test.text)
y_pred = logistic_reg.predict(test_vec)

submit = pd.DataFrame(test['id'])
submit['target'] = y_pred
#submit.to_csv('SUBMITS/submission-logreg.csv',index=False)

## 2. Logistic Regression con TfIDF Vectorizer

In [51]:
vector2 = TfidfVectorizer()
vector2.fit(pd.concat([train.text,test.text]))
X_train_vec2 = vector2.transform(X_train)
X_valid_vec2 = vector2.transform(X_valid)

logistic_reg2 = LogisticRegression()
logistic_reg2.fit(X_train_vec2,y_train)

y_pred2 = logistic_reg2.predict(X_valid_vec2)
#y_pred2 = (y_pred2 > 0.5 )

print("Accuracy:",accuracy_score(y_valid, y_pred2))
print("Precision:", sklearn.metrics.precision_score(y_valid,y_pred2))
print("Recall:", sklearn.metrics.recall_score(y_valid,y_pred2))
print("F1:",f1_score(y_valid, y_pred2))

Accuracy: 0.8061974789915967
Precision: 0.8523809523809524
Recall: 0.6605166051660517
F1: 0.7442827442827442


Podemos observar que usando TfIDF Vectorizer, tenemos peores resultados.

## 3. Logistic Regression utilizando más features

### 3.1. Utilizando 'text' y 'keyword' como features.

In [3]:
#Limpieza de datos

#Eliminar numeros de un texto
def eliminar_numeros(text):
    return re.sub("\d+", "",text)

#Eliminar puntuacion
def eliminar_puntuacion(text):
    return re.sub(r'[^\w\s]','',text)

#Pasar letras a minusculas
def minusculas(text):
    return text.lower()

#Eliminar caracteres especiales
def eliminar_caracteres(text):
    return re.sub('[^a-zA-Z0-9 \n\.]', '',text)

#Eliminar urls
def eliminar_url(text):
    url_reg = re.compile(r'https?://\S+|www\.\S+')
    return url_reg.sub(r'',text)

In [14]:
for data in [test,train]:
    data['text'] = data['text'].apply(lambda x: eliminar_puntuacion(x))
    data['text'] = data['text'].apply(lambda x: minusculas(x))
    data['text'] = data['text'].apply(lambda x: eliminar_numeros(x))
    data['text'] = data['text'].apply(lambda x: eliminar_caracteres(x))
    data['text'] = data['text'].apply(lambda x: remove_stopwords(x))
    data['text'] = data['text'].apply(lambda x: eliminar_url(x))    
    data['keyword'] = data['keyword'].apply(lambda x: re.sub(r'%20',' ', str(x)))

### 3.1.1. Split de datos

In [30]:
X3 = train['text'] + ' ' + train['keyword'] #features
y3 = train.target #variable a predecir

#Me quedo con el 75% del set para entrenar, y el otro 25% para testear
X_train3,X_valid3,y_train3,y_valid3 = train_test_split(X3, y3, test_size=0.25, random_state=42)
print(X_train3.shape)
print(X_valid3.shape)


#data = pd.concat([train.text,train.keyword,test.text,test.keyword])


(5709,)
(1904,)


### 3.1.2. Desarrollo del modelo y predicción


In [32]:
#Vectorizamos en tokens los datos
vector3 = CountVectorizer()
vector3.fit(pd.concat([train.text,test.text,train.keyword,test.keyword]))
X_train_vec3 = vector3.transform(X_train3)
X_valid_vec3 = vector3.transform(X_valid3)

#instanciamos el regresor con parametros por default
logistic_reg3 = LogisticRegression()

#entrenamos
logistic_reg3.fit(X_train_vec3, y_train3)

#predecimos sobre nuestra variable a predecir
y_pred3 = logistic_reg3.predict(X_valid_vec3)
y_pred3 = (y_pred3 > 0.5)

print("Accuracy:",accuracy_score(y_valid3, y_pred3))
print("Precision:", sklearn.metrics.precision_score(y_valid3,y_pred3))
print("Recall:", sklearn.metrics.recall_score(y_valid3,y_pred3))
print("F1:",f1_score(y_valid3, y_pred3))


Accuracy: 0.7956932773109243
Precision: 0.7826666666666666
Recall: 0.7220172201722017
F1: 0.7511196417146513


### 3.1.3.  Submit de Kaggle

In [35]:
#Predicciones con el set de TEST y submit -> score 0.78915 en kaggle
test_vec = vector3.transform(test['text']+' '+test['keyword'])
y_pred3 = logistic_reg3.predict(test_vec)

submit = pd.DataFrame(test['id'])
submit['target'] = y_pred3
#submit.to_csv('SUBMITS/submission-logreg-keywords.csv',index=False)

Podemos ver que obtuvimos un score menor al obtenido con una simple regresion lineal, sin features ademas del 'text' y sin limpieza de datos.