<img src="http://www.exalumnos.usm.cl/wp-content/uploads/2015/06/Isotipo-Negro.gif" title="Title text" width="20%" height="20%" />


<hr style="height:2px;border:none"/>
<h1 align='center'> Tarea 3 - Ensamblados y modelos avanzados</h1>

<H3 align='center'> <i>Felipe Olavarria, Rol:201673606-9</i> </H3>
<H3 align='center'> <i>Jean Aravena, Rol:201673573-9</i> </H3>
<hr style="height:2px;border:none"/>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import re, time
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer, word_tokenize
from  sklearn.metrics import f1_score

import warnings
warnings.filterwarnings('ignore')

## 2. Detección de acoso en *Twitter*
---
En las redes sociales muchas veces se encuentra con un cierto comportamiento indeseable para los usuarios, tal como racismo, misógeno, grupos de odio o *trolls*. El poder detectar de manera automática ciertos patrones en el comportamiento para tomar una acción debe ser crucial para reducir el tiempo y esfuerzo humano. En esta actividad se trabajará sobre *tweets* la red social de *twitter* para detectar comportamiento *online* de acoso (*harassment*), que por lo general, incluye *flaming* como lenguaje abusivo o insultos, *doxing* como mostrar la información personal de una mujer, por ejemplo el domicilio o número de teléfono, la suplantación o la vergüenza pública por destruir la reputación de las personas.

<img src="https://kidshelpline.com.au/sites/default/files/bdl_image/header-T-OH.png" title="Title text" width="45%"  />

En algunos problemas como este, el comportamiento a detectar puede ser asociado a una anomalía (*outlier*) del comportamiento normal de los usuarios en las redes sociales. Esto es una de las causas de la dificultad del problema, puesto que es **altamente desbalanceado**, donde aproximadamente un 10% de los *tweets* corresponden a acoso (*harassment*).

Los datos trabajados corresponderan a *tweets* etiquetados como *harassment* (con valor 1) o no (con valor 0) -- la tarea a detectar--. Además si desea utilizar, se incluye la información del tipo de *harassment* en el conjunto de entrenamiento como atributos extras. El conjunto de pruebas solo contiene los *tweets* a ser etiquetados.

In [2]:
#Train df
df_train = pd.read_csv("Train_data.csv")

df_train_text = df_train.tweet_content

labels_train_indirect = df_train.IndirectH
labels_train_physical = df_train.PhysicalH
labels_train_sexual = df_train.SexualH

labels_train = df_train.harassment
#Test df
df_test = pd.read_csv("Test_input.csv")
df_test_text = df_test.tweet_content
df_train.head(5)

Unnamed: 0,id,tweet_content,harassment,IndirectH,PhysicalH,SexualH
0,9565,also released this video of photos voyager too...,0,0,0,0
1,6794,Yeah sexting older games until x89 teach doug...,0,0,0,0
2,4337,ava There s likely hundreds of stories like t...,0,0,0,0
3,6621,Wonder if there is significance to having Ava ...,0,0,0,0
4,3289,i m a slut for guacamole an avocadhoe if you will,0,0,0,0


In [3]:
print(df_train[df_train['IndirectH'] == 1].shape[0])
print(df_train[df_train['PhysicalH'] == 1].shape[0])
print(df_train[df_train['SexualH'] == 1].shape[0])

126
112
311


In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5703 entries, 0 to 5702
Data columns (total 6 columns):
id               5703 non-null int64
tweet_content    5703 non-null object
harassment       5703 non-null int64
IndirectH        5703 non-null int64
PhysicalH        5703 non-null int64
SexualH          5703 non-null int64
dtypes: int64(5), object(1)
memory usage: 267.4+ KB


***Tenemos 5703 datos diferentes en el dataset, cada uno con 6 atributos. Se muestra el resumen del dataframe, en el cual no existen registros nulos para ninguna de las 6 columnas.
Además se informa el tipo de dato que presenta cada una de los atributos y la memoria utilizada.***

In [5]:
from sklearn.model_selection import train_test_split

c = df_train.shape[0]
d = int(c * 0.2)

df_train_text,df_val_text,labels_train_indirect,labels_val_indirect,labels_train_physical,labels_val_physical,labels_train_sexual,labels_val_sexual,labels_train,labels_val  = train_test_split(df_train_text, labels_train_indirect,labels_train_physical,labels_train_sexual,labels_train, test_size = d, random_state=0)


***El tamaño de entrenamiento correspondera al 80% de los datos.***

> Se realiza un pre-procesamiento a los textos para normalizar un poco su estructura, donde se pasa el texto a minúsculas (lower-casing), se reducen las mútliples letras, se eliminan palabras sin significados como artículos, pronombres y preposiciones (stop word removal), además de pasar las palabras a su tronco léxico con la técnica de lemmatizer

In [6]:
def base_word(word):
    wordlemmatizer = WordNetLemmatizer()
    return wordlemmatizer.lemmatize(word) 
def word_extractor(text):
    commonwords = stopwords.words('english')
    text = re.sub(r'([a-z])\1+', r'\1\1',text) #substitute multiple letter by two
    words = ""
    wordtokens = [ base_word(word.lower()) for word in word_tokenize(text) ]
    for word in wordtokens:
        if word not in commonwords: #delete stopwords
            words+=" "+word
    return words

texts_train = [word_extractor(text) for text in df_train_text]

texts_val = [word_extractor(text) for text in df_val_text]

texts_test = [word_extractor(text) for text in df_test_text]

> Se construye una representación vectorial a los textos de entrada para poder ser manejados y clasificados por los modelos de aprendizaje. 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(binary=False, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, norm='l2', use_idf=True, sublinear_tf=False)
vectorizer.fit(texts_train)


features_train = vectorizer.transform(texts_train)
features_val = vectorizer.transform(texts_val)
features_test = vectorizer.transform(texts_test)


vocab = vectorizer.get_feature_names()

***La manera de trabajar el problema sera haciendo modelos especializados en clasificar cada tipos de acoso, para luego juntar estas predicciones en una gran clase de acoso. Por lo que se entrenaran a parte cada modelo y se búscaran hiper-párametros que maximizen el F-score. Esto se repitira para una serie de modelos estudiados durante el curso.***

# Regresión logística

In [8]:
from sklearn.linear_model import LogisticRegression

def do_logit(x_,y_indirect,y_physical,y_sexual,features_val):
    Cs = [10**i for i in range(-4,7)]
    Ss = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
    f_max=-1
    for s in Ss:
        for i in Cs:
            model_1 = LogisticRegression(multi_class='auto',penalty='l2',max_iter=1000,solver = s)
            model_1.set_params(C=i)
            model_1.fit(x_,y_indirect)
            pred_i = model_1.predict(features_val)
            f_i = f1_score(labels_val_indirect, pred_i, average='binary')
            if f_i > f_max:
                print(f_i, 'indirect',s,i)
                f_max = f_i
                pred_1 = pred_i
    print(' ')
    f_max=-1
    for s in Ss:
        for i in Cs:
            model_2 = LogisticRegression(multi_class='auto',penalty='l2',max_iter=1000,solver = s)
            model_2.set_params(C=i)
            model_2.fit(x_,y_physical)
            pred_i = model_2.predict(features_val)
            f_i = f1_score(labels_val_physical, pred_i, average='binary')
            if f_i > f_max:
                print(f_i, 'physical',s,i)
                f_max = f_i
                pred_2 = pred_i
    print(' ')
    f_max=-1
    for s in Ss:
        for i in Cs:
            model_3 = LogisticRegression(multi_class='auto',penalty='l2',max_iter=1000,solver = s)
            model_3.set_params(C=i)
            model_3.fit(x_,y_sexual)
            pred_i = model_3.predict(features_val)
            f_i = f1_score(labels_val_sexual, pred_i, average='binary')
            if f_i > f_max:
                print(f_i, 'sexual',s,i)
                f_max = f_i
                pred_3 = pred_i

    return pred_1,pred_2,pred_3
    
y_pred_logit_1,y_pred_logit_2,y_pred_logit_3 = do_logit(features_train,labels_train_indirect,labels_train_physical,labels_train_sexual,features_val)

y_pred_logit = y_pred_logit_1 | y_pred_logit_2 | y_pred_logit_3 
f1_score(labels_val, y_pred_logit, average='binary')

0.0 indirect newton-cg 0.0001
0.13333333333333333 indirect newton-cg 10
0.16666666666666666 indirect lbfgs 1000000
 
0.0 physical newton-cg 0.0001
0.12121212121212122 physical newton-cg 10
0.2702702702702703 physical newton-cg 100
0.3157894736842105 physical newton-cg 1000000
 
0.0 sexual newton-cg 0.0001
0.14492753623188404 sexual newton-cg 1
0.4742268041237113 sexual newton-cg 10
0.4807692307692307 sexual newton-cg 100000
0.4952380952380952 sexual newton-cg 1000000
0.5 sexual liblinear 100000
0.505050505050505 sexual saga 1000


0.4277456647398844

# Super vector machine

In [9]:
from sklearn.svm import SVC as SVM #SVC is for classification
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

def do_svm(x_,y_indirect,y_physical,y_sexual,features_val):
    Cs = [10**i for i in range(-4,7)]
    Ks = ['linear', 'poly', 'rbf', 'sigmoid']
    f_max=-1
    for k in Ks:
        for i in Cs:
            model_1 = SVM(kernel=k,gamma='scale')
            model_1.set_params(C=i)
            model_1.fit(x_,y_indirect)
            pred_i = model_1.predict(features_val)
            f_i = f1_score(labels_val_indirect, pred_i, average='binary')
            if f_i > f_max:
                f_max = f_i
                pred_1 = pred_i
                print(f_i, 'indirect',k,i)
    print(' ')
    f_max=-1
    for k in Ks:
        for i in Cs:
            model_2 = SVM(kernel=k,gamma='scale')
            model_2.set_params(C=i)
            model_2.fit(x_,y_physical)
            pred_i = model_2.predict(features_val)
            f_i = f1_score(labels_val_physical, pred_i, average='binary')
            if f_i > f_max:
                f_max = f_i
                pred_2 = pred_i
                print(f_i, 'physical',k,i)
    print(' ')
    f_max=-1
    for k in Ks:
        for i in Cs:
            model_3 = SVM(kernel=k,gamma='scale')
            model_3.set_params(C=i)
            model_3.fit(x_,y_sexual)
            pred_i = model_3.predict(features_val)
            f_i = f1_score(labels_val_sexual, pred_i, average='binary')
            if f_i > f_max:
                f_max = f_i
                pred_3 = pred_i
                print(f_i, 'sexual',k,i)
    
    return pred_1,pred_2,pred_3
    
y_pred_svm_1,y_pred_svm_2,y_pred_svm_3 = do_svm(features_train,labels_train_indirect,labels_train_physical,labels_train_sexual,features_val)

y_pred_svm = y_pred_svm_1 | y_pred_svm_2 | y_pred_svm_3

f1_score(labels_val, y_pred_svm, average='binary')

0.0 indirect linear 0.0001
0.21052631578947367 indirect linear 10
 
0.0 physical linear 0.0001
0.125 physical linear 1
0.3157894736842105 physical linear 10
 
0.0 sexual linear 0.0001
0.3777777777777777 sexual linear 1
0.47368421052631576 sexual linear 10


0.4526315789473685

# Árbol regularizado

In [15]:
from sklearn.tree import DecisionTreeClassifier as Tree
def do_Tree(x_,y_indirect,y_physical,y_sexual,features_val):
    Ds = np.arange(1, 25, 1)
    Ss = np.arange(2, 40, 1)
    f_max=-1
    for d in Ds:
        for s in Ss:
            model_1= Tree()
            model_1.set_params(max_depth=d, min_samples_split=s) 
            model_1.fit(x_,y_indirect)
            pred_i = model_1.predict(features_val)
            f_i = f1_score(labels_val_indirect, pred_i, average='binary')
            if f_i > f_max:
                f_max = f_i
                pred_1 = pred_i
                print(f_i, 'indirect',d,s)     
    print(' ')           
    f_max=-1
    for d in Ds:
        for s in Ss:
            model_2= Tree()
            model_2.set_params(max_depth=d, min_samples_split=s) 
            model_2.fit(x_,y_physical)
            pred_i = model_2.predict(features_val)
            f_i = f1_score(labels_val_physical, pred_i, average='binary')
            if f_i > f_max:
                f_max = f_i
                pred_2 = pred_i
                print(f_i, 'physical',d,s)  
    print(' ')
    f_max=-1
    for d in Ds:
        for s in Ss:
            model_3= Tree()
            model_3.set_params(max_depth=d, min_samples_split=s) 
            model_3.fit(x_,y_indirect)
            pred_i = model_3.predict(features_val)
            f_i = f1_score(labels_val_sexual, pred_i, average='binary')
            if f_i > f_max:
                f_max = f_i
                pred_3 = pred_i
                print(f_i, 'sexual',d,s)  
            
    return pred_1,pred_2,pred_3

y_pred_tree_1,y_pred_tree_2,y_pred_tree_3 = do_Tree(features_train,labels_train_indirect,labels_train_physical,labels_train_sexual,features_val)

y_pred_tree  = y_pred_tree_1 | y_pred_tree_2 | y_pred_tree_3 

f1_score(labels_val, y_pred_tree, average='binary')

0.0 indirect 1 2
0.18750000000000003 indirect 5 2
0.2222222222222222 indirect 9 6
0.2702702702702703 indirect 15 11
 
0.12903225806451613 physical 1 2
0.16666666666666666 physical 3 3
0.21052631578947364 physical 8 8
0.24390243902439027 physical 14 12
 
0.0 sexual 1 2
0.03125 sexual 7 2
0.031746031746031744 sexual 7 3
0.03278688524590164 sexual 7 13
0.061538461538461535 sexual 8 32
0.0625 sexual 11 35


0.18840579710144928

# Random forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

def do_rf(x_,y_indirect,y_physical,y_sexual,features_val):
    Cs = [2**i for i in range(8)]
    f_max=-1
    for i in Cs:
        model_1 = RandomForestClassifier(n_estimators=i,max_depth=12,min_samples_split=10,random_state=0,n_jobs=-1)
        model_1.fit(x_,y_indirect)
        pred_i = model_1.predict(features_val)
        f_i = f1_score(labels_val_indirect, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_1 = pred_i
            print(f_i, 'indirect',i)
    print(' ')    
    f_max=-1
    for i in Cs:
        model_2 = RandomForestClassifier(n_estimators=i,max_depth=7,min_samples_split=40,random_state=0,n_jobs=-1)
        model_2.fit(x_,y_physical)
        pred_i = model_2.predict(features_val)
        f_i = f1_score(labels_val_physical, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_2 = pred_i
            print(f_i, 'physical',i)
    print(' ')
    f_max=-1
    for i in Cs:
        model_3 = RandomForestClassifier(n_estimators=i,max_depth=13,min_samples_split=25,random_state=0,n_jobs=-1)
        model_3.fit(x_,y_sexual)
        pred_i = model_3.predict(features_val)
        f_i = f1_score(labels_val_sexual, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_3 = pred_i
            print(f_i, 'sexual',i)
    
    return pred_1,pred_2,pred_3
    
y_pred_rf_1,y_pred_rf_2,y_pred_rf_3 = do_rf(features_train,labels_train_indirect,labels_train_physical,labels_train_sexual,features_val)

y_pred_rf = y_pred_rf_1 | y_pred_rf_2 | y_pred_rf_3

f1_score(labels_val, y_pred_rf, average='binary')

0.0 indirect 1
 
0.06451612903225806 physical 1
 
0.08571428571428572 sexual 1


0.10144927536231883

# Gradient boost

In [13]:
from sklearn.ensemble import GradientBoostingClassifier

def do_gb(x_,y_indirect,y_physical,y_sexual,features_val):
    Cs = [2**i for i in range(8)]
    f_max=-1
    for i in Cs:
        model_1 = GradientBoostingClassifier(n_estimators=i, learning_rate=1.0,max_depth=12,min_samples_split=10, random_state=0)
        model_1.fit(x_,y_indirect)
        pred_i = model_1.predict(features_val)
        f_i = f1_score(labels_val_indirect, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_1 = pred_i
            print(f_i, 'indirect',i)
    print(' ')
    f_max=-1
    for i in Cs:
        model_2 = GradientBoostingClassifier(n_estimators=i, learning_rate=1.0,max_depth=7,min_samples_split=40, random_state=0)
        model_2.fit(x_,y_physical)
        pred_i = model_2.predict(features_val)
        f_i = f1_score(labels_val_physical, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_2 = pred_i
            print(f_i, 'physical',i)
    print(' ')
    f_max=-1
    for i in Cs:
        model_3 = GradientBoostingClassifier(n_estimators=i, learning_rate=1.0,max_depth=13,min_samples_split=25, random_state=0)
        model_3.fit(x_,y_sexual)
        pred_i = model_3.predict(features_val)
        f_i = f1_score(labels_val_sexual, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_3 = pred_i
            print(f_i, 'sexual',i)
    
    return pred_1,pred_2,pred_3
    
y_pred_gb_1,y_pred_gb_2,y_pred_gb_3 = do_gb(features_train,labels_train_indirect,labels_train_physical,labels_train_sexual,features_val)

y_pred_gb = y_pred_gb_1 | y_pred_gb_2 | y_pred_gb_3

f1_score(labels_val, y_pred_gb, average='binary')

0.19047619047619047 indirect 1
 
0.16216216216216214 physical 1
0.21052631578947364 physical 2
 
0.5542168674698795 sexual 1


0.5560165975103736

# AdaBoost

In [14]:
from sklearn.ensemble import AdaBoostClassifier

def do_ab(x_,y_indirect,y_physical,y_sexual,features_val,labels_val):
    Cs = [2**i for i in range(8)]
    f_max=-1
    for i in Cs:
        model_1 = AdaBoostClassifier(base_estimator=Tree(max_depth=12,min_samples_split=10),n_estimators=i,learning_rate=1,random_state=0)
        model_1.fit(x_,y_indirect)
        pred_i = model_1.predict(features_val)
        f_i = f1_score(labels_val_indirect, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_1 = pred_i
            print(f_i, 'indirect',i)
    print(' ')
    f_max=-1
    for i in Cs:
        model_2 = AdaBoostClassifier(base_estimator=Tree(max_depth=7,min_samples_split=40),n_estimators=i,learning_rate=1,random_state=0)
        model_2.fit(x_,y_physical)
        pred_i = model_2.predict(features_val)
        f_i = f1_score(labels_val_physical, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_2 = pred_i
            print(f_i, 'physical',i)
    print(' ')
    f_max=-1
    for i in Cs:
        model_3 = AdaBoostClassifier(base_estimator=Tree(max_depth=13,min_samples_split=25),n_estimators=i,learning_rate=1,random_state=0)
        model_3.fit(x_,y_sexual)
        pred_i = model_3.predict(features_val)
        f_i = f1_score(labels_val_sexual, pred_i, average='binary')
        if f_i > f_max:
            f_max = f_i
            pred_3 = pred_i
            print(f_i, 'sexual',i)
    
    return pred_1,pred_2,pred_3
    
y_pred_ab_1,y_pred_ab_2,y_pred_ab_3 = do_ab(features_train,labels_train_indirect,labels_train_physical,labels_train_sexual,features_val,labels_val)


y_pred_ab = y_pred_ab_1 | y_pred_ab_2 | y_pred_ab_3

f1_score(labels_val, y_pred_ab, average='binary')
#0.3855421686746988

0.17142857142857143 indirect 1
 
0.16666666666666666 physical 1
 
0.43478260869565216 sexual 1


0.3726708074534162

# Intercalar y combinar modelos

### Rendimiento:

#### Regresión logística:
0.167 indirect lbfgs c = 1000000
 
0.316 physical newton-cg c = 1000000
 
0.505 sexual saga c = 1000

overall: 0.4277456647398844

#### Super vector machine:

0.211 indirect linear c = 10
 
0.316 physical linear c = 10
 
0.474 sexual linear c = 10

overall: 0.4526315789473685

#### Árbol regularizado

0.270 indirect d = 15 s = 11
 
0.244 physical d = 14 s = 12
 
0.062 sexual d = 11 s = 35

overall: 0.18840579710144928

#### Random forest 

0.0 indirect n = 1
 
0.064 physical n = 1
 
0.086 sexual n = 1

overall: 0.10144927536231883

#### Gradient boost

0.190 indirect n = 1
 
0.211 physical n = 2
 
0.554 sexual n = 1

overall: 0.5560165975103736

#### Adaboost

0.171 indirect n = 1
 
0.167 physical n = 1
 
0.435 sexual n =1

overall: 0.3726708074534162

***Clase 1: TREE > SVM > GB > AB > RL > RF***

***Clase 2: RL = SVM > TREE > GB > AB > RF***

***Clase 3: GB > RL > SVM > AB > RF > TREE***

### Calcular predcciones en test:

In [71]:
# rl
model_1 = LogisticRegression(multi_class='auto',penalty='l2',max_iter=1000,C=1000000,solver = 'lbfgs')
model_1.fit(features_train,labels_train_indirect)
y_test_logit_1 = model_1.predict(features_test)

model_2 = LogisticRegression(multi_class='auto',penalty='l2',max_iter=1000,C=1000000,solver = 'newton-cg')
model_2.fit(features_train,labels_train_physical)
y_test_logit_2 = model_2.predict(features_test)

model_3 = LogisticRegression(multi_class='auto',penalty='l2',max_iter=1000,C=1000,solver = 'saga')
model_3.fit(features_train,labels_train_sexual)
y_test_logit_3 = model_3.predict(features_test)


In [72]:
# svm
model_1 = SVM(kernel='linear',gamma='scale',C=10)
model_1.fit(features_train,labels_train_indirect)
y_test_svm_1 = model_1.predict(features_test)

model_2 = SVM(kernel='linear',gamma='scale',C=10)
model_2.fit(features_train,labels_train_physical)
y_test_svm_2 = model_2.predict(features_test)

model_3 = SVM(kernel='linear',gamma='scale',C=10)
model_3.fit(features_train,labels_train_sexual)
y_test_svm_3 = model_3.predict(features_test)


In [73]:
# tree
model_1 = Tree(max_depth=15, min_samples_split=11)
model_1.fit(features_train,labels_train_indirect)
y_test_tree_1 = model_1.predict(features_test)

model_2 = Tree(max_depth=14, min_samples_split=12)
model_2.fit(features_train,labels_train_physical)
y_test_tree_2 = model_2.predict(features_test)

model_3 = Tree(max_depth=11, min_samples_split=35)
model_3.fit(features_train,labels_train_sexual)
y_test_tree_3 = model_3.predict(features_test)


In [74]:
# rf
model_1 = RandomForestClassifier(n_estimators=1,max_depth=15,min_samples_split=11,random_state=0,n_jobs=-1)
model_1.fit(features_train,labels_train_indirect)
y_test_rf_1 = model_1.predict(features_test)

model_2 = RandomForestClassifier(n_estimators=1,max_depth=14,min_samples_split=12,random_state=0,n_jobs=-1)
model_2.fit(features_train,labels_train_physical)
y_test_rf_2 = model_2.predict(features_test)

model_3 = RandomForestClassifier(n_estimators=1,max_depth=11,min_samples_split=35,random_state=0,n_jobs=-1)
model_3.fit(features_train,labels_train_sexual)
y_test_rf_3 = model_3.predict(features_test)


In [75]:
# gb
model_1 = GradientBoostingClassifier(n_estimators=1, learning_rate=1.0,max_depth=12,min_samples_split=10, random_state=0)
model_1.fit(features_train,labels_train_indirect)
y_test_gb_1 = model_1.predict(features_test)

model_2 = GradientBoostingClassifier(n_estimators=2, learning_rate=1.0,max_depth=7,min_samples_split=40, random_state=0)
model_2.fit(features_train,labels_train_physical)
y_test_gb_2 = model_2.predict(features_test)

model_3 = GradientBoostingClassifier(n_estimators=1, learning_rate=1.0,max_depth=13,min_samples_split=25, random_state=0)
model_3.fit(features_train,labels_train_sexual)
y_test_gb_3 = model_3.predict(features_test)


In [76]:
#ab
model_1 = AdaBoostClassifier(base_estimator=Tree(max_depth=12,min_samples_split=10),n_estimators=1,learning_rate=1,random_state=0)
model_1.fit(features_train,labels_train_indirect)
y_test_ab_1 = model_1.predict(features_test)

model_2 = AdaBoostClassifier(base_estimator=Tree(max_depth=7,min_samples_split=40),n_estimators=1,learning_rate=1,random_state=0)
model_2.fit(features_train,labels_train_physical)
y_test_ab_2 = model_2.predict(features_test)

model_3 = AdaBoostClassifier(base_estimator=Tree(max_depth=13,min_samples_split=25),n_estimators=1,learning_rate=1,random_state=0)
model_3.fit(features_train,labels_train_sexual)
y_test_ab_3 = model_3.predict(features_test)


## Mejores predicciones por clase (validación):

In [54]:
y_pred_best = y_pred_tree_1 | y_pred_logit_2 | y_pred_gb_3
f1_score(labels_val, y_pred_best, average='binary')

0.5726495726495726

## Votación mayoria por clase

In [47]:
#probamos 6, por simpleza se removera el con peor fscore por clase para evitar empates

clase_1 = list(zip(y_pred_logit_1,y_pred_svm_1,y_pred_tree_1,y_pred_gb_1,y_pred_ab_1))
clase_2 = list(zip(y_pred_logit_2,y_pred_svm_2,y_pred_tree_2,y_pred_gb_2,y_pred_ab_2))
clase_3 = list(zip(y_pred_logit_3,y_pred_svm_3,y_pred_rf_3,y_pred_gb_3,y_pred_ab_3))

def votation(clase):
    y_pred = list()
    for pred in clase:
        #print(pred,max(set(pred), key = pred.count))
        y_pred.append(max(set(pred), key = pred.count))
    return np.array(y_pred)

In [51]:
y_pred_mayority_1 = votation(clase_1)
y_pred_mayority_2 = votation(clase_2)
y_pred_mayority_3 = votation(clase_3)

y_pred_mayority = y_pred_mayority_1|y_pred_mayority_2|y_pred_mayority_3
f1_score(labels_val, y_pred_mayority, average='binary')

0.4550898203592814

## Votación ponderada

In [52]:
#sera un ranking donde el mejor score vale 5 votos y el peor vale 1 voto

weighted_1 = list(zip(y_pred_logit_1,y_pred_svm_1,y_pred_svm_1,y_pred_svm_1,y_pred_svm_1,y_pred_tree_1,y_pred_tree_1,y_pred_tree_1,y_pred_tree_1,y_pred_tree_1,y_pred_gb_1,y_pred_gb_1,y_pred_gb_1,y_pred_ab_1,y_pred_ab_1))
weighted_2 = list(zip(y_pred_logit_2,y_pred_logit_2,y_pred_logit_2,y_pred_logit_2,y_pred_logit_2,y_pred_svm_2,y_pred_svm_2,y_pred_svm_2,y_pred_svm_2,y_pred_svm_2,y_pred_tree_2,y_pred_tree_2,y_pred_tree_2,y_pred_tree_2,y_pred_gb_2,y_pred_gb_2,y_pred_gb_2,y_pred_ab_2,y_pred_ab_2))
weighted_3 = list(zip(y_pred_logit_3,y_pred_logit_3,y_pred_logit_3,y_pred_logit_3,y_pred_svm_3,y_pred_svm_3,y_pred_svm_3,y_pred_rf_3,y_pred_gb_3,y_pred_gb_3,y_pred_gb_3,y_pred_gb_3,y_pred_gb_3,y_pred_ab_3,y_pred_ab_3))

y_pred_weighted_1 = votation(weighted_1)
y_pred_weighted_2 = votation(weighted_2)
y_pred_weighted_3 = votation(weighted_3)

y_pred_weighted = y_pred_weighted_1|y_pred_weighted_2|y_pred_weighted_3
f1_score(labels_val, y_pred_weighted, average='binary')

0.5139664804469273

> El archivo de submission debe contener las predicciones de harassment (0 o 1) a cada dato de pruebas, además de la columna de id asociado al dato, iniciando en 1. Si leyó de manera ordenada el archivo de pruebas, se puede generar de la siguiente manera:

In [77]:
clase_1 = list(zip(y_test_logit_1,y_test_svm_1,y_test_tree_1,y_test_gb_1,y_test_ab_1))
clase_2 = list(zip(y_test_logit_2,y_test_svm_2,y_test_tree_2,y_test_gb_2,y_test_ab_2))
clase_3 = list(zip(y_test_logit_3,y_test_svm_3,y_test_rf_3,y_test_gb_3,y_test_ab_3))
y_test_mayority_1 = votation(clase_1)
y_test_mayority_2 = votation(clase_2)
y_test_mayority_3 = votation(clase_3)




weighted_1 = list(zip(y_test_logit_1,y_test_svm_1,y_test_svm_1,y_test_svm_1,y_test_svm_1,y_test_tree_1,y_test_tree_1,y_test_tree_1,y_test_tree_1,y_test_tree_1,y_test_gb_1,y_test_gb_1,y_test_gb_1,y_test_ab_1,y_test_ab_1))
weighted_2 = list(zip(y_test_logit_2,y_test_logit_2,y_test_logit_2,y_test_logit_2,y_test_logit_2,y_test_svm_2,y_test_svm_2,y_test_svm_2,y_test_svm_2,y_test_svm_2,y_test_tree_2,y_test_tree_2,y_test_tree_2,y_test_tree_2,y_test_gb_2,y_test_gb_2,y_test_gb_2,y_test_ab_2,y_test_ab_2))
weighted_3 = list(zip(y_test_logit_3,y_test_logit_3,y_test_logit_3,y_test_logit_3,y_test_svm_3,y_test_svm_3,y_test_svm_3,y_test_rf_3,y_test_gb_3,y_test_gb_3,y_test_gb_3,y_test_gb_3,y_test_gb_3,y_test_ab_3,y_test_ab_3))

y_test_weighted_1 = votation(weighted_1)
y_test_weighted_2 = votation(weighted_2)
y_test_weighted_3 = votation(weighted_3)



y_test_best = y_test_tree_1 | y_test_logit_2 | y_test_gb_3
y_test_mayority = y_test_mayority_1|y_test_mayority_2|y_test_mayority_3
y_test_weighted = y_test_weighted_1|y_test_weighted_2|y_test_weighted_3



df_aux = pd.DataFrame()
df_aux["id"] = np.arange(1, 1+y_test_best.shape[0])
df_aux["harassment"] = y_test_best.astype('int')
df_aux.to_csv("test_estimation_best.csv", index=False)

df_aux = pd.DataFrame()
df_aux["id"] = np.arange(1, 1+y_test_mayority.shape[0])
df_aux["harassment"] = y_test_mayority.astype('int')
df_aux.to_csv("test_estimation_mayority.csv", index=False)

df_aux = pd.DataFrame()
df_aux["id"] = np.arange(1, 1+y_test_weighted.shape[0])
df_aux["harassment"] = y_test_weighted.astype('int')
df_aux.to_csv("test_estimation_weighted.csv", index=False)