# Datamining Project: determine the gender of Reddit authors using their comments


### Dipartimento di Fisica,UniTo
#### Carola Maria Caivano, matricola: 867290

## Abstract
Lo scopo del progetto è quello di andare a predire il genere di alcuni utenti di Reddit, utilizzando un dataset con alcune migliaia di posts scritti da 5000 diversi autori. 
Per farlo si sono andati ad utilizzare diversi modelli di classificazione che sono stati allenati sul training set sia per i subreddits che per i posts scritti dagli autori. Come predizione del modello finale è stata effettuata una regressione lineare delle predizioni sull'ensemble set ottenute dai vari modelli.


### Indice
1. Importazione dei dati di training
2. Preprocessing
  *    Estrazione delle features dai subreddits
  *    Estrazione delle features dai posts
  *    Creazione del dataset di training, di validazione e di ensemble
3. Importazione dei dati di test ed estrazione delle features
4. Selezione del modello
  * Modelli sui subreddits
  * Modelli sui posts
5. Ensemble Model

## 1. Importazione dei dati

Il dataset usato riguarda 289608 posts di reddit scritti da 5000 autori diversi per 3468 subreddits.  

In [None]:
%pylab inline
import pandas as pd

In [None]:
train_data = pd.read_csv("../input/datamining2022/train_data.csv", encoding="utf8")

In [None]:
display(train_data)

In [None]:
target = pd.read_csv("../input/datamining2022/train_target.csv")

In [None]:
target['gender'].value_counts() #per contare numero di maschi 0 e femmine 1

# 2. Preprocessing

Di seguito si riporta la parte di estrazione delle features effettuata sia sui subreddits che sui posts. Per questi ultimi è stato utilizzato bag of words.

## 2.1 Subreddit Extraction

In [None]:
#per contare il numero di subreddit
subreddits = train_data.subreddit.unique() 

#associa un indice ad ogni subreddit
subreddits_map = pd.Series(index=subreddits, data=arange(subreddits.shape[0])) 

In [None]:
from scipy import sparse #scrive le matrici in modo sparso

In [None]:
#un gruppo indica l'insieme di tutti i subreddit di un autore
def extract_features(group):
    group_subreddits = group['subreddit']
    group_subreddits = group_subreddits[group_subreddits.isin(subreddits_map.index)].values
    idxs = subreddits_map.loc[group_subreddits].values
    v = sparse.dok_matrix((1, subreddits.shape[0]))
    for idx in idxs:
        if not np.isnan(idx):
            v[0, idx] = 1
    return v.tocsr()

extract_features(train_data[train_data.author=='RedThunder90'])

In [None]:
features_dict = {} #dizionario che associa ad ogni autore i subreddits che ha scritto

for author, group in train_data.groupby('author'):
    features_dict[author] = extract_features(group)

In [None]:
X_train_subreddit = sparse.vstack([features_dict[author] for author in target.author])

In [None]:
y_train = target.gender

## 2.2. Text Extraction 

In [None]:
#tutto il testo scritto da un autore è estratto in una singola lista
def extract_text(group):
    group_text = group['body'].values
    return ''.join(group_text)

In [None]:
text_dict = {}

for author, group in train_data.groupby('author'):
    text_dict[author] = extract_text(group)

In [None]:
author_text = [text_dict[author] for author in target.author]
authors_text=author_text

### Bag of words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [None]:
pattern ='(?u)\\b[A-Za-z]{3,}'
# {3,} cerco tutte le parole con un minimo di 3 caratteri
# [A-Za-z] vuol dire con caratteri ammissibili dalla A alla Z maiuscole e dalla a alla z minuscole

stop_words = set(list(ENGLISH_STOP_WORDS) + ['test']) #parole che non vogliamo conteggiare

vec = CountVectorizer(token_pattern=pattern, stop_words=stop_words, ngram_range=(1, 1))
C = vec.fit_transform(authors_text) # Lista dei messaggi che vogliamo vettorizzare

In [None]:
C

In [None]:
# normalizzazione
tfidf = TfidfTransformer() #using log tf-idf

#calculate features using tf-idf and create a training set 
X_train_text = tfidf.fit_transform(C)

## 2.3. Creazione dei dataset di training, di validazione e di ensemble
Il dataset è stato diviso in tre parti: in un training set, usato per allenare i diversi modelli, in un validation set, usato per avere una stima di $E_{out}$ e in un ensemble set, usato per il modello finale.

### Subreddits data

In [None]:
rnd=6

In [None]:
from sklearn.model_selection import train_test_split
X_train_subreddit, X_vald_subreddit, y_train_subreddit, y_vald_subreddit=train_test_split(X_train_subreddit, y_train, test_size=0.2, random_state=rnd)

In [None]:
X_train_subreddit, X_ens_subreddit, y_train_subreddit, y_ens=train_test_split(X_train_subreddit, y_train_subreddit, test_size=0.2, random_state=rnd)

### Text data

In [None]:
from sklearn.model_selection import train_test_split
X_train_text, X_vald_text, y_train_text, y_vald_text=train_test_split(X_train_text, y_train, test_size=0.2, random_state=rnd)

In [None]:
X_train_text, X_ens_text, y_train_text, y_ens=train_test_split(X_train_text, y_train_text, test_size=0.2, random_state=rnd)

# 3. Importazione dei dati di test ed estrazione delle features

Di seguito si riporta l'importazione dei dati di test e l'estrazione delle features sia per i posts che per i subreddits con gli stessi metodi illustrati precedentemente per i dati di training.

In [None]:
test_data = pd.read_csv("../input/datamining2022/test_data.csv", encoding="utf8")
test_y=pd.read_csv("../input/datamining2022/sample.csv", encoding="utf8")

### 3.1. Subreddit Extraction

In [None]:
features_test_dict = {}

for author, group in test_data.groupby('author'):
    features_test_dict[author] = extract_features(group)

In [None]:
X_test_subreddit = sparse.vstack([features_test_dict[author] for author in test_data.author.unique()])
X_test_subreddit

### 3.2. Text Extraction

In [None]:
def extract_text(group):
    group_text = group['body'].values
    return ''.join(group_text)

In [None]:
# getting index positions of bad data in order to adjust it
def get_index_positions(list_of_elems, element):
    ''' Returns the indexes of all occurrences of give element in
    the list- listOfElements '''
    index_pos_list = []
    for i in range(len(list_of_elems)):
        if list_of_elems[i] == element:
            index_pos_list.append(i)
    return index_pos_list

In [None]:
# some authors may happen to have null body text, and this results as an error in the text_test_dict cell
author_test_array = test_data['author'].values
author_test_list = author_test_array.tolist()

In [None]:
index_post_list = get_index_positions(author_test_list,'SketchingShibe')
print('Indexes of all occurrences of {} in the list are : '.format('SketchingShibe'), index_post_list)

In [None]:
test_data = test_data.replace({np.nan: ','})

In [None]:
print(len(index_post_list))
test_data['body'].iloc[1063323]

In [None]:
text_test_dict = {}
conta = 0

for author, group in test_data.groupby('author'):
    #print('conta:',conta,'\t author',author)
    text_test_dict[author] = extract_text(group)
    #conta += 1

In [None]:
author_test_text = [text_test_dict[author] for author in test_y.author]
authors_test_text=author_test_text 

In [None]:
C_test = vec.transform(authors_test_text)

In [None]:
X_test_text=tfidf.transform(C_test)

# 4. Selezione del modello
In totale si sono andati ad utilizzare 9 modelli:
   * multinomialNb classifier allenato sui subreddits
   * complementNB classifier allenato sui subreddits
   * linear SVM allenato sui subreddits
   * RBF SVM allenato sui subreddits
   * Logistic Regression allenata sui subreddits
   * multinomialNb classifier allenato sui posts
   * complementNB classifier allenato sui posts
   * linear SVM allenato sui posts
   * Logistic Regression allenata sui posts


### Librerie

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, cross_val_score
from sklearn import model_selection
from sklearn.metrics import confusion_matrix
from sklearn import svm, model_selection
from sklearn.naive_bayes import ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import learning_curve
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, roc_auc_score

# 4.1 Modelli di classificazione sui subreddits

## 4.1.1. Multinomial Naive Bayes sui subreddits
Il primo modello utilizzato è il Multinomial Naive Bayes. Si è andati prima ad allenare il modello sul training set e poi in seguito si è andati ad ottimizzare l'iperparametro alpha per ottenerne il valore ottimale. Si è andati poi ad utilizzare il modello con il migliore valore di alpha per predire il genere degli autori del validation set e avere una prima stima dello score. Di seguito si possono visualizzare i risultati ottenuti.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
mNB = MultinomialNB() #di default alpha=1.0
mNB.fit(X_train_subreddit, y_train_subreddit)
y_pred_train = mNB.predict(X_train_subreddit)

print("Trained MultinomialNB Classifier")
print("Coefficients: %s ..." % (str(mNB.coef_)[:70]))
print("   Intercept: %s" %(str(mNB.intercept_)))
print('  \tROC-Score: ',round(roc_auc_score(y_train_subreddit,y_pred_train),3))

#### Migliore alpha per il modello MultinomialNB

In [None]:
alphas=np.logspace(-4,0.5,20)
scores=[]

for alpha in alphas:
        mNB=MultinomialNB(alpha=alpha)
        cv=KFold(n_splits=10, shuffle=True, random_state=0)
        scores_model=cross_val_score(mNB,X_train_subreddit, y_train_subreddit, cv=cv)
        scores.append(np.mean(scores_model))
        

In [None]:
plt.figure(figsize=(8,6))
plt.semilogx(alphas, scores)
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')

print (np.max(scores))
print ('Best alpha:', alphas[np.argmax(scores)])

#### GridSearch mNB

In [None]:
hyprm_alphas = np.logspace(-10,5,30)

model=MultinomialNB()
param_grid = {'alpha': hyprm_alphas}
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_subreddit, y_train_subreddit)
print(gs.best_params_) 

In [None]:
mNB_subreddit = gs.best_estimator_

#### Soluzione sul validation set

In [None]:
y_pred_vald = mNB_subreddit.predict(X_vald_subreddit)
print("mNB - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_subreddit,y_pred_vald),4))

#### Confusion Matrix

In [None]:
conf_matrix=pd.DataFrame(confusion_matrix(y_vald_subreddit, y_pred_vald), index=['actual 0', 'actual 1'], columns=['pred 0', 'pred 1'])
display(conf_matrix)

### Learning Curves

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [None]:
title="Learning curve mNB subreddit"
plot_learning_curve(mNB_subreddit, title, X_train_subreddit, y_train_subreddit, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))

### Soluzioni per mNB sui subreddits

In [None]:
y_pred= mNB_subreddit.predict_proba(X_test_subreddit)[:,1]

In [None]:
solution_mNB_subreddit = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_mNB_subreddit.head()

In [None]:
solution_mNB_subreddit.to_csv("solution_mNB_subreddit.csv", index=False) 

## 4.1.2. Complement Naive Bayes

Il secondo modello utilizzato è Complement Naive Bayes, anche in questo caso si andati ad ottimizzare l'iperparametro alpha.

In [None]:
cNB = ComplementNB() 

cNB.fit(X_train_subreddit, y_train_subreddit)
y_pred_train = cNB.predict(X_train_subreddit)

print("Trained ComplementNB Classifier")
print('  \tROC-Score: ',round(roc_auc_score(y_train_subreddit,y_pred_train),3))

### GridSearch cNB

In [None]:
hyprm_alphas = np.logspace(-10,5,30)

model=ComplementNB()
param_grid = {'alpha': hyprm_alphas}
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_subreddit, y_train_subreddit)
print(gs.best_params_) 

In [None]:
cNB_subreddit=gs.best_estimator_

In [None]:
# solution on validation set
y_pred_vald = cNB_subreddit.predict(X_vald_subreddit)
print("mNB - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_subreddit,y_pred_vald),4))

### Confusion Matrix

In [None]:
conf_matrix=pd.DataFrame(confusion_matrix(y_vald_subreddit, y_pred_vald), index=['actual 0', 'actual 1'], columns=['pred 0', 'pred 1'])
display(conf_matrix)

### Learning Curves

In [None]:
title="Learning curve cNB subreddit"
plot_learning_curve(cNB_subreddit, title, X_train_subreddit, y_train_subreddit, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))

### Solution on cNB subreddit features

In [None]:
y_pred= cNB_subreddit.predict_proba(X_test_subreddit)[:,1]

In [None]:
solution_cNB_subreddit = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_cNB_subreddit.head()

In [None]:
solution_cNB_subreddit.to_csv("solution_cNB_subreddit.csv", index=False) 

## 4.1.3. Linear Support Vector Machine per i subreddits

Il terzo modello utilizzato è SVM dove si è andati ad utilizzare un kernel lineare. Anche in questo caso si sono cercati i valori migliori per gli iperparametri gamma e C. Per gamma grandi la complessità del modello aumenta con conseguente rischio che possa avvenire overfitting, al contrario per gamma piccoli potrebbe avvenire underfitting. Allo stesso modo è stato valutato il parametro C che regola il confine tra margine soft ed hard.

In [None]:
C=10.0
gamma=0.1
SVM = svm.SVC(kernel='linear', gamma=gamma, C=C,probability=True)
SVM.fit(X_train_subreddit, y_train_subreddit)
y_pred_train=SVM.predict(X_train_subreddit)
print('C={} \t gamma={}'.format(C, gamma))
print('accuracy_score',round(accuracy_score(y_train_subreddit,y_pred_train),3))
print('roc_auc_score',round(roc_auc_score(y_train_subreddit,y_pred_train),3))

### Grid Search linear SVM

In [None]:
param_C=np.linspace(5,10,10)
param_gamma=np.linspace(0.019,0.03,10)

param_grid={'C': param_C,
               'gamma': param_gamma
              }
model=svm.SVC(kernel='linear', probability=True)
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_subreddit, y_train_subreddit)
print(gs.best_params_)

In [None]:
linear_SVM_subreddit=gs.best_estimator_

In [None]:
# solution on validation set
y_pred_vald = linear_SVM_subreddit.predict(X_vald_subreddit)
print("SVc - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_subreddit,y_pred_vald),4))

### Confusion Matrix

In [None]:
conf_matrix=pd.DataFrame(confusion_matrix(y_vald_subreddit, y_pred_vald), index=['actual 0', 'actual 1'], columns=['pred 0', 'pred 1'])
display(conf_matrix)

### Learning curves

In [None]:
title="Learning curve linear SVM subreddits"
plot_learning_curve(linear_SVM_subreddit, title, X_train_subreddit, y_train_subreddit, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))

### Soluzione su SVM 

In [None]:
y_pred= linear_SVM_subreddit.predict_proba(X_test_subreddit)[:,1]

In [None]:
solution_linear_SVM_subreddit = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_linear_SVM_subreddit.head()

In [None]:
solution_linear_SVM_subreddit.to_csv("solution_linear_SVM_subreddit.csv", index=False)

## 4.1.4. RBF Support Vector Machine per i subreddits
Di seguito si riporta i risultati del modello SVM dove si è andati ad utilizzare kernel RBF.

In [None]:
C=10.0
gamma=0.1
SVM = svm.SVC(kernel='rbf', gamma=gamma, C=C,probability=True)
SVM.fit(X_train_subreddit, y_train_subreddit)
y_pred_train=SVM.predict(X_train_subreddit)
print('C={} \t gamma={}'.format(C, gamma))
print('accuracy_score',round(accuracy_score(y_train_subreddit,y_pred_train),3))
print('roc_auc_score',round(roc_auc_score(y_train_subreddit,y_pred_train),3))

### Ricerca dei migliori parametri per SVM 

In [None]:
gammas=np.logspace(-5,-0.5,20)
scores=[]

for gamma in gammas:
    SVM=svm.SVC(kernel='rbf', gamma=gamma, C=C)
    cv=model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
    scores_model=model_selection.cross_val_score(SVM,X_train_subreddit,y_train_subreddit,cv=cv)
    scores.append(np.mean(scores_model))

In [None]:
print ('Best gamma:', gammas[np.argmax(scores)])
best_gamma=gammas[np.argmax(scores)]
print ('Best score:', scores[np.argmax(scores)])

plt.semilogx(gammas, scores)
plt.xlabel('gamma')
plt.ylabel('Score (accuracy)')

In [None]:
Cs=np.linspace(5,20,10)
scores=[]

for C in Cs:
    SVM=svm.SVC(kernel='rbf', gamma=best_gamma, C=C)
    cv=model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
    scores_model=model_selection.cross_val_score(SVM,X_train_subreddit,y_train_subreddit,cv=cv)
    scores.append(np.mean(scores_model))

In [None]:
print ('Best C:', Cs[np.argmax(scores)])
print ('Best score:', scores[np.argmax(scores)])

plt.semilogx(Cs, scores)
plt.xlabel('C')
plt.ylabel('Score (accuracy)')

### Grid Search SVM

In [None]:
param_C=np.linspace(1,5,10)
param_gamma=np.linspace(0.019,0.03,10)

param_grid={'C': param_C,
               'gamma': param_gamma
              }
model=svm.SVC(kernel='rbf', probability=True)
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_subreddit, y_train_subreddit)
print(gs.best_params_)

In [None]:
SVM_subreddit=gs.best_estimator_

In [None]:
# solution on validation set
y_pred_vald = SVM_subreddit.predict(X_vald_subreddit)
print("SVc - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_subreddit,y_pred_vald),4))

### Confusion Matrix

In [None]:
conf_matrix=pd.DataFrame(confusion_matrix(y_vald_subreddit, y_pred_vald), index=['actual 0', 'actual 1'], columns=['pred 0', 'pred 1'])
display(conf_matrix)

### Learning curves

In [None]:
title="Learning curve SVM subreddits"
plot_learning_curve(SVM_subreddit, title, X_train_subreddit, y_train_subreddit, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))

### Soluzione su RBF SVM subreddit

In [None]:
y_pred= SVM_subreddit.predict_proba(X_test_subreddit)[:,1]

In [None]:
solution_SVM_subreddit = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_SVM_subreddit.head()

In [None]:
solution_SVM_subreddit.to_csv("solution_SVM_subreddit.csv", index=False)

## 4.1.5. Linear Regression per i subreddit
L'ultimo modello usato è quello della regressione logisitca. In questo caso si è andati a cercare il migliore valore per l'iperparametro C. Di seguito si mostrano i risultati ottenuti.

In [None]:
lr = LogisticRegression()

Cs = np.logspace(-1, 1, 20)
cv = model_selection.KFold(n_splits=10, shuffle=True,random_state=0)
gs_bt = model_selection.GridSearchCV(lr,param_grid={"C": Cs},cv=cv, n_jobs=7,scoring='roc_auc')
gs_bt.fit(X_train_subreddit, y_train_subreddit)
print ('Best parameters:', gs_bt.best_params_)
print ('Best score:', gs_bt.best_score_)

lr.C=gs_bt.best_params_['C']
lr.fit(X_train_subreddit, y_train_subreddit)
lr_subreddit=gs.best_estimator_

In [None]:
y_pred_vald = lr_subreddit.predict(X_vald_subreddit)
print("mNB - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_subreddit,y_pred_vald),4))

## Soluzioni su Linear Regression 

In [None]:
y_pred= lr_subreddit.predict_proba(X_test_subreddit)[:,1]
solution_lr_subreddit = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_lr_subreddit.head()
solution_lr_subreddit.to_csv("solution_lr_subreddit.csv", index=False) 

# 4.2 Modelli di classificazione sui posts
Di seguito si riportano gli stessi modelli precedentemente utilizzati, ma che sono stati allenati sulle features estratte dai posts.

## 2.2.1. Multinomial Naive Bayes per i posts

In [None]:
mNB = MultinomialNB() #di default alpha=1.0
mNB.fit(X_train_text, y_train_text)
y_pred_train = mNB.predict(X_train_text)

print("Trained MultinomialNB Classifier")
print("Coefficients: %s ..." % (str(mNB.coef_)[:70]))
print("   Intercept: %s" %(str(mNB.intercept_)))
print('  \tROC-Score: ',round(roc_auc_score(y_train_text,y_pred_train),3))

### Migliore alpha per il modello MultinomialNB 

In [None]:
alphas=np.logspace(-10,5,30)
scores=[]

for alpha in alphas:
        mNB=MultinomialNB(alpha=alpha)
        cv=KFold(n_splits=10, shuffle=True, random_state=0)
        scores_model=cross_val_score(mNB,X_train_text, y_train_text, cv=cv)
        scores.append(np.mean(scores_model))

In [None]:
plt.figure(figsize=(8,6))
plt.semilogx(alphas, scores)
plt.ylabel('CV score')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')

print (np.max(scores))
print ('Best alpha:', alphas[np.argmax(scores)])

### GridSearch mNB

In [None]:
param_alphas = np.logspace(-10,5,30)

model=MultinomialNB()
param_grid = {'alpha': param_alphas}
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_text, y_train_text)
print(gs.best_params_) 

In [None]:
mNB_text = gs.best_estimator_

In [None]:
# solution on validation set
y_pred_vald = mNB_text.predict(X_vald_text)
print("mNB - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_text,y_pred_vald),4))

### Confusion Matrix

In [None]:
conf_matrix=pd.DataFrame(confusion_matrix(y_vald_text, y_pred_vald), index=['actual 0', 'actual 1'], columns=['pred 0', 'pred 1'])
display(conf_matrix)

### Learning Curves

In [None]:
title="Learning curve mNB on text features"
plot_learning_curve(mNB_text, title, X_train_text, y_train_text, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))

### Soluzioni su mNB subreddit

In [None]:
y_pred= mNB_text.predict_proba(X_test_text)[:,1]

In [None]:
solution_mNB_text = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_mNB_text.head()

In [None]:
solution_mNB_text.to_csv("solution_mNB_posts.csv", index=False)

## 4.2.2. Complement Naive Bayes per i posts

In [None]:
cNB = ComplementNB() 
cNB.fit(X_train_text, y_train_text)
y_pred_train = cNB.predict(X_train_text)

print('  \tROC-Score: ',round(roc_auc_score(y_train_text,y_pred_train),3))

### GridSearch cNB

In [None]:
param_alphas = np.logspace(-10,5,30)

model=ComplementNB()
param_grid = {'alpha': param_alphas}
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_text, y_train_text)
print(gs.best_params_) 

In [None]:
cNB_text = gs.best_estimator_

### Soluzione sul validation set

In [None]:
y_pred_vald = cNB_text.predict(X_vald_text)
print("mNB - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_text,y_pred_vald),4))

### Confusion Matrix

In [None]:
conf_matrix=pd.DataFrame(confusion_matrix(y_vald_text, y_pred_vald), index=['actual 0', 'actual 1'], columns=['pred 0', 'pred 1'])
display(conf_matrix)

### Learning curves

In [None]:
title="Learning curve cNB on text features"
plot_learning_curve(cNB_text, title, X_train_text, y_train_text, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))

### Soluzione del modello cNB 

In [None]:
y_pred= cNB_text.predict_proba(X_test_text)[:,1]

In [None]:
solution_cNB_text = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_cNB_text.head()

In [None]:
solution_mNB_text.to_csv("solution_mNB_posts.csv", index=False)

## 4.2.3. Support Vector Machine per i posts

In [None]:
C=10.0
gamma=0.1
SVM = svm.SVC(kernel='rbf', gamma=gamma, C=C,probability=True)
SVM.fit(X_train_text, y_train_text)
y_pred_train=SVM.predict(X_train_text)
print('C={} \t gamma={}'.format(C, gamma))
print('accuracy_score',round(accuracy_score(y_train_text,y_pred_train),3))
print('roc_auc_score',round(roc_auc_score(y_train_text,y_pred_train),3))

### Migliore gamma per il modello SVM

In [None]:
gammas=np.logspace(-5,-0.5,5)
scores=[]

for gamma in gammas:
    SVM=svm.SVC(kernel='rbf', gamma=gamma, C=C)
    cv=model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
    scores_model=model_selection.cross_val_score(SVM,X_train_text,y_train_text,cv=cv)
    scores.append(np.mean(scores_model))

In [None]:
print ('Best gamma:', gammas[np.argmax(scores)])
best_gamma=gammas[np.argmax(scores)]
print ('Best score:', scores[np.argmax(scores)])

plt.semilogx(gammas, scores)
plt.xlabel('gamma')
plt.ylabel('Score (accuracy)')

### Grid Search SVM

In [None]:
param_C=np.linspace(5,10,10)
param_gamma=np.linspace(0.019,0.03,10)

param_grid={'C': param_C,
               'gamma': param_gamma
              }
model=svm.SVC(kernel='rbf', probability=True)
gs = model_selection.GridSearchCV(model, param_grid)
gs.fit(X_train_text, y_train_text)
print(gs.best_params_)

In [None]:
SVM_text=gs.best_estimator_

In [None]:
# solution on validation set
y_pred_vald = SVM_subreddit.predict(X_vald_subreddit)
print("SVc - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_subreddit,y_pred_vald),4))

### Soluzione su SVM 

In [None]:
y_pred= SVM_text.predict_proba(X_test_text)[:,1]

In [None]:
solution_SVM_text = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_SVM_text.head()

In [None]:
solution_SVM_text.to_csv("solution_SVM_text.csv", index=False)

## 4.2.4. Linear Regression per i posts

In [None]:
lr = LogisticRegression()

Cs = np.logspace(-1, 1, 20)
cv = model_selection.KFold(n_splits=10, shuffle=True,random_state=0)
gs_bt = model_selection.GridSearchCV(lr,param_grid={"C": Cs},cv=cv, n_jobs=7,scoring='roc_auc')
gs_bt.fit(X_train_text, y_train_text)
print ('Best parameters:', gs_bt.best_params_)
print ('Best score:', gs_bt.best_score_)

lr.C=gs_bt.best_params_['C']
lr.fit(X_train_text, y_train_text)
lr_text=gs.best_estimator_

In [None]:
y_pred_vald = lr_text.predict(X_vald_text)
print("mNB - Estimate of E_out")
print('ROC-Score: ',round(roc_auc_score(y_vald_text,y_pred_vald),4))

In [None]:
y_pred= lr_text.predict_proba(X_test_text)[:,1]
solution_lr_text = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred})
solution_lr_text.head()
solution_lr_text.to_csv("solution_lr_text.csv", index=False) 

# 5.Ensemble Model
Infine per ottenere il modello finale l'ultimo passo consiste nell'effettuare un "ensemble" learning, cercando il miglior set di parametri  $w_k$  tali che l'ipotesi finale  $g_f$  sia uguale a  $g_f= \sum_{k=1}^n w_k h_k$ , dove  $h_k$  sono le predizioni sull'ensemble set dei modelli precedentemente visti. Di seguito si riporta sia l'ensemble learning effettuato sui modelli allenati sui subreddits sia l'ensemble learning effettuato sui modelli allenati sui posts.

In [None]:
from mlxtend.classifier import StackingClassifier

In [None]:
lr = LogisticRegression()

## Ensemble Model per i subreddits

In [None]:
clf_stack_subreddit = StackingClassifier(classifiers =[mNB_subreddit, cNB_subreddit,lr_subreddit], meta_classifier = lr, use_probas = True, use_features_in_secondary = True)

In [None]:
model_stack_subreddit = clf_stack_subreddit.fit(X_train_subreddit, y_train_subreddit)   
pred_stack_subreddit = model_stack_subreddit.predict(X_ens_subreddit) 
print('ROC-Score: ',round(roc_auc_score(y_ens,pred_stack_subreddit),4))

## Soluzione ensemble model subreddits

In [None]:
y_pred_stack_subreddit = model_stack_subreddit.predict_proba(X_test_subreddit)[:,1]
solution_ens_subreddit = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred_stack_subreddit})
solution_ens_subreddit.head()
solution_ens_subreddit.to_csv("solution_ens_subreddit.csv", index=False)

## Ensemble model per i posts

In [None]:
clf_stack_text = StackingClassifier(classifiers =[mNB_text, cNB_text,lr_text], meta_classifier = lr, use_probas = True, use_features_in_secondary = True)
model_stack_text = clf_stack_text.fit(X_train_text, y_train_text)   
pred_stack_text = model_stack_text.predict(X_ens_text) 
print('ROC-Score: ',round(roc_auc_score(y_ens,pred_stack_text),4))

## Soluzione ensemble model sui posts

In [None]:
y_pred_stack_text = model_stack_text.predict_proba(X_test_text)[:,1]
solution_ens_text = pd.DataFrame({"author":test_data.author.unique(), "gender":y_pred_stack_text})
solution_ens_text.head()
solution_ens_text.to_csv("solution_ens_text.csv", index=False)