# Challenge - Tweets analysis

![](https://images.unsplash.com/photo-1569285645462-a3f9c6332d56?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

This data originally came from Crowdflower's [Data for Everyone library](http://www.crowdflower.com/data-for-everyone).

As the original source says,

> We looked through tens of thousands of tweets about the early August GOP debate in Ohio and asked contributors to do both sentiment analysis and data categorization. Contributors were asked which candidate was mentioned, and what the sentiment was for a given tweet. 

In this exercise, you will use your NLP and ML tools to **predict the sentiment of a tweet**. This is a project where you are free to explore and use the techniques that you know on the given dataset.

Some potential guidelines :
- Be careful to thourougly explore the data and gain insights about it.
- Practice some topic modelling on the text of the tweets.
- Use the NLP tools that you know on the text of each tweet to make predictions about the sentiment of the tweet.
- Try some feature engineering to improve the performance of your model.

The dataset is located in the `input` folder.

### Références intéressantes:
Trump_VMalara.ipynb\
https://towardsdatascience.com/natural-language-processing-on-multiple-columns-in-python-554043e05308

In [166]:
# Imports
import pandas as pd
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [167]:
# read data
df = pd.read_csv('../input/Sentiment.csv',on_bad_lines='skip')

In [168]:
df.columns

Index(['Unnamed: 0', 'id', 'candidate', 'sentiment', 'name', 'retweet_count',
       'text', 'tweet_coord', 'tweet_created', 'tweet_id', 'tweet_location',
       'user_timezone'],
      dtype='object')

In [169]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,id,candidate,sentiment,name,retweet_count,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,0,1,No candidate mentioned,Neutral,I_Am_Kenzi,5,RT @NancyLeeGrahn: How did everyone feel about...,,2015-08-07 09:54:46 -0700,629697200650592256,,Quito
1,1,2,Scott Walker,Positive,PeacefulQuest,26,RT @ScottWalker: Didn't catch the full #GOPdeb...,,2015-08-07 09:54:46 -0700,629697199560069120,,
2,2,3,No candidate mentioned,Neutral,PussssyCroook,27,RT @TJMShow: No mention of Tamir Rice and the ...,,2015-08-07 09:54:46 -0700,629697199312482304,,
3,3,4,No candidate mentioned,Positive,MattFromTexas31,138,RT @RobGeorge: That Carly Fiorina is trending ...,,2015-08-07 09:54:45 -0700,629697197118861312,Texas,Central Time (US & Canada)
4,4,5,Donald Trump,Positive,sharonDay5,156,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,,2015-08-07 09:54:45 -0700,629697196967903232,,Arizona


In [170]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13871 entries, 0 to 13870
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      13871 non-null  int64 
 1   id              13871 non-null  int64 
 2   candidate       13775 non-null  object
 3   sentiment       13871 non-null  object
 4   name            13871 non-null  object
 5   retweet_count   13871 non-null  int64 
 6   text            13871 non-null  object
 7   tweet_coord     21 non-null     object
 8   tweet_created   13871 non-null  object
 9   tweet_id        13871 non-null  int64 
 10  tweet_location  9959 non-null   object
 11  user_timezone   9468 non-null   object
dtypes: int64(4), object(8)
memory usage: 1.3+ MB


In [171]:
print(df.tweet_coord.isna().sum())
print(df.tweet_location.isna().sum())

13850
3912


# EDA

In [172]:
# je droppe les cols: 'Unnamed: 0', 'id', name, tweet_coord, tweet_location, 'tweet_created', 'tweet_id'
df.drop(columns=['Unnamed: 0','id','name','tweet_coord','tweet_location','tweet_created','tweet_id','user_timezone'], inplace=True)
df.head(5)

Unnamed: 0,candidate,sentiment,retweet_count,text
0,No candidate mentioned,Neutral,5,RT @NancyLeeGrahn: How did everyone feel about...
1,Scott Walker,Positive,26,RT @ScottWalker: Didn't catch the full #GOPdeb...
2,No candidate mentioned,Neutral,27,RT @TJMShow: No mention of Tamir Rice and the ...
3,No candidate mentioned,Positive,138,RT @RobGeorge: That Carly Fiorina is trending ...
4,Donald Trump,Positive,156,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...


In [173]:
# dataset label distribution
df['sentiment'].value_counts()
"""
Negative    8493
Neutral     3142
Positive    2236
"""

'\nNegative    8493\nNeutral     3142\nPositive    2236\n'

In [174]:
## transformer la cible en {0,1,2}
df['sentiment'] = df['sentiment'].replace({'Negative': 0, 'Neutral': 1, 'Positive':2})
df.sentiment.value_counts()

0    8493
1    3142
2    2236
Name: sentiment, dtype: int64

### les données sont labellisées
je vais construire un benchmark model supervisé
puis tester des ML supervisés: Logistic Regression, XGBoost avec des hyperparams optim
et comme dans le cours faire une maximisation du F1-Score avec le seuil sur 

# 1. Sentiment Analysis using Machine Learning

### Préprocesser la feature 'text' -> 'tokens' (de texte vers liste de tokens pour chaque tweet)

In [175]:
## voir comment exclure avec une regex les strings de type : '@string espace' 
def filter_out_after_at(tokens):
    new_tokens = []
    for idx, t in enumerate(tokens):
        if idx >= 1:
            if prev_token != '@':
                new_tokens.append(t)
        else:
            new_tokens.append(t)
        prev_token = t
    return new_tokens
        

In [176]:
## voir comment exclure avec une regex les strings de type : '@string espace' 
def preprocessing(x):
    tokens = word_tokenize(x)
    # remove tokens following a '@'
    tokens = filter_out_after_at(tokens)
    tokens = [x.lower() for x in tokens if x.isalpha()]
    tokens = [x for x in tokens if x not in stop_words]
    tokens = [stemmer.stem(x) for x in tokens]
    return tokens

In [177]:
df["tokens"] = df.text.apply(preprocessing)
df.head()

Unnamed: 0,candidate,sentiment,retweet_count,text,tokens
0,No candidate mentioned,1,5,RT @NancyLeeGrahn: How did everyone feel about...,"[rt, everyon, feel, climat, chang, question, l..."
1,Scott Walker,2,26,RT @ScottWalker: Didn't catch the full #GOPdeb...,"[rt, catch, full, gopdeb, last, night, scott, ..."
2,No candidate mentioned,1,27,RT @TJMShow: No mention of Tamir Rice and the ...,"[rt, mention, tamir, rice, gopdeb, held, cleve..."
3,No candidate mentioned,2,138,RT @RobGeorge: That Carly Fiorina is trending ...,"[rt, carli, fiorina, trend, hour, debat, men, ..."
4,Donald Trump,2,156,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,"[rt, gopdeb, deliv, highest, rate, histori, pr..."


# VERSION.1: modèle supervisé uniquement avec la colonne tokens

In [178]:
# Build X and y
## J.30/05: remplacer tokens par text dans X
y = df['sentiment'].to_numpy()
X = df['tokens'] ##df.drop(columns=['sentiment','text'], axis=1)
##X = df['text'] ## - bien: F1-score LTR = 0.54 vs 0.54
print(f'y.shape: {y.shape}')
print(f'X.shape: {X.shape}')

y.shape: (13871,)
X.shape: (13871,)


In [179]:
# split train - test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'X_train.shape: {X_train.shape}')
print(f'X_test.shape: {X_test.shape}')
print(f'y_train.shape: {y_train.shape}')
print(f'y_test.shape: {y_test.shape}')

X_train.shape: (11096,)
X_test.shape: (2775,)
y_train.shape: (11096,)
y_test.shape: (2775,)


# TFIDF with Sklearn

### VERSION.1: modèle supervisé uniquement avec la colonne 'tokens' -> TF-IDF

In [180]:
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x)
tf_idf_train = vectorizer.fit_transform(X_train).toarray()
tf_idf_train

array([[0.        , 0.58122002, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.45069387, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.59526437, 0.25224846, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.60074772, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.48884862, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.53889059, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [181]:
# affichage et utilisation pour avoir plusieurs colonnes dans les features
## TRAIN
df_train = pd.DataFrame(tf_idf_train, columns=vectorizer.get_feature_names_out())

In [182]:
## TEST: transform only on Test
tf_idf_test = vectorizer.transform(X_test).toarray()
df_test = pd.DataFrame(tf_idf_test, columns=vectorizer.get_feature_names_out())

In [142]:
print(f'df_train: {df_train}')
print(f'df_test: {df_test}')
print(f'df_train.shape: {df_train.shape}')
print(f'df_test.shape: {df_test.shape}')

df_train:        aaaaand  aaaand  aaand  aaron  abandon  abbott  abc  abe  abhorr  abil  \
0          0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
1          0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
2          0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
3          0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
4          0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
...        ...     ...    ...    ...      ...     ...  ...  ...     ...   ...   
11091      0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
11092      0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
11093      0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
11094      0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   
11095      0.0     0.0    0.0    0.0      0.0     0.0  0.0  0.0     0.0   0.0   

       ...  zirco

# Model.1: Logistic Regression

In [183]:
from sklearn.linear_model import LogisticRegression

In [184]:
lr = LogisticRegression(max_iter=30)
lr.fit(tf_idf_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [185]:
# prédictions sur Train et Test
y_pred_train = lr.predict(tf_idf_train)
y_pred_test  = lr.predict(tf_idf_test)

### F1-score de la logistic regression avec 'tokens' only as predictor of 'sentiment'

In [186]:
from sklearn.metrics import f1_score,recall_score,precision_score
# F1-score de Train
print(f" F1-score sur Train: {f1_score(y_train, y_pred_train, average='weighted')}")
# F1-score de Test
print(f" F1-score sur Test: {f1_score(y_test, y_pred_test, average='weighted')}")
"""
 F1-score sur Train: 0.7208759545410538
 F1-score sur Test: 0.6484711971542514
"""

 F1-score sur Train: 0.555721165447757
 F1-score sur Test: 0.5451023810786118


'\n F1-score sur Train: 0.7208759545410538\n F1-score sur Test: 0.6484711971542514\n'

### Model.2: XGBoost

### ref: Medium: xgboost-classification-in-python

In [147]:
from xgboost import XGBClassifier

In [148]:
pd.DataFrame(y_train).value_counts()

0    6771
1    2530
2    1795
dtype: int64

In [149]:
pd.DataFrame(y_test).value_counts()

0    1722
1     612
2     441
dtype: int64

In [150]:
# XGBoost (different learning rate)
"""
xgb_classifier = XGBClassifier(eta = 0.1)
xgb_classifier.fit(tf_idf_train, y_train)

# prédictions sur Train et Test
y_pred_train = xgb_classifier.predict(tf_idf_train)
y_pred_test  = xgb_classifier.predict(tf_idf_test)

# F1 scores
# F1-score de Train
print(f" F1-score sur Train: {f1_score(y_train, y_pred_train, average='weighted')}")
# F1-score de Test
print(f" F1-score sur Test: {f1_score(y_test, y_pred_test, average='weighted')}")
"""
"""
 F1-score sur Train: 0.6677388710919745
 F1-score sur Test: 0.6172217719137684
"""

"""
learning_rate_range = np.arange(0.15, 0.65, 0.15)
test_XG  = {}
train_XG = {}

for learn_rate in learning_rate_range:
    xgb_classifier = XGBClassifier(eta = learn_rate)
    xgb_classifier.fit(tf_idf_train, y_train)

    y_pred_train = xgb_classifier.predict(tf_idf_train)
    y_pred_test  = xgb_classifier.predict(tf_idf_test)

    train_XG[str(learn_rate)] = f1_score(y_train, y_pred_train, average='weighted')
    test_XG[str(learn_rate)]  = f1_score(y_test, y_pred_test, average='weighted')
"""
"""
train_XG: {'0.15': 0.6942408701011298, '0.3': 0.7495637334755164, '0.44999999999999996': 0.7859795832618688, '0.6': 0.8117905777217034}
test_XG: {'0.15': 0.6264465950017845, '0.3': 0.6426191392431214, '0.44999999999999996': 0.6410016721381415, '0.6': 0.6551265266919923}
"""

"\ntrain_XG: {'0.15': 0.6942408701011298, '0.3': 0.7495637334755164, '0.44999999999999996': 0.7859795832618688, '0.6': 0.8117905777217034}\ntest_XG: {'0.15': 0.6264465950017845, '0.3': 0.6426191392431214, '0.44999999999999996': 0.6410016721381415, '0.6': 0.6551265266919923}\n"

In [151]:
#print(f'train_XG: {train_XG}')
#print(f'test_XG: {test_XG}')

### XGBoost avec un DF qui regroupe TF-IDF ci-dessus et d'autres features

In [152]:
df.columns

Index(['candidate', 'sentiment', 'retweet_count', 'text', 'tokens'], dtype='object')

In [153]:
df['candidate'].value_counts()

No candidate mentioned    7491
Donald Trump              2813
Jeb Bush                   705
Ted Cruz                   637
Ben Carson                 404
Mike Huckabee              393
Chris Christie             293
Marco Rubio                275
Rand Paul                  263
Scott Walker               259
John Kasich                242
Name: candidate, dtype: int64

### get_dummies sur candidate!

In [154]:
df = pd.get_dummies(df, columns=['candidate'])
### !!! attention: PAS TOP: Mieux: OneHotencoder !!!
# get_dummies c'est au début dans le df
# mais dans une logique de pipeline -> 1HE

In [155]:
df.columns

Index(['sentiment', 'retweet_count', 'text', 'tokens', 'candidate_Ben Carson',
       'candidate_Chris Christie', 'candidate_Donald Trump',
       'candidate_Jeb Bush', 'candidate_John Kasich', 'candidate_Marco Rubio',
       'candidate_Mike Huckabee', 'candidate_No candidate mentioned',
       'candidate_Rand Paul', 'candidate_Scott Walker', 'candidate_Ted Cruz'],
      dtype='object')

In [156]:
cols_DF2 = ['retweet_count', 'candidate_Ben Carson',
       'candidate_Chris Christie', 'candidate_Donald Trump',
       'candidate_Jeb Bush', 'candidate_John Kasich', 'candidate_Marco Rubio',
       'candidate_Mike Huckabee', 'candidate_No candidate mentioned',
       'candidate_Rand Paul', 'candidate_Scott Walker', 'candidate_Ted Cruz']
DF2 = df[cols_DF2]
DF2
## PB dans Trump_VM__.ipynb: le vectorizer est appliqué sur tt le df: Data Leakage: donc voir comment vraiment le faire !!

Unnamed: 0,retweet_count,candidate_Ben Carson,candidate_Chris Christie,candidate_Donald Trump,candidate_Jeb Bush,candidate_John Kasich,candidate_Marco Rubio,candidate_Mike Huckabee,candidate_No candidate mentioned,candidate_Rand Paul,candidate_Scott Walker,candidate_Ted Cruz
0,5,0,0,0,0,0,0,0,1,0,0,0
1,26,0,0,0,0,0,0,0,0,0,1,0
2,27,0,0,0,0,0,0,0,1,0,0,0
3,138,0,0,0,0,0,0,0,1,0,0,0
4,156,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
13866,7,0,0,0,0,0,0,0,1,0,0,0
13867,1,0,0,0,0,0,0,1,0,0,0,0
13868,67,0,0,0,0,0,0,0,0,0,0,1
13869,149,0,0,1,0,0,0,0,0,0,0,0


In [157]:
df_train

Unnamed: 0,aaaaand,aaaand,aaand,aaron,abandon,abbott,abc,abe,abhorr,abil,...,zirconia,zog,zombi,zombiereagan,zombiereaganris,zone,zoo,zoomph,zuckerberg,ツ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11092,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [158]:
X_train

9205     [rt, american, peopl, pick, next, presid, unit...
6779     [rt, serious, wish, one, tonight, moder, foxne...
4148     [trump, get, shit, rais, hand, last, night, ma...
4436     [rt, marcorubio, lie, said, never, advoc, exce...
13218    [rt, jebbi, talk, chang, tax, code, fix, job, ...
                               ...                        
5191     [love, gopdeb, take, hair, look, great, big, s...
13418    [rt, whererwomen, cairo, often, rail, vs, miso...
5390     [interest, everi, one, gop, candid, hate, amp,...
860      [chri, christi, tri, wrap, gopdeb, lie, twice,...
7270     [rt, ohhhh, bencarson, honesti, beauti, soul, ...
Name: tokens, Length: 11096, dtype: object

In [159]:
X2 = DF2
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.2, random_state=42)
X_train2 # ok les indices sont en phase avec X_train


Unnamed: 0,retweet_count,candidate_Ben Carson,candidate_Chris Christie,candidate_Donald Trump,candidate_Jeb Bush,candidate_John Kasich,candidate_Marco Rubio,candidate_Mike Huckabee,candidate_No candidate mentioned,candidate_Rand Paul,candidate_Scott Walker,candidate_Ted Cruz
9205,142,0,0,0,0,0,0,0,1,0,0,0
6779,48,0,0,0,0,0,0,0,1,0,0,0
4148,0,0,0,1,0,0,0,0,0,0,0,0
4436,2,0,0,0,0,0,1,0,0,0,0,0
13218,4,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
5191,0,0,0,1,0,0,0,0,0,0,0,0
13418,11,0,0,0,0,0,0,0,1,0,0,0
5390,0,0,0,0,0,0,0,0,1,0,0,0
860,0,0,1,0,0,0,0,0,0,0,0,0


### !!!! voir comment merger TF-IDF avec d'autres colonnes du df !!!!

In [160]:
df_train.index = X_train2.index ## A vérifier
## le faire lors de la création
df_train

Unnamed: 0,aaaaand,aaaand,aaand,aaron,abandon,abbott,abc,abe,abhorr,abil,...,zirconia,zog,zombi,zombiereagan,zombiereaganris,zone,zoo,zoomph,zuckerberg,ツ
9205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4436,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [161]:
## CONCAT de df_tf_idf et de X_train2 possible ???
#X_train3 = pd.concat([X_train, X_train2], axis = 1)
print(df_train.shape)
print(X_train2.shape)
X_train3 = pd.concat([df_train, X_train2], axis = 1) ## vérifier qd on a créé les dfs
X_train3

(11096, 7144)
(11096, 12)


Unnamed: 0,aaaaand,aaaand,aaand,aaron,abandon,abbott,abc,abe,abhorr,abil,...,candidate_Chris Christie,candidate_Donald Trump,candidate_Jeb Bush,candidate_John Kasich,candidate_Marco Rubio,candidate_Mike Huckabee,candidate_No candidate mentioned,candidate_Rand Paul,candidate_Scott Walker,candidate_Ted Cruz
9205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
6779,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
4148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,0,0,0,0,0
4436,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,0,0,0,0
13218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,0,0,0,0,0
13418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
5390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,0,0,0,0


In [162]:
df_test.index = X_test2.index
df_test

Unnamed: 0,aaaaand,aaaand,aaand,aaron,abandon,abbott,abc,abe,abhorr,abil,...,zirconia,zog,zombi,zombiereagan,zombiereaganris,zone,zoo,zoomph,zuckerberg,ツ
5040,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [163]:
X_test2

Unnamed: 0,retweet_count,candidate_Ben Carson,candidate_Chris Christie,candidate_Donald Trump,candidate_Jeb Bush,candidate_John Kasich,candidate_Marco Rubio,candidate_Mike Huckabee,candidate_No candidate mentioned,candidate_Rand Paul,candidate_Scott Walker,candidate_Ted Cruz
5040,0,0,0,0,0,0,0,0,1,0,0,0
1078,79,0,0,1,0,0,0,0,0,0,0,0
13717,0,0,0,0,0,1,0,0,0,0,0,0
3527,1,1,0,0,0,0,0,0,0,0,0,0
3404,1,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
7320,23,0,0,0,0,0,0,0,0,0,0,0
6017,1,0,0,0,0,0,0,0,1,0,0,0
1374,0,0,0,0,0,0,0,0,1,0,0,0
4785,11,0,0,0,0,0,0,0,1,0,0,0


In [164]:
#####X_test3 = pd.concat([X_test, X_test2], axis = 1)
X_test3 = pd.concat([df_test, X_test2], axis = 1)
X_test3
## Attention : filtrer les Nan!

Unnamed: 0,aaaaand,aaaand,aaand,aaron,abandon,abbott,abc,abe,abhorr,abil,...,candidate_Chris Christie,candidate_Donald Trump,candidate_Jeb Bush,candidate_John Kasich,candidate_Marco Rubio,candidate_Mike Huckabee,candidate_No candidate mentioned,candidate_Rand Paul,candidate_Scott Walker,candidate_Ted Cruz
5040,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
1078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,0,0,0,0,0
13717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
3527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
6017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
1374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
4785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0


### XGBoost sur X avec les features: tokens, candidate_dummies, retweet_count

In [165]:
xgb_classifier = XGBClassifier(eta = 0.7)
xgb_classifier.fit(X_train3, y_train)

# prédictions sur Train et Test
y_pred_train = xgb_classifier.predict(X_train3)
y_pred_test  = xgb_classifier.predict(X_test3)

# F1 scores
# F1-score de Train
print(f" F1-score sur Train3: {f1_score(y_train, y_pred_train, average='weighted')}")
# F1-score de Test
print(f" F1-score sur Test3: {f1_score(y_test, y_pred_test, average='weighted')}")
"""
F1-score sur Train3: 0.8351536411527715
F1-score sur Test3: 0.6597017814134338
"""

 F1-score sur Train3: 0.8351536411527715
 F1-score sur Test3: 0.6597017814134338


'\nF1-score sur Train3: 0.8351536411527715\nF1-score sur Test3: 0.6597017814134338\n'

In [82]:
xgb_classifier = XGBClassifier(eta = 1.0)
xgb_classifier.fit(X_train3, y_train)

# prédictions sur Train et Test
y_pred_train = xgb_classifier.predict(X_train3)
y_pred_test  = xgb_classifier.predict(X_test3)

# F1 scores
# F1-score de Train
print(f" F1-score sur Train3: {f1_score(y_train, y_pred_train, average='weighted')}")
# F1-score de Test
print(f" F1-score sur Test3: {f1_score(y_test, y_pred_test, average='weighted')}")


 F1-score sur Train3: 0.8652616881937858
 F1-score sur Test3: 0.6540031695210006
