# NLP 4 : Bag of Words

Entraîne-toi à classifier des tweets pour dire s'ils sont positifs ou négatifs. Ca pourrait être un outil très utile pour optimiser le travail d'un community manager.

1.   Importe l'ensemble de données de tweets [**train.csv**](https://github.com/Lit3C/machine-learning/blob/main/Twitter-Sentiment_Analysis/train.csv) dans un DataFrame.
2.   Ne garde que les tweets positifs et négatifs (tu excluras donc les `neutral`). Quel est le pourcentage de tweets positifs/négatifs ?
3.   Copie la colonne `text` dans une Série `X`, et la colonne `sentiment` dans une Série `y`. Applique un train test split avec le `random_state = 32` et un `train_size` de 0.75.
4.   Crée un modèle `vectorizer` avec scikit-learn en utilisant la méthode `Countvectorizer`. Entraîne ton modèle sur `X_train`, puis crée une matrice de features `X_train_CV`. Crée la matrice `X_test_CV` sans ré-entraîner le modèle. Le format de la matrice `X_test_CV` doit être `4091x15806` avec `44633 stored elements`.
5.   Entraîne maintenant une régression logistique avec les paramètres par défaut. Tu devrais obtenir les résultats suivants : `0.966` pour le test d'entraînement, et `0.877` pour l'ensemble de test.
6.   Étape bonus : essaye d'afficher 10 tweets qui ont été mal prédits (faux positifs ou faux négatifs). Aurais-tu fait mieux que l'algorithme ?

## 1. Import

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [3]:
df_raw = pd.read_csv('train.csv')
display(df_raw.head())

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


## 2. Only positive & negative

In [4]:
df = df_raw[df_raw['sentiment'].isin(['negative', 'positive'])]
display(df.head())
sentiment_percentage = round(df['sentiment'].value_counts(normalize=True) * 100,2)
print(sentiment_percentage)

Unnamed: 0,textID,text,selected_text,sentiment
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive


sentiment
positive    52.45
negative    47.55
Name: proportion, dtype: float64


## 3. Train Test Split

In [5]:
X = df['text']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   train_size=0.75,
                                                   random_state=32)

## 4. Vectorizer Model

In [6]:
vectorizer = CountVectorizer()
X_train_CV = vectorizer.fit_transform(X_train)

X_test_CV = vectorizer.transform(X_test)
X_test_CV

<4091x15806 sparse matrix of type '<class 'numpy.int64'>'
	with 44633 stored elements in Compressed Sparse Row format>

## 5. Regression Logistique

In [7]:
log_model = LogisticRegression(max_iter=1000).fit(X_train_CV, y_train)

print('accuracy score on train set :',round(log_model.score(X_train_CV, y_train),3))
print('accuracy score on test set :',round(log_model.score(X_test_CV, y_test),3))

accuracy score on train set : 0.966
accuracy score on test set : 0.877


## 6. Bonus : Wrong predicts tweets (10)

In [8]:
import random
y_pred = log_model.predict(X_test_CV)

errors = np.where(y_pred != y_test)[0]

random_numbers = np.random.choice(errors, min(10, len(errors)))

for i in random_numbers:
    print(f"Tweet : {X_test.iloc[i]}")
    print(f"Actual: {y_test.iloc[i]}")
    print(f"Predicted: {y_pred[i]}\n")

Tweet : Just woke up & can`t go back to sleep. Had a txt from the bff sayin if you`re up call me, sounds important but that was 4 hours ago
Actual: positive
Predicted: negative

Tweet :  It wasn`t the best flick, to be sure.  I`m just ready for 'Star Trek' now
Actual: negative
Predicted: positive

Tweet :  Hey babe, nothing much tryin to see what imma do at work today lol, look like the load isnt so bad.
Actual: positive
Predicted: negative

Tweet : _xo  hey chick u alryt u at dads tmoro we sud do sumin aen like last week we neva dun oot this week lol  missed you ha bye hun ****
Actual: negative
Predicted: positive

Tweet :  It wasn`t the best flick, to be sure.  I`m just ready for 'Star Trek' now
Actual: negative
Predicted: positive

Tweet :  I only saw urs by chance. Who else would have that name! Think I`ll b missing 2nite too.  Thank god for YouTube.
Actual: positive
Predicted: negative

Tweet : damnnn this day came to fast, but i cherished all the moment i had
Actual: positive
Pre