# TP - Détection de bad buzz sur les réseaux sociaux


<img src='https://metier-outsourcer.com/wp-content/uploads/2019/04/gestion-bad-buzz-2.png'>


A partir du jeu de données fourni, vous devez construire un modèle de classification supervisée permettant de détecter les bad buzz sur les réseaux sociaux.

Le jeu de données est composé du fichier [`data.csv`](https://drive.google.com/file/d/10DoCuqttlxqDlsc1NUptMVqMCIqZqcSv/view?usp=sharing) qui contient les tweets d'internautes labellisés "positif" ou "negatif" en fonction de leur sentiment.

1. Dans une première partie, vous observerez les données et vous vous familiariserez avec le jeu de données.

2. Dans une seconde partie, vous nétoyrez le jeu de données en supprimant les données manquantes, en transformant les données textuelles en données numériques (**TF-IDF** pour les tweetes et 1 ou 0 pour les labels) puis en séparant les données en un jeu d'entraînement et un jeu de test.

3. Dans une troisième partie, vous construirez un modèle de classification supervisée et vous l'entrainerez sur le jeu d'entraînement, vous évaluerez la performance de votre modèle sur le jeu de test et vous afficherez les résultats.

4. Dans une quatrième partie, vous connecterez votre modèle à l'api openai pour répondre aux tweets négatifs détectés par votre moèle via le module ChatCompletion.

5. Bonus: Identifiez les sujets les plus abordés dans les tweets négatifs grâce aux méthode de Topic Modeling.


---
**[Ressources utiles](https://drive.google.com/file/d/12sKr9R0A8lq2hcWUJDIx3SuyOOl_4hiA/view?usp=sharing)**

Data analyse: 
- [TP1 : Gestion des matrices avec Numpy](https://drive.google.com/file/d/1snqYVzZcfxvKjr1zwB_l2oCk8HYmZUgM/view?usp=sharing)
- [TP2 : Gestion des jeux de données avec Pandas](https://drive.google.com/file/d/15nsJksMowqjrEgBQd8RXv3O_ITKUsjUF/view?usp=sharing)
- [TP3 : Affichage de données avec Matplotlib](https://drive.google.com/file/d/11NQxpVv_iw_5PoFgMP-imNbDibAi9yDd/view?usp=sharing)
- [TP4 : Modèles de classification avec Scikit-learn](https://drive.google.com/file/d/1_8VVw1-tHQwJPIVoC_5sldu8h_HfpAxa/view?usp=sharing)
- [TP5 : Entraînement de différents modèles de classification supervisée](https://drive.google.com/file/d/1BnfCMuZDqHXZBzzXaYwI9fXxNi7jSL3V/view?usp=sharing)

Natural Language Processing:
- [TP3 : Traitement du langage naturel](https://drive.google.com/file/d/1GI9_wTJlb3_38kK_S2MTv8jttmos6ysd/view?usp=sharing)
- [TP4 : TFIDF & Text similarity](https://drive.google.com/file/d/1zRsc3h8-h_PKG4qnl-T7XdhtwC9bvzki/view?usp=sharing)
- [TP5 : Topic Modeling](https://drive.google.com/file/d/1SdLt2Xbiz20kca1bJtD8T27TEeDPFT1a/view?usp=sharing)

---

## 1. Observation des données

Observons les données en utilisant la librairie pandas.

Exécutez certaines des commandes suivantes pour vous familiariser avec le jeu de données, puis notez vos observations.

- `import pandas as pd`

- `df = pd.read_csv('data.csv')`

- `df.head()`

- `df.info()`

- `df.describe()`

- `df['label'].value_counts()`

- `df['label'].value_counts().plot(kind='bar')`

- `df['text'].value_counts()`

- `df['text'].value_counts().plot(kind='bar')`

- `df['text'].value_counts().plot(kind='hist')`

- `df['text'].value_counts().plot(kind='box') `

In [76]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [77]:
df = pd.read_csv('twitter.csv')
df

Unnamed: 0,id,game,label,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [78]:
df.head()

Unnamed: 0,id,game,label,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      74682 non-null  int64 
 1    game   74682 non-null  object
 2    label  74682 non-null  object
 3    text   73996 non-null  object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB


In [80]:
df.describe()

Unnamed: 0,id
count,74682.0
mean,6432.586165
std,3740.42787
min,1.0
25%,3195.0
50%,6422.0
75%,9601.0
max,13200.0


In [81]:
df.value_counts()

id      game                               label       text                                                                                                                          
3018   Dota2                              Positive    Wow.                                                                                                                               5
4050   CS-GO                              Positive    Wow                                                                                                                                5
10181  PlayerUnknownsBattlegrounds(PUBG)  Irrelevant  Really                                                                                                                             5
8808   Nvidia                             Positive    Wow                                                                                                                                5
6928   johnson&johnson                    Negative    "               

In [82]:
df.shape

(74682, 4)

In [9]:
df.isna().sum()


id          0
 game       0
 label      0
 text     686
dtype: int64

## 2. Nettoyage des données

Nettoyons les données en supprimant les données manquantes, en transformant les données textuelles en données numériques (TF-IDF pour les tweetes et 1 ou 0 pour les labels) puis en séparant les données en un jeu d'entraînement et un jeu de test.

- Supprimez les données manquantes.

- Utilisez la fonction catégorical pour transformer les labels en 1 ou 0.

- Utilisez la fonction TfidfVectorizer pour transformer les tweets en vecteurs TF-IDF.

- Séparez les données en un jeu d'entraînement et un jeu de test.


**Séparation des données en jeu d'entraînement et jeu de test :**

`from sklearn.model_selection import train_test_split`

`X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)`

In [10]:
#Pour supprimer les lignes contenant des valeurs manquantes dans un jeu de données df. 
#Pour concerver les modifications il faudra réassigner la variable df ==> df = df.dropna()
df.dropna()


#Pour sélectionner une variable 
df['nom_de_la_colone']
    

#Pour supprimer une colonne du jeu de données df. 
#Pour concerver les modifications il faudra réassigner la variable df.
df.drop(['nom_de la_colone'], axis=1)


#Pour afficher la répartion des valeurs de la varialbe mentionnée entre crochet.
df['nom_de_la_colone'].value_counts()


# Fonction permettant de remplacer les valeurs qualitatives par des valeurs quantitatives
def catégorical(df, column):
    liste_ = list(df[column].value_counts().index)
    df[column] = df[column].apply(lambda x: liste_.index(x))
    return df

#Pour concerver les modifications il faudra réassigner la variable :
#df ==> df = catégorical(df, 'nom_de_la_colonne').
catégorical(df, 'nom_de_la_colonne')


KeyError: 'nom_de_la_colone'

In [83]:
df

Unnamed: 0,id,game,label,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [84]:
df.drop_duplicates()

Unnamed: 0,id,game,label,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [85]:
df = df.dropna()
df

Unnamed: 0,id,game,label,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [86]:
df[' game']

0        Borderlands
1        Borderlands
2        Borderlands
3        Borderlands
4        Borderlands
            ...     
74677         Nvidia
74678         Nvidia
74679         Nvidia
74680         Nvidia
74681         Nvidia
Name:  game, Length: 73996, dtype: object

In [87]:
def clean_column_name(name):
    return name.strip().lower().replace(' ', '_').replace('#', '')

df = df.rename(columns=clean_column_name)
df.head()

Unnamed: 0,id,game,label,text
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [88]:
df.isna().sum()

id       0
game     0
label    0
text     0
dtype: int64

In [89]:
df['label'].value_counts()

Negative      22358
Positive      20655
Neutral       18108
Irrelevant    12875
Name: label, dtype: int64

In [90]:
def catégorical(df, column):
    liste_ = list(df[column].value_counts().index)
    df[column] = df[column].apply(lambda x: liste_.index(x))
    return df

In [91]:
df = catégorical(df,'label')

In [92]:
df = df.drop(['game'], axis=1)

In [93]:
df

Unnamed: 0,id,label,text
0,2401,1,im getting on borderlands and i will murder yo...
1,2401,1,I am coming to the borders and I will kill you...
2,2401,1,im getting on borderlands and i will kill you ...
3,2401,1,im coming on borderlands and i will murder you...
4,2401,1,im getting on borderlands 2 and i will murder ...
...,...,...,...
74677,9200,1,Just realized that the Windows partition of my...
74678,9200,1,Just realized that my Mac window partition is ...
74679,9200,1,Just realized the windows partition of my Mac ...
74680,9200,1,Just realized between the windows partition of...


In [94]:
!pip install sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer




[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [95]:
text = df['text']
text

0        im getting on borderlands and i will murder yo...
1        I am coming to the borders and I will kill you...
2        im getting on borderlands and i will kill you ...
3        im coming on borderlands and i will murder you...
4        im getting on borderlands 2 and i will murder ...
                               ...                        
74677    Just realized that the Windows partition of my...
74678    Just realized that my Mac window partition is ...
74679    Just realized the windows partition of my Mac ...
74680    Just realized between the windows partition of...
74681    Just like the windows partition of my Mac is l...
Name: text, Length: 73996, dtype: object

In [98]:
def tfidf_sklearn(text):
    vectorizer_sk = TfidfVectorizer(stop_words='english')
    print(vectorizer_sk.fit_transform(text).toarray())

In [99]:
BOW = tfidf_sklearn(text)
BOW

MemoryError: Unable to allocate 17.0 GiB for an array with shape (73996, 30764) and data type float64

In [101]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test

(49647    Last Weekend league for Fifa 20 Glad I could f...
 43676             omg i'm so excited to watch dk play pubg
 55915                    all others who have problems with
 14927                                                   in
 44039    minho, felix de jeongin sucked at pubg pretty ...
                                ...                        
 37541                                                  you
 6332                I'm not even going to show a 7-2 loss.
 55392                      Fuck this call of duty update..
 864      I should get up & feed my dogs & such that way...
 15956                        Welcome to The International!
 Name: text, Length: 59196, dtype: object,
 61413    Looks to me like he failed to check out the wa...
 44887    Wow, it takes all sorts of crazy people out th...
 73662    Nvidia Unveils The World’s Fastest Gaming Moni...
 36694    Huge radio play here. Reinvention / Corporate ...
 2308                            SO I HAPPY WHO ABOUT THI

In [102]:
df

Unnamed: 0,id,label,text
0,2401,1,im getting on borderlands and i will murder yo...
1,2401,1,I am coming to the borders and I will kill you...
2,2401,1,im getting on borderlands and i will kill you ...
3,2401,1,im coming on borderlands and i will murder you...
4,2401,1,im getting on borderlands 2 and i will murder ...
...,...,...,...
74677,9200,1,Just realized that the Windows partition of my...
74678,9200,1,Just realized that my Mac window partition is ...
74679,9200,1,Just realized the windows partition of my Mac ...
74680,9200,1,Just realized between the windows partition of...


## 3. Construction du modèle de classification supervisée

Construisons un modèle de classification supervisée et entraînons-le sur le jeu d'entraînement, évaluons la performance de notre modèle sur le jeu de test et affichons les résultats.

- Utilisez les modèles de classification supervisée suivants:

    - [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
    - [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
    - [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
    - [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
    - [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
    - (Bonus)[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
    


- Entraînez le modèle sur le jeu d'entraînement.

- Évaluez la performance du modèle sur le jeu de test.

- Affichez les résultats.

In [103]:
df[:500]

Unnamed: 0,id,label,text
0,2401,1,im getting on borderlands and i will murder yo...
1,2401,1,I am coming to the borders and I will kill you...
2,2401,1,im getting on borderlands and i will kill you ...
3,2401,1,im coming on borderlands and i will murder you...
4,2401,1,im getting on borderlands 2 and i will murder ...
...,...,...,...
496,2484,3,"@Joltzdude139 Hey Joltz, im a big fan and seei..."
497,2484,3,"@Joltzdude139 v Joltz, im a big fan out seeing..."
498,2486,2,"Guns, Love, and Tentacles is out now, and here..."
499,2486,2,"Guns, Love, and Tentacles is out now, and here..."


In [104]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_vect, y_train)

y_pred = model.predict(X_test_vect)

from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.845


In [105]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier

# Diviser les données en ensembles d'entraînement et de test
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=42)

# Transformer les tweets en vecteurs de caractéristiques numériques
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Entraîner le modèle KNN avec les données d'entraînement
k = 5  # nombre de voisins à considérer
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# Évaluer la performance du modèle sur les données de test
accuracy = knn.score(X_test, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.8656756756756757


In [106]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7493243243243243


## 4. Connexion à l'api `openai`

Connectons notre modèle à l'api `openai` pour répondre aux tweets négatifs détectés par notre modèle via le module `ChatCompletion`.

- API Key: `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`

In [107]:
import openai
openai.api_key = 'sk-BLkDnMFkqxsFkM7bzmZST3BlbkFJ9X1lBuI59zGzmd30UAX'

In [143]:
def rep_tweet(n):
    print(df.text.values[n])
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform(df['text'])
    prediction = model.predict(df.iloc[n])
    if prediction == 1:
        print("Tweet is positive")
    if prediction == 0:
        print("Tweet is negative")
        
        reponse = openai.ChatCompletion.create(
            model = "gpt-3.5-turbo",
            message=[
            {"role": "system", "content" : "Je suis un conseiller sur des résultats de commentaires."},
            {"role": "user", "content": "The biggest disappointment of my life came a year ago."},
            {"role": "system", "content" : "Je détermine que ce commentaire est négatif."},
            {"role": "user", "content": df.text.values[n]}
            ]
        )
    return reponse

## 5. Bonus: Topic Modeling

Identifiez les sujets les plus abordés dans les tweets négatifs grâce aux méthode de Topic Modeling.

In [149]:
!pip install gensim
!pip install spacy
!python -m spacy download fr_core_news_sm




[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip



Collecting spacy


[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading spacy-3.5.1-cp311-cp311-win_amd64.whl (12.2 MB)
     --------------------------------------- 12.2/12.2 MB 13.3 MB/s eta 0:00:00
Collecting spacy-legacy<3.1.0,>=3.0.11
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.4-py3-none-any.whl (11 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.9-cp311-cp311-win_amd64.whl (18 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.7-cp311-cp311-win_amd64.whl (28 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.8-cp311-cp311-win_amd64.whl (91 kB)
     ---------------------------------------- 91.9/91.9 kB 1.7 MB/s eta 0:00:00
Collecting thinc<8.2.0,>=8.1.8
  Downloading thinc-8.1.9-cp311-cp311-win_amd64.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 15.5 MB/s eta 0:00:00
Collecting wasabi<1.2.0,>=0.9.1
  Downloading wasabi-1.1.1-py3-none-any.whl (27 kB)
Collecting srsly<3.0.0,>=2.4.3
  D


[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [150]:
import pandas as pd
import spacy
import gensim

nlp = spacy.load('fr_core_news_sm')
text = df['text'].tolist()

# Tokenization
text_tokens = []
for doc in nlp.pipe(text):
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and token.is_alpha]
    text_tokens.append(tokens)

# Création de dictionnaire
dictionary = gensim.corpora.Dictionary(text_tokens)

# Création de corpus
corpus = [dictionary.doc2bow(tokens) for tokens in text_tokens]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, passes=10)

#Affichage
topics = lda_model.print_topics(num_topics=10, num_words=10)
for topic in topics:
    print(topic)

KeyboardInterrupt: 