# Sentiment Analysis using Forbes

#### WHAT IS KAFKA ?

Event streaming is the practice of capturing data in real-time from event sources like databases, sensors,
mobile devices, cloud services, and software applications in the form of streams of events; storing these
event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in
real-time as well as retrospectively; and routing the event streams to different destination technologies
as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right
information is at the right place, at the right time.

#### EVENTS

An event records the fact that “something happened” in the world or in your business. It is also called record
or message in the documentation. When you read or write data to Kafka, you do this in the form of events.
Conceptually, an event has a key, value, timestamp, and optional metadata headers. Here’s an example
event:

- Event key : " Alice "
- Event value : " Made a payment of $200 to Bob "
- Event timestamp : " Jun . 25 , 2020 at 2:06 p . m ."

#### PRODUCERS / CONSUMERS

Producers are those client applications that publish (write) events to Kafka, and consumers are those that
subscribe to (read and process) these events. In Kafka, producers and consumers are fully decoupled and
agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for.
For example, producers never need to wait for consumers.

#### TOPICS

Events are organized and durably stored in topics. Topics in Kafka are always multi-producer and multisubscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many
consumers that subscribe to these events. Topics in Kafka are always multi-producer and multi-subscriber: a
topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers
that subscribe to these events. Events in a topic can be read as often as needed, you define for how long
Kafka should retain your events through a per-topic configuration setting, after which old events will be
discarded.

![Illustrations](image_kafka2.png)

#### PROJECT SUMMARY

The goal of this project is to create an end-to-end Machine Learning project, including :
- extract tweets of specifics topics from Twitter, in real-time using Apache Kafka
- transform, using you trained-model for sentiments analysis classification
- load data into a data-warehouse using PostgreSQL
- real-time dashboard, to monitor the results for each topics using PowerBI 
Each parts can be start independently.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
time.sleep(5) # Pause de 5 secondes après le chargement de la page

import pandas as pd
import warnings
import nltk
warnings.filterwarnings("ignore")
import numpy as np
import matplotlib.pyplot as plt 
from confluent_kafka import Producer, Consumer
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from io import StringIO
from xgboost import plot_importance
from sklearn.metrics import precision_score, recall_score, confusion_matrix

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/macbookpro/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


#### EXTRACT

First you need to install Kafka (and Zookeeper to manage it) on your system. At least 4GB of RAM is needed.
You can use this tutorial : https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafkaon-ubuntu-18-04
You will use the website forbes.com and try to make your code generic in a way that it caters to multiple themes and publish them into the Kafka server using topics. You can have multiple #keyword to monitor for one specific show.

In [2]:
def get_data(query):
    driver = webdriver.Chrome()
    driver.get(f"https://www.forbes.com/search/?q={query}")


    # temps pour que e boutton "More articles" soit cliquable
    more_articles_button = WebDriverWait(driver, 30).until(
        EC.element_to_be_clickable((By.CLASS_NAME, 'search-more'))
    )

    # Cliquer sur le bouton "More articles" jusqu'à ce qu'il n'y en ait plus
    while more_articles_button:
        try:
            more_articles_button.click()
            more_articles_button = WebDriverWait(driver, 30).until(
                EC.element_to_be_clickable((By.CLASS_NAME, 'search-more'))
            )
        except:
            break

    # récupérez le contenu dès que tous les articles sont chargés, 
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    all_elt = soup.find_all("article", class_="stream-item et-promoblock-removeable-item et-promoblock-star-item")
    titles = []
    dates = []
    url_articles = []
    contents = []

    # Récupération des contenus de chaque élément
    for elt in all_elt:
        title = elt.find("a", class_="stream-item__title").text

        #vérifier si l'élément a été trouvé
        url_article_elt = elt.find("a", class_="stream-item__image ratio16x9")
        
        if url_article_elt:
            url_article = url_article_elt["href"]
        else : 
            url_article = "Nan"
            
        content = elt.find("div", class_="stream-item__description").text
        date = elt.find("div", class_="stream-item__date").text

        # Ajout des contenus à la liste
        titles.append(title)
        dates.append(date)
        contents.append(content)
        url_articles.append(url_article)

    # Création du DataFrame
    data = pd.DataFrame({"Title": titles, "Date": dates, "Urls": url_articles, "Content": contents})
    
    # Fermer le navigateur après avoir terminé
    driver.quit()
    
    return data

# Appeler la fonction avec l'URL openAI
query = "Finance"
df = get_data(query)
df


TimeoutException: Message: 
Stacktrace:
0   chromedriver                        0x0000000102a0ae18 chromedriver + 4627992
1   chromedriver                        0x0000000102a02b43 chromedriver + 4594499
2   chromedriver                        0x0000000102600e4a chromedriver + 392778
3   chromedriver                        0x000000010264c41d chromedriver + 701469
4   chromedriver                        0x000000010264c5b1 chromedriver + 701873
5   chromedriver                        0x0000000102690214 chromedriver + 979476
6   chromedriver                        0x000000010266e89d chromedriver + 841885
7   chromedriver                        0x000000010268d6d9 chromedriver + 968409
8   chromedriver                        0x000000010266e613 chromedriver + 841235
9   chromedriver                        0x000000010263f3da chromedriver + 648154
10  chromedriver                        0x000000010263fd1e chromedriver + 650526
11  chromedriver                        0x00000001029ca890 chromedriver + 4364432
12  chromedriver                        0x00000001029cfc41 chromedriver + 4385857
13  chromedriver                        0x00000001029afb2e chromedriver + 4254510
14  chromedriver                        0x00000001029d0969 chromedriver + 4389225
15  chromedriver                        0x00000001029a1e69 chromedriver + 4197993
16  chromedriver                        0x00000001029f1b78 chromedriver + 4524920
17  chromedriver                        0x00000001029f1d57 chromedriver + 4525399
18  chromedriver                        0x0000000102a02783 chromedriver + 4593539
19  libsystem_pthread.dylib             0x00007ff8061141d3 _pthread_start + 125
20  libsystem_pthread.dylib             0x00007ff80610fbd3 thread_start + 15


#### TRANSFORM

First you need to create a Sentiment Analysis model with the IMDB database using Scikit-learn (or XGBoost).

Dataset :  imdb.csv 
Using python, create a Kafka “consumer” with your trained-model to classify each articles from Kafka.

In [None]:
def read_csv_with_encoding(file_path, encodings=['utf-8', 'latin-1', 'ISO-8859-1']):
    for encoding in encodings:
        try:
            with open(file_path, 'rb') as f:
                content = f.read()
                decoded_content = content.decode(encoding, errors='replace')
            return pd.read_csv(StringIO(decoded_content))
        except UnicodeDecodeError:
            continue
    raise UnicodeDecodeError(f"Unable to decode the file {file_path} with the provided encodings.")

# Exemple d'utilisation
file_path = "./imdb.csv"
data = read_csv_with_encoding(file_path)


# Supposons que df est votre DataFrame
label_encoder = LabelEncoder()

# Appliquer LabelEncoder aux colonnes d'objets
for column in data.select_dtypes(include=['object']).columns:
    data[column] = label_encoder.fit_transform(data[column])

df = df.dropna(subset=['label'])
df

In [None]:
def data_categorization(data):
    """
    Cette fonction prend la donnée d'origine (data) et renvoie une version catégorisée de data.
    Les classes sont représentées par des entiers.
    """
    new_data = data.copy()
    class_mapping = {"neg": 0, "pos": 1, "unsup": 2}
    new_data['label'] = new_data['label'].map(class_mapping)
    return new_data


In [None]:
def get_split_data(data, label_column="label", feature_columns=["type", "review", "file"]):
    """
    Cette fonction prend la donnée d'origine (data) et crée une version catégorisée de cette donnée.
    Elle renvoie les données fractionnées (données d'entraînement et données de test).
    """
    new_data = data_categorization(data)
    X = new_data[feature_columns]
    y = new_data[label_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test


In [None]:
X_train,X_test,y_train,y_test = get_split_data(data_imdb)
X_train,X_test,y_train,y_test

In [None]:
def get_model(X_train, y_train):
    """
    Cette fonction prend X_train et y_train et renvoie un modèle XGBoost entraîné sur ces derniers.
    """
    model = XGBClassifier()
    model.fit(X_train, y_train)
    return model

In [None]:
model_set = get_model(X_train,y_train)

In [None]:
def data_standardization(X_train, X_test, model_type="XGBoost"):
    """
    Cette fonction prend en paramètres X_train et X_test et retourne X_train et X_test standardisés.
    Pour XGBoost, elle renvoie simplement les données telles quelles, car XGBoost n'exige généralement pas la standardisation.
    """
    if model_type == "XGBoost":
        return X_train, X_test
    else:
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        return X_train_scaled, X_test_scaled


In [None]:
X_train_scaled,X_test_scaled = data_standardization(X_train,X_test)

In [None]:
def plot_xgboost_importance(X_train, y_train, X_test, y_test, params=None, num_boost_round=100):
    """
    Cette fonction prend X_train, y_train, X_test et y_test comme données d'entraînement et de test,
    et affiche l'importance des fonctionnalités à l'aide de XGBoost ainsi que les indicateurs de performance.

    Parameters:
    - X_train: DataFrame, features d'entraînement.
    - y_train: Series, labels d'entraînement.
    - X_test: DataFrame, features de test.
    - y_test: Series, labels de test.
    - params: dict, paramètres du modèle XGBoost (par défaut, utilisera des paramètres par défaut).
    - num_boost_round: int, nombre d'itérations d'entraînement du modèle XGBoost.

    Returns:
    - None (affiche le graphique d'importance des fonctionnalités et les indicateurs de performance).
    """
    if params is None:
        # Utilisez des paramètres par défaut si aucun paramètre n'est fourni
        params = {'objective': 'multi:softmax', 'num_class': len(set(y_train)), 'eval_metric': 'mlogloss'}

    # Créer un objet DMatrix pour les données XGBoost
    dtrain = xgb.DMatrix(X_train, label=y_train)

    # Entraîner le modèle XGBoost
    model = xgb.train(params, dtrain, num_boost_round=num_boost_round)

    # Afficher l'importance des fonctionnalités
    plot_importance(model)
    plt.show()

    # Calculer et afficher les indicateurs de performance
    y_pred = model.predict(xgb.DMatrix(X_test))
    precision = precision_score(y_test, y_pred, average='macro')
    recall = recall_score(y_test, y_pred, average='macro')
    cm = confusion_matrix(y_test, y_pred)
    # Afficher les indicateurs de performance
    print('\nPerformance Indicators')
    print("==========================================")
    print(f'Precision: {precision:.2f}')
    print(f'Recall: {recall:.2f}')
    print(f'Confusion Matrix:\n{cm}')


In [None]:
plot_xgboost_importance(X_train, y_train, X_test, y_test)

#### LOAD

Make another Kafka consumer to export articles and their label inside a PostgreSQL.

#### REAL-TIME DASHBOARD

Create one PowerBI dashboard connected on the Kafka in order to monitor in real-time the number of
articles coming by topics
Create another PowerBI conencted to the PostgreSQL database to monitor the results of your classifier.