# Live sentiment analysis - Demo day

This notebook is the main entry point to run the live sentiment analysis by using the Twitter API and the dashboarding tools selected so far, as well as a messages broker to process the incoming Twitter data.

The diagram below shows the overall architecture of the services to run.

![Architecture diagram](./imgs/architecture.png "General Architecture")

Major services will be hosted in the cloud with managed alternatives:

* Kafka: Hosted by Confluent cloud under the free tier.
* Elasticsearch: Hosted by Elastic under the free tier.

Both hosted providers make the development easier by giving guidance on how to connect and setup clients for their respective platforms.

## Python producer

A Python producer was created with the [onboarding steps](https://docs.confluent.io/platform/current/tutorials/examples/clients/docs/python.html?utm_source=github&utm_medium=demo&utm_campaign=ch.examples_type.community_content.clients-ccloud) obtained in Confluent Cloud.

> Special steps were followed to install the `confluent-kafka` python library in an M1 machine, see [this issue's comment](https://github.com/confluentinc/confluent-kafka-python/issues/1190#issuecomment-1195952767).

Once the steps were followed in the above linked resources, the first messages were seen in Confluent cloud, these messages were not related to Twitter's data, but to simple integer messages.

![Messages produced via CLI](./imgs/cli-sample-producer.png)

The image above shows the CLI sample producer obtained from the Confluent Cloud setup page, it shows the application working properly with the installed dependencies. This base code will be employed to build the actual producer to send Tweet data into the topic.

![Messages in Confluent](./imgs/confluent-messages.png "Messages in confluent cloud")

The above image shows the produces messages in the Confluent Cloud UI.


### Producing Tweets

By using the Twitter client seen during the Project update N. 2 ([notebook](./hugging-face.ipynb)) the code below retrieves sample tweet data and produces it into the Kafka client.

The code below expects a configuration file to provide the Kafka provider with the right parameters to connect to the Confluent cluster. The `ccloud_lib` utilities file was borrowed from the original sample code from Confluent.

In [32]:
from confluent_kafka import Producer, KafkaError
import ccloud_lib
import json

In [43]:
class BaseProducer:
    """Defines the basic connectivity to reach Kafka instance hosted in Confluent Cloud"""
    def __init__(self, config_file, topic):
        """Creates a BaedProducer with the provided configuration and topic
        """

        conf = ccloud_lib.read_ccloud_config(config_file)

        producer_conf = ccloud_lib.pop_schema_registry_params_from_config(conf)
        self.producer = Producer(producer_conf)

        self.topic = topic
        ccloud_lib.create_topic(conf, topic)

        self.delivered_records = 0

    def acked(self, err, msg):
        """Delivery report handler called on
        successful or failed delivery of message
        """
        if err is not None:
            print("Failed to deliver message: {}".format(err))
        else:
            self.delivered_records += 1
            print("Produced record to topic {} partition [{}] @ offset {}".format(msg.topic(), msg.partition(), msg.offset()))

    def produce(self, values):
        for v in values:
            record_key = str(v['id'])
            record_value = json.dumps(v)
            
            print("Producing record: {}: {}".format(record_key, record_value))
            
            self.producer.produce(self.topic, key=record_key, value=record_value, on_delivery=self.acked)
            # p.poll() serves delivery reports (on_delivery)
            # from previous produce() calls.
            self.producer.poll(0)

        self.producer.flush()

        print("{} messages were produced to topic {}!".format(self.delivered_records, self.topic))


Before continuing, the `BaseProducer` needs to be somehow tested, so that it can be fixed in case we are not able to produce data. Let's start by loading some of the available sample data we have.

In [5]:
import pandas as pd

In [37]:
def sample_data(airline, filename, sample_size=10):
    aircanada_df = pd.read_csv(filename)
    aircanada_sample_df = aircanada_df.sample(sample_size)

    for i in range(aircanada_sample_df.iloc[:, 1:].shape[0]):
        row_json = aircanada_df.iloc[i, 1:].to_dict()
        row_json['airline'] = airline
        yield row_json

In [44]:
gen = sample_data('aircanada', './fresh_data/aircanada_sample_07072022_210411.csv')

In [45]:
producer = BaseProducer('./config.properties', 'twitter-data-test')

%4|1658881047.827|CONFWARN|rdkafka#producer-9| [thrd:app]: Configuration property session.timeout.ms is a consumer property and will be ignored by this producer instance
%4|1658881047.832|CONFWARN|rdkafka#producer-10| [thrd:app]: Configuration property session.timeout.ms is a consumer property and will be ignored by this producer instance


In [46]:
producer.produce(gen)

Producing record: 1545212028247220225: {"created_at": "2022-07-08T01:03:50.000Z", "id": 1545212028247220225, "text": "RT @mcdowell_norm: @OmarAlghabra @AirCanada Best thing you can do is RESIGN!", "airline": "aircanada"}
Producing record: 1545212013227479040: {"created_at": "2022-07-08T01:03:47.000Z", "id": 1545212013227479040, "text": "@OmarAlghabra @AirCanada End the masks. End the mandates. End the ArriveCAN app. Problem solved", "airline": "aircanada"}
Producing record: 1545211993564610561: {"created_at": "2022-07-08T01:03:42.000Z", "id": 1545211993564610561, "text": "@OmarAlghabra @AirCanada Remember you recently said out -of-practice travellers  are causing delays at security checkpoints ?  I do", "airline": "aircanada"}
Producing record: 1545211956495130625: {"created_at": "2022-07-08T01:03:33.000Z", "id": 1545211956495130625, "text": "@OmarAlghabra @AirCanada You have to return to the operations that were in place in 2019 and prior. Eliminate all mandates, apps, and restriction

With the above running we can see how the twitter data becomes available inside the Kafka Topic.

![Twitter data Kafka test](./imgs/twitter-data-test.png "Twitter Kafka test")