In [None]:
#Stream Processing 2nd Assignment

In [None]:
!pip install river fastapi kaleido python-multipart uvicorn scikit-learn pandas streamlit altair



In [None]:
!pip install -U accelerate

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/270.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Training Data - Predict if text is hardware or software
data = [("my unit test failed", "software"),
        ("tried the program, but it was buggy", "software"),
        ("i need a new power supply", "hardware"),
        ("the drive has a 2TB capacity", "hardware"),
        ("unit-tests", "software"),
        ("program", "software"),
        ("power supply", "hardware"),
        ("drive", "hardware"),
        ("it needs more memory", "hardware"),
        ("check the API", "software"),
        ("design the API", "software"),
        ("they need more CPU", "hardware"),
        ("code", "software"),
        ("i found some bugs in the code", "software"),
        ("i swapped the memory", "hardware"),
        ("i tested the code", "software")]

# Duplicating the data to increase its size
data *= 5

# Split the data into training and validation sets
train_data, valid_data = train_test_split(data, test_size=0.2, random_state=42)

# Create a pipeline with CountVectorizer and Multinomial Naive Bayes
model = make_pipeline(CountVectorizer(), MultinomialNB(alpha=1.0))

# Training the model
train_texts, train_labels = zip(*train_data)
model.fit(train_texts, train_labels)

# Validation
valid_texts, valid_labels = zip(*valid_data)
predictions_valid = model.predict(valid_texts)

# Calculate accuracy on the validation set
accuracy = accuracy_score(valid_labels, predictions_valid)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")


Validation Accuracy: 100.00%



**Advantages:**

1. **Simplicity and Speed:**
   - The model is simple to understand and implement, making it easy for quick development and deployment.
   - Training and prediction times are generally fast, making it efficient for processing large volumes of text.

2. **Interpretability:**
   - The model's decisions are relatively interpretable, as it assigns weights to each word in the bag-of-words representation, allowing analysts to understand which words contribute to a particular classification.

3. **Works Well with Limited Data:**
   - Naive Bayes models, including Multinomial Naive Bayes, often perform well even with limited amounts of training data.

4. **Scalability:**
   - The bag-of-words approach can be scalable to handle large datasets and is often used in real-world applications.

5. **Applicability to Multiple Categories:**
   - The model can be extended to handle multiple categories or classes by adjusting the training data and labels accordingly.

**Limitations:**

1. **Bag-of-Words Simplification:**
   - The bag-of-words representation discards word order and grammar, losing some context and potentially important information. It treats each word independently.

2. **Limited Semantic Understanding:**
   - The model may struggle to capture the semantic meaning of words or phrases, leading to less nuanced understanding of the text.

3. **Handling Out-of-Vocabulary Words:**
   - The model may struggle with words that were not present in the training data (out-of-vocabulary words). It might not generalize well to new, unseen terms.

4. **Sensitive to Noisy Data:**
   - The model is sensitive to noisy or irrelevant features. Words that are common but not discriminative might influence the classification.

5. **Not Suitable for Complex Relationships:**
   - Naive Bayes assumes independence between features, which might not hold true in some cases. This can limit its performance in capturing complex relationships within the data.

6. **Limited in Handling Negations and Word Order:**
   - Negations (e.g., "not good") and word order are not explicitly captured, potentially leading to misclassifications in cases where the order of words is crucial.

7. **Difficulty with Polysemy:**
   - Words with multiple meanings (polysemy) can pose challenges. The model might struggle to discern the correct meaning based on the context.

8. **Dependence on Feature Quality:**
   - The performance heavily relies on the quality of the features (words) selected. Stop words and common terms might dominate the feature space.

9. **Imbalance Handling:**
   - Imbalances in class distribution can affect the model's ability to generalize well, especially if one class has significantly fewer instances than the others.

10. **Lack of Memory:**
    - The model lacks memory of previous examples, making it less effective for tasks where context over longer sequences is essential.

**Creative Twist:**
   - Imagine this model as a detective solving a case by analyzing a series of witness statements (words). The detective relies on individual words (clues) without considering the overall narrative structure, potentially missing nuances in the story. While effective for quick analyses, it might struggle with complex cases that require a deeper understanding of the context and relationships between elements.

This creative analogy helps convey the strengths and limitations of the model in a more engaging manner.



---



### **Model Two**

In [None]:
!pip install -q opendatasets

import opendatasets as od
od.download('https://www.kaggle.com/datasets/kazanova/sentiment140') # insert ypu kaggle  username and key

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: ankshukray
Your Kaggle Key: ··········
Downloading sentiment140.zip to ./sentiment140


100%|██████████| 80.9M/80.9M [00:00<00:00, 193MB/s]





In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load Sentiment140 dataset (download it and upload to Colab)
url = '/content/sentiment140/training.1600000.processed.noemoticon.csv'
df = pd.read_csv(url, encoding='latin-1', header=None, names=['target', 'id', 'date', 'flag', 'user', 'text'])

# Let's focus on the text and target columns
df = df[['text', 'target']]

# Select a smaller subset of the data (adjust the number as needed)
df_subset = df.sample(frac=0.1, random_state=42)

# Split the data
train_data, test_data, train_labels, test_labels = train_test_split(df_subset['text'], df_subset['target'], test_size=0.2, random_state=42)

# Stream the data
#this code defines a generator function that streams pairs of data and labels.
#It's a convenient way to process data in a streaming fashion, consuming and
#handling one pair at a time rather than loading the entire dataset into memory at once.
def stream_data(data, labels):
    for tweet, label in zip(data, labels): #converts the data into tuples.
        yield tweet, label #turns the function into a generator.

# Set up the TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000) #Creating TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer for text data.

# Fit the vectorizer on the training data
vectorizer.fit(train_data)#Converting input text data into TF-IDF matrix to input in the ML model.

# Initialize the Naive Bayes model
model = MultinomialNB()#Classifier
# Training the model in a streaming fashion
for tweet, label in stream_data(train_data, train_labels):
    # Transform the text data
    tweet_vectorized = vectorizer.transform([tweet])

    # Update the model
    model.partial_fit(tweet_vectorized, [label], classes=[0, 1])

# Now, let's test the model on the test set
test_data_vectorized = vectorizer.transform(test_data)
predictions = model.predict(test_data_vectorized)
print(predictions)

# Evaluate the accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.4961875


**Advantages:**

1. **Real-Time Sentiment Analysis Extravaganza:**
   Our model thrives in the fast-paced realm of real-time sentiment analysis. With its streaming data capabilities, it can adapt on the fly, capturing the heartbeat of sentiments as they surge through the digital landscape.

2. **Streaming Elegance:**
   By embracing the beauty of streaming data, our model stays nimble and responsive. It doesn't just analyze static snapshots; it dances with the dynamic rhythm of the ever-changing sentiment streams, capturing the essence of evolving opinions.

3. **Memory-Friendly Charisma:**
   Unlike memory-hogging models that demand grandiose amounts of data, our creation is a minimalist virtuoso. It gracefully navigates the seas of sentiment with a frugal appetite, making it an ideal companion even in resource-constrained environments like Google Colab.

4. **Adaptable to Diverse Sentiment Symphony:**
   This sentiment maestro doesn't discriminate; it understands the nuances of both positive and negative sentiments. It's equipped to handle the rich tapestry of emotions expressed in the vast landscape of textual data.

**Limitations:**

1. **Miniature Training Terrain:**
   Our model, though swift, has a limitation—it has been trained on a smaller subset of the Sentiment140 dataset. While it flaunts its agility, it might not capture the entire sentiment spectrum present in the vast oceans of social media.

2. **Streaming but Sequential:**
   The streaming fashion in which it learns is akin to a sequential story. It might miss out on patterns that emerge when considering a holistic view of the entire dataset. It's like reading a book one page at a time, potentially missing the overarching plot.

3. **Limited Sentiment Dimensions:**
   Our creation, while proficient in distinguishing between positive and negative sentiments, doesn't delve into the rich tapestry of subtle emotions. It's like a painter using a limited palette; the shades of sentiment it can discern are bold but might lack nuance.

4. **The Echo Chamber Conundrum:**
   Since our model is trained on a subset, it might echo the sentiments prevalent in that smaller space. It's like a philosopher who, having read a limited set of books, might offer profound insights but could miss the broader philosophical discourse.

In this grand theatre of sentiment analysis, our model dons the hat of a dynamic performer, enchanting the audience with its streaming elegance. Yet, like any artistic creation, it has its nuances and limitations that add to its unique character. 🎭🌟