# Phase 4 Project — NLP Sentiment on Apple vs Google Tweets
_Group 1 Project — Moringa School_  


---

### Abstract

This project applies Natural Language Processing (NLP) to analyze public sentiment on Apple and Google products using a dataset of approximately 9,000 tweets collected by CrowdFlower and hosted on data.world. Each tweet is human‑labeled as **positive**, **negative**, or **neutral**, making it well‑suited for a supervised sentiment classification task. Our stakeholder is a technology market research team seeking actionable insights into how consumers perceive competing products in real time. Accurate sentiment classification allows the team to monitor brand perception, identify reputational risks, and adapt communication strategies quickly.

Data preparation addressed the short, noisy nature of tweets. We removed links and @mentions, normalized casing, preserved meaningful hashtag words, and stripped punctuation and HTML artifacts. We experimented with keeping vs. removing stopwords to gauge impact on sentiment detection. Feature extraction included **TF‑IDF** n‑grams for baseline models. Key libraries: **scikit‑learn** (modeling & evaluation), **NLTK** (cleaning), and **Hugging Face Transformers** (advanced modeling).

We followed an iterative strategy. First, a **binary proof‑of‑concept** (positive vs. negative) using Logistic Regression and Multinomial Naive Bayes demonstrated feasibility. Next, we expanded to **multiclass** (positive, neutral, negative) and tuned a **TF‑IDF + LinearSVC** pipeline with cross‑validation optimized for macro‑F1.

Evaluation used stratified train/validation/test splits. We reported accuracy, precision, recall, **macro‑F1**, and confusion matrices to reflect balanced performance across classes. We added model interpretability via top n‑grams and optional **LIME** explanations for selected tweets. We conclude with stakeholder recommendations for brand monitoring, negative‑spike alerts, and campaign impact analysis.

## 1. Business & Data Understanding

**Stakeholder:** Product marketing & market research teams tracking brand health for Apple vs Google products.

**Business problems we are solving:**
1. **Brand Monitoring:** Quantify daily/weekly sentiment for Apple vs Google to see shifts early.  
2. **Reputation Management:** Surface spikes in **negative** sentiment so PR/Support can respond quickly.  
3. **Product Insights:** Identify common praise/complaints by inspecting influential words for each sentiment.  
4. **Campaign Measurement:** Compare sentiment **before vs after** launches/announcements.

**Dataset:** ~9,000 tweets with human‑labeled sentiment (`positive`, `negative`, `neutral`). Columns:  
- `tweet_text` — the raw tweet  
- `emotion_in_tweet_is_directed_at` — brand/product mentioned (often Apple/Google terms)  
- `is_there_an_emotion_directed_at_a_brand_or_product` — sentiment label

We iterate **from simple to advanced** as recommended:  
- **Step 1 (Binary PoC):** Positive vs Negative only (fast baseline).  
- **Step 2 (Multiclass):** Add Neutral to model full business reality.  


In [1]:
# Reproducibility & Imports
import os, re, warnings
from pathlib import Path
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    classification_report, ConfusionMatrixDisplay, confusion_matrix
)

warnings.filterwarnings("ignore")
SEED = 42
np.random.seed(SEED)

def print_versions():
    import platform
    import sklearn, pandas, numpy, matplotlib
    print("Python:", platform.python_version())
    print("pandas:", pandas.__version__)
    print("numpy:", numpy.__version__)
    print("scikit-learn:", sklearn.__version__)
    print("matplotlib:", matplotlib.__version__)

print_versions()

Python: 3.8.5
pandas: 1.4.4
numpy: 1.23.0
scikit-learn: 0.23.2
matplotlib: 3.3.1


We loaded the libraries for data work, modeling, and plotting, set a random seed so results are repeatable, and printed package versions for reproducibility.

## 2. Load the Data

Read the CSV (with a fallback encoding if needed) and preview it.

In [2]:
# Update this path if running locally
DATA_PATH = "judge-1377884607_tweet_product_company.csv"

def read_csv_robust(path):
    try:
        return pd.read_csv(path)
    except UnicodeDecodeError:
        return pd.read_csv(path, encoding="latin1")

df_raw = read_csv_robust(DATA_PATH)
print("Shape:", df_raw.shape)
df_raw.head(5)

Shape: (9093, 3)


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
