# NLP twitter Sentiments about Apple and Google Products

## 1. Project Overview

This project aims to develop a Natural Language Processing (NLP) model to analyze sentiment in Tweets related to Apple and Google products. By classifying the sentiment of these Tweets as positive, negative, or neutral, the model will provide valuable insights into public perception, aiding businesses in marketing strategies and product development.


**Business Problem:**

In an era dominated by social media, brands must continuously track customer sentiments expressed online. Twitter, in particular, has become a critical platform where users voice their opinions about products and brands. However, the vast volume and rapid pace of tweets make it impractical for businesses to manually analyze these opinions for insights. To address this, a Natural Language Processing (NLP) model needs to be developed to automatically classify the sentiment of tweets and determine which brand or product is the target of those sentiments.

The dataset from CrowdFlower includes over 9,000 tweets that have been evaluated for sentiment (positive, negative, or no emotion) and tagged with the associated brand or product. The goal is to build an NLP model that can accurately and efficiently:

1. **Classify Sentiments**: Identify whether a tweet expresses positive, negative, or no emotion.
2. **Identify Brand/Product**: Recognize which brand or product is being referred to in the tweet.
3. **Handle Ambiguity**: Deal with tweets that might reference multiple brands or unclear sentiments.

Key challenges include:

- **Textual Variations**: Dealing with informal language, abbreviations, emojis, and slang used on social media.
- **Context Understanding**: Ensuring the model understands subtle and implicit expressions of sentiment.
- **Real-Time Processing**: Building a scalable solution that can process large volumes of data in real time for timely insights.

Solving this problem will help brands enhance their reputation management, respond promptly to consumer feedback, and optimize their marketing strategies based on real-time sentiment analysis.

#### Project Objectives

1.**Binary Classification Model:** The first objective is to develop a binary classification model to classify tweets as either positive or negative. Using Logistic Regression, this model aims to achieve a benchmark accuracy of 85%, serving as a proof of concept.

2.**Multiclass Classification Expansion:** After establishing a successful binary classification, we will develop a multiclass classifier to include neutral sentiments. This will provide a more comprehensive understanding of consumer sentiment and will be built using models like XGBoost and Multinomial Naive Bayes, with a target accuracy of 70%.

3.**Sentiment Comparison Between Apple and Google Products:** A final objective is to compare sentiment across the brands by analyzing the distribution of sentiments in tweets mentioning Apple, Google, and other products. This comparison will provide valuable insights for stakeholders to refine their strategies.

#### Stakeholders

Key stakeholders who would benefit from this project include:

**1. Product Managers** at Apple and Google, who can use the insights to tweak product features based on consumer sentiment.

**2. Marketing Teams** looking to assess the effectiveness of campaigns or brand perception.

**3. Customer Support Teams**, who can use the analysis to proactively address negative sentiment or capitalize on positive feedback.

**4. Consultants and market analysts** seeking to provide data-driven advice to tech companies on consumer perceptions.

##### Methodology Overview


1.**Data Understanding**
   - Familiarizing with the structure of the dataset (e.g., columns like `tweet_text`, `emotion_in_tweet_is_directed_at`, and `is_there_an_emotion_directed_at_a_brand_or_product`).
   - Identifying any initial anomalies (e.g., missing values, duplicates) that need cleaning.

2.**Data Preparation**
- Preprocessing steps:

    - Removed duplicate entries.
    - Addressed missing values by populating the emotion_in_tweet_is_directed_at column with "none" and removed entries     lacking tweet_text.
    - Applied text preprocessing techniques: tokenization, lowercasing, stopword removal, and lemmatization.

3.**Modeling**
    - Utilized key libraries: NLTK (for tokenization, stopword removal, lemmatization), sklearn's CountVectorizer (for vectorization), and pandas (for data handling).
    - Developed a logistic regression model for binary classification (positive/negative sentiment), aiming for 70% accuracy.
    - Expanded to a multiclass classifier to include neutral sentiments.

4.**Evaluation**
    - Accuracy was used as the main evaluation metric, measuring the model's overall ability to classify sentiments correctly.
    - While the model performed satisfactorily, missing values and data quality issues are potential limitations.

#### Data Understanding

We are using a dataset sourced from **CrowdFlower via Data.world,** containing approximately 9,000 tweets expressing sentiments about Apple and Google products. This dataset includes columns such as `tweet_text`, `emotion_in_tweet_is_directed_at`, and `is_there_an_emotion_directed_at_a_brand_or_product`. The main objective is to accurately classify each tweet into one of three sentiment categories: positive, negative, or neutral.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import RegexpTokenizer, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

from sklearn.model_selection import train_test_split, cross_validate
from numpy import array
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


import nltk
nltk.download('punkt')  # Download the tokenizer data
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to c:\Users\Augustine
[nltk_data]     Wanyonyi\anaconda3\envs\learn-env\lib\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
#load the dataset
df = pd.read_csv('Data\Sentiments_analysis.csv', encoding = 'unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [9]:
#looking at data, duplicates and null valuesprint
def data_summary(df):
    # Print the DataFrame info
    print(df.info())
    print(("-" * 20))
    
    # Print the total number of duplicated rows
    print('Total duplicated rows')
    print(df.duplicated().sum())
    print(("-" * 20))
    # Print the total number of null values in each column
    print('Total null values')
    
    print(df.isna().sum())    
data_summary(df)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None
--------------------
Total duplicated rows
22
--------------------
Total null values
tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64
