<a href="https://colab.research.google.com/github/Kanyi254/sentiment-analysis/blob/Maureen/Untitled1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SENTIMENT ANALYSIS

## 1. INTRODUCTION
Sentiment analysis, also known as opinion mining, is the process of analyzing and categorizing opinions expressed in text data to determine the writer's attitude toward a particular subject, product, or service. In the context of business and marketing, this process is vital for understanding customer sentiment from publicly available data, such as social media posts. Twitter, in particular, is a valuable source of real-time user-generated content that can provide deep insights into customer perceptions of brands, products, and services.

In this project, we aim to perform sentiment analysis on tweets to classify them as expressing either positive or negative sentiments about different products or companies. This will help businesses monitor customer feedback, improve product offerings, and manage brand perception.

### Problem Statement
With the growing volume of user-generated content on platforms like Twitter, businesses struggle to keep up with real-time customer feedback. Manually categorizing thousands of tweets is inefficient and time-consuming. Automated sentiment analysis offers an efficient solution for determining whether customers express positive or negative sentiments about a product or service.

The challenge is to build a machine learning model that can accurately classify the sentiment of tweets related to various products and companies.

### Objective
The objective of this project is to develop a machine learning model that can automatically classify the sentiment of tweets as either positive, negative or neutral. The model will be built using appropriate machine learning techniques, evaluated with standard metrics (such as accuracy, precision, recall, and F1 score), and applied to unseen data. The project will also include model explainability through techniques like LIME to ensure that predictions can be interpreted by users.

### Data Sources
The dataset originates from CrowdFlower via data.world. Contributors evaluated tweets related to various brands and products. Specifically:

- Each tweet was labeled as expressing positive, negative, or no emotion toward a brand or product.
- If emotion was expressed, contributors specified which brand or product was the target.

### Project Workflow
1. Data Loading and Understanding
2. Data Cleaning
3. Exploratory Data Analysis (EDA)
4. Data Preprocessing (for NLP tasks)
5. Modeling
6. Evaluation and Model Explainability

## 2. DATA LOADING & DATA UNDERSTANDING





In [1]:
!pip install wordcloud



In [2]:
# Data manipulation
import pandas as pd
import numpy as np

# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# nltk
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer


# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# sklearn
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,cross_val_score

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, recall_score

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from scipy.sparse import csr_matrix
from scipy.sparse import issparse

# wordCloud
from wordcloud import WordCloud

# pickle
import pickle

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
class DataUnderstanding():
    """Class that gives the data understanding of a dataset"""
    def __init__(self, data='None'):
        """Initialisation"""
        self.df = data

    def load_data(self, path):
        """Loading the data"""
        if self.df == 'None':
            self.df = pd.read_csv(path, encoding='latin-1')
        return self.df

    def understanding(self):
        # Info
        print("""INFO""")
        print("-"*4)
        self.df.info()

        # Shape
        print("""\n\nSHAPE""")
        print("-"*5)
        print(f"Records in dataset are {self.df.shape[0]} with {self.df.shape[1]} columns.")

        # Columns
        print("\n\nCOLUMNS")
        print("-"*6)
        print(f"Columns in the dataset are:")
        for idx in self.df.columns:
            print(f"- {idx}")

        # Unique Values
        print("\n\nUNIQUE VALUES")
        print("-"*12)
        for col in self.df.columns:
            print(f"Column *{col}* has {self.df[col].nunique()} unique values")
            if self.df[col].nunique() < 12:
                print(f"Top unique values in the *{col}* include:")
                for idx in self.df[col].value_counts().index:
                    print(f"- {idx}")
            print("")

        # Missing or Null Values
        print("""\nMISSING VALUES""")
        print("-"*15)
        for col in self.df.columns:
            print(f"Column *{col}* has {self.df[col].isnull().sum()} missing values.")

        # Duplicate Values
        print("""\n\nDUPLICATE VALUES""")
        print("-"*16)
        print(f"The dataset has {self.df.duplicated().sum()} duplicated records.")

In [5]:
# Loading the dataset

data = DataUnderstanding()

df = data.load_data(path="judge-1377884607_tweet_product_company.csv")

# First five rows of dataset
df.head()


FileNotFoundError: [Errno 2] No such file or directory: 'judge-1377884607_tweet_product_company.csv'

In [13]:
#looking into the dataset
data.understanding()

INFO
----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


SHAPE
-----
Records in dataset are 9093 with 3 columns.


COLUMNS
------
Columns in the dataset are:
- tweet_text
- emotion_in_tweet_is_directed_at
- is_there_an_emotion_directed_at_a_brand_or_product


UNIQUE VALUES
------------
Column *tweet_text* has 9065 unique values

Column *emotion_in_tweet_is_directed_at* has 9 unique values
Top unique values in the *emotion_in_tweet_is_directed_at* include:
- iPad
- Apple
- iPad or iPhone App
-

### Data Understanding Summary
Our dataset consists of 9,093 records and 3 columns.
The dataset primarily contains information from tweets, along with the emotions expressed toward specific brands or products.

1. Columns in the Dataset
- tweet_text: The text of the tweet. This column contains 9,092 non-null values, meaning 1 record has a missing tweet text. Additionally, there are 9,065 unique tweets, indicating 22 duplicate entries.

- emotion_in_tweet_is_directed_at: This column specifies the brand or product the emotion in the tweet is directed at. It has 3,291 non-null values, meaning that 5,802 records have missing values in this column. There are 9 unique values, with the most common brands/products being iPad, Apple, Google, iPhone, and other Google/Apple products or services.

- is_there_an_emotion_directed_at_a_brand_or_product: This column describes whether the tweet expresses an emotion toward a brand or product. It has 4 unique values:

    - "No emotion toward brand or product"

    - "Positive emotion"

    - "Negative emotion"

    - "I can't tell" This column contains no missing values.
    
2. Missing Values
- tweet_text: 1 missing value, which needs to be handled.
- emotion_in_tweet_is_directed_at: 5,802 missing values. This large number of missing values may require imputation or exclusion, depending on the importance of this column to the analysis.

3. Duplicate Values
The dataset contains 22 duplicate records, which need to be addressed to ensure data integrity.


## 3. DATA CLEANING

### Handling missing value
- Column (tweet_text) There is 1 missing value in this column. Since this is the key feature for sentiment analysis, we will be dropping the row containing the missing tweet.
- Column (emotion_in_tweet_is_directed_at) has 5802 missing values. Since this column represents the product or brand mentioned in the tweet, we will impute missing values with "Unknown" since this column is important and dropping it would mean loosing alot of valuable data and insights.



In [None]:
# Drop rows with missing tweet text
data_cleaned = data.dropna(subset=['tweet_text'])
