In [7]:
# Import all necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy.stats import skew, kurtosis

%matplotlib inline
warnings.filterwarnings("ignore")

# Data Understanding

### Features
1. **tweet_text (Categorical)**: The text of the tweet. This feature contains the actual tweet content posted by users.
2. **emotion_in_tweet_is_directed_at (Categorical)**: The brand or product that the emotion in the tweet is directed at. This feature identifies which brand or product is the target of the emotion expressed in the tweet.
3. **is_there_an_emotion_directed_at_a_brand_or_product (Categorical)**: Indicates whether there is an emotion directed at a brand or product. The possible values include "Positive emotion", "Negative emotion", and "No emotion toward brand or product.

Source: https://data.world/crowdflower/brands-and-product-emotions


In [8]:
# Read data

def read_data(file_path):
    """Read the dataset from the given filepath.
    Parameters: 
        - file_path (str): The path to the csv file containing the dataset.
    Returns:
        - DataFrame: The DataFrame containing the dataset.
    """
    # Specify the encoding
    try:
        df = pd.read_csv(file_path, encoding='utf-8')
    except UnicodeDecodeError:
        # If utf-8 fails, try 'latin1' encoding
        df = pd.read_csv(file_path, encoding='latin1')
    
    return df

# Specify the file_path
file_path = "judge-1377884607_tweet_product_company.csv"
df = read_data(file_path)


In [9]:
# Explore the Data and get familiar with all the variables
def Explore_data(df):
    """ Explore the dataset to get familiar with all the features
    and the summary statistic of the numerical features.

    Parameters:
        - df(DataFrame) : The DataFrame containing the dataset.

    Rerurns:
        - None
        """
    # i. Data Importation and Inspection
    print("Data Importation and Inspection:")
    print("The first 5 rows of the dataset:")
    print(df.head())

    # ii. Basic data information
    print("\nBasic data information")
    print(df.info())

    # iii. Shape of the dataset
    print("\nThe shape of the dataset:")
    print(df.shape)

    # iv. Data Types
    print("\nData Types;")
    print(df.dtypes)

    # v. Summary statistics
    print("\nData Summary:")
    print(df.describe())




Explore_data(df)

Data Importation and Inspection:
The first 5 rows of the dataset:
                                          tweet_text  \
0  .@wesley83 I have a 3G iPhone. After 3 hrs twe...   
1  @jessedee Know about @fludapp ? Awesome iPad/i...   
2  @swonderlin Can not wait for #iPad 2 also. The...   
3  @sxsw I hope this year's festival isn't as cra...   
4  @sxtxstate great stuff on Fri #SXSW: Marissa M...   

  emotion_in_tweet_is_directed_at  \
0                          iPhone   
1              iPad or iPhone App   
2                            iPad   
3              iPad or iPhone App   
4                          Google   

  is_there_an_emotion_directed_at_a_brand_or_product  
0                                   Negative emotion  
1                                   Positive emotion  
2                                   Positive emotion  
3                                   Negative emotion  
4                                   Positive emotion  

Basic data information
<class 'pandas.core.

# Data Importation and Inspection

### First 5 Rows:
The dataset comprises tweets about multiple brands and products, along with emotions directed towards them.

### Basic Data Information:
The dataset has 3 columns:
1. `tweet_text`: Contains the text of the tweet.
2. `emotion_in_tweet_is_directed_at`: Indicates the brand or product the emotion is directed at.
3. `is_there_an_emotion_directed_at_a_brand_or_product`: Specifies whether there is a positive, negative, or no emotion directed at the brand or product.

- The dataset has a total of 9093 entries.
- The `tweet_text` column has 9092 non-null values.
- The `emotion_in_tweet_is_directed_at` column has 3291 non-null values.
- The `is_there_an_emotion_directed_at_a_brand_or_product` column has 9093 non-null values.

### Data Shape:
The dataset consists of 9093 rows and 3 columns, indicating there are 9093 observations and 3 features.

### Data Types:
- All three columns (`tweet_text`, `emotion_in_tweet_is_directed_at`, and `is_there_an_emotion_directed_at_a_brand_or_product`) are of object type, indicating they contain categorical or text data.

### Summary Statistics:
The summary statistics provide insights into the distribution of data in the dataset.

- **tweet_text**:
  - **Count:** 9092 (total observations)
  - **Unique:** 9065 (unique tweets)
  - **Top:** "RT @mention Marissa Mayer: Google Will Connect You With The Future!" (most frequent tweet)
  - **Frequency:** 5 (occurrences of the most frequent tweet)

- **emotion_in_tweet_is_directed_at**:
  - **Count:** 3291 (non-null observations)
  - **Unique:** 9 (unique brands/products)
  - **Top:** "iPad" (most frequent brand/product)
  - **Frequency:** 946 (occurrences of the most frequent brand/product)

- **is_there_an_emotion_directed_at_a_brand_or_product**:
  - **Count:** 9093 (total observations)
  - **Unique:** 4 (unique sentiment categories)
  - **Top:** "No emotion toward brand or product" (most frequent sentiment)
  - **Frequency:** 5389 (occurrences of the most frequent sentiment)

These statistics are useful for understanding the data distribution and identifying any potential inconsistencies or areas for further investigation. The dataset provides a solid foundation for analyzing public sentiment towards brands and products based on Twitter data.
