# PROBLEM DEFINITION AND DESIGN THINKING DOCUMENT

## PROBLEM STATEMENT

The problem at hand involves performing sentiment analysis on customer
feedback to gain valuable insights into competitor products. By understanding customer sentiments,
companies can identify strengths and weaknesses in competing products, thereby improving their
own offerings. This project requires leveraging various Natural Language Processing (NLP) methods to
extract meaningful insights from customer feedback.

##### DESIGN THINKING APPROACH

To effectively solve the problem of performing sentiment analysis on customer feedback, we will follow a structured design thinking approach. This approach involves several key steps, as outlined below:

##### DATA COLLECTION

OBJECTIVE: Identify a dataset containing customer reviews and sentiments about competitor products.

DATA SOURCES: Identify and gather relevant datasets from sources such as online review platforms, social media, or customer feedback forms.

DATA VOLUME: Ensure the dataset is sufficiently large to provide meaningful insights and a representative sample of customer sentiments.

DATA QUALITY: Verify the data quality by checking for missing values,duplicates, and inconsistencies.

##### DATA PREPROCESSING

OBJECTIVE: Clean and preprocess the textual data for analysis.

TEXT CLEANING: Remove any special characters, punctuation, and irrelevant information that may not contribute to sentiment analysis.

TOKENIZATION: Split the text into individual words or tokens for further analysis.

STOPWORD REMOVAL: Eliminate common stopwords (e.g., “and,” “the,” “is”) to reduce noise in the data.

NORMALZATION: Normalize text by converting it to lowercase to ensure consistent.

##### SENTIMENT ANALYSIS TECHNIQUES

OBJECTIVE:Employ different NLP techniques like Bag of Words, Word Embeddings, or Transformer models for sentiment analysis.

BAG OF WORDS (BOW): Create a BoW representation of the text data, which counts the frequency of words in each document.

WORD EMBEDDINGS: Utilize pre-trained word embeddings to capture semantic meaning and relationships between words.

TRANSFORMER MODELS :Leverage advanced transformer-based models for deep contextualized sentiment analysis.

##### FEATURE EXTRACTION

OBJECTIVE : Extract features and sentiments from the text data.

SENTIMENT SCORING: Assign sentiment scores (positive, negative, neutral) to each customer review using the chosen sentiment analysis technique.

FEATURE EXTRACTION: Extract relevant features from the text, such as product mentions, key phrases, or specific attributes that customers mention in their feedback.

##### VISUALIZATION  

OBJECTIVE : Create visualizations to depict the sentiment distribution and analyze trends.

SENTIMENT DISTRIBUTION : Visualize the distribution of sentiment scores using bar charts, histograms, or pie charts to understand the overall sentiment of customer feedback  

TEMPORARAL ANALYSIS: Track sentiment trends over time to identify patterns or changes in customer perceptions.

WORD CLOUDS: Generate word clouds to highlight frequently mentioned words or phrases in customer reviews.

##### INSIGHTS GENERATION

OBJECTIVE: Extract meaningful insights from the sentiment analysis results to guide business decisions.

COMPETITOR ANALYSIS: Compare the sentiment scores of competitor products to identify strengths and weaknesses.

CUSTOMER FEEDBACK TRENDS: Identify recurring themes or issues in customer feedback that require attention.

RECOMMENDATIONS : Provide actionable recommendations for product improvement or marketing strategies based on the insights gained.

This dataset consist of the reviews submitted by the individuals who traveled through various Airlines.

Here in the dataset the reviews are observed and based on the observation it is categorized in to 3 categories: Positive, Negative, Neutral.

PURPOSE: The purpose is to create a model which on providing the tweets (reviews) to the training should provide the the outcome that whether a particular tweet done by a individual is a positive response or a negative one or neutral.

All reviews that depict positive response contains good experience of the traveler with airlines.

All reviews that depict negative response contains difficulty faced by traveler.

Neutral responses will be, that which are not specific to be considered in positive or negative.

OUTCOME : This will help the Airline find out their deficiencies so that they could work on it.


## Exploratory Analysis
To begin this exploratory analysis, first use `matplotlib` to import libraries and define functions for plotting the data.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [None]:
print(os.listdir('../input'))

The next code cells define functions for plotting data.

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()


In [None]:
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()


In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()


### Let's check 1st file: ../input/Tweets.csv

In [None]:
nRowsRead = 1000 # specify 'None' if want to read whole file
# Tweets.csv has 14640 rows in reality, but we are only loading/previewing the first 1000 rows
df1 = pd.read_csv('../input/Tweets.csv', delimiter=',', nrows = nRowsRead)
df1.dataframeName = 'Tweets.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

Let's take a quick look at what the data looks like:

In [None]:
df1.head(5)

Distribution graphs (histogram/bar graph) of sampled columns:

In [None]:
plotPerColumnDistribution(df1, 10, 5)

## Conclusion

This design thinking approach outlines a structured methodology for tackling the problem of performing sentiment analysis on customer feedback to gain insights into competitor products. By following these steps, we aim to extract valuable information from textual data,visualize trends, and generate actionable insights that can drive informed business decisions and enhance the company’s competitive edge in the market.