# Sentiment Analysis with Python Project

###  Amazon - Review Sentiment Analysis 👀 🐳 🧐

#### Dataset Story:
This dataset containing Amazon Product Data includes product categories and various metadata. The product with the most comments in the electronics category has user ratings and comments.

The following lines of code are installing various Python packages using pip in a Jupyter Notebook environment:

Specifically:

!pip install - Runs pip to install a Python package in the notebook environment
SentimentIntensityAnalyzer - This is a Python package that provides sentiment analysis functionality.
chart_studio - The chart studio package for using Plotly visualizations
TextBlob - Text processing and NLP library with sentiment analysis
plotly - Interactive visualization library
WordCloud - Word cloud generation package
cufflinks - Binding between Plotly and Pandas for data visualization
So in summary, these pip install statements are installing useful data analysis, visualization, and NLP packages for doing text mining and sentiment analysis work within a Jupyter Notebook.

The ! prefix lets you run shell commands like pip install without leaving the notebook. 

In [None]:
# Install the packages using pip in a Jupyter Notebook 
!pip install SentimentIntensityAnalyzer
!pip install chart_studio
!pip install TextBlob
!pip install plotly
!pip install WordCloud
!pip install cufflinks

The following imports provide the key Python packages used for data importing, cleaning, analysis, visualization and NLP tasks in text data projects involving sentiment analysis. It sets up the notebook with all the necessary module dependencies.

In [None]:
# import neccessary dependencies

# pandas as pd - Imports the popular Pandas data analysis library for working with data frames, renamed to pd for easier usage
import pandas as pd

# nltk - Imports the Natural Language Toolkit for NLP tasks like tokenizing
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
import nltk

# re - Imports Python's regular expression module for complex text pattern matching
import re

# TextBlob - Imports the TextBlob NLP library which has sentiment analysis capabilities
from textblob import TextBlob

# WordCloud - Enables generating word cloud visualizations
from wordcloud import WordCloud

# numpy - Imports NumPy for numerical and scientific computing
import numpy as np

# seaborn, matplotlib.pyplot - Data visualization libraries
#import seaborn as sns
import matplotlib.pyplot as plt

# cufflinks, plotly - Imports interactive visualization libraries to create charts, %matplotlib inline - Ensures plots are shown in the notebook
# init_notebook_mode - Enables Plotly functionality in notebook environment

import cufflinks as cf
%matplotlib inline
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = True)
cf.go_offline();
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# warnings.filterwarnings("ignore") - Ignores warning messages
# warnings.warn() - Shows how warnings are ignored
# pd.set_option('display.max_columns', None) - Ensures pandas shows all columns when printing data frames
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")
pd.set_option('display.max_columns', None)

# 👀 Reading the Dataset and an Overview 👀 🧐

This line of code is reading in a CSV dataset from a relative file path into a pandas DataFrame, this allows you to easily access the Amazon review data in df_ for further analysis and manipulation within the notebook. Checking head with df_.head(),performing cleaning, preprocessing, etc.

If executed correctly the output is a table with 12 columns such as #index, reviewerName, overall,	reviewText,	reviewTime, day_diff, helpful_yes, helpful_no, total_vote, score_pos_neg_diff, score_average_rating,	wilson_lower_bound and 6 rows

In [None]:
# Loading CSV dataset and reading it
df_ = pd.read_csv("amazon_reviews.csv")

In [None]:
# Copying the dataset into the Jupyter Notebook
df = df_.copy()

In [None]:
# Sorting the DataFrame (df) by the 'wilson_lower_bound' column in descending order.
df = df.sort_values("wilson_lower_bound", ascending=False)

#Dropping the 'Unnamed: 0' column from the DataFrame.
df.drop('Unnamed: 0', inplace = True, axis = 1)

#Displaying the first 5 rows of the DataFrame after sorting and dropping.
df.head()

The codes that follow is a set of custom functions to perform analysis on a pandas DataFrame and identify data quality issues:

missing_values_analysis()

Finds columns with missing values
Calculates number and ratio of missing values
Returns dataframe summarizing missing value details
check_dataframe()

Prints overall shape, data types
Calls missing_values_analysis and prints output
Prints number of duplicate rows
Prints quantiles for statistical overview
These functions help inspect a DataFrame for:

Missing values
Duplicated rows
Quantiles and outliers
General data types, size etc
It automates checking for common data quality issues to ensure the DataFrame is ready for analysis and modelling.

To use it:

Define dataframe as df
Call check_dataframe(df)
The output will print various stats and checks


In [None]:
# Finds columns with missing values
# Calculates number and ratio of missing values
# Returns dataframe summarizing missing value details
def missing_values_analysis(df):
    na_columns_ = [col for col in df.columns if df[col].isnull().sum() > 0]
    n_miss = df[na_columns_].isnull().sum().sort_values(ascending=True)
    ratio_ = (df[na_columns_].isnull().sum() / df.shape[0] * 100).sort_values(ascending=True)
    missing_df = pd.concat([n_miss, np.round(ratio_, 2)], axis=1, keys=['Total Missing Values', 'Ratio'])
    missing_df = pd.DataFrame(missing_df)
    return missing_df

# Prints overall shape, data types
# Calls missing_values_analysis and prints output
# Prints number of duplicate rows
# Prints quantiles for statistical overview
def check_dataframe(df, head=5, tail = 5):
    
    print(" SHAPE ".center(82,'~'))
    print('Rows: {}'.format(df.shape[0]))
    print('Columns: {}'.format(df.shape[1]))
    print(" TYPES ".center(82,'~'))
    print(df.dtypes)
    print("".center(82,'~'))
    print(missing_values_analysis(df))
    print(' DUPLICATED VALUES '.center(83,'~'))
    print(df.duplicated().sum())
    print(" QUANTILES ".center(82,'~'))
    print(df.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

# print various stats and checks
check_dataframe(df)

The output will be similar to this below:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SHAPE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Rows: 4915
Columns: 11
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TYPES ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
reviewerName             object
overall                 float64
reviewText               object
reviewTime               object
day_diff                  int64
helpful_yes               int64
helpful_no                int64
total_vote                int64
score_pos_neg_diff        int64
score_average_rating    float64
wilson_lower_bound      float64
dtype: object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              Total Missing Values  Ratio
reviewerName                     1   0.02
reviewText                       1   0.02
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DUPLICATED VALUES ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ QUANTILES ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                       0.00  0.05   0.50        0.95       0.99         1.00
overall                 1.0   2.0    5.0    5.000000    5.00000     5.000000
day_diff                1.0  98.0  431.0  748.000000  943.00000  1064.000000
helpful_yes             0.0   0.0    0.0    1.000000    3.00000  1952.000000
helpful_no              0.0   0.0    0.0    0.000000    2.00000   183.000000
total_vote              0.0   0.0    0.0    1.000000    4.00000  2020.000000
score_pos_neg_diff   -130.0   0.0    0.0    1.000000    2.00000  1884.000000
score_average_rating    0.0   0.0    0.0    1.000000    1.00000     1.000000
wilson_lower_bound      0.0   0.0    0.0    0.206549    0.34238     0.957544
def check_class(dataframe):
    nunique_df = pd.DataFrame({'Variable': dataframe.columns,
                               

The following function passes the df pandas dataframe into the function to analyze each column and get count of distinct classes/values and return as a dataframe.

It takes a pandas dataframe as input

Loops through the columns

Gets number of unique values for each using .nunique()

Stores the column names and unique counts into a new DataFrame

Sorts the DataFrame by 'Classes' column descending

Resets index
Returns this analysis DataFrame

Which gives an output DataFrame summarizing each variable and number of distinct classes/categories.


In [None]:
# Passes the df pandas dataframe into the function to analyze each column and get count of distinct classes/values and return as a dataframe.
def check_class(dataframe):
    nunique_df = pd.DataFrame({'Variable': dataframe.columns,
                               'Classes': [dataframe[i].nunique() \
                                           for i in dataframe.columns]})

    nunique_df = nunique_df.sort_values('Classes', ascending=False)
    nunique_df = nunique_df.reset_index(drop = True)
    return nunique_df

check_class(df)

 The output is a table with columns similar to the one below
 
    Variable	            Classes
0	reviewText	            4912
1	reviewerName	        4594
2	reviewTime	            690
3	day_diff	            690
4	wilson_lower_bound	    40
5	score_average_rating	28
6	score_pos_neg_diff	    27
7	total_vote	            26
8	helpful_yes	            23
9	helpful_no	            17
10	overall	                5

Next we use the `categorical_variable_summary()`. The function generates an interactive plotly visualization to analyze and summarize a categorical variable in a pandas dataframe.

Here is what it does:

1. Accepts a dataframe `df` and categorical column name as input

2. Creates a plotly figure with countplot and pie chart subplots 

3. The countplot shows the counts of each unique category value as a bar chart

4. The pie chart displays the percentage split of categories

5. Some styling customizations on the plots 

6. Sets overall figure title and template

7. Displays interactive figure

To use it:

```
categorical_variable_summary(df, 'column_name')
```

This provides a quick yet comprehensive summary of a categorical variable's distribution - both counts and percentages in an easy to interpret visual format.

The call on the sample dataframe `categorical_variable_summary(df,'overall')` passes the 'overall' column into the function to generate the analysis.



In [None]:
# categorical variable analysis ---> overall

constraints = ['#581845','#C70039','#2E4053','#1ABC9C','#7F8C8D']

def categorical_variable_summary(df, column_name):
    fig = make_subplots(rows=1,cols=2,
                        subplot_titles=('Countplot','Percentages'),
                        specs=[[{"type": "xy"}, {'type':'domain'}]])

    fig.add_trace(go.Bar( y = df[column_name].value_counts().values.tolist(), 
                          x = [str(i) for i in df[column_name].value_counts().index], 
                          text = df[column_name].value_counts().values.tolist(),
                          textfont = dict(size=15),
                          name = column_name,
                          textposition = 'auto',
                          showlegend=False,
                          marker=dict(color = constraints,
                                      line=dict(color='#DBE6EC',
                                                width=1))),
                  row = 1, col = 1)
    
    fig.add_trace(go.Pie(labels= df[column_name].value_counts().keys(),
                         values= df[column_name].value_counts().values,
                         textfont = dict(size = 20),
                         textposition='auto',
                         showlegend = False,
                         name = column_name,
                         marker=dict(colors=constraints)),
                  row = 1, col = 2)
    
    fig.update_layout(title={'text': column_name,
                             'y':0.9,
                             'x':0.5,
                             'xanchor': 'center',
                             'yanchor': 'top'},
                      template='plotly_white')
    
    iplot(fig)

In [None]:
categorical_variable_summary(df_,'overall')

Awesome visualization with a bar chart and a pie chart showing overall countplot and percentages

📌 Our goal is to rank the comments by sentiment analysis so we don't get hung up on other details. 


In [None]:
# sample for cleaning
df.reviewText.head()

The output will be similar to this:

2031    [[ UPDATE - 6/19/2014 ]]So my lovely wife boug...
3449    I have tested dozens of SDHC and micro-SDHC ca...
4212    NOTE:  please read the last update (scroll to ...
317     If your card gets hot enough to be painful, it...
4672    Sandisk announcement of the first 128GB micro ...
Name: reviewText, dtype: object

In [None]:

example_review = df.reviewText[2031]
example_review

the output will look like this:'[[ UPDATE - 6/19/2014 ]]So my lovely wife bought me a Samsung Galaxy Tab 4 for Father\'s Day and I\'ve been loving it ever since.  Just as other with Samsung products, the Galaxy Tab 4 has the ability to add a microSD card to expand the memory on the device.  Since it\'s been over a year, I decided to do some more research to see if SanDisk offered anything new.  As of 6/19/2014, their product lineup for microSD cards from worst to best (performance-wise) are the as follows:SanDiskSanDisk UltraSanDisk Ultra PLUSSanDisk ExtremeSanDisk Extreme PLUSSanDisk Extreme PRONow, the difference between all of these cards are simply the speed in which you can read/write data to the card.  Yes, the published rating of most all these cards (except the SanDisk regular) are Class 10/UHS-I but that\'s just a rating... Actual real world performance does get better with each model, but with faster cards come more expensive prices.  Since Amazon doesn\'t carry the Ultra PLUS model of microSD card, I had to do direct comparisons between the SanDisk Ultra ($34.27), Extreme ($57.95), and Extreme PLUS ($67.95).As mentioned in my earlier review, I purchased the SanDisk Ultra for my Galaxy S4.  My question was, did I want to pay over $20 more for a card that is faster than the one I already owned?  Or I could pay almost double to get SanDisk\'s 2nd-most fastest microSD card.The Ultra works perfectly fine for my style of usage (storing/capturing pictures & HD video and movie playback) on my phone.  So in the end, I ended up just buying another SanDisk Ultra 64GB card.  I use my cell phone *more* than I do my tablet and if the card is good enough for my phone, it\'s good enough for my tablet.  I don\'t own a 4K HD camera or anything like that, so I honestly didn\'t see a need to get one of the faster cards at this time.I am now a proud owner of 2 SanDisk Ultra cards and have absolutely 0 issues with it in my Samsung devices.[[ ORIGINAL REVIEW - 5/1/2013 ]]I haven\'t had to buy a microSD card in a long time. The last time I bought one was for my cell phone over 2 years ago. But since my cellular contract was up, I knew I would have to get a newer card in addition to my new phone, the Samsung Galaxy S4. Reason for this is because I knew my small 16GB microSD card wasn\'t going to cut it.Doing research on the Galaxy S4, I wanted to get the best card possible that had decent capacity (32 GB or greater). This led me to find that the Galaxy S4 supports the microSDXC Class 10 UHS-I card, which is the fastest possible given that class. Searching for that specifically on Amazon gave me results of only 3 vendors (as of April) that makes these microSDXC Class 10 UHS-1 cards. They are Sandisk (the majority), Samsung and Lexar. Nobody else makes these that are sold on Amazon.Seeing how SanDisk is a pretty good name out of the 3 (I\'ve used them the most), I decided upon the SanDisk because Lexar was overpriced and the Samsung one was overpriced (as well as not eligible for Amazon Prime).But the scary thing is that when you filter by the SanDisk, you literally get DOZENS of options. All of them have different model numbers, different sizes, etc. Then there\'s that confusion of what\'s the difference between SDHC & SDXC?SDHC vs SDXC:SDHC stand for "Secure Digital High Capacity" and SDXC stands for "Secure Digital eXtended Capacity". Essentially these two cards are the same with the exception that SDHC only supports capcities up to 32GB and is formated with the FAT32 file system. The SDXC cards are formatted with the exFAT file system. If you use an SDXC card in a device, it must support that file system, otherwise it may not be recognizable and/or you have to reformat the card to FAT32.FAT32 vs exFAT:The differences between the two file systems means that FAT32 has a maximum file size of 4GB, limited by that file system. exFAT on the otherhand, supports file sizes up to 2TB (terabytes). The only thing you need to know here really is that it\'s possible your device doesn\'t support exFAT. If that\'s the case, just reformat it to FAT32. REMEMBER FORMATTING ERASES ALL DATA!To clarify the model numbers, I I hopped over to the SanDisk official webpage. What I found there is that they offer two "highspeed" options for SanDisk cards. These are SanDisk Extreme Pro and SanDisk Ultra. SanDisk Extreme Pro is a line that supports read speeds up to 95MB/sec, however they are SDHC only. To make things worse, they are currently only available in 16GB & 8GB capacities. Since one of my requirements was to have a lot of storage, I ruled these out.The remaining devices listed on Amazon\'s search were the SanDisk Ultra line. But here, confusion sets in because SanDisk separates these cards to two different devices. Cameras & mobile devices. Is there a real difference between the two or is this just a marketing stunt? Unfortunately I\'m not sure but I do know the price difference between the two range from a couple cents to a few dollars. Since I wasn\'t sure, I opted for the one specifically targeted for mobile devices (just in case there is some kind of compatibility issue). To find the exact model number, I would go to Sandisk\'s webpage (sandisk.com) and compare their existing product lineup. From there, you get exact model numbers and you can then search Amazon for these model numbers. That is how I got mine (SDSDQUA-064G).As for speed tests, I haven\'t run any specific testing, but copying 8 GB worth of data from my PC to the card literally took just a few minutes.One last note is that Amazon attaches additional characters to the end (for example SDSDQUA-064G-AFFP-A vs SDSDQUA-064G-U46A). The difference between the two is that the "AFFP-A" means "Amazon Frustration Free Packaging". Other than that, these are exactly the same.  If you\'re wondering what I got (and want to use it in your Galaxy S4), I got the SDSDQUA-064G-u46A and it works like charm.'

In [None]:
# we clean it from punctuation and numbers - using regex. {Regular expression}
example_review = re.sub("[^a-zA-Z]",' ',example_review)
example_review

'   UPDATE               So my lovely wife bought me a Samsung Galaxy Tab   for Father s Day and I ve been loving it ever since   Just as other with Samsung products  the Galaxy Tab   has the ability to add a microSD card to expand the memory on the device   Since it s been over a year  I decided to do some more research to see if SanDisk offered anything new   As of            their product lineup for microSD cards from worst to best  performance wise  are the as follows SanDiskSanDisk UltraSanDisk Ultra PLUSSanDisk ExtremeSanDisk Extreme PLUSSanDisk Extreme PRONow  the difference between all of these cards are simply the speed in which you can read write data to the card   Yes  the published rating of most all these cards  except the SanDisk regular  are Class    UHS I but that s just a rating    Actual real world performance does get better with each model  but with faster cards come more expensive prices   Since Amazon doesn t carry the Ultra PLUS model of microSD card  I had to do direct comparisons between the SanDisk Ultra           Extreme           and Extreme PLUS          As mentioned in my earlier review  I purchased the SanDisk Ultra for my Galaxy S    My question was  did I want to pay over     more for a card that is faster than the one I already owned   Or I could pay almost double to get SanDisk s  nd most fastest microSD card The Ultra works perfectly fine for my style of usage  storing capturing pictures   HD video and movie playback  on my phone   So in the end  I ended up just buying another SanDisk Ultra   GB card   I use my cell phone  more  than I do my tablet and if the card is good enough for my phone  it s good enough for my tablet   I don t own a  K HD camera or anything like that  so I honestly didn t see a need to get one of the faster cards at this time I am now a proud owner of   SanDisk Ultra cards and have absolutely   issues with it in my Samsung devices    ORIGINAL REVIEW              I haven t had to buy a microSD card in a long time  The last time I bought one was for my cell phone over   years ago  But since my cellular contract was up  I knew I would have to get a newer card in addition to my new phone  the Samsung Galaxy S   Reason for this is because I knew my small   GB microSD card wasn t going to cut it Doing research on the Galaxy S   I wanted to get the best card possible that had decent capacity     GB or greater   This led me to find that the Galaxy S  supports the microSDXC Class    UHS I card  which is the fastest possible given that class  Searching for that specifically on Amazon gave me results of only   vendors  as of April  that makes these microSDXC Class    UHS   cards  They are Sandisk  the majority   Samsung and Lexar  Nobody else makes these that are sold on Amazon Seeing how SanDisk is a pretty good name out of the    I ve used them the most   I decided upon the SanDisk because Lexar was overpriced and the Samsung one was overpriced  as well as not eligible for Amazon Prime  But the scary thing is that when you filter by the SanDisk  you literally get DOZENS of options  All of them have different model numbers  different sizes  etc  Then there s that confusion of what s the difference between SDHC   SDXC SDHC vs SDXC SDHC stand for  Secure Digital High Capacity  and SDXC stands for  Secure Digital eXtended Capacity   Essentially these two cards are the same with the exception that SDHC only supports capcities up to   GB and is formated with the FAT   file system  The SDXC cards are formatted with the exFAT file system  If you use an SDXC card in a device  it must support that file system  otherwise it may not be recognizable and or you have to reformat the card to FAT   FAT   vs exFAT The differences between the two file systems means that FAT   has a maximum file size of  GB  limited by that file system  exFAT on the otherhand  supports file sizes up to  TB  terabytes   The only thing you need to know here really is that it s possible your device doesn t support exFAT  If that s the case  just reformat it to FAT    REMEMBER FORMATTING ERASES ALL DATA To clarify the model numbers  I I hopped over to the SanDisk official webpage  What I found there is that they offer two  highspeed  options for SanDisk cards  These are SanDisk Extreme Pro and SanDisk Ultra  SanDisk Extreme Pro is a line that supports read speeds up to   MB sec  however they are SDHC only  To make things worse  they are currently only available in   GB    GB capacities  Since one of my requirements was to have a lot of storage  I ruled these out The remaining devices listed on Amazon s search were the SanDisk Ultra line  But here  confusion sets in because SanDisk separates these cards to two different devices  Cameras   mobile devices  Is there a real difference between the two or is this just a marketing stunt  Unfortunately I m not sure but I do know the price difference between the two range from a couple cents to a few dollars  Since I wasn t sure  I opted for the one specifically targeted for mobile devices  just in case there is some kind of compatibility issue   To find the exact model number  I would go to Sandisk s webpage  sandisk com  and compare their existing product lineup  From there  you get exact model numbers and you can then search Amazon for these model numbers  That is how I got mine  SDSDQUA    G  As for speed tests  I haven t run any specific testing  but copying   GB worth of data from my PC to the card literally took just a few minutes One last note is that Amazon attaches additional characters to the end  for example SDSDQUA    G AFFP A vs SDSDQUA    G U  A   The difference between the two is that the  AFFP A  means  Amazon Frustration Free Packaging   Other than that  these are exactly the same   If you re wondering what I got  and want to use it in your Galaxy S    I got the SDSDQUA    G u  A and it works like charm '

📌I will now convert the text to all lowercase. Our machine learning algorithms recognize words that start with a capital letter as different words, and we will convert them to lowercase. Thus, our machine learning algorithms will not perceive words that start with a capital letter as a different word.

In [None]:
example_review = example_review.lower().split()
example_review

['update',
 'so',
 'my',
 'lovely',
 'wife',
 'bought',
 'me',
 'a',
 'samsung',
 'galaxy',
 'tab',
 'for',
 'father',
 's',
 'day',
 'and',
 'i',
 've',
 'been',
 'loving',
 'it',
 'ever',
 'since',
 'just',
 'as',
 'other',
 'with',
 'samsung',
 'products',
 'the',
 'galaxy',
 'tab',
 'has',
 'the',
 'ability',
 'to',
 'add',
 'a',
 'microsd',
 'card',
 'to',
 'expand',
 'the',
 'memory',
 'on',
 'the',
 'device',
 'since',
 'it',
 's',
 'been',
 'over',
 'a',
 'year',
 'i',
 'decided',
 'to',
 'do',
 'some',
 'more',
 'research',
 'to',
 'see',
 'if',
 'sandisk',
 'offered',
 'anything',
 'new',
 'as',
 'of',
 'their',
 'product',
 'lineup',
 'for',
 'microsd',
 'cards',
 'from',
 'worst',
 'to',
 'best',
 'performance',
 'wise',
 'are',
 'the',
 'as',
 'follows',
 'sandisksandisk',
 'ultrasandisk',
 'ultra',
 'plussandisk',
 'extremesandisk',
 'extreme',
 'plussandisk',
 'extreme',
 'pronow',
 'the',
 'difference',
 'between',
 'all',
 'of',
 'these',
 'cards',
 'are',
 'simply',
 'the',
 'speed',
 'in',
 'which',
 'you',
 'can',
 'read',
 'write',
 'data',
 'to',
 'the',
 'card',
 'yes',
 'the',
 'published',
 'rating',
 'of',
 'most',
 'all',
 'these',
 'cards',
 'except',
 'the',
 'sandisk',
 'regular',
 'are',
 'class',
 'uhs',
 'i',
 'but',
 'that',
 's',
 'just',
 'a',
 'rating',
 'actual',
 'real',
 'world',
 'performance',
 'does',
 'get',
 'better',
 'with',
 'each',
 'model',
 'but',
 'with',
 'faster',
 'cards',
 'come',
 'more',
 'expensive',
 'prices',
 'since',
 'amazon',
 'doesn',
 't',
 'carry',
 'the',
 'ultra',
 'plus',
 'model',
 'of',
 'microsd',
 'card',
 'i',
 'had',
 'to',
 'do',
 'direct',
 'comparisons',
 'between',
 'the',
 'sandisk',
 'ultra',
 'extreme',
 'and',
 'extreme',
 'plus',
 'as',
 'mentioned',
 'in',
 'my',
 'earlier',
 'review',
 'i',
 'purchased',
 'the',
 'sandisk',
 'ultra',
 'for',
 'my',
 'galaxy',
 's',
 'my',
 'question',
 'was',
 'did',
 'i',
 'want',
 'to',
 'pay',
 'over',
 'more',
 'for',
 'a',
 'card',
 'that',
 'is',
 'faster',
 'than',
 'the',
 'one',
 'i',
 'already',
 'owned',
 'or',
 'i',
 'could',
 'pay',
 'almost',
 'double',
 'to',
 'get',
 'sandisk',
 's',
 'nd',
 'most',
 'fastest',
 'microsd',
 'card',
 'the',
 'ultra',
 'works',
 'perfectly',
 'fine',
 'for',
 'my',
 'style',
 'of',
 'usage',
 'storing',
 'capturing',
 'pictures',
 'hd',
 'video',
 'and',
 'movie',
 'playback',
 'on',
 'my',
 'phone',
 'so',
 'in',
 'the',
 'end',
 'i',
 'ended',
 'up',
 'just',
 'buying',
 'another',
 'sandisk',
 'ultra',
 'gb',
 'card',
 'i',
 'use',
 'my',
 'cell',
 'phone',
 'more',
 'than',
 'i',
 'do',
 'my',
 'tablet',
 'and',
 'if',
 'the',
 'card',
 'is',
 'good',
 'enough',
 'for',
 'my',
 'phone',
 'it',
 's',
 'good',
 'enough',
 'for',
 'my',
 'tablet',
 'i',
 'don',
 't',
 'own',
 'a',
 'k',
 'hd',
 'camera',
 'or',
 'anything',
 'like',
 'that',
 'so',
 'i',
 'honestly',
 'didn',
 't',
 'see',
 'a',
 'need',
 'to',
 'get',
 'one',
 'of',
 'the',
 'faster',
 'cards',
 'at',
 'this',
 'time',
 'i',
 'am',
 'now',
 'a',
 'proud',
 'owner',
 'of',
 'sandisk',
 'ultra',
 'cards',
 'and',
 'have',
 'absolutely',
 'issues',
 'with',
 'it',
 'in',
 'my',
 'samsung',
 'devices',
 'original',
 'review',
 'i',
 'haven',
 't',
 'had',
 'to',
 'buy',
 'a',
 'microsd',
 'card',
 'in',
 'a',
 'long',
 'time',
 'the',
 'last',
 'time',
 'i',
 'bought',
 'one',
 'was',
 'for',
 'my',
 'cell',
 'phone',
 'over',
 'years',
 'ago',
 'but',
 'since',
 'my',
 'cellular',
 'contract',
 'was',
 'up',
 'i',
 'knew',
 'i',
 'would',
 'have',
 'to',
 'get',
 'a',
 'newer',
 'card',
 'in',
 'addition',
 'to',
 'my',
 'new',
 'phone',
 'the',
 'samsung',
 'galaxy',
 's',
 'reason',
 'for',
 'this',
 'is',
 'because',
 'i',
 'knew',
 'my',
 'small',
 'gb',
 'microsd',
 'card',
 'wasn',
 't',
 'going',
 'to',
 'cut',
 'it',
 'doing',
 'research',
 'on',
 'the',
 'galaxy',
 's',
 'i',
 'wanted',
 'to',
 'get',
 'the',
 'best',
 'card',
 'possible',
 'that',
 'had',
 'decent',
 'capacity',
 'gb',
 'or',
 'greater',
 'this',
 'led',
 'me',
 'to',
 'find',
 'that',
 'the',
 'galaxy',
 's',
 'supports',
 'the',
 'microsdxc',
 'class',
 'uhs',
 'i',
 'card',
 'which',
 'is',
 'the',
 'fastest',
 'possible',
 'given',
 'that',
 'class',
 'searching',
 'for',
 'that',
 'specifically',
 'on',
 'amazon',
 'gave',
 'me',
 'results',
 'of',
 'only',
 'vendors',
 'as',
 'of',
 'april',
 'that',
 'makes',
 'these',
 'microsdxc',
 'class',
 'uhs',
 'cards',
 'they',
 'are',
 'sandisk',
 'the',
 'majority',
 'samsung',
 'and',
 'lexar',
 'nobody',
 'else',
 'makes',
 'these',
 'that',
 'are',
 'sold',
 'on',
 'amazon',
 'seeing',
 'how',
 'sandisk',
 'is',
 'a',
 'pretty',
 'good',
 'name',
 'out',
 'of',
 'the',
 'i',
 've',
 'used',
 'them',
 'the',
 'most',
 'i',
 'decided',
 'upon',
 'the',
 'sandisk',
 'because',
 'lexar',
 'was',
 'overpriced',
 'and',
 'the',
 'samsung',
 'one',
 'was',
 'overpriced',
 'as',
 'well',
 'as',
 'not',
 'eligible',
 'for',
 'amazon',
 'prime',
 'but',
 'the',
 'scary',
 'thing',
 'is',
 'that',
 'when',
 'you',
 'filter',
 'by',
 'the',
 'sandisk',
 'you',
 'literally',
 'get',
 'dozens',
 'of',
 'options',
 'all',
 'of',
 'them',
 'have',
 'different',
 'model',
 'numbers',
 'different',
 'sizes',
 'etc',
 'then',
 'there',
 's',
 'that',
 'confusion',
 'of',
 'what',
 's',
 'the',
 'difference',
 'between',
 'sdhc',
 'sdxc',
 'sdhc',
 'vs',
 'sdxc',
 'sdhc',
 'stand',
 'for',
 'secure',
 'digital',
 'high',
 'capacity',
 'and',
 'sdxc',
 'stands',
 'for',
 'secure',
 'digital',
 'extended',
 'capacity',
 'essentially',
 'these',
 'two',
 'cards',
 'are',
 'the',
 'same',
 'with',
 'the',
 'exception',
 'that',
 'sdhc',
 'only',
 'supports',
 'capcities',
 'up',
 'to',
 'gb',
 'and',
 'is',
 'formated',
 'with',
 'the',
 'fat',
 'file',
 'system',
 'the',
 'sdxc',
 'cards',
 'are',
 'formatted',
 'with',
 'the',
 'exfat',
 'file',
 'system',
 'if',
 'you',
 'use',
 'an',
 'sdxc',
 'card',
 'in',
 'a',
 'device',
 'it',
 'must',
 'support',
 'that',
 'file',
 'system',
 'otherwise',
 'it',
 'may',
 'not',
 'be',
 'recognizable',
 'and',
 'or',
 'you',
 'have',
 'to',
 'reformat',
 'the',
 'card',
 'to',
 'fat',
 'fat',
 'vs',
 'exfat',
 'the',
 'differences',
 'between',
 'the',
 'two',
 'file',
 'systems',
 'means',
 'that',
 'fat',
 'has',
 'a',
 'maximum',
 'file',
 'size',
 'of',
 'gb',
 'limited',
 'by',
 'that',
 'file',
 'system',
 'exfat',
 'on',
 'the',
 'otherhand',
 'supports',
 'file',
 'sizes',
 'up',
 'to',
 'tb',
 'terabytes',
 'the',
 'only',
 'thing',
 'you',
 'need',
 'to',
 'know',
 'here',
 'really',
 'is',
 'that',
 'it',
 's',
 'possible',
 'your',
 'device',
 'doesn',
 't',
 'support',
 'exfat',
 'if',
 'that',
 's',
 'the',
 'case',
 'just',
 'reformat',
 'it',
 'to',
 'fat',
 'remember',
 'formatting',
 'erases',
 'all',
 'data',
 'to',
 'clarify',
 'the',
 'model',
 'numbers',
 'i',
 'i',
 'hopped',
 'over',
 'to',
 'the',
 'sandisk',
 'official',
 'webpage',
 'what',
 'i',
 'found',
 'there',
 'is',
 'that',
 'they',
 'offer',
 'two',
 'highspeed',
 'options',
 'for',
 'sandisk',
 'cards',
 'these',
 'are',
 'sandisk',
 'extreme',
 'pro',
 'and',
 'sandisk',
 'ultra',
 'sandisk',
 'extreme',
 'pro',
 'is',
 'a',
 'line',
 'that',
 'supports',
 'read',
 'speeds',
 'up',
 'to',
 'mb',
 'sec',
 'however',
 'they',
 'are',
 'sdhc',
 'only',
 'to',
 'make',
 'things',
 'worse',
 'they',
 'are',
 'currently',
 'only',
 'available',
 'in',
 'gb',
 'gb',
 'capacities',
 'since',
 'one',
 'of',
 'my',
 'requirements',
 'was',
 'to',
 'have',
 'a',
 'lot',
 'of',
 'storage',
 'i',
 'ruled',
 'these',
 'out',
 'the',
 'remaining',
 'devices',
 'listed',
 'on',
 'amazon',
 's',
 'search',
 'were',
 'the',
 'sandisk',
 'ultra',
 'line',
 'but',
 'here',
 'confusion',
 'sets',
 'in',
 'because',
 'sandisk',
 'separates',
 'these',
 'cards',
 'to',
 'two',
 'different',
 'devices',
 'cameras',
 'mobile',
 'devices',
 'is',
 'there',
 'a',
 'real',
 'difference',
 'between',
 'the',
 'two',
 'or',
 'is',
 'this',
 'just',
 'a',
 'marketing',
 'stunt',
 'unfortunately',
 'i',
 'm',
 'not',
 'sure',
 'but',
 'i',
 'do',
 'know',
 'the',
 'price',
 'difference',
 'between',
 'the',
 'two',
 'range',
 'from',
 'a',
 'couple',
 'cents',
 'to',
 'a',
 'few',
 'dollars',
 'since',
 'i',
 'wasn',
 't',
 'sure',
 'i',
 'opted',
 'for',
 'the',
 'one',
 'specifically',
 'targeted',
 'for',
 'mobile',
 'devices',
 'just',
 'in',
 'case',
 'there',
 'is',
 'some',
 'kind',
 'of',
 'compatibility',
 'issue',
 'to',
 'find',
 'the',
 'exact',
 'model',
 'number',
 'i',
 'would',
 'go',
 'to',
 'sandisk',
 's',
 'webpage',
 'sandisk',
 'com',
 'and',
 'compare',
 'their',
 'existing',
 'product',
 'lineup',
 'from',
 'there',
 'you',
 'get',
 'exact',
 'model',
 'numbers',
 'and',
 'you',
 'can',
 'then',
 'search',
 'amazon',
 'for',
 'these',
 'model',
 'numbers',
 'that',
 'is',
 'how',
 'i',
 'got',
 'mine',
 'sdsdqua',
 'g',
 'as',
 'for',
 'speed',
 'tests',
 'i',
 'haven',
 't',
 'run',
 'any',
 'specific',
 'testing',
 'but',
 'copying',
 'gb',
 'worth',
 'of',
 'data',
 'from',
 'my',
 'pc',
 'to',
 'the',
 'card',
 'literally',
 'took',
 'just',
 'a',
 'few',
 'minutes',
 'one',
 'last',
 'note',
 'is',
 ...]

This next set of statements is performing some text preprocessing on the `reviewText` column in the dataframe `df`.

Here is a breakdown:

```python
rt = lambda x: re.sub("[^a-zA-Z]",' ',str(x))
```

- Defines a lambda function `rt` that takes text `x` as input
- Uses regex to replace (substitute) anything that is not a letter character with a space 
- Converts the input `x` to a string to ensure text processing works

```python 
df["reviewText"] = df["reviewText"].map(rt)
```

- Applies the lambda function `rt` to every value in the `reviewText` column 
- This removes all non-letter characters, leaving only words
  
```python
df["reviewText"] = df["reviewText"].str.lower()
```  

- Converts all remaining text in `reviewText` column to lower case

```python
df.head(10)
```

- Prints first 10 rows to inspect preprocessing result

The overall effect is standardizing the review text data by:

1. Removing punctuation, numbers etc 
2. Converting to lower case

This cleans the text to prepare for further analysis like sentiment scoring.

In [None]:
rt = lambda x: re.sub("[^a-zA-Z]",' ',str(x))
df["reviewText"] = df["reviewText"].map(rt)
df["reviewText"] = df["reviewText"].str.lower()
df.head(10)

The output is a 11 columns  by 12 rows table updated version of the previous table

Next we are performing sentiment analysis on the 'reviewText' column in the dataframe `df`.

1. Applies TextBlob to generate polarity and subjectivity scores:

```
df[['polarity', 'subjectivity']] = df['reviewText'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))
```

2. Loops through each review text:

```
for index, row in df['reviewText'].iteritems():
```

3. Gets positive, negative, neutral sentiment scores using VADER: 

```
score = SentimentIntensityAnalyzer().polarity_scores(row)
```

4. Compares positive and negative scores to determine overall sentiment of the review as "Positive", "Negative" or "Neutral":

```
if neg > pos:
  df.loc[index, 'sentiment'] = "Negative" 
elif pos > neg:
  df.loc[index, 'sentiment'] = "Positive"
else:
  df.loc[index, 'sentiment'] = "neutral"
```

This adds Polarity and Subjectivity columns from TextBlob, and an overall Sentiment prediction column by analyzing the review text using both TextBlob and VADER libraries.


In [None]:
'''
# Sentiment analysis
# TextBlob Exit will return polarity and subjectivity.
# Polarity indicates your mood, that is, whether it is positive.
# It returns a value between 0 and 1. The closer to 1 the more positive, the closer to 0 the more negative.
'''

df[['polarity', 'subjectivity']] = df['reviewText'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

for index, row in df['reviewText'].items():

    score = SentimentIntensityAnalyzer().polarity_scores(row)

    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    if neg > pos:
        df.loc[index, 'sentiment'] = "Negative"
    elif pos > neg:
        df.loc[index, 'sentiment'] = "Positive"
    else:
        df.loc[index, 'sentiment'] = "neutral"

In [None]:
# 20 Identifying the interpretation, now we can include the positive, negative and neutral status of the comments.

df[df["sentiment"] == "Positive"].sort_values("wilson_lower_bound", ascending=False).head(5)

In [None]:
# Let's see if we have an unbalanced data problem
categorical_variable_summary(df,'sentiment')

In [None]:
# Let's see if there is an imbalance in the scoring?
df.groupby(["sentiment"])[['overall']].mean()

## SUMARY 
In summary,we try to provide a solution to one of the most important problems in e-commerce that is the correct calculation of points given to aftermarket products. Solving this problem means providing more customer satisfaction for e-commerce site, product prominence for sellers and a seamless shopping experience for buyers. Another problem is correct ordering.

The prominence of misleading comments among the comments made on the products financial loss and loss of customers. That's why amazon company wants to rank reviews, and when ordering these reviews, they want to rank them according to their sentiment (positive / negative). 

In solving these 2 basic problems with the use of data analysis tools, e-commerce site and sellers will increase their sales, customers hassle-free purchasing journey. This project is using custom functions to explore the categorized sentiment column, getting quick counts of reviews for each predicted category.