#**Task**
Use UMAP dimensionality reduction technique in Python and conduct visual analysis.

**Dataset: SpotifyReviews.csv** \
This dataset contains 35,306 customer reviews of the Spotify app, taken from the Google Play store. It also contains labels indicating whether customers recommended the app to others or not.

In [None]:
#Download the dataset from the cloud
!gdown 1zFhsASPRBcmHtxBy0COxs3vaG8Al9mOW

Downloading...
From: https://drive.google.com/uc?id=1zFhsASPRBcmHtxBy0COxs3vaG8Al9mOW
To: /content/SpotifyReviews.csv
  0% 0.00/5.61M [00:00<?, ?B/s] 84% 4.72M/5.61M [00:00<00:00, 31.6MB/s]100% 5.61M/5.61M [00:00<00:00, 36.1MB/s]


In [None]:
!pip install umap-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting umap-learn
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 4.6 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.7.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 63.1 MB/s 
Building wheels for collected packages: umap-learn, pynndescent
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.3-py3-none-any.whl size=82829 sha256=e990e1b4738b8afb9cb56f223320a6d4b24ecca9f0a0b9b9daf63a42fac9aa0a
  Stored in directory: /root/.cache/pip/wheels/b3/52/a5/1fd9e3e76a7ab34f134c07469cd6f16e27ef3a37aeff1fe821
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pynndescent: filename=pynndescent-0.5.7-py3-none-any.whl size=54286 sha256=a7ab51501f026d17fcefd7c51fd08831bfa28d7c66c473f2f25614f9b4cc6d8b
  Stored in directo

In [None]:
#Import Required Libraries
import re, nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import umap
import plotly.graph_objs as go
import plotly.figure_factory as ff


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
pd.set_option('display.max_colwidth', None) # Setting this so we can see the full content of cells
pd.set_option('display.max_columns', None) # to make sure we can see all the columns in output window

In [None]:
#Read Dataset
df = pd.read_csv('SpotifyReviews.csv')
df.head(5)

Unnamed: 0,Review,Recommend
0,"Great music service, the audio is high quality and the app is easy to use. Also very quick and friendly support.",Yes
1,Please ignore previous negative rating. This app is super great. I give it five stars+,Yes
2,Really buggy and terrible to use as of recently,No
3,Dear Spotify why do I get songs that I didn't put on my playlist??? And why do we have shuffle play?,No
4,I love the selection and the lyrics are provided with the song you're listening to!,Yes


In [None]:
# Check for null values
df.isnull().sum()

Review       0
Recommend    0
dtype: int64

In [None]:
# Change the target values from strings to integers because in plotly we use integer for marker
df['Recommend'] = df['Recommend'].map({'Yes':1, 'No':0})

In [None]:
df.head()

Unnamed: 0,Review,Recommend
0,"Great music service, the audio is high quality and the app is easy to use. Also very quick and friendly support.",1
1,Please ignore previous negative rating. This app is super great. I give it five stars+,1
2,Really buggy and terrible to use as of recently,0
3,Dear Spotify why do I get songs that I didn't put on my playlist??? And why do we have shuffle play?,0
4,I love the selection and the lyrics are provided with the song you're listening to!,1


In [None]:
# Cleaning Different Reviews
def cleaner(summary):
    soup = BeautifulSoup(summary, 'lxml') # removing HTML entities such as ‘&amp’,’&quot’,'&gt'; lxml is the html parser and shoulp be installed using 'pip install lxml'
    souped = soup.get_text()
    re1 = re.sub("[^A-Za-z]+"," ", souped) # substituting any non-alphabetic character that repeats one or more times with whitespace

    tokens = nltk.word_tokenize(re1)
    lower_case = [t.lower() for t in tokens]

    stop_words = set(stopwords.words('english'))
    filtered_result = list(filter(lambda l: l not in stop_words, lower_case))

    wordnet_lemmatizer = WordNetLemmatizer()
    return [wordnet_lemmatizer.lemmatize(t) for t in filtered_result]

In [None]:
# Apply Cleaner Function to clean the reviews
df['Cleaned_Review'] = df.Review.apply(cleaner)

In [None]:
# removing rows of length 0 (if exist)
df = df[df['Cleaned_Review'].map(len) > 0]

In [None]:
# Head the Original and Cleaned Review
df[['Review','Cleaned_Review']].head()

Unnamed: 0,Review,Cleaned_Review
0,"Great music service, the audio is high quality and the app is easy to use. Also very quick and friendly support.","[great, music, service, audio, high, quality, app, easy, use, also, quick, friendly, support]"
1,Please ignore previous negative rating. This app is super great. I give it five stars+,"[please, ignore, previous, negative, rating, app, super, great, give, five, star]"
2,Really buggy and terrible to use as of recently,"[really, buggy, terrible, use, recently]"
3,Dear Spotify why do I get songs that I didn't put on my playlist??? And why do we have shuffle play?,"[dear, spotify, get, song, put, playlist, shuffle, play]"
4,I love the selection and the lyrics are provided with the song you're listening to!,"[love, selection, lyric, provided, song, listening]"


In [None]:
# As we see the cleaned review is shown the strings separately lets join it
df['Cleaned_Review'] = [" ".join(row) for row in df.Cleaned_Review.values]

In [None]:
df['Cleaned_Review'].head()

0    great music service audio high quality app easy use also quick friendly support
1              please ignore previous negative rating app super great give five star
2                                                 really buggy terrible use recently
3                                    dear spotify get song put playlist shuffle play
4                                       love selection lyric provided song listening
Name: Cleaned_Review, dtype: object

In [None]:
# Make input data
data = df['Cleaned_Review']

In [None]:
tfidf = TfidfVectorizer(min_df=.0005, ngram_range=(1,3))
'''
  min_df=.0005 means that each ngram (unigram, bigram, & trigram) must be present in at least 30 documents 
  for it to be considered as a token (60000*.0005=30). This is a clever way of feature engineering
'''

tfidf.fit(data) # learn vocabulary of entire data
data_tfidf = tfidf.transform(data) # creating tfidf values

In [None]:
print(tfidf.get_feature_names_out())
print("Shape of tfidf matrix: ", data_tfidf.shape)

['aap' 'ability' 'ability play' ... 'yt music' 'zero' 'zero star']
Shape of tfidf matrix:  (35295, 5057)


In [None]:
# Implement UMAP to visualize dataset
u = umap.UMAP(n_components=2, n_neighbors=150, min_dist=0.4,metric='euclidean')
x_umap = u.fit_transform(data_tfidf)

In [None]:
recommend = list(df['Recommend'])
Reviews = list(df['Review'])

data_ = [go.Scatter(x=x_umap[:,0], y=x_umap[:,1], mode='markers',
                    marker = dict(color=df['Recommend'], colorscale='Rainbow', opacity=0.5),
                                text=[f'Recommed: {a}<br>Reviews: {b}' for a,b in list(zip(recommend, Reviews))],
                                hoverinfo='text')]

layout = go.Layout(title = 'UMAP Dimensionality Reduction', width = 1200, height = 1200,
                    xaxis = dict(title='First Dimension'),
                    yaxis = dict(title='Second Dimension'))
fig = go.Figure(data=data_, layout=layout)
fig.show()


In [None]:
import plotly
plotly.offline.plot(fig, filename='clusters.html')

'clusters.html'

**Question 1.** Can you identify two clusters of customer reviews based on whether the app was recommended to others or not? If yes, can you identify any sub-clusters within these two clusters? \
**Answer:** If we look deeply into the visualization, so I think we can classify the given visualiztion into two clusters like upper region belongs to the not recommended and the lower region belongs to the recommended, although in the not recommend region their might be some recommendation but if we consider the dense part so we can say that the reviews from the lower region are in the favour of app.

**Qestion 2.** What does this visualization tell you about the possible reasons a customer may or may not
recommend the app to others? \
**Answer:** The visualization clearly tell us that the customers recommend app to the others because as we see the recommended portion in the visualization is densely populated which means most of the customers like this app and want to recommend it to the others.