**Dataset**
labeled datasset collected from Spotify (Assignment 1 - Spotify Reviews Rating)

**Objective**
classify Review to a category from 1 to 5. <br>

**Total Estimated Time = 90-120 Mins**

**Evaluation metric**
macro f1 score

### Import used libraries

In [41]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 500)

### Load Dataset

In [89]:
data = pd.read_csv("/content/Assignment 1 - Spotify Reviews Rating.csv")
data.head(20)

Unnamed: 0,Time_submitted,Review,Rating
0,7/9/2022 15:00,"Great music service, the audio is high quality and the app is easy to use. Also very quick and friendly support.",5
1,7/9/2022 14:21,Please ignore previous negative rating. This app is super great. I give it five stars+,5
2,7/9/2022 13:27,"This pop-up ""Get the best Spotify experience on Android 12"" is too annoying. Please let's get rid of this.",4
3,7/9/2022 13:26,Really buggy and terrible to use as of recently,1
4,7/9/2022 13:20,Dear Spotify why do I get songs that I didn't put on my playlist??? And why do we have shuffle play?,1
5,7/9/2022 13:20,The player controls sometimes disappear for no reason. App restart forgets what I was playing but fixes the issue.,3
6,7/9/2022 13:19,I love the selection and the lyrics are provided with the song you're listening to!,5
7,7/9/2022 13:17,"Still extremely slow when changing storage to external sd card.. I'm convinced this is done on purpose, spotify knows of this issue and has done NOTHING to solve it! Over time I have changed sd cards, each being faster in read, write speeds(all samsung brand). And please add ""don't like song"" so it will never appear again in my searches or playlists.",3
8,7/9/2022 13:16,"It's a great app and the best mp3 music app I have ever used but there is one problem that, why can't we play some songs or find some songs? despite this the app is wonderful I recommend it. it's just the best.",5
9,7/9/2022 13:11,"I'm deleting this app, for the following reasons: This app now has a failing business model. Whether streaming services like it, or not: the consumer doesn't want to pay for music they can't fully own, and 6 ads successively, upon logging in, before a single song, is too much. Closed the app during ad number 6, and I'm more patient than most. If those are the only ways you can profit: you've already peaked. All that's left is your decline.",1


In [90]:
data.shape

(61594, 3)

In [91]:
data.drop(columns = 'Time_submitted', inplace = True)

### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [7]:
data.columns

Index(['Review', 'Rating'], dtype='object')

In [92]:
from sklearn.model_selection import train_test_split

x = data[['Review']]
y = data[['Rating']]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [93]:
x_train.shape

(49275, 1)

In [94]:
y_train.shape

(49275, 1)

### EDA on training data

- check NaNs

In [95]:
training_data = pd.concat([x_train, y_train], axis = 1)

In [96]:
training_data.isnull().sum()

Review    0
Rating    0
dtype: int64

- check duplicates

In [97]:
training_data.duplicated().sum()

133

In [98]:
training_data.drop_duplicates(inplace = True)

- show a representative sample of data texts to find out required preprocessing steps

In [99]:
sample_tweets = training_data['Review'].sample(n=150)
for tweet in sample_tweets:
    print(tweet)
    print("-" *100)


Forces you to subscribe via their website, no option to use Google Play subscriptions while all other alternatives offer it.
----------------------------------------------------------------------------------------------------
I like it and although many people are complaining it works just fine the playlist will be "blocked" sometimes on some devices because of updates but if you make sure it is updated it is fine.
----------------------------------------------------------------------------------------------------
Love it. Love the music!
----------------------------------------------------------------------------------------------------
Constantly crashes the UI but keeps playing. Must be force closed.
----------------------------------------------------------------------------------------------------
I feel like the only problem with this app is the non member access being unusable.
----------------------------------------------------------------------------------------------------
T

- check dataset balancing

In [100]:
class_counts = training_data['Rating'].value_counts()
print("Class Counts:")
print(class_counts)

Class Counts:
Rating
5    17539
1    14102
4     6302
2     5730
3     5469
Name: count, dtype: int64




*   will use macro-avg to counter the imbalance



- Cleaning and Preprocessing are:
    - tokenization
    - lemmetizing
    - emojis, punctuation, symbols and non words chars in general
    - slang (bcuz ui app .. )
    - contractions (it's ..)
    - digits
    - stop words
    - lowercasing
    - embedding
  

### Cleaning and Preprocessing

In [15]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
 nltk.download('punkt') #tokenizing

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin
import re

In [18]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [20]:
import contractions

In [30]:
import spacy

In [47]:
nlp = spacy.load('en_core_web_md')

In [102]:
training_data.shape

(49142, 2)

In [126]:
class CustomTransformer(BaseEstimator, TransformerMixin):
      def preprocess(self,review):
        review = review.lower() #lower case
        review = review.replace("bcuz", "because")
        review = review.replace("ui", "user interface")
        review = review.replace("app", "application")
        review = review.replace("apps", "applications")
        words = nltk.word_tokenize(review)
        transformed_words = [contractions.fix(word) for word in words]  #remove contractions
        transformed_words = [lemmatizer.lemmatize(word) for word in transformed_words if word.lower() not in stop_words] #lemm and remove stop words
        transformed_words = ' '.join(transformed_words)             #transformed_words

        transformed_words = re.sub(r'[^\w\s]', '', transformed_words) #remove non-word characters except space
        transformed_words = re.sub(r'\d+', '', transformed_words) #remove digits
        transformed_words = re.sub("[ ñ·ï§]", "", transformed_words)
        transformed_words = re.sub(r'\s+', ' ', transformed_words).strip()

        return transformed_words

      def get_embdng(self,data):
        embndg_mtx = np.zeros((len(data), 300))
        for i, doc in enumerate(nlp.pipe(data)):
            embndg_mtx[i, :] = doc.vector
        return embndg_mtx


      def transform(self, X):
        transformed_X = X.copy()
        transformed_X['Review'] = transformed_X['Review'].apply(self.preprocess)
        embndg_mtx = self.get_embdng(transformed_X['Review'].values.tolist())
        return embndg_mtx


In [127]:
x_train = CustomTransformer().transform(training_data)

In [103]:
y_train = training_data["Rating"]

In [128]:
x_train.shape

(49142, 300)

In [129]:
y_train.shape

(49142,)

In [134]:
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()

In [131]:
logistic_model.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [113]:
y_test.shape

(12319, 1)

In [114]:
x_test.shape

(12319, 1)

In [115]:
test_data = pd.concat([x_test, y_test], axis = 1)

In [121]:
x_test = CustomTransformer().transform(test_data)

In [132]:
y_pred = model.predict(x_test)

In [133]:
from sklearn.metrics import f1_score

macro_f1 = f1_score(test_data['Rating'], y_pred, average='macro')
macro_f1

0.35671702501618324