**Dataset**
labeled datasset collected from Spotify (Spotify Reviews Rating)

**Objective**
classify Review to a category from 1 to 5. <br>


**Evaluation metric**
macro f1 score

### Import used libraries

In [None]:
! pip install kaggle

In [None]:
! pip install emoji
! pip install contractions
! python -m spacy download en_core_web_md

In [None]:
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

In [None]:
import numpy as np
import nltk
import contractions
import re
import string
import emoji
import unicodedata

In [None]:
from nltk.tokenize import word_tokenize
import spacy
from gensim.models import Word2Vec, FastText , KeyedVectors
from sklearn.base import BaseEstimator, TransformerMixin
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json

User uploaded file "kaggle.json" with length 70 bytes


In [None]:
%%capture
! kaggle datasets download -d "leadbest/googlenewsvectorsnegative300"
! unzip /content/googlenewsvectorsnegative300.zip

### Load Dataset

In [None]:
# Load the TSV file
data = pd.read_csv('/content/Assignment 1 - Spotify Reviews Rating.csv')

# Display the first few rows of the DataFrame
print(data.head())

   Time_submitted                                             Review  Rating

0  7/9/2022 15:00  Great music service, the audio is high quality...       5

1  7/9/2022 14:21  Please ignore previous negative rating. This a...       5

2  7/9/2022 13:27  This pop-up "Get the best Spotify experience o...       4

3  7/9/2022 13:26    Really buggy and terrible to use as of recently       1

4  7/9/2022 13:20  Dear Spotify why do I get songs that I didn't ...       1


In [None]:
data.drop('Time_submitted', axis=1, inplace=True)

In [None]:
data.shape

(61594, 2)

### Data splitting

It is a good practice to split the data before EDA helps maintain the integrity of the machine learning process, prevents data leakage, simulates real-world scenarios more accurately, and ensures reliable model performance evaluation on unseen data.

In [None]:
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

### EDA on training data

- check NaNs

In [None]:
data.isna().sum()

Review    0
Rating    0
dtype: int64

- check duplicates

In [None]:
train_data[train_data.duplicated()]

Unnamed: 0,Review,Rating
28167,Great variety of music,5
15697,Best app for music listening,5
1940,Great music collection,5
4651,Great selection of music and podcasts,5
9051,#NAME?,2
...,...,...
16241,Way too many ads,4
27384,Very good music app,5
23355,Great music selection,5
21243,Best app for listening music,5


There are 133 dublicated rows in training data we should remove them

In [None]:
train_data= train_data.drop_duplicates()

In [None]:
# check again
train_data[train_data.duplicated()]

Unnamed: 0,Review,Rating


In [None]:
# Unpacking train_data and test_data into X_train, y_train, X_test, y_test
X_train, y_train = train_data['Review'], train_data['Rating']
X_test, y_test = test_data['Review'], test_data['Rating']

- check dataset balancing

In [None]:
train_data['Rating'].value_counts()

Rating
5    17539
1    14102
4     6302
2     5730
3     5469
Name: count, dtype: int64

- show a representative sample of data texts to find out required preprocessing steps

In [None]:
sample_data = train_data.sample(n=20)
print(sample_data)

                                                  Review  Rating

17802                        Love love love Spotify!!!!!       5

34845  Please please please please please just fix it...       2

52579  Constantly tells me no internet connection on ...       3

33227  I see I don't need to go into detail about the...       1

44654  I don't like how you have to subscribe to be a...       1

40913  Love this app!! Great selection of music and p...       5

59107  So many oldies but goodies have returned - eve...       4

61254  Easy way to listen to music, but I don't want ...       2

17613  The new update sucks I have been using this ap...       1

18230                         Too Many Ads Once Time😑😑😑😑       5

13843  Excellent it has all the music I want ,very af...       5

23681  So Spotify it's been a few months now, so do y...       1

44124  Option to play/pause and skip keeps disappeari...       3

42935  unless you plan on getting premium, it's not w...       4

34529  For

##According to our Data I need to do:
- some Cleaning and Preprocessing:
    - Lowercasing
    - Remove digits
    - Remove construction
    - Remove emojis
    - Remove un ascii characters
    - Remove punctuation
   

### Cleaning and Preprocessing

In [None]:
# Load nltk resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

spacy.load('en_core_web_md')

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data]   Unzipping tokenizers/punkt.zip.

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data]   Unzipping corpora/stopwords.zip.

[nltk_data] Downloading package wordnet to /root/nltk_data...


<spacy.lang.en.English at 0x7ab25eb0ae30>

### TextPreprocessor Transformer

In [None]:
class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, use_stemming=False, use_lemmatization=False):
        self.use_stemming = use_stemming
        self.use_lemmatization = use_lemmatization
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        # Iterate over each text in X and preprocess it
        preprocessed_texts = []
        for text in X:
            # Lowercasing
            text = text.lower()


            # Remove digits
            text = re.sub(r'\d', '', text)

            # Remove construction (e.g., RT for retweets)
            text = re.sub(r'\b(rt|RT)\b', '', text)

            # Remove emojis
            text = ''.join(char for char in text if not emoji.is_emoji(char))

            # Remove un-ASCII characters
            text = text.encode('ascii', 'ignore').decode('ascii')

            # Remove punctuation
            text = text.translate(str.maketrans('', '', string.punctuation))

            # Apply stemming or lemmatization
            if self.use_stemming:
                text = self.stemmer.stem(text)
            elif self.use_lemmatization:
                text = self.lemmatizer.lemmatize(text)

            preprocessed_texts.append(text)
        return preprocessed_texts

    def transform(self, X, y=None):
        return self.fit(X, y)

    def fit_transform(self, X, y=None):
        return self.fit(X, y)


In [None]:
# # check transformer
# tx=TextPreprocessor()
# xx=tx.fit_transform(X_train)
# xx

# word embedding Transformer

##  Word2Vec


In [None]:
!pip install fastcore





In [None]:
require_gpu()
class SpaCy_word2vec(BaseEstimator, TransformerMixin):
    def __init__(self, model_name='en_core_web_md'):
        self.model_name = model_name
        self.nlp = spacy.load(model_name)

    def fit(self, X, y=None):
        # This transformer does not require fitting, so we just return self
        return self

    def transform(self, X):
        embeddings = np.zeros((len(X), 300))
        for i, doc in enumerate(self.nlp.pipe(X)):
            embeddings[i, :] = doc.vector
        return embeddings

    def fit_transform(self, X, y=None):
        return self.transform(X)

In [None]:
EMBEDDING_FILE = '/content/GoogleNews-vectors-negative300.bin.gz'
word_vectors = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
len(word_vectors['adds'])

300

In [None]:
class Gensim_word2vec(BaseEstimator, TransformerMixin):

    def __init__(self, word_vectors):
        self.word_vectors = word_vectors
        self.vector_size = word_vectors.vector_size  # Assuming `word_vectors` is a KeyedVectors object

    def fit(self, X, y=None):
        # This transformer does not require fitting, so we just return self
        return self

    def transform(self, X):
        embeddings = np.zeros((len(X), self.vector_size))
        for i, review in tqdm(enumerate(X), total=len(X)):
            # Tokenize the review and get the average word vector
            words = review.split()
            vectors = [self.word_vectors[word] for word in words if word in self.word_vectors]
            if vectors:
                embeddings[i, :] = np.mean(vectors, axis=0)
        return embeddings

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)


**You  are doing Great so far!**

### Modelling

#### Spacy

In [36]:

# # Create the pipeline
# pipeline = Pipeline(steps=[
#     ('preprocessing', TextPreprocessor()),
#     ('Vectorizing', SpaCy_word2vec()),
#     ('model',LinearSVC())
# ])

# # Training the model
# pipeline.fit(X_train, y_train)


#### Evaluation

**Evaluation metric:**
macro f1 score

Macro F1 score is a useful metric in scenarios where you want to evaluate the overall performance of a multi-class classification model, **particularly when the classes are imbalanced**

![Calculation](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/639c3d934e82c1195cdf3c60_macro-f1.webp)

In [None]:
# Evaluating the model
y_pred = pipeline.predict(X_test)

# Evaluate the model using macro-averaged metrics
report = classification_report(y_test, y_pred, digits=3)

# Print the classification report
print(report)

#### Gensim

In [None]:
from tqdm import tqdm

In [None]:
# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessing', TextPreprocessor()),
    ('vectorizing', Gensim_word2vec(word_vectors)),
    ('model',LinearSVC())
])

# Training the model
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

# Evaluate the model using macro-averaged metrics
report = classification_report(y_test, y_pred, digits=3)

# Print the classification report
print(report)

100%|██████████| 49142/49142 [00:04<00:00, 11029.55it/s]

100%|██████████| 12319/12319 [00:01<00:00, 11860.94it/s]

              precision    recall  f1-score   support



           1      0.523     0.902     0.662      3524

           2      0.237     0.010     0.019      1385

           3      0.278     0.016     0.030      1412

           4      0.399     0.172     0.240      1532

           5      0.719     0.876     0.790      4466



    accuracy                          0.600     12319

   macro avg      0.431     0.395     0.348     12319

weighted avg      0.519     0.600     0.511     12319





