<a href="https://colab.research.google.com/github/Altimis/Predict_Tesla_Stock_Prices/blob/master/Groover.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries

In [1]:
!pip install textblob_fr
!pip install langdetect

Collecting textblob_fr
[?25l  Downloading https://files.pythonhosted.org/packages/99/53/1d532ed522e561bc75e78e5c1920aba52f574847339462780cd060f2e607/textblob_fr-0.2.0-py2.py3-none-any.whl (561kB)
[K     |▋                               | 10kB 24.3MB/s eta 0:00:01[K     |█▏                              | 20kB 31.8MB/s eta 0:00:01[K     |█▊                              | 30kB 24.4MB/s eta 0:00:01[K     |██▍                             | 40kB 21.5MB/s eta 0:00:01[K     |███                             | 51kB 20.9MB/s eta 0:00:01[K     |███▌                            | 61kB 15.8MB/s eta 0:00:01[K     |████                            | 71kB 16.3MB/s eta 0:00:01[K     |████▊                           | 81kB 15.5MB/s eta 0:00:01[K     |█████▎                          | 92kB 15.2MB/s eta 0:00:01[K     |█████▉                          | 102kB 15.6MB/s eta 0:00:01[K     |██████▍                         | 112kB 15.6MB/s eta 0:00:01[K     |███████                         

In [109]:
import re
import os
import io
from tqdm import tqdm
import warnings

import numpy as np
import pandas as pd
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier, XGBRegressor

from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Dense, Embedding, concatenate, Dropout, concatenate,Input
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

import nltk
import string
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob_fr import PatternTagger, PatternAnalyzer
from langdetect import detect

import matplotlib.pyplot as plt
%matplotlib inline
import plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Pandas presentation options
pd.options.display.max_colwidth = 150   # show whole tweet's content
pd.options.display.width = 200          # don't break columns

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Exploratory Data Analysis 

### Explore the dataframe



In [4]:
# Read data 
data = pd.read_csv('gdrive/My Drive/Groover/groover_dataset_challenge.csv')
# Drop redendant index column
# id column is only useful if we want to merge our dataset the other ones, otherwise we can just reset our index column with propre sorted values
data = data.drop(['Unnamed: 0', 'id'], axis=1)
data.score = data.score
data.head()

Unnamed: 0,band_id,influencer_id,feedback,score
0,24665,1642,"Bonjour Diogo Ramos,\n\n\nMerci pour le partage.\nActuellement le titre ne correspond pas à notre ligne de programmation.\n\nN'hésitez pas à nous ...",0.0
1,24665,118,"Un message touchant, développé sur une onde musicale légère et pleine d'oxygène. Direction musicale trop folklorique cependant bien que des pistes...",0.0
2,24665,226,Bonjour ! Merci beaucoup pour l'envoi ! On n'est pas totalement séduit par le morceau mais une prochaine fois peut-être :),0.0
3,24665,1603,Bonjour. Merci pour cette fraîcheur et cet hymne à la liberté en ces moments si incertains. Musicalement c(est parfait; orchestration et interprét...,1.0
4,24665,111,"Salut Diogo, alors c'est un peu éloigné de ce qu'on a l'habitude de partager mais j'ai été séduit par votre univers et le timbre de la voix, on va...",1.0


In [5]:
# Check the type of each column and if there are null values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   band_id        10000 non-null  int64  
 1   influencer_id  10000 non-null  int64  
 2   feedback       10000 non-null  object 
 3   score          10000 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 312.6+ KB


In [6]:
# Convert feedback column to string
data['feedback'] = data['feedback'].astype(str)

In [7]:
# Let's visualize the distribition of the label 'score'
def score_bars(data):
  # temp dataframe used for plotting score frequence
  temp_df = pd.DataFrame()
  temp_df['Score'] = list(data.score.value_counts().sort_index().keys())
  # Plot better colors
  temp_df['Score'] = temp_df['Score'].astype(str)
  temp_df['Frequence'] = list(data.score.value_counts().sort_index())
  fig = px.bar(temp_df, x='Score', y='Frequence', color='Score', title="Score Distribution")
  return fig
fig = score_bars(data)
fig.show()

It looks like many bands get null rating. I assume that most of bands are not much appreciated by the influences. This justifies the mean score of 1.3 for all bands.  
If we consider the label with discrete values (0.25, 0.5, 0.75, 1.0), we will have an imbalance problem to solve. We will work on this later.

In [35]:
# Count the top 5 bands with the highest number of ratings. 
print("Band_id count : \n")
print(data.band_id.value_counts()[:5])
# Count the top 5 influencers which did the highest number of ratings. 
print("\nInfluencer_id count : \n")
print(data.influencer_id.value_counts()[:5])

Band_id count : 

603      155
4942     148
3949     129
32787    120
29227    111
Name: band_id, dtype: int64

Influencer_id count : 

2064    77
110     64
1642    58
784     54
1297    53
Name: influencer_id, dtype: int64


Bands and influencers are not unique in the dataframe. Some bands are being rated by different influencers and some influencers have rated different bands

In [36]:
# Check if the band with the highest number of ratings have been rated by different influencers
print(data[data['band_id']==603]['influencer_id'].value_counts())
# Check if the influecer that did the highest number of ratings have rated different bands
print(data[data['influencer_id']==3592]['band_id'].value_counts())

2046    1
1610    1
2992    1
596     1
1875    1
       ..
2968    1
1174    1
2453    1
1683    1
1280    1
Name: influencer_id, Length: 155, dtype: int64
33533    1
19386    1
1859     1
21702    1
455      1
20232    1
32522    1
32204    1
32397    1
18609    1
18357    1
33612    1
7257     1
32794    1
33499    1
15132    1
25757    1
18409    1
24932    1
21605    1
26535    1
33600    1
33962    1
33646    1
6375     1
29260    1
24886    1
33216    1
Name: band_id, dtype: int64


The band with the highest number of ratings have been rated with strictly different influencers. The same goes for the influencer that did the highest number of ratings

In [37]:
# Find duplicated row based on feedback column
duplicated_data = data[data.duplicated(['feedback'])]
# Let's visualize the distribution of score on duplicated feedbacks
fig = score_bars(duplicated_data)
fig.show()

I did this visualization to see if duplicated columns are mostly scored with 0.0, if this is true, we could reduce the class imbalance after removing duplicated data

In [38]:
# Example of the most repeated feedbackv (32 times)
most_repeated = duplicated_data["feedback"].value_counts().keys()[0]
# Show the influencers who did this feedback
print("Number of influencers who did this feedback : {}".format(len(data[data["feedback"] == most_repeated]['influencer_id'].value_counts())))
# Show the bands for which this feedback has been made
print("Number of influencers for which this feedback has been made : {}".format(len(data[data["feedback"] == most_repeated]['band_id'].value_counts())))

IndexError: ignored

From this, we can see that this influencer gave the exact same feedback and score to all the bands that they rated.

### Visualization

We can take advantage of the non-unicity of bands and influencers to visualize the ratings (scores) according to each unique one of the latter (Average ratings per band or influencer vs. the number of ratings made by the influencer or made for the band)

In [None]:
# Let's visualize the average ratings of each band and the total number o ratings

def avg_nb_score(data):
  df_plot1 = pd.DataFrame()
  df_plot2 = pd.DataFrame()
  df_plot1['Average rating'] = data.groupby('band_id')['score'].mean().values
  df_plot1['Number of ratings'] = data.groupby('band_id')['score'].count().values
  df_plot2['Average rating'] = data.groupby('influencer_id')['score'].mean().values
  df_plot2['Number of ratings'] = data.groupby('influencer_id')['score'].count().values

  trace1 = go.Scatter(
                  x=df_plot1["Average rating"],
                  y=df_plot1["Number of ratings"],
                  name="For Bands",
                  mode='markers',
                  marker=dict(
                      color='rgb(34,163,192)'
                      )
                  )
  trace2 = go.Scatter( 
                  x=df_plot2["Average rating"],
                  y=df_plot2["Number of ratings"],
                  name="By Influencers",
                  mode='markers',
                  marker=dict(
                      color='rgb(160,65,32)'
                      )
                  )
  fig = make_subplots(rows=2, cols=1)
  fig.add_trace(trace1, row=1, col=1)
  fig.add_trace(trace2, row=2, col=1)

  fig.update_layout(width=1000, height=600, xaxis=dict(tickangle=90), 
                    title_text="Average Rating vs. Number of Ratings.",
                    xaxis_title="Average score",
                    yaxis_title="Score count",)
  return fig

avg_nb_score(data).show()


* The first figure represents the average rating made for each band vs. its total number of ratings. It is noticeable that the scatter plot is more concentrated in the low-avg score/low-number of rating part of the figure. Bands who recieved scores higher than 0.6 have not been rated enough to consider them as 'reliable' ratings. 
*  The second figure represents the average rating made by each influencer vs. its total number of ratings. This time I noticed that the average rating of influencers that have rated the most is 0.0 or 5.0 (the two points in the top-right and top left in the figure). From this, I made an assumption that these influencers always do the same rating (giving a score of 5.0 to all 90 bands or giving a score of 0.0 to all 80 bands), and that they are the same infuelncers who made the duplicated feedbacks that we discussed before.

I suggest that we remove the duplicated columns based on feedback to remove these outliers and, since duplicated text will not help us in the phase of predictive modeling



#### Remove duplicared feedbacks

In [None]:
data = data.drop_duplicates(subset ="feedback", 
                     keep = False) 

# Visualize the score distribution after removing duplicated feedbacks
score_bars(data).show()
# Visualize the average ratings of each band and the total number o ratings after removing duplicated feedbacks
avg_nb_score(data).show()

* From the first figure, we can see that score column is still unbalanced even after removing duplicates.
* The second figure that represents the first figure represents the average rating made for each band vs. its total number of ratings did not change much compared to the one before removing duplicates feedbacks.
* Although, from the third figure that represents the second figure represents the average rating made by each influencer vs. its total number of ratings, we can clearly see that those influencers that rated all bands with the same score have been removed. 

#### Visualize length of words

In [None]:
# Visualize the len of feedback for each score
data_len = data.copy()
# convert columns
data_len.score = data_len.score.astype('float')
data_len.feedback = data_len.feedback.astype('str')
# feedback length comlumn
data_len['feedback_len'] = [len(feedback) for feedback in data_len['feedback']]

def data_len_x(data, score):
  """
  return feedback length for a specific score
  """
  return data[data_len.score==score]['feedback_len']

# Plot histogram of feedback length for each score (0.0 or 1.0)
fig = go.Figure()
fig.add_trace(go.Histogram(x=np.log(data_len_x(data_len, 0.0)),
                           name = 'Score = 0.0'))
fig.add_trace(go.Histogram(x=np.log(data_len_x(data_len, 1.0)),
                           name = 'Score = 1.0'))

# Overlay both histograms
fig.update_layout()
fig.update_layout(
    barmode='overlay',
    xaxis_title = 'Feedback len',
    yaxis_title = 'Frequence'
)

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()

In [None]:
print("Statistical measures of feedback length related to the score 0.0 : \n{}".format(data_len_x(data_len, 0.0).describe()))
print("Statistical measures of feedback length related to the score 1.0 : \n{}".format(data_len_x(data_len, 1.0).describe()))

We can tell from the graph and measures that the lentgh of feedback doesn't affect its correspoding score. 


## Levraging Sentiment Analysis and Emotion Recognition in EDA

### Cleaning text data

In [None]:
# Copy data
data_SA = data.copy()
data_SA.score = data_SA.score.astype(float)

In [None]:
def clean_text(t: str) -> str:
  """
  A method to clean feedback from stopword, links and punctuations."
  """
  # remove \n
  t = t.rstrip('\n')

  # remove links
  t = re.sub(r'http\S+', '', t)
  t = re.sub(r'www.\S+', '', t)
  
  # Removing the punctuations
  t = [i for i in t if i not in string.punctuation]
  t = "".join(t)
  
  # Converting the text to lower
  t = t.lower()
  
  # Removing stop words
  t = ' '.join([word for word in t.split() if word not in stopwords.words('french')])
  
  # Cleaning the whitespaces
  t = re.sub(r'\s+', ' ', t).strip()
  
  return t  

In [18]:
# Create new column 'cleaned_feedback' which contains cleaned text in the feedback column
%timeit data_SA['cleaned_feedback'] = list(map(clean_text, data_SA['feedback']))
data_SA.head()

1 loop, best of 3: 34.8 s per loop


Unnamed: 0,band_id,influencer_id,feedback,score,cleaned_feedback
0,24665,1642,"Bonjour Diogo Ramos,\n\n\nMerci pour le partage.\nActuellement le titre ne correspond pas à notre ligne de programmation.\n\nN'hésitez pas à nous ...",0.0,bonjour diogo ramos merci partage actuellement titre correspond ligne programmation nhésitez envoyer prochains sons bonne continuation andré labo
1,24665,118,"Un message touchant, développé sur une onde musicale légère et pleine d'oxygène. Direction musicale trop folklorique cependant bien que des pistes...",0.0,message touchant développé onde musicale légère pleine doxygène direction musicale trop folklorique cependant bien pistes attachantes résultent ça...
3,24665,1603,Bonjour. Merci pour cette fraîcheur et cet hymne à la liberté en ces moments si incertains. Musicalement c(est parfait; orchestration et interprét...,1.0,bonjour merci cette fraîcheur cet hymne liberté moments si incertains musicalement cest parfait orchestration interprétation top diffuse manuel
4,24665,111,"Salut Diogo, alors c'est un peu éloigné de ce qu'on a l'habitude de partager mais j'ai été séduit par votre univers et le timbre de la voix, on va...",1.0,salut diogo alors cest peu éloigné quon a lhabitude partager jai séduit univers timbre voix va partager playlist jai liker commenter youtube
5,24665,2024,Bonjour\nMerci le titre est très frais! Mais malheureusement ce n'est pas la ligne editoriale de notre magazine.\nBonne continuation,0.0,bonjour merci titre très frais malheureusement nest ligne editoriale magazine bonne continuation


- I realized that the feedback is not only written in French, but also in English, Italian and other languages.  
- langdetect is a library used to detect the langage of a specific text, it was inspired by language-detection library in JAVA implemented by Google


In [19]:
# Detect feedback langage using langdetect library
%timeit data_SA['lang'] = data_SA['cleaned_feedback'].apply(detect)

1 loop, best of 3: 31.4 s per loop


In [20]:
# Display detected languages and corresponding count
data_SA['lang'].value_counts()

en    4284
fr    3922
it     341
pt      91
de      24
es      11
ca       3
sv       3
Name: lang, dtype: int64

In [21]:
# Create french and english dataframes
data_fr = data_SA[data_SA['lang']=='fr']
data_en = data_SA[data_SA['lang']=='en']

In [22]:
# Visualize scores distribution for each langage (fr and en)
trace1 = go.Bar(
                x=list(data_fr.score.value_counts().sort_index().keys()),
                y=list(data_fr.score.value_counts().sort_index()),
                name="Frensh feedbacks",
                )
trace2 = go.Bar( 
                x=list(data_en.score.value_counts().sort_index().keys()),
                y=list(data_en.score.value_counts().sort_index()),
                name="English feedbacks",
                )
fig = make_subplots(rows=1, cols=2)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=1, col=2)

fig.update_layout(width=1000, height=600, xaxis=dict(tickangle=90), 
                  title_text="Score distribution by language",
                  xaxis_title="Score",
                  yaxis_title="Frequence",)
fig.show()

We can see that they both have almost the same distribution. Regardless of the language of feedbacks, the distribution of scores sill the same 


In [23]:
# Count of feedback of the other languages : it, es, pt and de
data_SA[(data_SA['lang']!='en') & (data_SA['lang']!='fr')]['feedback'].shape

(473,)


### Compute Sentiment : Polarity and Subjectivity

Let's compute the polarity of each feedback based on it's langage (french and english only). The other langages represent only 4% feedbacks, we could remove them since there are no reliable pre-trained Sentiment Analysis models for these languages

In [24]:
# Delete lines with languages other than fr and en
data_SA = data_SA[(data_SA['lang']=='en') | (data_SA['lang']=='fr')]

In [25]:
# Modify the clean_text function to handle eng and fr languages
def clean_text(t: str) -> str:
  """
  A method to clean feedback from stopword, links and punctuations."
  """
  # remove \n
  t = t.rstrip('\n')

  # remove links
  t = re.sub(r'http\S+', '', t)
  t = re.sub(r'www.\S+', '', t)
  
  # Removing the punctuations
  t = [i for i in t if i not in string.punctuation]
  t = "".join(t)
  
  # Converting the text to lower
  t = t.lower()
  
  # Removing stop words
  all_stopwords = stopwords.words('french') + stopwords.words('english')
  t = ' '.join([word for word in t.split() if word not in all_stopwords])

  # Cleaning the whitespaces
  t = re.sub(r'\s+', ' ', t).strip()
  
  return t

In [26]:
# Create new column 'cleaned_feedback' which contains cleaned text in the feedback column
%timeit data_SA['cleaned_feedback'] = list(map(clean_text, data_SA['feedback']))
# Create tokenized cleaned feedback
%timeit data_SA['token_feedback'] = list(map(lambda x: x.split(" "), data_SA['cleaned_feedback']))
data_SA.head()

1 loop, best of 3: 2.98 s per loop
10 loops, best of 3: 26 ms per loop


Unnamed: 0,band_id,influencer_id,feedback,score,cleaned_feedback,lang,token_feedback
0,24665,1642,"Bonjour Diogo Ramos,\n\n\nMerci pour le partage.\nActuellement le titre ne correspond pas à notre ligne de programmation.\n\nN'hésitez pas à nous ...",0.0,bonjour diogo ramos merci partage actuellement titre correspond ligne programmation nhésitez envoyer prochains sons bonne continuation andré labo,fr,"[bonjour, diogo, ramos, merci, partage, actuellement, titre, correspond, ligne, programmation, nhésitez, envoyer, prochains, sons, bonne, continua..."
1,24665,118,"Un message touchant, développé sur une onde musicale légère et pleine d'oxygène. Direction musicale trop folklorique cependant bien que des pistes...",0.0,message touchant développé onde musicale légère pleine doxygène direction musicale trop folklorique cependant bien pistes attachantes résultent ça...,fr,"[message, touchant, développé, onde, musicale, légère, pleine, doxygène, direction, musicale, trop, folklorique, cependant, bien, pistes, attachan..."
3,24665,1603,Bonjour. Merci pour cette fraîcheur et cet hymne à la liberté en ces moments si incertains. Musicalement c(est parfait; orchestration et interprét...,1.0,bonjour merci cette fraîcheur cet hymne liberté moments si incertains musicalement cest parfait orchestration interprétation top diffuse manuel,fr,"[bonjour, merci, cette, fraîcheur, cet, hymne, liberté, moments, si, incertains, musicalement, cest, parfait, orchestration, interprétation, top, ..."
4,24665,111,"Salut Diogo, alors c'est un peu éloigné de ce qu'on a l'habitude de partager mais j'ai été séduit par votre univers et le timbre de la voix, on va...",1.0,salut diogo alors cest peu éloigné quon lhabitude partager jai séduit univers timbre voix va partager playlist jai liker commenter youtube,fr,"[salut, diogo, alors, cest, peu, éloigné, quon, lhabitude, partager, jai, séduit, univers, timbre, voix, va, partager, playlist, jai, liker, comme..."
5,24665,2024,Bonjour\nMerci le titre est très frais! Mais malheureusement ce n'est pas la ligne editoriale de notre magazine.\nBonne continuation,0.0,bonjour merci titre très frais malheureusement nest ligne editoriale magazine bonne continuation,fr,"[bonjour, merci, titre, très, frais, malheureusement, nest, ligne, editoriale, magazine, bonne, continuation]"


In [27]:
# Define the polarity and subjectivity function
def feedback_polarity(f, lang=None) -> list:
    'Compute sentiment for each feedback based on its language (en and fr)'
    if lang == None:
      lang = detect(f)
    if lang == 'fr':
      # Compute sentiment for frensh feedbacks
      polarity = TextBlob(f, pos_tagger=PatternTagger(), analyzer=PatternAnalyzer()).sentiment[0]
      subjectivity = TextBlob(f, pos_tagger=PatternTagger(), analyzer=PatternAnalyzer()).sentiment[1]
      return [polarity, subjectivity]
    elif lang == 'en':
      # Compute sentiment for english feedbacks
      polarity = TextBlob(f).sentiment.polarity
      subjectivity = TextBlob(f).sentiment.subjectivity
      return [polarity, subjectivity]
    else:
      return

In [28]:
# Compute sentiment
polar_subj = np.array(list(data_SA.apply(lambda x: feedback_polarity(x['cleaned_feedback'], x['lang']), axis=1)))
data_SA['Polarity'] = [polar for polar in polar_subj[:,0]]
data_SA['Subjectivity'] = [subj for subj in polar_subj[:,1]]
data_SA.head()

Unnamed: 0,band_id,influencer_id,feedback,score,cleaned_feedback,lang,token_feedback,Polarity,Subjectivity
0,24665,1642,"Bonjour Diogo Ramos,\n\n\nMerci pour le partage.\nActuellement le titre ne correspond pas à notre ligne de programmation.\n\nN'hésitez pas à nous ...",0.0,bonjour diogo ramos merci partage actuellement titre correspond ligne programmation nhésitez envoyer prochains sons bonne continuation andré labo,fr,"[bonjour, diogo, ramos, merci, partage, actuellement, titre, correspond, ligne, programmation, nhésitez, envoyer, prochains, sons, bonne, continua...",0.333333,0.3
1,24665,118,"Un message touchant, développé sur une onde musicale légère et pleine d'oxygène. Direction musicale trop folklorique cependant bien que des pistes...",0.0,message touchant développé onde musicale légère pleine doxygène direction musicale trop folklorique cependant bien pistes attachantes résultent ça...,fr,"[message, touchant, développé, onde, musicale, légère, pleine, doxygène, direction, musicale, trop, folklorique, cependant, bien, pistes, attachan...",0.074286,0.385714
3,24665,1603,Bonjour. Merci pour cette fraîcheur et cet hymne à la liberté en ces moments si incertains. Musicalement c(est parfait; orchestration et interprét...,1.0,bonjour merci cette fraîcheur cet hymne liberté moments si incertains musicalement cest parfait orchestration interprétation top diffuse manuel,fr,"[bonjour, merci, cette, fraîcheur, cet, hymne, liberté, moments, si, incertains, musicalement, cest, parfait, orchestration, interprétation, top, ...",0.166667,0.6
4,24665,111,"Salut Diogo, alors c'est un peu éloigné de ce qu'on a l'habitude de partager mais j'ai été séduit par votre univers et le timbre de la voix, on va...",1.0,salut diogo alors cest peu éloigné quon lhabitude partager jai séduit univers timbre voix va partager playlist jai liker commenter youtube,fr,"[salut, diogo, alors, cest, peu, éloigné, quon, lhabitude, partager, jai, séduit, univers, timbre, voix, va, partager, playlist, jai, liker, comme...",-0.0375,-0.075
5,24665,2024,Bonjour\nMerci le titre est très frais! Mais malheureusement ce n'est pas la ligne editoriale de notre magazine.\nBonne continuation,0.0,bonjour merci titre très frais malheureusement nest ligne editoriale magazine bonne continuation,fr,"[bonjour, merci, titre, très, frais, malheureusement, nest, ligne, editoriale, magazine, bonne, continuation]",0.25,0.3625


In [29]:
# Visualize feedbacks Polarity and Subjectivity and the corresponding score.
# For better plotting
data_SA.score = data_SA.score.astype(str)
data_SA.sort_values('score', ascending=False, inplace=True)
fig = px.scatter(data_SA, x="Polarity", y="Subjectivity",
                title="Score vs. Polarity.", color = 'score')

fig.update_layout(width=1200, height=500, xaxis=dict(tickangle=90), 
                  xaxis_title="<---- Negarive -- Polarity -- Positive ---->",
                  yaxis_title="<---- Fact   -- Subjectivity -- Opinion --->",)
fig.show()

I noticed that some feedbacks in French give a subjectivity >1, which is contradictory. After some research in the source code of the textblob_fr library, I noticed that the word 'plein' has a subjectivity of 35.0 instead of 0.35 (mistakes happen). Let's check if this is true for all feedbacks having subjectivity>1







In [30]:
all(phrase == True for phrase in [True if 'plein' in text else False for text in data_SA[data_SA['Subjectivity']>1]['cleaned_feedback'].astype(str)])

True

So we can remove these rows since they are considered as outliers in the subjectivity/polarity visualization.

In [31]:
# Filter the dataset
df_copy = data_SA.loc[data_SA['Subjectivity']<1]

# 
df_copy.score = df_copy.score.astype(str)
df_copy.sort_values('score', ascending=False, inplace=True)
fig = px.scatter(df_copy, x="Polarity", y="Subjectivity",
                title="Polarity vs. Subjectivity for each feedback and the corresponding given score.", color = 'score')

fig.update_layout(width=1200, height=500, xaxis=dict(tickangle=90), 
                  xaxis_title="<---- Negarive -- Polarity -- Positive ---->",
                  yaxis_title="<---- Fact   -- Subjectivity -- Opinion --->",)
fig.show()

- According to the visualization of polarity vs. subjectivity, the influencers' feedbacks are globally positive opinions (high subjectivity / positive polarity). In addition, opinions (high subjectivity) with a score of 5.0 (in blue) are more concentrated in the positive polarity area. Opinions with a score of 0.0 are in the polarity range of [0.2,0.4], which means that, even if the influencers are not interested in the band, their feedback is constructive and contains positive words.

Let's visualize some of these words 

In [32]:
data_SA.score = data_SA.score.astype(float)

In [33]:
# Vizualize word cloud based on score
from wordcloud import WordCloud

def word_cloud(data, score):

  text = " ".join(feedback for feedback in data[data.score==score].cleaned_feedback)
  print ("There are {} words in the combination of all feedbacks that are associated with score {}.".format(len(text), score))

  # Generate a word cloud image
  wordcloud = WordCloud(background_color="white").generate(text)

  fig = px.imshow(wordcloud)
  fig.update_xaxes(showticklabels=False)
  fig.update_yaxes(showticklabels=False)
  #fig.update_layout(width=800, height=800)
  fig.show()

print("Word cloud associated with the score 0.0 : ")
word_cloud(data_SA, 0.0)
print("Word cloud associated with the score 1.0 : ")
word_cloud(data_SA, 1.0)

Word cloud associated with the score 0.0 : 
There are 982305 words in the combination of all feedbacks that are associated with score 0.0.


Word cloud associated with the score 1.0 : 
There are 340892 words in the combination of all feedbacks that are associated with score 1.0.


We can clearly notice that there are some caracteristic words associated to each score that can be served to classify between them ("boone continuation" related to score 0.0 for example), I assume that a TF-IDF approach will give us acceptable results 

## Predictive modeling

### TF-IDF Approach

In [106]:
# Copy data
tfidf_data = data_SA.copy()

#### Regression 

In this section, I'm going to predict the score as it is. To do so, I will be considering this problem as a regression problem. The approach used in this first section is TF-IDF followed by different regression models to compare between them.

In [107]:
# Define features and labels
X = tfidf_data['cleaned_feedback'].astype(str)
Y = tfidf_data['score'].astype(float)

# Split train and test data
x_train, x_test, y_train, y_test = \
train_test_split(X, Y, test_size=0.3, random_state= 1)

Pipline used for TF-IDF approach: 
- CountVectorizer : Tokenize each document (feedback) and count the number of unique words in each document.
- TfidfTransformer : Transform a count matrix to a tf-idf representation (tf-idf means term-frequency x inverse document-frequency) that represents weighting factors for features. The weight increases if the word frequency in a document increases, but it's compensated by the number of occurence in all corpus. TF-IDF contains information on the more important words and the less important ones as well.
- Regression : I only tested 3 models to compare between them :
    - XGBoost Regressor : Feed the tf-idf representation and score ralted to each document (feedbach) to a XGBoost model for classification. This model is 
    - LInearRegression
    - SVM regression (SVR) : 


In [123]:
from sklearn.svm import SVR
#LogisticRegression()
pipeline_linear = Pipeline([
    ('bow', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', LinearRegression()),
])

#XGBoost()
pipeline_xgboost = Pipeline([
    ('bow', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', XGBRegressor()),
])

#SVM
pipeline_svm = Pipeline([
    ('bow', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', SVR(C=1.0, epsilon=0.2)),
])

In [124]:
# Test the logistic regression model on test data
pipeline_linear.fit(x_train,y_train)
test_predictions_1 = pipeline_linear.predict(x_test)
# Compute accuracy on test data
print("MAE : %.5f" % mean_squared_error(y_test , test_predictions_1))

# Test the xgboost model on test data
pipeline_xgboost.fit(x_train,y_train)
test_predictions_2 = pipeline_xgboost.predict(x_test)
# Compute accuracy on test data
print("MAE : %.5f" % mean_squared_error(y_test , test_predictions_2))

# Test the xgboost model on test data
pipeline_svm.fit(x_train,y_train)
test_predictions_3 = pipeline_svm.predict(x_test)
# Compute accuracy on test data
print("MAE : %.5f" % mean_squared_error(y_test , test_predictions_3))

MAE : 0.31641
MAE : 0.09013
MAE : 0.08467


In [None]:
# 

#### Classification

In [85]:
# Let's simplify our problem and make it a classification problem of good rating (score = 1.) and bad rating (score = 0.)
tfidf_data['score'] = [1 if score > 0.5 else 0 for score in tfidf_data['score']]
# Define features and labels
X = tfidf_data['cleaned_feedback']
Y = tfidf_data['score']
# Split train and test data
x_train, x_test, y_train, y_test = \
train_test_split(X, Y, test_size=0.3, random_state= 1)

In [95]:
#LogisticRegression()
pipeline_logistic = Pipeline([
    ('bow', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', LogisticRegression()),
])

#XGBoost()
pipeline_xgboost = Pipeline([
    ('bow', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', XGBClassifier()),
])

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling

In [103]:
# Test the logistic regression model on test data
pipeline_logistic.fit(x_train,y_train)
test_predictions_1 = pipeline_logistic.predict(x_test)
# Compute accuracy on test data
print("Accuracy : %.5f" % accuracy_score(y_test , test_predictions_1))
# Test the xgboost model on test data
pipeline_xgboost.fit(x_train,y_train)
test_predictions_2 = pipeline_xgboost.predict(x_test)
# Compute accuracy on test data
print("Accuracy : %.5f" % accuracy_score(y_test , test_predictions_2))

Accuracy : 0.88180
Accuracy : 0.87490


In the case of unbalanced data, accuracy is not a good evaluation measure to use because it relies on minimized error, and the latter is more strongly affected by the majority class than by the minority class.  

Let' use another evaluation metric, K-Fold cross validation and ROC_AUC that tells us how much the model is capable of distinguishing between classes (0 and 1).

In [104]:
# Define evaluation procedure
# Let's repeate the k-fold cross validation many times to reduce the bias and increase the variance (we will have a more meaningful metric measure)
# It's good for linear and small-moderate datasets (computational time ), which is our case
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)
# Return the mean of roc_auc of all fits
scores = cross_val_score(pipeline_logistic, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
# Summarize performance
print('Mean ROC AUC: %.5f' % np.mean(scores))

Mean ROC AUC: 0.95500


In [105]:
# The same approach for XGBoost model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)
scores = cross_val_score(pipeline_xgboost, X, Y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.5f' % np.mean(scores))

Mean ROC AUC: 0.91892


Logistic regression surpassed XGBoost on classifying between good (score = 1) and bad (score = 0) feedbacks. 

### Word embedding approach

In [82]:
# copy
data_Pred = data_SA.copy()
data_Pred.head()

Unnamed: 0,band_id,influencer_id,feedback,score,cleaned_feedback,lang,Polarity,Subjectivity
9999,34059,386,"Yo c'est pas mal ce que tu fais !\nLes prod c'est pas trop mon style, j'aimerais bien à l'occase essayer te faire posé sur des riddims un peu diff...",1.0,yo cest mal fais prod cest trop style jaimerais bien loccase essayer faire posé riddims peu différents si tai chaud jaimerais bien reste contact b...,fr,-0.025,0.370312
4789,32787,3141,"Hello,\nthe music and the voice are terrific, well done ! And the sound is really good.\nWe will broadcast this song for 3 months, starting next m...",1.0,hello music voice terrific well done sound really good broadcast song 3 months starting next monday please let know releases regards,en,0.22,0.42
1485,33042,1462,"Alright this is top notch quality, loved it. Will be sharing it today. Thanks for sending it to us. Great job!",1.0,alright top notch quality loved sharing today thanks sending us great job,en,0.55,0.5625
4765,32787,2956,"Second time around here, one more and you can ask for music friends hahahahaha, jokes aside, let's get to what matters. It's a type of Pop, a ligh...",1.0,second time around one ask music friends hahahahaha jokes aside lets get matters type pop light rock something different styles light aggressive s...,en,0.2875,0.6
4766,32787,2965,Excelente. Rock + Blues com ótima temática. Muito bom\n\nGreat. Rock + Blues with great theme. Very good,1.0,excelente rock blues com ótima temática muito bom great rock blues great theme good,en,0.766667,0.7


In [204]:
data_Pred['score'] = data_Pred['score'].astype(float)
data_Pred['feedback'] = data_Pred['feedback'].astype(str)

In [205]:
data_Pred['score'] = [1 if s > 0.5 else 0 for s in data_Pred['score']]

In [206]:
data_Pred['score'].value_counts()

0    7353
1    2172
Name: score, dtype: int64

In [207]:
class Embeddings:
    """
    Embedding class
    """
    
    def __init__(self, embed_path, embed_dim):
        self.embed_path, self.embed_dim = embed_path, embed_dim
        
    def get_coefs(self, word, *arr): 
        return word, np.asarray(arr, dtype='float32')

    def get_embedding_index(self):
        embeddings_index = dict(self.get_coefs(*o.split(" ")) for o in open(self.embed_path, errors='ignore'))
        return embeddings_index

    def create_embedding_matrix(self, tokenizer, max_features):
        """
        A method to create the embedding matrix
        """
        model_embed = self.get_embedding_index()

        embedding_matrix = np.zeros((max_features + 1, self.embed_dim))
        for word, index in tokenizer.word_index.items():
            if index > max_features:
                break
            else:
                try:
                    embedding_matrix[index] = model_embed[word]
                except:
                    continue
        return embedding_matrix

In [312]:
class Classifier:
    """
    A whole machine learning pipeline for feedbacks classification using word embeddings and LSTM
    """
    # init method
    def __init__(self, X: list, Y: list, embed_path: str, embed_dim: int, epochs=10, batch_size=256):
        
        self.X = X
        self.Y = Y
        self.embed_path = embed_path
        self.embed_dim = embed_dim
        self.epochs = epochs
        self.batch_size = batch_size
        
    def preprocess(self):
        
        # Split 
        X_train, X_test, Y_train, Y_test = train_test_split(
            self.X, self.Y, test_size=0.3, random_state=42)
        
        # Preprocecing the text
        X_train = list(map(self.clean_text, X_train))
        X_test = list(map(self.clean_text, X_test))
        Y_train = np.asarray(Y_train)
        Y_test = np.asarray(Y_test)
        
        # Tokenizing the text
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(X_train)
        self.tokenizer = tokenizer

        # Creating the embedding matrix
        embedding = Embeddings(self.embed_path, self.embed_dim)
        self.embedding_matrix = embedding.create_embedding_matrix(tokenizer, len(tokenizer.word_counts))

        # Creating the padded input for the deep learning model
        self.max_len = np.max([len(text.split()) for text in X_train])
        X_train = self.string_to_tensor(X_train, self.tokenizer, self.max_len)
        X_test = self.string_to_tensor(X_test, self.tokenizer, self.max_len)
        self.X_train, self.X_test, self.Y_train, self.Y_test = X_train, X_test, Y_train, Y_test
    
    # Train rnn model
    def train(self):

        class_weight = {0: 1.,
                1: 7353./2172.}

        lr_reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.3, patience=3, verbose=2, mode='max')
        early_stop = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=4, mode='min')
        
        model = self.Rnn_model(
            embedding_matrix=self.embedding_matrix, 
            embedding_dim=self.embed_dim, 
            max_len=self.max_len
        )
        model.fit(
            self.X_train,
            self.Y_train, 
            validation_data=(self.X_test, self.Y_test),
            batch_size=self.batch_size, 
            epochs=self.epochs,
            callbacks=[early_stop, lr_reduce],
            class_weight=class_weight
        )
        self.model = model
        return self.model
    
    # Model architecture 
    def Rnn_model(self, embedding_matrix, embedding_dim, max_len):
        """
        Recurrent neural network. The embedding layer is supposed 
        to take an embedding matrix for pretrained weights
        """

        inp1 = Input(shape=(max_len,))
        x = Embedding(embedding_matrix.shape[0], embedding_dim, weights=[embedding_matrix])(inp1)
        x = Bidirectional(LSTM(256, return_sequences=True))(x)
        x = Bidirectional(LSTM(150))(x)
        x = Dense(128, activation="relu")(x)
        x = Dropout(0.1)(x)
        x = Dense(64, activation="relu")(x)
        # For classification
        x = Dense(1, activation="sigmoid")(x)    
        # for regression
        #x = Dense(1)(x)    
        model = Model(inputs=inp1, outputs=x)
        # binary_crossentropy if we are dealing with a classification problem (mean_squared_error if regression)
        model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
        return model
    
    def predict(self, text: list):
        
        text = list(map(self.clean_text, text))
        text = self.string_to_tensor(text, self.tokenizer, self.max_len)
        
        train_pred = [1 if x[0]>0.5 else 0 for x in self.model.predict(text)]
        
        return train_pred

    def predict_classes(self, text: list):
        
        text = list(map(self.clean_text, text))
        text = self.string_to_tensor(text, self.tokenizer, self.max_len)
        
        train_pred = [x[0] for x in self.model.predict_classes(text).tolist()]
        
        return train_pred
    
    def evaluate(self):
        
        # If X_test is provided we make predictions with the created model
        if len(self.X_test)>0:
            #X_test = [self.clean_text(text) for text in self.X_test]
            #X_test = self.string_to_tensor(self.X_test, self.tokenizer, self.max_len)
            #test_pred = [x[0] for x in self.model.predict(self.X_test).tolist()]
            #train_pred = [x[0] for x in self.model.predict(self.X_train).tolist()]
            yhat = [x[0] for x in self.model.predict(self.X_test).tolist()]
            self.yhat = yhat

            # If true labels are provided we calculate the accuracy of the model
            if len(self.Y_test)>0:
              # Classification
              self.acc = accuracy_score(self.Y_test, [1 if x > 0.5 else 0 for x in yhat])
              return self.acc
              # Regression
              #self.trainScore = np.sqrt(mean_squared_error(self.Y_train, train_pred))
              #print('Train Score: %.2f RMSE' % (self.trainScore))
              #self.testScore = np.sqrt(mean_squared_error(self.Y_test, test_pred))
              #print('Test Score: %.2f RMSE' % (self.testScore))
              #return self.trainScore, self.testScore
                

    def string_to_tensor(self, string_list: list, tokenizer, max_len) -> list:
        """
        A method to convert a string list to a tensor for a deep learning model
        """    
        string_list = tokenizer.texts_to_sequences(string_list)
        string_list = pad_sequences(string_list, maxlen=max_len)

        return string_list
    
    def clean_text(self, t: str) -> str:
        """
        A method to clean tweets from stopword, links and punctuations."
        """
        # remove \n
        t = t.rstrip('\n')

        # Removing the punctuations
        t = [i for i in t if i not in string.punctuation]
        t = "".join(t)

        # Converting the text to lower
        t = t.lower()

        # Removing stop words
        t = ' '.join([word for word in t.split() if word not in stopwords.words('french')+stopwords.words('english')])

        # Cleaning the whitespaces
        t = re.sub(r'\s+', ' ', t).strip()

        return t       

In [313]:
# Unprocessed tweets
X = data_Pred['feedback'].tolist()

Y = data_Pred['score'].tolist()

embed_path="gdrive/My Drive/Groover/glove.6B.300d.txt"
embed_dim=300

classifier = Classifier(X, Y, embed_path, embed_dim, epochs=15, batch_size=256)

In [314]:
# Prepare training data
classifier.preprocess()

In [315]:
# Start training 
model = classifier.train()

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15


### Regression

In [184]:
# Evaluate the model
acc = classifier.evaluate()
#print(acc)

Train Score: 0.05 RMSE
Test Score: 0.24 RMSE


In [185]:
X_train, X_test, Y_train, Y_test = train_test_split(
            X, Y, test_size=0.3, random_state=42)

test_pred = classifier.predict(X_test)

[0.35213151574134827, 0.005995337851345539, 0.3452686071395874, 0.6661107540130615, 0.0008919974789023399, 0.32609161734580994, 0.0036737127229571342, 0.8986443877220154, 0.012182272039353848, 0.0024429745972156525]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 1.0, 0.0, 0.25]


In [188]:
print(test_pred[1000:])
print(Y_test[1000:])

[0.8388172388076782, 0.015686824917793274, 0.04143678769469261, 0.03445472940802574, 0.9471743702888489, -0.016922608017921448, 0.01856742799282074, -0.009018362499773502, 0.007100113667547703, 0.008718720637261868, 0.9150923490524292, 0.29085955023765564, 0.0009605316445231438, 0.12653028964996338, 0.4373140335083008, 0.00039001554250717163, 0.2302893102169037, 0.20923274755477905, -0.0017916737124323845, 0.9313439726829529, -0.007770218886435032, 0.9354633092880249, -0.00010526087135076523, 0.38314110040664673, -0.008796275593340397, 0.06650802493095398, 0.006974558345973492, 0.039260298013687134, 0.7406441569328308, 0.9370160698890686, -0.0033549489453434944, 0.008317219093441963, 0.9310141801834106, 0.007367917336523533, 0.01063526887446642, -0.006682419218122959, 0.123704694211483, 0.08592072129249573, 0.10243625938892365, 0.9560524225234985, 0.9537553191184998, 0.03149787709116936, 0.9392101764678955, 0.063140869140625, 0.30221834778785706, -0.0037855850532650948, 0.2506701946258

### Classification

In [316]:
# Evaluate the model
acc = classifier.evaluate()
print(acc)

0.91777466759972


In [317]:
X_train, X_test, Y_train, Y_test = train_test_split(
            X, Y, test_size=0.3, random_state=42)

test_pred = classifier.predict(X_test)

In [318]:
print(test_pred[:50])
print(Y_test[:50])

[0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0]


In [296]:
#confusion matrix on the test dataset
from sklearn.metrics import classification_report
print(classification_report(Y_test, test_pred))

              precision    recall  f1-score   support

           0       0.96      0.92      0.94      2194
           1       0.77      0.86      0.81       664

    accuracy                           0.91      2858
   macro avg       0.86      0.89      0.87      2858
weighted avg       0.91      0.91      0.91      2858



In [319]:
#confusion matrix on the test dataset
from sklearn.metrics import classification_report
print(classification_report(Y_test, test_pred))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      2194
           1       0.82      0.83      0.82       664

    accuracy                           0.92      2858
   macro avg       0.88      0.89      0.89      2858
weighted avg       0.92      0.92      0.92      2858

