# EDA

Inspiration : https://medium.com/deepdatascience/feature-extraction-from-text-text-data-preprocessing-594b11af19f5

In [5]:
!pip install textblob



In [61]:
#Temps et fichiers
import os
import warnings
import time
from datetime import timedelta

#Manipulation de données
import pandas as pd
import numpy as np

# Text
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
import string
import re
import spacy 
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

#Modélisation
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.svm import LinearSVC
from sklearn.model_selection import RandomizedSearchCV# the keys can be accessed with final_pipeline.get_params().keys()
from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier


#Evaluation
from sklearn.metrics import f1_score, confusion_matrix


#Visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px


#Tracking d'expérience
import mlflow
import mlflow.sklearn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [62]:
!pip freeze > /mnt/docker/requirements.txt

## Utilisation du package

In [8]:
#Cette cellule permet d'appeler la version packagée du projet et d'en assurer le reload avant appel des fonctions
%load_ext autoreload
%autoreload 2

In [9]:
from dsa_sentiment.scripts.make_dataset import load_data
from dsa_sentiment.scripts.evaluate import eval_metrics
from dsa_sentiment.scripts.make_dataset import Preprocess_StrLower, Preprocess_transform_target

## Configuration de l'experiment MLFlow

In [10]:
mlflow.tracking.get_tracking_uri()

'/mnt/experiments'

## Description de la stratégie employée

L'approche déployée consiste à analyser des tweets en langue anglaise et de prédire les sentiments qu'ils portent : `{negative: -1, neutral: 0, positive: 1}`

Dans cet exercice, la langue anglaise est un facteur facilitant dans la mesure où beaucoup de modèles préentrainés existent dans cette langue.

La difficulté dans cet exercice provient de sa source : les tweets.
Les approches classiques reposent sur :
- le passage en minuscule, or dans les tweets, l'utilisation de mots en **majuscules** est un marqueur d'une **émotion forte**
- l'utilisation de structures linguisitiques relativement correctes augmentées par la lemmatisation / tokenisation. 
Or, les mots utilisés dans les tweets font l'objet de nombreuses **fautes d'orthographes ou d'abbréviations** (ex `thx`)
- les marqueurs de ponctuation sont usuellement retirés, or ici, ils peuvent être utilisés comme **smiley** `;-)` ou pour marquer une **émotion forte** `!!!`
- l'**humour** et les **euphémismes** sont très présents sur tweeter, or les modèles ont beaucoup de mal à distinguer ces cas qui nécessitent une compréhension contextuelle.

En complément au sujet du TD lui-même, celui-ci a été l'occasion de monter en compétence avec les (je l'espère) bonnes pratiques de codage et l'utilisation de techniques de MLOps.

Le code de ce projet a été organisé en s'appuyant sur le framework open source [**orbyter**](https://github.com/manifoldai/orbyter-cookiecutter) de la société [Manifold.ai](https://www.manifold.ai/project-orbyter). Ce framework pousse à la standardisation de la structure du code, via l'utilisation de `cookiecutter` et promeut un développement dans un environnement dockerisé dès le départ : 

![structure](https://www.manifold.ai/hubfs/Torus.png) 

La logique de développement pronée est disponible [ici](https://cdn2.hubspot.net/hubfs/4584542/Conference%20Slides/2019StrataNY_EfficientMLengineering.pdf)

Plusieurs modifications ont dû être apportées aux paramètres du `docker-compose` pour permettre un accès aux ressources GPU depuis le docker.

Le code a été versionné et est disponible ici [github](https://github.com/Fabien-DS/DSA_Sentiment)

### Chargement des données

In [11]:
!pwd

/mnt/notebooks/EDA


In [12]:
data_folder = os.path.join('/mnt', 'data', 'raw')
all_raw_files = [os.path.join(data_folder, fname)
                    for fname in os.listdir(data_folder)]
all_raw_files

['/mnt/data/raw/sample_submission.csv',
 '/mnt/data/raw/test.csv',
 '/mnt/data/raw/train.csv']

In [13]:
random_state=42

Il n'est pas possible de faire de l'imputation comme avec des champs numérique. Il convient donc de supprimer les tweets vides (`dropNA=True`).

En complément, il est decidé de réserver 30% du jeu d'entrainnement pour créer un jeu de validation.

In [14]:
X_train, y_train, X_val, y_val = load_data(all_raw_files[2], split=True, test_size=0.3, random_state=random_state, dropNA=True)

In [15]:
X_train.head()

Unnamed: 0,textID,text,selected_text
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going"
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD
2,088c60f138,my boss is bullying me...,bullying me
3,9642c003ef,what interview! leave me alone,leave me alone
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,"


In [16]:
print(f'le jeu d\'entraînement initial contient', X_train.shape[0] + X_val.shape[0], 'lignes')

le jeu d'entraînement initial contient 27480 lignes


In [17]:
y_train.head()

Unnamed: 0,sentiment
0,neutral
1,negative
2,negative
3,negative
4,negative


In [18]:
X_test, y_test = load_data(all_raw_files[1], split=False, random_state=random_state, dropNA=True)

In [19]:
X_test.head()

Unnamed: 0,textID,text
0,f87dea47db,Last session of the day http://twitpic.com/67ezh
1,96d74cb729,Shanghai is also really exciting (precisely -...
2,eee518ae67,"Recession hit Veronique Branquinho, she has to..."
3,01082688c6,happy bday!
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!


In [20]:
print(f'le jeu de test contient', X_test.shape[0] , 'lignes')

le jeu de test contient 3534 lignes


### Transformation initiales des données

Cette partie vise uniquement à sélectionner les colonnes dont nous nous servirons et à transcoder la cible au format souhaité.

In [21]:
# Dans ce projet on ne se servira que du champs `text`. On cherche toutefois à conserver le format pandas DataFrame
X_train = X_train[['text']]
X_val = X_val[['text']]
X_test = X_test[['text']]

In [22]:
X_train.head()

Unnamed: 0,text
0,"I`d have responded, if I were going"
1,Sooo SAD I will miss you here in San Diego!!!
2,my boss is bullying me...
3,what interview! leave me alone
4,"Sons of ****, why couldn`t they put them on t..."


On commence par transformer les cibles pour se conformer aux instructions

In [23]:
y_train = Preprocess_transform_target(y_train, columns_to_process=['sentiment'])
y_train.head()

Unnamed: 0,sentiment
0,0
1,-1
2,-1
3,-1
4,-1


In [24]:
y_val = Preprocess_transform_target(y_val, ['sentiment'])
y_val.head()

Unnamed: 0,sentiment
19236,0
19237,-1
19238,1
19239,-1
19240,1


In [25]:
y_test = Preprocess_transform_target(y_test, ['sentiment'])
y_test.head()

Unnamed: 0,sentiment
0,0
1,1
2,-1
3,1
4,1


# EDA

Parmi les éléments propres auw tweets qui peuvent avoir un impact sur la suite on compte :

 - les mots clefs marqués par un `#`
 - les noms d'utilisateurs commençant par un `@`
 - les valeurs numériques mentionnées
 - les nombre de mots en MAJUSCULES

In [26]:
def count_hashtags(df, text_field):
    '''
    count the number of keywords marked by a '#'
    
    inputs : 
        df : a dataframe
        text_field : the name of the text column to analyse
    
    returns :
        a copy of the dataframe df augmented by an additional column 'hashtags_count'
    
    '''
    df['hashtags_count'] = df[text_field].apply( lambda x : len( [ x for x in x.split() if x.startswith('#') ]))
    return df

In [27]:
X_train = count_hashtags(X_train, 'text')
X_val = count_hashtags(X_val, 'text')
X_test = count_hashtags(X_test, 'text')
X_train.head(5)

Unnamed: 0,text,hashtags_count
0,"I`d have responded, if I were going",0
1,Sooo SAD I will miss you here in San Diego!!!,0
2,my boss is bullying me...,0
3,what interview! leave me alone,0
4,"Sons of ****, why couldn`t they put them on t...",0


In [28]:
help(count_hashtags)

Help on function count_hashtags in module __main__:

count_hashtags(df, text_field)
    count the number of keywords marked by a '#'
    
    inputs : 
        df : a dataframe
        text_field : the name of the text column to analyse
    
    returns :
        a copy of the dataframe df augmented by an additional column 'hashtags_count'



In [29]:
X_train.describe()

Unnamed: 0,hashtags_count
count,19236.0
mean,0.021574
std,0.168488
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,5.0


In [30]:
def count_usernames(df, text_field):
    '''
    count the number of users marked by a '@'
    
    inputs : 
        df : a dataframe
        text_field : the name of the text column to analyse
    
    returns :
        a copy of the dataframe df augmented by an additional column 'users_tagged'
    
    '''
    df['users_tagged'] = df[text_field].apply( lambda x : len( [ x for x in x.split() if x.startswith('@') ]))
    return df

In [31]:
X_train = count_usernames(X_train, 'text')
X_val = count_usernames(X_val, 'text')
X_test = count_usernames(X_test, 'text')

In [32]:
X_train.head()

Unnamed: 0,text,hashtags_count,users_tagged
0,"I`d have responded, if I were going",0,0
1,Sooo SAD I will miss you here in San Diego!!!,0,0
2,my boss is bullying me...,0,0
3,what interview! leave me alone,0,0
4,"Sons of ****, why couldn`t they put them on t...",0,0


In [33]:
X_train.describe()

Unnamed: 0,hashtags_count,users_tagged
count,19236.0,19236.0
mean,0.021574,0.010241
std,0.168488,0.103734
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.0,0.0
max,5.0,2.0


In [34]:
def count_numerical_values(df, text_field):
    '''
    count the number of numerical values in a text
    
    inputs : 
        df : a dataframe
        text_field : the name of the text column to analyse
    
    returns :
        a copy of the dataframe df augmented by an additional column 'number_num_val'
    
    '''
    df['number_num_val'] = df[text_field].apply( lambda x : len( [ x for x in x.split() if x.isdigit() ]))
    return df

In [35]:
X_train = count_numerical_values(X_train, 'text')
X_val = count_numerical_values(X_val, 'text')
X_test = count_numerical_values(X_test, 'text')

In [36]:
X_train.describe()

Unnamed: 0,hashtags_count,users_tagged,number_num_val
count,19236.0,19236.0,19236.0
mean,0.021574,0.010241,0.085881
std,0.168488,0.103734,0.335115
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,0.0,0.0,0.0
max,5.0,2.0,6.0


In [37]:
def count_upper(df, text_field):
    '''
    count the number of upper case words in a text
    
    inputs : 
        df : a dataframe
        text_field : the name of the text column to analyse
    
    returns :
        a copy of the dataframe df augmented by an additional column 'num_upper'
    
    '''
    df['num_upper'] = df[text_field].apply( lambda x : len( [ x for x in x.split() if x.isupper() ]))
    return df

In [38]:
X_train = count_upper(X_train, 'text')
X_val = count_upper(X_val, 'text')
X_test = count_upper(X_test, 'text')

In [39]:
X_train.describe()

Unnamed: 0,hashtags_count,users_tagged,number_num_val,num_upper
count,19236.0,19236.0,19236.0,19236.0
mean,0.021574,0.010241,0.085881,0.605635
std,0.168488,0.103734,0.335115,1.268571
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,1.0
max,5.0,2.0,6.0,27.0


**ATTENTION** : Il manque le nombre de smiley

inspi : https://towardsdatascience.com/text-preprocessing-for-data-scientist-3d2419c8199d

In [42]:
# Checking the first 10 most frequent words
from collections import Counter


In [45]:
def count_most_common_words(df, text_field, nb=10):
    '''
    count the most common words
    
    inputs : 
        df : a dataframe
        text_field : the name of the text column to analyse
    
    returns :
        a list of tuple containing the most common words and their respective number of occurences
    
    '''
    cnt = Counter()
    for text in df[text_field].values:
        for word in text.split():
            cnt[word] += 1
        
    return cnt.most_common(nb)

In [47]:
count_most_common_words(X_train, 'text', 20)

[('to', 6927),
 ('I', 6075),
 ('the', 5959),
 ('a', 4507),
 ('my', 3422),
 ('and', 3313),
 ('i', 3031),
 ('you', 2667),
 ('is', 2563),
 ('for', 2516),
 ('in', 2474),
 ('of', 2159),
 ('it', 2053),
 ('on', 1824),
 ('have', 1620),
 ('that', 1527),
 ('me', 1526),
 ('so', 1514),
 ('with', 1387),
 ('be', 1365)]

In [49]:
!pip install emot

Collecting emot
  Downloading emot-2.1-py3-none-any.whl (27 kB)
Installing collected packages: emot
Successfully installed emot-2.1


In [50]:
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

In [58]:
# Converting emojis to words
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
    return text# Converting emoticons to words    
    
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text# Example


In [59]:
text = "Hello :-) :-)"
convert_emoticons(text)


'Hello Happy_face_smiley Happy_face_smiley'

In [60]:
text1 = "Hilarious 😂"
convert_emojis(text1)

'Hilarious face_with_tears_of_joy'