# <center>CLASSIFICATION ON TWEETS</center>

L'objectif est de mettre en place les algorithmes Bag of words / TF-IDF sur le jeu de données de tweets catastrophes. 

Puis pour chaque algorithme, utiliser 4 modèles et évaluer leurs performances. 

Finalement faire une cross-validation et recherche d'hyperparametres afin de comparer les performances.

## Import modules

On importe les modules nécessaires à l'execution du notebook

In [1]:
# DATA MANIPULATION
import numpy as np
import pandas as pd

# ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer # BAG OF WORDS
from sklearn.feature_extraction.text import TfidfVectorizer # TFIDF


from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# DATA VIZ
import matplotlib.pyplot as plt

# NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fisma\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Import Dataset

On importe le dataset au format csv puis on jete un rapide coup d'oeil avec un head et un shape

In [2]:
df = pd.read_csv('../TextFiles/tweet_train.csv')
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
df.shape

(7613, 5)

On drop ensuite les colonnes inutiles

In [4]:
df.drop(['id','keyword', 'location'], axis = 1, inplace = True) 

## LOOK EMPTY AND NAN

On regarder maintenant les valeurs manquantes

In [5]:
df.isnull().sum()

text      0
target    0
dtype: int64

In [6]:
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

0 blanks:  []


On remarque qu'aucune donnée n'est manquante et pas d'espaces marquants manquants

## PREPROCESS DATA

In [7]:
ps = PorterStemmer()
wl = WordNetLemmatizer()

def preprocess_data(data):
    review =re.sub(r'https?://\S+|www\.\S+|http?://\S+',' ',data) # ENLEVER URLS
    review =re.sub(r'<.*?>',' ',review) # ENLEVER LES TAGS HTML
    review = re.sub("["
                           u"\U0001F600-\U0001F64F"  # ENLEVER LES EMOTICONS
                           u"\U0001F300-\U0001F5FF"  # SYBOLES ET PICTOGRAMMES
                           u"\U0001F680-\U0001F6FF"  # TRANSPORTS ET MAPS
                           u"\U0001F1E0-\U0001F1FF"  # FLAGS (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+",' ',review)
    review = re.sub('[^a-zA-Z]',' ',review) # Juste le texte
    review = review.lower() # en minuscule
    review = review.split()
    review = [ps.stem(words) for words in review if words not in stopwords.words('english')] #steeming
    review = [i for i in review if len(i)>3] # Enlever les caractères de moins de 3 de longueur
    review = ' '.join(review)
    return review

df["c_text"] = df["text"].apply(preprocess_data)

On clean le text avec notre fonction preprocess_data

In [8]:
df.head()

Unnamed: 0,text,target,c_text
0,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquak allah forgiv
1,Forest fire near La Ronge Sask. Canada,1,forest fire near rong sask canada
2,All residents asked to 'shelter in place' are ...,1,resid shelter place notifi offic evacu shelter...
3,"13,000 people receive #wildfires evacuation or...",1,peopl receiv wildfir evacu order california
4,Just got sent this photo from Ruby #Alaska as ...,1,sent photo rubi alaska smoke wildfir pour school


Création du nouveau dataset avec le text clean

In [9]:
c_df = df[['c_text','target']]
c_df.head()

Unnamed: 0,c_text,target
0,deed reason earthquak allah forgiv,1
1,forest fire near rong sask canada,1
2,resid shelter place notifi offic evacu shelter...,1
3,peopl receiv wildfir evacu order california,1
4,sent photo rubi alaska smoke wildfir pour school,1


## Division du dataset (target/feature)

In [10]:
X = c_df['c_text']
y = c_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## BUILD ALGO FUNCTION

In [11]:
def build_pipeline_and_process(vector,clf):
    text_clf_nb = Pipeline(
        [
            ('vector', vector),
            ('clf', clf),
        ]
    )

    text_clf_nb.fit(X_train, y_train)

    predictions = text_clf_nb.predict(X_test)

    print("CONFUSION MATRIX :\n",metrics.confusion_matrix(y_test,predictions))

    print("METRICS REPORT :\n",metrics.classification_report(y_test,predictions))

    print("SCORE :",metrics.accuracy_score(y_test,predictions))

    return metrics.accuracy_score(y_test,predictions)

## SIMPLE BAG OF WORDS

On commence par utiliser la fonction build_pipeline_and_process sans cross-validation ni hyperparameters tuning 

In [12]:
build_pipeline_and_process(CountVectorizer(),MultinomialNB())

CONFUSION MATRIX :
 [[1219  227]
 [ 288  779]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.81      0.84      0.83      1446
           1       0.77      0.73      0.75      1067

    accuracy                           0.80      2513
   macro avg       0.79      0.79      0.79      2513
weighted avg       0.79      0.80      0.79      2513

SCORE : 0.7950656585754079


0.7950656585754079

In [13]:
build_pipeline_and_process(CountVectorizer(),LinearSVC())

CONFUSION MATRIX :
 [[1194  252]
 [ 333  734]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.78      0.83      0.80      1446
           1       0.74      0.69      0.72      1067

    accuracy                           0.77      2513
   macro avg       0.76      0.76      0.76      2513
weighted avg       0.77      0.77      0.77      2513

SCORE : 0.7672105053720653


0.7672105053720653

In [14]:
build_pipeline_and_process(CountVectorizer(),DecisionTreeClassifier())

CONFUSION MATRIX :
 [[1161  285]
 [ 357  710]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.76      0.80      0.78      1446
           1       0.71      0.67      0.69      1067

    accuracy                           0.74      2513
   macro avg       0.74      0.73      0.74      2513
weighted avg       0.74      0.74      0.74      2513

SCORE : 0.7445284520493434


0.7445284520493434

In [15]:
build_pipeline_and_process(CountVectorizer(),RandomForestClassifier())

CONFUSION MATRIX :
 [[1254  192]
 [ 363  704]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.78      0.87      0.82      1446
           1       0.79      0.66      0.72      1067

    accuracy                           0.78      2513
   macro avg       0.78      0.76      0.77      2513
weighted avg       0.78      0.78      0.78      2513

SCORE : 0.7791484281734978


0.7791484281734978

On obtient les scores suivants : 

- BW_MNB = 0.7950656585754079
- BW_SVC = 0.7672105053720653
- BW_DTC = 0.7389574214086749
- BW_RFC = 0.7783525666534024

On remarque qu'avec bag of words le plus performant sans cross-validation et hyperparameters tnning est le multinomialNB, puis le randomForest, puis le DecisionTree et enfin le SVC

## SIMPLE TFIDF

On commence par utiliser la fonction build_pipeline_and_process sans cross-validation ni hyperparameters tuning 

In [16]:
build_pipeline_and_process(TfidfVectorizer(),MultinomialNB())

CONFUSION MATRIX :
 [[1280  166]
 [ 348  719]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.79      0.89      0.83      1446
           1       0.81      0.67      0.74      1067

    accuracy                           0.80      2513
   macro avg       0.80      0.78      0.78      2513
weighted avg       0.80      0.80      0.79      2513

SCORE : 0.7954635893354556


0.7954635893354556

In [17]:
build_pipeline_and_process(TfidfVectorizer(),LinearSVC())

CONFUSION MATRIX :
 [[1213  233]
 [ 316  751]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.79      0.84      0.82      1446
           1       0.76      0.70      0.73      1067

    accuracy                           0.78      2513
   macro avg       0.78      0.77      0.77      2513
weighted avg       0.78      0.78      0.78      2513

SCORE : 0.7815360127337844


0.7815360127337844

In [18]:
build_pipeline_and_process(TfidfVectorizer(),DecisionTreeClassifier())

CONFUSION MATRIX :
 [[1114  332]
 [ 358  709]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.76      0.77      0.76      1446
           1       0.68      0.66      0.67      1067

    accuracy                           0.73      2513
   macro avg       0.72      0.72      0.72      2513
weighted avg       0.72      0.73      0.72      2513

SCORE : 0.7254277755670513


0.7254277755670513

In [19]:
build_pipeline_and_process(TfidfVectorizer(),RandomForestClassifier())

CONFUSION MATRIX :
 [[1232  214]
 [ 331  736]]
METRICS REPORT :
               precision    recall  f1-score   support

           0       0.79      0.85      0.82      1446
           1       0.77      0.69      0.73      1067

    accuracy                           0.78      2513
   macro avg       0.78      0.77      0.77      2513
weighted avg       0.78      0.78      0.78      2513

SCORE : 0.7831277357739753


0.7831277357739753

On obtient les scores suivants : 

- TF_MNB = 0.7954635893354556
- TF_SVC = 0.7815360127337844
- TF_DTC = 0.7302029446876244
- TF_RFC = 0.7879029048945484

On remarque qu'avec tfidf le plus performant sans cross-validation et hyperparameters tnning est le multinomialNB, puis le randomForest, puis le SVC et enfin le DecisionTree

## BUILD TUNING ALGO FUNCTION

In [20]:
def build_tuning_pipeline_and_process(vector,clf,grid_params):
    pipeline = Pipeline(
        [
            ('vector', vector),
            ('clf', clf),
        ]
    )
    
    clf = GridSearchCV(pipeline, grid_params, cv = 5)

    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)

    print("Best Score: ", clf.best_score_)
    print("Best Params: ", clf.best_params_)

## TUNING BAG OF WORDS

In [21]:
build_tuning_pipeline_and_process(CountVectorizer(),MultinomialNB(),{
  'clf__alpha': np.linspace(0.5, 1.5, 6),
  'clf__fit_prior': [True, False]
})

Best Score:  0.7845098039215685
Best Params:  {'clf__alpha': 1.5, 'clf__fit_prior': True}


In [22]:
build_tuning_pipeline_and_process(CountVectorizer(),LinearSVC(),{
    'clf__C':np.arange(0.01,100,10)
})



Best Score:  0.7825490196078431
Best Params:  {'clf__C': 0.01}




In [23]:
build_tuning_pipeline_and_process(CountVectorizer(),DecisionTreeClassifier(),{
    'clf__criterion':['gini','entropy'],
    'clf__max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]
})

Best Score:  0.763921568627451
Best Params:  {'clf__criterion': 'entropy', 'clf__max_depth': 70}


In [24]:
build_tuning_pipeline_and_process(CountVectorizer(),RandomForestClassifier(),{ 
    'clf__n_estimators': [200, 500],
    'clf__max_features': ['auto', 'sqrt', 'log2'],
    'clf__max_depth' : [4,5,6,7,8],
    'clf__criterion' :['gini', 'entropy']
})

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


Best Score:  0.6384313725490196
Best Params:  {'clf__criterion': 'gini', 'clf__max_depth': 8, 'clf__max_features': 'auto', 'clf__n_estimators': 200}


On obtient les scores suivants : 

- TF_MNB = 0.7845098039215685 {'clf__alpha': 1.5, 'clf__fit_prior': True}
- TF_SVC = 0.7825490196078431 {'clf__C': 0.01}
- TF_DTC = 0.7629411764705882 {'clf__criterion': 'entropy', 'clf__max_depth': 70}
- TF_RFC = 0.6378431372549019 {'clf__criterion': 'gini', 'clf__max_depth': 8, 'clf__max_features': 'auto', 'clf__n_estimators': 200}

On remarque qu'avec bag od words tuning le plus performant  est le multinomialNB, puis le SVC, puis le DecisionTree et enfin le Random Forest

## TUNING TFIDF

In [25]:
build_tuning_pipeline_and_process(TfidfVectorizer(),MultinomialNB(),{
  'clf__alpha': np.linspace(0.5, 1.5, 6),
  'clf__fit_prior': [True, False],
  'vector__max_df': np.linspace(0.1, 1, 10),
  'vector__binary': [True, False],
  'vector__norm': [None, 'l1', 'l2'], 
})

Best Score:  0.7945098039215687
Best Params:  {'clf__alpha': 1.3, 'clf__fit_prior': True, 'vector__binary': True, 'vector__max_df': 0.1, 'vector__norm': 'l2'}


In [27]:
build_tuning_pipeline_and_process(TfidfVectorizer(),LinearSVC(),{
    'clf__C':np.arange(0.01,100,10),
    'vector__max_df': np.linspace(0.1, 1, 10),
    'vector__binary': [True],
    'vector__norm': [None], 
})



Best Score:  0.7645098039215685
Best Params:  {'clf__C': 0.01, 'vector__binary': True, 'vector__max_df': 0.1, 'vector__norm': None}




In [28]:
build_tuning_pipeline_and_process(TfidfVectorizer(),DecisionTreeClassifier(),{
    'clf__criterion':['gini','entropy'],
    'clf__max_depth':[4,5,6,7,8,9,10],
    'vector__max_df': np.linspace(0.1, 1, 10),
    'vector__binary': [True],
    'vector__norm': [None], 
})

Best Score:  0.6627450980392157
Best Params:  {'clf__criterion': 'gini', 'clf__max_depth': 10, 'vector__binary': True, 'vector__max_df': 0.1, 'vector__norm': None}


In [30]:
build_tuning_pipeline_and_process(TfidfVectorizer(),RandomForestClassifier(),{ 
    'clf__n_estimators': [200, 500],
    'clf__max_features': ['sqrt', 'log2'],
    'clf__max_depth' : [4,5],
    'clf__criterion' :['gini', 'entropy'],
    'vector__max_df': np.linspace(0.1, 1, 10),
    'vector__binary': [True],
    'vector__norm': [None], 
})

Best Score:  0.603921568627451
Best Params:  {'clf__criterion': 'entropy', 'clf__max_depth': 5, 'clf__max_features': 'sqrt', 'clf__n_estimators': 200, 'vector__binary': True, 'vector__max_df': 0.6, 'vector__norm': None}


On obtient les scores suivants : 

- TF_MNB = 0.7945098039215687 {'clf__alpha': 1.3, 'clf__fit_prior': True, 'vector__binary': True, 'vector__max_df': 0.1, 'vector__norm': 'l2'}
- TF_SVC = 0.7645098039215685 {'clf__C': 0.01, 'vector__binary': True, 'vector__max_df': 0.1, 'vector__norm': None}
- TF_DTC = 0.6627450980392157 {'clf__criterion': 'gini', 'clf__max_depth': 10, 'vector__binary': True, 'vector__max_df': 0.1, 'vector__norm': None}
- TF_RFC = 0.603921568627451  {'clf__criterion': 'entropy', 'clf__max_depth': 5, 'clf__max_features': 'sqrt', 'clf__n_estimators': 200, 'vector__binary': True, 'vector__max_df': 0.6, 'vector__norm': None}

On remarque qu'avec tfidf tuning le plus performant  est le multinomialNB, puis le SVC, puis le DecisionTree et enfin le Random Forest