<a href="https://colab.research.google.com/github/Remydeme/Descarte/blob/master/TFIDF_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install seaborn 
!pip install spacy 
!pip install pyldavis
!pip install bokeh
!pip3 install spacy
!python3 -m spacy download en_core_web_sm
!pip install tqdm
!pip install eli5
!pip install lime
!pip install skater
!pip install --upgrade gensim
!pip install shap
!pip install plotly 
!pip install chart_studio
!pip install lime

# Introduction 


In this notebook we will study the TF-IDF in depth. We will work on the 20 news groups dataset.

Our goals : 

* prepare the dataset for the training 

* Study the dataset vocab 

* See how we configure the TFidfVectorizer 

* In the training part we will see:

    * The influence of the ngram_range parameter on machine learning model performance.
    * The improvement of the f1-score when we merge the importance matrices of two tfidf, one configured for word analysis and the other for character analysis.
    * The impact of the max_feature parameter.

* Validate our assumptions on the **FakeNews** dataset.

**If you are already familiar with  the TF-IDF and how to prepare a text dataset I advice you to jump to TFIDF training section**

**Note**: This notebook is rich in content. You can directly use the summary to access to the conclusion.
---

**Results** :  on the **Kaggle FakeNews dataset** our best f1-score is 99.2%. This allows us to rank first for this Kaggle competition 


# TF-IDF 

## How it works

### TF-IDF 

Here TF means **Term Frequency** and IDF means **Inverse Document Frequency**. TF has the same explanation as in **bag of words** (BOW) model.

By taking the **inverse of the document** frequency TF-IDF vectorizer has given an importance to the rarity of a word.


* **TF** : Number of times the word appears in the text / number of words in the text

![Texte alternatif…](https://miro.medium.com/proxy/1*HM0Vcdrx2RApOyjp_ZeW_Q.png)

* **IDF** : log(Number of documents in which the word appears / number of documents)  

![Texte alternatif…](https://miro.medium.com/proxy/1*A5YGwFpcTd0YTCdgoiHFUw.png)

* **TF-IDF = TF * IDF** 

![Texte alternatif…](https://miro.medium.com/proxy/1*nSqHXwOIJ2fa_EFLTh5KYw.png)

* **TF** : counts the number of times each word appears in the text. He therefore measures the importance of the word in the text. The probability of the word in the text.

* **IDF** : Measure the importance of each word in the whole corpus of text. It answers the question: Is this word a theme in documents?

These two terms combined provide a weighting of the importance of each word with respect to the section it is in.

## Goal

1. Determine if when we couple two TF-IDF, one for characters and another for words it's improves the accuracy of our model.

2. Determine how to properly configure our **TF-IDF**. We will focus on the **ngram_range** and **max_feature**.

3. Study the impact of vocabulary size.





# Dataset


In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

def twenty_newsgroup_to_csv():
    newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

    df = pd.DataFrame([newsgroups_train.data, newsgroups_train.target.tolist()]).T
    df.columns = ['text', 'target']

    targets = pd.DataFrame( newsgroups_train.target_names)
    targets.columns=['title']

    out = pd.merge(df, targets, left_on='target', right_index=True)
    out['date'] = pd.to_datetime('now')
    return out 

In [None]:
news = twenty_newsgroup_to_csv()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:
news = news[['text', 'target', 'title']]

In [None]:
distribution = news.title.value_counts()
distribution = pd.DataFrame({'Category' : distribution.index, 'Frequency' : distribution.values})
distribution

Unnamed: 0,Category,Frequency
0,rec.sport.hockey,600
1,soc.religion.christian,599
2,rec.motorcycles,598
3,rec.sport.baseball,597
4,sci.crypt,595
5,sci.med,594
6,rec.autos,594
7,sci.space,593
8,comp.windows.x,593
9,sci.electronics,591


### Text frequency analysis 

In [None]:
import plotly.express as px 

px.bar(distribution, x="Frequency", y='Category', color='Category', orientation='h', labels={'Category': 'Thème', 'Frequency' : 'Nombre de textes'})

In [None]:
distribution.describe()

Unnamed: 0,Frequency
count,20.0
mean,565.7
std,58.251813
min,377.0
25%,574.5
50%,591.0
75%,594.25
max,600.0


Our dataset contains 20 different classes. It is well distributed. We have an average of 565 texts and a standard deviation of 58 texts. The least well represented classes are respectively in order: religion, politics and atheism.

## Cleaning 


Before we train our model on the texts we have to clean these texts and put them in a format (token & vector) understandable by our model.

Here are some small standard cleaning functions. We use spacy tokenizer and lexer.

In [None]:
import spacy
import en_core_web_sm

print(f'Spacy version {spacy.__version__}')
nlp = en_core_web_sm.load()
stop_words = spacy.lang.en.STOP_WORDS
punctuations = spacy.lang.punctuation.LIST_PUNCT

Spacy version 2.2.4


Prepare text:

1. Lowercase all characters

2. Apply lemming. replaces the adjective, verb with the root forms (sometimes called synonyms in search context) of inflected (derived) words.

3. Remove the frequent words "STOP WORDS" from the English vocabulary.

4. We also remove unnecessary characters (Work done by hand).

See : [About lemming](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python?utm_source=adwords_ppc&utm_campaignid=9942305733&utm_adgroupid=100189364546&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=332602034352&utm_targetid=aud-763347114660:dsa-929501846124&utm_loc_interest_ms=&utm_loc_physical_ms=9056441&gclid=CjwKCAjwltH3BRB6EiwAhj0IUGeOuxWdfLC-fyH9hxPsjdJJms60PPpkPv-WZCJKR5TwL9QtVkrPAhoCuLYQAvD_BwE)

In [None]:
def prepareText(text, punctuation=True, lemming=True, stop_word=True):
    """
    Prepare the text by removing punctuation, stop words and doing lemming 
    :param text: 
    :return: text 
    """     
    clean_text = nlp(text)
    
    #lowering word
    #lemming 
    # if words is pronoun don't apply lemming because spacy convert the words 
    # in "_PRON-" 
    if lemming == True:
        clean_text = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in clean_text]
    
    #remove stop words
    if stop_word == True:
        clean_text = [ word for word in clean_text if (word not in stop_words) ]
    # remove punctuation 
    if punctuation == True:
        clean_text = [word for word in clean_text if word.isalpha() ]
    
    #remove single char [b-Z] we only keep 'a'
    clean_text = [ word for word in clean_text if (len(word) != 1 and word != 'a') ]

    #clean_text = [ word for word in clean_text if (word not in noisy_words)]


    return clean_text

This function remove Url and mail punctuation characters. 

In [None]:
def standardize_text(df, text_field):
    df[text_field] = df[text_field].str.replace(r"http\S+", "")
    df[text_field] = df[text_field].str.replace(r"http", "")
    df[text_field] = df[text_field].str.replace(r"@\S+", "")
    df[text_field] = df[text_field].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
    df[text_field] = df[text_field].str.replace(r"@", "at")
    return df

## Vocabulary

Let's do a detailed study of our dataset texts composition . We are going to make a study on 3 categories religions, auto and hockey.


1. Size of the vocabulary 

2. Study of the distribution (# words per text)

In [None]:
news  = [ news[news["title"] == "rec.sport.hockey"], news[news["title"] == "soc.religion.christian"], news[news["title"] == "rec.autos"] ]
news = pd.concat(news)

### Size of the vocabulary 

In [None]:
text_stack = " "
for text in news.text:
  text_stack += text

We have merged all of our texts into one. The goal is to determine the size of our vocabulary.

In [None]:
splited_text = text_stack.split(' ')

In [None]:
print(f"The corpus of texts is made up of : {len(splited_text)} words")

The corpus of texts is made up of : 410409 words


To determine the size of our vocabulary we will use the CountVectorizer object. It will analyze our text and build a dictionary containing the vocabulary.

We pass it in parameter to our **"prepareText"** method. Our method will reduce the size of our texts by removing irrelevant elements (frequent words, punctuation ...).  

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cvect = CountVectorizer(tokenizer=prepareText, ngram_range=(1,1))
cvect_3 = CountVectorizer(tokenizer=prepareText, ngram_range=(1,3))

In [None]:
news = standardize_text(df=news, text_field='text')

In [None]:
corpus = news.text.to_list()
cvect.fit(corpus)
print("")

In [None]:
cvect_3.fit(corpus)
print("")

In [None]:
print(f"Our vocabulary contains:  {len(cvect.vocabulary_)} words.")

Our vocabulary contains:  15487 words.


In [None]:
print(f"The vocabulary with ngram_range = (1,3) contains :  {len(cvect_3.vocabulary_)} words")

The vocabulary with ngram_range = (1,3) contains :  259269 words


In [None]:
(259269) / 15487

16.74107315813263

The vocabulary size increases considerably with the use of ngram. Here, with NGRAM_range = (1,3), the size of our vocabulary is multiplied by **16.7**.

In [None]:
cvect.get_feature_names()[10:30]

['abbie',
 'abbreviation',
 'abc',
 'abe',
 'abhorent',
 'abhorrent',
 'abide',
 'abideth',
 'abiding',
 'ability',
 'abiogenesis',
 'able',
 'ablility',
 'ably',
 'aboard',
 'abode',
 'abolish',
 'abolition',
 'abomination',
 'abort']

## Word distribution

We want to count the number of words per text for each category. To determine if the length of the texts has an impact on the classification made by our model.

 

In [None]:
def lenText(text):
  """
    Count the number of token present in a text. 
    @params text : text that we want to analyse 
  """
  return len(text.split(' '))

In [None]:
text_size_df = news.text.apply(lenText)

In [None]:
news['text_size'] = text_size_df

In [None]:
news.head()

Unnamed: 0,text,target,title,text_size
21,\nI think that Mike Foligno was the captain of...,10,rec.sport.hockey,90
35,\nFunny you should mention this one time on H...,10,rec.sport.hockey,71
57,\nNo no no!!! It's a squid! Keep the traditi...,10,rec.sport.hockey,23
88,\n \...,10,rec.sport.hockey,75
113,\n\nWell I don't see any smileys here I am t...,10,rec.sport.hockey,44


In [None]:
import plotly.express as px

px.box(news, x='text_size', y='title', color='title',  orientation='h', category_orders={'title' : ['rec.autos', 'soc.religion.christian', 'rec.sport.hockey']})

Here we can observe that the texts for the different categories have very different lengths.

* The autos category has short text compared to other categories. **75%** of the texts are less than 160 tokens in length (before cleaning).

* Text about religion are generally longer in this corpus.  **50% of them have a length greater than 160 tokens**. 


So our texts, before we applied the cleaning, have on average 269 tokens. Let's see after cleaning. 

In [None]:
cleaned_text = news.text.apply(prepareText)

In [None]:
news['cleaned_text'] = cleaned_text

In [None]:
cleaned_text_size = [len(tokens) for tokens in news.cleaned_text]

In [None]:
news['cleaned_text_size'] = cleaned_text_size

In [None]:
news.cleaned_text_size.describe()

count    1793.000000
mean       79.447295
std       190.363558
min         0.000000
25%        18.000000
50%        41.000000
75%        82.000000
max      6158.000000
Name: cleaned_text_size, dtype: float64

In [None]:
import plotly.express as px

px.box(news, x='cleaned_text_size', y='title', color='title',  orientation='h', category_orders={'title' : ['rec.autos', 'soc.religion.christian', 'rec.sport.hockey']}, labels={'title' : 'Thème', 'cleaned_text_size' : 'Nombre de mot par texte'})

In [None]:
for category in ['rec.autos', 'soc.religion.christian', 'rec.sport.hockey']:
  print(str(20 * '-') + category + str(20 * '-'))
  print(news[news['title'] == category].describe())

--------------------rec.autos--------------------
         text_size  cleaned_text_size
count   594.000000         594.000000
mean    146.872054          49.422559
std     335.808029         110.693148
min       1.000000           0.000000
25%      41.000000          15.000000
50%      78.000000          28.000000
75%     159.750000          55.750000
max    6004.000000        2064.000000
--------------------soc.religion.christian--------------------
         text_size  cleaned_text_size
count   599.000000         599.000000
mean    307.662771         103.799666
std     429.153452         126.694983
min       1.000000           0.000000
25%      86.500000          31.000000
50%     176.000000          63.000000
75%     374.500000         125.500000
max    5921.000000        1117.000000
--------------------rec.sport.hockey--------------------
          text_size  cleaned_text_size
count    600.000000         600.000000
mean     353.473333          84.860000
std     1369.577197         2

**After cleaning the text mean length is equal to 79 tokens. The cleaning remove almost 2/3 of the tokens.** 

* after the cleaning the class:
   * Autos have lost 2/3 of its information
   * Religion have lost 2/3 of its information
   * Hockey have lost 3/4 of its information

* The category **Christian** after cleaning has texts containing about 50% more words than the autos class and 20% more words than the hockey class. Our classes doesn't have texts of the same length. 



**1. Does the number of words in a text have an impact on its classification ?** 

The TFIDF algorithm first designs a vocabulary from all of the texts. The TF-IDF score is calculated for each word of the vocabulary, we determine the important words at this stage, and therefore the "themes". When we make a prediction, our algorithm use those words (which represent themes) in a text in order to classify it in the right category. The prediction of our model will therefore depend on the words present in the text.

The longer a text, the more words it will contain. Probably several words from many themes. And therefore, our algorithm will be able to classify it better. Moreover, the probability of **overlap** (that several themes are found in the text) will also be higher.

# TF-IDF configuration

With scikit-learn, there are two ways to perform an analysis by TF-IDF:

1. Apply the **CountVectorizer** algoirthm which build a vocabulary from the text corpus, applies a cleaning function, transforms the words into tokens and builds a term-frequency matrix for each texts that have been pass to the fit_tranform method. 

Then we apply TFidfTransformer. It will multiply those matrix by the IDF **Inverse document frequency**.

2. Use the TFIdfVectorizer which performs those two operations at once. Using the **fit_transform** method.

We use the second method.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


Let's see how to configure the TFIdfVectorizer. We apply it on 2 small medical articles selected on the web. 

In this part we will analyze how configuration of the TFIDVectorizer. We will focus on three parameters. 


* ngram_range : {tuple} (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. 

* max_featuresint, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.


* analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

There are other features that can be adjusted. [for more](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) 

In [None]:
import numpy as np 

tfidf_words_3 = TfidfVectorizer(tokenizer=prepareText, max_features=20,  ngram_range=(1,3), analyzer='word', dtype=np.float32)
tfidf_words_2 = TfidfVectorizer(tokenizer=prepareText, max_features=20, ngram_range=(1,2) , analyzer='word', dtype=np.float32)
tfidf_words_1 = TfidfVectorizer(tokenizer=prepareText, max_features=20, ngram_range=(1,1) , analyzer='word', dtype=np.float32)

* **Tokenizer**: we apply our "tokenization" function **prepareText**
* **max_feature**: limit our vocabulary to the 20 most 
important vocabulary words.
* **ngram_range**: (1,3) we  build ngrams of 1 to 3 words (analyzer = 'words').

In [None]:
texte1 = """ A DNA molecule encoding a polypeptide having at least one immunogenic determinants of the CCV spike protein, said CCV spike protein having an amino acid sequence shown in SEQ ID No. 2, 4, or 6, said polypeptide being capable of eliciting a protective immune response in a dog against CCV infection or disease."""

In [None]:
texte2 = """Dr Saif said that the veterinary community has a long experience with coronaviruses causing severe disease in domestic animals and can therefore provide assistance in the understanding the epidemiology of the disease, development of models, pathogenicity studies, and mechanisms of prevention and control for SARS."""

In [None]:
tfidf_words_3.fit([texte1, texte2])
tfidf_words_2.fit([texte1, texte2])
tfidf_words_1.fit([texte1, texte2])

In [None]:
tfidf_words_3.vocabulary_

{'ccv': 0,
 'ccv spike': 1,
 'ccv spike protein': 2,
 'disease': 3,
 'infection disease': 4,
 'pathogenicity study mechanism': 5,
 'polypeptide': 6,
 'polypeptide capable': 7,
 'polypeptide capable elicit': 8,
 'polypeptide immunogenic': 9,
 'polypeptide immunogenic determinant': 10,
 'prevention': 11,
 'prevention control': 12,
 'prevention control sars': 13,
 'protective': 14,
 'protective immune': 15,
 'protective immune response': 16,
 'protein': 17,
 'spike': 18,
 'spike protein': 19}

In [None]:
print(f'The vocabulary size for our example is {len(tfidf_words_3.vocabulary_)} words')

The vocabulary size for our example is 20 words



1. Our model has built the vocabulary with our word trigrams.

```

 'acid': 0,
 'acid sequence': 1,
 'acid sequence seq': 2,

```




#### TFIDF appliqué sur le Texte 1

In [None]:
transformed_text_3 =  tfidf_words_3.transform([texte1])
transformed_text_2 =  tfidf_words_2.transform([texte1])
transformed_text_1 =  tfidf_words_1.transform([texte1])

In [None]:
words_3 = transformed_text_3.toarray()
words_2 = transformed_text_2.toarray()
words_1 = transformed_text_1.toarray()

In [None]:
max_3 = np.argmax(words_3[0], axis=0)
max_2 = np.argmax(words_2[0], axis=0)
max_1 = np.argmax(words_1[0], axis=0)

In [None]:
(max_3, max_2, max_1)

(0, 0, 1)

We get the most important word indexes for our three vectorizers.

In [None]:
(tfidf_words_1.get_feature_names()[max_1], tfidf_words_2.get_feature_names()[max_2], tfidf_words_3.get_feature_names()[max_3])

('ccv', 'ccv', 'ccv')

For the vectorizers **ngram_range = (1,1), (1,2) and (1,3)** the most important word in the vocabulary is **CCV**, a word meaning: "a virus of the genus Coronavirus".
In the first text the word ccv is mentioned 3 times, it is indeed the subject of the text. 

#### TFIDF applied on Text 2

In [None]:
transformed_text_3 =  tfidf_words_3.transform([texte2])
transformed_text_2 =  tfidf_words_2.transform([texte2])
transformed_text_1 =  tfidf_words_1.transform([texte2])
words_3 = transformed_text_3.toarray()
words_2 = transformed_text_2.toarray()
words_1 = transformed_text_1.toarray()
max_3 = np.argmax(words_3[0], axis=0)
max_2 = np.argmax(words_2[0], axis=0)
max_1 = np.argmax(words_1[0], axis=0)
(max_3, max_2, max_1)

(3, 2, 2)

In [None]:
(tfidf_words_1.get_feature_names()[max_1], tfidf_words_2.get_feature_names()[max_2], tfidf_words_3.get_feature_names()[max_3])

('disease', 'disease', 'disease')

The TFIDF gives us the word **disease** as the theme for text number 2. This text  speaks about the coronavirus causing serious diseases in affected animals.


**Note: The ngram_range,  increased the size of our vocabulary. With a range greater than (1.1), i.e. (1, n) with *n > 1*. Our TF-IDF will try to determine the importance of these *n-grams* in the texts.**

### Char Vectorizer 

* we configure our tfidf in analyzer = 'char' mode. Our tokenizer will build our vocabulary by making ngrams (2,6) and (3,6) of character (any type of character).

* We limits the vocabulary to the 20 most important words.

In [None]:
vect_char_3 = TfidfVectorizer(max_features=20,  lowercase=True, analyzer='char', stop_words= 'english',ngram_range=(3,6),dtype=np.float32)
vect_char_2 = TfidfVectorizer(max_features=20, lowercase=True, analyzer='char', stop_words= 'english',ngram_range=(2,6),dtype=np.float32)

In [None]:
vect_char_3.fit([texte1, texte2])
vect_char_2.fit([texte1, texte2])


The parameter 'stop_words' will not be used since 'analyzer' != 'word'


The parameter 'stop_words' will not be used since 'analyzer' != 'word'



TfidfVectorizer(analyzer='char', binary=False, decode_error='strict',
                dtype=<class 'numpy.float32'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=20,
                min_df=1, ngram_range=(2, 6), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [None]:
transformed_text_char_3 = vect_char_3.transform([texte1])
transformed_text_char_2 = vect_char_2.transform([texte1])

In [None]:
vect_char_3.vocabulary_

{' a ': 0,
 ' an': 1,
 ' in': 2,
 ' of': 3,
 ' of ': 4,
 ' pr': 5,
 ' sa': 6,
 ' th': 7,
 ' the': 8,
 ' the ': 9,
 'e i': 10,
 'g a': 11,
 'he ': 12,
 'id ': 13,
 'in ': 14,
 'ing': 15,
 'ing ': 16,
 'ng ': 17,
 'the': 18,
 'the ': 19}

In [None]:
vect_char_2.vocabulary_

{' a': 0,
 ' c': 1,
 ' d': 2,
 ' i': 3,
 ' o': 4,
 ' p': 5,
 ' s': 6,
 'an': 7,
 'd ': 8,
 'de': 9,
 'e ': 10,
 'g ': 11,
 'id': 12,
 'in': 13,
 'n ': 14,
 'ng': 15,
 'ng ': 16,
 'se': 17,
 'th': 18,
 'ti': 19}

**Vectorizer with ngram_range = (3,6)**

In [None]:
words_3 = transformed_text_char_3.toarray()

In [None]:
vect_char_3.get_feature_names()[np.argmax(words_3[0])]

'g a'

For this vectorizer the most important word is **'g a'**.

**Vectorizer with ngram_range = (2,6)**


In [None]:
words_2 = transformed_text_char_2.toarray()

In [None]:
vect_char_2.get_feature_names()[np.argmax(words_2[0])]

'in'

For our second vectorizer the most important word is **in**.

The charVectorizer designs tokens from a group of characters.

So in both cases we end up with a vocabulary consisting of ngram characters. The words of this vocabulary have a size between (2,6) and  (3,6) (case of our other vectorizer).


**By doing this, we build important groups of characters that our tokenizer configured with analyzer = 'word' could not have detected (due to the tokenization method used)**

**Thus, by merging the words tfidf matrices resulting from the application of this transformer and our transformer configured for words, we add information that our model can then exploit to make better predictions.**

# **Training**

We were able to observe the results of these two types of vectorizer. We will now apply the combination of the two vectorizers to our dataset and study the performance of our models.


1. Compare the **tfidf** with **analyzer 'word'** alone vs **tfidf with an analyzer 'word' + tfidf with an analyzer 'char'**.

2. determine the impact of **ngram_range** on f1-score.

3. The impact of **max_feature** on f1-score.

### Impact of ngram parameter on model accuracy 

In [None]:
corpus = news.text
target = news.target.to_list()

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(corpus, target, test_size=0.3, random_state=42, stratify=target)

* Stratify: So that we get the same class proportions in the set tests and the train set. 

**The impact of NGRAM on performance**

In this section we will study the impact of ngram_range on model performance. The objective is to determine with which value of ngram_range we obtain the best f1-score.

We will test the TF-IDF alone and the TF-IDF coupled to the tf-idf-char with the 3 values of ngram_range below : 

* (1,1)
* (1,2)
* (1,3)


We will use a LinearSVC as a template. It provides very good results with text classification problems.

**Note: our goal is not to study the SVC. So we won't do feature ingeeniering in order to improve our machine learning model.**

In [None]:
import tqdm
from scipy import sparse
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.model_selection import cross_val_score

In [None]:
vect_char = TfidfVectorizer(analyzer='char', stop_words='english',  max_features=40000, ngram_range=(2,6), lowercase=True, dtype=np.float32)

In [None]:
vect_corpus_char = vect_char.fit_transform(x_train)

In [None]:
vect_char_vocab_size = len(vect_char.get_feature_names())

In [None]:
ngram_range = [(1,1), (1, 2), (1, 3)]
key = ['(1,1)', '(1,2)', '(1,3)']
vocab_size = []
vector_words_dico = {}
for ngram in tqdm.tqdm_notebook(ngram_range):
  vect_word = TfidfVectorizer(analyzer='word', tokenizer=prepareText, ngram_range=ngram, dtype=np.float32)
  vect_corpus_word = vect_word.fit_transform(x_train)
  vector_words_dico[str(ngram)] = vect_corpus_word
  vect_corpus_wc = sparse.hstack([vect_corpus_word , vect_corpus_char])
  vector_words_dico[str(ngram)+'+tfidf_char'] = (vect_corpus_wc)
  vect_word_vocab_size = len(vect_word.vocabulary_)
  vocab_size.append(vect_word_vocab_size)
  vocab_size.append(vect_char_vocab_size + vect_word_vocab_size)


This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`



HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))




In [None]:
keys = []
scores = [] 
for key in vector_words_dico:
  svc =  LinearSVC()
  score = cross_val_score(svc, vector_words_dico[key], y_train,scoring='accuracy')
  keys.append(key)
  scores.append(np.mean(score))


In [None]:
scores_df = pd.DataFrame({'config' : keys, 'score' : scores, 'vocab_size' : vocab_size})

In [None]:
scores_df.head()

Unnamed: 0,config,score,vocab_size
0,"(1, 1)",0.92749,12781
1,"(1, 1)+tfidf_char",0.933068,52781
2,"(1, 2)",0.928287,92307
3,"(1, 2)+tfidf_char",0.930677,132307
4,"(1, 3)",0.926693,181964


In [None]:
px.scatter(scores_df, x='config',y='score', color='config', size='vocab_size', labels={'config' : "TF-IDF configuration"}, title='Influence of ngram_range on model accuracy')

Increasing the ngram value decreases the performance of our model. By increasing it, we increase the size of our vocabulary. We can see that this increase decreases the score of our SVC trained with the default parameters.

On the other hand, we can see that the coupling of the two Tfidf improves always provides better results than the model alone.

#### **Conclusion**

* We get the best performance with the (1,2) ngram_range.

* Coupling the two tfidf increase the accuracy by almost 1% for all the model that implement it. 

**ngram_range** prameter permits to enrich the vocabulary be creating tokens. But this increase in vocabulary reduce the model precision. We hypothesized that our model over-adjusts when we increase the data.

Pros : 
  - enrich the vocabulary 

Cons : 
  - Model loose in precision when the vocabulary size increased.


---
---


The tfidf **"coupling technic"** improve the model recall and accuracy .  

Pros : 
  - Better recall and accuracy 

Cons : 
  - Increased the training time
  - Interpretability become unreadable with Lime and Shap text analyzer.


 In order to improve the performance of our algorithm with NGRAM we need to do **feature engineering** to adapt the model to the vocabulary size. 


## Max_feature impact 


We will study in detail the impact on performance of this parameter. It allows you to limit the vocabulary to the **max_feature** words with the highest TF-IDF values. 

In [None]:
vect_word_max_feature = TfidfVectorizer(analyzer='word', tokenizer=prepareText, ngram_range=(1,3), dtype=np.float32)

In [None]:
vocab_size_prop = [0.01, 0.03, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

This table contains the different percentages of vocabulary sizes that we will test. We will build and train a tfidf with each of those size. 

In [None]:
vect_word_max_feature.fit(x_train)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float32'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 3), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function prepareText at 0x7f77fdf88158>,
                use_idf=True, vocabulary=None)

In [None]:
vocab_size = len(vect_word_max_feature.vocabulary_)

In [None]:
import math

In [None]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, roc_auc_score, recall_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

#### **Cross validation**

Let's validate our results with cross-validation

In [None]:
from datetime import datetime

from sklearn.linear_model import LogisticRegression

In [None]:
%%time 
import numpy as np

svc = LinearSVC()
report_max_feature = []
report_f1_score = []
report_time = []
report_vect_shape = []

for prop in vocab_size_prop:
  max_feature = math.ceil(vocab_size * prop)
  vect = TfidfVectorizer(analyzer='word', tokenizer=prepareText, max_features=max_feature, ngram_range=(1,3), dtype=np.float32)
  start = datetime.now()
  vect_corpus = vect.fit_transform(x_train)
  scores = cross_val_score(svc, vect_corpus, y_train,scoring='accuracy')
  end = datetime.now()
  report_time.append(end-start)
  report_vect_shape.append(str(vect_corpus.shape))
  report_f1_score.append(np.mean(scores))
  report_max_feature.append(max_feature)

CPU times: user 10min 57s, sys: 4.61 s, total: 11min 2s
Wall time: 11min 2s


In [None]:
report_svc = {'max feature' : report_max_feature, 'proportion' : vocab_size_prop, 'duration' : report_time, 'vector shape' : report_vect_shape,'f1-score' : report_f1_score}

In [None]:
df_svc = pd.DataFrame(report_svc)

In [None]:
df_svc

Unnamed: 0,max feature,proportion,duration,vector shape,f1-score
0,1820,0.01,00:00:51.953483,"(1255, 1820)",0.919522
1,5459,0.03,00:00:51.574035,"(1255, 5459)",0.933865
2,9099,0.05,00:00:51.235053,"(1255, 9099)",0.930677
3,14558,0.08,00:00:51.164784,"(1255, 14558)",0.931474
4,18197,0.1,00:00:50.982480,"(1255, 18197)",0.933068
5,36393,0.2,00:00:50.929798,"(1255, 36393)",0.931474
6,54590,0.3,00:00:51.139359,"(1255, 54590)",0.930677
7,72786,0.4,00:00:50.962535,"(1255, 72786)",0.930677
8,90982,0.5,00:00:50.331637,"(1255, 90982)",0.92988
9,109179,0.6,00:00:50.560483,"(1255, 109179)",0.92749


In [None]:
px.scatter(df_svc,x='max feature',y='f1-score', color='proportion', size='proportion', labels={'max feature' : "Vocabulary Size"}, title='Influence of vocabulary size on accuracy')

With 1% or 1,820 vocabulary words, we get a score of 91.9%. By adding 2% of the "top words" to our vocabulary (about 3000 words) we gain 2.1% in accuracy. The increase in the size of the vocabulary does not improve the performance of our model, on the contrary we observe a drop in performance.

## **Conclusion max_feature**

Using the max_feature parameter allows:

Pros:

* Gain in precision and relevance of our model (ideal value to be determined).
* Reduction in the size of our vocabulary, therefore of the data on which our model learns.
* Deletion of non-essential data during training.
* Savings in training time (to study in detail)

Cons:

* search the right value. I think the value depends on the data. But a value between 5K and 25K words seems to be interesting.

The max_feature parameter allows us to limit our vocabulary to our top N words. Those N words, are determined by their importance. 

It is therefore possible to obtain very good results in our predictions, by using a percentage of our vocabulary gathering only the important words.

---

Deleting words with a lower importance score improves the performance of our model (LinearSVC) and reduces the training time of the model.


**Note: When we tested the impact of ngram_range on the performance of our model. We had noticed that we lost precision as we increased the interval. With the optimal max_feature (3%) we obtain with the ngram_range = (1.3) a much better score than previously 93.3% vs 92.6%.**


## **Validation**

We  want to validate assumptions made in the previous part. We are going to use an other dataset. From Kaggle we download the FakeNews dataset. 

Our assumptions are:

* **When we limit the size of the vocabulary using max_feature we improve the performance of the model by reducing the number of feature only to the most important features. It permits us to use big ngram_range that add informations to our dataset**.

* **A good max_feature value is between 5K and 25K words seems to be interesting. We are going to use 20k words as max_feature value**.

* **Coupling two tfidf  one configured with analyzer word and another cwith analyzer char increase the f1-score.** 



###  Dataset 

We will use the Fake News dataset to validate our arguments. It is a corpus of texts consisting of **"fake news"** and real press articles. 

**Note: You must load the fakeNews dataset in your colab space** 

In [None]:
import tqdm
from scipy import sparse
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.model_selection import cross_val_score
import pandas as pd 
from scipy import sparse
import plotly.express as px

In [None]:
train_df = pd.read_csv('/content/train.csv')

### Data preparation

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


For lines with author and title at null. We replace the value **nan** by **"Unknow"**.

In [None]:
train_df.author.fillna("unknow", inplace=True)

In [None]:
train_df.title.fillna("unknow", inplace=True)

We removed empty lines 

In [None]:
train_df.dropna(inplace=True)

In [None]:
x_train = train_df.text.to_list()

In [None]:
titles = train_df.text.to_list()

In [None]:
target = train_df.label.to_list()

In [None]:
len(x_train)

20761

### TF-IDF 

In order to see the improvements made by the configuration of tf-idf, we will train our svc with :

* Le modèle par défaut
* tfidf ngram_range=(1,3) and max_feature=20 000
* tfidf ngram_range=(1,3) and max_feature=(80% of the vocabulary)

Note : We also calculate with the same parameters the f1-score of tfidf coupled with tfidf (analyzer = 'char').

In [None]:
vector_words_dico = {}

In [None]:
vect_char = TfidfVectorizer(analyzer='char', stop_words='english',  max_features=40000, ngram_range=(2,6), lowercase=True, dtype=np.float32)

In [None]:
vect_word = TfidfVectorizer(analyzer='word', tokenizer=prepareText, ngram_range=(1,3), dtype=np.float32)

In [None]:
vect_corpus_word = vect_word.fit_transform(x_train)

In [None]:
vect_corpus_char = vect_char.fit_transform(x_train)

In [None]:
vect_char_vocab_size = len(vect_char.vocabulary_)
vect_word_vocab_size = len(vect_word.vocabulary_)

In [None]:
vocab_size = [vect_word_vocab_size, vect_word_vocab_size + 40000, 20000, 20000 + 40000, vect_word_vocab_size * 0.8, vect_word_vocab_size * 0.8 + 40000]

In [None]:
vocab_size

[10159379, 10199379, 20000, 60000, 8127503.2, 8167503.2]

In [None]:
config = [((1,1), None), ((1,3), 20000), ((1,3), int(vect_word_vocab_size * 0.8))]
keys = ['defaut', 'ngram (1,3) + mf=20000 words', 'ngram (1,3) + mf=80%ofwords']
ngram_range = []
for index, (ngram, max_feat) in enumerate(config):
  key = keys[index]
  vect_word = TfidfVectorizer(analyzer='word', tokenizer=prepareText, ngram_range=ngram, max_features=max_feat, dtype=np.float32)
  vect_corpus_word = vect_word.fit_transform(x_train)
  vect_corpus_wc = sparse.hstack([vect_corpus_word , vect_corpus_char])
  vector_words_dico[key] = vect_corpus_word
  vector_words_dico[key + ' coupled with char'] = (vect_corpus_wc)

Add size of the coupled tfidf.

In [None]:
keys = []
accuracy_scores = [] 
recall_scores = []
for key in vector_words_dico:
  svc =  LinearSVC()
  accuracy = cross_val_score(svc, vector_words_dico[key], target ,scoring='accuracy')
  recall = cross_val_score(svc, vector_words_dico[key], target ,scoring='recall')
  keys.append(key)
  accuracy_scores.append(np.mean(accuracy))
  recall_scores.append(np.mean(recall))

In [None]:
f1_score = 2 * (np.array(recall_scores) * np.array(accuracy_scores)) / (np.array(recall_scores) + np.array(accuracy_scores))
f1_score

array([0.95850486, 0.99091738, 0.96469957, 0.99142275, 0.96432462,
       0.9925307 ])

We compute the F1-score for each model. F1 score (also F-score or F-measure) is a measure of a test's accuracy.

In [None]:
scores_df = pd.DataFrame({'config' : keys, 'accuracy' : accuracy_scores, 'recall' : recall_scores , 'vocab_size' : vocab_size, 'f1-score' : f1_score})
scores_df.head(6)

Unnamed: 0,config,accuracy,recall,vocab_size,f1-score
0,defaut,0.955301,0.96173,10159379.0,0.958505
1,defaut coupled with char,0.990126,0.99171,10199379.0,0.990917
2,"ngram (1,3) + mf=15000 words",0.961996,0.967418,20000.0,0.9647
3,"ngram (1,3) + mf=15000 words coupled with char",0.990367,0.992481,60000.0,0.991423
4,"ngram (1,3) + mf=80%ofwords",0.957661,0.971082,8127503.2,0.964325
5,"ngram (1,3) + mf=80%ofwords coupled with char",0.99133,0.993734,8167503.2,0.992531


In [None]:
fig = px.scatter(scores_df, x='config',y='f1-score', size='recall', color='config', labels={'config' : "TF-IDF configuration"}, title='F1-score on FakeNews dataset', hover_data=['recall', 'accuracy'], width=700, height=700,)
fig.update_layout(showlegend=False)

### Interpretation of results

To facilitate redaction : 

* tfidf(default_params) = "default"
* tfidf(max_feature=20k, ngram=(1,3)) = "tfidf20"
* tfidf(max_feature=80% vocab, ngram=(1,3)) = "tfidf80"


* **With the model configured with default parameters our f1-score is already very high 95.8%. It increases by 4% when it is coupled to tfidf-char.**



* **Models with an ngram_range (1,3) get better scores than ngram (1,1). They have a better recall and accuracy (here it's not huge. only 1%). This therefore validates our hypothesis saying that ngram improves performance coupled with the max_feature.**



* **The model with the lowest max_feature tfidf20 obtains the best f1-score. On the other hand tfidf20 has a recall 1% lower than tfidf80.**



* **Coupling the tfidf(analyzer=word) to the tfidf(analyzer=char) increases the f1-score by 4%. Adding information from this tfidf increases the accuracy score and the recall of our model.** 

It is therefore clear that with these two parameters, which are max_feature and ngram_range, we improve the performance of our model. The max_feature, reduces the complexity of the data on which our model trains by limiting the vocabulary to important words. Our model is therefore based solely on these words to make these predictions. The ngram_range meanwhile adds information highlighting groups of tokens that would play a role of themes.

---

Note: *It is also possible to influence the min_df, max_df parameters. They allow you to delete words which appear no more than **min_df** times or which appear more than **max_df** times in the texts.*


## Test best model and submit 

We will submit our results on kaggle to see the score we get with our best model.

In [None]:
tfidf_word_svc = TfidfVectorizer(tokenizer=prepareText, analyzer='word', max_features=20000, ngram_range=(1,3))
tfidf_char_svc = TfidfVectorizer(analyzer='char', max_features=40000, stop_words='english', ngram_range=(2,6))
tfidf_svc = FeatureUnion([('word', tfidf_word_svc), ('char', tfidf_char_svc)])

In [None]:
svc_pip = make_pipeline(tfidf_svc, svc)

In [None]:
%%time
svc_pip.fit(corpus, target)

In [None]:
predictions_svc = svc_pip.predict(test_corpus)

### Submit

In [None]:
submit_svc = pd.DataFrame({'id': test_df.id, 'label' : predictions_svc})

In [None]:
submit_svc.to_csv('submission.csv', index=False)

In [None]:
!kaggle competitions submit -c fake-news -f submission.csv -m "SVC with default config and TFIDF"

![Texte alternatif…](https://i.ibb.co/qnp0y0q/Capture-d-cran-2020-05-16-08-05-59.png)

![Texte alternatif…](https://i.ibb.co/YRjWg5S/kaggle-result.png)

![Texte alternatif…](https://i.ibb.co/WHVyk53/Capture-d-cran-2020-05-16-08-08-40.png)

We get 99.2 as a private score and 99.1 as a public score.

Note: Using tfidf with the char analyzer can make the interpretation of predictions less readable.

voir : https://colab.research.google.com/drive/14-sxbLTVi3MG-xFeywv-zpcmFc7QxkkN#scrollTo=3_ylaFCDwgLn