<h6><center>Introduction to Data Sciences</center></h6>


## Lab 4: Text Mining

Text mining is the process of automatically extracting "high-quality" information from text. High-quality information is typically derived by transforming the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.

Text mining identifies facts, relationships and assertions that would otherwise remain buried in the mass of textual big data. Once extracted, this information is converted into a structured form that can be further analyzed, or presented directly using clustered HTML tables, mind maps, charts, etc. Text mining employs a variety of methodologies to process the text, one of the most important of these being Natural Language Processing (NLP).

Typical text mining applications include:
- Text classification
    - e.g. spam email detection
- Text clustering
    - e.g. document retrieval, topic extraction
- Sentiment analysis
    - e.g. detecting the emotions like "angry", "sad", and "happy" in a customer message
- Named entity recognition, etc.
    - e.g. detecting person names, organizations, locations in texts

## In this lab we will learn:
1. Preprocessing: textual normalization, simple tokenization, stopword removal
2. Converting documents into feature sets: Tf-Idf Vectorizer

---
## Text Normalization and Preprocessing

Text normalization is the process of transforming text into a single canonical form.

#### Progress bar

In [3]:
def print_progress_bar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█', printEnd = "\r"):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
        printEnd    - Optional  : end character (e.g. "\r", "\r\n") (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print(f'\r{prefix} |{bar}| {percent}% {suffix}', end = printEnd)
    # Print New Line on Complete
    if iteration == total:
        print()

### Lower case
A computer does not **require** upper case letters.

In [4]:
# An example string:
raw_1 = "La pie niche haut. L’oie niche bas. Où l’hibou niche-t-il ?L’hibou niche ni haut ni bas. L'hibou niche pas."

# Write code to lower case the string
s = raw_1.lower()

print(s)

la pie niche haut. l’oie niche bas. où l’hibou niche-t-il ?l’hibou niche ni haut ni bas. l'hibou niche pas.


### Handling Accented Characters

Diacritics or accents on characters in English have a fairly marginal status, and we might well want `cliché` and `cliche` to match, or `naive` and `naïve`. This can be done by normalizing tokens to remove diacritics. In many other languages, diacritics are a regular part of the writing system and distinguish different sounds. Occasionally words are distinguished only by their accents. For instance, in French, `pêche` is `fishinig` while `péché` is `sin`.

Nevertheless, the important question is usually not prescriptive or linguistic but is a question of how users are likely to write queries for these words. In many cases, users will enter queries for words without diacritics, whether for reasons of speed, laziness, limited software, or habits born of the days when it was hard to use non-ASCII text on many computer systems. In these cases, it might be best to equate all words to a form without diacritics.

We will simply list all French accents and use string replace to convert accented characters with their canonical form.

Let's replace accented characters with their canonical form.

In [5]:
def normalize_accent(string):
    string = string.replace('á', 'a')
    string = string.replace('â', 'a')

    string = string.replace('é', 'e')
    string = string.replace('è', 'e')
    string = string.replace('ê', 'e')
    string = string.replace('ë', 'e')

    string = string.replace('î', 'i')
    string = string.replace('ï', 'i')

    string = string.replace('ö', 'o')
    string = string.replace('ô', 'o')
    string = string.replace('ò', 'o')
    string = string.replace('ó', 'o')

    string = string.replace('ù', 'u')
    string = string.replace('û', 'u')
    string = string.replace('ü', 'u')

    string = string.replace('ç', 'c')

    return string

In [6]:
print(normalize_accent(raw_1))

La pie niche haut. L’oie niche bas. Ou l’hibou niche-t-il ?L’hibou niche ni haut ni bas. L'hibou niche pas.


### Tokenization : Spacy
[spaCy](https://spacy.io/usage) is a platform to work with natural language data using Python.

We will work with French data, so **you need to install proper the proper language package for French**. The complete installation instructions are [available in this link](https://spacy.io/usage).



As usual, we will first convert everything to lowercase and normalize accents.

In [7]:
raw_2 = "Latte ôtée, mur gâté, trou s’y fit, rat s’y mit, chat l’y vit, chat l’y prit."

# Write code here to convert everything in lower case and to normalize accents.
s = normalize_accent(raw_2)
s = s.lower()

print(s)

latte otee, mur gate, trou s’y fit, rat s’y mit, chat l’y vit, chat l’y prit.


`spaCy` already provides us with modules to easily tokenize the text.

For that, first spaCy needs to be loaded with the required language. Then we can use the tokenizer.

In [6]:
!python3 -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [7]:
import spacy

# Load spaCy for french
spacy_nlp = spacy.load("fr_core_news_sm")

[Here is a tutorial](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad) for basic usages of spaCy.

Upon running spaCy on a string, it automatically generates spaCy-token object. We simply need to get the string form (also called the orthogonal form) of the tokens. We will use `token.orth_` to get the string form of each token.

In [8]:
# Tokenize
spacy_tokens = spacy_nlp(s)
string_tokens = [token.orth_ for token in spacy_tokens]

print(string_tokens)

['latte', 'otee', ',', 'mur', 'gate', ',', 'trou', 's’', 'y', 'fit', ',', 'rat', 's’', 'y', 'mit', ',', 'chat', 'l’', 'y', 'vit', ',', 'chat', 'l’', 'y', 'prit', '.']


### Handling Punctuations

Punctuations are noises while processing text. We are more interesting in the words itself.

SpaCy recognises punctuation and is able to split these punctuation tokens from word tokens. We can use the `token.is_punct` to identify a punctuation token.

In [9]:
# Remove punctuation tokens
string_tokens = [token.orth_ for token in spacy_tokens if not token.is_punct]

print(string_tokens)

['latte', 'otee', 'mur', 'gate', 'trou', 's’', 'y', 'fit', 'rat', 's’', 'y', 'mit', 'chat', 'l’', 'y', 'vit', 'chat', 'l’', 'y', 'prit']


### Stop word filtering

Stop words are words which are filtered out before or after processing of natural language data (text). There is no universal stop-word list. Often, stop word lists include short function words, such as "the", "is", "at", "which", "on" etc. in English and "le", "la", "et", "à", "qui" etc. in French. Removing  stop-words has been shown to increase the performance of different tasks like search.

Lucky for us, spaCy can also detect if a given token is a stop word. The stop words detection by spaCy will depend on the language we used to load spaCy. In our case it will detect only French stop words.

We can use the `token.is_stop` to identify a stop word.

In [10]:
# Remove stop words
string_tokens = [token.orth_ for token in spacy_tokens if not token.is_punct if not token.is_stop]

print(string_tokens)

['latte', 'otee', 'mur', 'gate', 'trou', 'fit', 'rat', 'mit', 'chat', 'vit', 'chat', 'prit']


### Lastly, recombining tokens into a string

Many applications require to input a string, not a list of tokens. So we can merge the tokens into a single string. This will give us a **clean** string which we got after preprocessing of the raw string.

In [11]:
# Combining list of tokens into a single string
clean_2 = " ".join(string_tokens)

print(clean_2)

latte otee mur gate trou fit rat mit chat vit chat prit


## Let's combine everything: write a function
Using above steps, we will now write a function. We will call this function **raw_to_text**. This function will take a raw text string and will return a list of tokens. We will also supply the spacy object for French which we loaded earlier. This will save us time to load the module again and again.
1. lower case
2. normalize accents (use the method we created before)
3. tokenize
4. remove punctuation tokens and stop words
5. joining the tokens back into a single string

In [28]:
def raw_to_tokens(raw_string, spacy_nlp):

    # Write code for lower-casing
    raw_string = raw_string.lower()

    # Write code to normalize the accents
    raw_string = normalize_accent(raw_string)

    # Write code to tokenize
    spacy_tokens = spacy_nlp(raw_string)

    # Write code to remove punctuation and stop words tokens and create string tokens
    string_tokens =  [tokens.orth_ for tokens in spacy_tokens if not tokens.is_punct if not tokens.is_stop]

    # Write code to join the tokens back into a single string
    clean_string = " ".join(string_tokens)

    return clean_string

Let's test the function with some sample data

In [13]:
import nltk

# Download the punkt tokenizer models
nltk.download('punkt')

# Now you can use the word_tokenize and sent_tokenize functions
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Hello world. This is a test sentence."

# Tokenizing into words
words = word_tokenize(text)
print(words)

# Tokenizing into sentences
sentences = sent_tokenize(text)
print(sentences)


['Hello', 'world', '.', 'This', 'is', 'a', 'test', 'sentence', '.']
['Hello world.', 'This is a test sentence.']


[nltk_data] Downloading package punkt to /Users/welto/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Something bigger
We will use [**EU Parliament Statements**](https://www.statmt.org/europarl/index.html) dataset. You have been supplied with a small extract of French statements for this lab. Here is an example:
>Les critères de choix et les activités subventionnées dans le cadre de Leader atténuent, dans le meilleur des cas, une partie des problèmes de l'espace rural d' importance secondaire, tandis que dans le pire des cas, dégénèrent en un affaiblissement des relations publiques et une corruption des consciences.

In this hands-on we will use 10,000 French documents extracted from the English-French bilingual dataset.

The file **corpus.txt** supplied here, contains 10,000 documents. Each line of the file is a document.

Now we will:
   1. Load the documents as a list
   2. Pre-process the documents
   
Note: Each line of the file **corpus.txt** is a document.

In [55]:
"""
Write code to load documents as a list

Hint 1: open the file using open()
Hint 2: use read() to load the content
Hint 3: use splitlines() to get separate documents

This will give us a list of strings, each string is document.
"""

file = open("/Users/welto/Library/CloudStorage/OneDrive-CentraleSupelec/2A/CASA/Data Science/TD:TP/corpus.txt")
docs_raw = file.read()
docs_raw = docs_raw.splitlines()

print("Loaded " + str(len(docs_raw)) + " documents.")

Loaded 10000 documents.


In [56]:
print(docs_raw[1])

Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.


In [39]:
"""
Write code to create a list of pre-processed documents
Hint: use raw_to_tokens function in a loop for each document

Note: this might take few minutes.
"""
docs_clean = []

i = len(docs_raw)
print_progress_bar(0, i, prefix='Progress:', suffix='Complete', length=50)

for k in range(i) :
    docs_clean.append(
        raw_to_tokens(docs_raw[k], spacy_nlp)
    )
    # Update the progress bar
    print_progress_bar(k + 1, i, prefix='Progress:', suffix='Complete', length=50)

Progress: |██████████████████████████████████████████████████| 100.0% Complete


In [57]:
# Print sample documents
print("Raw document: ", docs_raw[4938])
print("Preprocessed document: ", docs_clean[4938])

Raw document:  Et ceci aussi parce que je sais qu'outre les experts économiques et financiers, nombre de sociaux-démocrates de cette Assemblée partagent notre critique.
Preprocessed document:  sais experts economiques financiers nombre sociaux democrates assemblee partagent critique


## Vector Space Model
We are interested in using this data to build statistical models. So, we now need to **vectorize** this data. The goal is to find a way to represent the data so that the computer can understand it.

### Bag of words
A bag of words represents each document in a corpus as a series of features. Most commonly, the features are the collection of all unique words in the vocabulary of the entire corpus. The values are usually the count of the number of times that word appears in the document, i.e. **term frequency**.

A document $d$ is represented by a weight vector is $v_d=[w_{1,d} , w_{2,d},\ldots, w_{N,d}]$ where $w_{t,d} = tf_{t,d}$, the term frequency of word $t$ in document $d$.

A corpus is then represented as a matrix with one row per document and one column per unique word.

### Scikit-Learn
[Scikit-learn](http://scikit-learn.org/stable/) is machine learning library for the Python programming language. It features a wide range of machine learning algorithms for classification, regression and clustering. It also provides various supporting machine learning techniques such as cross validation, text vectorizer. Scikit-learn is designed to interoperate with the Python numerical and scientific libraries [NumPy](http://www.numpy.org/).

Simple to use: import the required module and call it.

## Vectorizer
To build our initial bag of words count matrix, we will use scikit-learn's **CountVectorizer** class to transform our corpus into a bag of words representation. CountVectorizer expects as input a list of raw strings containing the documents in the corpus. It tabulates occurrance counts per document for each feature.

In [58]:
import numpy as np

docs_raw_sample = ["Chat vit rôt.",
                   "Rôt tenta chat.",
                   "Chat mit patte à rôt.",
                   "Rôt trop chaud !",
                   "Rôt brûla patte à chat.",
                   "Chat quitta rôt."]

# Write code to import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Write code to convert the list of documents to list of tokens.

docs_raw_sample_tokens = [raw_to_tokens(docs_raw_sample[k], spacy_nlp) for k in range(len(docs_raw_sample))]

# Write code to create a CountVectorizer
vectorizer = CountVectorizer()

# Write code to vectorize the sample text
X_sample = vectorizer.fit_transform(docs_raw_sample)

# The matrix is to be converted to dense matrix to print it
print("Count Matrix:")
print(X_sample.todense())
print("\nWords in vocabulary:")
print(vectorizer.get_feature_names_out())

Count Matrix:
[[0 1 0 0 0 0 1 0 0 1]
 [0 1 0 0 0 0 1 1 0 0]
 [0 1 0 1 1 0 1 0 0 0]
 [0 0 1 0 0 0 1 0 1 0]
 [1 1 0 0 1 0 1 0 0 0]
 [0 1 0 0 0 1 1 0 0 0]]

Words in vocabulary:
['brûla' 'chat' 'chaud' 'mit' 'patte' 'quitta' 'rôt' 'tenta' 'trop' 'vit']


## TF-IDF Weighting Scheme
The tf-idf weighting scheme is an improvement over the simple term count or term frequency scheme we just saw. It is frequently used in text mining applications and has been shown to be effective. It combines two term statistics components:
1. **Local component**: term count or term frequency (tf) reflects how important a word is to a document locally. For more details you can refer to [this link](https://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html).
2. **Global component**: inverse document frequency (idf) of a word reflects how important the word is to the entire corpus or collection of documents. _Document frequency_ (df) of a word is the number of documents in the corpus where the word appears. A term with higher $df$ is a common term, thus carries less importance. $idf$ is an inverse function of $df$. So higher $idf$ means higher importance of the term globally. For more details you can refer to [this link](https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html).

The weight vector for document $d$ under tf-idf scheme is $v_d=[w_{1,d} , w_{2,d},\ldots, w_{N,d}]$ where $w_{t,d}=tf_{t,d}\times\log\frac{Card(D)}{Card(d'\in D | t\in d') + 1}$ In the denominator we have added 1 to avoid division by zero, which is called smoothing.

Scikit-learn has your back, it already provides the **TfidfVectorizer** module to compute TF-IDF matrix.

**Note**:
1. Scikit-learn uses a slightly different formula than that we saw today morning. You can refer to [corresponding documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to know more.
2. Do not forget about preprocessing and tokenization before doing vectorization.

In [59]:
docs_raw_sample = ["Chat vit rôt.",
                   "Rôt tenta chat.",
                   "Chat mit patte à rôt.",
                   "Rôt trop chaud !",
                   "Rôt brûla patte à chat.",
                   "Chat quitta rôt."]

# Write code to import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Write code to convert the list of documents to list of tokens.
docs_raw_sample_tfidf = [raw_to_tokens(docs_raw_sample[k], spacy_nlp) for k in range(len(docs_raw_sample))]

# Write code to create a TfidfVectorizer object
tfidf = TfidfVectorizer()

# Write code to vectorize the sample text
X_tfidf_sample = tfidf.fit_transform(docs_raw_sample_tfidf)

print("Shape of the TF-IDF Matrix:")
print(X_tfidf_sample.shape)
print("TF-IDF Matrix:")
print(X_tfidf_sample.todense())
print(tfidf.get_feature_names_out())

Shape of the TF-IDF Matrix:
(6, 10)
TF-IDF Matrix:
[[0.         0.42407356 0.         0.         0.         0.
  0.36743345 0.         0.         0.82774046]
 [0.         0.42407356 0.         0.         0.         0.
  0.36743345 0.82774046 0.         0.        ]
 [0.         0.35088001 0.         0.68487548 0.56160769 0.
  0.30401578 0.         0.         0.        ]
 [0.         0.         0.67465286 0.         0.         0.
  0.29947796 0.         0.67465286 0.        ]
 [0.68487548 0.35088001 0.         0.         0.56160769 0.
  0.30401578 0.         0.         0.        ]
 [0.         0.42407356 0.         0.         0.         0.82774046
  0.36743345 0.         0.         0.        ]]
['brula' 'chat' 'chaud' 'mit' 'patte' 'quitta' 'rot' 'tenta' 'trop' 'vit']


## From Documents to Features
TF-IDF basically transforms a set of documents into a set of features, which can be directly used in machine learning tasks.

Let's now convert EU Parliamanet documents into Tf-Idf vectors.

A correct conversion will generate a matrix of dimensions (10000, 13429).

In [63]:
# Write code to convert raw documents into TF-IDF matrix.
"""
Hint: - create a TfidfVectorizer for clean EU Parliamanet documents
      - use fit_transform to vectorize raw_docs
"""
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(docs_clean)

print("Shape of the TF-IDF Matrix:")
print(X_tfidf.shape)

Shape of the TF-IDF Matrix:
(10000, 13467)


## Course Project: Text Classification with Rakuten France Product Data

The project focuses on the topic of large-scale product type code text classification where the goal is to predict each product’s type code as defined in the catalog of Rakuten France. This project is derived from a data challenge proposed by Rakuten Institute of Technology, Paris. Details of the data challenge is [available in this link](https://challengedata.ens.fr/challenges/35).

The above data challenge focuses on multimodal product type code classification using text and image data. **For this project we will work with only text part of the data.**

Please read carefully the description of the challenge provided in the above link. **You can disregard any information related to the image part of the data.**

### To obtain the data
You have to register yourself [in this link](https://challengedata.ens.fr/challenges/35) to get access to the data.

For this project you will only need the text data. Download the training files `x_train` and `y_train`, containing the item texts, and the corresponding product type code labels.

### Pandas for handling the data
The files you obtained are in CSV format. We strongly suggest to use Python Pandas package to load and visualize the data. [Here is a basic tutorial](https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/) on how to handle data in CSV file using Pandas.

If you open the `x_train` dataset using Pandas, you will find that it contains following columns:
1. an integer ID for the product
2. **designation** - The product title
3. description
4. productid
5. imageid

For this project we will only need the integer ID and the designation. You can [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the other columns.

The training output file `y_train.csv` contains the **prdtypecode**, the target/output variable for the classification task, for each integer id in the training input file `X_train.csv`.

### Task for the break
1. Register yourself and download the training and test for text data. You do not need the `supplementary files` for this project.
2. Load the data using pandas and disregard unnecessary columns as mentioned above.
3. On the **designation** column, apply the preprocessing techniques.

### Task for the end of the course
After this preprocessing step, you have now access to a TF-IDF matrix that constitute our data set for the final evaluation project. The project guidelines are:
1. Apply all appropriated approaches taught in the course and practiced in lab sessions on this data set. The goal is to predict the target variable (prdtypecode).
2. Compare performances of all these models in terms of the weighted-f1 scores you can output.
3. Conclude about the most appropriate approach on this data set for the predictive task.
4. Write a report that adress all these guidelines with a maximal page number of 5 (including figures, tables and references). We will take into account the quality of writing and presentation of the report.