<img src="images/ds.jpg">

---
# Exploratory Data Analysis

Welcome to this Exploratory Data Analysis notebook for Text Data ! :)

<br/>

This notebook aims to facilitate the understanding of your data through various analyzes on the texts.

<br/>

Python cells are to be executed sequentially, and some blocks must be modified according to your project.


---
# Prerequisites


<br/>


- **A working python 3.8 virtual environment, with your project installed** (in develop mode)


- Your virtual env must have been added as an ipython kernel, and defined as the currently used kernel !

### Adding your virtual env


<br/>


**This should be done for the proper functioning of this notebook !**


<br/>

By default, python notebooks use your main python installation.   
In this step, we detail how to add your virtual environment as an ipython kernel:


~ Open a command prompt

~ Activate your virtual environment (e.g. `venv_awesome`)

~ Install `ipykernel` -> `pip install ipykernel`

~ dd your virtual environment as an ipython kernel (e.g. `python -m ipykernel install --user --name=venv_awesome`)

<br>

Aide :

- The `jupyter kernelspec list` commande allows you to list all the installed kernels<br>
<br>
- The `jupyter kernelspec uninstall venv_awesome` commande allows to uninstall the `venv_awesome` kernel

You can then use this notebook with your virtual env:


~ Change your kernel (e.g. `venv_awesome`):

<img src="images/kernel.jpg">
<br>
<img src="images/kernel2.jpg">

_info : you may need to restart your jupyter notebook_

### Important

In this notebook, we "hide" all the functions definitions as to not spoil the reading.   

---

<br>

---

<br>

---
# Imports and local variables

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import re
import random
import dill as pickle
import nltk
import numpy as np
import pandas as pd
from PIL import Image
from collections import Counter


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize  # Should be installed through words_n_fun

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, HTML, Javascript, clear_output

%matplotlib inline

from {{package_name}} import utils

# Increase pandas display size
pd.set_option('display.max_rows', 500)

# Set default figure size & font
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 60
plt.rcParams['font.size'] = 14

# Center figures
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

In [None]:
{% raw %}# From https://stackoverflow.com/questions/31517194/how-to-hide-one-specific-cell-input-or-output-in-ipython-notebook
def hide_toggle(for_next=False, text_display='Toggle show/hide'):
    '''Function to hide a notebook cell'''
    this_cell = """$('div.cell.code_cell.rendered.selected')"""
    target_cell = this_cell  # target cell to control with toggle
    js_f_name = 'code_toggle_{}'.format(str(random.randint(1,2**64)))
    html = f"""
        <script>
            function {js_f_name}() {{
                {target_cell}.find('div.input').toggle();
            }}

        </script>

        <a href="javascript:{js_f_name}()" id="{js_f_name}">{text_display}</a>
    """
    js = f'''
            var output_area = this;
            var cell_element = output_area.element.parents('.cell');
            var cell_idx = Jupyter.notebook.get_cell_elements().index(cell_element);
            var current_cell = Jupyter.notebook.get_cell(cell_idx);
            $(current_cell.element[0]).find('div.input').toggle();
            Jupyter.notebook.select(cell_idx +  1);
            Jupyter.notebook.focus_cell();
         '''
    display(HTML(html))
    display(Javascript(js))

{% endraw %}

hide_toggle(text_display='Toggle show/hide --- function hide_toggle')

In [None]:
max_samples = 20e3
filename = 'filename.csv'  # FILE NAME TO BE MODIFIED !
text_col = 'sentence'  # TEXT COLUMN NAME TO BE MODIFIED !
y_cols = ['col1', 'col2']  # TARGET(S) COLUMNS NAMES TO BE MODIFIED ! - can be empty

---

<br>

---

<br>

---
# Data loading and first analysis

In [None]:
data_path = utils.get_data_path()
file_path = os.path.join(data_path, filename)

In [None]:
df = pd.read_csv(file_path, sep='{{default_sep}}', encoding='{{default_encoding}}', nrows=max_samples)

In [None]:
# Display shape
n_rows = df.shape[0]
n_columns = df.shape[1]
print(f"Number of lines : {n_rows}")
print(f"Number of columns : {n_columns}")

In [None]:
# Display 10 random lines
df.sample(10).head(10)

In [None]:
# Check input & target(s) columns presence
assert text_col in df.columns, f"Column '{text_col}' is not present in the loaded DataFrame"
for col in y_cols:
    assert col in df.columns, f"Column '{col}' is not present in the loaded DataFrame"

In [None]:
# Missing values
for col in [text_col] + y_cols:
    nb_missing = df[col].isna().sum()
    print(f"Missing values for column \033[1m{col}\033[0m : {nb_missing} -> {round(nb_missing / n_rows * 100, 2)} %")

In [None]:
# Duplicates text_col
n_duplicates = n_rows - len(df[text_col].drop_duplicates())
print(f"Number of duplicates for column \033[1m{text_col}\033[0m : {n_duplicates} -> {round(n_duplicates / n_rows * 100, 2)} %")

In [None]:
nltk.download('punkt')

# Number of unique words in the corpus
corpus_lower = [[_.lower() for _ in word_tokenize(token)] for token in df[text_col]]
flat_corpus_lower = [item for sublist in corpus_lower for item in sublist]
unique_words = set([token for token in flat_corpus_lower if token.isalpha()])
print(f"Number (approximate) of unique words in the corpus : {len(unique_words)}")

In [None]:
# Simple target(s) analysis
for col in y_cols:
    ax = sns.countplot(x=df[col])
    for p in ax.patches:
        ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width()/2., 1.01 * p.get_height()), ha='center', color='black', size=18)
    plt.show(block=False)

---

<br>

---

<br>

---
# Statistics

In [None]:
punct_regex = r"(?!_)[\w\s]|((?<=\w)'(?=\w))"
# [\w\s] -> match all "not special" characters, but includes '_'
# (?!_) -> do not capture '_'
# ((?<=\w)'(?=\w)) -> we also match apostrophes that match French words (e.g. "c'est", "t'ai", etc.)

df['word_count'] = df[text_col].apply(lambda x : len(x.split()))
df['char_count'] = df[text_col].apply(lambda x : len(x.replace(" ", ""))) # Not 100 % exact, but whatever
df['word_density'] = df['word_count'] / (df['char_count'] + 1)
df['punct_count'] = df[text_col].str.replace(punct_regex, '').apply(lambda x : len(x))
df['uppercase_ratio'] = df[text_col].apply(lambda x: len([c for c in x if c.isupper()]) / len(x))

hide_toggle(text_display='Toggle show/hide --- Add columns for stats')

In [None]:
# Get stats & display
df_stats = pd.DataFrame(columns=['Min', 'Quantile 0.05', 'Quantile 0.125', 'Quantile 0.25', 'Mean', 'Median', 'Quantile 0.75', 'Quantile 0.875', 'Quantile 0.95', 'Max'])
for col in ['word_count', 'char_count', 'word_density', 'punct_count', 'uppercase_ratio']:
    df_stats.loc[col] = [df[col].min(),
                         df[col].quantile(0.05),
                         df[col].quantile(0.125),
                         df[col].quantile(0.25),
                         df[col].mean(),
                         df[col].median(),
                         df[col].quantile(0.75),
                         df[col].quantile(0.825),
                         df[col].quantile(0.95),
                         df[col].max()]

hide_toggle(text_display='Toggle show/hide --- Add statistics')

In [None]:
# display stats
df_stats.style.background_gradient(axis=1).set_precision(2)

In [None]:
# Lines are to be uncommented at the user's choice

# Violin plots
ax = sns.violinplot(x=df["word_count"], inner='quartile', color=sns.color_palette("muted", 8)[0])
plt.show(block=False)
# ax = sns.violinplot(x=df["char_count"], inner='quartile', color=sns.color_palette("muted", 8)[1])
# plt.show(block=False)
# ax = sns.violinplot(x=df["word_density"], inner='quartile', color=sns.color_palette("muted", 8)[2])
# plt.show(block=False)
# ax = sns.violinplot(x=df["punct_count"], inner='quartile', color=sns.color_palette("muted", 8)[3])
# plt.show(block=False)
# ax = sns.violinplot(x=df["uppercase_ratio"], inner='quartile', color=sns.color_palette("muted", 8)[4])
# plt.show(block=False)

# Example with target comparison (WARNING, must be used with a unique boolean target)
# (small trick on y to be able to use side by side comparison)
# ax = sns.violinplot(x=df["word_count"], y=['' for _ in range(n_rows)],  hue=df[y_cols[0]], inner='quartile', palette="Set2", split=True)
# plt.show(block=False)

In [None]:
def display_outliers(col: str, text: str):
    '''Displays outliers on a given col (min & max included)'''
    # Outliers
    outliers_min = df[df[col] <= df[col].quantile(0.02)][text_col].sample(5)
    outlier_min = df[df[col] == df[col].min()][text_col].iloc[0]
    if outlier_min not in outliers_min.values:
        outliers_min = pd.Series(outlier_min).append(outliers_min, ignore_index=True)
    outliers_max = df[df[col] >= df[col].quantile(0.98)][text_col].sample(5)
    outlier_max = df[df[col] == df[col].max()][text_col].iloc[0]
    if outlier_max not in outliers_max.values:
        outliers_max = pd.Series(outlier_max).append(outliers_max, ignore_index=True)
    print(text)
    print('')
    for i, sentence in enumerate(outliers_min):
        print(f'---- Outlier min - text {i + 1} ----')
        print('')
        print('\t', sentence)
        print('')
    for i, sentence in enumerate(outliers_max):
        print(f'---- Outlier max - text {i + 1} ----')
        print('')
        print('\t', sentence)
        print('')
    print('')

hide_toggle(text_display='Toggle show/hide --- function display_outliers')

In [None]:
# Lines are to be uncommented at the user's choice

display_outliers('word_count', 'Outliers number of words:')
# display_outliers('char_count', 'Outliers number of text characters:')
# display_outliers('word_density', 'Outliers words density:')
# display_outliers('punct_count', 'Outliers number of punctuations:')
# display_outliers('uppercase_ratio', 'Outliers uppercase ratio:')

---

<br>

---

<br>

---
# Topic modelling

Warning: this part requires gensim to be installed.

In [None]:
!pip install gensim==3.8.3
!pip install pyLDAvis==3.2.2
clear_output()

### Latent Semantic Analysis

In [None]:
import pyLDAvis.gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
pyLDAvis.enable_notebook()
clear_output()

In [None]:
# Retrieve Bag of Words corpus
# You can easily edit the first line to add a filter on a class (for example)
corpus_lower = [[_.lower() for _ in word_tokenize(token)] for token in df[text_col]]
corpus_no_stops = [[t for t in doc if t not in stopwords.words('french')] for doc in corpus_lower]
corpus_alphas = [[token for token in doc if token.isalpha()] for doc in corpus_no_stops]
dictionary = corpora.Dictionary(corpus_alphas)
corpus_bow = [dictionary.doc2bow(t) for t in corpus_alphas]

In [None]:
%%time
num_topics = 5  # To be modified according to your project (should not be too big, otherwise the graphic will take too long to set up)
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=num_topics, random_state=42)

In [None]:
pyLDAvis.gensim.prepare(lda_model, corpus_bow, dictionary)

---

<br>

---

<br>

---
# N-grams counts

In [None]:
def ngrams_count(corpus, ngram_range, n=-1):
    '''Counts N-grams'''
    # Using CountVectorizer to build a bag of words using the given corpus
    vectorizer = CountVectorizer(stop_words=stopwords.words('french'), ngram_range=ngram_range).fit(corpus)
    bag_of_words = vectorizer.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    total_list = words_freq[:n]
    
    # Returning a DataFrame with the ngrams count
    count_df = pd.DataFrame(total_list, columns=['ngram', 'count'])
    return count_df


hide_toggle(text_display='Toggle show/hide --- function ngrams_count')

In [None]:
def format_spines(ax, right_border=True):
    '''Sets up borders from an axis and personalize colors
    Args:
        ax: figure axis
        right_border: flag to determine if the right border will be visible or not
    '''
    # Setting up colors
    ax.spines['bottom'].set_color('#CCCCCC')
    ax.spines['left'].set_color('#CCCCCC')
    ax.spines['top'].set_color('#FFFFFF')
    if right_border:
        ax.spines['right'].set_color('#CCCCCC')
    else:
        ax.spines['right'].set_color('#FFFFFF')
    ax.patch.set_facecolor('#FFFFFF')

def plot_ngrams(pos_corpus, neg_corpus, col, negative_val, positive_val):
    '''Plots ngrams counts'''
    # Extracting the top 10 unigrams
    unigrams_pos = ngrams_count(pos_corpus, (1, 1), 10)
    unigrams_neg = ngrams_count(neg_corpus, (1, 1), 10)

    # Extracting the top 10 bigrams
    bigrams_pos = ngrams_count(pos_corpus, (2, 2), 10)
    bigrams_neg = ngrams_count(neg_corpus, (2, 2), 10)

    # Extracting the top 10 trigrams
    trigrams_pos = ngrams_count(pos_corpus, (3, 3), 10)
    trigrams_neg = ngrams_count(neg_corpus, (3, 3), 10)
    
    # Joining everything in a python dictionary to make the plots easier
    ngram_dict_plot = {
        f'Top Unigrams on {col} = {negative_val}': unigrams_neg,
        f'Top Unigrams on {col} = {positive_val}': unigrams_pos,
        f'Top Bigrams on {col} = {negative_val}': bigrams_neg,
        f'Top Bigrams on {col} = {positive_val}': bigrams_pos,
        f'Top Trigrams on {col} = {negative_val}': trigrams_neg,
        f'Top Trigrams on {col} = {positive_val}': trigrams_pos,
    }

    # Plotting the ngrams analysis
    fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(15, 18))
    i, j = 0, 0
    colors = ['Blues_d', 'Reds_d']
    for title, ngram_data in ngram_dict_plot.items():
        ax = axs[i, j]
        sns.barplot(x='count', y='ngram', data=ngram_data, ax=ax, palette=colors[j])

        # Customizing plots
        format_spines(ax, right_border=False)
        ax.set_title(title, size=14)
        ax.set_ylabel('')
        ax.set_xlabel('')

        # Incrementing the index
        j += 1
        if j == 2:
            j = 0
            i += 1
    plt.tight_layout()
    plt.show()

hide_toggle(text_display='Toggle show/hide --- function plot_ngrams')

In [None]:
# Corpus choice
if len(y_cols) != 0:
    col = y_cols[0]  # CHOOSE YOUR TARGET HERE
    negative_val = 0  # CORRESPONDING NEGATIVE VALUE - MIGHT NEED TO BE CHANGED (or to adapt if multiclass)
    positive_val = 1  # CORRESPONDING POSITIVE VALUE - MIGHT NEED TO BE CHANGED (or to adapt if multiclass)
    neg_corpus = df[df[col] == negative_val][text_col]
    pos_corpus = df[df[col] == positive_val][text_col]

In [None]:
# Plot ngrams
if len(y_cols) != 0:
    plot_ngrams(pos_corpus, neg_corpus, col, negative_val, positive_val)

---

<br>

---

<br>

---
# Word clouds

Warning: this part requires wordcloud to be installed.

In [None]:
!pip install wordcloud==1.8.1
clear_output()

### Example positive / negative

In [None]:
from wordcloud import WordCloud

In [None]:
def plot_wordclouds(pos_corpus, neg_corpus, col, negative_val, positive_val):
    # Prepare format
    neg_corpus_lower = [[_.lower() for _ in word_tokenize(token)] for token in neg_corpus]
    pos_corpus_lower = [[_.lower() for _ in word_tokenize(token)] for token in pos_corpus]
    neg_corpus_no_stops = [[t for t in doc if t not in stopwords.words('french')] for doc in neg_corpus_lower]
    pos_corpus_no_stops = [[t for t in doc if t not in stopwords.words('french')] for doc in pos_corpus_lower]
    neg_corpus_alphas = [[token for token in doc if token.isalpha()] for doc in neg_corpus_no_stops]
    pos_corpus_alphas = [[token for token in doc if token.isalpha()] for doc in pos_corpus_no_stops]
    negative_words = [item for sublist in neg_corpus_alphas for item in sublist]
    positive_words = [item for sublist in pos_corpus_alphas for item in sublist]
    
    # Reading and preparing a mask for serving as wordcloud background
    pos_img = Image.open("./images/positive.png")
    neg_img = Image.open("./images/negative.png")

    # Transforming like mask
    pos_mask = np.array(pos_img)

    # Transforming bomb mask
    neg_mask = np.array(neg_img)

    # Using Counter for creating a dictionary counting
    positive_dict = Counter(positive_words)
    negative_dict = Counter(negative_words)

    # Generating wordclouds for both positive and negative comments
    positive_wc = WordCloud(width=1280, height=720, collocations=False, random_state=42, mask=pos_mask,
                          colormap='Blues', background_color='white', max_words=50).generate_from_frequencies(positive_dict)
    negative_wc = WordCloud(width=1280, height=720, collocations=False, random_state=42, mask=neg_mask,
                          colormap='Reds', background_color='white', max_words=50).generate_from_frequencies(negative_dict)

    # Visualizing the WC created
    fig, axs = plt.subplots(1, 2, figsize=(20, 20))
    ax1 = axs[0]
    ax2 = axs[1]

    ax1.imshow(negative_wc)
    ax1.axis('off')
    ax2.set_title(f'WordCloud for values equal to {negative_val}, for target column {col}', size=18, pad=20)

    ax2.imshow(positive_wc)
    ax2.axis('off')
    ax2.set_title(f'WordCloud for values equal to {positive_val}, for target column {col}', size=18, pad=20)

    plt.show()
    
hide_toggle(text_display='Toggle show/hide --- function plot_wordclouds')

In [None]:
# Corpus choice
if len(y_cols) != 0:
    col = y_cols[0]  # CHOOSE YOUR TARGET HERE
    negative_val = 0  # CORRESPONDING NEGATIVE VALUE - MIGHT NEED TO BE CHANGED (or to adapt if multiclass)
    positive_val = 1  # CORRESPONDING POSITIVE VALUE - MIGHT NEED TO BE CHANGED (or to adapt if multiclass)
    neg_corpus = df[df[col] == negative_val][text_col]
    pos_corpus = df[df[col] == positive_val][text_col]

In [None]:
# Plot word clouds
if len(y_cols) != 0:
    plot_wordclouds(pos_corpus, neg_corpus, col, negative_val, positive_val)

---

<br>

---

<br>

---
# Corpus comparison


This part compares the importance of words in the studied corpus with their importance in the wikipedia FR dataset (or whatever corpus). It makes it possible to highlight abnormally over- or under-represented words.

This part requires a .pkl file with a dictionnary of IDF over the corpus to be compared to.

In [None]:
data_path = utils.get_data_path()
idf_name = "wikipedia_idf.pkl"  # TO BE CHANGED WITH YOUR IDF FILE
idf_path = os.path.join(data_path, idf_name)
if not os.path.exists(idf_path):
    raise FileNotFoundError(f"Can't find file {idf_path}")
idf = pickle.load(open(idf_path, "rb"))

In [None]:
# Start by normalizing the IDF dictionnary
max_idf = np.max([idf[x] for x in idf.keys()])
for x in idf.keys():
    idf[x] /= max_idf

# Then, we fit a TFIDF on our corpus
# The corpus should be preprocessed the same way as the other corpus when you created the IDF file
# Usually we apply "lower" and "remove stopwords" steps
tfidf = TfidfVectorizer(min_df=100)
tfidf.fit_transform(df[text_col].dropna().apply(lambda x: " ".join([t for t in re.sub('[^A-Za-zÀ-ÖØ-öø-ÿ-+]+', ' ', x.lower()).strip().split()
                                                                    if t not in stopwords.words('french')])))

# Retrieve the IDF and normalize it
idf_corpus = {}
for x in tfidf.vocabulary_.keys():
    idf_corpus[x] = tfidf.idf_[tfidf.vocabulary_[x]]
max_idf_corpus = np.max([idf_corpus[x] for x in idf_corpus.keys()])
for x in idf_corpus.keys():
    idf_corpus[x] /= max_idf_corpus

# Only consider the words in common between our corpus and the generic corpus
intersect = idf.keys() & idf_corpus.keys()

# Evaluate the differences between the IDF of our corpus and the IDF of the generic corpus
data_plot = {}
for x in intersect:
    data_plot[x] = (idf_corpus[x] - idf[x]) / idf_corpus[x]

# Get everything in a DataFrame and sort by values
data_plot = pd.DataFrame.from_dict(data_plot, orient='index', columns=["value"])
data_plot = data_plot.sort_values(by="value", ascending=False)

In [None]:
n = 10  # Number of words to be displayed for both most positives and most negatives values
plt.barh(list(data_plot.index[0:n-1]) + list(data_plot.index[-n:]), list(data_plot["value"][0:n-1]) + list(data_plot["value"][-n:]))
plt.title("Over-represented words (<0) and under-represented words (>0) compared to a generic corpus.")
plt.ylabel("Words")
plt.xlabel("IDF difference")
plt.show()