# **Catégorisez automatiquement des questions**

## partie 1/8 : analyse exploratoire

### <br> Notebook d’exploration et de pré-traitement des questions comprenant une analyse univariée et multivariée, un nettoyage des questions, un feature engineering de type bag of words avec réduction de dimension (du vocabulaire et des tags) 

<br>


## 1.1 Importation des librairies, réglages


In [43]:
import sys
import numpy as np
import random
from zipfile import ZipFile
import pandas as pd

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

# NLP
from bs4 import BeautifulSoup
import nltk
nltk.download('punkt')
import string
import spacy

print('Python version ' + sys.version)
print('\npandas version ' + pd.__version__)
print('sns version ' + sns.__version__)

plt.style.use('ggplot')
pd.set_option('display.max_columns', 200)
sns.set(font_scale=1)


Python version 3.11.4 (main, Jul  5 2023, 14:15:25) [GCC 11.2.0]

pandas version 2.1.1
sns version 0.12.2


[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1.2 Fonctions


In [44]:
def quick_look(df, miss=True):
    """
    Display a quick overview of a DataFrame, including shape, head, tail, unique values, and duplicates.

    Args:
        df (pandas.DataFrame): The input DataFrame to inspect.
        check_missing (bool, optional): Whether to check and display missing values (default is True).

    The function provides a summary of the DataFrame, including its shape, the first and last rows, the count of unique values per column, and the number of duplicates.
    If `check_missing` is set to True, it also displays missing value information.
    """
    print(f'shape : {df.shape}')

    display(df.head())
    display(df.tail())

    print('uniques :')
    display(df.nunique())

    print('Doublons ? ', df.duplicated(keep='first').sum(), '\n')

    if miss:
        display(get_missing_values(df))


def lerp(a, b, t):
    """
    Linear interpolation between two values 'a' and 'b' at a parameter 't'.
    A very useful little function, used here to position annotations in plots.
    Got it coding with Radu :)

    Given two values 'a' and 'b', and a parameter 't',
    this function calculates the linear interpolation between 'a' and 'b' at 't'.

    Parameters:
    a (float or int): The start value.
    b (float or int): The end value.
    t (float): The interpolation parameter (typically in the range [0, 1], but can be outside).

    Returns:
    float or int: The interpolated value at parameter 't'.
    """
    return a + (b - a) * t


def generate_random_pastel_colors(n):
    """
    Generates a list of n random pastel colors, represented as RGBA tuples.

    Parameters:
    n (int): The number of pastel colors to generate.

    Returns:
    list: A list of RGBA tuples representing random pastel colors.

    Example:
    >>> generate_random_pastel_colors(2)
    [(0.749, 0.827, 0.886, 1.0), (0.886, 0.749, 0.827, 1.0)]
    """
    colors = []
    for _ in range(n):
        # Generate random pastels
        red = round(random.randint(150, 250) / 255.0, 3)
        green = round(random.randint(150, 250) / 255.0, 3)
        blue = round(random.randint(150, 250) / 255.0, 3)

        # Create an RGB color tuple and add it to the list
        color = (red,green,blue, 1.0)
        colors.append(color)

    return colors

print(generate_random_pastel_colors(2))


def get_missing_values(df):
    """Generates a DataFrame containing the count and proportion of missing values for each feature.

    Args:
        df (pandas.DataFrame): The input DataFrame to analyze.

    Returns:
        pandas.DataFrame: A DataFrame with columns for the feature name, count of missing values,
        count of non-missing values, proportion of missing values, and data type for each feature.
    """
    # Count the missing values for each column
    missing = df.isna().sum()

    # Calculate the percentage of missing values
    percent_missing = df.isna().mean() * 100

    # Create a DataFrame to store the results
    missings_df = pd.DataFrame({
        'column_name': df.columns,
        'missing': missing,
        'present': df.shape[0] - missing,  # Count of non-missing values
        'percent_missing': percent_missing.round(2),  # Rounded to 2 decimal places
        'type': df.dtypes
    })

    # Sort the DataFrame by the count of missing values
    missings_df.sort_values('missing', inplace=True)

    return missings_df

# with pd.option_context('display.max_rows', 1000):
#   display(get_missing_values(df))


# ma fonction d'origine (non cleanée)
def hist_distrib(dataframe, feature, bins, r, density=True):
    """
    Affiche un histogramme, pour visualiser la distribution empirique d'une variable
    Argument : df, feature num
    """
    # calcul des tendances centrales :
    mode =  str(round(dataframe[feature].mode()[0], r))
    # mode is often zero, so Check if there are non nul values in the column
    if (dataframe[feature] != 0).any():
        mode_non_nul = str(round(dataframe.loc[dataframe[feature] != 0, feature].mode()[0], r))
    else:
        mode_non_nul = "N/A"
    mediane = str(round(dataframe[feature].median(), r))
    moyenne = str(round(dataframe[feature].mean(), r))
    # dispersion :
    var_emp = str(round(dataframe[feature].var(ddof=0), r))
    coeff_var =  str(round(dataframe[feature].std(ddof=0), r)) # = écart-type empirique / moyenne
    # forme
    skewness = str(round(dataframe[feature].skew(), 2))
    kurtosis = str(round(dataframe[feature].kurtosis(), 2))

    fig, ax = plt.subplots(figsize=(12, 5))
    dataframe[feature].hist(density=density, bins=bins, ax=ax)
    yt = plt.yticks()
    y = lerp(yt[0][0], yt[0][-1], 0.8)
    t = y/20
    xt = plt.xticks()
    x = lerp(xt[0][0], xt[0][-1], 0.7)
    plt.title(feature, pad=20, fontsize=18)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    fs =13
    plt.annotate('Mode : ' + mode, xy = (x, y), fontsize = fs, xytext = (x, y), color = 'g')
    plt.annotate('Mode + : ' + mode_non_nul, xy = (x, y-t), fontsize = fs, xytext = (x, y-t), color = 'g')
    plt.annotate('Médiane : ' + mediane, xy = (x, y-2*t), fontsize = fs, xytext = (x, y-2*t), color = 'g')
    plt.annotate('Moyenne : ' + moyenne, xy = (x, y-3*t), fontsize = fs, xytext = (x, y-3*t), color = 'g')

    plt.annotate('Var emp : ' + var_emp, xy = (x, y-5*t), fontsize = fs, xytext = (x, y-5*t), color = 'g')
    plt.annotate('Coeff var : ' + coeff_var, xy = (x, y-6*t), fontsize = fs, xytext = (x, y-6*t), color = 'g')

    plt.annotate('Skewness : ' + skewness, xy = (x, y-8*t), fontsize = fs, xytext = (x, y-8*t), color = 'g')
    plt.annotate('Kurtosis : ' + kurtosis, xy = (x, y-9*t), fontsize = fs, xytext = (x, y-9*t), color = 'g')
    plt.show()

    return float(skewness) # pour eventuel passage au log

# version cleanée
def hist_distrib(dataframe, feature, bins, decimal_places, density=True):
    """
    Visualize the empirical distribution of a numerical feature using a histogram.
    Calcul des principaux indicateurs de tendance centrale, dispersion et forme.

    Args:
        dataframe (pandas.DataFrame): The input DataFrame containing the feature.
        feature (str): The name of the numerical feature to visualize.
        bins (int): The number of bins for the histogram.
        decimal_places (int): The number of decimal places for rounding numeric values.
        density (bool, optional): Whether to display the histogram as a density plot (default is True).

    Returns:
        float: The skewness of the feature's distribution.

    The function generates a histogram of the feature, displays various statistics, and returns the skewness of the distribution.
    """
    # Calculate central tendencies and dispersion
    mode_value = round(dataframe[feature].mode()[0], decimal_places)
    mode_non_zero = "N/A"
    if (dataframe[feature] != 0).any():
        mode_non_zero = round(dataframe.loc[dataframe[feature] != 0, feature].mode()[0], decimal_places)
    median_value = round(dataframe[feature].median(), decimal_places)
    mean_value = round(dataframe[feature].mean(), decimal_places)

    # Calculate dispersion
    var_emp = round(dataframe[feature].var(ddof=0), decimal_places)
    coeff_var = round(dataframe[feature].std(ddof=0), decimal_places)

    # Calculate shape indicators
    skewness_value = round(dataframe[feature].skew(), 2)
    kurtosis_value = round(dataframe[feature].kurtosis(), 2)

    # Create the plot
    fig, ax = plt.subplots(figsize=(12, 5))
    dataframe[feature].hist(density=density, bins=bins, ax=ax)

    # Adjust placement for annotations
    yt = plt.yticks()
    y_position = lerp(yt[0][0], yt[0][-1], 0.8)
    y_increment = y_position / 20
    xt = plt.xticks()
    x_position = lerp(xt[0][0], xt[0][-1], 0.7)

    # Add annotations with horizontal and vertical alignment
    annotation_fs = 13
    color = 'g'
    ax.annotate(f'Mode: {mode_value}', xy=(x_position, y_position), fontsize=annotation_fs,
                xytext=(x_position, y_position), color=color, ha='left', va='bottom')
    ax.annotate(f'Mode +: {mode_non_zero}', xy=(x_position, y_position - y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - y_increment), color=color, ha='left', va='bottom')
    ax.annotate(f'Median: {median_value}', xy=(x_position, y_position - 2 * y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - 2 * y_increment), color=color, ha='left', va='bottom')
    ax.annotate(f'Mean: {mean_value}', xy=(x_position, y_position - 3 * y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - 3 * y_increment), color=color, ha='left', va='bottom')
    ax.annotate(f'Var Emp: {var_emp}', xy=(x_position, y_position - 5 * y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - 5 * y_increment), color=color, ha='left', va='bottom')
    ax.annotate(f'Coeff Var: {coeff_var}', xy=(x_position, y_position - 6 * y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - 6 * y_increment), color=color, ha='left', va='bottom')
    ax.annotate(f'Skewness: {skewness_value}', xy=(x_position, y_position - 8 * y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - 8 * y_increment), color=color, ha='left', va='bottom')
    ax.annotate(f'Kurtosis: {kurtosis_value}', xy=(x_position, y_position - 9 * y_increment), fontsize=annotation_fs,
                xytext=(x_position, y_position - 9 * y_increment), color=color, ha='left', va='bottom')

    # Label the x-axis and y-axis
    ax.set_xlabel(feature, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)

    # Show the plot
    plt.title(f'Distribution of {feature}', pad=20, fontsize=18)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.show()

    return skewness_value


def boxplot_distrib(dataframe, feature):
    """
    Affiche un boxplot, pour visualiser les tendances centrales et la dispersion d'une variable.

    Args:
        dataframe (pandas.DataFrame): The input DataFrame containing the feature.
        feature (str): The name of the numerical feature to visualize.

    The function generates a box plot of the feature to display central tendencies (median and mean) and dispersion.
    """
    fig, ax = plt.subplots(figsize=(10, 4))

    medianprops = {'color':"blue"}
    meanprops = {'marker':'o', 'markeredgecolor':'black',
            'markerfacecolor':'firebrick'}

    dataframe.boxplot(feature, vert=False, showfliers=False, medianprops=medianprops, patch_artist=True, showmeans=True, meanprops=meanprops)

    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.show()


def courbe_lorenz(dataframe, feature):
    """
    Affiche une courbe de Lorenz, pour visualiser la concentration d'une variable
    Calcule l'indice de Gini
    Visualize a Lorenz curve to assess the concentration of a variable and calculate the Gini coefficient.

    Args:
        dataframe (pandas.DataFrame): The input DataFrame containing the feature.
        feature (str): The name of the numerical feature to visualize.

    The function generates a Lorenz curve to assess the concentration of the feature and calculates the Gini coefficient.
    """
    fig, ax = plt.subplots(figsize=(12, 5))
    values = dataframe.loc[dataframe[feature].notna(), feature].values
    # print(values)
    n = len(values)
    lorenz = np.cumsum(np.sort(values)) / values.sum()
    lorenz = np.append([0],lorenz) # La courbe de Lorenz commence à 0

    xaxis = np.linspace(0-1/n,1+1/n,n+1)
    #Il y a un segment de taille n pour chaque individu, plus 1 segment supplémentaire d'ordonnée 0.
    # #Le premier segment commence à 0-1/n, et le dernier termine à 1+1/n.
    plt.plot(xaxis,lorenz,drawstyle='steps-post')
    plt.plot(np.arange(2),[x for x in np.arange(2)])
    # calcul de l'indice de Gini
    AUC = (lorenz.sum() -lorenz[-1]/2 -lorenz[0]/2)/n # Surface sous la courbe de Lorenz. Le premier segment (lorenz[0]) est à moitié en dessous de 0, on le coupe donc en 2, on fait de même pour le dernier segment lorenz[-1] qui est à moitié au dessus de 1.
    S = 0.5 - AUC # surface entre la première bissectrice et le courbe de Lorenz
    gini = 2*S
    plt.annotate('gini =  ' + str(round(gini, 2)), xy = (0.04, 0.88), fontsize = 13, xytext = (0.04, 0.88), color = 'g')
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    plt.show()


def graphs_analyse_uni(dataframe, feature, bins=50, r=5, density=True):
    """
    Affiche histogramme + boxplot + courbe de Lorenz

    Args:
        dataframe (pandas.DataFrame): The input DataFrame containing the feature.
        feature (str): The name of the numerical feature to analyze.
        bins (int, optional): The number of bins for the histogram (default is 50).
        decimal_places (int, optional): The number of decimal places for rounding numeric values (default is 5).
        density (bool, optional): Whether to display the histogram as a density plot (default is True).

    The function generates and displays an analysis of the given numerical feature, including an histogram, a box plot, and a Lorenz curve.
    """
    hist_distrib(dataframe, feature, bins, r)
    boxplot_distrib(dataframe, feature)
    courbe_lorenz(dataframe, feature)


def shape_head(df, nb_rows=5):
    """
    Affiche les dimensions et les premières lignes dùun dataframe
    Display the dimensions and the first rows of a DataFrame.

    Args:
        df (pandas.DataFrame): The input DataFrame to display.
        nb_rows (int, optional): The number of rows to display (default is 5, max is 60).

    The function prints the dimensions of the DataFrame and displays the first few rows.
    """
    print(df.shape)
    display(df.head(nb_rows))


def doughnut(df, feature, title, width=10, height=10):
    """
    Affiche la répartition d'une feature sous forme de diagramme circulaire
    Display the distribution of a feature as a doughnut chart.
    Les couleurs sont aléatoires.

    Args:
        df (pandas.DataFrame): The input DataFrame containing the feature.
        feature (str): The name of the feature to visualize.
        title (str): The title for the doughnut chart.
        width (int, optional): The width of the chart (default is 10).
        height (int, optional): The height of the chart (default is 10).

    The function creates a doughnut chart to visualize the distribution of the specified feature.
    If you don't like the colors, try running it again :)
    """
    colors = generate_random_pastel_colors(20)

    grouped_df = df.groupby(feature).size().to_frame("count_per_type").reset_index()
    pie = grouped_df.set_index(feature).copy()

    fig, ax = plt.subplots(figsize=(width, height))

    patches, texts, autotexts = plt.pie(x=pie['count_per_type'], autopct='%1.1f%%',
        startangle=-30, labels=pie.index, textprops={'fontsize':11, 'color':'#000'},
        labeldistance=1.25, pctdistance=0.85, colors=colors)

    plt.title(
    label=title,
    fontdict={"fontsize":17},
    pad=20
    )

    for text in texts:
        # text.set_fontweight('bold')
        text.set_horizontalalignment('center')

    # Customize percent labels
    for autotext in autotexts:
        autotext.set_horizontalalignment('center')
        autotext.set_fontstyle('italic')
        autotext.set_fontsize('10')

    #draw circle
    centre_circle = plt.Circle((0,0),0.7,fc='white')
    fig = plt.gcf()
    fig.gca().add_artist(centre_circle)

    plt.show()


def get_non_null_values(df):
    """
    Génère un dataframe contenant le nombre et la proportion de non-null (non-zero) valeurs pour chaque feature
    Generate a DataFrame containing the count and proportion of non-null (non-zero) values for each feature.

    Args:
        df (pandas.DataFrame): The input DataFrame to analyze.

    The function calculates and returns a DataFrame with the count and percentage of non-null (non-zero) values for each feature.
    """
    non_null_counts = df.ne(0).sum()
    percent_non_null = (non_null_counts / df.shape[0]) * 100
    non_null_values_df = pd.DataFrame({'column_name': df.columns,
                                       'non_null_count': non_null_counts,
                                       'percent_non_null': percent_non_null.round(2),
                                       'type': df.dtypes})
    non_null_values_df.sort_values('non_null_count', inplace=True)
    return non_null_values_df


def get_colors(n=7):
    """
    Generate a list of random colors from multiple colormaps.

    Args:
        n (int, optional): The number of colors to sample from each colormap (default is 7).

    Returns:
        list: A list of random colors sampled from different colormaps.
    """
    num_colors_per_colormap = n
    colormaps = [plt.cm.Pastel2, plt.cm.Set1, plt.cm.Paired]
    all_colors = []

    for colormap in colormaps:
        colors = colormap(np.linspace(0, 1, num_colors_per_colormap))
        all_colors.extend(colors)

    np.random.shuffle(all_colors)

    return all_colors


[(0.725, 0.949, 0.737, 1.0), (0.616, 0.6, 0.792, 1.0)]


## 1.3 importation des données brutes


In [45]:
# Donnees compressées sinon on dépasse la limite /objet de Github (50Mb)

# path to the zip file
zip_file_path = './../data/raw_data/QueryResults.zip'

# directory where you want to extract the contents
extract_to_dir = './../data/raw_data'

# Open the zip file
with ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified directory
    zip_ref.extractall(extract_to_dir)

# L'encodage est bien UTF-8 (vérifié en ouvrant le .csv ds vscode)
raw_data = pd.read_csv('./../data/raw_data/QueryResults.csv', sep=',')

quick_look(raw_data)


shape : (50000, 9)


Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate
0,ImportError: cannot import name 'url_decode' f...,<p>I am building a webapp using Flask. I impor...,<python><flask><importerror><flask-login><werk...,77215107,13,14443,,5,2023-10-02 11:07:45
1,Compilation error after upgrading to JDK 21 - ...,"<p>After upgrading to JDK 21, I have the follo...",<spring-boot><compiler-errors><upgrade><lombok...,77171270,55,36788,,3,2023-09-25 09:05:11
2,Differences between Langchain & LlamaIndex,<p>I'm currently working on developing a chatb...,<chatbot><openai-api><langchain><large-languag...,76990736,28,10433,,2,2023-08-28 07:22:32
3,session not created: This version of ChromeDri...,<p>I am running a Docker image from a Docker c...,<python><amazon-web-services><docker><google-c...,76909437,14,14969,,8,2023-08-15 22:21:03
4,Spring security method cannot decide pattern i...,<p>When I try to run an application it fails t...,<java><spring-boot><eclipse><spring-security><...,76809698,27,18943,,8,2023-08-01 08:16:21


Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate
49995,How can I send a file document to the printer ...,<p>Here's the basic premise:</p>\n\n<p>My user...,<c#><winforms><pdf><.net-4.0><printing>,6103705,91,215784,0.0,12,2011-05-23 22:22:56
49996,CA1014 Mark 'some.dll' with CLSCompliant(true)...,"<p>When I run StyleCop, I got this error messa...",<visual-studio><visual-studio-2010><dll><style...,6103133,17,11024,0.0,2,2011-05-23 21:15:51
49997,How to change a text file's name in C++?,<p>I would like to change a <code>txt</code> f...,<c++><algorithm><file><directory><file-rename>,6103036,16,37118,0.0,3,2011-05-23 21:05:59
49998,php implode (101) with quotes,<p>Imploding a simple array </p>\n\n<p>would ...,<php><arrays><string><csv><implode>,6102398,156,141141,0.0,16,2011-05-23 20:06:35
49999,What characters are allowed in a iOS file name?,<p>I'm looking for a way to make sure a string...,<ios><file><filenames><character-encoding><nsf...,6102333,29,26085,0.0,10,2011-05-23 20:00:57


uniques :


Title            49999
Body             50000
Tags             48252
Id               50000
Score              761
ViewCount        36831
FavoriteCount        2
AnswerCount         64
CreationDate     49994
dtype: int64

Doublons ?  0 



Unnamed: 0,column_name,missing,present,percent_missing,type
Title,Title,0,50000,0.0,object
Body,Body,0,50000,0.0,object
Tags,Tags,0,50000,0.0,object
Id,Id,0,50000,0.0,int64
Score,Score,0,50000,0.0,int64
ViewCount,ViewCount,0,50000,0.0,int64
AnswerCount,AnswerCount,0,50000,0.0,int64
CreationDate,CreationDate,0,50000,0.0,object
FavoriteCount,FavoriteCount,1827,48173,3.65,float64


In [46]:
# Seulement 2 titres identiques / 50 000 lignes
# id est bien une clé primaire (body aussi)

# Les types semblent corrects,
# à part les dates bien sûr

raw_data['CreationDate'] = pd.to_datetime(raw_data['CreationDate'])
# Après je pense qu'on n'utilisera jamais ces dates... Juste au cas où.

raw_data.describe()

# Avec nos critères (cf requete sql, fin ntbk2), il a fallu retourner jusqu'à mai 2011
# pour avoir 50 000 questions. On retrouve ici le favoriteCount très bas, proche de 0.
# On a une moyenne pour le score (environ 50), le nv de vues (66 000) et de réponses (>5)
# dans notre corpus.


Unnamed: 0,Id,Score,ViewCount,FavoriteCount,AnswerCount,CreationDate
count,50000.0,50000.0,50000.0,48173.0,50000.0,50000
mean,28782030.0,49.97398,66128.81,8.3e-05,5.38448,2015-02-23 09:48:58.933900032
min,6102333.0,11.0,10002.0,0.0,2.0,2011-05-23 20:00:57
25%,14131520.0,15.0,18420.5,0.0,3.0,2013-01-03 00:42:30.500000
50%,25516820.0,24.0,32106.0,0.0,4.0,2014-08-26 23:41:22
75%,41338350.0,45.0,64467.25,0.0,6.0,2016-12-27 02:23:45
max,77215110.0,27153.0,10639240.0,1.0,87.0,2023-10-02 11:07:45
std,16874040.0,168.429881,142456.6,0.009112,4.622519,


In [47]:
raw_questions_tags = raw_data[['Body', 'Tags']].copy()

# Rename columns
raw_questions_tags = raw_questions_tags.rename(columns={'Body': 'questions', 'Tags': 'tags'})



## 1.5 Suppression des tags html


In [48]:
raw_questions_tags['questions'] = raw_questions_tags['questions'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
# Le warning n'a pas d'importance ici : du texte qui ne contient pas de tags html
# n'est pas modifié par BeautifulSoup.get_text()

quick_look(raw_questions_tags)


shape : (50000, 2)


Unnamed: 0,questions,tags
0,I am building a webapp using Flask. I imported...,<python><flask><importerror><flask-login><werk...
1,"After upgrading to JDK 21, I have the followin...",<spring-boot><compiler-errors><upgrade><lombok...
2,I'm currently working on developing a chatbot ...,<chatbot><openai-api><langchain><large-languag...
3,I am running a Docker image from a Docker cont...,<python><amazon-web-services><docker><google-c...
4,When I try to run an application it fails to s...,<java><spring-boot><eclipse><spring-security><...


Unnamed: 0,questions,tags
49995,Here's the basic premise:\nMy user clicks some...,<c#><winforms><pdf><.net-4.0><printing>
49996,"When I run StyleCop, I got this error message ...",<visual-studio><visual-studio-2010><dll><style...
49997,"I would like to change a txt file's name, but ...",<c++><algorithm><file><directory><file-rename>
49998,Imploding a simple array \nwould look like th...,<php><arrays><string><csv><implode>
49999,I'm looking for a way to make sure a string ca...,<ios><file><filenames><character-encoding><nsf...


uniques :


questions    50000
tags         48252
dtype: int64

Doublons ?  0 



Unnamed: 0,column_name,missing,present,percent_missing,type
questions,questions,0,50000,0.0,object
tags,tags,0,50000,0.0,object


## 4.6 Tokenisation, majuscules, ponctuation


In [49]:
def preprocess_text_1(text):
    # Tokenize and convert to lowercase
    tokens = nltk.word_tokenize(text.lower())

    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]
    # Alternatively, using nltk
    # tokenizer = nltk.RegexpTokenizer(r'\w+')
    # tokens = tokenizer.tokenize(tokens)

    return tokens

# Apply the preprocessing function to the 'questions' column
raw_questions_tags['tokens'] = raw_questions_tags['questions'].apply(preprocess_text_1)

quick_look(raw_questions_tags)


shape : (50000, 3)


Unnamed: 0,questions,tags,tokens
0,I am building a webapp using Flask. I imported...,<python><flask><importerror><flask-login><werk...,"[i, am, building, a, webapp, using, flask, i, ..."
1,"After upgrading to JDK 21, I have the followin...",<spring-boot><compiler-errors><upgrade><lombok...,"[after, upgrading, to, jdk, 21, i, have, the, ..."
2,I'm currently working on developing a chatbot ...,<chatbot><openai-api><langchain><large-languag...,"[i, 'm, currently, working, on, developing, a,..."
3,I am running a Docker image from a Docker cont...,<python><amazon-web-services><docker><google-c...,"[i, am, running, a, docker, image, from, a, do..."
4,When I try to run an application it fails to s...,<java><spring-boot><eclipse><spring-security><...,"[when, i, try, to, run, an, application, it, f..."


Unnamed: 0,questions,tags,tokens
49995,Here's the basic premise:\nMy user clicks some...,<c#><winforms><pdf><.net-4.0><printing>,"[here, 's, the, basic, premise, my, user, clic..."
49996,"When I run StyleCop, I got this error message ...",<visual-studio><visual-studio-2010><dll><style...,"[when, i, run, stylecop, i, got, this, error, ..."
49997,"I would like to change a txt file's name, but ...",<c++><algorithm><file><directory><file-rename>,"[i, would, like, to, change, a, txt, file, 's,..."
49998,Imploding a simple array \nwould look like th...,<php><arrays><string><csv><implode>,"[imploding, a, simple, array, would, look, lik..."
49999,I'm looking for a way to make sure a string ca...,<ios><file><filenames><character-encoding><nsf...,"[i, 'm, looking, for, a, way, to, make, sure, ..."


uniques :


TypeError: unhashable type: 'list'

## 4.7 tokens uniques


In [None]:
def preprocess_text_2(liste_tokens):
    tokens_uniques = set(liste_tokens)

    return tokens_uniques

# Apply the preprocessing function
raw_questions_tags['tokens_uniques'] = raw_questions_tags['tokens'].apply(preprocess_text_2)

raw_questions_tags.describe()


Unnamed: 0,questions,tags,tokens,tokens_uniques
count,50000,50000,50000,50000
unique,50000,48252,50000,50000
top,I am building a webapp using Flask. I imported...,<javascript><jquery><html><css><twitter-bootst...,"[i, am, building, a, webapp, using, flask, i, ...","{passwordfield, recent, your, but, datetime, /..."
freq,1,49,1,1


In [None]:
raw_questions_tags['nb_mots'] = raw_questions_tags['tokens'].apply(len)
raw_questions_tags['nb_mots_uniques'] = raw_questions_tags['tokens_uniques'].apply(len)

raw_questions_tags[['nb_mots', 'nb_mots_uniques']].describe()


Unnamed: 0,nb_mots,nb_mots_uniques
count,50000.0,50000.0
mean,180.64882,95.94906
std,189.014106,67.644658
min,6.0,6.0
25%,77.0,53.0
50%,128.0,79.0
75%,216.0,118.0
max,3700.0,2083.0
