# **Catégorisez automatiquement des questions**

## partie 3/8 : Prédiction de tags, approche non-supervisée

### <br> Proposition de mots clés, de type LDA avec visualisation en 2D des topics

<br>


## Importation des librairies, réglages


In [1]:
import os, sys, random
# from zipfile import ZipFile
import numpy as np
import pandas as pd
from pandarallel import pandarallel

# Visualisation
import matplotlib.pyplot as plt
# import seaborn as sns
import plotly.express as px

# Feature engineering
# from sklearn.model_selection import train_test_split
# from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora
from gensim.models import LdaModel

# Modify if necessary
num_cores = os.cpu_count()
print(f"\nNumber of CPU cores: {num_cores}")
pandarallel.initialize(progress_bar=False, nb_workers=6)



Number of CPU cores: 8
INFO: Pandarallel will run on 6 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


### Fonctions


In [2]:
def get_missing_values(df):
    """Generates a DataFrame containing the count and proportion of missing values for each feature.

    Args:
        df (pandas.DataFrame): The input DataFrame to analyze.

    Returns:
        pandas.DataFrame: A DataFrame with columns for the feature name, count of missing values,
        count of non-missing values, proportion of missing values, and data type for each feature.
    """
    # Count the missing values for each column
    missing = df.isna().sum()

    # Calculate the percentage of missing values
    percent_missing = df.isna().mean() * 100

    # Create a DataFrame to store the results
    missings_df = pd.DataFrame({
        'column_name': df.columns,
        'missing': missing,
        'present': df.shape[0] - missing,  # Count of non-missing values
        'percent_missing': percent_missing.round(2),  # Rounded to 2 decimal places
        'type': df.dtypes
    })

    # Sort the DataFrame by the count of missing values
    missings_df.sort_values('missing', inplace=True)

    return missings_df

# with pd.option_context('display.max_rows', 1000):
#   display(get_missing_values(df))


def quick_look(df, miss=True):
    """
    Display a quick overview of a DataFrame, including shape, head, tail, unique values, and duplicates.

    Args:
        df (pandas.DataFrame): The input DataFrame to inspect.
        check_missing (bool, optional): Whether to check and display missing values (default is True).

    The function provides a summary of the DataFrame, including its shape, the first and last rows, the count of unique values per column, and the number of duplicates.
    If `check_missing` is set to True, it also displays missing value information.
    """
    print(f'shape : {df.shape}')

    display(df.head())
    display(df.tail())

    print('uniques :')
    display(df.nunique())

    print('Doublons ? ', df.duplicated(keep='first').sum(), '\n')

    if miss:
        display(get_missing_values(df))



### fin du preprocessing


In [3]:
# import

train = pd.read_csv('./../data/cleaned_data/2_bow_uniques/train_bow_uniques_title_nltk.csv', sep=',')
test = pd.read_csv('./../data/cleaned_data/2_bow_uniques/test_bow_uniques_title_nltk.csv', sep=',')

quick_look(train)


shape : (43016, 6753)


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
0,2013-08-23 23:28:22,How to implement a ViewPager with different Fr...,When I start an activity which implements view...,"['android', 'android-layout', 'android-fragmen...",implement viewpager fragment layout,implement viewpager fragment layouts start act...,implement,start activity implement viewpager create frag...,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2015-04-26 06:13:36,Cannot subscript a value of [AnyObject]? with ...,This is in a class extending PFQueryTableViewC...,"['ios', 'xcode', 'swift', 'parse-platform', 'x...",subscript value anyobject index type int,subscript value anyobject index type int class...,subscript value index type,class extend follow error row cast way subscri...,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2014-08-06 12:33:53,Equivalent to java packages in C#,"I have been looking for a way to make a ""packa...","['java', 'c#', 'eclipse', 'visual-studio-2013'...",equivalent java package c,equivalent java package c look way make folder...,package c,look way package folder studio express know pr...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2014-06-05 18:35:37,How to use UIVisualEffectView to Blur Image?,Could someone give a small example of applying...,"['ios', 'objective-c', 'uiview', 'uikit', 'uiv...",use blur image,use blur image someone give example apply try ...,use uivisualeffectview,example apply blur image try figure code uivis...,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2013-06-28 11:53:56,How can I sort arrays and data in PHP?,\nThis question is intended as a reference for...,"['php', 'arrays', 'sorting', 'object', 'spl']",sort array data php,sort array data php question intend reference ...,sort array datum,question intend reference sort array think cas...,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
43011,2017-02-24 13:38:36,How to fully dump / print variable to console ...,Hey there I am searching for a function which ...,"['javascript', 'dart', 'debugging', 'console',...",dump print console dart language,dump print console dart language hey search fu...,dump print variable console language,search function print variable console languag...,0,0,...,0,0,0,0,0,0,0,0,0,0
43012,2011-10-20 07:21:34,Is there a way to make a method which is not a...,Is there any way of forcing child classes to o...,"['java', 'inheritance', 'overriding', 'abstrac...",way make method,way make method override force child class nee...,way method,way force child class override method need cre...,0,0,...,0,0,0,0,0,0,0,0,0,0
43013,2012-09-11 11:34:25,Can I incorporate both SignalR and a RESTful API?,I have a single page web app developed using A...,"['asp.net', 'rest', 'web-applications', 'asp.n...",incorporate signalr api,incorporate signalr api page web app develop u...,incorporate signalr api,page web app develop convert method push base ...,0,0,...,0,0,0,0,0,0,0,0,0,0
43014,2021-03-23 19:24:04,How can i use php8 attributes instead of annot...,This is what I would like to use:\n#[ORM\Colum...,"['php', 'symfony', 'doctrine-orm', 'doctrine',...",use attribute annotation doctrine,use attribute annotation doctrine like column ...,use attribute annotation doctrine,like use string error annotate support miss,0,0,...,0,0,0,0,0,0,0,0,0,0
43015,2016-03-19 18:27:38,Localizing string resources added via build.gr...,This is in continuation to an answer which hel...,"['android', 'android-studio', 'android-gradle-...",localize string resource add build gradle use,localize string resource add build gradle use ...,localize string resource add build.gradle,continuation answer help post add string resou...,0,0,...,0,0,0,0,0,0,0,0,0,0


uniques :


CreationDate    43012
title           43015
body            43016
all_tags        41627
title_nltk      42537
                ...  
zooming             2
zsh                 2
zshrc               2
zuul                2
zxing               2
Length: 6753, dtype: int64

Doublons ?  0 



Unnamed: 0,column_name,missing,present,percent_missing,type
CreationDate,CreationDate,0,43016,0.00,object
priority,priority,0,43016,0.00,int64
printwriter,printwriter,0,43016,0.00,int64
printstacktrace,printstacktrace,0,43016,0.00,int64
println,println,0,43016,0.00,int64
...,...,...,...,...,...
function,function,0,43016,0.00,int64
gae,gae,0,43016,0.00,int64
zxing,zxing,0,43016,0.00,int64
title_nltk,title_nltk,1,43015,0.00,object


In [4]:
missing = train.loc[(train['title_nltk'].isna()) |
                                (train['title_spacy'].isna()), :]

print (missing.index)
display(missing)


Index([4532, 8280, 12992, 14957, 22934, 24964, 25950], dtype='int64')


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
4532,2014-10-13 16:31:47,Laravel Eloquent OR WHERE IS NOT NULL,I am using the Laravel Administrator package f...,"['php', 'sql', 'laravel', 'eloquent', 'adminis...",eloquent,eloquent use laravel administrator package sto...,,package story run issue display result delete ...,0,0,...,0,0,0,0,0,0,0,0,0,0
8280,2015-04-22 11:41:34,Why is FusedLocationApi.getLastLocation null,I am trying to get location by using FusedLoca...,"['android', 'android-4.4-kitkat', 'android-loc...",,null try get location use permission file andr...,,try location permission file use android reque...,0,0,...,0,0,0,0,0,0,0,0,0,0
12992,2013-08-09 14:16:44,Using IS NULL and COALESCE in OrderBy Doctrine...,I basically have the following (My)SQL-Query\n...,"['mysql', 'symfony', 'doctrine-orm', 'doctrine...",use coalesce doctrine,use coalesce doctrine query select order compa...,,follow address order company job target | doct...,0,0,...,0,0,0,0,0,0,0,0,0,0
14957,2016-08-17 23:26:55,Spring Boot multipartfile always null,I am using Spring Boot version = '1.4.0.RC1' w...,"['java', 'spring-mvc', 'spring-boot', 'retrofi...",spring boot multipartfile,spring boot multipartfile use version rc1 try ...,,version try use file upload controller info re...,0,0,...,0,0,0,0,0,0,0,0,0,0
22934,2014-03-27 21:18:08,Sqlite NULL and unique?,I noticed that I can have NULL values in colum...,"['sql', 'sqlite', 'null', 'unique', 'unique-co...",null unique,null notice value column constraint col genera...,,notice value column constraint generate issue ...,0,0,...,0,0,0,0,0,0,0,0,0,0
24964,2014-02-24 20:47:00,MVC HttpPostedFileBase always null,I have this controller and what I am trying to...,"['c#', 'asp.net', 'asp.net-mvc', 'asp.net-mvc-...",httppostedfilebase,httppostedfilebase controller try send image b...,,controller try send image byte product content...,0,0,...,0,0,0,0,0,0,0,0,0,0
25950,2012-11-28 01:42:30,Android Notification PendingIntent Extras null,I am trying to send information from notificat...,"['android', 'android-intent', 'bundle', 'andro...",notification pendingintent extra null,notification pendingintent try send informatio...,,try send information notification activity cod...,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
quick_look(train)


shape : (43016, 6753)


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
0,2013-08-23 23:28:22,How to implement a ViewPager with different Fr...,When I start an activity which implements view...,"['android', 'android-layout', 'android-fragmen...",implement viewpager fragment layout,implement viewpager fragment layouts start act...,implement,start activity implement viewpager create frag...,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2015-04-26 06:13:36,Cannot subscript a value of [AnyObject]? with ...,This is in a class extending PFQueryTableViewC...,"['ios', 'xcode', 'swift', 'parse-platform', 'x...",subscript value anyobject index type int,subscript value anyobject index type int class...,subscript value index type,class extend follow error row cast way subscri...,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2014-08-06 12:33:53,Equivalent to java packages in C#,"I have been looking for a way to make a ""packa...","['java', 'c#', 'eclipse', 'visual-studio-2013'...",equivalent java package c,equivalent java package c look way make folder...,package c,look way package folder studio express know pr...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2014-06-05 18:35:37,How to use UIVisualEffectView to Blur Image?,Could someone give a small example of applying...,"['ios', 'objective-c', 'uiview', 'uikit', 'uiv...",use blur image,use blur image someone give example apply try ...,use uivisualeffectview,example apply blur image try figure code uivis...,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2013-06-28 11:53:56,How can I sort arrays and data in PHP?,\nThis question is intended as a reference for...,"['php', 'arrays', 'sorting', 'object', 'spl']",sort array data php,sort array data php question intend reference ...,sort array datum,question intend reference sort array think cas...,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
43011,2017-02-24 13:38:36,How to fully dump / print variable to console ...,Hey there I am searching for a function which ...,"['javascript', 'dart', 'debugging', 'console',...",dump print console dart language,dump print console dart language hey search fu...,dump print variable console language,search function print variable console languag...,0,0,...,0,0,0,0,0,0,0,0,0,0
43012,2011-10-20 07:21:34,Is there a way to make a method which is not a...,Is there any way of forcing child classes to o...,"['java', 'inheritance', 'overriding', 'abstrac...",way make method,way make method override force child class nee...,way method,way force child class override method need cre...,0,0,...,0,0,0,0,0,0,0,0,0,0
43013,2012-09-11 11:34:25,Can I incorporate both SignalR and a RESTful API?,I have a single page web app developed using A...,"['asp.net', 'rest', 'web-applications', 'asp.n...",incorporate signalr api,incorporate signalr api page web app develop u...,incorporate signalr api,page web app develop convert method push base ...,0,0,...,0,0,0,0,0,0,0,0,0,0
43014,2021-03-23 19:24:04,How can i use php8 attributes instead of annot...,This is what I would like to use:\n#[ORM\Colum...,"['php', 'symfony', 'doctrine-orm', 'doctrine',...",use attribute annotation doctrine,use attribute annotation doctrine like column ...,use attribute annotation doctrine,like use string error annotate support miss,0,0,...,0,0,0,0,0,0,0,0,0,0
43015,2016-03-19 18:27:38,Localizing string resources added via build.gr...,This is in continuation to an answer which hel...,"['android', 'android-studio', 'android-gradle-...",localize string resource add build gradle use,localize string resource add build gradle use ...,localize string resource add build.gradle,continuation answer help post add string resou...,0,0,...,0,0,0,0,0,0,0,0,0,0


uniques :


CreationDate    43012
title           43015
body            43016
all_tags        41627
title_nltk      42537
                ...  
zooming             2
zsh                 2
zshrc               2
zuul                2
zxing               2
Length: 6753, dtype: int64

Doublons ?  0 



Unnamed: 0,column_name,missing,present,percent_missing,type
CreationDate,CreationDate,0,43016,0.00,object
priority,priority,0,43016,0.00,int64
printwriter,printwriter,0,43016,0.00,int64
printstacktrace,printstacktrace,0,43016,0.00,int64
println,println,0,43016,0.00,int64
...,...,...,...,...,...
function,function,0,43016,0.00,int64
gae,gae,0,43016,0.00,int64
zxing,zxing,0,43016,0.00,int64
title_nltk,title_nltk,1,43015,0.00,object


In [6]:
def fix_false_null_values(df):
    df.loc[(df['title_nltk'].isna()), 'title_nltk'] = 'null'
    df.loc[(df['title_spacy'].isna()), 'title_spacy'] = 'null'


fix_false_null_values(train)
fix_false_null_values(test)

# Check for null values in the entire DataFrame
null_values = train[train.isnull().any(axis=1)]

# Print the rows with null values
print(null_values)


Empty DataFrame
Columns: [CreationDate, title, body, all_tags, title_nltk, body_nltk, title_spacy, body_spacy, __attribute__, __bridge, __call__, __declspec, __dict__, __dirname, __file__, __getitem__, __init__, __m128, __new__, __str__, __unicode__, __webpack_require__, _auth, _blank, _files, _get, _id, _layout, _libs, _main, _mysql, _next, _objc_class_, _post, _session, _ssl, _start, _tp, _x, a2dp, a4, a9, aa, aac, aapt, aapt2, aar, ab, abc, abi, ability, abort, absent, absolute, abspath, abstract, abstraction, abstractprotocol, acceleration, accelerometer, accent, accept, access, access_fine_location, access_token, accessdenied, accessdeniedexception, accessibility, accessor, accesstoken, accomplish, accordion, account, accountcontroller, accuracy, ace, achieve, ack, acl, acquire, across, act, action, actionbar, actionbaractivity, actionbarsherlock, actioncontroller, actionlink, actionmailer, actionresult, actionsheet, actionview, activate, activatedroute, activation, activator, act

In [7]:
quick_look(train)


shape : (43016, 6753)


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
0,2013-08-23 23:28:22,How to implement a ViewPager with different Fr...,When I start an activity which implements view...,"['android', 'android-layout', 'android-fragmen...",implement viewpager fragment layout,implement viewpager fragment layouts start act...,implement,start activity implement viewpager create frag...,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2015-04-26 06:13:36,Cannot subscript a value of [AnyObject]? with ...,This is in a class extending PFQueryTableViewC...,"['ios', 'xcode', 'swift', 'parse-platform', 'x...",subscript value anyobject index type int,subscript value anyobject index type int class...,subscript value index type,class extend follow error row cast way subscri...,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2014-08-06 12:33:53,Equivalent to java packages in C#,"I have been looking for a way to make a ""packa...","['java', 'c#', 'eclipse', 'visual-studio-2013'...",equivalent java package c,equivalent java package c look way make folder...,package c,look way package folder studio express know pr...,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2014-06-05 18:35:37,How to use UIVisualEffectView to Blur Image?,Could someone give a small example of applying...,"['ios', 'objective-c', 'uiview', 'uikit', 'uiv...",use blur image,use blur image someone give example apply try ...,use uivisualeffectview,example apply blur image try figure code uivis...,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2013-06-28 11:53:56,How can I sort arrays and data in PHP?,\nThis question is intended as a reference for...,"['php', 'arrays', 'sorting', 'object', 'spl']",sort array data php,sort array data php question intend reference ...,sort array datum,question intend reference sort array think cas...,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
43011,2017-02-24 13:38:36,How to fully dump / print variable to console ...,Hey there I am searching for a function which ...,"['javascript', 'dart', 'debugging', 'console',...",dump print console dart language,dump print console dart language hey search fu...,dump print variable console language,search function print variable console languag...,0,0,...,0,0,0,0,0,0,0,0,0,0
43012,2011-10-20 07:21:34,Is there a way to make a method which is not a...,Is there any way of forcing child classes to o...,"['java', 'inheritance', 'overriding', 'abstrac...",way make method,way make method override force child class nee...,way method,way force child class override method need cre...,0,0,...,0,0,0,0,0,0,0,0,0,0
43013,2012-09-11 11:34:25,Can I incorporate both SignalR and a RESTful API?,I have a single page web app developed using A...,"['asp.net', 'rest', 'web-applications', 'asp.n...",incorporate signalr api,incorporate signalr api page web app develop u...,incorporate signalr api,page web app develop convert method push base ...,0,0,...,0,0,0,0,0,0,0,0,0,0
43014,2021-03-23 19:24:04,How can i use php8 attributes instead of annot...,This is what I would like to use:\n#[ORM\Colum...,"['php', 'symfony', 'doctrine-orm', 'doctrine',...",use attribute annotation doctrine,use attribute annotation doctrine like column ...,use attribute annotation doctrine,like use string error annotate support miss,0,0,...,0,0,0,0,0,0,0,0,0,0
43015,2016-03-19 18:27:38,Localizing string resources added via build.gr...,This is in continuation to an answer which hel...,"['android', 'android-studio', 'android-gradle-...",localize string resource add build gradle use,localize string resource add build gradle use ...,localize string resource add build.gradle,continuation answer help post add string resou...,0,0,...,0,0,0,0,0,0,0,0,0,0


uniques :


CreationDate    43012
title           43015
body            43016
all_tags        41627
title_nltk      42538
                ...  
zooming             2
zsh                 2
zshrc               2
zuul                2
zxing               2
Length: 6753, dtype: int64

Doublons ?  0 



Unnamed: 0,column_name,missing,present,percent_missing,type
CreationDate,CreationDate,0,43016,0.0,object
priority,priority,0,43016,0.0,int64
printwriter,printwriter,0,43016,0.0,int64
printstacktrace,printstacktrace,0,43016,0.0,int64
println,println,0,43016,0.0,int64
...,...,...,...,...,...
functools,functools,0,43016,0.0,int64
functionality,functionality,0,43016,0.0,int64
function,function,0,43016,0.0,int64
gauge,gauge,0,43016,0.0,int64


In [8]:
index = [4532, 8280, 12992, 14957, 22934, 24964, 25950]

display(train.loc[train.index.isin(index), :])

# OK


Unnamed: 0,CreationDate,title,body,all_tags,title_nltk,body_nltk,title_spacy,body_spacy,__attribute__,__bridge,...,zone,zoneddatetime,zoneid,zookeeper,zoom,zooming,zsh,zshrc,zuul,zxing
4532,2014-10-13 16:31:47,Laravel Eloquent OR WHERE IS NOT NULL,I am using the Laravel Administrator package f...,"['php', 'sql', 'laravel', 'eloquent', 'adminis...",eloquent,eloquent use laravel administrator package sto...,,package story run issue display result delete ...,0,0,...,0,0,0,0,0,0,0,0,0,0
8280,2015-04-22 11:41:34,Why is FusedLocationApi.getLastLocation null,I am trying to get location by using FusedLoca...,"['android', 'android-4.4-kitkat', 'android-loc...",,null try get location use permission file andr...,,try location permission file use android reque...,0,0,...,0,0,0,0,0,0,0,0,0,0
12992,2013-08-09 14:16:44,Using IS NULL and COALESCE in OrderBy Doctrine...,I basically have the following (My)SQL-Query\n...,"['mysql', 'symfony', 'doctrine-orm', 'doctrine...",use coalesce doctrine,use coalesce doctrine query select order compa...,,follow address order company job target | doct...,0,0,...,0,0,0,0,0,0,0,0,0,0
14957,2016-08-17 23:26:55,Spring Boot multipartfile always null,I am using Spring Boot version = '1.4.0.RC1' w...,"['java', 'spring-mvc', 'spring-boot', 'retrofi...",spring boot multipartfile,spring boot multipartfile use version rc1 try ...,,version try use file upload controller info re...,0,0,...,0,0,0,0,0,0,0,0,0,0
22934,2014-03-27 21:18:08,Sqlite NULL and unique?,I noticed that I can have NULL values in colum...,"['sql', 'sqlite', 'null', 'unique', 'unique-co...",null unique,null notice value column constraint col genera...,,notice value column constraint generate issue ...,0,0,...,0,0,0,0,0,0,0,0,0,0
24964,2014-02-24 20:47:00,MVC HttpPostedFileBase always null,I have this controller and what I am trying to...,"['c#', 'asp.net', 'asp.net-mvc', 'asp.net-mvc-...",httppostedfilebase,httppostedfilebase controller try send image b...,,controller try send image byte product content...,0,0,...,0,0,0,0,0,0,0,0,0,0
25950,2012-11-28 01:42:30,Android Notification PendingIntent Extras null,I am trying to send information from notificat...,"['android', 'android-intent', 'bundle', 'andro...",notification pendingintent extra null,notification pendingintent try send informatio...,,try send information notification activity cod...,0,0,...,0,0,0,0,0,0,0,0,0,0


## LDA


In [9]:
from gensim import corpora
from gensim.matutils import Sparse2Corpus
import numpy as np

# Assuming train is your DataFrame

# Identify numeric columns (assuming only/all numeric columns represent BoW)
numeric_columns = train.select_dtypes(include=np.number).columns

# Extract only numeric columns
bow_train = train[numeric_columns]

print(bow_train.shape)

# Convert DataFrame to Gensim Dictionary
dictionary = corpora.Dictionary(bow_train.apply(lambda row: [(col, freq) for col, freq in zip(numeric_columns, row) if freq > 0], axis=1))

# Convert DataFrame to Gensim Corpus
corpus_train = Sparse2Corpus(bow_train.values.T)


(43016, 6745)


TypeError: decoding to str: need a bytes-like object, tuple found

In [None]:
# Identify numeric columns (assuming only/all numeric columns represent BoW)
numeric_columns = train.select_dtypes(include='number').columns

# Extract only numeric columns
bow_train = train[numeric_columns]
bow_test = test[numeric_columns]

print(bow_train.shape)
print(bow_test.shape)

# Convert DataFrame to Gensim Corpus and Dictionary
# corpus_train = corpora.MmCorpus(bow_train.values)


: 

In [None]:
corpus_test = corpora.MmCorpus(bow_test.values)


In [None]:
# Convert to Gensim Corpus and Dictionary
# corpus_train = corpora.MmCorpus(matutils.Sparse2Corpus(X_train_bow, documents_columns=False))
dictionary_train = corpora.Dictionary.from_corpus(corpus_train, id2word=dict((id, word) for word, id in vectorizer.vocabulary_.items()))

# LDA Model training on the training set
lda_model = LdaModel(corpus=corpus_train, id2word=dictionary_train, num_topics=5)  # You can replace 'num_topics' with your desired value

# Preprocess test set
X_test_processed = [preprocess_document(doc) for doc in X_test]

# Transform the test set to Bag-of-Words using the same vectorizer
X_test_bow = vectorizer.transform(X_test_processed)

# Convert to Gensim Corpus
corpus_test = corpora.MmCorpus(matutils.Sparse2Corpus(X_test_bow, documents_columns=False))

# Evaluate on the test set
test_log_likelihood = lda_model.log_perplexity(corpus_test)
print(f"Log Likelihood on Test Set: {test_log_likelihood}")

# Get Topics
topics = lda_model.print_topics()

# Print the topics
for topic in topics:
    print(topic)