# Notebook Predictions

Contexte du projet : 

Votre objectif dans ce projet est de créer un outil qui utilise des techniques de traitement de texte pour répondre aux besoin de votre client. 

Projet : 

- Visualize and predict clusters from historical search trends
- Goal : Build a model able to perform clusters from trending topics
- Difficulty : use word embedding and t-SNE

In [56]:
# Import des librairies 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

# plot en 3D 
import plotly.graph_objs as go
import plotly.express as px

# preprocessing data
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from gensim.models import Word2Vec

# clustering models
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, dendrogram

# metrics
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples

In [57]:
# Import des données
df = pd.read_csv('data/gsearch_jobs.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,index,title,company_name,location,via,description,extensions,job_id,thumbnail,...,commute_time,salary_pay,salary_rate,salary_avg,salary_min,salary_max,salary_hourly,salary_yearly,salary_standardized,description_tokens
0,0,0,Data Analyst (Risk Adjustment Consulting Resea...,"Cambia Health Solutions, Inc",United States,via Datafloq,Are you looking for a new job? Check out this ...,"['3 hours ago', 'Full-time', 'No degree mentio...",eyJqb2JfdGl0bGUiOiJEYXRhIEFuYWx5c3QgKFJpc2sgQW...,,...,,,,,,,,,,[]
1,1,1,DATA ANALYST II,Lumen,United States,via ComputerJobs.com,About Lumen\nLumen is guided by our belief tha...,"['17 hours ago', 'Full-time', 'No degree menti...",eyJqb2JfdGl0bGUiOiJEQVRBIEFOQUxZU1QgSUkiLCJodG...,,...,,,,,,,,,,"['excel', 'sql', 'powerpoint', 'power_bi', 'sh..."
2,2,2,Data Analyst - Swisslog,Swisslog,United States,via Swisslog,"Data Analyst Mason, Ohio With guidance from se...","['4 hours ago', 'Full-time', 'Health insurance...",eyJqb2JfdGl0bGUiOiJEYXRhIEFuYWx5c3QgLSBTd2lzc2...,https://encrypted-tbn0.gstatic.com/images?q=tb...,...,,,,,,,,,,"['python', 'r', 'sql', 'powerpoint', 'word', '..."
3,3,3,Data Analyst - Secret clearance - Remote Remot...,General Dynamics Information Technology,Anywhere,via Clearance Jobs,REQ#: RQ135670 Travel Required: None Public Tr...,"['11 hours ago', 'Work from home', 'Full-time'...",eyJqb2JfdGl0bGUiOiJEYXRhIEFuYWx5c3QgLSBTZWNyZX...,https://encrypted-tbn0.gstatic.com/images?q=tb...,...,,,,,,,,,,"['t-sql', 'pl/sql', 'sql']"
4,4,4,Collections Data Analyst (921071),Purpose Financial,United States,via Jobs At Purpose Financial / Advance Americ...,"Address : 135 N Church Street, Spartanburg, So...","['20 hours ago', 'Full-time', 'Health insuranc...",eyJqb2JfdGl0bGUiOiJDb2xsZWN0aW9ucyBEYXRhIEFuYW...,,...,,,,,,,,,,"['python', 'r', 'sas', 'sql']"


## EDA

### Basics infos

In [58]:
df.shape

(17977, 27)

In [59]:
df.columns

Index(['Unnamed: 0', 'index', 'title', 'company_name', 'location', 'via',
       'description', 'extensions', 'job_id', 'thumbnail', 'posted_at',
       'schedule_type', 'work_from_home', 'salary', 'search_term', 'date_time',
       'search_location', 'commute_time', 'salary_pay', 'salary_rate',
       'salary_avg', 'salary_min', 'salary_max', 'salary_hourly',
       'salary_yearly', 'salary_standardized', 'description_tokens'],
      dtype='object')

### NaN

In [60]:
def count_nan(df):
    
    nan_counts = df.isna().sum() # compte le nombre de NaN pour chaque colonne
    total_counts = len(df) # compte le nombre total de données dans le dataframe
    nan_percentages = (nan_counts / total_counts) * 100 # calcule le pourcentage de NaN pour chaque colonne
    result_df = pd.concat([nan_counts, nan_percentages], axis=1) # combine les deux séries en un dataframe
    result_df.columns = ['NaN Count', 'NaN Percentage'] # renomme les colonnes du nouveau dataframe
    return result_df

In [61]:
df_NaN = count_nan(df)
df_NaN = df_NaN.sort_values(by = ['NaN Count'], ascending = False)
# df_NaN = df_NaN.loc[df_NaN['NaN Count'] != 0]
df_NaN

Unnamed: 0,NaN Count,NaN Percentage
commute_time,17977,100.0
salary_yearly,16525,91.923013
salary_hourly,15966,88.813484
salary_max,14720,81.882405
salary_min,14720,81.882405
salary,14508,80.703121
salary_standardized,14508,80.703121
salary_avg,14508,80.703121
salary_rate,14508,80.703121
salary_pay,14508,80.703121


Observations : 
- 27 colonnes 
- 17977 rows 
- 
- 12 colonnes ont plus de 50% de NaN

Conclusion : 
- Enlever les colonnes qui ont plus de 50% de NaN
- Faire un dropna() pour enlevr les lignes restantes qui contiennent des NaN

In [62]:
# supprimer les colonnes qui ont trop de NaN
def no_NaN(df, treshold):
    
    nan_counts = df.isna().sum() # compte le nombre de NaN pour chaque colonne
    total_counts = len(df) # compte le nombre total de données dans le dataframe
    nan_percentages = (nan_counts / total_counts) * 100 # calcule le pourcentage de NaN pour chaque colonne
    nan_treshold = nan_percentages[nan_percentages.values < treshold]
    
    return df[nan_treshold.index]

In [63]:
# df version 2
df_v2 = no_NaN(df, 50)
df_v2 = df_v2.dropna()
df_v2.isnull().sum()

Unnamed: 0            0
index                 0
title                 0
company_name          0
location              0
via                   0
description           0
extensions            0
job_id                0
posted_at             0
schedule_type         0
search_term           0
date_time             0
search_location       0
description_tokens    0
dtype: int64

In [64]:
df_v2.shape

(17847, 15)

Observations : 
- df_v2 a 15 colonnes
- et 17847 lignes

### duplicates

In [65]:
df_v2.duplicated().sum()

0

Observation : 
- pas de duplicates

### Features

L'objectif ici est d'avoir un meilleur aperçu des features du dataset pour pouvoir sélectionner les plus pertinentes

In [66]:
df_v2.head()

Unnamed: 0.1,Unnamed: 0,index,title,company_name,location,via,description,extensions,job_id,posted_at,schedule_type,search_term,date_time,search_location,description_tokens
0,0,0,Data Analyst (Risk Adjustment Consulting Resea...,"Cambia Health Solutions, Inc",United States,via Datafloq,Are you looking for a new job? Check out this ...,"['3 hours ago', 'Full-time', 'No degree mentio...",eyJqb2JfdGl0bGUiOiJEYXRhIEFuYWx5c3QgKFJpc2sgQW...,3 hours ago,Full-time,data analyst,2023-01-02 04:00:10.087160,United States,[]
1,1,1,DATA ANALYST II,Lumen,United States,via ComputerJobs.com,About Lumen\nLumen is guided by our belief tha...,"['17 hours ago', 'Full-time', 'No degree menti...",eyJqb2JfdGl0bGUiOiJEQVRBIEFOQUxZU1QgSUkiLCJodG...,17 hours ago,Full-time,data analyst,2023-01-02 04:00:12.552732,United States,"['excel', 'sql', 'powerpoint', 'power_bi', 'sh..."
2,2,2,Data Analyst - Swisslog,Swisslog,United States,via Swisslog,"Data Analyst Mason, Ohio With guidance from se...","['4 hours ago', 'Full-time', 'Health insurance...",eyJqb2JfdGl0bGUiOiJEYXRhIEFuYWx5c3QgLSBTd2lzc2...,4 hours ago,Full-time,data analyst,2023-01-02 04:00:12.552732,United States,"['python', 'r', 'sql', 'powerpoint', 'word', '..."
3,3,3,Data Analyst - Secret clearance - Remote Remot...,General Dynamics Information Technology,Anywhere,via Clearance Jobs,REQ#: RQ135670 Travel Required: None Public Tr...,"['11 hours ago', 'Work from home', 'Full-time'...",eyJqb2JfdGl0bGUiOiJEYXRhIEFuYWx5c3QgLSBTZWNyZX...,11 hours ago,Full-time,data analyst,2023-01-02 04:00:12.552732,United States,"['t-sql', 'pl/sql', 'sql']"
4,4,4,Collections Data Analyst (921071),Purpose Financial,United States,via Jobs At Purpose Financial / Advance Americ...,"Address : 135 N Church Street, Spartanburg, So...","['20 hours ago', 'Full-time', 'Health insuranc...",eyJqb2JfdGl0bGUiOiJDb2xsZWN0aW9ucyBEYXRhIEFuYW...,20 hours ago,Full-time,data analyst,2023-01-02 04:00:14.406611,United States,"['python', 'r', 'sas', 'sql']"


In [67]:
df_v2.columns

Index(['Unnamed: 0', 'index', 'title', 'company_name', 'location', 'via',
       'description', 'extensions', 'job_id', 'posted_at', 'schedule_type',
       'search_term', 'date_time', 'search_location', 'description_tokens'],
      dtype='object')

In [68]:
# suppression de Unnamed:0
df_v2 = df_v2.drop(["Unnamed: 0"], axis = 1)
df_v2.shape

(17847, 14)

In [69]:
# fonction qui print les valeurs unique pour toutes chaque features 
colonnes = ['index', 'title', 'company_name', 'location', 'via',
       'description', 'extensions', 'posted_at', 'schedule_type',
       'search_term', 'date_time', 'search_location', 'description_tokens']
for i in colonnes:
    print(f"************************{i}************************")
    print(len(df_v2[i].unique()))

************************index************************
3172
************************title************************
6971
************************company_name************************
4888
************************location************************
491
************************via************************
379
************************description************************
12635
************************extensions************************
3271
************************posted_at************************
75
************************schedule_type************************
4
************************search_term************************
1
************************date_time************************
1877
************************search_location************************
1
************************description_tokens************************
3999


In [70]:
colonnes_petit = ["schedule_type", "search_term", "search_location"]
for i in colonnes_petit:
    print(f"************************{i}************************")
    print(df_v2[i].unique())

************************schedule_type************************
['Full-time' 'Internship' 'Contractor' 'Part-time']
************************search_term************************
['data analyst']
************************search_location************************
['United States']


Observation :
- schedule_type : type de contrat ('Full-time', 'Internship', 'Contractor')
- Search_term : la recherche qui a été faite pour accéder aux offres de poste de Data Analyse
- search_location : la recherche qui a été faite pour accéder aux offre de poste aux Etats Unis

In [71]:
# colonne description_tokens avec les valeurs pour les 20 premières lignes
df_v2["description_tokens"].unique()[:20]

array(['[]', "['excel', 'sql', 'powerpoint', 'power_bi', 'sharepoint']",
       "['python', 'r', 'sql', 'powerpoint', 'word', 'power_bi', 'excel', 'tableau']",
       "['t-sql', 'pl/sql', 'sql']", "['python', 'r', 'sas', 'sql']",
       "['go', 'excel', 'word', 'python', 'r']",
       "['excel', 'snowflake', 'tableau', 'python', 'r', 'azure', 'sql', 'power_bi']",
       "['excel', 'word', 'javascript', 'spss', 'jira', 'sas', 'sql', 'powerpoint', 'mysql']",
       "['sas', 'sql', 'power_bi']",
       "['excel', 'word', 'tableau', 'spss', 'sas', 'sql', 'powerpoint']",
       "['excel', 'tableau', 'python', 'mysql']", "['excel', 'power_bi']",
       "['go']", "['excel', 'tableau', 'spss']",
       "['sas', 'spreadsheet', 'r', 'sql', 'power_bi', 'excel', 'snowflake', 'tableau']",
       "['sql', 'excel', 'tableau']", "['excel', 'tableau']",
       "['python', 'r', 'sql', 'looker', 'tableau']",
       "['c', 'python', 'sql']",
       "['excel', 'c', 'dax', 'tableau', 'python', 'r', 'alteryx

In [72]:
# Colonne description avec les valeurs pour les 5 premières lignes
df_v2["description"].unique()[2]

'Data Analyst Mason, Ohio With guidance from senior Design and Consulting team members develop data driven solutions based on Swisslog products using client data. Interpret client’s data to give key insights into the day to day needs of their warehouses and distribution centers. Make an impact With guidance from senior Design and Consulting team members develop data driven solutions based on... Swisslog products using client data. Interpret client’s data to give key insights into the day to day needs of their warehouses and distribution centers.\n\nData-analysis using tools like Excel, Python, Power BI, Tableau, SQL\n\nMathematical formulas commonly used in material handling business\n\nDeveloping analytical tools and processes for efficient data analysis\n\nUnderstand basics in material handling systems (will get training\n\nUnderstand typical warehouse operations (will get training)\n\nBring to the team Bachelor’s in Business, Applied Mathematics, Engineering or equivalent.\n\nExperi

Observations : 
- description_tokens : tous les outils nécessaires pour l'offre de poste, qui seront utilisé par le futur salarié
- description : offre de poste qui a été posté, en texte

Conclusion : 
- Pour le model on va garder ces colonnes : ["description", "schedule_type", "Search_term", "search_location", "description_tokens", "date_time"]
- Toutes ces colonnes sont sous forme de liste 
- Preparer ces colonnes pour quelles soient prête à être envoyé au modèle
- Peut-être séparer date_time en 3 colonnes : ["YEAR", "MONTH", "DAY"]

### Représentation des données avec T-SNE

In [73]:
# # Construction des phrases
# sentences = [description.split() for description in df_v2['description']]

# # Entraînement du modèle Word2Vec
# model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# # Obtenir les vecteurs d'embedding pour chaque mot
# word_vectors = model.wv

# # Réduction de dimension avec t-SNE
# tsne = TSNE(n_components=3, random_state=42)
# vectors_3d = tsne.fit_transform(word_vectors.vectors)

# # Création des données pour le plot 3D interactif
# trace = go.Scatter3d(
#     x=vectors_3d[:, 0],
#     y=vectors_3d[:, 1],
#     z=vectors_3d[:, 2],
#     mode='markers',
#     text=list(word_vectors.key_to_index.keys()),
#     hoverinfo='text',
# )

# data = [trace]

# # Configuration du layout
# layout = go.Layout(
#     margin=dict(l=0, r=0, b=0, t=0),
#     hovermode='closest',
# )

# # Création de la figure
# fig = go.Figure(data=data, layout=layout)

# # Affichage du plot interactif en 3D
# fig.show()

### Prétraitement des données

In [74]:
# Conversion de la colonne 'date_time' en datetime
df_v2['date_time'] = pd.to_datetime(df_v2['date_time'])

In [75]:
# Extraction de l'année, du mois et du jour dans des colonnes séparées
df_v2['YEAR'] = df_v2['date_time'].dt.year
df_v2['MONTH'] = df_v2['date_time'].dt.month
df_v2['DAY'] = df_v2['date_time'].dt.day

In [76]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/selmane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/selmane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [77]:
# Prétraitement des données
def preprocess_text(text):
    # Suppression des caractères spéciaux et de la ponctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Conversion en minuscules
    text = text.lower()
    
    # Suppression des mots vides
    stop_words = set(stopwords.words('english'))  # Remplacez 'your_language' par votre langue (par exemple, 'english' pour l'anglais)
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatisation
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Rejoindre les tokens prétraités en une seule chaîne
    processed_text = ' '.join(tokens)
    
    return processed_text

In [78]:
df_v2["description_tokens"] = df_v2["description_tokens"].apply(preprocess_text)

In [79]:
# Appliquer la fonction de prétraitement à la colonne 'description'
df_v2['description'] = df_v2['description'].apply(preprocess_text)

In [80]:
df_v2 = df_v2.to_csv("data/df_v2.csv", index=False)

In [81]:
# sample of df_v2
df_v3 = df_v2.sample(n=3000, random_state=42)
df_v3.shape

AttributeError: 'NoneType' object has no attribute 'sample'

In [None]:
# Préparation des données
text_data = df_v3['description'].values
schedule_type_data = df_v3['schedule_type'].values
search_term_data = df_v3['search_term'].values
search_location_data = df_v3['search_location'].values
description_tokens_data = df_v3['description_tokens'].values
year_data = df_v3['YEAR'].values
month_data = df_v3['MONTH'].values

# Préparation des autres features
text_data = np.array(text_data)
schedule_type_data = np.array(schedule_type_data)
search_term_data = np.array(search_term_data)
search_location_data = np.array(search_location_data)
description_tokens_data = np.array(description_tokens_data)
year_data = np.array(year_data)
month_data = np.array(month_data)

### Deep Learning

#### 1 MODEL 

L'objectif ici est de créer un modèle capable de générer du texte.

En effet, l'utilisateur entrera des informations spécifiques (ex: outils qu'il connait, préférence de contrat...) et notre modèle écrira et lui présentera une offre d'emploie typique du marché qui lui correspondra. 

#### MODEL 2

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.metrics import Mean
import pickle

In [82]:
# Preprocessing
text_data = df_v3['description'].values
input_sequences = []
target_sequences = []

# Prepare input and target sequences
for (schedule_type, search_term, search_location, description_tokens, year, month, job_offer) in zip(schedule_type_data, search_term_data, search_location_data, description_tokens_data, year_data, month_data, text_data):
    input_sequence = f"{schedule_type} {search_term} {search_location} {description_tokens} {str(year)} {str(month)}"
    target_sequence = f"<start> {job_offer} <end>"
    input_sequences.append(input_sequence)
    target_sequences.append(target_sequence)

# Initialize Tokenizer
tokenizer = Tokenizer()

# Fit Tokenizer on data
tokenizer.fit_on_texts(input_sequences + target_sequences)

# Convert sequences to tokenized sequences
input_sequences = tokenizer.texts_to_sequences(input_sequences)
target_sequences = tokenizer.texts_to_sequences(target_sequences)

# Find maximum lengths
max_sequence_length = max([len(seq) for seq in input_sequences + target_sequences])

# Pad sequences
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='post')
target_sequences = pad_sequences(target_sequences, maxlen=max_sequence_length, padding='post')

# Get the vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Define the input layers
encoder_inputs = Input(shape=(max_sequence_length,))
decoder_inputs = Input(shape=(max_sequence_length-1,))

# Define model architecture
latent_dim = 6 # Dimensionality of the latent space
embedding = Embedding(vocab_size, latent_dim)

encoder_embedding = embedding(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

decoder_embedding = embedding(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Create the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

# Train the model
target_sequences_input = target_sequences[:, :-1]
target_sequences_output = target_sequences[:, 1:]
model.fit([input_sequences, target_sequences_input], target_sequences_output, epochs=5, batch_size=8)

2023-05-25 23:48:26.416761: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-25 23:48:26.420548: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-25 23:48:26.422738: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 1/5


2023-05-25 23:48:27.440305: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-25 23:48:27.443555: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-25 23:48:27.445993: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f97db431f70>

## Tokenize save

In [86]:
# download with pickle

# open a file, where you ant to store the data
file_1 = open("tokenizer_2.pkl", "wb")

# dump information to that file
pickle.dump(tokenizer, file_1)

# close the file
file_1.close()

## Model save

- 1st model : Epoch 2/2, loss: 8.6932
- 2nd model : Epoch 5/5, loss: 0.9620

In [85]:
# download with pickle

# open a file, where you ant to store the data
file = open("model_2.pkl", "wb")

# dump information to that file
pickle.dump(model, file)

# close the file
file.close()

In [None]:
# # open a file, where you stored the pickled data
# file = open('important', 'rb')

# # dump information to that file
# data = pickle.load(file)

# # close the file
# file.close()

In [46]:
# Generate text
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

def generate_text(input_sequence):
    states_value = encoder_model.predict(input_sequence)
    target_sequence = np.zeros((1, 1))  # Start with empty target sequence
    confidence_threshold = 0.00011

    stop_condition = False
    generated_text = []

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_sequence] + states_value)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])

        # Check if the sampled token index exists in the vocabulary
        if sampled_token_index in tokenizer.index_word:
            sampled_word = tokenizer.index_word[sampled_token_index]
            generated_text.append(sampled_word)
            print(np.max(output_tokens))
        else:
            # Handle the case when the token index is not found
            remaining_indices = set(range(len(tokenizer.index_word))) - {0}  # Exclude the unknown token
            sampled_token_index = np.random.choice(list(remaining_indices))
            sampled_word = tokenizer.index_word[sampled_token_index]
            generated_text.append(sampled_word)
            print(np.max(output_tokens))

        # Update the stop condition based on the generated token
#       if sampled_word == '<end>' or np.max(output_tokens) < confidence_threshold:
        if sampled_word == '<end>' or len(generated_text) > 100:
            stop_condition = True
            
        target_sequence = np.array([[sampled_token_index]])  # Update the target sequence
        states_value = [h, c]

    return ' '.join(generated_text)

2023-05-25 22:48:29.243973: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-25 22:48:29.270287: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-25 22:48:29.285224: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

{schedule_type} {search_term} {search_location} {description_tokens} {str(year)} {str(month)}

In [48]:
# Get user inputs
feature_1 = input("Enter schedule type: ")
feature_2 = input("Enter search term: ")
feature_3 = input("Enter search location: ")
feature_4 = input("Enter description tokens: ")
feature_5 = input("Enter year: ")
feature_6 = input("Enter month: ")

Enter schedule type: full-time
Enter search term: data analysis
Enter search location: United-State
Enter description tokens: python, scala, excel 
Enter year: 2022
Enter month: 12


In [49]:
# Combine features into input sequence
input_sequence = f"{feature_1} {feature_2} {feature_3} {feature_4} {feature_5} {feature_6}"

# Convert string features to tokenized sequences
input_sequence_sequence = tokenizer.texts_to_sequences([input_sequence])

# Pad the sequence
input_sequence_padded = pad_sequences(np.array(input_sequence_sequence), maxlen=max_sequence_length, padding='post')

# Generate text
generated_text = generate_text(input_sequence_padded)
print(generated_text)

2023-05-25 22:50:34.016200: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-25 22:50:34.023515: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-25 22:50:34.029114: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus



2023-05-25 22:50:36.822279: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-25 22:50:36.831025: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-25 22:50:36.836514: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

3.2794276e-05


2023-05-25 22:50:38.239937: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-25 22:50:38.247372: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-25 22:50:38.252326: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

3.2778455e-05
3.2770247e-05
3.278526e-05
3.2775406e-05
3.2777818e-05
3.2805237e-05
3.2792304e-05
3.278955e-05
3.2786866e-05
3.278119e-05
3.2773387e-05
3.27761e-05
3.276312e-05
3.2789285e-05
3.2777538e-05
3.2761887e-05
3.277477e-05
3.278235e-05
3.2803393e-05
3.279483e-05
3.2785185e-05
3.2782616e-05
3.279701e-05
3.2811313e-05
3.281602e-05
3.2796557e-05
3.2787753e-05
3.278954e-05
3.27942e-05
3.279025e-05
3.2777458e-05
3.277526e-05
3.2777232e-05
3.280401e-05
3.2799224e-05
3.2793294e-05
3.278876e-05
3.2774027e-05
3.2765103e-05
3.2782846e-05
3.2778666e-05
3.278987e-05
3.279754e-05
3.2794855e-05
3.2773383e-05
3.278163e-05
3.279609e-05
3.2811222e-05
3.2789925e-05
3.2776123e-05
3.2782074e-05
3.2783086e-05
3.2775813e-05
3.276838e-05
3.2777283e-05
3.2791366e-05
3.2775417e-05
3.2777843e-05
3.2797714e-05
3.2782667e-05
3.278e-05
3.278574e-05
3.2780208e-05
3.2791286e-05
3.2811007e-05
3.2808435e-05
3.280981e-05
3.2795328e-05
3.27917e-05
3.2787204e-05
3.2795735e-05
3.2796674e-05
3.278511e-05
3.2785574e

In [51]:
# Define the input layer for the encoder
encoder_states

[<KerasTensor: shape=(None, 6) dtype=float32 (created by layer 'lstm')>,
 <KerasTensor: shape=(None, 6) dtype=float32 (created by layer 'lstm')>]

In [52]:
max_sequence_length

2786