Skip to content

Repository of Natural Language Processing project at Polytechnic of Milan. Generative chatbots, with audio, images and RAG.

License

Notifications You must be signed in to change notification settings

Digioref/NLP-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP-Project

Welcome to the NLP Project repository!
This project was developed as part of the Natural Language Processing course at Politecnico di Milano.

Subject: 088946 - Natural Language Processing

Professor: Carman Mark James

Academic Year: 2024/2025

Overview

This repository contains code, notebooks, demos, and resources related to various Natural Language Processing (NLP) tasks and techniques explored throughout the course. Key topics include:

  • Text Classification
  • Clustering & Topic Modeling
  • Audio Processing
  • Generative Models
  • Retrieval-Augmented Generation (RAG)
  • Dataset Analysis & Visualization

A major focus of the project was the analysis of a large collection of recipes. We built and evaluated multiple models capable of:

  • Retrieving recipes based on ingredients, cuisine, or dish name
  • Generating new recipes using advanced generative techniques
  • Exploring different approaches to recipe recommendation and content synthesis

Interface screenshots

CLIP chatbot

Screenshot 1

Generative chatbot

Screenshot 2

Project Structure

NLP-Project/
├── demo/               # Python demo scripts for generative and RAG models
│   ├── Demo_Generative.py
│   └── Demo_RAG.py
│
├── notebooks/          # Main Jupyter notebooks for each topic
│   ├── Audio_Processing.ipynb
│   ├── Classification.ipynb
│   ├── Clustering_and_Topic_Modeling.ipynb
│   └── ...
│
├── test_notebooks/     # Supplementary notebooks for testing and analysis
│   ├── chatbot_generator.ipynb
│   ├── dataset_analysis.ipynb
│   └── ...
│
├── HTML/               # HTML exports of the main notebooks
│   ├── Audio_Processing.html
│   ├── Classification.html
│   └── ...
│
├── videos/             # Demo videos showcasing project features
│   ├── Demo_Gen.mp4
│   ├── Demo_RAG.mp4
│   └── NLP_Project_final.mp4
│
└── README.md           # Project documentation (this file)

Getting Started

  1. Clone the repository:

    git clone <repository-url>
    cd NLP-Project
  2. Set up your environment: It is recommended to use a virtual environment (e.g., venv or conda). Install the required dependencies as specified in the relevant notebooks or scripts.

  3. Explore the notebooks: Open the Jupyter notebooks in the notebooks/ folder to explore the main topics and experiments.

  4. Run the demos: The demo/ folder contains Python scripts demonstrating generative and retrieval-augmented generation models.

Features

  • Interactive Notebooks: Each topic is explored through well-documented Jupyter notebooks.
  • Demo Scripts: Ready-to-run Python scripts for key models and pipelines.
  • Exported HTML: Notebooks are also available as HTML files for easy viewing.
  • Videos: Demo videos showcase the interface and results of the project.

Description of the Notebooks

There are 8 notebooks, each of them highlighting a particular analysis on the dataset.

Preliminary Analysis

It shows some preliminary analyses on the dataset RecipeNLG, to understand how it is structured, what are the tasks that can be performed and how can be performed.
The dataset contains a lot of recipes, with their features describing how a recipe is made up and how can be cooked. It is a very versatile dataset, we identified a lot of possible tasks it was designed for, such as classification, NER (Named Entity Recognition), Clustering, RAG, Generative chatbots, and so on.
The notebook is Preliminary Analysis.

Screenshot 1

Screenshot 2

Classification

We perform a classification task on the dataset. In particular, the task is a binary classification because we classify each recipe according to the column "source", so to understand if the recipe was gathered from the web or taken from the Recipe1M dataset.
To do this taks, we use several Machine Learning methods for classification to highlight which method performs better and, in general, to show their performances, by showing the confusion matrix and using a train-validation-test pipeline.
The notebook is Classification.

Comparison ML models

Screenshot 1

PCA

Screenshot 2

Text Search

We perform Text Search, which means we create some indexes to perform some queries on the dataset by exploiting those indexes. The use of the indexes is useful because they speed up the process of searching the relevant documents inside the dataset for a given query.
In general, an index is composed of words and for each word, the documents where it appears are underlined, giving also, in some cases, the position of the word inside each document. So, the aim is to have an index of recipes and we can search for specific keywords, such as ingredients, in a fast way.
The notebook is Text Search.

Comparison Text Search models

Screenshot 1

Clustering and Topic Modeling

We perform clustering of the dataset, which consists of assigning each document to a cluster, identified by applying a specific clustering method.
There are different clustering methods, here we use K-Means and Mini-Batch K-Means, but there are also Hierarchical Agglomerative Clustering methods, which are computationally expensive, and algorithms, like DBScan, that find also odd-shaped clusters, without the need to specify in advance the number of desired clusters. We used K-Means because it is the most used clustering technique and it is fast to train compared to others.
Instead, in Topic Modeling each recipe belongs to many clusters at the same time. The notebook is Clustering&Topic.

3D LDA

Screenshot 1

2D tSNE LDA

Screenshot 2

Sequence Labelling

We explore Sequence Labeling, a fundamental task in Natural Language Processing (NLP), where each element in a sequence (typically words in a sentence) is assigned a label. Sequence labeling is essential in many NLP tasks such as Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and syntactic parsing.
We utilize SpaCy, a powerful NLP library, to perform a series of linguistic analyses on textual data. The analysis includes extracting syntactic and semantic features from recipe directions, which helps in understanding the structure and meaning of text.
Additionally, in the final section, we implement a custom Named Entity Recognition (NER) model using SpaCy to learn to identify domain-specific entities in recipe data — such as ingredients.
Finally, we apply the textual features from recipe instructions to perform binary text classification tasks using deep learning models in TensorFlow/Keras. These tasks include:

  • Vegetarian vs. Non-Vegetarian classification, where we predict the dietary category based on recipe directions.
  • Gathered vs. Recipe1M source classification, where we predict whether a recipe was gathered via web scraping or came from the structured Recipe1M+ dataset.

The notebook is Sequence Labelling.

Custom NER

Screenshot 1

Standard NER

Screenshot 2

Audio Processing

We work with audio signals, representing them with Spectrogram and Mel spectrogram. We have made some recordings to be used for Speech recognition and we perform the Text-To-Speech task, by writing some recipes and let the model transofrm them into audio. Moreover, we perform the transcription of the audio.
To perform the Text-to-Speech task, we use the Tacotron model, which is an encoder-decoder model using LSTM. It produces, from an input text, the corresponding Mel spectrogram. SO, to have as final output an audio that we can play and hear, we need a so-called Vocoder, in this case WaveGlow, which transform the Mel spectrogram into playable audio.
The notebook is Audio Processing.

Mel Spectrogram

Screenshot 1

Synthetized Audio signal

Screenshot 2

RAG and Word2Vec

We explore Word Embedding by training a World2Vec model, compare it to some pretrained ones and finally build a simple Top-K recommender.
Then, we build a Retrieval Augmented Generation (RAG) chatbot, including also the CLIP to make the chatbot work also with images (images and urls of images).
The notebook is RAG.

CLIP chatbot

Screenshot 1

Generative chatbots

We aim to answer questions about the dataset using the following approach: instead of fine-tuning a model, we use a general-purpose model and apply one-shot learning by prompting it with the question along with the most relevant recipes.

  1. The recipes whose embeddings best match the embedding of the question are retrieved using a similarity search.
  2. The question and the selected recipes are provided as input to a T5 model, which is prompted to answer the question based on the provided recipes.

We chose FLAN-T5, an encoder-decoder model pre-trained for a wide range of tasks such as translation, question answering, and summarization.
FLAN-T5 is designed to perform well in zero-shot and few-shot settings, making it a suitable choice for this kind of task without the need for task-specific fine-tuning.
GPT-2 (Generative Pretrained Transformer 2) is a large language model designed to generate text given an input prompt. In this project, the model is fine-tuned (i.e. trained) on the recipe dataset to create a GPT2_fineTuned model capable of generating new recipes.
We have also integrated text-to-speech (TTS) and speech-to-text (STT) systems.
Then, we explore how prompt granularity and example-based context affect image synthesis with a pretrained Stable Diffusion model. We compare three regimes: Zero-shot, One-shot and Few-shot.
Our goal was to see how much guidance the model needs to generate realistic and semantically accurate food photographs.
The notebook is Generative models.

Generative CLIP

Screenshot 1

Authors

This project was developed by:

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •