This project aims to develop a machine learning model to detect and censor disinformation using large language models (LLMs). The model is trained on a dataset containing text data and associated labels (e.g., disinformation or not disinformation). The project includes data preprocessing, model training, fine-tuning, and evaluation.
-
Clone the repository:
git clone https://github.com/your_username/your_project.git cd your_project
-
Create a Python virtual environment:
python3 -m venv venv source venv/bin/activate
-
Install required packages:
pip install -r requirements.txt
'pandas' and 'numpy' for data manipulation and numerical computing.
'matplotlib' and 'seaborn' for data visualization.
'torch' for working with PyTorch, a popular machine learning library.
'transformers' for using large language models (LLMs) such as GPT and BERT from the Hugging Face Transformers library.
'datasets' for accessing and managing datasets from the Hugging Face Datasets library.
'PyYAML' for working with YAML configuration files.
'wordcloud' for generating word clouds from text data.
- Data Preprocessing:Run the data preprocessing script to clean and prepare the dataset:
python src/data_preprocessing.py
- Training the Model:Use the training notebook (train_notebook.ipynb) to train the model:
jupyter notebook notebooks/train_notebook.ipynb
- Fine-Tuning the Model:Use the fine-tuning script to fine-tune the model on additional data:
python src/fine_tune.py
- Evaluating the Model:Evaluate the model's performance using the evaluation script:
python src/evaluation.py --model-path models/trained_model --dataset-path data/processed/test_data.csv
data/: Raw and processed data.
models/: Trained models.
notebooks/: Jupyter notebooks for exploratory analysis and model training.
src/: Source directory containing scripts for data preprocessing, model training, fine-tuning, and evaluation.
config/: Configuration file (config.yaml).
utils/: Utility functions for data and model handling.
requirements.txt: File listing project dependencies.
The project uses a configuration file (config/config.yaml) to manage parameters such as data paths, model hyperparameters, and evaluation settings. Customize this file according to your requirements.
The data used in this project includes text data labeled as disinformation or not disinformation. The data is preprocessed and split into training, validation, and test sets.
Upon completion of the project, you may find the results of model training and evaluation, including metrics such as accuracy, precision, recall, and F1-score, in the respective scripts and notebooks.