This tutorial will show you how to use ProTaska-GPT to get recommendations for machine learning and data science tasks for a dataset containing tweets classified by their sentiment. We'll use the `mteb/tweet_sentiment_extraction` dataset available on [Hugging Face](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction).  
  
To follow this tutorial, you will need Python 3 and an OpenAI API account.

**Step 1:** Install ProTaska-GPT with all its dependencies. 

In [1]:
!pip install ProTaska-GPT

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ProTaska-GPT
  Downloading ProTaska_GPT-0.0.12-py3-none-any.whl (10 kB)
Collecting langchain (from ProTaska-GPT)
  Downloading langchain-0.0.202-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
Collecting colorama (from ProTaska-GPT)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting wikipedia (from ProTaska-GPT)
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting openai (from ProTaska-GPT)
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from ProTaska-GPT)
  Downloading datasets-2.13.0-py3-none-any.whl (485 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Step 2:** Import the ProTaska-GPT functions to generate a description of the dataset and  recommend tasks based on it.

In [2]:
from protaska.describer import describe_dataset
from protaska.ideate import main as chatbot

**Step 3:** Specify the values for the following variables. They will serve as parameters for the functions you imported in the previous step.
* `openai_key` (str) - Your OpenAI API key. ProTaska-GPT needs it to leverage GPT-3.5 capabilities. For more information on how to get the API key, refer to the [OpenAI documentation](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key).
* `importer_type` (str) - ProTaska-GPT comes with three importer classes, which can preprocess  datasets available on Hugging Face (`HuggingFaceDatasetImporter`), Kaggle (`KaggleDatasetImporter`), or in your local filesystem (`LocalDatasetImporter`). In this case, we need to supply `HuggingFaceDatasetImporter`to load the dataset from Hugging Face.
* `dataset_key` (str) - Name of the dataset to be used. Specify only for datasets imported from Hugging Face or Kaggle. The datasets stored in your local filesystem will be identified by their path (see below). Our dataset is called `mteb/tweet_sentiment_extraction`.
> For an example using a dataset stored in the local filesystem, check out our [Housing example](https://github.com/AmanPriyanshu/ProTaska-GPT/blob/main/Tutorials/Housing_data_Example.ipynb).
* `destination_path` (str) - Path in your local filesystem where the dataset is stored, or will be stored once downloaded from Hugging Face or Kaggle. We want to save the downladed dataset to the `downloaded_data` directory.


In [3]:
openai_key = 'Enter **secret key**'
importer_type = "HuggingFaceDatasetImporter"
dataset_key = 'mteb/tweet_sentiment_extraction'
destination_path = './downloaded_data/'

**Step 4:** Now, let's describe the dataset using the `describe_dataset` function.  
  
This function first uses the dataset importer (identified by `importer_type`) to load, pre-process, and store the dataset. Then it calls the dataset ingestor, which provides a framework for GPT to understand and summarize the dataset.  
  
The function returns the dataset description as a string and an object with additional metadata, which will be used to suggest tasks in the next step.

In [4]:
description, dataloader_obj = describe_dataset(openai_key, importer_type, destination_path, dataset_key)
description

Downloading readme:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading and preparing dataset json/mteb--tweet_sentiment_extraction to /root/.cache/huggingface/datasets/mteb___json/mteb--tweet_sentiment_extraction-0669dffec9427684/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/465k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/mteb___json/mteb--tweet_sentiment_extraction-0669dffec9427684/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Saving the dataset (0/1 shards):   0%|          | 0/27481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3534 [00:00<?, ? examples/s]

"Description: This dataset contains tweets with sentiment labels (neutral, negative, positive) for training and testing purposes. The data is in the form of lists of tweet IDs, text, labels, and label text. The task for this dataset would be sentiment analysis, and it belongs to the field of natural language processing. The dataset has 27481 rows for training and 3534 rows for testing. The features include 'id', 'text', 'label', and 'label_text'."

**Step 5:** Finally, turn your data into insights by calling the `chatbot` function.  
  
This function prompts GPT connected to Wikipedia to suggest various machine learning and data science tasks such as text classification, reinforcement learning, anomaly detection, or clustering. GPT will also mention Python packages needed to perform those tasks, along with instructions on how to install them.  
  
Users can ask further questions, which will be processed based on a summary of the previous conversation stored in the buffer memory, or stop the interaction by typing `break` or `exit`.

In [6]:
chatbot(openai_key, description, dataloader_obj.superficial_meta_data, agent_verbose=False)

ProTaska:	 Hello! Based on the description of your dataset, there are several ML/DS tasks that you can perform. Here are some of them:

1. Sentiment Analysis - This is the main task for your dataset. It involves analyzing the text of each tweet and determining whether the sentiment expressed is neutral, negative, or positive. You can use various ML algorithms such as Naive Bayes, Logistic Regression, or Support Vector Machines to perform this task. To get started, you can use libraries such as NLTK or Scikit-learn. To install NLTK, you can use the command "pip install nltk" in your terminal.

2. Text Classification - This task involves categorizing the tweets into different classes based on their content. For example, you can classify tweets into categories such as politics, sports, entertainment, etc. To perform this task, you can use ML algorithms such as Decision Trees, Random Forests, or Neural Networks. To get started, you can use libraries such as Scikit-learn or Keras. To instal