<a href="https://colab.research.google.com/github/Rishav-hub/Auto-ViML/blob/main/05_AutoVIML_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installation
```pip install git+https://github.com/AutoViML/Auto_ViML.git```

or 

```pip install autoviml```

In [None]:
!pip install autoviml

## Implementation with Auto-ViML

In [None]:
from autoviml.Auto_NLP import Auto_NLP

## Load Dataset

In [None]:
import tensorflow_datasets as tfds
dataset, info = tfds.load('amazon_us_reviews/Personal_Care_Appliances_v1_00', with_info=True, batch_size=-1)

In [None]:
train_dataset = dataset['train']

## Convert dataset to array


In [None]:
import numpy as np
dataset=tfds.as_numpy(train_dataset)

In [None]:
dataset

## Extracting important columns

The dataset has different columns. We are interested in the following columns: helpful_votes, review_headline, review_body and star_rating.

- star_rating: It shows the 1-5 star rating of the product purchased.

- helpful_votes: It shows the number of votes for a purchased product.

- review_headline: It shows the title product review.

- review_body: It shows a detailed description of the review.

In [None]:
helpful_votes=dataset['data']['helpful_votes']
review_headline=dataset['data']['review_headline']
review_body=dataset['data']['review_body']
rating=dataset['data']['star_rating']

## Creating a data frame

In [None]:
import pandas as pd
reviews_df=pd.DataFrame(np.hstack((helpful_votes[:,None],review_headline[:,None],review_body[:,None],rating[:,None])),columns=['votes','headline','reviews','rating'])


In [None]:
# Define Datatypes Columns datatypes

convert_dict = {'votes': int, 
 'headline': str,
 'reviews': str,
 'rating': int
               }
reviews_df = reviews_df.astype(convert_dict) 


In [None]:
reviews_df

## Adding the target column
For a review to be positive, the star_rating should be greater than 4. If the star_rating is less than 4, the review is negative.

This code will add the target column. It will ensure that if the rating is greater than 4, the review is labeled 1. If the rating is less than 4, the review is labeled 0.

In [None]:
reviews_df["target"] = reviews_df["rating"].apply(lambda x: 1 if x>= 4 else 0) 

In [None]:
reviews_df

## Split the Dataset

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(reviews_df, test_size=0.25)

## Initiating Auto NLP

- **nlp_column**: The model uses this column as the input column. It feeds the model with data during training.

- **target**: It shows the model’s output after making a prediction.

- **train**: It is the split dataset that the model uses during training.

- **test**: It is the split dataset that the model uses during testing.

- **score_type**='balanced_accuracy': It calculates the accuracy score for the model.

- **modeltype**='Classification': It specifies the type of model we are building. We are building a classification model.

- **top_num_features**=50: It specifies the number of important features the model uses during training. Features are the important attributes found in the dataset.

- **build_model**=True: It tells the Auto_NLP function to build the model. Auto_NLP function will then use the key, AutoVIML features, to produce an optimized model.

In [None]:
nlp_column = 'reviews'
target = 'target'

In [None]:
nlp_transformer= Auto_NLP(
                nlp_column, train, test, target, score_type='balanced_accuracy',
 modeltype='Classification',top_num_features=50, verbose=2,
 build_model=True)

## Making predictions

In [None]:
nlp_transformer.predict(test[nlp_column])