<a href="https://colab.research.google.com/github/Mattshanevdberg/ML-FruitPunchAI_BootCamp/blob/main/5_2_ANSWERS_Transfer_learning_part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers -q

[K     |████████████████████████████████| 3.4 MB 7.5 MB/s 
[K     |████████████████████████████████| 3.3 MB 56.9 MB/s 
[K     |████████████████████████████████| 67 kB 7.3 MB/s 
[K     |████████████████████████████████| 895 kB 61.6 MB/s 
[K     |████████████████████████████████| 596 kB 70.7 MB/s 
[K     |████████████████████████████████| 306 kB 7.8 MB/s 
[K     |████████████████████████████████| 243 kB 72.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 69.6 MB/s 
[K     |████████████████████████████████| 133 kB 79.4 MB/s 
[K     |████████████████████████████████| 271 kB 65.0 MB/s 
[K     |████████████████████████████████| 160 kB 60.8 MB/s 
[K     |████████████████████████████████| 192 kB 72.5 MB/s 
[?25h

# **Transfer Learning part 1**

First we will have a look at the huggingface library. This is a library full of large pretrained models that are easily to be installed and used. Huggingface has a large amount of NLP (Natural Language Processing) algorithms but also offers alorithms for audio and vision processing. Check out their site for all the available models:
https://huggingface.co/models

Here below we give an example of an algorithm named "bert-base-NER". bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task (Named Entity Recognition). It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

For more info on named entity recognition you can check out this paper https://aclanthology.org/W03-0419.pdf

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "I'm Dorian from Utrecht and I work for Fruitpunch AI."

ner_results = nlp(example)
print(ner_results)

Before we really start transfer learning we want to show you how usefull pre-trained models can be, and potentially can save you a lot of time. For the first exercise we'll take a model from huggingface and see how well it performs vs our own build model from scratch. Please note: This is not yet transfer learning because we are not re-training a model.

In [None]:
# Load the dataset
!git clone https://github.com/fruitpunch-ai-code/epoch-14.git

fatal: destination path 'epoch-14' already exists and is not an empty directory.


In [None]:
#required libraries
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn import model_selection
import nltk
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
#Set Random seed
np.random.seed(500)

df = pd.read_csv('/content/epoch-14/Challenges/reviews.csv', encoding='latin-1')
df.head()

Now we will train a NLP from scratch to recognise the sentiment in amazon reviews. The model will have to determine if the review is positive or negative. 


In [None]:
# Step 1: Data Pre-processing - This will help in getting better results through the classification algorithms
Corpus = df.copy()
# Step 1a : Remove blank rows if any.
Corpus['text'].dropna(inplace=True)

# Step - 1b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [entry.lower() for entry in Corpus['text']]

# Step - 1c : Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]

# Step - 1d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(Corpus['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus.loc[index,'text_final'] = str(Final_words)

# Step 2: Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)

# Step 3: Label encode the target variable  - This is done to transform Categorical data of string type in the data set into numerical values
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

# Step 4: Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comparison to the corpus
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [None]:
# Step 5: Now run ML algorithm to classify the text

# Classifier - Algorithm - Naive Bayes
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score
# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

# Classifier - Algorithm - SVM
from sklearn import svm
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

# Classifier - Algorithm - Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
# fit the training dataset on the classifier
clf = RandomForestClassifier(n_estimators=400, max_depth=20, random_state=0)
clf.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_clf = clf.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Random Forest Classifier Accuracy Score -> ",accuracy_score(predictions_clf, Test_Y)*100)

# Classifier - Algorithm - AdaBoost Classifier
from sklearn.ensemble import AdaBoostClassifier
# fit the training dataset on the classifier
adaclf = AdaBoostClassifier(n_estimators=800, random_state=0)
adaclf.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_adaclf = adaclf.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("AdaBoost Classifier Accuracy Score -> ",accuracy_score(predictions_adaclf, Test_Y)*100)

# Classifier - Algorithm - Linear SVC
from sklearn.svm import LinearSVC
# fit the training dataset on the classifier
svc = LinearSVC(random_state=0, tol=1e-5)
svc.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_svc = svc.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Linear SVC Accuracy Score -> ",accuracy_score(predictions_svc, Test_Y)*100)

# Classifier - Algorithm - Logistic Regression
from sklearn.linear_model import LogisticRegression
# fit the training dataset on the classifier
lr = LogisticRegression()
lr.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_lr = lr.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Logistic Regression Accuracy Score -> ",accuracy_score(predictions_lr, Test_Y)*100)

# Classifier - Algorithm - MLP Classifier
from sklearn.neural_network import MLPClassifier
# fit the training dataset on the classifier
mlp = MLPClassifier(hidden_layer_sizes=(13,13,13),max_iter=500)
mlp.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_nn = mlp.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("MLPClassifier Accuracy Score -> ",accuracy_score(predictions_nn, Test_Y)*100)

Naive Bayes Accuracy Score ->  81.96666666666667
SVM Accuracy Score ->  84.36666666666667
Random Forest Classifier Accuracy Score ->  81.63333333333334
AdaBoost Classifier Accuracy Score ->  79.76666666666667
Linear SVC Accuracy Score ->  83.43333333333334
Logistic Regression Accuracy Score ->  85.0
MLPClassifier Accuracy Score ->  81.39999999999999


The best model reaches an accuracy of 85% let's see if we can top that with a model from huggingface

### **Assignment 1**

A) Search the huggingface library for a model that can classify the reviews in this dataset on positive or negative. And run it on the data.

B) Eveluate your transfered model. Does it outperform the models build from scratch?

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForQuestionAnswering, pipeline

model_name = "aychang/roberta-base-imdb"

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

nlp = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)

Some weights of the model checkpoint at aychang/roberta-base-imdb were not used when initializing RobertaForQuestionAnswering: ['classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at aychang/roberta-base-imdb and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for p

In [None]:
# the model we took here only takes lists as input
list_of_text = df.text.tolist()

In [None]:
# To speed things up a little bit grab a subset of the data
import tensorflow as tf
with tf.device('/device:GPU:0'):
  results = nlp(list_of_text[0:1000])

In [None]:
# extract the predictions from the results
pred = []
for i in range(1000):
  if results[i]['label'] == 'pos':
    pred.append(1)
  if results[i]['label'] == 'neg':
    pred.append(0)

In [None]:
# Calculate accuracy score but first transform labels to numbers
df['label'] = Encoder.fit_transform(df['label'])
print("Transfered model Accuracy Score -> ",accuracy_score(pred, df['label'][0:1000])*100)

Transfered model Accuracy Score ->  92.80000000000001
