# Transformers with Pipline

A type of model in NLP that forms the basis of many state-of-art model such as BERT and GPT-2

This is the starter file for downloading Transformers to local environment. However, it has limited information, as I stopped working on it when deciding to shift to using Web API for this stage of the project. For detailed tutorial, checkout office guide on [Pipline for Inference](https://huggingface.co/docs/transformers/en/pipeline_tutorial).

We will come back to working with models locally in the future for finetuning and training stage of the project(potentially). So it is good to keep this file around



The [Pipline](https://huggingface.co/docs/transformers/pipeline_tutorial) is a wrapper function in the Hugging Face's Transformers libarary that abstract away the complex process of preprocessing input data, loading the model, making a prediction, and process output data into a simple function call. Leaving us an interface that only requires a couple lines of code to make a prediction. 

In [2]:
from transformers import pipeline # This will take some time

The important parameters to specify in the Hugging Face pipline are: 

Task: A string that specify the task to perform such as sentiment analysis, question-answering, summation, translation etc. 

Model: A model ID for a specific transformer model.

If any of these are missing, they will be defaulted base on the input of the other parameter.

For example, if Model is left blank, then the default model for the specified task will be selected. If the Task is left blank, then the default task for the specifed model will be selected.

Note: Some model can do multiple tasks.

In [5]:
sentiment_pipeline = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)


emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


[{'label': 'POS', 'score': 0.9916695356369019},
 {'label': 'NEG', 'score': 0.9806600213050842}]

Exploring transformer with RMP comments. 

In [7]:
import pandas as pd

Predicting class sentiment and check out the actual comments


In [16]:
rating = pd.read_csv("../data/clean_ratings.csv")
first_five = rating[:5]
# Take the first 5 random comment
first_five["score"] = sentiment_pipeline(first_five["comment"].tolist())
first_five


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_five["score"] = sentiment_pipeline(first_five["comment"].tolist())


Unnamed: 0,profID,attendanceMandatory,class,comment,date,difficultyRating,grade,helpfulRating,isForCredit,isForOnlineClass,ratingTags,wouldTakeAgain,score
0,7964,False,ANTHRCUL101,Fricke is the man. Entire class probably took ...,2019-04-28 17:13:12,1.0,A,5.0,False,False,"['Respected', 'Inspirational', 'Amazing Lectur...",True,"{'label': 'POS', 'score': 0.9641988277435303}"
1,7964,False,ANTHRO101,Tom Fricke is one of those professors you will...,2019-01-08 18:41:24,1.0,A+,5.0,False,False,"['Accessible Outside Class', 'Hilarious', 'Ama...",True,"{'label': 'POS', 'score': 0.9928324222564697}"
2,7964,False,ANTHRCUL101,Prof. Fricke is amazing. He is hilarious and t...,2018-12-16 03:11:18,1.0,A,5.0,False,False,"['Hilarious', 'Graded By Few Things', 'Caring']",True,"{'label': 'POS', 'score': 0.9918490648269653}"
3,7964,False,CULTANTHRO101,Such an easy class. Exams were exactly like th...,2018-12-12 10:03:19,1.0,A,5.0,False,False,"['Accessible Outside Class', 'Graded By Few Th...",True,"{'label': 'POS', 'score': 0.9914464950561523}"
4,7964,False,ANTHRCUL101,Easiest class i have taken at UM. The exams to...,2018-12-11 16:33:00,1.0,A+,5.0,False,False,"['Respected', 'Hilarious', 'Amazing Lectures']",True,"{'label': 'POS', 'score': 0.9928606152534485}"


In [6]:
for comment in first_five: print(comment)

Fricke is the man. Entire class probably took five hours of studying for the entire semester. I think he may be retiring, but if not, take this class if you can fit it in.
Tom Fricke is one of those professors you will never, ever forget. I found myself coming to every single lecture even though none of them were mandatory simply because he is quite possibly the most entertaining person on the planet. Tom cares so much about his students and I became quite good friends with him through office hours. 100/10 recommend
Prof. Fricke is amazing. He is hilarious and tells great, interesting stories. Practice exams are the actual exams, but Fricke doesn't give the answers directly (he will if you ask him). If you have the option, TAKE THIS CLASS!
Such an easy class. Exams were exactly like the practice exam given and I didn't have to show up to lecture. He's also a great guy and I went to office hours just to chat with him.
Easiest class i have taken at UM. The exams took a majority of the cl

In [7]:
eecs280 = rating[rating["class"]=="EECS280"]["comment"]
eecs280 = eecs280[eecs280 != "No Comments"]
results = sentiment_pipeline(eecs280.tolist())

positive_scores = [result['score'] for result in results if result['label']=='POS']
negative_scores = [result['score'] for result in results if result['label'] == 'NEG']

average_positive_score = sum(positive_scores)/len(positive_scores) if positive_scores else 0
average_negative_score = sum(negative_scores) / len(negative_scores) if negative_scores else 0 

print(f'Average positive score: {average_positive_score}')
print(f'Average negative score: {average_negative_score}')

Average positive score: 0.9592511071978023
Average negative score: 0.8555095936570849


The positive comments are very positive, the negative comments are pretty negative, but not extremely negative

Average positivity using the comments

Summarize the class(regardless of professor) using summarizer

In [8]:
# Join all the comments as one big string of text first. 
comment_batches = []

# Summarization task have a cap for maximum input tokens, thus we put into batches of 30
for comments in range(0, len(eecs280), 10):
    batch = ''.join(eecs280[comments:comments+10])
    comment_batches.append(batch)
summarizer = pipeline("summarization", model ="sshleifer/distilbart-cnn-12-6")
summary = summarizer(comment_batches[0], max_length = 100, min_length = 50, do_sample=False )


In [9]:
summary[0]["summary_text"]

" He knows OOP, if you pay attention he will teach you. Take advantage of office hours! Conveniently posts lecture notes online so that students can focus on the material and not worry about scribbling down every last bit of code. He's an akward, pacing computer dork."

In [10]:
#try another example: 
summary = summarizer(comment_batches[1], max_length = 100, min_length = 50, do_sample=False )
summary[0]["summary_text"]

" Do not waist your money and your time by taking a class that she teaches . She speaks too fast and with a heavy accent, and goes through work way too quickly . Exams are killers - very unreasonable . She's actually nice to talk to if you make an effort to go to her office hours ."