## NPL (NATURAL LANGUAGE PROCESSING)
 - NLP is a field of linguistics and machine learning focused on understanding everything related to human language.
 - The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.
 - The following is a list of common NLP tasks, with some examples of each:

1. `Classifying whole sentences`: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not

2. `Classifying each word in a sentence`: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)

3. `Generating text content`: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words

4. `Extracting an answer from a text`: Given a question and a context, extracting the answer to the question based on the information provided in the context

5. `Generating a new sentence from an input text`: Translating a text into another language, summarizing a text
NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.



In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")


classifier("I love you")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

In [9]:
# Zero shot clssification
# Helps us to clssify text that have not been labelled
zero_shot_classifer = pipeline("zero-shot-classification")
zero_shot_classifer("This course is about DINOSEURS",  candidate_labels =["education", "science", "history"])

#This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it.
# It can directly return probability scores for any list of labels you want!

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'sequence': 'This course is about DINOSEURS',
 'labels': ['history', 'education', 'science'],
 'scores': [0.5254783034324646, 0.3027774691581726, 0.17174425721168518]}

In [18]:
# TEXT GENERATION
# this is similar to the predictive text feature that is found on many phones
# You can control how many different sequences are generated with the argument num_return_sequences 
# and the total length of the output text with the argument max_length.
text_generator = pipeline("text-generation")

print(text_generator("I would like to thank Mr."))

print(text_generator("hello Boys, how", max_length=10)) # maybe use with apdding=True

print(text_generator("hello Boys, how", num_return_sequences=5)) # return array of 5 outcomes

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'I would like to thank Mr. Fassbender and his staff for their hard work. Please also let me know if your restaurant ever loses customers. I would also like to thank Mr. Pankaja and his staff for helping with my food request'}]


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "hello Boys, how's your record going? I"}]
[{'generated_text': 'hello Boys, how does he know you\'ve been playing his fantasy football game this season?"\n\n"We do know this."\n\nThat comment, for many fans, didn\'t quite make it inside the room until the last hour of the game'}, {'generated_text': 'hello Boys, how we got together and started on her family, we\'ll definitely be back next year.\n\n"We\'re all very happy. We\'re so well. We\'re all trying to pull it together."'}, {'generated_text': 'hello Boys, how much do you love to play hockey? This year there\'s an interesting team called "Guns of Wobblie" that is a bunch of small-town kids who play as \'Boys.\' They play on the main'}, {'generated_text': "hello Boys, how are we holding on? The best thing to say is that as soon as there's a change in perspective, there's not necessarily a huge problem, especially if you're young. You need to make a conscious effort to have any"}, {'generated_text': 'hello Boys, ho

In [19]:
# Using specific model from the transformers Hub using pipeline function
text_generator = pipeline("text-generation", model="distilgpt2")
text_generator("Hello Boss, how can i")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Hello Boss, how can i tell about it? It's a puzzle that you've always played in a computer game, but then you're all here, you can go. It's really a game of playing and solving puzzles. It looks like your"}]

In [26]:
# Fill MASK
unmask_it = pipeline("fill-mask")
unmask_it("hello buddy , you are <mask> nice today", top_k = 2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `P

[{'score': 0.17569412291049957,
  'token': 182,
  'token_str': ' very',
  'sequence': 'hello buddy , you are very nice today'},
 {'score': 0.1327628642320633,
  'token': 546,
  'token_str': ' looking',
  'sequence': 'hello buddy , you are looking nice today'}]

In [28]:
# Named Entity recognition (NER)
# catagories based on persons,location,orgainization
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sachin and I work at Axomiym Labs in Kochi")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is

[{'entity_group': 'PER',
  'score': 0.9989223,
  'word': 'Sachin',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.9930562,
  'word': 'Axomiym Labs',
  'start': 32,
  'end': 44},
 {'entity_group': 'LOC',
  'score': 0.9909345,
  'word': 'Kochi',
  'start': 48,
  'end': 53}]

In [31]:
# Question Answering
# The question-answering pipeline answers questions using information from a given context:

qr = pipeline("question-answering")
qr(question="Where do i work", context="My name is Sachin and works at Axomium albs")


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.9574574828147888, 'start': 31, 'end': 43, 'answer': 'Axomium albs'}

In [32]:
# Summarization
summar = pipeline("summarization")
summar(""" America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'summary_text': ' America has changed dramatically during recent years . There are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues . There is greater concentration on high-tech subjects, largely supporting increasingly complex scientific developments . While the latter is important, it should not be at the expense of more traditional engineering .'}]