# **NLP pipeline  :**


  


  **1.Tokenize - > 2.Preprocess  - > 3.Featurize  - > 4.Model**


  

---


  ## 1.  **Tokenize**

  Tokenization - Break down sentences into words, characters, sequnces of characters (n-grams).

In [None]:
# Let's import librraies for the Tokenization. There are plenty of libraries you can use as NLTK, Spacy, Gensim, but we will focus on NLTk as the most popular/easy one.
import nltk
nltk.download('punkt')
text = ' Sugar Gliders active at night, they are very social animals and live in family groups. Sugar Gliders, Sugar Gliders '

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Sugar', 'Gliders', 'active', 'at', 'night', ',', 'they', 'are', 'very', 'social', 'animals', 'and', 'live', 'in', 'family', 'groups', '.', 'Sugar', 'Gliders', ',', 'Sugar', 'Gliders']


  ## **2. Text Processing**

Turn raw data into a suitable format used for analysis or modeling.

Whenever we have textual data, we need to apply several processing and pre-processing steps to the data to transform words into numerical features that work with machine learning algorithms.


* Normalization (lowercasing, handling punctuation, HTML tags, extra spaces).
* Stop word removal (such as - 'the', 'is', 'a' carry a little value)
* Stemming & Lemization ('swimming' - > 'swim')

**NOTE you don't need to use stemming and lemmization at the same time ☝**

The decision depends on the specific NLP task you're performing.

If accurate grammatical information is crucial, lemmatization is preferred. (e.g., sentiment analysis where "agreed" vs. "agree" might have different sentiment)
If capturing the general meaning is sufficient and processing speed is a concern, stemming might be a good option. (e.g., topic modeling where exact grammatical form might be less important)

In [None]:
# import needed libraries for the Text Processing step
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Lowercase the text
text = text.lower()

# Remove stop words
stop_words = stopwords.words('english')
text_tokens = [word for word in text.split() if word not in stop_words]

# Stemming
ps = PorterStemmer()
stemmed_tokens = [ps.stem(token) for token in text_tokens]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 3. **Featurize (Vectorization)**

* *Goal*: Transform raw text data into numerical features suitable for machine learning algorithms.

* Vectorization - One Technique for Feature Engineering:  represents a text document or token (word) as a fixed-length vector of numbers. There is different methods of vectorization.


**1. Bag-of-Words (BoW):**

Creates a "bag" of unique words in a document, ignoring word order.
Each word is represented by a binary value (1 if present, 0 if absent) or its frequency in the document.
Simple and efficient, but doesn't capture word order or relationships.



In [None]:
# Bag of Words, binary of frequency representation / dictionary - word : count

# Creating the Bag of Words model
word2count = {}
for word in stemmed_tokens:
    if word not in word2count.keys():
      word2count[word] = 1
    else:
      word2count[word] += 1



print(word2count)

{'sugar': 3, 'glider': 2, 'activ': 1, 'night,': 1, 'social': 1, 'anim': 1, 'live': 1, 'famili': 1, 'groups.': 1, 'gliders,': 1}


**2. TF-IDF (Term Frequency-Inverse Document Frequency):**

Improves on BoW by considering both the frequency of a word in a document (TF) and its rarity across all documents (IDF).
Words that are frequent in a document but rare overall get a higher weight, emphasizing their importance.
More informative than BoW, as it considers both document-specific and global word importance.


In [None]:
# -*- coding: utf-8 -*-
"""simplified_nltk.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/15VBStKeAgUn6JS2mT4ll4kCi00JWA4We
"""

!pip install nltk

from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')
nltk.download('stopwords')


from nltk import sent_tokenize
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Tokenize text into sentences
sentences = sent_tokenize(text)

# 2. Initialize stemmer and stop words
ps = PorterStemmer()
stop_words = set(stopwords.words("english"))

# 3. Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=lambda text: [ps.stem(word.lower()) for word in word_tokenize(text) if word.lower() not in stop_words])

# 4. Fit and transform text to TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)


print(tfidf_matrix)

  (0, 1)	0.32412344955584815
  (0, 6)	0.32412344955584815
  (0, 4)	0.32412344955584815
  (0, 7)	0.32412344955584815
  (0, 3)	0.32412344955584815
  (0, 9)	0.32412344955584815
  (0, 0)	0.23061650387901603
  (0, 8)	0.32412344955584815
  (0, 2)	0.32412344955584815
  (0, 5)	0.23061650387901603
  (0, 10)	0.23061650387901603
  (1, 0)	0.3333333333333333
  (1, 5)	0.6666666666666666
  (1, 10)	0.6666666666666666


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




**3. Word Embedding:**

A more sophisticated approach that captures semantic relationships between words.
Words are mapped to vectors in a high-dimensional space, where similar words are positioned close together.
Techniques like Word2Vec or GloVe use neural networks to learn these word embeddings from large text corpora.
Captures semantic meaning and relationships beyond simple word presence or frequency.


**When is the best to use ?** 🔅

Depends on your specific NLP task and data ! 👍

**BoW**: A good starting point for simple classification tasks where word order might not be crucial.

**TF-IDF**: Useful for tasks where capturing the importance of words within a document and across the corpus is important (e.g., document similarity, information retrieval).

**Word Embedding**: Powerful for tasks that require understanding semantic relationships between words (e.g., sentiment analysis, machine translation).

Model

## **Text Summury with Transformer architecture model. Hugging Face** 🤗

1. import the dependencies
2. Building a pipeline
3. Summarize the text

We pass the árticle as argumet, setting max lenght, min lenght, do_sample - we tell summarizer that we want to use a a Greedy decoder,

In [1]:
#install hugginf face libraries
pip install transformers



In [3]:
# We downlaod a modle that someoen else was trainign on a large dataset
from transformers import pipeline

In [4]:
#loading the pipeline
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [7]:
article  = ''' As you'll know by now, I write on all sorts of entrepreneurial topics. The newsletter has covered everything from consumer perception to imposter syndrome; we've talked breaking industry rules, finding your product market fit, and what it means to be an effective leader. And I've loved it.

But this is what I know to be true about entrepreneurship: it starts with the person. Before you get down to the gritty stuff – like finding your market and building a product – you have to first ask yourself: are you, as a person, ready for the life that comes with being an entrepreneur?

This is a topic I haven't really touched on, or not as much as I'd like. So, let's talk about it.

A Quick Intro
Many of us are obsessed with the idea of creating something cool and being our own boss; so obsessed, though, that we often forget to ask ourselves the tough questions. Do I suit the role of an entrepreneur? Have I done the groundwork (and inner work) to be ready for this?

And please, don't misread me here. I do believe that any one of us can create something and become the boss of it. But are we all ready for the entrepreneurial lifestyle, just as we are? Without any self-assessment and preparation, I'm not so sure.

Jared Yellin is my inspo for today's newsletter, and for good reason. He's a non-technical tech founder who found a way to scale his original tech company with zero dollars invested to over 40,000 paying users from around the world.

After exiting his first venture, he partnered with Grant Cardone to launch the 10X Incubator, with the goal of building, scaling, and selling 10,000 tech companies in 10 years. The 10X Incubator, he believes, will democratize the tech industry by creating an even playing field for all.

Since Jared's in the business of identifying entrepreneurs that have real potential, I figured he's one of the best guys to grill on the topic of entrepreneurial preparedness. What does he look for? What traits should we exhibit?

I hope you'll find his responses useful – I certainly did. Let's dive in!

Why Can't We All Be Entrepreneurs?
I'll get to Jared's awesome insights in just a minute. First, though, let's talk about this. Why can't we all be entrepreneurs? And what's so special about the people who can make it happen?

It's important to note – as I've noted in many of my newsletters now – that most entrepreneurs fail. That's just the facts of the industry. Ten percent fail in their first year, and 70 percent fail between years two and five. So no, we can't all be entrepreneurs (even though we might try).

But here's the light at the end of the tunnel. Pretty much everything that makes an entrepreneur successful – the characteristics and traits that allow them to reach that success – can be learned and honed. Entrepreneurs aren't always born.

What Makes An Entrepreneur?
So then, what are the traits that successful entrepreneurs need to possess? Looking online for resources pertaining to this subject is pretty interesting. I started with a questionnaire written way back in 2010; according to its author, Daniel Isenberg, agreeing with many of the following statements indicates a high chance of entrepreneurial readiness:'''

In [9]:
summarizer(article, max_length = 130, min_length = 30, do_sample = False)

[{'summary_text': " Jared Yellin, a non-technical tech founder, launched the 10X Incubator, with the goal of building, scaling, and selling 10,000 tech companies in 10 years . Entrepreneurs aren't always born, and the traits that make them successful can be learned and honed, he says . Jared's answers come from a questionnaire written by Daniel Isenberg in 2010 ."}]