# Feature Vectorization
- Feature Vectorization is used to convert the preprocesed text to numerical.
- **Types of Feature Vectorization**
  1. Heuristic
     - re
     - wordnet
  2. ML
     - One hot Encoding
     - Index Based
     - BOW(Bag Of Words)
     - TF-IDF
  3. DL Approch
     - wordZvec
     - Fast text
     - gpt
     - llms ....... etc
## One Hot Encoding
- OHE is the binary representation of feature vectorization.
-  0 -> Absent  |   1 -> Present
-  words -> Columns | document -> Rows
-  **Steps to implement OHE:**
      1. Preprocessing text.
      2. Tokenize the text data.
      3. Create vocabulary.
      4. Assign the binary values to the words  0 -> Absent  |   1 -> Present.
#### Pros / Con's for applying OHE
- **PROS**
  - Intuitive
  - Simple to understand.
- **Con's**
  - Input shape is not constant.
    - Based on vocabulary, th i/p shape gets changed. 
  - Order/ sequence is missing.
    - Contextual meaning is missed.
  - Sparse matrix.
    - The data which contains more zero's. (The opposite of spares matrix is **Dense Matrix**, means less Zero's.)
    - Multi Colinearity(almost same data in both columns it meeans it has strong relationship.)
    - ML can't capture more pattrens.
  - OOV(order of vocabulary).
    - Outside of set of unique words.
    - During training ML algorithm trained on vocabulary of test data, during testing if a new word comes which is not present in the vocabulary, Ml cant understand that word.
  - Lack of semmanting.
    - The relationship between words.

## Index Based Encoding
- The idea behind the index-based encoding is to map each word with one index, i.e., a number. The first step is to create a dictionary that maps words to indexes.
- **PROS**
  - Intuitive
  - Simple to understand.
  - Present contextual meaning.
- **Con's**
    - Input shape is not constant.
    - Order/ sequence is missing.
    - Sparse matrix.
    - OOV(order of vocabulary).
    - Lack of semmanting.
 
## BOW(Bag Of Words)
- BOw is used to convert preprocessed to structural & numerical.
- BOW deals with the representation of <ins>count</ins> ie. no.of ocurences of word in the <ins>document(row/cell)</ins>.
- BOW gives the importance to words in the document.
-  **Steps to implement BOW:**
      1. Preprocessing text.
      2. Tokenize the text data.
      3. Create vocabulary.
      4. Assign the values based upon no of occurences.
- **PROS**
  - Intuitive
  - Simple to understand.
  - Importance to words.
- **Con's**
    - Input shape is not constant.
      - Solution: max_features
    - Order/ sequence is missing.
      - Solution: N grams.
    - Sparse matrix.
    - OOV(order of vocabulary).
    - Lack of semmanting.
- **N Grams** -> No.of Words.
  - Contextual meaning.
  - **Unigram** (1,1) - Combination of Single words.
  - **Bigram** (2,2) - Combination of two words.
  - **Trigram** (3,3) - Combination of three words.
  - **Range** (1,2)
    - Ex: Biryani is good, tasty and cheap.
      - Unigram => Biryani, good, tasty, cheap
      - Bigram => Biryani good, good tasty, tasty cheap

## TF-IDF(Term Frequency - Inverse Document Frequency)
- TF-IDF is used to convert preprocessed to structural & numerical.
- The main advantage of TFIDF is it gives importance/priority to the words at corpus level.
- TF-IDF deals with the representation of <ins>TF * IDF</ins>
- **Term Frequency**
  - can be measured in document
  - Term Frequency = No.of occurences of that word in the document / Total no.of words in the document.
- **Inverse Document Frequency**
  - can be measured at corpus level
  - Inverse Document Frequency =  
$$  \begin log_{e}&\\y \fracN+13}n+-1}\+1}
$$

In [1]:
import pandas as pd

In [16]:
df = pd.DataFrame([["I am data analyst, data scientist and data engineer"],
             ["I am powerBI devloper"],
             ["I am SQL Engineer"],
                  ["i am jobless"]], columns=["Text"])

In [17]:
df

Unnamed: 0,Text
0,"I am data analyst, data scientist and data eng..."
1,I am powerBI devloper
2,I am SQL Engineer
3,i am jobless


In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
bow = CountVectorizer(stop_words="english")
pd.DataFrame(bow.fit_transform(df["Text"]).toarray(), columns=bow.get_feature_names_out())

Unnamed: 0,analyst,data,devloper,engineer,jobless,powerbi,scientist,sql
0,1,3,0,1,0,0,1,0
1,0,0,1,0,0,1,0,0
2,0,0,0,1,0,0,0,1
3,0,0,0,0,1,0,0,0


In [20]:
bow.vocabulary_

{'data': 1,
 'analyst': 0,
 'scientist': 6,
 'engineer': 3,
 'powerbi': 5,
 'devloper': 2,
 'sql': 7,
 'jobless': 4}

### N grams

In [21]:
bow = CountVectorizer(stop_words="english",ngram_range=(1,2))
pd.DataFrame(bow.fit_transform(df["Text"]).toarray(), columns=bow.get_feature_names_out())

Unnamed: 0,analyst,analyst data,data,data analyst,data engineer,data scientist,devloper,engineer,jobless,powerbi,powerbi devloper,scientist,scientist data,sql,sql engineer
0,1,1,3,1,1,1,0,1,0,0,0,1,1,0,0
1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [23]:
bow = CountVectorizer(stop_words="english",ngram_range=(1,1),binary=True)
pd.DataFrame(bow.fit_transform(df["Text"]).toarray(), columns=bow.get_feature_names_out())

Unnamed: 0,analyst,data,devloper,engineer,jobless,powerbi,scientist,sql
0,1,1,0,1,0,0,1,0
1,0,0,1,0,0,1,0,0
2,0,0,0,1,0,0,0,1
3,0,0,0,0,1,0,0,0


### Vocabulary
- To mention our own order.

In [28]:
order = {'data': 6,
     'analyst': 7,
     'scientist': 5,
     'engineer': 0,
     'powerbi': 1,
     'devloper': 4,
     'sql': 2,
     'jobless': 3}

bow = CountVectorizer(stop_words="english",vocabulary=order)
pd.DataFrame(bow.fit_transform(df["Text"]).toarray(), columns=bow.get_feature_names_out())

Unnamed: 0,engineer,powerbi,sql,jobless,devloper,scientist,data,analyst
0,1,0,0,0,0,1,3,1
1,0,1,0,0,1,0,0,0
2,1,0,1,0,0,0,0,0
3,0,0,0,1,0,0,0,0


In [25]:
bow.vocabulary_

{'data': 1,
 'analyst': 0,
 'scientist': 6,
 'engineer': 3,
 'powerbi': 5,
 'devloper': 2,
 'sql': 7,
 'jobless': 4}

### max_df / min_df (document Frequency)
- considers the df of entire corpus.

In [29]:
bow = CountVectorizer(stop_words="english", max_df=2)
pd.DataFrame(bow.fit_transform(df["Text"]).toarray(), columns=bow.get_feature_names_out())

Unnamed: 0,analyst,data,devloper,engineer,jobless,powerbi,scientist,sql
0,1,3,0,1,0,0,1,0
1,0,0,1,0,0,1,0,0
2,0,0,0,1,0,0,0,1
3,0,0,0,0,1,0,0,0


### BOW Parameters definations/ usages
- strip_accents=None, -- Changes to english language
- lowercase=True,
- preprocessor=None, -- customize function ie. text preprocessor
- stop_words=None, -- English stop words
- ngram_range=(1, 1) -- combination of words
- max_df=1.0, -- max occured in document
- min_df=1, -- min occured in document
- max_features=None, -- Considering fixed size of columns
- vocabulary=None, -- Changing the index/order
- binary=False, -- present/absent

# TF-IDF(Term Frequency - Inverse Document Frequency)

Doumentation
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

In [41]:
df = pd.DataFrame([["I am data analyst, data scientist and data engineer"],
             ["I am sql developer"],
             ["I am ML engineer and AI engineer"]], columns=['Text'])
df

Unnamed: 0,Text
0,"I am data analyst, data scientist and data eng..."
1,I am sql developer
2,I am ML engineer and AI engineer


In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [43]:
tf = TfidfVectorizer(stop_words="english")

In [44]:
pd.DataFrame(tf.fit_transform(df['Text']).toarray(),columns = tf.get_feature_names_out())

Unnamed: 0,ai,analyst,data,developer,engineer,ml,scientist,sql
0,0.0,0.293884,0.881652,0.0,0.223506,0.0,0.293884,0.0
1,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.707107
2,0.481482,0.0,0.0,0.0,0.732359,0.481482,0.0,0.0


-----------------
------------

In [7]:
text_data = [
    'café',
    'résumé',
    'naïve',
    'coöperate',
    'cooperate'
]

In [8]:
text = pd.DataFrame(text_data,columns=["text"])
text

Unnamed: 0,text
0,café
1,résumé
2,naïve
3,coöperate
4,cooperate


In [9]:
import autocorrect
import emoji
import re
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


In [10]:
chat = {
    "brb": "be right back",
    "btw": "by the way",
    "lol": "laugh out loud",
    "omg": "oh my god",
    "ttyl": "talk to you later",
    "idk": "I don't know",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "fyi": "for your information",
    "smh": "shaking my head",
    "np": "no problem",
    "tbh": "to be honest",
    "wbu": "what about you",
    "bc": "because",
    "afaik": "as far as I know",
    "asap": "as soon as possible",
    "atm": "at the moment",
    "bbl": "be back later",
    "bfn": "bye for now",
    "bff": "best friends forever",
    "cu": "see you",
    "cya": "see you",
    "dm": "direct message",
    "fb": "Facebook",
    "ftw": "for the win",
    "gg": "good game",
    "gr8": "great",
    "gtg": "got to go",
    "hbu": "how about you",
    "ily": "I love you",
    "jk": "just kidding",
    "lmao": "laughing my ass off",
    "lmk": "let me know",
    "nvm": "never mind",
    "omw": "on my way",
    "plz": "please",
    "ppl": "people",
    "rofl": "rolling on the floor laughing",
    "thx": "thanks",
    "u": "you",
    "ur": "your",
    "yolo": "you only live once",
    "yw": "you're welcome",
    "ty": "thank you",
    "abt" : "about"
}

In [11]:
def text_preprocess(text):
    speller = autocorrect.Speller()
    stem = PorterStemmer()
    lemma = WordNetLemmatizer()
    
    text = text.lower() # Converting to lowercase for uniformity
    text = speller.autocorrect_sentence(text) # correcting the spelling mistakes
    text = emoji.demojize(text).replace(':',' ') # emoji prediction and converting to text
    text = re.sub(r'www.\S+|https?://\S+',' ',text) # rerplacing the urls
    text = re.sub(r'<[^>]+>'," ",text) # Html tags
    text = re.sub(r"[^a-zA-Z0-9']",' ',text) # removing the puntuations
    text = re.sub(r'[0-9] ','',text) # replacing the numbers
    text = text = ' '.join(map(lambda i: chat[i] if i in chat.keys() else i, text.split())) # chat words 
    text = word_tokenize(text) # word tokenization
    text = [stem.stem(i) for i in text] #stemming
    text = [lemma.lemmatize(i)for i in text] # LEmatization
    text = [i for i in text if i not in stopwords.words('english') ] # stop words remover
    a = text.apply(text_preprocess)
    
    df = pd.DataFrame(a.apply("".join()))
    return df

In [None]:
bow = CountVectorizer(stop_words="english",preprocessor=text_preprocess)
pd.DataFrame(bow.fit_transform(text).toarray(), columns=bow.get_feature_names_out())