### Intro to TextBlob

TextBlob is a python library for processing textual data. It provides a simple API for common natural language processing (NLP) tasks, such as 
- part-of-speech tagging,
- noun phrase extraction,
- sentiment analysis,
- clasification,
- translation and more.

TextBlob is built on top of NLTK and pattern, making it a beginner-friendly tool for NLP.

**Key features of TextBlob**

1. Sentiment Analysis:
    - provides polarity (positive/negative/neutral) and subjectivity (objective/subjective) of a text.

In [1]:
from textblob import TextBlob

In [2]:
text1 = TextBlob("TextBlob is am amazing library.")
text2 = TextBlob("I am not feeling well.")
text3 = TextBlob("Very bad and inapropriate arrangements")

print(text1.sentiment)
print(text2.sentiment)
print(text3.sentiment)

Sentiment(polarity=0.6000000000000001, subjectivity=0.9)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.9099999999999998, subjectivity=0.8666666666666667)


2. Part-of-Speech (POS) tagging:
    - Label each words in a sentence with their grammatical roles.

In [3]:
text4 = TextBlob("A quick brown fox jump over a lazy dog")
print(text4.tags)

[('A', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'JJ'), ('jump', 'NN'), ('over', 'IN'), ('a', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


3. tokenization
   - split the text into words or sentences

In [11]:
text5 = 'TextBlob is used for tokenization. It provides a simple way to analyze the sentiment of text.'
blob = TextBlob(text5)

In [12]:
print(blob.words)

['TextBlob', 'is', 'used', 'for', 'tokenization', 'It', 'provides', 'a', 'simple', 'way', 'to', 'analyze', 'the', 'sentiment', 'of', 'text']


In [13]:
print(blob.sentences)

[Sentence("TextBlob is used for tokenization."), Sentence("It provides a simple way to analyze the sentiment of text.")]


4. Noun Phrase extraction
   - Extract noun phrase from the text.

In [14]:
print(blob.noun_phrases)

['textblob', 'simple way']


5. Word Inflection and lemmatization
   - change word forms or get the base form of a word.

In [18]:
word = TextBlob("apples bananas boxes flies")
for wr in word.words:
    print(wr.singularize())

apple
banana
box
fly


6. Spelling correction
   - Correct spelling errors in text.

In [21]:
blb = TextBlob('I noot good very good at speling!')
print(blb.correct())

I not good very good at spelling!


7. Language translation
   - Translate text into another language.
   - Translation in TextBlob requires an external translation API.

### How to use CountVectorizer?

- It is a class in scikit-learn that transform a collection of text documents into a numerical matrix of word or token counts.
- It is used to facilitate preprocessing for NLP use cases.

In [2]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import nltk
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
with open('SMSSpamCollection_.csv', 'r', encoding='utf-8', errors='ignore') as file:
    data = pd.read_csv(file,header=None,names=["labels","texts

# Remove unwanted tab spaces from the 'texts' column and the leading and trailing spaces.
data['texts'] = data['texts'].str.replace('\t', '', regex=True).str.strip()
data.head(10)


Unnamed: 0,labels,texts
0,ham,Go until jurong point crazy.. Available only ...
1,ham,Ok lar... Joking wif u oni... hamOk lar... Jok...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,Nah I don't think he goes to usf he lives aro...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


Create a basic sparse matrix

In [42]:
vectorizer = CountVectorizer(stop_words='english',max_features=50)
matrix = vectorizer.fit_transform(data.texts)

- by default, it will convert the text to lowercase and use utf-8 encoding

In [43]:
#visualize as a dataframe
df = pd.DataFrame(data=matrix.toarray(),columns = vectorizer.get_feature_names_out())

In [44]:
df #Each row represents an individual text from the dataset

Unnamed: 0,claim,come,da,day,dear,did,don,dont,free,going,good,got,great,gt,hey,hi,home,hope,just,know,later,like,ll,lor,love,lt,mobile,msg,need,new,night,oh,ok,phone,pls,reply,send,sorry,stop,tell,text,think,time,today,txt,ur,want,wat,week,ã¼
0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5569,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
5571,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5572,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [33]:
#Extract feature names
vectorizer.get_feature_names_out() #returns words from corpus, sorted by position in the sparse matrix

array(['00', '000', '008704050406', ..., 'ãœ', 'éˆ', 'œharry'],
      shape=(8719,), dtype=object)

In [34]:
#Get the indices of each feature name
vectorizer.vocabulary_

{'go': 3564,
 'until': 8085,
 'jurong': 4374,
 'point': 5954,
 'crazy': 2335,
 'available': 1309,
 'only': 5568,
 'in': 4110,
 'bugis': 1762,
 'great': 3648,
 'world': 8551,
 'la': 4501,
 'buffet': 1760,
 'cine': 2058,
 'there': 7694,
 'got': 3608,
 'amore': 1075,
 'wat': 8325,
 'ok': 5535,
 'lar': 4537,
 'joking': 4342,
 'wif': 8453,
 'oni': 5564,
 'hamok': 3732,
 'free': 3368,
 'entry': 2960,
 'wkly': 8509,
 'comp': 2175,
 'to': 7806,
 'win': 8466,
 'fa': 3097,
 'cup': 2393,
 'final': 3217,
 'tkts': 7793,
 '21st': 410,
 'may': 4958,
 '2005': 401,
 'text': 7643,
 '87121': 789,
 'receive': 6338,
 'question': 6229,
 'std': 7279,
 'txt': 7986,
 'rate': 6281,
 'apply': 1162,
 '08452810075over18': 76,
 'dun': 2812,
 'say': 6677,
 'so': 7073,
 'early': 2833,
 'hor': 3948,
 'already': 1047,
 'then': 7688,
 'nah': 5267,
 'don': 2720,
 'think': 7709,
 'he': 3799,
 'goes': 3572,
 'usf': 8131,
 'lives': 4691,
 'around': 1213,
 'here': 3851,
 'though': 7729,
 'freemsg': 3375,
 'hey': 3861,
 'darl

- Please note that this does not return the frequency count, but instead, it provides the index of each word in the corpus.
- This vocabulary helps the vectorizer know which word corresponds to which column in the transformed feature matrix. 

**Refine your matrix with parameters**

- Remove stop words
  - **CountVectorizer(stop_words = 'english')**
  - Stop words typically have little significance and do not add a tremendous amount of value in classification tasks.
  - These can include words, such as "the","or","is" etc.
  - To remove these words, we can pass stop_words parameter.
- Set maximum and minimum count thresholds
  - **CountVectorizer(max_df = 0.80,min_df = 0.20)**
  - we can set thresholds to remove words from the matrix that appears too frequently or remove words that rarely appear.
  - The sample removes any word from the sparse matrix that appears less than 20% and over 80% of the time in each text.
- Limit the number of features
  - **CountVectorizer(max_features=50)**
  - if we want to limit the number of words within you vocabulary, you can limit to the most commonly used x_number of words.
- Creating n-grams
  - **CountVectorizer(n_grams=(2,2))**
  - n-gram is a contiguous sequence of n items(words, characters etc) from a given sample of text or speech.
      - Unigrams (1-gram)
          - Example: "I love machine learning"
          - Unigrams: ["I", "love", "machine", "learning"]
      - Bigrams (2-grams)
          - Bigrams: ["I love", "love machine", "machine learning"]
      - The parameter(2,2) in the ngram_range argument specifies the range of n-grams to consider.
          - ngram_range=(2,2) it means that the vectorizer will only generate bigrams (sequences of exactly 2 words) from the input text.
    - Application:
          - Text classification: N-grams are used to create features that help classify text into categories.
          - Language modeling: Predict the next word in a sequence, as in the case of text generation models.
          - Speech recognition: Recognize and predict sequences of words based on their occurrence patterns. 
  
