# Converting text data to its numerical form  

In [50]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Count Vectorizer

CountVectorizer is a pre-processing technique used to convert text data into numerical form. This creates a bag of words where each word is treated as a separate feature and the count of each word in a given document is used as the value of that feature.

Let's understand with the help of some examples:

### Example 1: Using only Count Vectorizer

In [51]:
sentence_1 = "The cat sat"
sentence_2 = "The cat sat in the hat"
sentence_3 = "The cat with the hat"

document1 = [sentence_1, sentence_2, sentence_3]
document1

['The cat sat', 'The cat sat in the hat', 'The cat with the hat']

| Word| Frequency |
| :--- | :---: |       
| the | 5 |
| cat | 3 |
| sat | 2 |
| in | 1 |
| hat | 2 |
| with | 1 |

| Sentences | the | cat | sat | in | hat | with |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |        
| sentence_1 | 1 | 1 | 1 | 0 | 0 | 0 |
| sentence_2 | 2 | 1 | 1 | 1 | 1 | 0 |
| sentence_3 | 2 | 1 | 0 | 0 | 1 | 1 |

Let's verify this using CountVectorizer:

In [52]:
count_vectorizer = CountVectorizer() # CountVectorizer has 'lowercase = True' as a default parameter
count_vectorizer.fit(document1)
count_array = count_vectorizer.transform(document1).toarray()
count_array

array([[1, 0, 0, 1, 1, 0],
       [1, 1, 1, 1, 2, 0],
       [1, 1, 0, 0, 2, 1]])

In [53]:
df = pd.DataFrame(count_array, columns=count_vectorizer.get_feature_names_out())
df

Unnamed: 0,cat,hat,in,sat,the,with
0,1,0,0,1,1,0
1,1,1,1,1,2,0
2,1,1,0,0,2,1


In the above example, we can clearly see that CountVectorizer created a sparse matrix where each row corresponds to a document and each column corresponds to a unique word in the document with the count of each word in a given corpus is used as the value of that feature. This means that it assigns equal importance to each word in the document, regardless of its relevance or importance in the corpus. Therefore, common words like "the" or "and" may have a high frequency count but low information value.

### Example 2: Using Lemmatization and Count Vectorizer

In this next example, we will remove stop words and apply lemmatization to our data before creating bag of words with CountVectorizer.

In [54]:
sentences = ["The quick brown fox jumps over the lazy dog",
             "She sells seashells by the seashore",
             "All work and no play makes Jack a dull boy",
             "I love the smell of fresh flowers in the spring",
             "The early bird catches the worm",
             "Actions speak louder than words",
             "Honesty is the best policy",
             "The sun is shining brightly outside",
             "A picture is worth a thousand words",
             "An apple a day keeps the doctor away"]


data1 = pd.DataFrame(sentences, columns=["sentences"])
data1

Unnamed: 0,sentences
0,The quick brown fox jumps over the lazy dog
1,She sells seashells by the seashore
2,All work and no play makes Jack a dull boy
3,I love the smell of fresh flowers in the spring
4,The early bird catches the worm
5,Actions speak louder than words
6,Honesty is the best policy
7,The sun is shining brightly outside
8,A picture is worth a thousand words
9,An apple a day keeps the doctor away


In [55]:
# converting to lowercase
data1["sentences"] = data1["sentences"].str.lower()

# Removing stopwords from the data
stop_words = stopwords.words("english")
data1["sentences"] = data1["sentences"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

# applying lemmatization
wnl = WordNetLemmatizer()
data1["sentences"] = data1["sentences"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))

data1

Unnamed: 0,sentences
0,quick brown fox jump lazy dog
1,sell seashells seashore
2,work play make jack dull boy
3,love smell fresh flower spring
4,early bird catch worm
5,action speak louder word
6,honesty best policy
7,sun shin brightly outside
8,picture worth thousand word
9,apple day keep doctor away


In [56]:
count_vectorizer.fit(data1["sentences"])
count_array = count_vectorizer.transform(data1["sentences"]).toarray()

In [57]:
df1 = pd.DataFrame(count_array, columns=count_vectorizer.get_feature_names_out())
df1

Unnamed: 0,action,apple,away,best,bird,boy,brightly,brown,catch,day,...,shin,smell,speak,spring,sun,thousand,word,work,worm,worth
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1
9,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### Example 3: Data Cleaning and using Lemmatization and Count Vectorizer

In [58]:
twitter_data = pd.read_csv("hf://datasets/PrkhrAwsti/Twitter_Sentiment_3M/twitter_dataset.csv")
twitter_data.head()

Unnamed: 0.1,Unnamed: 0,tweet,sentiment
0,0,is upset that he can't update his Facebook by ...,0.0
1,1,@Kenichan I dived many times for the ball. Man...,0.0
2,2,my whole body feels itchy and like its on fire,0.0
3,3,"@nationwideclass no, it's not behaving at all....",0.0
4,4,@Kwesidei not the whole crew,0.0


Only get 1000 rows

In [59]:
twitter_data = twitter_data[0:1000]

In [60]:
len(twitter_data[0:1000])

1000

In [61]:
# dropping NaN values
twitter_data.isnull().any().sum()

0

In [62]:
twitter_data.dropna(axis = 0, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_data.dropna(axis = 0, inplace = True)


In [63]:
twitter_data.isnull().any().sum()

0

Clean the data

In [64]:
# converting text to lowercase
twitter_data["tweet"] = twitter_data["tweet"].str.lower()

# Removing stopwords from the data
stop_words = stopwords.words("english")
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

# removing links
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: re.sub(r"http\S+|www\.\S+", "", x))

# removing email addresses
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: re.sub(r"\w+@\w+\.com", "", x))

# removing punctuation marks
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: re.sub(r"[.,;:!\?\"'`]", "", x))

# removing special characters
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: re.sub(r"[@#\$%^&*\(\)\\/\+-_=\[\]\{\}<>]", "", x))

# removing unnecessary characters
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: re.sub(r"½m|½s|½t|½ï", "", x))

twitter_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_data["tweet"] = twitter_data["tweet"].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_data["tweet"]

Unnamed: 0.1,Unnamed: 0,tweet,sentiment
0,0,upset cant update facebook texting it might cr...,0.0
1,1,kenichan dived many times ball managed save r...,0.0
2,2,whole body feels itchy like fire,0.0
3,3,nationwideclass no behaving all im mad here ca...,0.0
4,4,kwesidei whole crew,0.0


Apply Lemmatization

In [65]:
wnl = WordNetLemmatizer()
twitter_data["tweet"] = twitter_data["tweet"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))

twitter_data[["tweet"]].head()

Unnamed: 0,tweet
0,upset cant update facebook texting it might cr...
1,kenichan dive many time ball manage save rest ...
2,whole body feel itchy like fire
3,nationwideclass no behave all im mad here cant...
4,kwesidei whole crew


Apply CountVectorizer

In [66]:
count_vectorizer.fit(data1["sentences"])
count_array = count_vectorizer.transform(data1["sentences"]).toarray()

twitter_cv = pd.DataFrame(count_array, columns=count_vectorizer.get_feature_names_out())
twitter_cv

Unnamed: 0,action,apple,away,best,bird,boy,brightly,brown,catch,day,...,shin,smell,speak,spring,sun,thousand,word,work,worm,worth
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1
9,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## TfidfVectorizer

TfidfVectorizer is also a pre-processing technique used to convert text data into numerical form. TfidfVectorizer not only counts the frequency of each word but also assigns a weight to each word based on its frequency in the document and its frequency in the entire corpus. This means that it gives higher weights to words that are important or informative in the document and lower weights to common words that are not. This is achieved through a term frequency-inverse document frequency (TF-IDF) formula that balances the frequency of a word in a document with its frequency in the entire corpus.

### Example 1

In [67]:
sentence_1 = "It is going to rain today"
sentence_2 = "I am not going to office today"
sentence_3 = "I am going to watch a football match"

document2 = [sentence_1, sentence_2, sentence_3]
document2

['It is going to rain today',
 'I am not going to office today',
 'I am going to watch a football match']

In [68]:
data2 = pd.DataFrame(document2, columns=["document"])
data2

Unnamed: 0,document
0,It is going to rain today
1,I am not going to office today
2,I am going to watch a football match


In [69]:
# converting to lowercase
data2["document"] = data2["document"].str.lower()

# Removing stopwords from the data
stop_words = stopwords.words("english")
data2["document"] = data2["document"].apply(lambda x: " ".join(word for word in x.split() if word not in stop_words))

# applying lemmatization
wnl = WordNetLemmatizer()
data2["document"] = data2["document"].apply(lambda x: " ".join(wnl.lemmatize(word, "v") for word in x.split()))

data2

Unnamed: 0,document
0,go rain today
1,go office today
2,go watch football match


Here is the frequency of each word in the document:

| Word| Frequency |
| :--- | :---: |   
| go | 3 |
| rain	| 1 |
| today	| 2 |
| office | 1 |
| watch	| 1 |
| football | 1 |
| match	| 1 |

Term Frequency (TF)

$$
TF = \frac{Number of repetitions of a word in a sentence} {Number of words in sentence}
$$

| Word | TF of Sentence 1 | TF of Sentence 2 | TF of Sentence 3 |
| :--- | :---: | :---: | :---: |
| go | $\frac{1}{3}$ | $\frac{1}{3}$ | $\frac{1}{4}$ |
| rain	| $\frac{1}{3}$ | 0 | 0 |
| today	| $\frac{1}{3}$ | $\frac{1}{3}$ | 0 |
| office | 0 | $\frac{1}{3}$ | 0 |
| watch	| 0 | 0 | $\frac{1}{4}$ |
| football | 0 | 0| $\frac{1}{4}$ |
| match	| 0 | 0 | $\frac{1}{4}$ |

Inverse Document Frequency (IDF)

$$
IDF = log(\frac{Number of sentences in the document}{Number of sentences containing word in document})
$$

| Word | IDF |
| :---: | :---: |   
| go | log(3/3) = 0.000 |
| rain	| log(3/1) = 0.477 |
| today	| log(3/2) = 0.176 |
| office | log(3/1) = 0.477 |
| watch	| log(3/1) = 0.477 |
| football | log(3/1) = 0.477 |
| match	| log(3/1) = 0.477 |

Tfidf is just the multiplication of TF and IDF:

$$
TFIDF = TF * IDF
$$

| Sentences	| go | rain | today | office | watch | football | match
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Sentence 1 | 1/3 x 0 | 1/3 x 0.477 | 1/3 x 0.176 | 0 x 0.477 | 0 x 0.477 | 0 x 0.477 | 0 x 0.477 |
| Sentence 2 | 1/3 x 0 | 0 x 0.477 | 1/3 x 0.176 | 1/3 x 0.477 | 0 x 0.477 | 0 x 0.477 | 0 x 0.477 |
| Sentence 3 | 1/4 x 0 | 0 x 0.477 | 0 x 0.176 | 0 x 0.477 | 1/4 x 0.477 | 1/4 x 0.477 | 1/4 x 0.47 |

| Sentences	| go | rain | today | office | watch | football | match |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Sentence 1 | 0 | 0.159 | 0.059 |	0	| 0	| 0	| 0 |
| Sentence 2 | 0 | 0 | 0.059 |	0.159 |	0	| 0	| 0 |
| Sentence 3 | 0 | 0 | 0 | 0 |0.119	| 0.119	| 0.119 |

In [70]:
tfidf = TfidfVectorizer()
tfidf.fit(data2["document"])
tfidf_array = tfidf.transform(data2["document"]).toarray()

df2 = pd.DataFrame(tfidf_array, columns = tfidf.get_feature_names_out())
df2

Unnamed: 0,football,go,match,office,rain,today,watch
0,0.0,0.425441,0.0,0.0,0.720333,0.547832,0.0
1,0.0,0.425441,0.0,0.720333,0.0,0.547832,0.0
2,0.546454,0.322745,0.546454,0.0,0.0,0.0,0.546454


### Example 2:

In [71]:
# using TfidfVectorizer with all default parameters except 'max_features = 2500'
tfidf_1 = TfidfVectorizer(input = "content", encoding = "utf-8", decode_error = "strict",
                          strip_accents = None, lowercase = True, preprocessor = None,
                          tokenizer = None, analyzer = "word", stop_words = None,
                          token_pattern = r"(?u)\b\w\w+\b", ngram_range = (1,1),
                          max_df = 1.0, min_df = 1, max_features = 2500, vocabulary = None,
                          binary = False, dtype = np.float64, norm = "l2", use_idf = True,
                          smooth_idf = True, sublinear_tf = False)

tfidf_1.fit(twitter_data["tweet"])
tfidf_1_array = tfidf_1.transform(twitter_data["tweet"]).toarray()

twitter_tfidf1 = pd.DataFrame(tfidf_1_array, columns = tfidf_1.get_feature_names_out())
twitter_tfidf1.head()

Unnamed: 0,aaaaand,aaronrva,aaw,able,absolutely,abt,accent,accept,access,accident,...,yrs,yu,yucky,yup,zac,zaydia,zero,zip,zone,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
# 2000 bi-gram features
tfidf_2 = TfidfVectorizer(max_features=2000, ngram_range = (2,2))
tfidf_2.fit(twitter_data["tweet"])
tfidf_2_array = tfidf_2.transform(twitter_data["tweet"]).toarray()

twitter_tfidf2 = pd.DataFrame(tfidf_2_array, columns = tfidf_2.get_feature_names_out())
twitter_tfidf2.head()

Unnamed: 0,aaaaand back,all im,already go,amp iz,annoy easily,another day,anybody know,april th,assignment finish,back work,...,work hard,work im,work today,work tomorrow,work uk,work well,world live,write paper,yawn yawn,year old
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.437543,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [73]:
# 1500 tri-gram features
tfidf_3 = TfidfVectorizer(max_features=1500, ngram_range = (3,3))
tfidf_3.fit(twitter_data["tweet"])
tfidf_3_array = tfidf_3.transform(twitter_data["tweet"]).toarray()

twitter_tfidf3 = pd.DataFrame(tfidf_3_array, columns = tfidf_3.get_feature_names_out())
twitter_tfidf3.head()

Unnamed: 0,aaaaand back literature,annoy easily today,april th come,bah car start,band fun lead,bday know do,brent praise band,cant find it,car start wait,check update make,...,velvet amp eat,veneia not yet,verdict today apparently,via sharethis oh,via still wait,victoria follow unpleasant,victory th consecutive,wednesday bday know,wish could go,world live in
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


CountVectorizer is a simpler method that can work well for some applications, but it may not be as effective as TfidfVectorizer when it comes to capturing the most important information in a text. TfidfVectorizer is better than CountVectorizer because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.