# Definition

- Vectorization in NLP: 📊 Transforming text data into numerical vectors for machine learning algorithms to process.
- Count Vectorizer: 📏 Converts text documents into a matrix where each row represents a document and each column represents a unique word, with values indicating the frequency of each word.
  [Learn more about Count Vectorizer](https://www.youtube.com/watch?v=NF_DhVH_I-E)
- Hashing Vectorizer: 🔍 Converts text into fixed-size vectors by applying a hashing function to tokenized words.
  [Learn more about Hashing Vectorizer](https://www.youtube.com/watch?v=NF_DhVH_I-E)

- Benefits: 💡 Efficiently represents text data for machine learning models, enabling analysis and prediction tasks.
- Tf-idf Vectorizer: 📈 Converts text documents into numerical vectors based on term frequency-inverse document frequency, emphasizing rare terms that are important in distinguishing documents.
[Learn more about Tf-Idf](https://www.youtube.com/watch?v=D2V1okCEsiE)

- Hashing Vectorizer vs Tf-idf Vectorizer: 🔄 Hashing Vectorizer is memory-efficient and faster
    



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Import Libraries

In [5]:
# hashing vectorizer
from sklearn.feature_extraction.text import HashingVectorizer

from sklearn.feature_extraction.text import TfidfTransformer


In [8]:
# import text data

with open("/content/drive/MyDrive/CyberSecurity with AI /anonops_short.txt", encoding = 'utf8') as f:
  annops_chat_log = f.readlines()



In [9]:
# Read the file
file_path = '/content/drive/MyDrive/CyberSecurity with AI /anonops_short.txt'
with open(file_path, 'r') as file:
    data = file.read()

# Print the content of the file
print(data)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [12]:
# Define HashingVectorizer
hash_vector = HashingVectorizer(input='content', ngram_range=(1, 2))

# Transform text data into a numerical matrix
X_train_counts = hash_vector.fit_transform(annops_chat_log)

# Calculate TF-IDF
tf_transform = TfidfTransformer(use_idf=True).fit(X_train_counts)

# Transform count matrix to TF-IDF representation
X_train_tf = tf_transform.transform(X_train_counts)

**📌note:** we applied n-grams is 1,2 which means single word / two consecutive words count together by applying hashing vectorization. Then we applied Tf-Idf Vectorizer to assign perfect weight to the count of hashing vectorization.

In [13]:
print(f"spare matrix representation: \n{X_train_tf}")

spare matrix representation: 
  (0, 938273)	0.10023429482560929
  (0, 871172)	-0.33044470291777067
  (0, 755834)	-0.2806123960092745
  (0, 556974)	-0.2171490773135763
  (0, 548264)	-0.09851435603064428
  (0, 531189)	-0.2566310842337745
  (0, 522961)	-0.3119912982467716
  (0, 514190)	-0.2527659565181208
  (0, 501800)	-0.33044470291777067
  (0, 499727)	-0.18952297847436425
  (0, 488876)	0.13502094828386488
  (0, 377854)	0.22710724511856722
  (0, 334594)	-0.25581186158424035
  (0, 256577)	0.20949022238574433
  (0, 197273)	-0.30119674850360456
  (0, 114899)	0.09713499033205285
  (0, 28523)	-0.3060506288368513
  (1, 960098)	0.09780838928665199
  (1, 955748)	-0.2747271490090429
  (1, 952302)	0.26070217969901804
  (1, 938273)	0.12095603891963835
  (1, 937092)	-0.2947114257264502
  (1, 927866)	0.21727726371674563
  (1, 820768)	-0.11065660403137358
  (1, 772066)	-0.14344517367198276
  :	:
  (180828, 329790)	0.06808618130417012
  (180828, 312887)	-0.08249409552977467
  (180828, 209871)	0.1768592