<a href="https://colab.research.google.com/github/SohanChidrawar/BE_Programming/blob/main/Bag_of_approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec

In [2]:
# Load the dataset
data = pd.read_csv("data.csv")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes: float64(3), int64(5

In [4]:
# Preprocess the text data (if necessary)
text_data = data['Market Category'].tolist()

In [6]:
text_data = data['Market Category'].fillna('').tolist()

**Bag of Word**

The BoW approach provides a numerical representation of text data that can be used as input for machine learning algorithms, allowing them to work with text data effectively.

In [7]:
# Bag-of-Words (BoW)
# Count Occurrence
count_vectorizer = CountVectorizer()
bow_count = count_vectorizer.fit_transform(text_data)

**Normalized Count Occurrence**

Normalized count occurrences (TF) represents the relative frequency of terms within individual documents

In [8]:
# Normalized Count Occurrence
count_vectorizer_normalized = CountVectorizer(binary=True)
bow_normalized_count = count_vectorizer_normalized.fit_transform(text_data)

**Frequency-inverse Document Frequency**

 IDF represents the discriminative power of terms across the entire document set.

 1) Discriminative Power: The discriminative power of a term refers to its ability to differentiate or distinguish between documents in a collection. In the context of natural language processing (NLP) and text analysis, terms with high discriminative power are those that are unique or characteristic to specific documents or classes of documents.

In [9]:
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(text_data)

**Word2Vec**

It can be used to capture semantic meanings and relationships between words in a continuous vector space.

In [10]:
# Word2Vec
# Tokenize text data
tokenized_data = [text.split() for text in text_data]

In [11]:
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)

In [12]:
# Embeddings
word_embeddings = []
for text in tokenized_data:
    text_embedding = [word2vec_model.wv[word] for word in text if word in word2vec_model.wv]
    if text_embedding:
        text_embedding_avg = sum(text_embedding) / len(text_embedding)
        word_embeddings.append(text_embedding_avg)
    else:
        word_embeddings.append([0]*100)  # If no word in the text is present in the Word2Vec model, use zero vector


In [13]:
# Print the processed output
print("Bag-of-Words (Count Occurrence):\n", bow_count)
print("\nBag-of-Words (Normalized Count Occurrence):\n", bow_normalized_count)
print("\nTF-IDF:\n", tfidf)
print("\nWord2Vec Embeddings:\n", word_embeddings)


Bag-of-Words (Count Occurrence):
   (0, 3)	1
  (0, 11)	1
  (0, 9)	1
  (0, 7)	1
  (0, 10)	1
  (1, 9)	1
  (1, 10)	1
  (2, 9)	1
  (2, 7)	1
  (2, 10)	1
  (3, 9)	1
  (3, 10)	1
  (4, 9)	1
  (5, 9)	1
  (5, 10)	1
  (6, 9)	1
  (6, 10)	1
  (7, 9)	1
  (7, 7)	1
  (7, 10)	1
  (8, 9)	1
  (9, 9)	1
  (10, 9)	1
  (10, 7)	1
  (10, 10)	1
  :	:
  (11905, 7)	1
  (11905, 10)	1
  (11905, 2)	1
  (11906, 9)	1
  (11906, 6)	1
  (11906, 0)	1
  (11907, 9)	1
  (11907, 6)	1
  (11907, 0)	1
  (11908, 9)	1
  (11908, 6)	1
  (11908, 0)	1
  (11909, 9)	1
  (11909, 6)	1
  (11909, 0)	1
  (11910, 9)	1
  (11910, 6)	1
  (11910, 0)	1
  (11911, 9)	1
  (11911, 6)	1
  (11911, 0)	1
  (11912, 9)	1
  (11912, 6)	1
  (11912, 0)	1
  (11913, 9)	1

Bag-of-Words (Normalized Count Occurrence):
   (0, 3)	1
  (0, 11)	1
  (0, 9)	1
  (0, 7)	1
  (0, 10)	1
  (1, 9)	1
  (1, 10)	1
  (2, 9)	1
  (2, 7)	1
  (2, 10)	1
  (3, 9)	1
  (3, 10)	1
  (4, 9)	1
  (5, 9)	1
  (5, 10)	1
  (6, 9)	1
  (6, 10)	1
  (7, 9)	1
  (7, 7)	1
  (7, 10)	1
  (8, 9)	1
  (9, 9)	1
 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

