## Exploratory Data Analysis (EDA) - Initial Data Scraping and Observations

In this notebook, Firstly data scraping process and the insights gained from the dataset. This initial step is crucial to understand the dataset, its metadata, and the relationships between data and its features.

**Data Overview:**

To get an overview of the dataset, I used various Python libraries like `pyspark` `pandas` to load and explore the data. 

Exploratory Data Analysis (EDA) - Initial Data Scraping and Observations

**Data Overview:**

To get an overview of the dataset, I used various Python libraries like pyspark and pandas to load and explore the data.

**Data Preprocessing:**

Data preprocessing is a crucial step to ensure that the data is in a usable format. This involves handling missing values, encoding categorical variables, and scaling or normalizing features as needed. I performed the following preprocessing steps:

1. Data Loading: I used PySpark and Pandas to load the dataset.
2. Handling Missing Values: Checked for missing values and decided whether to impute or remove them.
3. Encoding Categorical Variables: Encoded categorical variables when necessary.
4. Feature Scaling/Normalization: Scaled or normalized features as required.

**Initial Observations:**

After loading and preprocessing the data, I conducted an initial analysis to gain insights into the dataset:

- Data Size: The datasets contains different number of rows and columns.
- Feature Types: I observed that the dataset includes features of different types, such as categorical, and text.
- Missing Values: I checked for missing values in the dataset. If any were found, I determined whether to impute missing data or remove instances with missing values.
- Descriptive Statistics: I calculated basic summary statistics, including mean, median, standard deviation, and quartiles for count of features.

**Further Steps:**

Based on these initial observations, I planned to establish relationships of the Features Product, Retailer, Brand with OFFER, which may include feature engineering and deeper exploratory analysis.

By conducting this initial data scraping and observation, I have laid the foundation for a more comprehensive analysis of the dataset. Understanding the dataset and its metadata is essential for making informed decisions throughout the data analysis and modeling process. Next, I will proceed with more in-depth exploratory analysis and feature engineering to extract valuable insights from the data.



In [1]:
!pip install pyspark
!pip install pyspark[sql]
!pip install pandas



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Fetch").getOrCreate()



In [3]:
data_df = spark.read.csv("C:/Users/karth/Desktop/Fetch_project/Data/brand_category.csv", header=True, inferSchema=True)

print(data_df)

data_df.fillna("NA")

data = data_df.drop("RECEIPTS")

data.describe().show()

data.filter(data["BRAND"] == "NA").show()

data.filter(data["BRAND_BELONGS_TO_CATEGORY"] == "NA").show()

data.show()

unique_brands = data.select("BRAND").distinct()
unique_brands.show()

category_count = data.groupBy("BRAND").count()
category_count.show()

unique_categories = data.select("BRAND_BELONGS_TO_CATEGORY").distinct()
unique_categories.show()

category_counts = data.groupBy("BRAND_BELONGS_TO_CATEGORY").count()
category_counts.show()

DataFrame[BRAND: string, BRAND_BELONGS_TO_CATEGORY: string, RECEIPTS: int]
+-------+-----------------+-------------------------+
|summary|            BRAND|BRAND_BELONGS_TO_CATEGORY|
+-------+-----------------+-------------------------+
|  count|             9906|                     9906|
|   mean|          1000.45|                     null|
| stddev|841.6671097792823|                     null|
|    min|                1|       Adult Incontinence|
|    max|    breath savers|                   Yogurt|
+-------+-----------------+-------------------------+

+-----+-------------------------+
|BRAND|BRAND_BELONGS_TO_CATEGORY|
+-----+-------------------------+
|   NA|                     Beer|
+-----+-------------------------+

+-----+-------------------------+
|BRAND|BRAND_BELONGS_TO_CATEGORY|
+-----+-------------------------+
+-----+-------------------------+

+------------------+-------------------------+
|             BRAND|BRAND_BELONGS_TO_CATEGORY|
+------------------+----------------

In [4]:
data_df1 = spark.read.csv("C:/Users/karth/Desktop/Fetch_project/Data/categories.csv", header=True, inferSchema=True)

data_df1.filter(data_df1['CATEGORY_ID'].isNull()).show()

data_df1.filter(data_df1['PRODUCT_CATEGORY'].isNull()).show()

data_df1.filter(data_df1['IS_CHILD_CATEGORY_TO'].isNull()).show()

data_df1.describe().show()

data1 = data_df1.drop("CATEGORY_ID")

data1.describe().show()

unique_PRODUCT_CATEGORY = data1.select("PRODUCT_CATEGORY").distinct()
unique_PRODUCT_CATEGORY.show()

PRODUCT_CATEGORY_count = data1.groupBy("PRODUCT_CATEGORY").count()
PRODUCT_CATEGORY_count.show()

IS_CHILD_CATEGORY_TO_categories = data1.select("IS_CHILD_CATEGORY_TO").distinct()
IS_CHILD_CATEGORY_TO_categories.show()

IS_CHILD_CATEGORY_TO_counts = data1.groupBy("IS_CHILD_CATEGORY_TO").count()
IS_CHILD_CATEGORY_TO_counts.show()

+-----------+----------------+--------------------+
|CATEGORY_ID|PRODUCT_CATEGORY|IS_CHILD_CATEGORY_TO|
+-----------+----------------+--------------------+
+-----------+----------------+--------------------+

+-----------+----------------+--------------------+
|CATEGORY_ID|PRODUCT_CATEGORY|IS_CHILD_CATEGORY_TO|
+-----------+----------------+--------------------+
+-----------+----------------+--------------------+

+-----------+----------------+--------------------+
|CATEGORY_ID|PRODUCT_CATEGORY|IS_CHILD_CATEGORY_TO|
+-----------+----------------+--------------------+
+-----------+----------------+--------------------+

+-------+--------------------+------------------+--------------------+
|summary|         CATEGORY_ID|  PRODUCT_CATEGORY|IS_CHILD_CATEGORY_TO|
+-------+--------------------+------------------+--------------------+
|  count|                 118|               118|                 118|
|   mean|                null|              null|                null|
| stddev|         

In [5]:
data_df2 = spark.read.csv("C:/Users/karth/Desktop/Fetch_project/Data/offer_retailer.csv", header=True, inferSchema=True)

data_df2.filter(data_df2["OFFER"].isNull()).show()

data_df2.filter(data_df2["RETAILER"].isNull()).show()

data_df2.filter(data_df2["BRAND"].isNull()).show()

data2 = data_df2.fillna("NA")

data2.filter(data2["RETAILER"] == "NA").show()

data2.describe().show()

unique_OFFER = data2.select("OFFER").distinct()
unique_OFFER.show()

OFFER_count = data2.groupBy("OFFER").count()
OFFER_count.show()

unique_RETAILER = data2.select("RETAILER").distinct()
unique_RETAILER.show()

RETAILER_count = data2.groupBy("RETAILER").count()
RETAILER_count.show()

BRAND_categories = data2.select("BRAND").distinct()
BRAND_categories.show()

BRAND_counts = data2.groupBy("BRAND").count()
BRAND_counts.show()

+-----+--------+-----+
|OFFER|RETAILER|BRAND|
+-----+--------+-----+
+-----+--------+-----+

+--------------------+--------+--------------------+
|               OFFER|RETAILER|               BRAND|
+--------------------+--------+--------------------+
|Beyond Meat® Plan...|    null|         BEYOND MEAT|
|Good Humor Vienne...|    null|          GOOD HUMOR|
|Emmy's Organics® ...|    null|        EMMYS POP UP|
|Barilla® Pesto Sauce|    null|             BARILLA|
|Any General Mills...|    null|                null|
|Good Rewards Memb...|    null|ANNIES HOMEGROWN ...|
|DOVE® Chocolate, ...|    null|      DOVE CHOCOLATE|
|Hellmann's® OR Be...|    null|HELLMANNS BEST FOODS|
|M&M'S®, select si...|    null|                M&MS|
|GATORLYTE® OR GAT...|    null|            GATORADE|
|Red Gold Tomato K...|    null|            RED GOLD|
|Gillette Venus ® ...|    null|      GILLETTE VENUS|
|Simply Spiked™ Le...|    null|       SIMPLY SPIKED|
|Tru-Ray® Premium ...|    null|              TRURAY|
|CESAR

In [7]:
# Loading the datasets into dataframes
data.show()
data1.show()
data2.show()

# Combining the dataframes vertically
dff = data.unionByName(data1, allowMissingColumns=True).unionByName(data2, allowMissingColumns=True)

# Viewing the merged dataframe
dff.describe().show()

df = dff.fillna("NA")

+------------------+-------------------------+
|             BRAND|BRAND_BELONGS_TO_CATEGORY|
+------------------+-------------------------+
|  CASEYS GEN STORE|         Tobacco Products|
|  CASEYS GEN STORE|                   Mature|
|            EQUATE|             Hair Removal|
|         PALMOLIVE|              Bath & Body|
|              DAWN|              Bath & Body|
|          BARBASOL|             Hair Removal|
|            KROGER|                   Bakery|
|        SKINTIMATE|             Hair Removal|
|     DAWN PLATINUM|              Bath & Body|
|KIRKLAND SIGNATURE|              Bath & Body|
|          RED BULL|     Carbonated Soft D...|
|         COCA-COLA|     Carbonated Soft D...|
|              AJAX|              Bath & Body|
|            EQUATE|              Bath & Body|
|         DR PEPPER|     Carbonated Soft D...|
|        BODYCOLOGY|          Body Fragrances|
|           KINDERS|             Dips & Salsa|
|    BODY FANTASIES|          Body Fragrances|
|      MOUNTA

In [8]:
dff.show()

+------------------+-------------------------+----------------+--------------------+-----+--------+
|             BRAND|BRAND_BELONGS_TO_CATEGORY|PRODUCT_CATEGORY|IS_CHILD_CATEGORY_TO|OFFER|RETAILER|
+------------------+-------------------------+----------------+--------------------+-----+--------+
|  CASEYS GEN STORE|         Tobacco Products|            null|                null| null|    null|
|  CASEYS GEN STORE|                   Mature|            null|                null| null|    null|
|            EQUATE|             Hair Removal|            null|                null| null|    null|
|         PALMOLIVE|              Bath & Body|            null|                null| null|    null|
|              DAWN|              Bath & Body|            null|                null| null|    null|
|          BARBASOL|             Hair Removal|            null|                null| null|    null|
|            KROGER|                   Bakery|            null|                null| null|    null|


In [None]:
import pandas as pd

# Convert PySpark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

# Write Pandas DataFrame to save as CSV file
pandas_df.to_csv('file.csv', index=False)

Until Here I performed the data scarping and anlysis, unified the different datasets and merged them as one after this Performed the data unifying by mapping them.
 
**Text Data Preprocessing using PySpark**

Now, I focus on preparing the text data within the unified dataset. Our goal is to clean and preprocess the text features using PySpark.

**Data Deduplication**
Duplicate rows in the dataset are removed to ensure data integrity.

**Text Cleaning**
Text data is cleaned by converting it to lowercase and removing special characters.

**Text Tokenization**
The text is tokenized, breaking it into individual words, and stored in a new "words" column.

**Stop Word Removal**
Common stop words are removed from the tokenized words.

**Lemmatization with Word2Vec (Example)**
Word2Vec is used for lemmatization, converting text to numerical vectors. If useful in the future.


This preprocessing prepares the text data for analysis and machine learning tasks.

 

In [11]:
!pip install pyspark
!pip install torch
!pip install pandas
!pip install numpy 
!pip install sentence-transformers 
!pip install elasticsearch
!pip install tqdm
!pip install faiss-cpu
!pip install sklearn

Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.18
    Uninstalling urllib3-1.26.18:
      Successfully uninstalled urllib3-1.26.18
Successfully installed urllib3-1.25.11


ERROR: conda 4.13.0 requires ruamel_yaml_conda>=0.11.14, which is not installed.
ERROR: selenium 4.3.0 has requirement urllib3[secure,socks]~=1.26, but you'll have urllib3 1.25.11 which is incompatible.
ERROR: elastic-transport 8.4.1 has requirement urllib3<2,>=1.26.2, but you'll have urllib3 1.25.11 which is incompatible.


Collecting urllib3<2,>=1.26.2
  Using cached urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.25.11
    Uninstalling urllib3-1.25.11:
      Successfully uninstalled urllib3-1.25.11
Successfully installed urllib3-1.26.18


ERROR: conda 4.13.0 requires ruamel_yaml_conda>=0.11.14, which is not installed.
ERROR: requests 2.22.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 1.26.18 which is incompatible.




ERROR: Could not find a version that satisfies the requirement faiss (from versions: none)
ERROR: No matching distribution found for faiss




In [12]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lower, regexp_replace, col
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.ml.feature import Word2Vec

# Initializing a Spark session
spark = SparkSession.builder.appName("TextProcessing").getOrCreate()

#loading dataset
unified_df = spark.read.option("header", "true").csv("C:/Users/karth/Desktop/Fetch_project/unified.csv")

# Removed complete duplicate rows
unified_df = unified_df.dropDuplicates()

for col_name in unified_df.columns:
    if unified_df.schema[col_name].dataType.typeName() == 'string':
        unified_df = unified_df.withColumn(col_name, lower(col(col_name)))
        unified_df = unified_df.withColumn(col_name, regexp_replace(col(col_name), "[^a-zA-Z0-9\\s]", ""))


# Tokenizing the text
tokenizer = Tokenizer(inputCol="OFFER", outputCol="words")
text_df = tokenizer.transform(unified_df)

# Removed stop words and Lemmatization using Word2Vec
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
text_df = remover.transform(text_df)
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="filtered_words", outputCol="features")
model = word2Vec.fit(text_df)
text_df = model.transform(text_df)

# Dispalying the processed DataFrame
text_df.show(truncate=False)

#Convert PySpark DataFrame to Pandas DataFrame and to save as CSV file
import pandas as pd
pandas_df = text_df.toPandas()
pandas_df.to_csv('unified_test.csv', index=False)

# Stop the Spark session
spark.stop()


+-------------------------------------------------------------------------------+----------------------+------------------------------------------+----------------+-------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+------------------------------------------------------------------+
|OFFER                                                                          |RETAILER_mapped       |BRAND                                     |PRODUCT_CATEGORY|words                                                                                      |filtered_words                                                                 |features                                                          |
+-------------------------------------------------------------------------------+----------------------+------------------------------------------+----------------+----------------------------

PermissionError: [Errno 13] Permission denied: 'unified_test.csv'

**I used Hugging Face's Sentence Transformers to find semantic matches in a dataset. It encodes offers into embeddings and performs K-nearest neighbors search. An example query, "TARGET," is used to find and display the top 10 similar offers.**

In [13]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import faiss

# Initializing a SentenceTransformer model on Hugging Face
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')


df = pd.read_csv('C:/Users/karth/Desktop/Fetch_project/unified_test.csv')

offer_embeddings = model.encode(df['OFFER'].tolist())

# Builded an index for KNN search
index = faiss.IndexFlatL2(offer_embeddings.shape[1])
index.add(offer_embeddings)

# Function to perform K-nearest neighbors (KNN) search
def knn_search(input_query, index, k=10):
    query_embedding = model.encode([input_query])
    distances, indices = index.search(query_embedding, k)
    
    search_results = []
    for i in range(k):
        search_results.append((indices[0][i], distances[0][i]))
    
    return search_results

# Input query
input_query = "TARGET"

# Perform KNN semantic search and get top 10 results
top_results = knn_search(input_query, index, k=10)

# Print the list of offers with distance scores
for i, (index, distance) in enumerate(top_results):
    print(f"Result {i + 1} - Distance: {distance:.4f}")
    print(f"Offer: {df.iloc[index]['OFFER']}\n")





Result 1 - Distance: 43.3433
Offer: arber at target

Result 2 - Distance: 63.1514
Offer: loral paris true match foundation at target

Result 3 - Distance: 63.4933
Offer: loreal paris true match foundation at target

Result 4 - Distance: 68.3754
Offer: beyond steak plantbased seared tips 10 ounce buy 2 at target

Result 5 - Distance: 68.3754
Offer: beyond steak plantbased seared tips 10 ounce buy 2 at target

Result 6 - Distance: 68.3754
Offer: beyond steak plantbased seared tips 10 ounce buy 2 at target

Result 7 - Distance: 74.9721
Offer: beyond steak plantbased seared tips 10 ounce at target

Result 8 - Distance: 74.9721
Offer: beyond steak plantbased seared tips 10 ounce at target

Result 9 - Distance: 74.9721
Offer: beyond steak plantbased seared tips 10 ounce at target

Result 10 - Distance: 75.2401
Offer: back to the roots grow seed starting pots or germination trays at walmart or target



**Now after observing the above models performance I used Hugging Face's Sentence Transformers to encode offers into embeddings and perform K-nearest neighbors (KNN) search. Here we employed scikit-learn's NearestNeighbors to find the top 20 similar offers to the example query "cheese" based on cosine distance. The results are displayed with their distances.**

In [14]:
from sentence_transformers import SentenceTransformer
import pandas as pd
from sklearn.neighbors import NearestNeighbors


model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

df = pd.read_csv('C:/Users/karth/Desktop/Fetch_project/unified_test.csv')

offer_embeddings = model.encode(df['OFFER'].tolist())

# Building an index for KNN search using scikit-learn's NearestNeighbors
nn = NearestNeighbors(n_neighbors=10, metric='cosine', n_jobs=-1)
nn.fit(offer_embeddings)

# Function to perform K-nearest neighbors (KNN) search
def knn_search(input_query, nn, k=10):
    query_embedding = model.encode([input_query])
    distances, indices = nn.kneighbors(query_embedding, n_neighbors=k)
    
    # Converted the results to a list format
    search_results = []
    for i in range(k):
        result = {
            "index": indices[0][i],
            "distance": distances[0][i],
            "offer": df.iloc[indices[0][i]]['OFFER']
        }
        search_results.append(result)
    
    return search_results

input_query = "Target"

top_results = knn_search(input_query, nn, k=20)

for i, result in enumerate(top_results):
    print(f"Result {i + 1} - Distance: {result['distance']:.4f}")
    print(f"Offer: {result['offer']}\n")


Result 1 - Distance: 0.3363
Offer: arber at target

Result 2 - Distance: 0.5424
Offer: loreal paris true match foundation at target

Result 3 - Distance: 0.5498
Offer: loral paris true match foundation at target

Result 4 - Distance: 0.5825
Offer: beyond steak plantbased seared tips 10 ounce buy 2 at target

Result 5 - Distance: 0.5825
Offer: beyond steak plantbased seared tips 10 ounce buy 2 at target

Result 6 - Distance: 0.5825
Offer: beyond steak plantbased seared tips 10 ounce buy 2 at target

Result 7 - Distance: 0.6139
Offer: talenti mini bars

Result 8 - Distance: 0.6139
Offer: talenti mini bars

Result 9 - Distance: 0.6204
Offer: beyond steak plantbased seared tips 10 ounce at target

Result 10 - Distance: 0.6204
Offer: beyond steak plantbased seared tips 10 ounce at target

Result 11 - Distance: 0.6204
Offer: beyond steak plantbased seared tips 10 ounce at target

Result 12 - Distance: 0.6276
Offer: loral paris makeup spend 30 at target

Result 13 - Distance: 0.6509
Offer: lo

**Both above and below codes use Sentence Transformers to encode offers and find similar offers based on cosine distance. However, the below code introduces an intelligent search function that identifies the type of input text (e.g., retailer, brand, or product category) and performs tailored searches, providing more context-specific results.**

In [15]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd
from sklearn.neighbors import NearestNeighbors

# Loaded the paraphrse model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

df = pd.read_csv('C:/Users/karth/Desktop/Fetch_project/unified_test.csv')

offer_embeddings = model.encode(df['OFFER'].tolist())

# Define k asuming that the datest can provide atleast 10 offers with the given query
k = 10 

nn = NearestNeighbors(n_neighbors=k, metric='cosine', n_jobs=-1)
nn.fit(offer_embeddings)

# Function to perform K-nearest neighbors (KNN) search
def knn_search(input_query, nn, k):
    query_embedding = model.encode([input_query])
    
    # Perform a KNN search
    distances, indices = nn.kneighbors(query_embedding, n_neighbors=k)
    
    # Converted the results to a list format
    search_results = []
    for i in range(k):
        result = {
            "index": indices[0][i],
            "distance": distances[0][i],
            "offer": df.iloc[indices[0][i]]['OFFER'],
            "retailer": df.iloc[indices[0][i]]['RETAILER_mapped'],
            "brand": df.iloc[indices[0][i]]['BRAND'],
            "product_category": df.iloc[indices[0][i]]['PRODUCT_CATEGORY'],
        }
        search_results.append(result)
    
    return search_results

def search(input_text, nn, k):
    # To Determine the type of search based on input_text
    search_type = "offer"
    if input_text in df['PRODUCT_CATEGORY'].values:
        search_type = "PRODUCT_CATEGORY"
    elif input_text in df['BRAND'].values:
        search_type = "BRAND"
    elif input_text in df['RETAILER_mapped'].values:
        search_type = "RETAILER_mapped"
    
    if search_type == "offer":
        # Perform KNN semantic search for offers
        results = knn_search(input_text, nn, k)
    else:
        # Filter the dataset based on the search type
        filtered_df = df[df[search_type] == input_text]
        if not filtered_df.empty:
            # Encode the filtered offers and perform KNN search
            query_embedding = model.encode([input_text])
            filtered_offer_embeddings = model.encode(filtered_df['OFFER'].tolist())
            # Define k based on the number of samples in the filtered dataset
            k = min(k, len(filtered_offer_embeddings))
            filtered_nn = NearestNeighbors(n_neighbors=k, metric='cosine', n_jobs=-1)
            filtered_nn.fit(filtered_offer_embeddings)
            distances, indices = filtered_nn.kneighbors(query_embedding, n_neighbors=k)
            results = []
            for i in range(k):
                result = {
                    "index": indices[0][i],
                    "distance": distances[0][i],
                    "offer": filtered_df.iloc[indices[0][i]]['OFFER'],
                    "retailer": filtered_df.iloc[indices[0][i]]['RETAILER_mapped'],
                    "brand": filtered_df.iloc[indices[0][i]]['BRAND'],
                    "product_category": filtered_df.iloc[indices[0][i]]['PRODUCT_CATEGORY'],
                }
                results.append(result)
        else:
            results = []

    return results

# input query
input_query = "sams club"

# Now results will only show based on the samples I provided
search_results = search(input_query, nn, k)

# Printed the list of offers with distance scores
for result in search_results:
    print("Offer:", result["offer"])
    print("Retailer:", result["retailer"])
    print("Brand:", result["brand"])
    print("Product Category:", result["product_category"])
    print("Distance Score:", result["distance"])
    print()

Offer: tyson products select varieties spend 20 at sams club
Retailer: sams club
Brand: frozen beef
Product Category: nan
Distance Score: 0.5882034

Offer: tyson products select varieties spend 20 at sams club
Retailer: sams club
Brand: packaged meat
Product Category: nan
Distance Score: 0.5882034

Offer: spend 50 on a fullpriced new club membership
Retailer: sams club
Brand: nan
Product Category: nan
Distance Score: 0.60120666

Offer: georges farmers market chicken wings at sams club
Retailer: sams club
Brand: nan
Product Category: nan
Distance Score: 0.6272481

Offer: spend 110 on a fullpriced new plus membership and receive an additional 10000 points
Retailer: sams club
Brand: nan
Product Category: nan
Distance Score: 0.7857358



***CONCLUSION***

**In summary**, the analysis begins with data scraping and initial observations, proceeds to data preprocessing, includes text data preprocessing, and finally demonstrates semantic search capabilities using Sentence Transformers. The Last semantic search code is more advanced, providing **context-specific results** based on the type of input text. Overall, this comprehensive data analysis and search capability enhance data understanding and utilization.