# **Documentation**

## Dependencies
The code relies on the following libraries and modules:

- `csv`: Provides functionality for reading and writing CSV files.
- `chardet`: Enables character encoding detection.
- `re`: Offers regular expression operations.
- `pandas`: A powerful library for data manipulation and analysis.
- `numpy`: Provides support for mathematical operations on arrays and matrices.
- `sklearn`: A comprehensive machine learning library.
- `nltk`: A natural language processing library.

##Functionality

1. Reading CSV Files

  The code utilizes the `csv` module to read CSV files. It allows for the extraction of data from these files for further analysis.

2. Character Encoding Detection

  The `chardet` library is used to detect the character encoding of the text data. It helps ensure that the data is properly decoded and processed.

3. Regular Expression Operations

  The `re` module enables the use of regular expressions. Regular expressions are powerful tools for pattern matching and text manipulation.

4. Data Manipulation with pandas and numpy

  The code utilizes the `pandas` library for efficient data manipulation and analysis. It allows for easy handling of structured data, such as CSV files. The numpy library is used for mathematical operations on arrays and matrices.

5. Cosine Similarity Calculation

  The code employs the `cosine_similarity` function from the `sklearn.metrics.pairwise` module to calculate the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors.

6. Natural Language Processing (NLP) with nltk

  The code uses the `nltk` library for natural language processing tasks. It specifically utilizes the stopwords corpus from the `nltk.corpus` module. Stopwords are common words that are often excluded from text analysis as they typically do not carry important meaning.

In [1]:
import csv
import chardet
import re
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
libraries = {
    'csv': csv,
    'chardet': chardet,
    're': re,
    'pandas': pd,
    'numpy': np,
    'sklearn': sklearn,
    'nltk': nltk,
}

for library_name, library in libraries.items():
    try:
        version = library.__version__
        print(f'{library_name}: {version}')
    except AttributeError:
        print(f'{library_name}: version not found')

csv: 1.0
chardet: 4.0.0
re: 2.2.1
pandas: 1.5.3
numpy: 1.22.4
sklearn: 1.2.2
nltk: 3.8.1


# Library Imports

The following library is imported in the code:

- `google.colab`: A library for working with Google Colab, a cloud-based Jupyter notebook environment.

## Functionality

1. The code snippet uses the `google.colab` library to mount the Google Drive to the Colab runtime. This allows accessing and working with files and directories in the Google Drive within the Colab environment.

2. The `drive.mount()` function is called with the /content/drive directory as the mount point, where the Google Drive will be mounted. Once the Google Drive is successfully mounted, the files and directories within it can be accessed using standard file I/O operations.


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Reading the Word Embedding File

The code reads the word embedding file in the `GloVe` format. The file path is specified as `/content/drive/MyDrive/For Capstone/Tensorflow Words Embedding/glove.6B.100d.txt`. The file is opened in read mode with the specified encoding of `utf-8`.

## Creating the Embeddings Index
The code iterates over each line in the file. Each line represents a word and its corresponding embedding vector. The line is split into a list of values based on whitespace.

- `word`: The first value in the line represents the word itself.
- `coefs`: The remaining values in the line represent the embedding vector as a sequence of floating-point numbers.

The code converts the embedding vector values to a `NumPy` array of type `float32` and assigns it to the variable coefs.

The word and its corresponding embedding vector are added as a `key-value` pair to the `embeddings_index` dictionary.

In [6]:
# Step 1: Load GloVe word embeddings
embeddings_index = {}
with open('/content/drive/MyDrive/For Capstone/Tensorflow Words Embedding/glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs


## Setting the Embedding Dimension
The code initializes the variable `embedding_dim` with a value of 100. This variable represents the dimensionality of the word embeddings.

## Creating the Embedding Matrix
- The code initializes an empty `NumPy` array called `embedding_matrix` with dimensions `(len(embeddings_index), embedding_dim)`. This matrix will store the word embeddings for the given vocabulary.

- The code also initializes an empty list called `vocab` to store the words corresponding to each row of the embedding matrix.

- The code then iterates over the words in the `embeddings_index` dictionary using the enumerate function. The enumerate function provides an index i and the corresponding word at each iteration.

- For each word, the code retrieves the embedding vector from the `embeddings_index` dictionary using the word as the key.

- The embedding vector is assigned to the `embedding_vector` variable.

- The `embedding_vector` is then assigned to the ith row of the `embedding_matrix`, effectively storing the embedding vector in the matrix.

- The `word` is appended to the vocab list.

In [None]:
# Step 2: Create embedding matrix and vocabulary
embedding_dim = 100  # Dimensionality of the word embeddings
embedding_matrix = np.zeros((len(embeddings_index), embedding_dim))
vocab = []

for i, word in enumerate(embeddings_index):
    embedding_vector = embeddings_index[word]
    embedding_matrix[i] = embedding_vector
    vocab.append(word)

## Creating the Search Data Array

- The code initializes an empty list called `search_data` to store the embedding vectors for the `words` in the vocab list.

- The code then iterates over each word in the `vocab` list.

- For each word, the code retrieves the corresponding embedding vector from the `embeddings_index` dictionary using the word as the key.

- The embedding vector is appended to the `search_data` list.

- After iterating through all the words, the search_data list is converted to a `NumPy` array using `np.array(search_data)`.

In [None]:
# Step 6: Prepare the search data
search_data = []
for word in vocab:
    search_data.append(embeddings_index[word])

search_data = np.array(search_data)


In [None]:
# Step 7: Define the search function
def search(query, top_k=5):
    top_words_list = []
    for query_word in query:
        query_tokens = query_word.split()
        query_embedding = np.mean([embeddings_index[token] for token in query_tokens if token in embeddings_index], axis=0)
        similarity_scores = cosine_similarity([query_embedding], search_data)
        similarity_scores = similarity_scores.reshape(-1)
        top_indices = similarity_scores.argsort()[-top_k:][::-1]
        top_words = [vocab[i] for i in top_indices]
        top_words_list.append(top_words)
    return top_words_list

## Stop Words
The code imports the stopwords corpus from the `nltk` library, specifically the English stop words. Stop words are common words that are often excluded from text analysis as they typically do not carry important meaning.

## Additional Keywords
The code defines a list called `additional_keywords` that contains additional words to be removed from the search query. These words are "caffe", "place", "coffee", "nan", and "cafe".

## Preprocessing the Search Query
The code splits the user input query into individual words using the `split` function.

The code then preprocesses the search query by converting it to lowercase using the lower function and extracting only alphanumeric characters using the `re.findall` function. This removes symbols and special characters from the search query, keeping only words.

## Removing Stop Words and Additional Keywords
The code removes stop words and additional keywords from the search keywords list using list comprehension. It checks if each word is not in the `stop_words` set and not in the `additional_keywords` list.

## Returning the Search Keywords
The code `returns` the preprocessed search keywords as a list.

In [None]:
def input_keyword(user_input):
    stop_words = set(stopwords.words('english'))

    # Words to remove
    additional_keywords = ["caffe", "place", "coffee", "nan", "cafe"]

    # Split the input into individual words
    words = user_input.split()

    # Preprocess the search query by splitting and removing symbols
    search_keywords = re.findall(r'\b\w+\b', user_input.lower())

    # Remove stop words and additional keywords from the search keywords
    search_keywords = [word for word in search_keywords if word not in stop_words and word not in additional_keywords]

    return search_keywords

# Asking User Input
user_input = input("Search: ")
result = input_keyword(user_input)
print("Filtered Input:", result)

Search: caffe that has a affordable and cozy place
Filtered Input: ['affordable', 'cozy']


## Retrieving Top Words for Each Query Word
The code calls a function named search with the result and `top_k` parameters to retrieve the top words for each query word. The specific implementation of the search function is not provided in the code snippet.

The resulting top words for each query word are stored in the `top_words_list` variable.

## Combining Top Words into a Single List
The code initializes an empty list called `list_of_words` to store the combined top words.

The code then iterates over the query words in the result list using the enumerate function. The enumerate function provides an index i and the corresponding `query_word` at each iteration.

For each query word, the code prints a message indicating the closest meanings for that query word and prints the corresponding top words from the `top_words_list`. The top words for the query word are also appended to the `list_of_words`.

## Flattening the List of Lists
After iterating through all the query words, the code uses list comprehension to flatten the `list_of_words` by iterating over each sublist in `list_of_words` and extracting each word.

The resulting flattened list of words is stored in the `list_of_words` variable.

In [None]:
top_words_list = search(result, top_k=10)

list_of_words = []
for i, query_word in enumerate(result):
    print(f"Closest meanings for query '{query_word}':")
    print(top_words_list[i])
    list_of_words.append(top_words_list[i])
    print()
list_of_words  = [word for sublist in list_of_words for word in sublist]
# list_of_words = list_of_words[0:len(query)]
print('List closest keyword from query: \n', list_of_words)

Closest meanings for query 'affordable':
['affordable', 'inexpensive', 'cheap', 'expensive', 'cheaper', 'efficient', 'unaffordable', 'accessible', 'priced', 'environmentally']

Closest meanings for query 'cozy':
['cozy', 'cosy', 'comfy', 'homey', 'cramped', 'dingy', 'spacious', 'clubby', 'rustic', 'shabby']

List closest keyword from query: 
 ['affordable', 'inexpensive', 'cheap', 'expensive', 'cheaper', 'efficient', 'unaffordable', 'accessible', 'priced', 'environmentally', 'cozy', 'cosy', 'comfy', 'homey', 'cramped', 'dingy', 'spacious', 'clubby', 'rustic', 'shabby']


## Reading the CSV File
The code opens the specified CSV file in read mode using the `open` function. It creates a CSV reader object using `csv.reader` to read the file contents.

The code initializes an empty list called rows to store the rows of the CSV file.

It then iterates over each row in the CSV file using a for loop. Each row is appended to the rows list.

## Searching for Keywords
The code initializes an empty list called `row_numbers` to store the row numbers where the keywords are found.

The code iterates over each row in the rows list using a nested for loop. It also iterates over each keyword in the keywords list.

For each row and keyword, the code checks if the keyword is present in the specified column of the row. If the keyword is found, the index of the row in the rows list is appended to the `row_numbers` list using the `index` method.

## Returning the Row Numbers
The code returns the `row_numbers` list, which contains the row numbers where the keywords were found.

In [None]:
def search_keywords(csv_file, keywords, column):

  with open(csv_file, 'r') as f:
    reader = csv.reader(f)
    rows = []
    for row in reader:
      rows.append(row)

  row_numbers = []
  for row in rows:
    for keyword in keywords:
      if keyword in row[column]:
        row_numbers.append(rows.index(row))

  return row_numbers


if __name__ == '__main__':
    csv_file = '/content/drive/MyDrive/For Capstone/Collecting data/Place Detail (Scored + Keyword 1 & 2 Extracted  + Additional Feature (longlang, contact etc)).csv'
    keywords = list_of_words
    column = 13

    row_numbers = search_keywords(csv_file, keywords, column)
    unique_list = list(set(row_numbers))
    sorted_list = sorted(unique_list)
    Place_list= sorted_list[:20]

print(Place_list)

[2, 7, 9, 11, 12, 14, 15, 19, 21, 22, 23, 25, 29, 30, 34, 35, 36, 37, 38, 45]


## Columns to Extract
The code initializes a list called `columns_to_extract` that contains the indices of the columns to be extracted from the CSV file. The specific columns to be extracted are specified as `[0, 2, 4, 5, 14]`.

## Extracting Data from CSV
The code initializes an empty list called output to store the extracted data.

The code opens the CSV file specified by the variable `csv_file` in read mode using the open function. It creates a CSV reader object using `csv.reader` to read the file contents.

The code skips the header row of the CSV file using the next function to move the reader to the next row.

A nested function called `get_data` is defined to extract the data from the CSV file based on the provided `row_numbers` and `olumn_numbers`.

Within the `get_data` function, the code iterates over each row in the CSV file using a for loop and the enumerate function. For each row, it checks if the index `i + 1` is in the `row_numbers` list. If it is, the code extracts the values from the specified `column_numbers` and appends them to the data list.

The `get_data` function returns the extracted data as a list.

The code calls the get_data function with the `Place_list (row numbers)` and `columns_to_extract` parameters, and assigns the returned data to the variable data.

## Creating Output Dictionary
The code iterates over each row in the data list using a for loop. For each row, it creates a dictionary containing the extracted data with keys `"name", "address", "rating", "total_review", and "url_photo"`. The values are assigned based on the corresponding elements in the row.

Each dictionary is appended to the output list.

## Returning the Output
The code returns the output list, which contains a list of dictionaries representing the extracted data from the CSV file.

In [None]:
def caffe_result(Place_list):
  columns_to_extract = [0, 2, 4, 5, 14]
  output = []

  with open(csv_file, 'r') as file:
      reader = csv.reader(file)
      next(reader)  # Skip the header row

      def get_data(row_numbers, column_numbers):
          data = []
          for i, row in enumerate(reader):
              if i + 1 in row_numbers:
                  row_data = [row[col] for col in column_numbers]
                  data.append(row_data)
          return data

      data = get_data(Place_list, columns_to_extract)
      for row in data:
          output.append({
                "name": row[0],
                "address": row[1],
                "rating": float(row[2]),
                "total_review": int(row[3]),
                "url_photo": row[4]
            })

      return output

# print(caffe_result(Place_list))


results = caffe_result(Place_list)
for result in results:
    print(result)

print(type(result))

{'name': 'Filosofi Kopi Jogja', 'address': 'Jl. Pandhawa No.001/17, Tegal Rejo, Sariharjo, Kec. Ngaglik, Kabupaten Sleman, Daerah Istimewa Yogyakarta 55581, Indonesia', 'rating': 4.5, 'total_review': 10643, 'url_photo': 'https://lh3.googleusercontent.com/places/ANJU3DtZv8Cgp2loEb8jQsfDS56fNN3N26tqc7y7Xk10jj6kXtIQ_wblkMeleGlljFV7jyGajr69EfbkU3GDlVo69D22-GMRnj84KNY=s1600-w400'}
{'name': 'Cokelat Klasik Cafe', 'address': 'Jalan Joyo Agung, Merjosari, Lowokwaru, Tlogomas, Kec. Lowokwaru, Kota Malang, Jawa Timur 65144, Indonesia', 'rating': 4.4, 'total_review': 6741, 'url_photo': 'https://lh3.googleusercontent.com/places/ANJU3Dvw5URkLOzQ3skAAAqK9jbsVKzBlPvIIeaXG5pe0EbOjuj1KBANlozk-azxZEEKOA4tA-88XbICHMdARdbrDU4yltuOtluyUg=s1600-w400'}
{'name': 'Bukit Delight', 'address': 'Jl. Joyo Agung No.1, Tlogomas, Kec. Lowokwaru, Kota Malang, Jawa Timur 65144, Indonesia', 'rating': 4.5, 'total_review': 5333, 'url_photo': 'https://lh3.googleusercontent.com/places/ANJU3Dt2mpfqwMdzCKBski9jEu0drc192U1XoO-f