In [9]:
import pandas as pd
df = pd.read_csv('data/data.csv')
print("dataset length: ", len(df))
print(df[:10])

dataset length:  10017
   Sl no                                             Tweets     Search key  \
0      1   #1: @fe ed "RT @MirayaDizon1: Time is ticking...  happy moments   
1      2   #2: @蓮花 &はすか ed "RT @ninjaryugo: ＃コナモンの日 だそうで...  happy moments   
2      3   #3: @Ris ♡ ed "Happy birthday to one smokin h...  happy moments   
3      4   #4: @월월 [씍쯴사랑로봇] jwinnie is the best, cheer u...  happy moments   
4      5   #5: @Madhurima wth u vc♥ ed "Good morning dea...  happy moments   
5      6   #6: @Jeinalís Ramos ed "Happy moments 🙏🏽 http...  happy moments   
6      7   #7: @Eric Rogers ed "@CaitlinUnruh The movie ...  happy moments   
7      8   #8: @Yanny Sandal ed "I don’t give two shits ...  happy moments   
8      9   #9: @daynada ed "my beautiful barbie bride an...  happy moments   
9     10   #10: @ß🌪 ed "Someone Great has been one of th...  happy moments   

  Feeling  
0   happy  
1   happy  
2   happy  
3   happy  
4   happy  
5   happy  
6   happy  
7   happy  
8   happy 

Building a search engine for emojis

1. Index the corpus

term - token

term - emoji index. A sparse matrix with true/false if emoji appears with term
inverted index - dictionary of terms, and a list of their appearances (emojis)

Building index:
1. collect documents (sentences with emojis)
2. tokenize the documents
3. preprocess the tokens. lowercase, cleanup, english
4. Index documents with inverted index

Each emoji has unique ID
Maintain dictionary and postings
dictionary - emoji and pointer to document its from
postings - inverted index [emoji, frequency in doc, [docID1, docID2]]


Boolean query Happy AND Sad
Answer set rank emojis that has both happy and sad, otherwise, happy then sad, depending on frequency. 

Tokenization
- lowercase might be bad for emojis because we need to keep names apart from words (General Motors)
- stemming and lemmatization - Porter algorithm

Intersection algorithm for Happy and Sad is O(n+m) where n and m are number of occurrences 

Tolerant retrieval
Wildcard searches like re*val would need to use re AND val. for those searches, 
k-gram index woudl help
phonetic correction
lehvenstein distance


Index compression
Possibly 75% less storage
Allow use of caching frequently used terms and 
Rule of 30 - the 30 most common words account for 30% of the tokens in text. 
In the postings list, the term is the most space needed. Instead of using the emoji, use a pointer to the emoji


Scoring, term weighting, vector space model 


In [29]:
%pip install Unidecode nltk

Collecting Unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
     ---------------------------------------- 0.0/235.9 kB ? eta -:--:--
     -------------------------------------  235.5/235.9 kB 7.3 MB/s eta 0:00:01
     -------------------------------------- 235.9/235.9 kB 7.3 MB/s eta 0:00:00
Installing collected packages: Unidecode
Successfully installed Unidecode-1.3.6
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: C:\Users\carde\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [30]:
import os
import csv
import json
import nltk
import unidecode
import re
from nltk.corpus import words

# Download words corpus if not done before
nltk.download('words')

# Set of all English words
english_words = set(words.words())

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\carde\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


In [55]:
# Index and tokenize tweets 
def clean_and_tokenize(text):
    # Remove diacritics
    text = unidecode.unidecode(text)
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    # Tokenize and remove punctuation
    tokens = nltk.wordpunct_tokenize(text)
    tokens = [token for token in tokens if token.isalnum()]
    # Filter non-English words
    tokens = [token for token in tokens if token in english_words]
    return tokens

def index_csv_file(csv_path, dictionary, postings):
    # if os.path.isdir(csv_path):
    #     # If the path is a directory, index each item in the directory
    #     for item in os.listdir(csv_path):
    #         index_csv_file(os.path.join(csv_path, item), dictionary, postings)
    # else:
        with open(csv_path, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
            next(reader)  # Skip the header
            for row in reader:
                tweet_id, tweet, _ , _= row
                tokens = clean_and_tokenize(tweet)
                for token in tokens:
                    if token not in dictionary:
                        dictionary[token] = set()
                    dictionary[token].add(tweet_id)
                    if token not in postings:
                        postings[token] = 0
                    postings[token] += 1


def index_data(csv_folder):
    dictionary ={}
    postings = {}
    for csv_file in os.listdir(csv_folder):
        if csv_file.endswith('.csv'):
            index_csv_file(os.path.join(csv_folder, csv_file), dictionary, postings)
    with open('dictionary.json', 'w', encoding= 'utf-8') as f:
        f.write( json.dumps({k: list(v) for k, v in dictionary.items()}))
    with open('postings.json', 'w', encoding= 'utf-8') as f:
        f.write(json.dumps(postings))

In [56]:
index_data('data')

