## A brief overview of data-structures

### For all purposes
```python
data = [sentences]
sentence = [elements]
element = {
    'word': bruh,
    'lang': en,
    'pos': X
}
```

- words: the "words" of the data
- lang: e, h, b, t, n
    - e: english
    - h: hindi
    - b: bengali
    - t: tamil
    - n: naaahh (rest)
- pos: the part of speech



In [1]:
import regex
import numpy as np
import matplotlib.pyplot as plt
import json

In [7]:
with open('hi_en_dataset.json') as f:
    json_data = json.load(f)

In [8]:
data = []
for key, value in json_data.items():
    data.append(value['sentence'])

In [18]:
def grammify(data, n, item):
    """
    Builds the gram_data dictionary
    inputs:
        data: a list of sentences
        sentences: a list of dict containing all the information
        n: the n in n-gram
        item: the dictionary key
    outputs:
        gram_data:
            a list of tuples of (value, key) pairs.
            value as the frequency of the gram,
            given by the key
    """
    d = {}
    for sentence in data:
        x = len(sentence)
        for i in range(x-n+1):
            window = sentence[i:i+n]
            key = ""
            for element in window:
                key += element[item] + " "

            if key in d:
                d[key] += 1
            else:
                d[key] = 1

    d = [(v, k) for k, v in d.items()]
    return sorted(d, reverse=True)

In [19]:
def plot_gram_data(gram_data, language, gram, item):
    """
    Plots the gram data.
    inputs:
        gram_data:
            a list of tuples of (value, key) pairs.
            value as the frequency of the gram,
            given by the key
        
        language:
            the name of the data in concern
        gram:
            the n of the data in concern
        data:
            the item in concern
    """
    n = len(gram_data)
    labels = [item[1] for item in gram_data]
    heights = [item[0] for item in gram_data]
    plt.figure(figsize=(20,10))
    plt.bar(x=np.arange(n), height=heights, tick_label=labels)
    plt.title(f"{language}: for n={gram} and parameter={item}")
    plt.xticks(rotation=90)
    plt.show()

In [58]:
def find_gram(data, query, item, is_regex):
    """
    Finds the requested query in the data. 
    Could be plain search or regex search.
    Helpful for getting samples for analysis.
    
    inputs:
        data:
            the data where the query is searched
        query:
            the request on what to search. space separated for bigger n
        item:
            the item to be searched for. could be 'token', 'pos', 'lang'    
        is_regex: 
            if the query is regex or not
    outputs:
        returns a list of all the sentences containing the query
    """
    # stores the final results of the data
    query_results = []

    focused_sentences = []
    for sentence in data:
        extracted_string = ""
        for element in sentence:
            extracted_string += element[item] + " "
        focused_sentences.append((sentence, extracted_string[:-1]))
    print("done reformatting the data")

    if is_regex:
        pattern = regex.compile(query)
        query_results = [
            sentence for sentence, extracted_string in focused_sentences 
            if len(pattern.findall(extracted_string))
        ]
    else:
        query_results = [
            sentence for sentence, extracted_string in focused_sentences
            if query in extracted_string
        ]
    print(f"Found {len(query_results)} matches.")
    return query_results

In [77]:
def get_extracted_data(data, item):
    extracted_data = []
    for i, sentence in enumerate(data):
        temp = f"{i}. "
        for element in sentence:
            temp += element[item] + " "
        extracted_data.append(temp[:-1])
    return extracted_data

In [78]:
# gram = 2
# item = 'pos'
# plot_gram_data(grammify(data, gram, item)[:20], language='codemix', gram=gram, item=item)

In [79]:
# find_gram(data, query="what", item="word", is_regex=False)

In [80]:
get_extracted_data(find_gram(data, query="(of)", item="word", is_regex=True), "word")

done reformatting the data
Found 178 matches.


he insti I feel bad for you dogs come on open your eyes you can do much better than a random slut who takes advantage of your tharkiness Rise people Rise',
 '110. Such men deserve the kind of women they get',
 '111. Back in the in my 2nd yr 1st sem we put a spoof ad in posing as a modeling agency and provided a mobile number and email address Got emails phone numbers sizzling portfolio photos etc from wannabe models and hotties all over Bombay And everyone always wondered how I was never single',
 '112. To all those calling this fake this was a true story which happened in Yes we had cellphones and yes email was popular There was no high speed internet except for a few labs You guys are sounding like a bunch of unimaginative slackers and lazy wannabes Our reality was so far from yours that we imagine how you guys live on campus any more In fact I truly believe that all you kids do these days is watch porn stalk chicks on FB and jerk off alone Ed',
 '113. My best moments at IITB were wh