## INPUT PIPELINE

Assigning proper tag to a particular question is crucial for the model to perform well. Since each sample has tags relating to both the data structures and algorithms used to 
solve those samples, it is necessary to separate this overlap. <br>

It is also essential that no sample should have more than one target tag. So whenever, targets are created it is absolutely essential to verify that there is no overlap of target tags.<br>

This notebook is used as the **input pipeline** for the question tagging notebook. Any variation of the original dataset with various target tags to be fed into the model, will be prepared here

In [1]:
import pandas as pd
df = pd.read_csv("../text_cleaning_and_processing_1.csv")

In [2]:
data_structures = {
    "array", "string", "graph"
}

In [3]:
from collections import defaultdict

In [4]:
pair_freqs = defaultdict(int)
for row in range(df.shape[0]):
    tags = list(df.iloc[row, 1:8].values)
    for i in range(len(tags)):
        if tags[i] == "No tag" or tags[i] not in data_structures:
            break
        for j in range(i+1, len(tags)):
            if tags[j] == "No tag" or tags[j] not in data_structures:
                break
            if tags[i] in data_structures and tags[j] in data_structures:
                pair_freqs[tuple(sorted((tags[i], tags[j])))] += 1

pair_freqs

defaultdict(int,
            {('array', 'array'): 48,
             ('array', 'string'): 57,
             ('array', 'graph'): 106,
             ('graph', 'graph'): 82,
             ('graph', 'string'): 7,
             ('string', 'string'): 6})

In [6]:
drop_inds = []
qns = df[["stop_words_removed_qns", "tag1"]]
for i in range(df.shape[0]):
    tags = set(df.iloc[i, 1:8].values)
    if len(data_structures.difference(tags)) != 2:
        drop_inds.append(i)
    else:
        qns.iat[i, 1] = list(data_structures.intersection(tags))[0]

print(len(drop_inds))

7293


In [7]:
qns.drop(drop_inds, axis=0, inplace=True)
qns.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qns.drop(drop_inds, axis=0, inplace=True)


In [8]:
qns["tag1"].value_counts()

graph     981
array     944
string    843
Name: tag1, dtype: int64

In [9]:
qns.to_csv("input_to_tagging.csv", index=False)