# Search Engine

This is a small experiment that performs a search in a Pandas dataframe. The idea is that the results are returned in a reasonable time and also that they are relevant.

## The Solution

###Summary
The program creates a TF-IDF representation for every row in the data frame,, scoring every word (or n-gram) in the text. This produces a sparse matrix with around 1 million rows and n columns (n = number of n-grams obtained in the entire df). Then the program produces the equivalent TF-IDF matrix for a given query and calculates the cosine similarity between both matrices. This cosine similarity is used as the relevance score for each row in our df. Finally, the program returns the top 10 most relevant news headlines in the df for the query.

### Implementation
The implemented code is basically a class called 'SearchEngine'. This class performs mainly two tasks: 
1. Prepare the data set and build a 'model' based on the entire df (method: fit). 
2. Given a query, return the most relevant products in the df that match this query (method: get_results)

### Method *fit*
This method will perform the following actions: 

1. Prepare the data: It concatenates the 'name' and 'brand' columns into one column, converts the text to lower case for this new column, removes stop words and optionally stems the words using the Porter stemer.

2. Calculate a TF-IDF matrix: It will calculate the TF-IDF matrix for the entire catalog, using different n-grams. By default it uses 1 to 3 n-grams. More information on TF-IDF (Term frequency – Inverse document frequency) can be found in wikipedia: https://en.wikipedia.org/wiki/Tf%E2%80%93idf. More information about n-grams could be found in here: https://en.wikipedia.org/wiki/N-gram
For this task the program uses the sklearn package.

3. Save the objects: It saves the sparse matrix for the catalog and the sklearn models as attributes of the class


### Method *get_results*
This method will receive as a parameter a single query and it will perform the search in the catalog returning up to 10 results ordered by the ranking score. The ranking score will be obtained in the following way:

1. Preprocess the query using the same transformations used in the 'fit' method for the entire df (convert the text to lower case, remove stop words, stem words if it applies)

2. Obtain a TF-IDF representation of our query. This will be a sparse matrix of only 1 row

3. Calculate the cosine similarity between the matrix representation of our query and our entire catalog matrix. This vector will be our ranking score

4. Sort the catalog by the ranking score obtained in the previous step (in descending order) and return the top 10 most relevant products for our queries 

## Code

In [None]:
# -*- coding: utf-8 -*-
import copy
import pandas as pd
import numpy as np
import sys
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from stemming.porter2 import stem

data_path = '../input/abcnews-date-text.csv'
        
class SearchEngine():  
    replace_words = {'&': '_and_', 'unknown':' '}    

    def __init__(self, text_column='name', id_column='id'):
        self.text_column = text_column
        self.id_column = id_column
        pass
    
    def fit(self, df, ngram_range=(1,3), perform_stem=True):
        self.df = df
        self.perform_stem = perform_stem
        doc_df = self.preprocess(df)
        stopWords = stopwords.words('english')    
        self.vectoriser = CountVectorizer(stop_words = stopWords, ngram_range=ngram_range)
        train_vectorised = self.vectoriser.fit_transform(doc_df)
        self.transformer = TfidfTransformer()
        self.transformer.fit(train_vectorised)
        self.fitted_tfidf = self.transformer.transform(train_vectorised)

    def preprocess(self, df):
        result = df[self.text_column]
        result = np.core.defchararray.lower(result.values.astype(str))
        for word in self.replace_words:
            result = np.core.defchararray.replace(result, word, self.replace_words[word])
        if self.perform_stem:
            result = self.stem_array(result)
        return result

    def preprocess_query(self, query):
        result = query.lower()
        for word in self.replace_words:
            result = result.replace(word, self.replace_words[word])
        if self.perform_stem:
            result = self.stem_document(result)
        return result

    def stem_array(self, v):
        result = np.array([self.stem_document(document) for document in v])
        return result
    
    def stem_document(self, text):
        result = [stem(word) for word in text.split(" ")]
        result = ' '.join(result)
        return result
    
    def get_results(self, query, max_rows=10):
        score = self.get_score(query)
        results_df = copy.deepcopy(self.df)
        results_df['ranking_score'] = score
        results_df = results_df.loc[score>0]
        results_df = results_df.iloc[np.argsort(-results_df['ranking_score'].values)]
        results_df = results_df.head(max_rows)
        self.print_results(results_df, query)
        return results_df        
        
    def get_score(self, query):
        query_vectorised = self.vectoriser.transform([query])    
        query_tfidf = self.transformer.transform(query_vectorised)
        cosine_similarities = linear_kernel(self.fitted_tfidf, query_tfidf).flatten()
        return cosine_similarities
    
    def print_results(self, df, query):
        print("---------")
        print('results for "{}"'.format(query))
        for i, row in df.iterrows():
            print('{}, {}, {}'.format(
                    row['ranking_score'],
                    row[self.id_column],
                    row[self.text_column]))
    
def load_data():
    df = pd.read_csv(data_path)
    return df

### Examples
Here you can play with different queries, using more or less n-grams, and using stemming (that will increase the time to process)

In [None]:
queries = [
    'global warming',
    'how can I win kaggle competitions from my cell phone',
    'what is the meaning of life',
    'donald trump riding an skate board',
    'some people like weird things, like pizza with pineapple',
    'I dont like cricket, I love it'
    ]

df = load_data()
model = SearchEngine(text_column='headline_text',  id_column='publish_date')
model.fit(df, perform_stem=False)


In [None]:
# Getting results 
for query in queries:
    model.get_results(query)