The Modern Information Retrieval course @ Sharif University of Technology ---Spring2020
P1: A complete information retrieval system on Persian wikipedia webpages. First, the data was normalized, tokenized and stemmed. Second, the data was indexed. Both positional indexing and bigram indexing (for spell check) was implemented. Third, a vector space approach was employed using tf-idf (both ltn-lnn and ltc-lnc were used). Also, a phrasal search was implemented. Different evaluation metrics, F1, Precision, Recall, MAP, and NDCG were also implemented for testing.
P2: On a part of AG News dataset, many algorithms were implemented and tested. kNN and Naive Bayes were implemented from the scratch, using the vector space created by tf-idf (ntn). kNN was implemented both using Cosine Similarity and Euclidean distance. Also, a [1, 3, 5] were tested for k. Naive Bayes was used with smoothing and the parameter search was done on a validation set. In the next part, nltk was used for for stemming and stopwords removal. Its effect was examined using evaluation metrics, Macro Averaged F1, Precision, Recall, Confusion Matrix. Using sklearn, Random Forests and SVM were also investigated. Hypermeter search was done on a validation set. Finally, t-SNE was used to visualize other vectorization techniques such as Word2Vec. k-Means was finally used to cluster the data and its results with different parameters was inspected.
P3: In the first part, a scrapy crawler crawls SemanticScholar to find papers. They are saved in a JSON format. Secondly, they are indexed in a Elasticsearch server. In the third part, PageRank is calculated for the pages. Then, different searching scenarios are implemented. In the last part, HITS algorithm is implemented to rank the authors of the papers.