DSA Final Project Report

Team ID: team33 Members: B09902004, B09902011, B09902100

Refs: https://en.wikipedia.org/wiki/Shunting-yard_algorithm

Data Structure & Algorithms

`Find-Similar`

We use a hash table with tokens as keys to store mail ids that have that token. Mail ids are stored in both a linked list and an array, so that knowing both whether a given mail has this token can be done in $O(1)$ and getting all mail ids with this token can be done in $O(mails\ with\ this\ token)$. Building this hash table takes $O(mails \times token\ per\ mail)$.

As this hash table is being built, we also store the unique tokens of first $1000$ mails. Then the intersection table is built by going through the unique tokens of each mail, and check for any other mails which also have this token by going through the linked lists in the hash table. This way we can calculate the number of intersecting tokens of each mail pair.

We then build a $1000 \times 10000$ table (offline) to store $|A \cap B|$ using the data computed above, and put it in the source code. In the code submitted to the judge, we simply loop over the table and calculate the similarity to all other mails. We then can respond the query in $O(mails)$ time.

`Expression-Match`

We first convert the infix expression to postfix in $O(length)$ time, then evaluate the expression in $O(length \times mails)$ time.

Infix to postfix is done using the Shunting-Yard algorithm. To evaluate the the postfix expression, we use a stack to store boolean values. If we get a token, push the mail id array. If we get a operator, pop the values, apply operation, then push the result back.

`Group-Analyse`

In this part, we apply a disjoint set model to solve this type of query.

First, we analyse the users in given mails, and find a method to hash these users' name that any user won't collide with others.

In main.c, for each GROUP-ANALYSE query, we will initialize an empty disjoint set at the begining. Then, we can get the information of From and To from given mid. For each user that hasn't appeared in this disjoint set $n_g$ will add 1. If From and To are not in the same disjoint set, we will merge them together and $n_g$ will minus 1. However, there is a special case that From equals to To. In this type of mail, $n_g$ will not add 1 since this node cannot be a group. For each union operation, it will return the size of the disjoint set, and $l_g$ will be substitute if the return value is bigger than $l_g$. At the end, $n_g$ and $l_g$ will be submitted through api.h.

Cost Estimation of Queries

`Find-Similar`	`Expression-Match`	`Group-Analyse`
$O(mails)$	$O(length \times mails)$	$O(mails \times \alpha(mails))$

Scheduling Strategy

Because Find-Similar on average has $80$ times the reward for each answer ~~, which is perfectly balanced,~~ we decide to tackle this problem first. And since the source code file size is limited, we only want to answer queries with mid less than $1000$. This gave us a glorious score of about $600000$.

We tried only answering Expression-Match, but got merely $5000$ points and was far from finishing all the Expression-Match queries. Answering only Group-Analyse gets us $50000$ points but doesn't seems to finish all queries.

Therefore, we decide to finish all Find-Similar queies first, then answer Group-Analyse queries with above average rewards. This gives as $700000$ points in total.

Additional Notes

For Find-Similar, after some offline tests, it turns out that calculating all the $10000 \times 10000$ similarity in the program is simply unviable. So we figure we can try and precompute some part of it, and surprisingly it works great.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
docs		docs
extracted_query_datas		extracted_query_datas
hash research		hash research
test_env		test_env
.gitignore		.gitignore
README.md		README.md
api.h		api.h
expr_match.c		expr_match.c
get_token.h		get_token.h
group_analyse.h		group_analyse.h
hash_table.h		hash_table.h
main_query1.c		main_query1.c
main_query2.c		main_query2.c
main_query3.c		main_query3.c
main_ultimate.c		main_ultimate.c
report.pdf		report.pdf
similar_table.h		similar_table.h
stupid_perfect_hash.c		stupid_perfect_hash.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSA Final Project Report

Data Structure & Algorithms

`Find-Similar`

`Expression-Match`

`Group-Analyse`

Cost Estimation of Queries

Scheduling Strategy

Additional Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DSA Final Project Report

Data Structure & Algorithms

Find-Similar

Expression-Match

Group-Analyse

Cost Estimation of Queries

Scheduling Strategy

Additional Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Find-Similar`

`Expression-Match`

`Group-Analyse`

Packages