-
Notifications
You must be signed in to change notification settings - Fork 0
/
a2_sql-queries.tex
230 lines (197 loc) · 8.61 KB
/
a2_sql-queries.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
\subsection {Number of comments by a specific user}\label{subsec:comments-by-user}
\begin{verbatim}
SELECT COUNT(*)
FROM schema.comments_constrained
WHERE author='username';
\end{verbatim}
All usernames can be queried from the schema.redditors table.
Showing one example of running this code as a stored function in figure~\ref{fig:number-of-comments-figure} below.
\begin{figure}[h]
\includegraphics[width=400px] {number_of_posts}\label{fig:number-of-comments-figure}
\caption {The query was run in about 44ms}
\end{figure}
\subsection{Comments per day in subreddit}\label{subsec:comments-per-day}
First, I define a SQL function to translate the subreddit\_id to subreddit\_name for prettier presentation.
I also define basically the same function but in the other direction if it should be needed.
\begin{verbatim}
CREATE FUNCTION
get_subreddit_name(parameter_subreddit_id VARCHAR(10))
RETURNS VARCHAR(24)
READS SQL DATA
BEGIN
RETURN
(SELECT subreddit_name
FROM schema.subreddits_constrained
WHERE subreddit_id = parameter_subreddit_id);
END;
\end{verbatim}
A function to translate epoch time to Dates is also defined.
\begin{verbatim}
CREATE FUNCTION
epoch_to_date(parameter_epoch INT)
RETURNS INT
READS SQL DATA
BEGIN
RETURN (DATE(FROM_UNIXTIME(parameter_epoch)));
END;
\end{verbatim}
To simplify getting the desired data I also create a view in the database using the queries below.
\begin{verbatim}
CREATE VIEW schema.comment_date_view AS
SELECT
id,
get_subreddit_name(schema.comments_constrained.subreddit_id)
AS subreddit_name,
epoch_to_date(schema.comments_constrained.created_utc)
AS posted_date
FROM schema.comments_constrained;
\end{verbatim}
With this in place, I define a stored procedure:
\begin{verbatim}
CREATE FUNCTION posts_per_day (
p_subreddit_name VARCHAR(24)
)
RETURNS FLOAT(10, 3)
READS SQL DATA
BEGIN
RETURN
(SELECT AVG(posts_in_day) AS avg_per_day FROM (
SELECT
COUNT(posted_date) AS posts_in_day
FROM test_db.comment_date_view
WHERE subreddit_name = p_subreddit_name
GROUP BY posted_date)
AS posts_per_day);
END;
\end{verbatim}
Selecting from this function as below will then give average posts per day for the passed subreddit name.
\begin{verbatim}
SELECT posts_per_day ('subreddit_name');
\end{verbatim}
The results of executing this are shown in figure~\ref{fig:sub_ppd}.
\begin{figure}[h]
\includegraphics[width=400px] {function_avg_p_day}\label{fig:sub_ppd}
\caption {The query was run in about 4min 20s}
\end{figure}
\subsection{How many comments include the word 'lol'}\label{subsec:how-many-comments-include}
Start by defining a procedure that stores the number of comments with the desired word in the out parameter.
\begin{verbatim}
CREATE FUNCTION posts_containing (
p_term VARCHAR(50)
)
RETURNS INT
READS SQL DATA
BEGIN
RETURN (SELECT COUNT(*) FROM
(SELECT * FROM test_db.comments_constrained
WHERE body LIKE CONCAT('%', p_term, '%')) as contains_term);
END;
\end{verbatim}
Then it can easily be called as
\begin{verbatim}
SELECT posts_containing ('lol');
\end{verbatim}
The execution results are shown in figure~\ref{fig:sub_posts_containing}.
\begin{figure}[h]
\includegraphics[width=400px] {posts_containing}\label{fig:sub_posts_containing}
\caption {The query was run in about 1min 10s}
\end{figure}
Whether the method should demand spaces around the search phrase is debatable.
In the case of lol, the word is oftentimes used as one part of a longer word, so it would miss some counts there.
On the other hand, it is likely also including some entries that wouldn't fit what the user of the procedure likely intended.
Due to this, I think letting the caller of the method put in whitespaces themselves if they want stricter results is appropriate, and I will leave it as is.
\subsection{Commenters of thread T have participated in what subreddits?}\label{subsec:what-subreddits}
In MariaDB it is not possible to return multiple values from stored procedures or functions, so the stored procedure cannot return the results it found for further processing.
\begin{verbatim}
CREATE PROCEDURE subreddits_from_link_id (
IN p_link_id VARCHAR(10)
)
READS SQL DATA
BEGIN
SELECT DISTINCT get_subreddit_name(subreddit_id) AS subreddit
FROM test_db.comments_constrained
WHERE author IN
(SELECT DISTINCT author
FROM test_db.comments_constrained
WHERE link_id = p_link_id
AND NOT author='[deleted]'
);
END;
\end{verbatim}
The results from me running this procedure is shown in figure~\ref{fig:subreddits_from_link} below.
\begin{figure}[h]
\includegraphics[width=400px] {subreddits_from_link}\label{fig:subreddits_from_link}
\caption {The query was run in about 9s}
\end{figure}
\subsection{Which users have the highest and lowest scores?} \label{subsec:user-high-low-scores}
These two question are basically the same, the only difference is if it is ORDER BY \ldots ASC or ORDER BY \ldots DESC.
So I will create a view that queries the total scores of each user, and then it can be simply ordered by the total\_score column.
Querying this view is very slow \lparen see fig~\ref{fig:user_total_score}\rparen as the GROUP BY author call is very expensive.
It would likely be beneficial to store this as its own table to avoid having to re-compute it on every call.
It would obviously require more storage space as what is essentially more copies of data need to be stored.
But, considering that the computation was taking over an hour for me, it is likely worth the cost.
Adding a trigger rule in the database that updates the values when a new value is inserted after this extra table is set up would help with keeping the database easy to use so the user won't forget updating total scores.
\begin{figure}[h]
\includegraphics[width=400px] {TODO}\label{fig:user_total_score}
\caption {The query was run in about 1h 20min}
\end{figure}
\begin{verbatim}
CREATE user_total_scores AS
SELECT author, SUM(score) AS total_score
FROM comments_constrained
WHERE NOT author='[deleted]'
GROUP BY author;
\end{verbatim}
\subsection{Which subreddits have highest and lowest scores?} \label{subsec:subreddit-high-low-scores}
This is basically the same as the previous question about user high scores and lowest scores, but querying for subreddits instead.
Showing a stored procedure below.
\begin{verbatim}
CREATE PROCEDURE subreddit_lowscores (
)
READS SQL DATA
BEGIN
SELECT
get_subreddit_name(subreddit_id) AS subreddit,
SUM(score) AS total_score
FROM comments_constrained
GROUP BY subreddit
ORDER BY total_score;
END;
\end{verbatim}
\subsection{All users someone might have interacted with} \label{subsec:potential-interactions}
This means getting all links that a user has posted in, getting all comments on those links and selecting distinct authors.
\begin{verbatim}
CREATE PROCEDURE get_interactions(
IN p_redditor VARCHAR(20)
)
READS SQL DATA
BEGIN
SELECT DISTINCT author
FROM test_db.comments_constrained
WHERE link_id IN
(
SELECT link_id
FROM test_db.comments_constrained
WHERE author=p_redditor
)
AND author NOT IN ('[deleted]', p_redditor);
END;
\end{verbatim}
\subsection{All users who have only posted in one subreddit} \label{subsec:only-one-subreddit}
This procedure gets the author and subreddit\_id columns from the comments table, and uses select distinct to
then be able to count the number of rows grouped by author.
Then getting those with only one subreddit is as easy as querying for where count=1.
\begin{verbatim}
CREATE PROCEDURE only_n_subreddits(IN p_n INT)
READS SQL DATA
BEGIN
SELECT *
FROM (SELECT author, COUNT(*) AS `subreddit_count` FROM
(SELECT DISTINCT author, subreddit_id
FROM comments_constrained
WHERE NOT author='[deleted]')
AS author_reddits
GROUP BY author) AS `only_one_reddit`
WHERE subreddit_count<=p_n;
END
\end{verbatim}