# Part 2: QUERYING

## Writing queries

In [1]:
import pymysql

# Connect to the database
connection = pymysql.connect(host='localhost:8000',
                             user='user',
                             password='password',
                             database='database',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)

### Questions

**1 - Users with highest scores over time**
- Implement a query that returns the users with the highest aggregate scores (over all their
posts) for the whole dataset. Restrict results to only those whose aggregated score is above 10,000 points, in descending order. Return two columns: `username` and `aggr_scores`.


In [2]:
def users_with_best_scores():
    with connection.cursor() as cur:
        q = """
            SELECT user_name AS username,sum(score) AS aggr_scores
            FROM post
            GROUP BY user_name
            HAVING aggr_scores > 10000
            ORDER BY aggr_scores DESC
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [3]:
users_with_best_scores()

[{'username': 'DaFunkJunkie', 'aggr_scores': Decimal('250374')},
 {'username': 'None', 'aggr_scores': Decimal('218846')},
 {'username': 'SUPERGUESSOUS', 'aggr_scores': Decimal('211610')},
 {'username': 'jigsawmap', 'aggr_scores': Decimal('210823')},
 {'username': 'hildebrand_rarity', 'aggr_scores': Decimal('122463')},
 {'username': 'iSlingShlong', 'aggr_scores': Decimal('118595')},
 {'username': 'tefunka', 'aggr_scores': Decimal('79560')},
 {'username': 'chrisdh79', 'aggr_scores': Decimal('60373')},
 {'username': 'JLBesq1981', 'aggr_scores': Decimal('58235')},
 {'username': 'rspix000', 'aggr_scores': Decimal('57106')},
 {'username': 'Wagamaga', 'aggr_scores': Decimal('47988')},
 {'username': 'stem12345679', 'aggr_scores': Decimal('47455')},
 {'username': 'TheJeck', 'aggr_scores': Decimal('26057')},
 {'username': 'TheGamerDanYT', 'aggr_scores': Decimal('25357')},
 {'username': 'TrumpSharted', 'aggr_scores': Decimal('21153')},
 {'username': 'NotsoPG', 'aggr_scores': Decimal('18518')},
 {

**2 - Most active users**
- Implement a query that returns the top 10 users in terms of the number of subreddits they have posted in. Since several users have posted in the same number of subreddits, I need to order my results, first, by number of active subreddits per user, and secondly, alphabetically by username. The alphabetical order should be, first any number, then A-Z (irrespective of case). The query should return two columns:`username` and `numb_subs`.

In [4]:
def most_active_users():
    with connection.cursor() as cur:
        q = """
            SELECT username,count(active_sub) AS numb_subs 
            FROM (SELECT user_name AS username, subreddit_name AS active_sub FROM post
            GROUP BY username, active_sub) AS TEMP
            GROUP BY username
            ORDER BY numb_subs DESC,username ASC LIMIT 10
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [5]:
most_active_users()

[{'username': 'Kinglens311', 'numb_subs': 15},
 {'username': 'AutoModerator', 'numb_subs': 9},
 {'username': 'Ford456fgfd', 'numb_subs': 8},
 {'username': 'accappatoiviola', 'numb_subs': 6},
 {'username': 'dunkin1980', 'numb_subs': 6},
 {'username': 'giveawayguy99', 'numb_subs': 6},
 {'username': 'Kindy0', 'numb_subs': 6},
 {'username': 'PerfctSmile', 'numb_subs': 6},
 {'username': 'BrightscapesArt', 'numb_subs': 5},
 {'username': 'checkmak01', 'numb_subs': 5}]

**3 - Awarded posts**
- Implement a query that returns the number of posts who have received at least two awards. 

In [6]:
def awarded_posts():
    with connection.cursor() as cur:
        q = """
            SELECT COUNT(*) 
            FROM post
            WHERE total_awards_received >= 2
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [7]:
awarded_posts()

[{'COUNT(*)': 52}]

**4 - Find Covid subreddits in name and description.**
- Implement a query that retrieves the name and description of all subreddits where the name starts with _covid_ or _corona_ and the description contains _covid_ anywhere. The returned table should have two columns: `name` and `description`.

In [8]:
def covid_subreddits():
    with connection.cursor() as cur:
        q = """
            SELECT subreddit_name AS name,subr_description AS description 
            FROM subreddit
            WHERE subreddit_name LIKE 'corona%'
            OR subreddit_name LIKE 'covid%' 
            OR subr_description Like '%covid%'
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [9]:
covid_subreddits()

[{'name': 'Coronavirus',
  'description': 'Place to discuss all things COVID-related'},
 {'name': 'CoronavirusUK',
  'description': 'Spreading news, advice and media following the UK’s spread of the virus.'},
 {'name': 'CoronavirusUS',
  'description': 'USA/Canada specific information on the coronavirus (SARS-CoV-2) that causes coronavirus disease 2019 (COVID-19)'},
 {'name': 'China_Flu',
  'description': 'COVID-19 (2019-nCoV) Wuhan Coronavirus Information'},
 {'name': 'CoronavirusCA',
  'description': 'Tracking the Coronavirus/Covid-19 outbreak in California'},
 {'name': 'LockdownSkepticism',
  'description': 'Examining the empirical basis for mandatory lockdown policies in both the physical and social sciences. We are concerned about the impact of COVID-19 lockdowns/quarantines on our freedoms, human rights, physical and mental health, and economy. We are skeptical of ongoing lockdowns as an effective way to manage the coronavirus pandemic. This is a non-partisan, non-racist, multidi

**5 - Find users in haystack**
- Implement a query that retrieves _only the names_ of those users who have at least 3 posts with the same score as their number of comments, and their username contains the string _meme_ anywhere. Returned table should contain only one column: `username`.

In [10]:
def haystack():
    with connection.cursor() as cur:
        q = """
            SELECT user_name AS username
            FROM post
            WHERE user_name LIKE '%meme%'
            AND num_comments = score
            GROUP BY user_name
            HAVING count(*)>=3
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [11]:
haystack()

[{'username': 'MemeWarriors'}, {'username': 'PublicMemeResource'}]

**6 - Subreddits with the highest average upvote ratio**
- Implement a query that shows the 10 top subreddits in terms of the average upvote ratio of the users that posted in them. Return two columns: `subr_name` and `avg_upv_ratio`.

In [12]:
def avg_upvote_ratio_per_subreddit():
    with connection.cursor() as cur:
        q = """
            SELECT subreddit_name AS subr_name,avg(user_upvote_ratio) AS avg_upv_ratio FROM
            (SELECT p.subreddit_name,u.user_upvote_ratio
            FROM post p
            JOIN user u
            ON p.user_name=u.user_name
            GROUP BY p.user_name
            ) AS TEMP
            GROUP BY subr_name
            ORDER BY avg_upv_ratio DESC
            LIMIT 10
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [13]:
avg_upvote_ratio_per_subreddit()

[{'subr_name': 'virginvschad', 'avg_upv_ratio': 0.3233547709882},
 {'subr_name': 'MensLib', 'avg_upv_ratio': 0.2891344204545},
 {'subr_name': 'opensource', 'avg_upv_ratio': 0.2762725353241},
 {'subr_name': 'razer', 'avg_upv_ratio': 0.2254116237164},
 {'subr_name': 'CovIdiots', 'avg_upv_ratio': 0.2043045498431},
 {'subr_name': 'CoronavirusUS', 'avg_upv_ratio': 0.1591350896792},
 {'subr_name': 'NoNewNormal', 'avg_upv_ratio': 0.1516660974982},
 {'subr_name': 'sportsbook', 'avg_upv_ratio': 0.1502002626657},
 {'subr_name': 'worldbuilding', 'avg_upv_ratio': 0.1298482173256},
 {'subr_name': 'wicked_edge', 'avg_upv_ratio': 0.1229753224179}]

**7 - What are the chances** [1 mark]
- Implement a query that finds those posts whose length (in number of characters) is exactly the same as the length of the description of the subreddit in which they were posted on. Retrieve the following columns: `subreddit_name`, `posting_user`, `user_registered_at`, `post_full_text` (which can the `title`, `selftext` or a concatenation of both), `subreddit_description` and `dif` (which should show the difference in characters between the subreddit description and the post).

In [14]:
def what_are_the_chances():
    with connection.cursor() as cur:
        q = """
            SELECT p.title as post_full_text,s.subreddit_name,s.subr_description AS subreddit_description,u.user_name AS posting_user,u.user_registered_at,abs(length(p.title)-length(s.subr_description)) AS dif
            FROM post p
            JOIN subreddit s
            ON p.subreddit_name=s.subreddit_name
            JOIN user u
            ON p.user_name=u.user_name
            WHERE length(title)=length(subr_description)
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [15]:
what_are_the_chances()

[{'post_full_text': "Essential Politics: Trump and Newsom's quiet cooperation",
  'subreddit_name': 'CoronavirusCA',
  'subreddit_description': 'Tracking the Coronavirus/Covid-19 outbreak in California',
  'posting_user': 'a_real_live_alien',
  'user_registered_at': datetime.datetime(2011, 3, 13, 0, 0),
  'dif': 0},
 {'post_full_text': "Counter-protestors ('defending' statues) in London clashed with police on Bridge street next to Big Ben.",
  'subreddit_name': 'PublicFreakout',
  'subreddit_description': 'A subreddit dedicated to people freaking out, melting down, losing their cool, or being weird in public.',
  'posting_user': 'Al-Andalusia',
  'user_registered_at': datetime.datetime(2011, 2, 4, 0, 0),
  'dif': 0},
 {'post_full_text': "according to u/iffywolf dms might be fake cause pm isn't capalized who knows",
  'subreddit_name': 'playboicarti',
  'subreddit_description': 'A subreddit dedicated to the discussion of hip-hop/trap artist Playboi Carti',
  'posting_user': 'aliiiiiiiii

**8 - Most active August 2020 days.**
- Write a query that retrieves _only_ a ranked list of the most prolific days in August 2020, prolific measured in number of posts per day. Your query should return those days in a single-column table (column name `post_day`) in the format `YYYY-MM-DD`.

In [16]:
def most_prolific_days():
    with connection.cursor() as cur:
        q = """
            SELECT date_format(posted_at,'%Y-%m-%d') AS post_day FROM post
            WHERE posted_at BETWEEN '2020-08-01 00:00:00' AND '2020-09-01 00:00:00' 
            GROUP BY post_day
            ORDER BY count(posted_at) DESC
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [17]:
most_prolific_days()

[{'post_day': '2020-08-10'},
 {'post_day': '2020-08-12'},
 {'post_day': '2020-08-14'},
 {'post_day': '2020-08-04'},
 {'post_day': '2020-08-07'},
 {'post_day': '2020-08-21'},
 {'post_day': '2020-08-26'},
 {'post_day': '2020-08-24'},
 {'post_day': '2020-08-17'},
 {'post_day': '2020-08-06'},
 {'post_day': '2020-08-19'},
 {'post_day': '2020-08-20'},
 {'post_day': '2020-08-13'},
 {'post_day': '2020-08-03'},
 {'post_day': '2020-08-27'},
 {'post_day': '2020-08-05'},
 {'post_day': '2020-08-18'},
 {'post_day': '2020-08-29'},
 {'post_day': '2020-08-31'},
 {'post_day': '2020-08-15'},
 {'post_day': '2020-08-09'},
 {'post_day': '2020-08-08'},
 {'post_day': '2020-08-16'},
 {'post_day': '2020-08-25'},
 {'post_day': '2020-08-23'},
 {'post_day': '2020-08-11'},
 {'post_day': '2020-08-28'},
 {'post_day': '2020-08-30'},
 {'post_day': '2020-08-22'},
 {'post_day': '2020-08-02'},
 {'post_day': '2020-08-01'}]

**9 - Top 'covid'-mentioning users.**
- Retrieve the top 5 users in terms of how often they have mentioned the term 'covid' in their posts. Return two columns: `username` and `total_count`. Consider an occurrence of the word 'covid' only when it appears before and after a whitespace (i.e., `<space>covid<space>`) and irrespective of case (both `<space>Covid<space>` and `<space>covid<space>` would be valid hits).

In [18]:
def count_covid():
    with connection.cursor() as cur:
        q = """
            SELECT user_name AS username,count(*) AS total_count FROM post
            WHERE title LIKE '% covid %' OR selftext LIKE '% covid %'
            GROUP BY user_name
            ORDER BY total_count DESC
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [19]:
count_covid()

[{'username': 'Kalepa', 'total_count': 6},
 {'username': '_nutri_', 'total_count': 4},
 {'username': 'Skullzrulerz', 'total_count': 4},
 {'username': 'Pessimist2020', 'total_count': 4},
 {'username': 'Beliavsky', 'total_count': 3},
 {'username': 'Mrexreturns', 'total_count': 3},
 {'username': 'Tha_NerdHerd', 'total_count': 3},
 {'username': 'PerriX2390', 'total_count': 3},
 {'username': 'Kinglens311', 'total_count': 3},
 {'username': 'Hundsheimer_Berge', 'total_count': 3},
 {'username': 'ajariax', 'total_count': 3},
 {'username': 'north0east', 'total_count': 3},
 {'username': 'geocentrist', 'total_count': 3},
 {'username': 'nutlikew', 'total_count': 3},
 {'username': 'johnslegers', 'total_count': 3},
 {'username': 'Bferw', 'total_count': 3},
 {'username': 'jsinkwitz', 'total_count': 3},
 {'username': 'noel_edmondso', 'total_count': 3},
 {'username': 'IanMazgelis', 'total_count': 3},
 {'username': 'pothead218', 'total_count': 3},
 {'username': 'grupal', 'total_count': 2},
 {'username': 

**10 - Users with mean high score for their posts.**
- Retrieve the number of users (ignoring the username 'None') with an average score for their posts which is higher than the average score for the posts in our dataset. Return only one result, under the column `result`.

In [20]:
def users_score_above_mean():
    with connection.cursor() as cur:
        q = """
            SELECT count(*) AS result FROM
            (SELECT user_name,avg(score) AS avg_score FROM post
            GROUP BY user_name
            HAVING avg_score > (SELECT avg(score) FROM post)) AS TEMP
            ;
        """
        cur.execute(q)
        results = cur.fetchall()
    return results

In [21]:
users_score_above_mean()

[{'result': 70}]