<font size="6">Building Fast Queries on a CSV</font>

We will imagine that we are researchers monitoring the discourse about the climate changes on reddit, to find out what people think and discuss on this topic.

The goal of this project is to create a class that represent the comments about the climate changes on reddit. The methods in that class will implement the queries that we want to answer about our inventory. We will also preprocess that data to make those queries run faster.

In [1]:
from csv import reader

opened_file = open('the-reddit-climate-change-dataset-comments.csv', encoding='UTF-8')
read_file = reader(opened_file)
rows = list(read_file)
header = rows[0]
rows = rows[1:]
print(header)
for i in range(2):
    print(rows[i])

['type', 'id', 'subreddit.id', 'subreddit.name', 'subreddit.nsfw', 'created_utc', 'permalink', 'body', 'sentiment', 'score']
['comment', 'imlddn9', '2qh3l', 'news', 'false', '1661990368', 'https://old.reddit.com/r/news/comments/x2cszk/us_life_expectancy_down_for_secondstraight_year/imlddn9/', 'Yeah but what the above commenter is saying is their base doesn’t want any of that. They detest all of those things, even the small gradual changes. Investing in nuclear energy is a tacit acknowledgement of man made climate change. Any acknowledgement or concession and they will be primaried out in a minute', '0.5719', '2']
['comment', 'imldbeh', '2qn7b', 'ohio', 'false', '1661990340', 'https://old.reddit.com/r/Ohio/comments/x2awnp/state_government_may_soon_kill_a_solar_project_in/imldbeh/', "Any comparison of efficiency between solar and fossil fuels is nonsensical at best and intentionally misleading at worst. In no universe is light -&gt; photovoltaic cell -&gt; electricity less efficient than

<font size="5">Reddit Class</font>

Let's start by implementing the constructor. It will take the name of the CSV file as argument and then read the rows contained in self.header and self.rows.

In [4]:
class Reddit():
    def __init__(self, csv_filename):
        opened_file = open(csv_filename, encoding='UTF-8')
        read_file = reader(opened_file)
        rows = list(read_file)
        self.header = rows[0]
        self.rows = rows[1:]
            
reddit = Reddit('the-reddit-climate-change-dataset-comments.csv')
print(reddit.header)
print(len(reddit.rows))

['type', 'id', 'subreddit.id', 'subreddit.name', 'subreddit.nsfw', 'created_utc', 'permalink', 'body', 'sentiment', 'score']
4600698


<font size="5">Finding a Comment Form the Id</font>

Throughout this project, we will make several improvements to the Comment class.<br>
The first thing that we will implement is a way to look up a comment from a given identifier. In this way, we can register and analyze the comment that we think is relevant.

The idea is proceprocess the data into a dictionary where the keys are the IDs and the values the rows.

In [2]:
class Reddit():
    def __init__(self, csv_filename):
        opened_file = open(csv_filename, encoding='UTF-8')
        read_file = reader(opened_file)
        rows = list(read_file)
        self.header = rows[0]
        self.rows = rows[1:]
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[1]] = row
            
    def get_comment_from_id(self, comment_id):
        if comment_id in self.id_to_row:
            return self.id_to_row[comment_id]
        return None

<font size="4">Test the code:</font>

In [3]:
reddit = Reddit('the-reddit-climate-change-dataset-comments.csv')
print(reddit.get_comment_from_id('imlddn9'))
print(reddit.get_comment_from_id('imlddn8'))

['comment', 'imlddn9', '2qh3l', 'news', 'false', '1661990368', 'https://old.reddit.com/r/news/comments/x2cszk/us_life_expectancy_down_for_secondstraight_year/imlddn9/', 'Yeah but what the above commenter is saying is their base doesn’t want any of that. They detest all of those things, even the small gradual changes. Investing in nuclear energy is a tacit acknowledgement of man made climate change. Any acknowledgement or concession and they will be primaried out in a minute', '0.5719', '2']
None


<font size="5">Analized Sentiment</font>

On our database, we have the analyzed sentiment of each comment, with the upper limit being defined as 1 and the lower limit as -1. This sentiments can be used to see how the comments were written, in a negative, neutral or positive way, which can inform us how this person sees the climate changes.

We will write a function that, given an interval between -1 and 1, return the messagens. The idea is proceprocess the data into a dictionary where the keys are the Sentiments and the values the rows.

In [None]:
class Reddit():
    def __init__(self, csv_filename):
        opened_file = open(csv_filename, encoding='UTF-8')
        read_file = reader(opened_file)
        rows = list(read_file)
        self.header = rows[0]
        self.rows = rows[1:]
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[1]] = row
        self.sentiment_to_row = {}
        for row in self.rows:
            self.sentiment_to_row[row[8]] = row
            
    def get_comment_from_id(self, comment_id):
        if comment_id in self.id_to_row:
            return self.id_to_row[comment_id]
        return None
    
    def analized_sentiment(self, lim_inf, lim_sur):
        list_sentiment = []
        if lim_inf < -1 or lim_sur > 1:
            return None
        if lim_inf in self.sentiment_to_row and lim_sur in self.sentiment_to_row:
            if self.sentiment_to_row > lim_inf and self.sentiment_to_row < lim_sur:
                list_sentiment.append(self.sentiment_to_row)
        return list_sentiment

<font size="4">Test the code:</font>

In [None]:
reddit = Reddit('the-reddit-climate-change-dataset-comments.csv')
print(reddit.analized_sentiment(-0.5,0.5))
print(reddit.analized_sentiment(-2,-1))

<font size="5">Comment Score</font>

With the score atribute, we can determine which comments are the most highly rated and, consequently, the most important one.

We will write a function that, given an score, return two messagens whose sum of the "score return" column value is equal to the score.

In [None]:
class Reddit():
    def __init__(self, csv_filename):
        opened_file = open(csv_filename, encoding='UTF-8')
        read_file = reader(opened_file)
        rows = list(read_file)
        self.header = rows[0]
        self.rows = rows[1:]
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[1]] = row
        self.sentiment_to_row = {}
        for row in self.rows:
            self.sentiment_to_row[row[8]] = row
        self.scores = set()
        for row in self.rows:
            self.scores.add(row[-1])
            
    def get_comment_from_id(self, comment_id):
        if comment_id in self.id_to_row:
            return self.id_to_row[comment_id]
        return None
    
    def analized_sentiment(self, lim_inf, lim_sur):
        list_sentiment = []
        if lim_inf < -1 or lim_sur > 1:
            return None
        if lim_inf in self.sentiment_to_row and lim_sur in self.sentiment_to_row:
            if self.sentiment_to_row > lim_inf and self.sentiment_to_row < lim_sur:
                list_sentiment.append(self.sentiment_to_row)
        return list_sentiment
    
    def score(self, score):
        score_messagens = []
        if score in self.scores:
            score_messagens.append(self.scores[score])
        else:
            return -1
        for scores in self.score:
            if score - scores in self.scores:
                score_messagens.append(self.scores[score - scores])
        return score_messagens
        

In [None]:
reddit = Reddit('the-reddit-climate-change-dataset-comments.csv')
print(reddit.score(15))
print(reddit.score(0))