# Sentiment Analysis on Road Bike Brands

I am trying to figure out which road bike brand is the favorite among Redditors.

For this, I am analyzing comments about bike brands from the subreddits **r/cycling**, **r/bicycling**, **r/RoadBikes** for their sentiment.

### What is <span style="color:#FF5700">Reddit</span>?

Reddit is a website containing numerous forums about different topics. Those forums are called _Subreddits_.

Each subreddit can be sorted by **new**, **hot** (ie recent and very well visited) and **top** (*of the day/the week/the month/the year/all time*).

### Motivation

I am hugely into cycling. Even though I dont have the money to spend on a new high-end bike, I like to dream and look at lots of different road bikes from various brands.

This project stemmed from my curiosity about which brands might be the most favored among people.

*Credit where credit is due*: I got the idea for that project from this [post](https://www.reddit.com/r/bicycling/comments/14ffxuy/i_analyzed_200k_comments_to_find_reddits_favorite/), but I'm putting my own twist on it.

### Which brands am I looking for?

- <span style="color:#bd82d9">Argon 18</span>
- <span style="color:#bd82d9">Bianchi</span>, <span style="color:#bd82d9">BMC</span>
- <span style="color:#bd82d9">Cannondale</span>, <span style="color:#bd82d9">Canyon</span>, <span style="color:#bd82d9">Cervelo</span>, <span style="color:#bd82d9">Cinelli</span>, <span style="color:#bd82d9">Colnago</span>, <span style="color:#bd82d9">Cube</span>
- <span style="color:#bd82d9">Giant</span>
- <span style="color:#bd82d9">Merida</span>
- <span style="color:#bd82d9">Orbea</span>
- <span style="color:#bd82d9">Pinarello</span>
- <span style="color:#bd82d9">Ridley</span>, <span style="color:#bd82d9">Rose</span>
- <span style="color:#bd82d9">Scott</span>, <span style="color:#bd82d9">Specialized</span>
- <span style="color:#bd82d9">Trek</span>
- <span style="color:#bd82d9">Ventum</span>
- <span style="color:#bd82d9">Wilier</span>


I originally had **Time**, **Look** and **Felt** in my list aswell, but due to those being common english words, filtering for those wasnt feasible within this timeframe.

*Spoiler*: **Argon 18** and **Ventum** didn't have enough entries, so they were filtered out aswell.

---



## Gathering Data

Since the beginning of 2024, Reddit only provides API access for a fee. I didn't want to pay that fee, so I wrote my own API:

```python
def get_posts_from_2024(endpoint, category='/hot', last_after=None, onlyId=False):
    '''
    This function gathers up to 1000 posts from a specified subreddit endpoint on Reddit. 
    It retrieves posts made within the year 2024 from the current runtime, excluding those posted before 2024. 
    The function can optionally return only the IDs of the posts.

    :param endpoint: String representing the subreddit endpoint in the format '/r/subreddit_name'.
    :param category: Optional string specifying the category of posts to retrieve ('/new', '/hot', or '/top', default is '/hot').
    :param last_after: Optional string indicating the after_post_id to continue scraping from a specific point.
    :param onlyId: Optional boolean. If True, returns a DataFrame with only the post IDs.
    :return: A pandas DataFrame containing all post information or only the IDs if onlyId is True,
             along with the last after_post_id for continuation of scraping.
    '''

    base_url = 'https://www.reddit.com'
    url = base_url + endpoint + category + '.json'
    if category == 'top/?t=year':
        url = base_url + endpoint + '/top/' + '.json?t=year'
    dataset = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36'
    }
    df = None

    for i in range(10): 
        params = {
            'limit' : 100,  # Max. amount of items per round, limited by the offical endpoint
            't' : 'year',  # Only get posts that have been made during the last year (starting at runtime)
            'after' : last_after  # after_post_id for next search iteration (each search is only about 25 items)
        }
        try:
            response = httpx.get(url, params = params, headers=headers)
            json_data = response.json()
            dataset.extend([rec['data'] for rec in json_data['data']['children']])
            last_after = json_data['data']['after']
            print(f'Fetched {100 * i + 100} posts :)')
            # Filtering out all posts made before 01.01.2024
            df = pd.DataFrame(dataset)
            start_date = datetime(2024, 1, 1, tzinfo=timezone.utc).timestamp()
            df['created'] = df['created'].astype(float) 
            df = df[df['created'] >= start_date]
            if onlyId:
                df = df[['id']]
        except: 
            print('Failed to fetch posts :(')
            #raise
        
        sleeptime = float(random.randrange(2, 5))  # random sleeptime to be less suspicious
        time.sleep(sleeptime)
    
    return df
```

This code returns the last 1000 posts from a given category. The 1000-post limit is set by Reddit's endpoint and can't be easily bypassed.

This code snippet is followed by another one that scrapes all the top-level comments of each post using its id.
- Using these two functions, I managed to scrape about **38,000** comments.

In addition to my own scraping, I obtained a dataset of all comments from the three subreddits from 06.2005 - 12.2023 from this [torrent](https://academictorrents.com/details/56aa49f9653ba545f48df2e33679f014d2829c10).
- This provided me with about **5,300,000** additional comments.

<span style="color:#FF8C00">**With that, I have a dataset that spans roughly from June 2005 to June 2024 and contains about 5,350,000 comments.**</span>

---

## Data Cleaning

After loading/scraping all data, I formatted it as | subreddit | comment |, discarding all other information from the JSON, and combined all dataframes into one large dataframe.

<u>Then, I began the actual cleaning process in my module *reddit_scraper.py*:</u>
- Converting Unicode codes into their respective characters.
- Removing URLs.
- Removing special characters (e.g., formatting characters).
- Filtering out comments that do not mention any of the targeted bike brands using *TheFuzz*, which allows for typos.
    - If a comment mentiones multiple brands, a comment is listed multiple times, each time with a different brand mentioned in that comment. 
    - *Code for that looks like this*:
```python
def contains_keyword(subreddit, comment, keywords):
    '''
    Recursively searches for all keywords in a given comment.
    Allows for typos and is not case-sensitive.

    :param subreddit: String representing the subreddit the comment is taken from.
    :param comment: String to be parsed for keywords.
    :param keywords: List of keywords to search for.
    :return: List of dictionaries in JSON-like format with the shape 
             [{'subreddit': subreddit, 'keyword': keyword, 'matched_word': matched_word, 'comment': comment}]. 
             One dictionary for each keyword found.
    '''
    MIN_SCORE = 85
    words = comment.lower().split()
    best_score = 0
    best_match = None
    matched_word = None

    for keyword in keywords:
        for word in words:
            score = fuzz.ratio(word, keyword.lower())
            if score > MIN_SCORE and score > best_score:
                best_score = score
                best_match = keyword
                matched_word = word
    
    if best_match:
        keywords.remove(best_match)  # Remove the best match from keywords list

        # Recursively call contains_keyword to find the second match
        second_match = contains_keyword(subreddit, comment, keywords)

        # Prepare data for DataFrame processing
        match_data = [{'subreddit': subreddit, 'keyword': best_match, 'matched_word': matched_word, 'comment': comment}]
        
        if second_match:
            match_data.extend(second_match)
        
        return match_data

    return []
```
- Filtering out certain brands when written in lowercase, as they were often just common English words:
    - These brands are: Giant, Cube, Rose
- Removing brands with fewer than 100 comments.
- Adding a column 'multiple' to mark comments that contain more than one brand and therefore appear more than once in the DataFrame.
- The DataFrame now looks like this:

| subreddit   | keyword       | matched_word   | comment                                                        | multiple |
|-------------|---------------|----------------|----------------------------------------------------------------|----------|
| "bicycling" | "Trek"        | "trek"         | "I love Trek and Specialized. I am not a fan of Gient though!" | true     |
| "bicycling" | "Specialized" | "specialized"  | "I love Trek and Specialized. I am not a fan of Gient though!" | true     |
| "bicycling" | "Giant"       | "gient"        | "I love Trek and Specialized. I am not a fan of Gient though!" | true     |


<br>

Since certain brands are very similar to common English vocabulary, I had to filter out comments that falsely passed as being about a bike brand when they were not. I originally planned to use **NER** (Named Entity Recognition) for this, but after testing, I found that it handled most bike brands poorly.

As it wasn't within my timeframe to create a new dataset and train the model on bike brands, I decided to use a more traditional approach with Keyword Based Segmentation. This method processed each comment, keeping as much context as possible while trying to remove as many other mentioned brands as possible. 

The DataFrame now looks like this:

| subreddit   | keyword       | matched_word   | comment                          | multiple |
|-------------|---------------|----------------|----------------------------------|----------|
| "bicycling" | "Trek"        | "trek"         | "I love Trek and Specialized."   | true     |
| "bicycling" | "Specialized" | "specialized"  | "I love Trek and Specialized."   | true     |
| "bicycling" | "Giant"       | "gient"        | "I am not a fan of Gient though! | true     |

**These are the results on how many times each brand is mentioned:**

| Brand       | Count  |
|-------------|--------|
| Trek        | 43380  |
| Specialized | 38637  |
| Giant       | 20464  |
| Cannondale  | 19046  |
| Canyon      | 11746  |
| Bianchi     | 8563   |
| Cervelo     | 5782   |
| Scott       | 4639   |
| BMC         | 2813   |
| Pinarello   | 2784   |
| Colnago     | 2563   |
| Cinelli     | 2136   |
| Orbea       | 1733   |
| Cube        | 1589   |
| Merida      | 1481   |
| Ridley      | 1225   |
| Wilier      | 830    |
| Rose        | 761    |

<u>I got an overall accuracy on 'direct-brand-hits' of 67,5%.</u>

---

## Sentiment Analysis

I performed sentiment analysis using a Hugging Face pipeline with the [roBERTa model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest). I customized some of the base settings to utilize my GPU for faster computation and adjusted the batch size to optimize performance.

After conducting sentiment analysis, I filtered out comments that received a **neutral sentiment** or had a **score below 0.5** to exclude uncertain results.

As a final step, I evaluated the number of comments for each brand. Brands with fewer than 100 comments were removed from further analysis due to insufficient data for meaningful insights.

<br/>

This is the Python code used for the sentiment analysis:

```python 
def analyse_sentiment(data):
    '''
    Analyze sentiment of comments using a pre-trained model from Hugging Face.

    :param data: DataFrame containing comments.
    :return: DataFrame with sentiment analysis results.
    '''
    batch_size = 256
    results = []

    for i in tqdm(range(0, len(data), batch_size)):
        batch = data['comment'][i:i+batch_size].tolist()
        batch_results = sentiment_task(batch)
        results.extend(batch_results)

    analytics = pd.DataFrame(data)
    analytics['sentiment'] = [result['label'] for result in results]
    analytics['score'] = [result['score'] for result in results]
    
    return analytics


def filter_sentiment(data):
    '''
    Filter sentiment analysis results based on score and neutrality.

    :param data: DataFrame with sentiment analysis results.
    :return: Filtered DataFrame.
    '''
    filtered = data[(data['score'] >= 0.500) & (data['sentiment'] != 'neutral')]
    filtered = remove_brand_by_threshold(filtered)
    return filtered


if __name__ == '__main__':
    model_path = 'cardiffnlp/twitter-roberta-base-sentiment-latest'
    sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path, device=0, max_length=512, truncation=True)
        
    data = pd.read_json(r'data/segmented.json', orient='records', lines=True)
    sentiment = analyse_sentiment(data)

    filtered = filter_sentiment(sentiment)
    filtered.to_json(r'data/sentiment_filtered.json', orient='records', lines=True)

```

---

## Showcasing Results

**I visualized the main result as a bar chart using the respective logos of each brand, similar to how it was done in my inspiration.**

In addition to that, I have visualized the following three results:

- **Top 3 Liked Brands of Each Subreddit**
- **Top 3 Most 'Hated' Brands of Each Subreddit**
- **Most Controversial Brands Based on Controversy Score**

$ \text{Controversy Score} = \frac{\text{Bad Comments}}{\text{Good Comments} + \text{Bad Comments}} \times \left(1 - \frac{\text{Total Comments}}{\text{Maximum Comments}}\right) $

<small>

- <b>Bad Comments</b>: Number of negative or unfavorable comments about the brand.
- <b>Good Comments</b>: Number of positive or favorable comments about the brand.
- <b>Total Comments</b>: Total number of comments (both good and bad) about the brand.
- <b>Maximum Comments</b>: A hypothetical maximum number of comments used to scale the influence of total comments on the controversy score.
</small>
---

<p align="center">
    <img src="./plots/vis1.png" alt="Image 4">
</p>

---

<p align="center">
    <img src="./plots/vis2.png" alt="Image 3">
</p>

---

<div align="center">
  <img src="./plots/vis31.png" alt="Image 1" style="max-width: 45%; margin: 5px;">
  <img src="./plots/vis32.png" alt="Image 2" style="max-width: 45%; margin: 5px;">
</div>


---

## The End
```
 o__         __o        ,__o        __o           __o
 ,>/_       -\<,      _-\_<,       _`\<,_       _ \<_
(*)`(*).....O/ O.....(*)/'(*).....(*)/ (*).....(_)/(_)
```