# <font color = 'Maroon'> Project 3 - Web APIs & NLP Analysis on Dota 2 vs League of Legends Subreddits
---

## <font color = 'Maroon'> Background

Reddit is the most popular discussion-based social media in the world and ranked as the 8th most popular website in the world [(*source*)](https://www.socialmediatoday.com/news/13-fascinating-facts-about-reddit-infographic/523516/). Many people rely on reddit to understand more about a particular topic through subreddit. Reddit consistently gains momentum due to its constant support from the community who actively contributes information and opinions to the subreddits that they like. Subreddit covers a wide range of topics, from finance, politics to gaming and memes. Also, people can just subscribe to the subreddits and they will be continuously updated with discussions pertaining to the topics. Those subreddits are also well-maintained by reddit moderators to ensure posts are proper and contain no bullying or harassment.

As such, this project aims to understand the differing gaming aspects between two subreddits through the power of Natural Language Processing (NLP). The two subreddits are [DotA2](https://www.reddit.com/r/DotA2/) and [League of Legends](https://www.reddit.com/r/leagueoflegends/).

## <font color = 'Maroon'> Problem Statement

Due to the Covid-19 pandemic, many people has shifted to online activities to maintain relationship with their closed ones. One of the most popular online activities is gaming. Online gaming has since garnered more attention and has also been the core to many communities. The start of many popular online games are Dota2 and League of Legends, in which the games are free and require massive teamwork to win the game. Both games have been around since 2011 and 2009 consecutively, with Dota2 as the continuation of Dota (released in 2003) followed by its success. Even in 2022, both games are still in the top 20 most popular PC games in the world [(source)](https://newzoo.com/insights/rankings/top-20-pc-games).

Both Dota2 and LOL are Role-Playing Game (RPG) and each teammate needs to play a character and act as a certain role (top laner, jungler, mid laner, bottom laner and support). Each of the laners can also build their characters as carries, cores, gankers, etc depending on their items purchase and also the natural abilities of the characters.

Due to the many complex character abilities and strategies, Dota2 and LOL fanbase communities like to discuss their builds, items, skins, highlights of their matches as well as top competition team plays in reddit. Fun fact, Dota2 highest level of competition (The International 2021) has the largest prize pool in any esports tournament worldwide ever. It has the prize pool of \\$40 millions, compared to Fortnite World Cup finals 2019 which is \\$15 millions. 

As a junior data scientist in Twitch, i was tasked to create a feature of an internal ambitious project. The team has been developing news feed in Twitch and one of the category is gaming. I was asked to give insights of two similar games and create a classification machine learning model for future posts so Twitch streamers/users can get personalised news feed with correct tags on the feed.

Observing the popularity of Dota2, I decided to compare Dota2 posts vs League of Legends posts as both games are similar in nature.

## <font color = 'Maroon'> Approach

Below are the proposed approach for this project

1. Data Collection
    - Scrape two reddit posts using Pushshift API
    - Combine posts together after each API pull


2. Data Cleaning and EDA
    - Clean selftext and title words by using regex, remove punctuations and stop words as well as tokenize, stem and lemmatize the texts
    - Using NLP through Count Vectorizer and TF-IDF Vectorizer to analyse the word frequency for both subreddits
   
   
3. Machine Modelling & Selection
    - Utilization of pipelines to orchestrate the machine learning operations and allow a series of data transformations linkage to a measurable   modelling process
    - Different classification machine learning models development (Logistic Regression, Multinomial Naive Bayes, K-Nearest Neighbor as well as Random Forest Classifier) and compare them to choose the best model
    - GridSearchCV coupled with pipeline hyperparameter tuning to enable model accuracy enhancement
    - Top important words from each machine learning model understanding
    

## <font color = 'Maroon'> Data Scraping with PushShift API

Data collection by using PushShift API to pull DotA2 and LOL subreddit posts

In [59]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
import requests
import warnings
warnings.filterwarnings('ignore')

In [60]:
#Define function to get the API parameters, consisting of subreddit, size of posts and posted time
def get_params(baseposts_df, subreddit):
    params = {
        'subreddit': subreddit, 
        'size': 100, 
        'before': baseposts_df.loc[(baseposts_df.shape[0] - 1), 'created_utc'] 
    }
    return params

In [61]:
#Define function to get the posts in the API and convert to json. Posts are in the data column
def get_posts(params, baseurl='https://api.pushshift.io/reddit/search/submission'):
    res = requests.get(baseurl, params)
    if res.status_code != 200:
        return f'Error! Status code: {res.status_code}'
    else:
        data = res.json()
        posts = data['data']
    return posts

In [62]:
def create_new_df(posts):
    return pd.DataFrame(posts)

### <font color = 'Maroon'> Retrieving DotA2 posts

In [63]:
params_dota = {
    'subreddit': 'DotA2', 
    'size': 100
}

In [64]:
posts_dota = get_posts(params_dota)

In [65]:
dota_df = create_new_df(posts_dota)

In [66]:
dota_df.shape

(99, 83)

In [67]:
dota_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_text_color,removed_by_category,media_metadata,discussion_type,suggested_sort,poll_data,crosspost_parent,crosspost_parent_list,gallery_data,is_gallery
0,[],False,HoggiQQ,,[],,text,t2_e4tqz,False,False,...,,,,,,,,,,
1,[],False,StartingFrom-273,,[],,text,t2_cy6t9qbb,False,False,...,,,,,,,,,,
2,[],False,lvndrs,marci,"[{'a': ':marci:', 'e': 'emoji', 'u': 'https://...",:marci:,richtext,t2_p1au3,False,False,...,dark,,,,,,,,,
3,[],False,tmjm,,[],,text,t2_c88ee,False,False,...,,,,,,,,,,
4,[],False,AnomaLuna,luna,"[{'a': ':luna:', 'e': 'emoji', 'u': 'https://e...",:luna:,richtext,t2_6jnv0rni,False,False,...,dark,,,,,,,,,


In [68]:
dota_df[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,DotA2,"Hey, got a Saturday ticket for the Major in th...","Can't make it to the Major, giving away ticket.",1653116935
1,DotA2,,[Dota2]Eat some snacks. Don't miss the sights....,1653116783
2,DotA2,,A TI in SEA is almost confirmed!,1653116690
3,DotA2,"Hey,\n\nGot base model max studio (m1 max , 32...",M1 Mac crashing,1653115690
4,DotA2,,Null Talisman meta is getting out of hand,1653115596


### <font color = 'Maroon'> Updating Dota2 Posts to retrieve 4200 posts

In [69]:
def update_df(baseposts_df, subreddit):
    params = get_params(baseposts_df, subreddit)
    # print(params)
    posts = get_posts(params)
    # print(len(posts))
    df2 = create_new_df(posts)
    # print(df2.shape)
    updated = pd.concat([baseposts_df, df2], axis=0, ignore_index=True, sort=True)
    return updated

In [70]:
#Code to loop update_df function to get params and posts 41 times to get 4200 posts
for i in range(41):
    dota_df = update_df(dota_df, 'DotA2')
    if i in [10, 20, 30, 41]:
        print(dota_df.shape)

dota_df.shape

(1199, 85)
(2199, 85)
(3199, 87)


(4199, 87)

### <font color = 'Maroon'> Retrieving League of Legends Posts

In [71]:
params_lol = {
    'subreddit': 'leagueoflegends', 
    'size': 100
}

In [72]:
posts_lol = get_posts(params_lol)
lol_df = create_new_df(posts_lol)
lol_df.shape

(100, 76)

In [73]:
lol_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_text_color,media_metadata,media,media_embed,secure_media,secure_media_embed,suggested_sort,author_flair_background_color,crosspost_parent,crosspost_parent_list
0,[],False,arisasam,,[],,text,t2_13ubtt,False,False,...,,,,,,,,,,
1,[],False,Massive_Dependent_63,,[],,text,t2_dn33zpat,False,False,...,,,,,,,,,,
2,[],False,culaina001,,[],,text,t2_nffafzna,False,False,...,,,,,,,,,,
3,[],False,Mertcun,,[],,text,t2_7eds41ke,False,False,...,,,,,,,,,,
4,[],False,SolubilityRules,,[],,text,t2_2lolx4d,False,False,...,,,,,,,,,,


In [74]:
lol_df[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,leagueoflegends,[removed],My friend and I got fresh level 30 accts; afte...,1653117360
1,leagueoflegends,So in my gold tier tokens I had death incarnat...,Question about tokens (the little badges you c...,1653117306
2,leagueoflegends,,Road to 1 month,1653117055
3,leagueoflegends,,Any explain?,1653117019
4,leagueoflegends,"I'm a T1 fan, and I get it. The team dominated...",T1 members proclaiming victory even before get...,1653116857


### <font color = 'Maroon'> Updating League of Legends Posts to retrieve 4200 posts

In [75]:
for i in range(41):
    lol_df = update_df(lol_df, 'leagueoflegends')
    if i in [10, 20, 30, 41]:
        print(lol_df.shape)

lol_df.shape

(1200, 80)
(2199, 80)
(3199, 80)


(4199, 80)

### <font color = 'Maroon'> Saving the DataFrames to csv file

In [76]:
dota_df.to_csv('dota.csv')
lol_df.to_csv('lol.csv')