# Project 3 - Reddit classification challenge

## Part 1 - Web scraping and cleaning

### Contents:
- [Problem statement and objectives](#Problem-statement-and-objectives)
- [Web scraping](#Web-scraping)
- [Scraped data cleaning](#Scraped-data-cleaning)
- [Data storage](#Data-storage)

### Problem statement and objectives

**Problem statement**

- In recent years there has been a surge of forums/discussion/comments among netizens in various social media channels, where a wide range of topics are being discussed.
- As the variety of discussion topics rose substantially, the data science team realise that when compiling information from these netizens, discussion/comments which can be broken down into sub-topics will be more meaningful. For example topics relating to dietary, instead of categorising it broadly as "diet", further breakdowns such as 'plant-based', 'kosher' or 'paleo' will be useful.

**Objectives**

- To develop a classification model which is able to classify/categorise subtopics based on consolidated netizens' comments / discussion. 

- The model should be able to distinguish / classify discussion / comments on topics which are closely related. That's why for a start, the data science team chose streaming services namely, Netflix and AmazonPrimeVideo to train the model. Data is extracted from sub-reddits due to active users on the platform. 

- The model to be shortlisted will be based on various consideration such as accuracy score, strengths and limitations and the selected model will be trained further to include more subtopics

- The model when deployed, may be useful to any user, be it data scientist, marketing team to get insights on which discussion / comment belong which subtopic




In [1]:
# basics
import pandas as pd
import numpy as np

# web scraping
import requests
import json
from bs4 import BeautifulSoup

# visualisation
import seaborn as sns
import matplotlib.pyplot as plt
!pip install wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# modelling
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score,mean_squared_error,confusion_matrix,accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# NLP
import nltk
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import time
import re
import warnings
warnings.simplefilter(action = 'ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)



### Web scraping

In [2]:
url = "https://api.pushshift.io/reddit/search/submission"

Create a function to scrape info from Reddit

In [3]:
def get_content(subreddit, n_iter):
    
    df_list = []
    current_time = 1623491323
    
    for _ in range(n_iter):
        res = requests.get(
            url,
            params={
                "subreddit": subreddit,
                "size": 100,
                "before": current_time
            }
        )
        time.sleep(3)
        df = pd.DataFrame(res.json()['data'])
        df_list.append(df)
        current_time = df.created_utc.min()
        
    return pd.concat(df_list, axis=0)

Scraping contents from Netflix subreddit

In [4]:
netflix_df = get_content('netflix', 10)[['title','selftext','subreddit']]

In [5]:
# test

netflix_df.head()

Unnamed: 0,title,selftext,subreddit
0,Heman,,netflix
1,Help me find a episode,I'm looking for something I saw many years ago...,netflix
2,Netflix won’t work,I have noticed that Netflix refuses to launch ...,netflix
3,Army of The Dead 2,army of the dead 2 will be set in new york cit...,netflix
4,Believe me: The Abduction of Lisa Mcvey,Sorry little rant but holy shit those 2 female...,netflix


Preliminary cleaning --> remove NaN

In [6]:
netflix_df['title'].replace(np.nan, '', inplace = True)

In [7]:
netflix_df['selftext'].replace(np.nan, '', inplace = True)

Scraping contents from AmazonPrimeVideo subreddit

In [8]:
az_df = get_content('AmazonPrimeVideo', 10)[['title','selftext','subreddit']]

In [9]:
az_df.head(10)

Unnamed: 0,title,selftext,subreddit
0,Any Stack TV subscribers here?,[removed],AmazonPrimeVideo
1,chromecast,[removed],AmazonPrimeVideo
2,Using digital reward credit for channel subscr...,I've got some digital rewards for no rush ship...,AmazonPrimeVideo
3,Infinite loading during ads.,[removed],AmazonPrimeVideo
4,Infinite loading screen whenever an ad tries t...,,AmazonPrimeVideo
5,Anyone Know What Time Clarkson's Farm On Prime?,Hi anyone know what time (UK or EST) Clarkson'...,AmazonPrimeVideo
6,Confused,"So, I'm a bit confused here. I am watching YuG...",AmazonPrimeVideo
7,"So, channel subscriptions do not include all t...",I've been rewatching the Mythbusters series an...,AmazonPrimeVideo
8,Why I Have Mixed Feelings Over The Family Man ...,,AmazonPrimeVideo
9,The creepiest movie from my childhood is on Pr...,It's called The Adventures of Mark Twain. It's...,AmazonPrimeVideo


In [10]:
az_df['title'].replace(np.nan, '', inplace = True)

In [11]:
az_df['selftext'].replace(np.nan, '', inplace = True)

### Scraped data cleaning

Create a function for further cleaning, i.e. removing symbols, numbers, etc.

In [12]:
def clean_text(text):
    
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[''""...]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub("[^a-zA-Z0-9]+", " ",text)
    text = text.replace('[removed]', '')
    text = text.replace('[deleted]', '')
    text = text.replace('?', '')
    text = text.replace('!', '')
    text = text.replace('()', '')
    text = text.replace(',', '')
    text = text.replace("'", '')
    text = text.replace("http", '')
    
    return text

In [13]:
netflix_df['selftext'] = netflix_df['selftext'].apply(clean_text)

In [14]:
netflix_df['title'] = netflix_df['title'].apply(clean_text)

In [15]:
az_df['selftext'] = az_df['selftext'].apply(clean_text)

In [16]:
az_df['title'] = az_df['title'].apply(clean_text)

In [17]:
#check

netflix_df.head()

Unnamed: 0,title,selftext,subreddit
0,heman,,netflix
1,help me find a episode,i m looking for something i saw many years ago...,netflix
2,netflix won t work,i have noticed that netflix refuses to launch ...,netflix
3,army of the dead,army of the dead will be set in new york city ...,netflix
4,believe me the abduction of lisa mcvey,sorry little rant but holy shit those female p...,netflix


In [18]:
az_df.head()

Unnamed: 0,title,selftext,subreddit
0,any stack tv subscribers here,,AmazonPrimeVideo
1,chromecast,,AmazonPrimeVideo
2,using digital reward credit for channel subscr...,i ve got some digital rewards for no rush ship...,AmazonPrimeVideo
3,infinite loading during ads,,AmazonPrimeVideo
4,infinite loading screen whenever an ad tries t...,,AmazonPrimeVideo


After cleaning, concatenate title and selftext

In [19]:
netflix_df['title_text'] = netflix_df['title'] + ' ' + netflix_df['selftext']

In [20]:
netflix_df.head()

Unnamed: 0,title,selftext,subreddit,title_text
0,heman,,netflix,heman
1,help me find a episode,i m looking for something i saw many years ago...,netflix,help me find a episode i m looking for somethi...
2,netflix won t work,i have noticed that netflix refuses to launch ...,netflix,netflix won t work i have noticed that netflix...
3,army of the dead,army of the dead will be set in new york city ...,netflix,army of the dead army of the dead will be set...
4,believe me the abduction of lisa mcvey,sorry little rant but holy shit those female p...,netflix,believe me the abduction of lisa mcvey sorry l...


In [21]:
az_df['title_text'] = az_df['title'] + ' ' + az_df['selftext']

In [22]:
az_df.head()

Unnamed: 0,title,selftext,subreddit,title_text
0,any stack tv subscribers here,,AmazonPrimeVideo,any stack tv subscribers here
1,chromecast,,AmazonPrimeVideo,chromecast
2,using digital reward credit for channel subscr...,i ve got some digital rewards for no rush ship...,AmazonPrimeVideo,using digital reward credit for channel subscr...
3,infinite loading during ads,,AmazonPrimeVideo,infinite loading during ads
4,infinite loading screen whenever an ad tries t...,,AmazonPrimeVideo,infinite loading screen whenever an ad tries t...


### Data storage

Saving the files to csv

In [23]:
#save the cleaned-up r/netflix file 

netflix_df.to_csv(r'C:\Users\Leemei\Data Science\GA\projects\project_3\datasets\netflix(cleanup).csv', index = False)

In [24]:
#save the cleaned-up r/AmazonPrimeVideo file 

az_df.to_csv(r'C:\Users\Leemei\Data Science\GA\projects\project_3\datasets\amazonprimevideo(cleanup).csv', index = False)