# <u>Project 3 : Text classification of Real Estate vs Stock investments</u>

## Problem Statement

This project aims to build a text classifier that is able to classify questions using a suitable model. The selected model should be able to correctly classify a given post to either Real Estate or Stocks investment. We will be exploring 2 vectorisers (CountVec and TFIDF) to create a feature vocabulary and evaluating 3 classifiers: Logistic Regression, Multinomial Naive Bayes and Support Vector Classifier. We would be using the Matthews Correlation Coefficient as the primary metric to determine the best model having the lowest number of misclassified posts.

## Executive Summary

Launched in 2016, Seedly helps users make smarter financial decisions with its expense tracking app which allows users to sync their financial accounts and better manage their cash-flow.

Over the years, we've introduced a community feature which allows users to crowdsource knowledge from peers before making a financial decision; an unbiased reviews platform for a myriad of products ranging from travel insurance to robo-advisors; as well as comparison tools for the open electricity market and SIM-only mobile plans. As the number of topics expands in line with the number of readers, there is a growing need to classify the questions and display them in the most appropiate section within the forum.

The data science team @ Seedly would be solving this challenge by building an effective model to analyse and provide a binary classification. Focusing on Seedly's core offerings, we would be classifying posts between Real estate and Stocks investments. Due to the ease of extracting data through an API, we would be mining the 2 relevant subreddits to create a corpus to train our model on. We would then parse the data and lemmatize it before we create the feature vocabulary. After fitting and tuning the model, we will be evaluating the trained model on actual blind queries from the Seedly.SG discussion forums. Finally, we would be creating a web app to test out the delivery of the model. 

In [27]:
# Importing the standard libraries
import pandas as pd
import numpy as np
import requests
import time
import random

## Data Collection


In [28]:
# Assign the URLs to the 2 subreddits
url_fin = 'https://www.reddit.com/r/stocks.json'
url_prop = 'https://www.reddit.com/r/realestateinvesting.json'

In [29]:
# creating a function to do an api call to the reddit json
def reddit_grabber(url_to_use,csv_filename):
    posts = []
    after = None
    total_post_counter = 0
    num_of_cycles = 40

    for a in range(num_of_cycles):
        if after == None:
            current_url = url_to_use
        else:
            current_url = url_to_use + '?after=' + after 
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'orange black'})

        if res.status_code != 200:
            print('Status error: ', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        total_post_counter += len(current_posts)
        print("Progess @ cycle",a+1,"/",num_of_cycles,':',total_post_counter,"posts")
        posts.extend(current_posts)
        after = current_dict['data']['after']

        if a > 0:
            prev_posts = pd.read_csv('data/' + csv_filename +'.csv')
            current_df = pd.DataFrame(posts)
            pd.concat([prev_posts, current_df], axis=0).to_csv('data/' + csv_filename +'.csv', index = False)

        else:
            pd.DataFrame(posts).to_csv('data/' + csv_filename +'.csv', index = False)

        #wait time before next loop     
        rand_time = random.randint(3,9)
        time.sleep(rand_time)

        if after == None:
            print("-- END OF SCRAPE --")
            break
        

In [30]:
# Mining the stocks subreddit
reddit_grabber(url_fin,'fin')

https://www.reddit.com/r/stocks.json
Progess @ cycle 1 / 40 : 27 posts
https://www.reddit.com/r/stocks.json?after=t3_ikk5ly
Progess @ cycle 2 / 40 : 52 posts
https://www.reddit.com/r/stocks.json?after=t3_ikm4el
Progess @ cycle 3 / 40 : 77 posts
https://www.reddit.com/r/stocks.json?after=t3_ikq0r7
Progess @ cycle 4 / 40 : 102 posts
https://www.reddit.com/r/stocks.json?after=t3_ikn7jh
Progess @ cycle 5 / 40 : 127 posts
https://www.reddit.com/r/stocks.json?after=t3_ik1emy
Progess @ cycle 6 / 40 : 152 posts
https://www.reddit.com/r/stocks.json?after=t3_ikeek2
Progess @ cycle 7 / 40 : 177 posts
https://www.reddit.com/r/stocks.json?after=t3_ijvnu5
Progess @ cycle 8 / 40 : 202 posts
https://www.reddit.com/r/stocks.json?after=t3_ika3y6
Progess @ cycle 9 / 40 : 227 posts
https://www.reddit.com/r/stocks.json?after=t3_ik66f0
Progess @ cycle 10 / 40 : 252 posts
https://www.reddit.com/r/stocks.json?after=t3_ijnu0w
Progess @ cycle 11 / 40 : 277 posts
https://www.reddit.com/r/stocks.json?after=t3_ijm

In [31]:
# Mining the real estate subreddit
reddit_grabber(url_prop,'prop')

https://www.reddit.com/r/realestateinvesting.json
Progess @ cycle 1 / 40 : 25 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_iki91a
Progess @ cycle 2 / 40 : 50 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_ik7r2i
Progess @ cycle 3 / 40 : 75 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_ij5php
Progess @ cycle 4 / 40 : 100 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_iiwklm
Progess @ cycle 5 / 40 : 125 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_iidhkn
Progess @ cycle 6 / 40 : 150 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_ihvt4y
Progess @ cycle 7 / 40 : 175 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_igyvgj
Progess @ cycle 8 / 40 : 200 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_igjlce
Progess @ cycle 9 / 40 : 225 posts
https://www.reddit.com/r/realestateinvesting.json?after=t3_ig2ipw
Progess @ cycle 10 / 40 : 250 posts
https://

#### We can observe that based on the frequency of mining using the API, the shape of the dataframe doesn't add up to the final number of posts scraped. 

We will proceed to remove the duplicate rows (using the title of each post) to clean up the dataframes.

In [170]:
df = pd.read_csv('data/fin.csv')
df.shape

(8806, 104)

In [171]:
df.drop_duplicates(subset=['title'],inplace=True)
df.shape

(630, 104)

In [172]:
df2 = pd.read_csv('data/prop.csv')
df2.shape

(14019, 112)

In [173]:
df2.drop_duplicates(subset=['title'],inplace=True)
df2.shape

(818, 112)

## Pre-cleaning

In [183]:
new_df1 = df.loc[:,['selftext','title','subreddit']]
print(new_df1.shape)
new_df1.head()

(630, 3)


Unnamed: 0,selftext,title,subreddit
0,Please use this thread to discuss your portfol...,Rate My Portfolio - r/Stocks Quarterly Thread ...,stocks
1,"This is the daily discussion, so anything stoc...",r/Stocks Daily Discussion &amp; Technicals Tue...,stocks
2,It shall be interesting to see how the market ...,Tesla to sell up to $5 billion in stock amid r...,stocks
3,Zoom has added close to **$40 billion** to its...,"Zoom's market cap dashes past Lowes, Phillip M...",stocks
4,\n- TSLA: up more than 10x in the past 12 mont...,Which stocks do you think will have explosive ...,stocks


##### We can observe that we have more posts under the Real estate category. To prevent an unbalanced dataset, we will drop title rows that have below 30 characters.

In [184]:
new_df2 = df2.loc[df2['title'].apply(lambda x: len(x) > 30),['selftext','title','subreddit']]
print(new_df2.shape)
new_df2.head()

(623, 3)


Unnamed: 0,selftext,title,subreddit
0,[https://www.businessinsider.com/trump-admini...,The Trump administration is moving to implemen...,realestateinvesting
2,"Hello everyone, first time poster here. I'm lo...",Unconventional idea feedback: Investing in Sma...,realestateinvesting
3,"I know this question gets asked a lot, but I'l...",Has anyone had experience with buying land and...,realestateinvesting
5,let's say you're in NY or NJ or Cali would you...,would you move to a lower cost of state for re...,realestateinvesting
7,"Hello, \n\n&amp;#x200B;\n\nI am a relative new...",Separate Bank Account Question?,realestateinvesting


### Combine text cols into one col and label encode subreddits
We will then encode **'stocks'** posts as **1** and **'realestate'** posts as **0**. There is no specific need for a primary class in our case.

In [185]:
# combine the cols into a 'rawtext' col
new_df1['rawtext'] = new_df1.selftext + ' ' + new_df1.title
new_df2['rawtext'] = new_df2.selftext + ' ' + new_df2.title

# label the subreddits as 1 and 0
new_df1.subreddit = new_df1.subreddit.map({'stocks':1})
new_df2.subreddit = new_df2.subreddit.map({'realestateinvesting':0})

In [186]:
# drop the unwanted cols
new_df1.drop(columns=['selftext','title'],inplace=True)
new_df2.drop(columns=['selftext','title'],inplace=True)

### Combine both dataframes into one file and save it to a seperate CSV file and off to modelling
[click here to access the modelling notebook](02_Modelling.ipynb)


In [192]:
pd.concat([new_df1,new_df2], axis=0).to_csv('data/final.csv', index=False)