<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP 
#### Binary CLassification of Subreddit Posts

 - [Problem Statement](#Problem-Statement)
 - [Background Information](#Background-Information)
 - [Methodology](#Methodology)
 - [Datasets](#Datasets)
 - [Data Import & Cleaning](#Data-Import-&-Cleaning)
 - [Exploratory Data Analysis](#Exploratory-Data-Analysis)
 - [Modeling & Feature Engineering](#Modeling-&-Feature-Engineering)
 - [Significant Findings](#Significant-Findings)
 - [Conclusion](#Conclusion)
 - [Recommendations](#Recommendations)
 - [Citations and Sources](#Citations-and-Sources)

## Problem Statement

As data scientists engaged by Propnex, we are tasked to develop an accurate model to predict house sales in Ames, Lowas based on the given fixed house characteristics. We will approach this analysis by first modeling the sales prices of houses with default features from the dataset, and later performing features selection to optimize the model performance.


## Background Information

Many real estate organizations have traditionally relied on a mix of intuition and traditional, retrospective statistics to make pricing decisions ([*source1*](https://www.mckinsey.com/industries/real-estate/our-insights/getting-ahead-of-the-market-how-big-data-is-transforming-real-estate)). Today, a slew of new variables, such as proximity to points of interest and the existence of environmental pollution, allow for more vivid depictions of a location's future risks and benefits ([*source2*](https://www.redirectconsulting.com/blog/big-data-in-real-estate-3-important-non-traditional-data-sets-to-consider)).

The sweet spot between density and proximity to community amenities differs by American city and even neighborhood, hidden by an ever-increasing amount of data that is becoming increasingly difficult to manage. For instance, while the impact of the proximity of places of interest on property price is obvious, housing values are also influenced by the number, mix, and quality of community amenities that surround them ([*source1*](https://www.mckinsey.com/industries/real-estate/our-insights/getting-ahead-of-the-market-how-big-data-is-transforming-real-estate)).

These nonlinear interactions can be found in a variety of American cities. Thus, machine learning technique is one way to stitch together these data through sophisticated analytics, making it substantially easier to comprehend complex relationships.

A successful data-driven approach can yield powerful insights. In this analysis, we will attempt to develop an accurate and reliable machine-learning model to forecast the housing price in Ames, Lowas based on the given fixed house characteristics

## Methodology 

Following Blitzstein & Pfister’s workflow ([*source3*](https://github.com/cs109/2015/blob/master/Lectures/01-Introduction.pdf)), a 5 steps framework was implemented to conduct this analysis. These 5 steps are:

**Step 1: Ask an interesting question**
- Defining a clear and concise problem statement.

**Step 2: Get the data**
- Import and clean raw data to ensure that all datatypes were accurate and any other errors were fixed.
- Exploring best method to fill up null values, if applicable

**Step 3: Explore the data**
- Differentiate numerical and categorical features in the dataset
- For categorical features, analyze if they are nominal or ordinal features
- Transform ordinal features to numerical ranks
- Perform exploratory data analysis to determine any meaningful correlations
- Dealing with outliers
- Perform feature engineering

**Step 4: Model the data**
- Creating a base model with Linear Regression
- Perform feature selections/ feature engineering to optimize model performance
- Selecting the best Machine learning algorithm/model selection for submission
- Data Visualization
  - subplots
  - histograms
  - scatterplots
  - boxplots

**Step 5: Communicate and visualize the results**
- Present findings to a non-technical audience and provide recommendations

## Datasets

* [`real_advise.csv`](../datasets/real_advise.csv): Data set contains real relationship advises. This dataset will be split for  training and testing purposes.
* [`useless_advise.csv`](../datasets/useless_advise.csv): Data set contains useless advises. This dataset will be split for  training and testing purposes.

## Data Import from Reddit & Cleaning

#### Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import requests
import datetime as dt
from pmaw import PushshiftAPI

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, mean_squared_error
from sklearn.pipeline import Pipeline

#### Using Pushshift API to extract 8000 posts from subreddit r/relationship_advise

In [4]:
# api = PushshiftAPI()

# subreddit = 'relationship_advice'
# limit = 8000
# before = int(dt.datetime(2022,1,1,0,0).timestamp())
# after = int(dt.datetime(2010,1,1,0,0).timestamp())

# real_advise_data = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after)
# print(f'Retrieved {len(real_advise_data)} submissions from Pushshift')

Retrieved 0 submissions from Pushshift


In [6]:
real_advise_df = pd.DataFrame(real_advise_data)

In [93]:
real_advise_df = real_advise_df[['subreddit', 'selftext', 'title']]

In [94]:
real_advise_df.shape

(10000, 3)

In [95]:
real_advise_df.isnull().sum()

subreddit     0
selftext     18
title         0
dtype: int64

In [96]:
real_advise_df.to_csv('real_advise.csv')

#### Using Pushshift API to extract 8000 posts from subreddit r/LearnUselessTalents

In [7]:
api = PushshiftAPI()

subreddit = 'relationship_advice'
limit = 100
before = int(dt.datetime(2021,2,1,0,0).timestamp())
after = int(dt.datetime(2020,12,1,0,0).timestamp())


useless_advise_data = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after)
print(f'Retrieved {len(useless_advise_data)} submissions from Pushshift')

Retrieved 0 submissions from Pushshift


In [123]:
useless_advise_df = pd.DataFrame(useless_advise_data)

In [124]:
useless_advise_df

In [102]:
useless_advise_df = useless_advise_df[['subreddit', 'selftext', 'title']]

KeyError: "None of [Index(['subreddit', 'selftext', 'title'], dtype='object')] are in the [columns]"

In [103]:
useless_advise_df.shape

(0, 0)

In [104]:
useless_advise_df.isnull().sum()

Series([], dtype: float64)

In [105]:
useless_advise_df.to_csv('useless_advise.csv')