# Project 3: Web APIs & NLP

### By Alex Lau

## 1. Problem Statement

## 2. Executive Summary

## 3. Table of Contents

1. [Problem Statement](#1.-Problem-Statement)
2. [Executive Summary](#2.-Executive-Summary)
3. [Table of Contents](#3.-Table-of-Contents)
4. [Loading Libraries & Data](#4.-Loading-Libraries-&-Data)
5. [Preliminary EDA](#5.-Preliminary-EDA)
   <br>5.1 [High-Level Checks](#5.1-High-Level-Checks)
   <br>5.2 [Investigating Target Variable](#5.2-Investigating-Target-Variable)
   <br>5.3 [Investigating Features](#5.3-Investigating-Features)
   <br>5.4 [Visualizing Top 5 Features by Correlation](#5.4-Visualizing-Top-5-Features-by-Correlation)
   <br>5.5 [Handling Outliers](#5.5-Handling-Outliers)
6. [Data Cleaning](#6.-Data-Cleaning)
   <br>6.1 [Converting Ordinal Features to Numbers](#6.1-Converting-Ordinal-Features-to-Numbers)
7. [Exploratory Data Analysis (EDA)](#7.-Exploratory-Data-Analysis-(EDA))
8. [Feature Engineering](#8.-Feature-Engineering)
    <br>8.1 [Reviewing Correlations](#8.1-Reviewing-Correlations)
9. [Model Preperation (Preprocessing)](#9.-Model-Preparation-(Preprocessing))
10. [Modeling](#10-Modeling)
    <br>10.1 [Logistic Regression](#10.1-Logistic-Regression)
    <br>10.2 [Gaussian Naive Bayes](#10.2-Gaussian-Naive-Bayes)
11. [Model_Selection](#11.-Model-Selection)
12. [Model Evaluation](#12.-Model-Evaluation)
13. [Conclusions and Evaluation](#13.-Conclusions-and-Evaluation)

## 3. Exploratory Data Analysis

## 4. Loading Libraries & Data

In [2]:
# Import libaries
import pandas as pd
import numpy as np
import requests
import datetime as dt
import time

from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.naive_bayes import GaussianNB, MultinomialNB


In [3]:
# ## Save url as string
# BASE_URL = 'http://www.mediaeater.com/cameras/'
# com_board_1 = 'info/cb-01.html'

# url = BASE_URL + com_board_1

In [13]:
kind = 'submission'
subreddit_1 = 'teslamotors'
subreddit_2 = 'AndrewyangUBI'

# set 500 because it's possible reddit won't let us pull more than that amount
stem = "https://api.pushshift.io/reddit/search/submission?subreddit={}&size=500".format(subreddit_1)
stem

'https://api.pushshift.io/reddit/search/submission?subreddit=teslamotors&size=500'

In [14]:
URL = '{}&after={}d'.format(stem, 30)
URL

'https://api.pushshift.io/reddit/search/submission?subreddit=teslamotors&size=500&after=30d'

In [15]:
response = requests.get(URL)

In [16]:
response

<Response [200]>

In [17]:
assert response.status_code == 200

In [18]:
mine = response.json()['data']

In [19]:
response.json()

{'data': [{'all_awardings': [],
   'allow_live_comments': False,
   'author': 'importxero',
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_text': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_xgbog',
   'author_patreon_flair': False,
   'author_premium': False,
   'awarders': [],
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1577553597,
   'domain': 'i.redd.it',
   'full_link': 'https://www.reddit.com/r/teslamotors/comments/egtbzj/husband_brought_home_5_cybertrucks/',
   'gildings': {},
   'id': 'egtbzj',
   'is_crosspostable': False,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': True,
   'is_robot_indexable': False,
   'is_self': False,
   'is_video': False,
   'link_flair_background_color': '#dadee2',
   'link_flair_richtext': [{'e': 'text', 't': 'Media/Image'}],
   'link_flair_template_id': '6be18cbc-53c6-11e9-92f3-0e03190c749e',
   'link_flair_text': 'Media/Image',
  

In [20]:
mine

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'importxero',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_xgbog',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1577553597,
  'domain': 'i.redd.it',
  'full_link': 'https://www.reddit.com/r/teslamotors/comments/egtbzj/husband_brought_home_5_cybertrucks/',
  'gildings': {},
  'id': 'egtbzj',
  'is_crosspostable': False,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': True,
  'is_robot_indexable': False,
  'is_self': False,
  'is_video': False,
  'link_flair_background_color': '#dadee2',
  'link_flair_richtext': [{'e': 'text', 't': 'Media/Image'}],
  'link_flair_template_id': '6be18cbc-53c6-11e9-92f3-0e03190c749e',
  'link_flair_text': 'Media/Image',
  'link_flair_text_color': 'dark',
  'l

In [21]:
df = pd.DataFrame.from_dict(mine)

In [22]:
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,media_embed,post_hint,preview,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,media_metadata,author_cakeday,suggested_sort
0,[],False,importxero,,[],,text,t2_xgbog,False,False,...,,,,,,,,,,
1,[],False,MooseAMZN,2 oi,"[{'e': 'text', 't': 'None'}]",,richtext,t2_ure9f,False,False,...,,,,,,,,,,
2,[],False,backstreetatnight,,"[{'e': 'text', 't': 'Tri-Motor Cybertruck | P3...",Tri-Motor Cybertruck | P3D,richtext,t2_2qvn2dtd,False,True,...,,,,,,,,,,
3,[],False,backstreetatnight,,"[{'e': 'text', 't': 'Tri-Motor Cybertruck | P3...",Tri-Motor Cybertruck | P3D,richtext,t2_2qvn2dtd,False,True,...,,,,,,,,,,
4,[],False,backstreetatnight,,"[{'e': 'text', 't': 'Tri-Motor Cybertruck | P3...",Tri-Motor Cybertruck | P3D,richtext,t2_2qvn2dtd,False,True,...,,,,,,,,,,


In [23]:
df.shape

(500, 73)

In [24]:
subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']

In [25]:
df = df[subfield]
df.head(10)

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self
0,Husband brought home 5 Cybertrucks.,,teslamotors,1577553597,importxero,0,1,False
1,Model Y pics... VIN 594. Think they've made 60...,,teslamotors,1577553663,MooseAMZN,148,1,False
2,Tesla Model 3 interior compared to the Mercede...,,teslamotors,1577554656,backstreetatnight,2,1,False
3,Tesla Patents New Chemistry for Better Batteries,,teslamotors,1577554707,backstreetatnight,1,1,False
4,Tesla killers that failed miserably.,,teslamotors,1577554911,backstreetatnight,0,1,False
5,Model 3 Mid-Range battery degradation of 5% af...,I've had my premium mid-range for 10 months an...,teslamotors,1577555366,pedals2paddles,11,1,True
6,I can vouch for the accuracy of this statement.,,teslamotors,1577556863,deemon999,0,1,False
7,Non English voice commands and GUI questions,"I live in the US, but have a bilingual househo...",teslamotors,1577557374,furuike,17,1,True
8,"""Increase temperature by a degree.""",[removed],teslamotors,1577557641,HarryJoy,0,1,True
9,"""Increase temperature by a degree.""",[removed],teslamotors,1577558275,HarryJoy,0,1,True


In [26]:
df = df.drop_duplicates()

In [27]:
df.shape

(500, 8)

In [28]:
created = df.loc[0, 'created_utc']

In [30]:
dt.date.fromtimestamp(created)

datetime.date(2019, 12, 28)

In [31]:
created

1577553597

In [32]:
_timestamp = df["created_utc"].apply(lambda x: dt.date.fromtimestamp(x))

In [33]:
df['timestamp'] = _timestamp # underscore in front is a temporary variable

In [34]:
df

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,Husband brought home 5 Cybertrucks.,,teslamotors,1577553597,importxero,0,1,False,2019-12-28
1,Model Y pics... VIN 594. Think they've made 60...,,teslamotors,1577553663,MooseAMZN,148,1,False,2019-12-28
2,Tesla Model 3 interior compared to the Mercede...,,teslamotors,1577554656,backstreetatnight,2,1,False,2019-12-28
3,Tesla Patents New Chemistry for Better Batteries,,teslamotors,1577554707,backstreetatnight,1,1,False,2019-12-28
4,Tesla killers that failed miserably.,,teslamotors,1577554911,backstreetatnight,0,1,False,2019-12-28
...,...,...,...,...,...,...,...,...,...
495,Why were so many Supercharger openings delayed...,I live in Arkansas. Tesla had 7 supercharger l...,teslamotors,1577857992,logeeny,99,1,True,2020-01-01
496,Feature Request: Wiper Service Mode on App,Was reading over the cold weather tips for my ...,teslamotors,1577860159,gdukin,14,1,True,2020-01-01
497,PSA: The Tesla app seems to be experiencing is...,[removed],teslamotors,1577861009,iLoveCalculus314,1,1,True,2020-01-01
498,341 Sentry Events in a couple of hours? Either...,,teslamotors,1577861740,Abandonedpools,0,1,False,2020-01-01


In [35]:
# we're asking pushshift to engage with redit website
def query_pushshift(subreddit, kind='submission', skip=30, times=5, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 
                                'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):
    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    
    mylist = []
    
    for x in range(1, times + 1):
        
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        time.sleep(2)
        
    full = pd.concat(mylist, sort=False)
    
    if kind == "submission":
        
        full = full[subfield]
        
        full = full.drop_duplicates()
        
        full = full.loc[full['is_self'] == True]
        
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date)
    
    full['timestamp'] = _timestamp
    print(full.shape)
    
    return full 

In [36]:
sub_1_query = query_pushshift(subreddit_1)

https://api.pushshift.io/reddit/search/submission/?subreddit=teslamotors&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=teslamotors&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=teslamotors&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=teslamotors&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=teslamotors&size=500&after=150d
(1250, 9)


In [37]:
sub_1_query.shape

(1250, 9)

In [38]:
sub_1_query.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
1,Considerations when charging at a v3 charger?,"So a brand new v3 charger went in nearby, whic...",teslamotors,1577559649,Thegeobeard,46,1,True,2019-12-28
2,2020 Tesla Roadster - New feature I haven't se...,"Hello,\n\nI appear to have found my way to the...",teslamotors,1577559740,Phantasm22,0,1,True,2019-12-28
4,Expected depreciation,I'm looking at buying a model 3 in the UK in a...,teslamotors,1577559979,AcesFullOfKings,12,1,True,2019-12-28
8,Cybertruck Tri Motor AWD,[removed],teslamotors,1577562440,blacksnake29,0,1,True,2019-12-28
10,PSA: Check your tire inflator kit.,I purchased a Tesla tire inflator kit from the...,teslamotors,1577563014,williamwashere,40,1,True,2019-12-28


In [39]:
sub_1_query.to_csv('./sub_1_query.csv', index = True)

In [40]:
query_load = pd.read_csv('./sub_1_query.csv', index_col = 0)
query_load.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
1,Considerations when charging at a v3 charger?,"So a brand new v3 charger went in nearby, whic...",teslamotors,1577559649,Thegeobeard,46,1,True,2019-12-28
2,2020 Tesla Roadster - New feature I haven't se...,"Hello,\n\nI appear to have found my way to the...",teslamotors,1577559740,Phantasm22,0,1,True,2019-12-28
4,Expected depreciation,I'm looking at buying a model 3 in the UK in a...,teslamotors,1577559979,AcesFullOfKings,12,1,True,2019-12-28
8,Cybertruck Tri Motor AWD,[removed],teslamotors,1577562440,blacksnake29,0,1,True,2019-12-28
10,PSA: Check your tire inflator kit.,I purchased a Tesla tire inflator kit from the...,teslamotors,1577563014,williamwashere,40,1,True,2019-12-28


## 6. Data Cleaning

6. [Data Cleaning](#6.-Data-Cleaning)
   <br>6.1 [Converting Ordinal Features to Numbers](#6.1-Converting-Ordinal-Features-to-Numbers)
7. [Exploratory Data Analysis (EDA)](#7.-Exploratory-Data-Analysis-(EDA))
8. [Feature Engineering](#8.-Feature-Engineering)
    <br>8.1 [Reviewing Correlations](#8.1-Reviewing-Correlations)
9. [Model Preperation (Preprocessing)](#9.-Model-Preparation-(Preprocessing))

## 8. Feature Engineering

## 9. Model Preparation (Preprocessing)

## 10 Modeling

### 10.1 Logistic Regression

### 10.2 Gaussian Naive Bayes

## 11. Model Selection

## 12. Model Evaluation

## 13. Conclusions and Evaluation