# Project 3 Web APIs & NLP - Florian Combelles

In week four we've learned about a few different classifiers. In week five we'll learn about webscraping, APIs, and Natural Language Processing (NLP). This project will put those skills to the test.

For project 3, your goal is two-fold:
1. Using [Pushshift's](https://github.com/pushshift/api) API, you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

###### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Naive Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of your results.
- A short presentation outlining your process and findings for a semi-technical audience.

## Problem Statement and background:

###### Problem Statement

During Covid 19, we have seen an increase in the number of new pet owners in Singapore.
These new inexperienced pet owners are facing a lack of ressources on how to care for their new pets.

This leads to them having to overrely on veterinarians and pet stores to provides information and respond to their queries.

This influx of inexperienced pets owners overly reliant on vets and pet store reduces their work efficiency and distract them from their main responsibilites


##### Background

We are working for a company called Pet Smart.

We are releasing a new mobile app that includes two features:
* A chat box where you can ask your cat or dog related question and get an answer from a team of experts.
* Articles that provides informations and tips on how to care for your pets.


## Part 1: Data Acquisition

For this project, we will be scrapping [Reddit](https://www.reddit.com) and particularly two similar subreddits:
* [DogAdvice](https://www.reddit.com/r/DogAdvice/)
* [CatAdvice](https://www.reddit.com/r/CatAdvice/)

###### These two subreddits are a place where people can post and look for information/advice regarding their dogs and cats.

In order to acquire information from these two subreddits, we will be using the [Pushshift's](https://github.com/pushshift/api) API.

This API was designed to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

It provides full functionality for searching Reddit data and also includes the capability of creating powerful data aggregations

##### Importing Libraries

For our data acquisition we only need to import two libraries:
* Requests for our Data Scrapping
* Pandas to transform this data into dataframes

In [None]:
import pandas as pd
import requests

### Web Scraping

##### Setting up the url to retrieve subreddit content

Pushshift provides two options when scrapping Reddit.
* Submissions - Scrape posts content
* Comments - Scrape comments on posts

For this project, we decided to focus on submission as we want to identify in which subreddit a particular submission would fall.

In [None]:
# Indicating base url to retrieve Reddit Posts

url = f'https://api.pushshift.io/reddit/search/submission/'

##### Creating parameters variables

We are specifying the subreddits we want to collect information from (CatAdvice, DogAdvice)

We are also creating a variable that will help us automate data collection beyond the limitation of 250 posts per request

In [None]:
# Creating a lists of subreddits to scrape
# Creating a counter to define how many times we want to go through the posts (since we have a limit of 250)

subreddits = ['CatAdvice', 'DogAdvice']
max_posts = 5

##### Scraping Reddit content and saving results to a CSV

We are looking to get 1000 posts from each subreddits.

In order to do so, we are creating two loops.
* A for loop that will filter which information (parameters) we want to retrieve.

In this case, we will be collected the title, time of post creation, domain (is the post a new content or is it shared from another site?) and the type of content that the post includes (photo, videos...).
We also restricts only to posts that we not removed/deleted.

* A while loop

The while loop will look for the posts that match our criterias and retrieve their content as a json file.
This json file will then be converted into an entry into a dataframe that will contain all the posts for each subreddit.
Finally, because of the limit of 250 posts, we added a counter that will allow us to repeat the process as many time as we need.

In [None]:
# Creating a For Loop to go through our parameters

for i in subreddits:
    params = {'subreddit': i,
         'size': '250',
# Adding a new criteria: selftext:not 'removed' to filter out the removed posts              
         'selftext:not': '[removed]',     
         'fields': ['selftext', 'title', 'subreddit', 'created_utc',
                    'domain', 'is_robot_indexable', 'post_hint']}
    
    count = 1
    total_submissions = []

# Creating a While loop to get the json response and save them in a DatFrame
# Add a new before parameter with the date of the oldest post scrapped
# Increment counter and start scraping again from oldest post

    while count < max_posts:
        response = requests.get(url, params)
        resp_json = response.json()
        posts = resp_json['data']
        submissions = pd.DataFrame(posts)
        total_submissions.append(submissions)
        params['before'] = submissions['created_utc'].iloc[-1]
        count += 1

# Concatenate all the submissions into a single DataFrame
# Save our two subreddit into a csv file
        
    final_submissions = pd.concat(total_submissions)    
    final_submissions.to_csv(f'data/{i}.csv', index= False)

### Conclusion

Now that we have managed to automate acquisition and retrieve the data from our two subreddit, we can move on the Data Cleaning and EDA.