<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

---
## Problem Statement
You are a data scientist in a well known cosmetic company. The company has an online store and discussion forums for all the product categories. The department in charge of the website and forum has complained that customers are posting their feedbacks and discussions for Makeup and Perfumes in the wrong forums. Due to this, you colleagues have to manually comb through the discussion forums to identify posts that have been posted in the wrong forum. They found this time consuming and have turned to the data science department to look for a better solution.

You have been tasked by your direct supervisor to create a classification model to identify whether posts should be classified in the Makeup or Perfumes column. At the same time, use the analysis to identify useful insights that could be used by the marketing department for their marketing efforts.

### Contents:
- [Datasets Used](#Datasets-Used)
- [Extraction of Data](#Extraction-of-Data)

## Datasets Used

As the forums were newly created, there isn't sufficient data for us to perform analysis on. As such, we have extract posts from the Makeup and Perfume subreddits to form our dataset.

Information found in the datasets include the title and selftext of the posts. Please refer to the data dictionary for more infomation on the columns extracted.

## Extraction of Data

**Install `psaw` library**

In [1]:
# pip install psaw

Use the above to install the `psaw` library if it is not available in your notebook.

**1. Importing of libraries**

In [2]:
# Import libraries
import requests
import pandas as pd
import datetime as dt 
import time
import random

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

**2. Extraction of data using Pushshift.io and putting data into dataframe**

Our goal is to get about 3,000 posts from each of the subreddit. We will create a function and extract posts from both subreddits. We will start from the most recent posts and work our way back till we get at least 3,000 posts from each subreddit. Data is extracted on 27 August 2022.

In [3]:
def scrap(subreddit, n, days = 30):
    
    # Url
    base_url = 'https://api.pushshift.io/reddit/search/submission'
    full_url = f'{base_url}?subreddit={subreddit}&size=500'
    
    # Creating an empty list to store the posts
    posts = []
    
    # Iterations to modify the url after each iteration
    for i in range(1, n+1):
        urlmod = '{}&after={}d'.format(full_url, days*i)
        res_1 = requests.get(urlmod)
        
        # This is to prevent errors from stopping the codes from running
        try:
            res = requests.get(urlmod)
            assert res.status_code == 200
        except:
            continue
        
        # Converting to json
        extracted = res.json()['data']
        # Constructing a dataframe from dict
        df = pd.DataFrame.from_dict(extracted)
        # Adding the df to post list(created on top)
        posts.append(df)
        
        # Total number of posts scrapped
        total_scraped = sum(len(x) for x in posts)
        
        # If there are more than n values/data, stop. 
        if total_scraped > n:
            break
        
        # Generate a random sleep duration to seem like a human user
        sleep_duration = random.randint(1,9)
        time.sleep(sleep_duration)
            
    
    # Creating a list of features of interest that we will be using
    features_of_interest = ['subreddit', 'title', 'selftext']
    
    # Combine all iterations into 1 dataframe
    final_df = pd.concat(posts, sort=False)
    # And remove all the unrequired columns from the datasets
    final_df = final_df[features_of_interest]
    # Dropping any duplicates
    final_df.drop_duplicates(inplace=True)
    return final_df.reset_index(drop=True)

In [4]:
submissions_perfumes_df = scrap('Perfumes', 3000)
submissions_makeup_df = scrap('Makeup', 3000)

print(f'Retrieved {len(submissions_perfumes_df)} submissions on \'Perfumes\' from Pushshift')
print((f'Retrieved {len(submissions_makeup_df)} submissions on \'Makeup\' from Pushshift'))

Retrieved 3002 submissions on 'Perfumes' from Pushshift
Retrieved 3220 submissions on 'Makeup' from Pushshift


We have managed to extract more than 3,000 non duplicate posts for both subreddits. We will export the files to csv and proceed with the cleaning and analysis of the datasets.

**5. Exporting of data to csv**

In [5]:
submissions_perfumes_df.to_csv('../datasets/perfumes_df.csv')
submissions_makeup_df.to_csv('../datasets/makeup_df.csv')

We will continue the rest of the analysis in a separate workbook. Please refer to **"2. Analysis of Datasets"** for the analysis and recommendations.