<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Project 3: Web APIs & NLP

---
## Problem Statement
You are a data scientist in a well known real estate company located in Ames. In a bid to boost sales, the Board of Directors wants to provide free self-served platform to inform clients of the potential value of their homes. They would also like to find identify factors that might affect sale prices as higher sale prices equate to higher commission income. 

You have been tasked by your direct supervisor to create a regression model to predict the price of houses in Ames, so that these prices can be included in the platform. You will also need to identify factors affecting sales price and make recommendations on what could be done to improve sales income.

### Contents:
- [Background](#Background)
- [Datasets Used](#Datasets-Used)
- [Extraction of Data](#Extraction-of-Data)

## Background

Ames is a city in Story County, Iowa, United States, located approximately 30 miles (48 km) north of Des Moines in central Iowa. ([*source*](https://en.wikipedia.org/wiki/Ames,_Iowa)). With a population of more than 65,000, Ames offers cultural, recreational, educational, business, and entertainment amenities more common in bigger metros. As a growing city, Ames continues to focus on building a strong community filled with opportunities for all. ([*source*](https://www.cityofames.org/about-ames))

## Datasets Used

For the purpose of the analysis, we are provided with the `train` and `test` datasets. The `train` dataset contains Ames' housing sales prices and their relevant information from 2006 to 2010. We will be using this dataset for model building purposes. The `test` dataset contains another set of Ames' housing sale price, but does not include the sale prices. We predicting the sale prices found in this dataset instead.

Information found in the `train` datasets includes information suchs as the sale prices, building class, information on the pool, basement, neighbourhood, garage and overall quality of the house. The full information could be found in the data dictionary below.

Information found in the `test` datasets contains the same fields as those found in thte `train` dataset, except for the sale prices.

## Extraction of Data

**Install `psaw` library**

In [1]:
# pip install psaw

Use the above to install the `psaw` library if it is not available in your notebook.

**1. Importing of libraries**

In [2]:
# Import libraries
import requests
import pandas as pd
import datetime as dt 
import time
import random

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

**2. Define the date range for the extraction**

The before arguments in pmaw only accept dates in the epoch time format, which is the number of seconds that have elapsed since 00:00:00 UTC on Jan 1, 1970. Thus we will use the below function to convert 15$^{th}$ August 2022 to epoch time format.

In [3]:
before = int(dt.datetime(2022,8,15,0,0).timestamp())

**3. Extraction of data using Pushshift.io and putting data into dataframe**

Our goal is to get about 3,000 posts from each of the subreddit. We will create a function and extract posts from both subreddits.

In [4]:
def scrap(subreddit, n, days = 30):
    
    # Url
    base_url = 'https://api.pushshift.io/reddit/search/submission'
    full_url = f'{base_url}?subreddit={subreddit}&size=500'
    #print(full_url)
    
    # Creating an empty list to store the posts
    posts = []
    
    # Iterations to modify the url after each iteration
    for i in range(1, n+1):
        urlmod = '{}&after={}d'.format(full_url, days*i)
        #print URL used and days
        #print(f'Url: {urlmod}')
        #print(f'Days: {days*i}')
        res_1 = requests.get(urlmod)
        
        # This is to prevent errors from stopping the codes from running
        try:
            res = requests.get(urlmod)
            assert res.status_code == 200
        except:
            continue
        
        # Converting to json
        extracted = res.json()['data']
        # Constructing a dataframe from dict
        df = pd.DataFrame.from_dict(extracted)
        # Adding the df to post list(created on top)
        posts.append(df)
        
        
        total_scraped = sum(len(x) for x in posts)
        # Print total posts scrapped to see how many posts the function has scrapped
        #print(f'total_scraped: {total_scraped}')
        
        
        # If there are more than n values/data, stop. 
        if total_scraped > n:
            break
        
        # Generate a random sleep duration to seem like a human user
        sleep_duration = random.randint(1,9)
        #print(f'sleep_duration: {sleep_duration}')
        time.sleep(sleep_duration)
            
    
    # Creating a list of features of interest that we will be using
    features_of_interest = ['subreddit', 'title', 'selftext']
    
    # Combine all iterations into 1 dataframe
    final_df = pd.concat(posts, sort=False)
    # And remove dataframe to limit to features of interest
    final_df = final_df[features_of_interest]
    # Dropping any duplicates
    final_df.drop_duplicates(inplace=True)
    # Display final shape to double check the columns created
    #print(f'final_df.shape: {final_df.shape}')
    return final_df.reset_index(drop=True)

In [5]:
submissions_perfumes_df = scrap('Perfumes', 3000)
submissions_makeup_df = scrap('Makeup', 3000)

print(f'Retrieved {len(submissions_perfumes)} submissions on \'Perfumes\' from Pushshift')
print((f'Retrieved {len(submissions_makeup)} submissions on \'Makeup\' from Pushshift'))

KeyboardInterrupt: 

We have managed to extract more than 3,000 non duplicate posts for both subreddits. We will export the files to csv and proceed with the cleaning and analysis of the datasets.

**5. Exporting of data to csv**

In [None]:
submissions_perfumes_df.to_csv('../datasets/perfumes_df.csv')
submissions_makeup_df.to_csv('../datasets/makeup_df.csv')

We will continue the rest of the analysis in a separate workbook. Please refer to **"2. Analysis of Datasets"** for the analysis and recommendations.