<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Project 3: Web APIs & NLP

---
## Problem Statement
You are a data scientist in a well known real estate company located in Ames. In a bid to boost sales, the Board of Directors wants to provide free self-served platform to inform clients of the potential value of their homes. They would also like to find identify factors that might affect sale prices as higher sale prices equate to higher commission income. 

You have been tasked by your direct supervisor to create a regression model to predict the price of houses in Ames, so that these prices can be included in the platform. You will also need to identify factors affecting sales price and make recommendations on what could be done to improve sales income.

### Contents:
- [Background](#Background)
- [Datasets Used](#Datasets-Used)
- [Extraction of Data](#Extraction-of-Data)

## Background

Ames is a city in Story County, Iowa, United States, located approximately 30 miles (48 km) north of Des Moines in central Iowa. ([*source*](https://en.wikipedia.org/wiki/Ames,_Iowa)). With a population of more than 65,000, Ames offers cultural, recreational, educational, business, and entertainment amenities more common in bigger metros. As a growing city, Ames continues to focus on building a strong community filled with opportunities for all. ([*source*](https://www.cityofames.org/about-ames))

## Datasets Used

For the purpose of the analysis, we are provided with the `train` and `test` datasets. The `train` dataset contains Ames' housing sales prices and their relevant information from 2006 to 2010. We will be using this dataset for model building purposes. The `test` dataset contains another set of Ames' housing sale price, but does not include the sale prices. We predicting the sale prices found in this dataset instead.

Information found in the `train` datasets includes information suchs as the sale prices, building class, information on the pool, basement, neighbourhood, garage and overall quality of the house. The full information could be found in the data dictionary below.

Information found in the `test` datasets contains the same fields as those found in thte `train` dataset, except for the sale prices.

## Extraction of Data

**Install `pmaw` library**

In [1]:
# pip install pmaw

Use the above to install the `pmaw` library if it is not available in your notebook.

**1. Importing of libraries**

In [2]:
# Import libraries
import requests
import pandas as pd
from pmaw import PushshiftAPI
import datetime as dt 

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

**2. Define the date range for the extraction**

The before arguments in pmaw only accept dates in the epoch time format, which is the number of seconds that have elapsed since 00:00:00 UTC on Jan 1, 1970. Thus we will use the below function to convert 15$^{th}$ August 2022 to epoch time format.

In [3]:
before = int(dt.datetime(2022,8,15,0,0).timestamp())

**3. Extraction of data using Pushshift API**

Our goal is to get about 3,000 posts from each of the subreddit.

In [4]:
# Instantiate the function
api = PushshiftAPI()

# Define the parameters that we need
subreddit_1 = 'Perfumes'
subreddit_2 = 'Makeup'
limit = 3000

# Retreive posts from Pushshift
submissions_perfumes = api.search_submissions(subreddit = subreddit_1, limit = limit, before = before, stickied = False)
submissions_makeup = api.search_submissions(subreddit = subreddit_2, limit = limit, before = before, stickied = False)

print(f'Retrieved {len(submissions_perfumes)} submissions on \'Perfumes\' from Pushshift')
print((f'Retrieved {len(submissions_makeup)} submissions on \'Makeup\' from Pushshift'))

Not all PushShift shards are active. Query results may be incomplete.
Not all PushShift shards are active. Query results may be incomplete.


Retrieved 3000 submissions on 'Perfumes' from Pushshift
Retrieved 3000 submissions on 'Makeup' from Pushshift


Noted that there is a warning message that says "Not all PushShift shards are active. Query results may be incomplete." This means that some of the posts might not have been extracted. However in this case, we are not concerned with the completeness of the data. Our goal is to extract 3,000 posts from reddit, which the above code manage to do, so we will continue with the dataset extracted.

**4. Putting extracted data into a DataFrame**

In [6]:
# Putting extract data into a dataframe
submissions_perfumes_df = pd.DataFrame(submissions_perfumes)
submissions_makeup_df = pd.DataFrame(submissions_makeup)

# preview the datasets
print('Perfumes dataset:')
display(submissions_perfumes_df.head(5))
print('\n')

print('Makeup dataset:')
display(submissions_makeup_df.head(5))

Perfumes dataset:


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,secure_media,secure_media_embed,removed_by_category,author_cakeday,author_flair_background_color,author_flair_template_id,author_flair_text_color,poll_data,crosspost_parent,crosspost_parent_list
0,[],False,rosecreme21,,[],,text,t2_bcxpc1je,False,False,...,,,,,,,,,,
1,[],False,JV_peaches,,[],,text,t2_bvyd7dqy,False,False,...,,,,,,,,,,
2,[],False,That-Target-3086,,[],,text,t2_blwdt3d2,False,False,...,,,,,,,,,,
3,[],False,JoulesR95,,[],,text,t2_69b40rhh,False,False,...,,,,,,,,,,
4,[],False,devanderleej,,[],,text,t2_l0xhioxk,False,False,...,,,,,,,,,,




Makeup dataset:


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,author_flair_background_color,author_flair_text_color,author_cakeday,author_flair_template_id,thumbnail_height,thumbnail_width,edited
0,[],False,Jin3092,,[],,text,t2_n683mlf,False,False,...,,,,,,,,,,
1,[],False,ThisisanotherTA0,,[],,text,t2_5if9x2aq,False,False,...,,,,,,,,,,
2,[],False,DwightsMegaDesk,,[],,text,t2_dkuit5yx,False,False,...,,,,,,,,,,
3,[],False,melissajackson07,,[],,text,t2_1pqcatke,False,False,...,,,,,,,,,,
4,[],False,BRB092021,,[],,text,t2_p4pnp6zd,False,False,...,,,,,,,,,,


**5. Exporting of data to csv**

In [7]:
submissions_perfumes_df.to_csv('../datasets/perfumes_df.csv')
submissions_makeup_df.to_csv('../datasets/makeup_df.csv')

We will continue the rest of the analysis in a separate workbook. Please refer to **"2. Analysis of Datasets"** for the analysis and recommendations.