<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Project 3: Web APIs & NLP

---
## Problem Statement
You are a data scientist in a well known real estate company located in Ames. In a bid to boost sales, the Board of Directors wants to provide free self-served platform to inform clients of the potential value of their homes. They would also like to find identify factors that might affect sale prices as higher sale prices equate to higher commission income. 

You have been tasked by your direct supervisor to create a regression model to predict the price of houses in Ames, so that these prices can be included in the platform. You will also need to identify factors affecting sales price and make recommendations on what could be done to improve sales income.

### Contents:
- [Background](#Background)
- [Datasets Used](#Datasets-Used)
- [Extraction of Data](#Extraction-of-Data)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Data Dictionary](#Data-Dictionary)
- [Exploratory Data Analysis and Feature Engineering](#Exploratory-Data-Analysis-and-Feature-Engineering)
- [Initial Modelling](#Initial-Modelling)
- [Further EDA and Feature Engineering](#Further-EDA-and-Feature-Engineering)
- [Regularisation Regression](#Regularisation-Regression)
- [Model Test Results](#Model-Test-Results)
- [Predicting Sale Prices in Test Dataset](#Predicting-Sale-Prices-in-Test-Dataset)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Background

Ames is a city in Story County, Iowa, United States, located approximately 30 miles (48 km) north of Des Moines in central Iowa. ([*source*](https://en.wikipedia.org/wiki/Ames,_Iowa)). With a population of more than 65,000, Ames offers cultural, recreational, educational, business, and entertainment amenities more common in bigger metros. As a growing city, Ames continues to focus on building a strong community filled with opportunities for all. ([*source*](https://www.cityofames.org/about-ames))

## Datasets Used

For the purpose of the analysis, we are provided with the `train` and `test` datasets. The `train` dataset contains Ames' housing sales prices and their relevant information from 2006 to 2010. We will be using this dataset for model building purposes. The `test` dataset contains another set of Ames' housing sale price, but does not include the sale prices. We predicting the sale prices found in this dataset instead.

Information found in the `train` datasets includes information suchs as the sale prices, building class, information on the pool, basement, neighbourhood, garage and overall quality of the house. The full information could be found in the data dictionary below.

Information found in the `test` datasets contains the same fields as those found in thte `train` dataset, except for the sale prices.

## Extraction of Data

Please refer to **"1. Extraction of Data"** for the steps done for the extraction of data from Reddit.

## Data Import and Cleaning

**1. Importing of libraries**

In [1]:
# Import libraries
import requests
import pandas as pd
from pmaw import PushshiftAPI
import datetime as dt 

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

**2. Importing of datasets being used**

We have imported the `perfumes_df` and `makeup_df` datasets which we have extracted in **"1. Extraction of Data"**.

In [2]:
# Import datasets:
perfumes_df = pd.read_csv("../datasets/perfumes_df.csv")
makeup_df = pd.read_csv("../datasets/makeup_df.csv")

**3. Display datasets**

Display the first 5 rows of the imported datasets

In [3]:
# Setting to display all the columns
pd.set_option("display.max_columns", None)

# Display first 5 rows of the datasets
print("First 5 rows of the \"Perfume\" dataset:")
display(perfumes_df.head())
print("")
print("First 5 rows of the \"Makeup\" dataset:")
display(makeup_df.head())

First 5 rows of the "Perfume" dataset:


Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_cakeday,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,thumbnail_height,thumbnail_width,url_overridden_by_dest,discussion_type,suggested_sort,gallery_data,is_gallery,media_metadata,media,media_embed,secure_media,secure_media_embed,poll_data,author_flair_background_color,author_flair_template_id,author_flair_text_color,crosspost_parent,crosspost_parent_list
0,0,[],False,kontrlino,True,,[],,text,t2_chi8dfh0,False,False,False,[],False,False,1654170459,self.Perfumes,https://www.reddit.com/r/Perfumes/comments/v37...,{},v370pw,False,True,False,False,False,True,True,False,#0079d3,"[{'e': 'text', 't': 'Discussion '}, {'a': ':sn...",692f1192-c21f-11ea-a4f8-0e90b6858235,Discussion :snoo_tableflip::snoo_thoughtful:,light,richtext,False,False,True,0,0,False,all_ads,/r/Perfumes/comments/v370pw/what_to_wear_on_my...,False,6,1654170470,1,"I am wearing a suit on both occasions, the dis...",True,False,False,Perfumes,t5_2s8y4,13664,public,self,What to wear on my university graduation and g...,0,[],1.0,https://www.reddit.com/r/Perfumes/comments/v37...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
1,1,[],False,kontrlino,True,,[],,text,t2_chi8dfh0,False,False,False,[],False,False,1654170451,self.Perfumes,https://www.reddit.com/r/Perfumes/comments/v37...,{},v370n6,False,True,False,False,False,True,True,False,#0079d3,"[{'e': 'text', 't': 'Discussion '}, {'a': ':sn...",692f1192-c21f-11ea-a4f8-0e90b6858235,Discussion :snoo_tableflip::snoo_thoughtful:,light,richtext,False,False,True,0,0,False,all_ads,/r/Perfumes/comments/v370n6/what_to_wear_on_my...,False,6,1654170461,1,"I am wearing a suit on both occasions, the dis...",True,False,False,Perfumes,t5_2s8y4,13664,public,self,What to wear on my university graduation and g...,0,[],1.0,https://www.reddit.com/r/Perfumes/comments/v37...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
2,2,[],False,ThBars,,,[],,text,t2_ng3mm0vg,False,False,False,[],False,False,1654163618,self.Perfumes,https://www.reddit.com/r/Perfumes/comments/v35...,{},v359wp,False,True,False,False,False,True,True,False,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/Perfumes/comments/v359wp/rose_kabuki_christ...,False,6,1654163629,1,Does any of you own Rose Kabuki by Dior and co...,True,False,False,Perfumes,t5_2s8y4,13661,public,self,Rose kabuki Christian Dior,0,[],1.0,https://www.reddit.com/r/Perfumes/comments/v35...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
3,3,[],False,plur_nerd,,,[],,text,t2_6827e,False,False,False,[],False,False,1654152961,self.Perfumes,https://www.reddit.com/r/Perfumes/comments/v32...,{},v32v05,False,True,False,False,False,True,True,False,#ea0027,"[{'e': 'text', 't': 'Help '}, {'a': ':table:',...",4a663a4c-c21f-11ea-a4f8-0e90b6858235,Help :table::snoo_shrug:,light,richtext,False,False,False,0,0,False,all_ads,/r/Perfumes/comments/v32v05/looking_for_a_perf...,False,6,1654152971,1,"I know this is a long shot, but I’m absolutely...",True,False,False,Perfumes,t5_2s8y4,13660,public,self,Looking for a perfume that smells like Melissa...,0,[],1.0,https://www.reddit.com/r/Perfumes/comments/v32...,all_ads,6,,,,,,,,,,,,,,,,,,,,,
4,4,[],False,truperfumesaustalia,,,[],,text,t2_cpqitg13,False,False,False,[],False,False,1654145937,truperfume.fashion.blog,https://www.reddit.com/r/Perfumes/comments/v31...,{},v31644,False,False,False,False,False,False,False,False,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/Perfumes/comments/v31644/celebrity_perfumes...,False,6,1654145948,1,,True,False,False,Perfumes,t5_2s8y4,13657,public,https://a.thumbs.redditmedia.com/uMJ7DxSaYConQ...,CELEBRITY PERFUMES ARE A DELIGHT,0,[],1.0,https://truperfume.fashion.blog/2022/06/02/get...,all_ads,6,link,"{'enabled': False, 'images': [{'id': 'thjZVxBh...",reddit,59.0,140.0,https://truperfume.fashion.blog/2022/06/02/get...,,,,,,,,,,,,,,,



First 5 rows of the "Makeup" dataset:


Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,author_flair_template_id,author_flair_text_color,author_flair_background_color,author_cakeday,thumbnail_height,thumbnail_width,edited
0,0,[],False,Jin3092,,[],,text,t2_n683mlf,False,False,False,[],False,False,1657337407,self.Makeup,https://www.reddit.com/r/Makeup/comments/vusi6...,{},vusi6k,False,True,False,False,False,True,True,False,,"[{'e': 'text', 't': '[Makeup Help]'}]",0eb429a6-3b52-11e3-ad3c-12313d184137,[Makeup Help],dark,richtext,False,False,True,0,0,False,all_ads,/r/Makeup/comments/vusi6k/where_do_i_ask_for_t...,False,6,1657337418,1,"I'm new to this subreddit, is there a specific...",True,False,False,Makeup,t5_2qrwc,1285287,public,top,self,where do i ask for tips and advices?,0,[],1.0,https://www.reddit.com/r/Makeup/comments/vusi6...,all_ads,6,,,,,,,,,,,,,
1,1,[],False,ThisisanotherTA0,,[],,text,t2_5if9x2aq,False,False,False,[],False,False,1657336967,self.Makeup,https://www.reddit.com/r/Makeup/comments/vusdl...,{},vusdlb,False,True,False,False,False,True,True,False,,"[{'e': 'text', 't': '[Makeup Help]'}]",0eb429a6-3b52-11e3-ad3c-12313d184137,[Makeup Help],dark,richtext,False,False,False,0,0,False,all_ads,/r/Makeup/comments/vusdlb/foundation_recommend...,False,6,1657336979,1,Hi everyone! I have really oily skin and lots ...,True,False,False,Makeup,t5_2qrwc,1285275,public,top,self,Foundation recommendations: Really oily skin w...,0,[],1.0,https://www.reddit.com/r/Makeup/comments/vusdl...,all_ads,6,,,,,,,,,,,,,
2,2,[],False,DwightsMegaDesk,,[],,text,t2_dkuit5yx,False,False,False,[],False,False,1657336542,self.Makeup,https://www.reddit.com/r/Makeup/comments/vus92...,{},vus92i,False,True,False,False,False,True,True,False,,"[{'e': 'text', 't': '[Makeup Help]'}]",0eb429a6-3b52-11e3-ad3c-12313d184137,[Makeup Help],dark,richtext,False,False,True,0,0,False,all_ads,/r/Makeup/comments/vus92i/my_eyeshadow_is_magn...,False,6,1657336552,1,"I’m not talking about the pans, I’m talking ab...",True,False,False,Makeup,t5_2qrwc,1285260,public,top,self,My eyeshadow is magnetic?,0,[],1.0,https://www.reddit.com/r/Makeup/comments/vus92...,all_ads,6,,,,,,,,,,,,,
3,3,[],False,melissajackson07,,[],,text,t2_1pqcatke,False,False,False,[],False,False,1657336154,self.Makeup,https://www.reddit.com/r/Makeup/comments/vus4y...,{},vus4yk,False,True,False,False,False,True,True,False,,"[{'e': 'text', 't': '[Makeup Help]'}]",0eb429a6-3b52-11e3-ad3c-12313d184137,[Makeup Help],dark,richtext,False,False,True,0,0,False,all_ads,/r/Makeup/comments/vus4yk/small_palette_shade_...,False,6,1657336165,1,"[Pictures are available on my profile, if you ...",True,False,False,Makeup,t5_2qrwc,1285244,public,top,self,Small Palette Shade Review • E.L.F. Cosmetics:...,0,[],1.0,https://www.reddit.com/r/Makeup/comments/vus4y...,all_ads,6,,,,,,,,,,,,,
4,4,[],False,BRB092021,,[],,text,t2_p4pnp6zd,False,False,False,[],False,False,1657335397,self.Makeup,https://www.reddit.com/r/Makeup/comments/vurwk...,{},vurwkj,False,True,False,False,False,True,True,False,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/Makeup/comments/vurwkj/creases_under_eyes_a...,False,6,1657335407,1,"Hope someone can help, I’ll try keep it short ...",True,False,False,Makeup,t5_2qrwc,1285223,public,top,self,Creases under eyes and flare ups,0,[],0.99,https://www.reddit.com/r/Makeup/comments/vurwk...,all_ads,6,,,,,,,,,,,,,


**4. Extract required columns**

Extract the `subreddit`, `title`, `selftext` columns from both datasets.

In [4]:
# Extract 'subreddit', 'title', 'selftext' columns from both datasets
perfumes_df = perfumes_df[['subreddit', 'title', 'selftext']]
makeup_df = makeup_df[['subreddit', 'title', 'selftext']]

# Diplay the simplified datasets
print("First 5 rows of the \"Perfume\" dataset:")
display(perfumes_df.head())
print("")
print("First 5 rows of the \"Makeup\" dataset:")
display(makeup_df.head())

First 5 rows of the "Perfume" dataset:


Unnamed: 0,subreddit,title,selftext
0,Perfumes,What to wear on my university graduation and g...,"I am wearing a suit on both occasions, the dis..."
1,Perfumes,What to wear on my university graduation and g...,"I am wearing a suit on both occasions, the dis..."
2,Perfumes,Rose kabuki Christian Dior,Does any of you own Rose Kabuki by Dior and co...
3,Perfumes,Looking for a perfume that smells like Melissa...,"I know this is a long shot, but I’m absolutely..."
4,Perfumes,CELEBRITY PERFUMES ARE A DELIGHT,



First 5 rows of the "Makeup" dataset:


Unnamed: 0,subreddit,title,selftext
0,Makeup,where do i ask for tips and advices?,"I'm new to this subreddit, is there a specific..."
1,Makeup,Foundation recommendations: Really oily skin w...,Hi everyone! I have really oily skin and lots ...
2,Makeup,My eyeshadow is magnetic?,"I’m not talking about the pans, I’m talking ab..."
3,Makeup,Small Palette Shade Review • E.L.F. Cosmetics:...,"[Pictures are available on my profile, if you ..."
4,Makeup,Creases under eyes and flare ups,"Hope someone can help, I’ll try keep it short ..."


**5.1. Check datasets for null values**

In [5]:
# Check null values found in both datasets
print('Null values in "Perfume" dataset:')
print(perfumes_df.isnull().sum().to_string())
print("\n")
print('Null values in "Makeup" dataset:')
print(makeup_df.isnull().sum().to_string())

Null values in "Perfume" dataset:
subreddit       0
title           0
selftext     1275


Null values in "Makeup" dataset:
subreddit      0
title          0
selftext     211


Since our purpose is to analyse the texts found in the posts, null values are not going to be useful in our analysis as there are no texts ar all. Let's drop these rows as the rows since they are not going to be useful in the analysis.

**5.2. Drop null values found in datasets.**

In [6]:
# Drop rows with null values found in the 'selftext' column
perfumes_df = perfumes_df.dropna()
makeup_df = makeup_df.dropna()

# Check null values for the dataset after dropping the null rows
print('Null values in "Perfume" dataset:')
print(perfumes_df.isnull().sum().to_string())
print("\n")
print('Null values in "Makeup" dataset:')
print(makeup_df.isnull().sum().to_string())

Null values in "Perfume" dataset:
subreddit    0
title        0
selftext     0


Null values in "Makeup" dataset:
subreddit    0
title        0
selftext     0


All the null values have now been removed. We have noted that there are some posts that are have been removed or deleted are still found in the datasets as well. Lets take a look at them.

**6.1. Check the number of "[removed]" and "[deleted]" posts.**

In [7]:
print("Number of removed posts found in the perfume dataset:")
print(perfumes_df[perfumes_df['selftext'] == '[removed]'].count().to_string())
print("")

print("Number of deleted posts found in the perfume dataset:")
print(perfumes_df[perfumes_df['selftext'] == '[deleted]'].count().to_string())
print("")

print("Number of removed posts found in the makeup dataset:")
print(makeup_df[makeup_df['selftext'] == '[deleted]'].count().to_string())
print("")
print("Number of deleted posts found in the makeup dataset:")
print(makeup_df[makeup_df['selftext'] == '[deleted]'].count().to_string())
print("")

Number of removed posts found in the perfume dataset:
subreddit    31
title        31
selftext     31

Number of deleted posts found in the perfume dataset:
subreddit    1
title        1
selftext     1

Number of removed posts found in the makeup dataset:
subreddit    1
title        1
selftext     1

Number of deleted posts found in the makeup dataset:
subreddit    1
title        1
selftext     1



Similar to posts with no texts, the removed and deleted posts are of no use to our analysis. Thus, lets remove them from our datasets as well.

**6.2. Drop "[removed]" and "[deleted]" posts.**

In [8]:
# Drop rows with removed or deleted posts found in the 'selftext' column
perfumes_df = perfumes_df[perfumes_df['selftext'] != '[removed]']
perfumes_df = perfumes_df[perfumes_df['selftext'] != '[deleted]']
makeup_df = makeup_df[makeup_df['selftext'] != '[removed]']
makeup_df = makeup_df[makeup_df['selftext'] != '[deleted]']

# Diplay the datasets after removing the rows with removed or deleted posts
print("First 5 rows of the \"Perfume\" dataset:")
display(perfumes_df.head())
print("")
print("First 5 rows of the \"Makeup\" dataset:")
display(makeup_df.head())

First 5 rows of the "Perfume" dataset:


Unnamed: 0,subreddit,title,selftext
0,Perfumes,What to wear on my university graduation and g...,"I am wearing a suit on both occasions, the dis..."
1,Perfumes,What to wear on my university graduation and g...,"I am wearing a suit on both occasions, the dis..."
2,Perfumes,Rose kabuki Christian Dior,Does any of you own Rose Kabuki by Dior and co...
3,Perfumes,Looking for a perfume that smells like Melissa...,"I know this is a long shot, but I’m absolutely..."
7,Perfumes,Help me out!,I’m a student from Poland and me and my group ...



First 5 rows of the "Makeup" dataset:


Unnamed: 0,subreddit,title,selftext
0,Makeup,where do i ask for tips and advices?,"I'm new to this subreddit, is there a specific..."
1,Makeup,Foundation recommendations: Really oily skin w...,Hi everyone! I have really oily skin and lots ...
2,Makeup,My eyeshadow is magnetic?,"I’m not talking about the pans, I’m talking ab..."
3,Makeup,Small Palette Shade Review • E.L.F. Cosmetics:...,"[Pictures are available on my profile, if you ..."
4,Makeup,Creases under eyes and flare ups,"Hope someone can help, I’ll try keep it short ..."


**6.3 Check the number of remaining rows in the dataset**

In [9]:
print('Info of the Perfumes Dataset:')
print(perfumes_df.info())
print("")
print('Info of the Makeup Dataset:')
print(makeup_df.info())
print("")

Info of the Perfumes Dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1193 entries, 0 to 2499
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1193 non-null   object
 1   title      1193 non-null   object
 2   selftext   1193 non-null   object
dtypes: object(3)
memory usage: 37.3+ KB
None

Info of the Makeup Dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1990 entries, 0 to 2499
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1990 non-null   object
 1   title      1990 non-null   object
 2   selftext   1990 non-null   object
dtypes: object(3)
memory usage: 62.2+ KB
None



After removing the null, removed and deleted posts from the datasets, there are 1,586 rows in the `perfumes` dataset and 1,996 rows in the `makeup` dataset.