# Guided Project: Web Data Pipeline

Goal: practice what you have learned in the Intermediate Python and Data Engineering chapter 

Get Started:
    
* Find a data set to process - You can use a dataset you are familiar with like 
  https://www.kaggle.com/teajay/global-shark-attacks/version/1 or find another. 
  A great place to start looking would be https://github.com/awesomedata/awesome-public-datasets 
  and https://www.kaggle.com/datasets.
* Examine the data and come up with a deliverable before diving in and applying any methods to it.
* You will need to import it, use your newly-acquired skills to build a data pipeline that processes the data 
  and produces a result.
* You must construct a data pipeline with the majority of your code wrapped in functions.
* Break the project down into different steps - leverage the stages of the data pipeline covered in the pipelines
  lesson and answer the appropriate questions for each stage.
* Each data pipeline stage should be covered: acquisition, wrangling, analysis, and reporting.
* You must demonstrate all the topics we covered in the chapter (functions, list comprehensions, string operations, and error handling) in your processing of the data.
* Your code should be saved in a Python executable file (.py), your data should be saved in a folder named data, and your results should be saved in a folder named output.
* You should also include a README.md file that describes the steps you took and your thought process as you built your data pipeline.


In [73]:
import pandas as pd
import numpy as np

In [77]:
products = pd.read_csv('/Users/laumaustupa/Desktop/Ironhack/Cours/data-labs/module-1/pipelines-project/data/DatafinitiElectronicsProductsPricingData.csv', encoding='latin1')
products.head()


Unnamed: 0,id,prices.amountMax,prices.amountMin,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,prices.sourceURLs,asins,brand,categories,dateAdded,dateUpdated,ean,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs,upc,weight,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,http://www.bestbuy.com/site/sanus-tv-wall-moun...,B00C78VIUE,Sanus,"Audio & Video Accessories,TV Mounts,TV Accesso...",2015-04-13T12:00:51Z,2018-05-12T18:59:48Z,,https://images-na.ssl-images-amazon.com/images...,sanusvlf410b110inchsuperslimfullmotionmountfor...,,VLF410B1,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,793796000000.0,32.8 pounds,,,,,
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,https://www.walmart.com/ip/BOYTONE-BT210FB-BLA...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds,,,,,
2,AVpgMuGwLJeJML43KY_c,69.0,69.0,In Stock,New,USD,2017-09-08T05:00:00Z,False,Walmart.com,Expedited,https://www.walmart.com/ip/Boytone-BT-210F-Blu...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds,,,,,
3,AVpgMuGwLJeJML43KY_c,69.99,69.99,Yes,New,USD,2017-10-10T05:00:00Z,False,Bestbuy.com,,https://www.bestbuy.com/site/boytone-2500w-2-1...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds,,,,,
4,AVpgMuGwLJeJML43KY_c,66.99,66.99,Yes,New,USD,2017-08-28T07:00:00Z,False,Bestbuy.com,,https://www.bestbuy.com/site/boytone-2500w-2-1...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds,,,,,


In [78]:
#Remove duplicate rows:
before = len(products)
products = products.drop_duplicates()
after = len(products)

print('Number of duplicate records dropped: ', str(before - after))

Number of duplicate records dropped:  13


In [79]:
#Row count:
print('Number of total rows: ', products.shape[0])

#Missing Values:
null_cols = products.isnull().sum()
null_cols[null_cols > 0]

Number of total rows:  14579


prices.availability     1916
prices.condition        1394
prices.merchant          761
prices.shipping         5786
ean                    10372
manufacturer            6971
Unnamed: 26            14540
Unnamed: 27            14561
Unnamed: 28            14573
Unnamed: 29            14567
Unnamed: 30            14573
dtype: int64

In [80]:
#Drop last columns as they are all almost empty:
products = products.drop(['Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30'], axis=1)


In [81]:
#Remove 'prices.' from column names:
products.columns = products.columns.str.replace('prices.','')
products.head()

Unnamed: 0,id,amountMax,amountMin,availability,condition,currency,dateSeen,isSale,merchant,shipping,sourceURLs,asins,brand,categories,dateAdded,dateUpdated,ean,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs.1,upc,weight
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,http://www.bestbuy.com/site/sanus-tv-wall-moun...,B00C78VIUE,Sanus,"Audio & Video Accessories,TV Mounts,TV Accesso...",2015-04-13T12:00:51Z,2018-05-12T18:59:48Z,,https://images-na.ssl-images-amazon.com/images...,sanusvlf410b110inchsuperslimfullmotionmountfor...,,VLF410B1,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,793796000000.0,32.8 pounds
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,https://www.walmart.com/ip/BOYTONE-BT210FB-BLA...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
2,AVpgMuGwLJeJML43KY_c,69.0,69.0,In Stock,New,USD,2017-09-08T05:00:00Z,False,Walmart.com,Expedited,https://www.walmart.com/ip/Boytone-BT-210F-Blu...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
3,AVpgMuGwLJeJML43KY_c,69.99,69.99,Yes,New,USD,2017-10-10T05:00:00Z,False,Bestbuy.com,,https://www.bestbuy.com/site/boytone-2500w-2-1...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
4,AVpgMuGwLJeJML43KY_c,66.99,66.99,Yes,New,USD,2017-08-28T07:00:00Z,False,Bestbuy.com,,https://www.bestbuy.com/site/boytone-2500w-2-1...,"B018K251JE,B00VILQKQ8",Boytone,"Stereos,Portable Bluetooth Speakers,TV, Video ...",2015-05-18T14:14:56Z,2018-06-13T19:39:02Z,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds


In [82]:
set(products['availability'])

{'32 available',
 '7 available',
 'Discontinued',
 'FALSE',
 'In Stock',
 'Limited Stock',
 'More Coming Soon',
 'More on the Way',
 'No',
 'Out Of Stock',
 'Retired',
 'Special Order',
 'TRUE',
 'Yes',
 nan,
 'sold',
 'undefined',
 'yes'}

In [83]:
products['availability'] = products['availability'].replace('No', 'Out Of Stock')
products['availability'] = products['availability'].replace('sold', 'Out Of Stock')
products['availability'] = products['availability'].replace('FALSE', 'Out Of Stock')
products['availability'] = products['availability'].replace('32 available', 'In Stock')
products['availability'] = products['availability'].replace('7 available', 'Limited Stock')
products['availability'] = products['availability'].replace('TRUE', 'In Stock')
products['availability'] = products['availability'].replace('yes', 'In Stock')
products['availability'] = products['availability'].replace('Yes', 'In Stock')
products['availability'] = products['availability'].replace('More Coming Soon', 'Soon Back In Stock')
products['availability'] = products['availability'].replace('More on the Way', 'Soon Back In Stock')
products['availability'] = products['availability'].replace('More on the Way', 'Soon Back In Stock')
products['availability'] = products['availability'].replace('Special Order', 'Available On Request')

products['availability'] = products['availability'].replace('undefined', np.nan)

set(products['availability'])

{'Available On Request',
 'Discontinued',
 'In Stock',
 'Limited Stock',
 'Out Of Stock',
 'Retired',
 'Soon Back In Stock',
 nan}

In [84]:
products['availability'].value_counts()

In Stock                12037
Out Of Stock              267
Soon Back In Stock        150
Available On Request      146
Limited Stock               8
Discontinued                5
Retired                     1
Name: availability, dtype: int64

In [88]:
products.columns

Index(['id', 'amountMax', 'amountMin', 'availability', 'condition', 'currency',
       'dateSeen', 'isSale', 'merchant', 'shipping', 'sourceURLs', 'asins',
       'brand', 'categories', 'dateAdded', 'dateUpdated', 'ean', 'imageURLs',
       'keys', 'manufacturer', 'manufacturerNumber', 'name',
       'primaryCategories', 'sourceURLs', 'upc', 'weight'],
      dtype='object')

In [87]:
products['primaryCategories'].value_counts()

Electronics              14469
Electronics,Furniture       61
 Apple CarPlay              27
 Intel Celeron              12
 Siri Eyes Free              6
Electronics,Media            4
Name: primaryCategories, dtype: int64

In [70]:
products['currency'].value_counts()

USD    14496
CAD       53
EUR       33
SGD        8
GBP        2
Name: currency, dtype: int64

In [None]:
#Create new column to convert all to USD
#Calculate the mean price
#Or create bining for price ranges

In [91]:
null_country = products[products['manufacturer'].isnull()==True]
null_country = null_country[['manufacturer', 'condition', 'brand']]
null_country

Unnamed: 0,manufacturer,condition,brand
0,,New,Sanus
18,,New,DreamWave
19,,New,DreamWave
20,,new,DreamWave
21,,new,DreamWave
22,,New,DreamWave
28,,New,Universal Remote Control
29,,New,Universal Remote Control
30,,new,Universal Remote Control
31,,New,Universal Remote Control


In [None]:
#Fill in those for Samsung etc?