# Guided Project: Web Data Pipeline

Goal: practice what you have learned in the Intermediate Python and Data Engineering chapter 

Get Started:
    
* Find a data set to process - You can use a dataset you are familiar with like 
  https://www.kaggle.com/teajay/global-shark-attacks/version/1 or find another. 
  A great place to start looking would be https://github.com/awesomedata/awesome-public-datasets 
  and https://www.kaggle.com/datasets.
* Examine the data and come up with a deliverable before diving in and applying any methods to it.
* You will need to import it, use your newly-acquired skills to build a data pipeline that processes the data 
  and produces a result.
* You must construct a data pipeline with the majority of your code wrapped in functions.
* Break the project down into different steps - leverage the stages of the data pipeline covered in the pipelines
  lesson and answer the appropriate questions for each stage.
* Each data pipeline stage should be covered: acquisition, wrangling, analysis, and reporting.
* You must demonstrate all the topics we covered in the chapter (functions, list comprehensions, string operations, and error handling) in your processing of the data.
* Your code should be saved in a Python executable file (.py), your data should be saved in a folder named data, and your results should be saved in a folder named output.
* You should also include a README.md file that describes the steps you took and your thought process as you built your data pipeline.


In [8]:
import pandas as pd
import numpy as np

In [28]:
products = pd.read_csv('/Users/laumaustupa/Desktop/Ironhack/Cours/data-labs/module-1/pipelines-project/data/DatafinitiElectronicsProductsPricingData.csv', encoding='latin1')
products

Unnamed: 0,id,prices.amountMax,prices.amountMin,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,...,name,primaryCategories,sourceURLs,upc,weight,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,...,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,7.93796E+11,32.8 pounds,,,,,
1,AVpgMuGwLJeJML43KY_c,69.00,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
2,AVpgMuGwLJeJML43KY_c,69.00,69.00,In Stock,New,USD,2017-09-08T05:00:00Z,False,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
3,AVpgMuGwLJeJML43KY_c,69.99,69.99,Yes,New,USD,2017-10-10T05:00:00Z,False,Bestbuy.com,,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
4,AVpgMuGwLJeJML43KY_c,66.99,66.99,Yes,New,USD,2017-08-28T07:00:00Z,False,Bestbuy.com,,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
5,AVpgMuGwLJeJML43KY_c,66.00,66.00,In Stock,New,USD,2017-10-24T04:00:00Z,False,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
6,AVpgMuGwLJeJML43KY_c,74.99,74.99,In Stock,New,USD,2017-08-15T15:00:00Z,False,California Electronics,Value,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
7,AVpgMuGwLJeJML43KY_c,65.99,65.99,Yes,New,USD,2017-08-12T09:00:00Z,False,Bestbuy.com,,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
8,AVpgMuGwLJeJML43KY_c,69.99,69.99,In Stock,New,USD,2017-09-14T21:00:00Z,False,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,
9,AVpgMuGwLJeJML43KY_c,69.00,64.00,In Stock,New,USD,2018-01-21T08:00:00Z,True,Walmart.com,Standard,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,6.42015E+11,14 pounds,,,,,


In [29]:
#Remove duplicate rows:
before = len(products)
products = products.drop_duplicates()
after = len(products)

print('Number of duplicate records dropped: ', str(before - after))

Number of duplicate records dropped:  0


In [30]:
#Row count:
print('Number of total rows: ', products.shape[0])

#Missing Values:
null_cols = products.isnull().sum()
null_cols[null_cols > 0]

Number of total rows:  3500


prices.shipping    1540
ean                2901
manufacturer       1934
Unnamed: 26        3500
Unnamed: 27        3494
Unnamed: 28        3494
Unnamed: 29        3500
Unnamed: 30        3494
dtype: int64

In [31]:
#Drop last columns as they are all almost empty:
products = products.drop(['Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30'], axis=1)


In [32]:
#Remove 'prices.' from column names:
products.columns = products.columns.str.replace('prices.','')
products.head()

Unnamed: 0,id,amountMax,amountMin,availability,condition,currency,dateSeen,isSale,merchant,shipping,...,ean,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs,upc,weight
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,sanusvlf410b110inchsuperslimfullmotionmountfor...,,VLF410B1,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,793796000000.0,32.8 pounds
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
2,AVpgMuGwLJeJML43KY_c,69.0,69.0,In Stock,New,USD,2017-09-08T05:00:00Z,False,Walmart.com,Expedited,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
3,AVpgMuGwLJeJML43KY_c,69.99,69.99,Yes,New,USD,2017-10-10T05:00:00Z,False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
4,AVpgMuGwLJeJML43KY_c,66.99,66.99,Yes,New,USD,2017-08-28T07:00:00Z,False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds


In [33]:
#check all column names:
products.columns

Index(['id', 'amountMax', 'amountMin', 'availability', 'condition', 'currency',
       'dateSeen', 'isSale', 'merchant', 'shipping', 'sourceURLs', 'asins',
       'brand', 'categories', 'dateAdded', 'dateUpdated', 'ean', 'imageURLs',
       'keys', 'manufacturer', 'manufacturerNumber', 'name',
       'primaryCategories', 'sourceURLs', 'upc', 'weight'],
      dtype='object')

In [34]:
#Check categories for availability:
set(products['availability'])

{'FALSE',
 'In Stock',
 'More on the Way',
 'No',
 'Out Of Stock',
 'Retired',
 'Special Order',
 'TRUE',
 'Yes',
 'sold',
 'undefined',
 'yes'}

In [35]:
#Clean up:
products['availability'] = products['availability'].replace('No', 'Out Of Stock')
products['availability'] = products['availability'].replace('sold', 'Out Of Stock')
products['availability'] = products['availability'].replace('FALSE', 'Out Of Stock')
products['availability'] = products['availability'].replace('TRUE', 'In Stock')
products['availability'] = products['availability'].replace('yes', 'In Stock')
products['availability'] = products['availability'].replace('Yes', 'In Stock')
products['availability'] = products['availability'].replace('More on the Way', 'Soon Back In Stock')
products['availability'] = products['availability'].replace('Special Order', 'Available On Request')

products['availability'] = products['availability'].replace('undefined', np.nan)

set(products['availability'])

{'Available On Request',
 'In Stock',
 'Out Of Stock',
 'Retired',
 'Soon Back In Stock',
 nan}

In [36]:
#Count number per category:
products['availability'].value_counts()

In Stock                3303
Out Of Stock              68
Available On Request      52
Soon Back In Stock        45
Retired                    1
Name: availability, dtype: int64

In [37]:
#Count products per primary category:
products['primaryCategories'].value_counts()

Electronics        3494
 Siri Eyes Free       6
Name: primaryCategories, dtype: int64

In [38]:
#Count products per currency:
products['currency'].value_counts()

USD    3500
Name: currency, dtype: int64

In [39]:
# Check if all values in amountMax and amountMin columns are the same
set(products['amountMax'] == products['amountMin'])

{False, True}

In [40]:
#Fill in missing manufacturer with brand:
products.loc[(products['manufacturer'].isnull()==True), 'manufacturer'] = products['brand'] 
set(products.manufacturer)

{'5 Years',
 'AOC',
 'ASUS',
 'AUDIOQUEST',
 'Acer',
 'Actiontec',
 'AfterShokz',
 'Aftershokz',
 'Aiwa',
 'Alienware',
 'Allround Software',
 'Alpine',
 'Alpine Electronics USA',
 'Anker',
 'Antec',
 'Apple',
 'Arris',
 'Atrend',
 'AudioQuest',
 'Audioquest',
 'BOSE',
 'Belkin',
 'Bose',
 'Bowers & Wilkins',
 'Boytone',
 'CORSAIR',
 'Canon',
 'Canon Cameras US',
 'Case Logic',
 'Cerwin Vega',
 'Cerwin-Vega',
 'Chief Manufacturing',
 'Cooks International LLC (Former Coshell Holdings LLC)',
 'Coolermaster',
 'Corsair',
 'Cybertronpc',
 'Dantona',
 'Definitive Technology',
 'Dell',
 'Denaq',
 'Denon',
 'Digipower',
 'DreamWave',
 'Dual',
 'ECOXGEAR',
 'ELITE SCREENS DIRECTSHIP',
 'Elite Screens',
 'Elitescreens',
 'Emerge',
 'Epson',
 'Escort',
 'Fitbit',
 'G-Technology',
 'GEKO',
 'Garmin',
 'GoPro',
 'Goal Zero',
 'Grace Digital',
 'Grace Digital Audio',
 'HP',
 'Hauppauge',
 'Hisense',
 'House Of Marley',
 'House of Marley',
 'Hoya',
 'IOGEAR',
 'ISIMPLE',
 'Incipio',
 'Innovative Tec

In [41]:
null_cols = products.isnull().sum()
null_cols[null_cols > 0]

availability      31
shipping        1540
ean             2901
dtype: int64

In [43]:
#Replace the incorrect value in manufacturer with brand:
products['manufacturer'] = products['manufacturer'].replace('5 Years', products['brand'])

In [44]:
set(products.manufacturer)

{'AOC',
 'ASUS',
 'AUDIOQUEST',
 'Acer',
 'Actiontec',
 'AfterShokz',
 'Aftershokz',
 'Aiwa',
 'Alienware',
 'Allround Software',
 'Alpine',
 'Alpine Electronics USA',
 'Anker',
 'Antec',
 'Apple',
 'Arris',
 'Atrend',
 'AudioQuest',
 'Audioquest',
 'BOSE',
 'Belkin',
 'Bose',
 'Bowers & Wilkins',
 'Boytone',
 'CORSAIR',
 'Canon',
 'Canon Cameras US',
 'Case Logic',
 'Cerwin Vega',
 'Cerwin-Vega',
 'Chief Manufacturing',
 'Cooks International LLC (Former Coshell Holdings LLC)',
 'Coolermaster',
 'Corsair',
 'Cybertronpc',
 'Dantona',
 'Definitive Technology',
 'Dell',
 'Denaq',
 'Denon',
 'Digipower',
 'DreamWave',
 'Dual',
 'ECOXGEAR',
 'ELITE SCREENS DIRECTSHIP',
 'Elite Screens',
 'Elitescreens',
 'Emerge',
 'Epson',
 'Escort',
 'Fitbit',
 'G-Technology',
 'GEKO',
 'Garmin',
 'GoPro',
 'Goal Zero',
 'Grace Digital',
 'Grace Digital Audio',
 'HP',
 'Hauppauge',
 'Hisense',
 'House Of Marley',
 'House of Marley',
 'Hoya',
 'IOGEAR',
 'ISIMPLE',
 'Incipio',
 'Innovative Technology',
 '

In [49]:
products.head()

Unnamed: 0,id,amountMax,amountMin,availability,condition,currency,dateSeen,isSale,merchant,shipping,...,ean,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs,upc,weight
0,AVphzgbJLJeJML43fA0o,104.99,104.99,In Stock,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,sanusvlf410b110inchsuperslimfullmotionmountfor...,Sanus,VLF410B1,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,793796000000.0,32.8 pounds
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
2,AVpgMuGwLJeJML43KY_c,69.0,69.0,In Stock,New,USD,2017-09-08T05:00:00Z,False,Walmart.com,Expedited,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
3,AVpgMuGwLJeJML43KY_c,69.99,69.99,In Stock,New,USD,2017-10-10T05:00:00Z,False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds
4,AVpgMuGwLJeJML43KY_c,66.99,66.99,In Stock,New,USD,2017-08-28T07:00:00Z,False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642015000000.0,14 pounds


In [50]:
#Export csv:
products.to_csv('../output/electronics_clean.csv', index=False)