# Extracting Product Information from Amazon Website using Scrapy.
# Objectives:
This project implements Python web scraping tools to extract "data science books" information such as book titles, authors, reviews, prices, and image URL links to analyze correlation of columens and give informative insights to users. Collected data has been stored in CSV and JSON format for future use or advanced analysis. 

# Contents
    ∎ Multiple page data scraped from the amazon website using Scrapy, CSS selector, and VS Code
    ∎ Importing necessary python libraries to crawl and analyze the data
    ∎ Import & Inspect raw data table structure
    ∎ Detecting nulls
    ∎ Handling nulls & duplicated values
    ∎ Renaming & rearrange columns
    ∎ Convert dtype 'object' to 'int' to analyze reviews & price column as a numerical value
    ∎ Top 10 books with a title containing "Data Science" including the highest reviews and lowest price
    ∎ Book title contains "Data Analytics"
    ∎ Books having >=200 reviews
    ∎ Grouping books by author & reviews 
    ∎ Highest, minimum, and average purchase prices
    ∎ Which books has highest-Reviews?
    ∎ Which books has Minimum reviews?
    ∎ Does higher priced books are of better reviews?

# 1. Importing necessary python libraries

In [2]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd

# 2. Import & Inspect raw data table structure

In [3]:
data = pd.read_csv(r'C:\Users\user\Desktop\Amazon Data Extraction Project Multiple Pages\AmazonScrapy\products.csv')
data

Unnamed: 0,author,product_name,product_price,product_reviews,product_url
0,Alex J. Gutman,"Becoming a Data Head: How to Think, Speak and ...",24,209,https://m.media-amazon.com/images/I/51dY5RkvW6...
1,,The Four Agreements (Illustrated Edition) A Pr...,27,88941,https://m.media-amazon.com/images/I/91ryfmRier...
2,Hal Harvey,,28,18,https://m.media-amazon.com/images/I/716QwZRD3e...
3,Ryan O'Hanlon,Net Gains: Inside the Beautiful Game’s Analyti...,27,27,https://m.media-amazon.com/images/I/61k00WsIsj...
4,Stanley Chiang,,26,118,https://m.media-amazon.com/images/I/61AzwN+jyw...
...,...,...,...,...,...
784,Rick Sherman,Business Intelligence Guidebook: From Data Int...,31,98,https://m.media-amazon.com/images/I/51BvOBcIDV...
785,Kenneth Cukier,,18,143,https://m.media-amazon.com/images/I/71UZqltaRa...
786,Stefan Papp,Data Scientist Lined Notebook Data Science Jou...,7,1,https://m.media-amazon.com/images/I/61nQ117vbz...
787,David Spiegelhalter,The Art of Statistics: How to Learn from Data,18,2906,https://m.media-amazon.com/images/I/91AxFQKFxW...


# 3. Detecting nulls

In [4]:
data.isnull().sum()

author             490
product_name       576
product_price      428
product_reviews    429
product_url        428
dtype: int64

# 4. Handling nulls & duplicated values

In [5]:
data.dropna(inplace = True)
data

Unnamed: 0,author,product_name,product_price,product_reviews,product_url
0,Alex J. Gutman,"Becoming a Data Head: How to Think, Speak and ...",24,209,https://m.media-amazon.com/images/I/51dY5RkvW6...
3,Ryan O'Hanlon,Net Gains: Inside the Beautiful Game’s Analyti...,27,27,https://m.media-amazon.com/images/I/61k00WsIsj...
5,Wayne Winston,Microsoft Excel Data Analysis and Business Mod...,10,423,https://m.media-amazon.com/images/I/81FoeAfUIX...
6,Nick Singh,Ace the Data Science Interview: 201 Real Inter...,38,695,https://m.media-amazon.com/images/I/61Yd6YXufG...
7,Matt Goldwasser,SQL for Data Analytics: Perform fast and effic...,27,234,https://m.media-amazon.com/images/I/719zBDnZtX...
...,...,...,...,...,...
781,Nicholas Kelly,Delivering Data Analytics: A Step-By-Step Guid...,34,19,https://m.media-amazon.com/images/I/81E6AdHoVx...
784,Rick Sherman,Business Intelligence Guidebook: From Data Int...,31,98,https://m.media-amazon.com/images/I/51BvOBcIDV...
786,Stefan Papp,Data Scientist Lined Notebook Data Science Jou...,7,1,https://m.media-amazon.com/images/I/61nQ117vbz...
787,David Spiegelhalter,The Art of Statistics: How to Learn from Data,18,2906,https://m.media-amazon.com/images/I/91AxFQKFxW...


# 5. Renaming columns

In [6]:
data = data.rename(columns={'author': 'Author','product_name': 'Title','product_price': 'Price','product_reviews': 'Reviews', 'product_url': 'Url'})
ColumnsList = ['Title','Author','Price','Reviews','Url']
data.head()

Unnamed: 0,Author,Title,Price,Reviews,Url
0,Alex J. Gutman,"Becoming a Data Head: How to Think, Speak and ...",24,209,https://m.media-amazon.com/images/I/51dY5RkvW6...
3,Ryan O'Hanlon,Net Gains: Inside the Beautiful Game’s Analyti...,27,27,https://m.media-amazon.com/images/I/61k00WsIsj...
5,Wayne Winston,Microsoft Excel Data Analysis and Business Mod...,10,423,https://m.media-amazon.com/images/I/81FoeAfUIX...
6,Nick Singh,Ace the Data Science Interview: 201 Real Inter...,38,695,https://m.media-amazon.com/images/I/61Yd6YXufG...
7,Matt Goldwasser,SQL for Data Analytics: Perform fast and effic...,27,234,https://m.media-amazon.com/images/I/719zBDnZtX...


# 6. Convert dtype 'object' to 'int' to analyze reviews & price column as a numerical value

In [7]:
data.dtypes

Author     object
Title      object
Price      object
Reviews    object
Url        object
dtype: object

In [8]:
data["Reviews"] = pd.to_numeric(data["Reviews"], errors='coerce').fillna(0, downcast='infer')
data["Price"] = pd.to_numeric(data["Price"], errors='coerce').fillna(0, downcast='infer')
data.dtypes

Author     object
Title      object
Price       int64
Reviews     int64
Url        object
dtype: object

# 7. Top 10 books with a title containing "Data Science" including the highest reviews and lowest price

In [9]:
DtataScience_Books= data[data['Title'].str.contains('Data Science')]
Lower_price_books = DtataScience_Books.nlargest(10, ['Reviews', 'Price'])
Lower_price_books

Unnamed: 0,Author,Title,Price,Reviews,Url
6,Nick Singh,Ace the Data Science Interview: 201 Real Inter...,38,695,https://m.media-amazon.com/images/I/61Yd6YXufG...
400,Nick Singh,Ace the Data Science Interview: 201 Real Inter...,38,695,https://m.media-amazon.com/images/I/61Yd6YXufG...
10,Dursun Delen,"Business Intelligence, Analytics, and Data Sci...",33,280,https://m.media-amazon.com/images/I/91Zx6bc7y2...
0,Alex J. Gutman,"Becoming a Data Head: How to Think, Speak and ...",24,209,https://m.media-amazon.com/images/I/51dY5RkvW6...
395,Alex J. Gutman,"Becoming a Data Head: How to Think, Speak and ...",24,209,https://m.media-amazon.com/images/I/51dY5RkvW6...
416,Chris Fregly,"Data Science on AWS: Implementing End-to-End, ...",31,159,https://m.media-amazon.com/images/I/81k33dFBLt...
413,Julian James McKinnon,Computer Programming Crash Course: 7 Books in ...,33,134,https://m.media-amazon.com/images/I/71zkPeFMA4...
686,(13 used & new offers),"Introducing Data Science: Big Data, Machine Le...",42,98,https://m.media-amazon.com/images/I/71zG6q0qpi...
8,Yoon Hyup Hwang,Hands-On Data Science for Marketing: Improve y...,45,68,https://m.media-amazon.com/images/I/719TiXxn3r...
434,Nina Zumel,Practical Data Science with R,49,64,https://m.media-amazon.com/images/I/71v3KRNJq3...


# 8. Book title contains "Data Analytics"

In [10]:
data[data['Title'].str.contains('Data Analytics',case = False)]

Unnamed: 0,Author,Title,Price,Reviews,Url
7,Matt Goldwasser,SQL for Data Analytics: Perform fast and effic...,27,234,https://m.media-amazon.com/images/I/719zBDnZtX...
272,(10 used & new offers),Big Data Analytics for Time-Critical Mobility ...,167,1,https://m.media-amazon.com/images/I/616CrRKfua...
309,Herbert Jones,Data Analytics: An Essential Beginner’s Guide ...,13,17,https://m.media-amazon.com/images/I/71cbRSQW63...
337,Seth Stephens-Davidowitz,Big Data Analytics,7,1,https://m.media-amazon.com/images/I/61PsVFO9WY...
351,Chanchal Chatterjee,Introduction to Data Analytics for Accounting ...,45,2,https://m.media-amazon.com/images/I/718JDFKNAB...
361,Jonathan Schwabish,Fundamentals of Data Analytics: With a View to...,21,263,https://m.media-amazon.com/images/I/51BOxIeRTD...
392,Ann Jackson,"Tableau Strategies: Solving Real, Practical Pr...",29,32,https://m.media-amazon.com/images/I/81nxllSFh2...
407,Ann Jackson,"Tableau Strategies: Solving Real, Practical Pr...",29,32,https://m.media-amazon.com/images/I/81nxllSFh2...
445,Paul Cerrato,Reinventing Clinical Decision Support: Data An...,69,9,https://m.media-amazon.com/images/I/81fowytG5q...
449,Vijay Kotu,"DATA MINING, BIG DATA ANALYTICS AND DEEP LEARN...",4,2,https://m.media-amazon.com/images/I/71VwzlIHU+...


# 9. Books having >=200 reviews

In [11]:
data[data['Reviews']>=200]['Title']

0      Becoming a Data Head: How to Think, Speak and ...
5      Microsoft Excel Data Analysis and Business Mod...
6      Ace the Data Science Interview: 201 Real Inter...
7      SQL for Data Analytics: Perform fast and effic...
10     Business Intelligence, Analytics, and Data Sci...
19     Small Data: The Tiny Clues That Uncover Huge T...
55                                   Thinking Basketball
266    Competing on Analytics: The New Science of Win...
283                           SQL All-in-One For Dummies
287                           SQL All-in-One For Dummies
322                                        R For Dummies
361    Fundamentals of Data Analytics: With a View to...
391    SQL Cookbook: Query Solutions and Techniques f...
395    Becoming a Data Head: How to Think, Speak and ...
400    Ace the Data Science Interview: 201 Real Inter...
408    Information Dashboard Design: Displaying Data ...
432    The Ethics of Artificial Intelligence in Educa...
742      Qualitative Data Analy

# 10. Grouping books by author & Maximun reviews

In [12]:
data.groupby('Author')['Reviews'].max().sort_values(ascending=False)

Author
Allen G. Taylor        790
Nick Singh             695
Andrie de Vries        493
Wayne Winston          423
Martin Lindstrom       347
                      ... 
Stefan Papp              1
Christian Mayer          1
Andrew Nguyen            1
Thomas Mailund           0
David Spiegelhalter      0
Name: Reviews, Length: 150, dtype: int64

# 11. Highest, minimum, and average purchase prices

In [13]:
data['Price'].max()

179

In [14]:
data['Price'].min()

4

In [15]:
data['Price'].mean()

38.18888888888889

# 12. Which books has highest-Reviews?

In [16]:
print(data['Title'][data['Reviews'] == data['Reviews'].max()].nunique())
print(print(data['Title'][data['Reviews'] == data['Reviews'].max()].unique()))

1
['SQL All-in-One For Dummies']
None


# 13. Which books has Minimum reviews?

In [17]:
print(data['Title'][data['Reviews'] == data['Reviews'].min()].nunique())
print(print(data['Title'][data['Reviews'] == data['Reviews'].min()].unique()))

2
['Beginning Data Science in R 4: Data Analysis, Visualization, and Modelling for the Data Scientist'
 'The Art of Statistics: How to Learn from Data']
None


# 14. Does higher priced books are of better reviews?

In [19]:
px.scatter(data, x ='Price', y='Reviews')