# This exploratory data analysis is based on the products data provided by Amazon.

In [21]:
#importing the libraries that will be used in this project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from termcolor import colored

In [22]:
#reading the data into a dataframe
df = pd.read_csv("../sample.csv")
#Intial shape of the dataframe
print("Initial shape of the dataframe is : ", df.shape , "\n")
#Printing the first 2 rows of the datframe
print("The first 2 rows of the dataframe are-", "\n")
df.head(2)

Initial shape of the dataframe is :  (7249, 31) 

The first 2 rows of the dataframe are- 



Unnamed: 0,id,prices.amountMax,prices.amountMin,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,...,name,primaryCategories,sourceURLs,upc,weight,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,...,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,794000000000.0,32.8 pounds,,,,,
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,...,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642000000000.0,14 pounds,,,,,


As can be seen from the datframe, it has several "unnamed" columns having mostly null values. 
Let's check the complete data for null values.

In [23]:
df.isnull().sum().sort_values(ascending = False).to_frame().T

Unnamed: 0,Unnamed: 30,Unnamed: 28,Unnamed: 29,Unnamed: 27,Unnamed: 26,ean,manufacturer,prices.shipping,weight,upc,...,asins,prices.sourceURLs,prices.merchant,prices.isSale,prices.dateSeen,prices.currency,prices.condition,prices.availability,prices.amountMin,dateUpdated
0,7243,7243,7237,7231,7210,5706,4014,2972,0,0,...,0,0,0,0,0,0,0,0,0,0


It is confirmed that all the "Unnamed" columns have mostly null values. Also several other columns also have 
significantly high amount of missing values. But we will deal with those columns after understanding a bit more
about the data. 
       For now, let's delete the "Unnamed" columns.

In [24]:
df.drop(["Unnamed: 26", "Unnamed: 27", "Unnamed: 28", "Unnamed: 29", "Unnamed: 30"], axis = 1, inplace = True)
df.head(2)

Unnamed: 0,id,prices.amountMax,prices.amountMin,prices.availability,prices.condition,prices.currency,prices.dateSeen,prices.isSale,prices.merchant,prices.shipping,...,ean,imageURLs,keys,manufacturer,manufacturerNumber,name,primaryCategories,sourceURLs,upc,weight
0,AVphzgbJLJeJML43fA0o,104.99,104.99,Yes,New,USD,"2017-03-30T06:00:00Z,2017-03-10T22:00:00Z,2017...",False,Bestbuy.com,,...,,https://images-na.ssl-images-amazon.com/images...,sanusvlf410b110inchsuperslimfullmotionmountfor...,,VLF410B1,Sanus VLF410B1 10-Inch Super Slim Full-Motion ...,Electronics,https://www.amazon.com/Sanus-VLF410B1-10-Inch-...,794000000000.0,32.8 pounds
1,AVpgMuGwLJeJML43KY_c,69.0,64.99,In Stock,New,USD,2017-12-14T06:00:00Z,True,Walmart.com,Expedited,...,,https://images-na.ssl-images-amazon.com/images...,boytone2500w21chhometheatersystemblackdiamond/...,Boytone,BT-210F,Boytone - 2500W 2.1-Ch. Home Theater System - ...,Electronics,http://reviews.bestbuy.com/3545/4784804/review...,642000000000.0,14 pounds




Here is a list of the remaining columns -


In [30]:
df.columns

Index(['id', 'prices.amountMax', 'prices.amountMin', 'prices.availability',
       'prices.condition', 'prices.currency', 'prices.dateSeen',
       'prices.isSale', 'prices.merchant', 'prices.shipping',
       'prices.sourceURLs', 'asins', 'brand', 'categories', 'dateAdded',
       'dateUpdated', 'ean', 'imageURLs', 'keys', 'manufacturer',
       'manufacturerNumber', 'name', 'primaryCategories', 'sourceURLs', 'upc',
       'weight'],
      dtype='object')

The list of columns shows that several columns are named incorrectly and have "prices." as prefix. Let's correct it.



In [35]:
df.columns = df.columns.str.replace("prices.", "")
df.columns
    

  df.columns = df.columns.str.replace("prices.", "")


Index(['id', 'amountMax', 'amountMin', 'availability', 'condition', 'currency',
       'dateSeen', 'isSale', 'merchant', 'shipping', 'sourceURLs', 'asins',
       'brand', 'categories', 'dateAdded', 'dateUpdated', 'ean', 'imageURLs',
       'keys', 'manufacturer', 'manufacturerNumber', 'name',
       'primaryCategories', 'sourceURLs', 'upc', 'weight'],
      dtype='object')

The column list looks good. 
##### Most of the column names are self-explanatory. The "not so simple" columns are - 
* ** availability - ** In stock or not. 
* ** condition ** New, used etc
* ** asins ** Amazon Standard Identification Number
* ** ean ** European Article Number


In [38]:
from termcolor import colored