# Walmart EDA
![image_name](../../assets/walmart-image.jpg "alt-image_name!")


## About Author
Author: Joshua Farara

Project: Walmart EDA
### Contact Info
Click on link below to contact/follow/correct me:

Email: joshua.farara@gmail.com

[LinkedIn](https://www.linkedin.com/in/joshuafarara/)

[Facebook](https://www.facebook.com/josh.farara/)

[Twitter](https://x.com/FararaTheArtist)

[Github](https://github.com/JoshuaFarara)


## About Data

Title: Walmart Product Data 2019

Dataset: [Link](https://www.kaggle.com/datasets/promptcloud/walmart-product-data-2019/data)

Description: PromptCloud extracted 30K records of sample data from Walmart to find out product reviews, pricing, and categories of products available on Walmart.



### Dataset Columns Names

Features:

'Uniq Id', 'Crawl Timestamp', 'Product Url', 'Product Name',
       'Description', 'List Price', 'Sale Price', 'Brand', 'Item Number',
       'Gtin', 'Package Size', 'Category', 'Postal Code', 'Available'



### Metadata

Author: Vimal, Tarun

Source: PromptCloud (Owner)

Collection Methodology

License:


### Task

Helpful to customers, clients and various other people interested in data of different types to use this to come up with a case study or an article or anything that can be used by the common man and competitors to position themselves strongly in the market. 



### Objective

We saw that there was a large need for datasets of different types of Products across the largest eCommerce websites around the globe. Although Amazon is the largest in the world and is the most searched in the globe. Walmart makes a strong case for itself in the field of eCommerce as it has been around for years now catering to the average American families daily.

### Kernel Version Used

Python==3.11.7


## Import Libraries
We will use the following libraries
1. Pandas: Data manipulation and analysis
2. Numpy: Numerical operations and calculations
3. Matplotlib: Data visualization and plotting
4. Seaborn: Enhanced data visualization and statistical graphics
5. Scipy: Scientific computing and advanced mathematical operations


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import os
import sys
# this is for jupyter notebook to show the plot in the notebook itself instead of opening a new window
%matplotlib inline

for dirname, _, filenames in os.walk('/kaggle/input'):
   for filename in filenames:
       print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


## Data Loading and Exploration | Cleaning

### Load a CSV file then creating a dataframe

In [2]:
# Kaggle Notebook
# df = pd.read_csv('/kaggle/input/coffee-sales/index.csv')

# Local Machine Notebook
df = pd.read_csv('../../data/walmart_data.csv')


### Set the option to show maximum columns

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


### Get a sneak peek of data
The purpose of a sneak peek is to get a quick overview of the data and identify any potential problems or areas of interest


In [4]:
df.head(5)

Unnamed: 0,Uniq Id,Crawl Timestamp,Product Url,Product Name,Description,List Price,Sale Price,Brand,Item Number,Gtin,Package Size,Category,Postal Code,Available
0,019b67ef7f01103d8fb0a53e4c36daa7,2019-12-18 10:20:52 +0000,https://www.walmart.com/ip/La-Costena-Chipotle...,"La Costena Chipotle Peppers, 7 OZ (Pack of 12)",We aim to show you accurate product informati...,31.93,31.93,La Costeï¿½ï¿½a,,139941530,,"Food | Meal Solutions, Grains & Pasta | Canned...",,True
1,3a4ff306dcc8a6e2bf720964d29b84c3,2019-12-18 17:21:48 +0000,https://www.walmart.com/ip/Equate-Triamcinolon...,Equate Triamcinolone Acetonide Nasal Allergy S...,We aim to show you accurate product informati...,10.48,10.48,Equate,569045548.0,632775553,,Health | Equate | Equate Allergy | Equate Sinu...,,True
2,80090549d7d176327b186353c7b28ca4,2019-12-18 17:46:41 +0000,https://www.walmart.com/ip/AduroSmart-ERIA-Sof...,AduroSmart ERIA Soft White Smart A19 Light Bul...,We aim to show you accurate product informati...,10.99,10.99,AduroSmart ERIA,568068849.0,281487005,,Electronics | Smart Home | Smart Energy and Li...,,True
3,151ee1c61a29bacfedb01cd500494b2f,2019-12-18 22:14:22 +0000,https://www.walmart.com/ip/24-Classic-Adjustab...,"24"" Classic Adjustable Balloon Fender Set Chro...",We aim to show you accurate product informati...,38.59,38.59,lowrider,,133714060,,Sports & Outdoors | Bikes | Bike Accessories |...,,True
4,7b2ef8d41f65df121f6b4b9828cf8dad,2019-12-18 06:56:02 +0000,https://www.walmart.com/ip/Elephant-Shape-Sili...,Elephant Shape Silicone Drinkware Portable Sil...,We aim to show you accurate product informati...,5.81,5.81,Anself,,104042139,,Baby | Feeding | Sippy Cups: Alternatives to P...,,True


### Let's see the column names

In [5]:
df.columns

Index(['Uniq Id', 'Crawl Timestamp', 'Product Url', 'Product Name',
       'Description', 'List Price', 'Sale Price', 'Brand', 'Item Number',
       'Gtin', 'Package Size', 'Category', 'Postal Code', 'Available'],
      dtype='object')

### Let's have a look on the shape of the dataset

In [6]:
print(f"The Number of Rows are {df.shape[0]}, and columns are {df.shape[1]}.")

The Number of Rows are 30000, and columns are 14.


### Let's have a look on the columns and their data types using detailed info function

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Uniq Id          30000 non-null  object 
 1   Crawl Timestamp  30000 non-null  object 
 2   Product Url      30000 non-null  object 
 3   Product Name     30000 non-null  object 
 4   Description      29947 non-null  object 
 5   List Price       30000 non-null  float64
 6   Sale Price       30000 non-null  float64
 7   Brand            29436 non-null  object 
 8   Item Number      8875 non-null   float64
 9   Gtin             30000 non-null  int64  
 10  Package Size     0 non-null      float64
 11  Category         29983 non-null  object 
 12  Postal Code      0 non-null      float64
 13  Available        30000 non-null  bool   
dtypes: bool(1), float64(5), int64(1), object(7)
memory usage: 3.0+ MB


### Count the missing values

In [8]:
df.isnull().sum()

Uniq Id                0
Crawl Timestamp        0
Product Url            0
Product Name           0
Description           53
List Price             0
Sale Price             0
Brand                564
Item Number        21125
Gtin                   0
Package Size       30000
Category              17
Postal Code        30000
Available              0
dtype: int64

## Cleaning Data

### First Step of Cleaning

### Second Step of Cleaning

### Third Step of Cleaning

### Fourth Step of Cleaning

### Fifth Step of Cleaning

### Restructure dateframe order.

In [None]:
# Fill in restructureed dateframe format for desired look of data.  

## Analytical Questions

Analysis Subject 1

1. 

2. 

3. 

4. 

5. 

Analysis Subject 2

6. 

7. 

8. 

9. 

10. 


## Summary

## Contact Information

Click on link below to contact/follow/correct me:

Email: joshua.farara@gmail.com

[LinkedIn](https://www.linkedin.com/in/joshuafarara/)

[Facebook](https://www.facebook.com/josh.farara/)

[Twitter](https://x.com/FararaTheArtist)

[Github](https://github.com/JoshuaFarara)
