## Exploratory Data Analysis I

Let's begin our analysis by exploring both the univariate and bivariate characteristics of this dataset. The general goals of this step include: 

* finding outliers & distributions through univariate visualizations
* finding trends & patterns through bivariate visualizations

When you are done with this section of the project, validate that your output matches the screenshot provided in the `docs/part1.md` file and answer the questions located underneath `Exploratory Data Analysis II` in your own words.

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# TODO: load `data/raw/shopping.csv` as a pandas dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

df= pd.read_csv("data/raw/shopping.csv")

In [None]:
# TODO: print out the first 5 rows for display
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
df.head()

In [None]:
# TODO: Print out summary statistics for all numeric columns
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

df.describe()

## Univariate Analysis

Let's generate visualizatons for each numeric variable to get an idea of the outliers & distributions present in our dataset.

In addition, let's also visualize the frequency-count of qualitative variables to get an understanding of the composition of our dataset. 

In [None]:
# TODO: plot a seaborn histogram for the "Age" column
# Documentation: https://seaborn.pydata.org/generated/seaborn.histplot.html

sns.histoplot(datafile["Age"])

In [None]:
# TODO: plot a seaborn histogram for the "Purchase Amount (USD)" column

sns.histoplot(datafile["Purchase Amount (USD)"])

In [None]:
# TODO: plot a seaborn histogram for the "Review Rating" column

sns.histoplot(datafile["Review Rating"])

In [None]:
# TODO: plot a seaborn histogram for the "Previous Purchases" column

sns.histoplot(datafile["Previous Purchases"])

In [None]:
# TODO: count the frequency of unique values in the "Gender" column, save this value into a new dataframe named "gender_counts"
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

gender_counts = datafile.value_counts("Gender")

In [None]:
# TODO: plot a matplotlib barplot for the gender_counts dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

plt.bar(height= gender_counts.values, x= gender_counts.csv)

In [None]:
# TODO: count the frequency of unique values in the "Season" column, save this value into a new dataframe named "season_counts"

season_counts = datafile.value_counts("Season")

In [None]:
# TODO: plot a matplotlib barplot for the season_counts dataframe

plt.bar(height= season_counts.values, x=season_counts.csv)

In [None]:
# TODO: count the frequency of unique values in the "Shipping Type" column, save this value into a new dataframe named "ship_counts"

ship_counts = datfile.value_counts("Shipping Type")

In [None]:
# TODO: plot a matplotlib barplot for the shipping_counts dataframe

plt.barh(width= ship_counts.values, y = ship_counts.index)

In [None]:
# TODO: count the frequency of unique values in the "Promo Code Used" column, save this value into a new dataframe named "promo_counts"

promo_counts = datafile.value_counts("Promo Code Used")

In [None]:
# TODO: plot a matplotlib barplot for the promo_counts dataframe

plt.bar(height= promo_counts.value, x = promo_counts.index)

In [None]:
# TODO: count the frequency of unique values in the "Payment Method" column, save this value into a new dataframe named "pay_counts"

pay_counts = datafile.value_counts("Payment Method")

In [None]:
# TODO: plot a matplotlib barplot for the pay_counts dataframe

plt.bar(height= pay_counts.values, x= pay_counts.index)

In [None]:
# TODO: count the frequency of unique values in the "Frequency of Purchases" column, save this value into a new dataframe named "purch_counts"

purch_counts = datafile.value_counts("Frequency of Purchases")

In [None]:
# TODO: plot a matplotlib barplot for the purch_counts dataframe

plt.bar(height= purch_counts.values, x = purch_counts.index)

In [None]:
# TODO: count the frequency of unique values in the "Location" column, save this value into a new dataframe named "loc_counts"

loc_counts = datafile.value_counts("Location")

In [None]:
# TODO: plot a horizontal barplot for the loc_counts dataframe
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.barh.html
# Hint: resize the figure using "plt.figure(figsize=(10,10))" to "unsquish" your visualization

plt.figure(figsize=(10,10))
plt.barh(width= loc_counts.values, y = loc_counts.index)

## Bivariate Analysis

Let's generate visualizatons for relationships between multiple numeric variables to get an idea of patterns and clusters that might be present in our dataset.

In [None]:
# TODO: Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Gender" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(y=datafile["Purchase Amount (USD)"], x=datafile["Gender"])

In [None]:
# TODO: Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Season" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(y= datafile["Purchase Amount (USD)"], x=datafile["Season"])

In [None]:
# TODO: Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Review Rating" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(y= datafile["Purchase Amount (USD)"], x=datafile["Review Rating"])

In [None]:
# TODO: Create a boxplot that reveals the range of "Purchase Amount (USD)" for each "Promo Code Used" 
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(y= datafile["Purchase Amount (USD)"], x=datafile["Season"])

In [None]:
# TODO: Create a boxplot that reveals the range of "Payment Method" for each "Purchase Amount (USD)"
# Documentation: https://seaborn.pydata.org/generated/seaborn.boxplot.html

sns.boxplot(y= datafile["Purchase Amount (USD)"], x=datafile["Payment Method"])

In [1]:
# TODO: plot a grid of diagrams on all numeric columns where the upper-half of the grid are scatter-plots
# the bottom-half are kde-plots
# and the diagonal is a histplot
# Documentation: https://seaborn.pydata.org/tutorial/axis_grids.html
# Hint: This might take a few seconds to load
# Hint: to read the kde diagrams in the bottom-half check out https://www.greenbelly.co/pages/contour-lines

g = sns.PairGrid(datafile)
g.map_diag(sns.histplot)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)

NameError: name 'sns' is not defined

## Exploratory Data Analysis II

In the next section, answer a few questions regarding your dataset using the visualizations you've generated.

### Q1

Which state contains the most amount of shoppers? Which state contains the least?

California has the most while Hawaii has the least.

### Q2

Which season has the largest amount of purchases?

Winter

### Q3

What is the most popular form of payment for our customers in the US? What is the least popular form of payment?

Credit card is most popular while cash is the least popular.

### Q4

What is the most popular form of shipping for our customers in the US? What is the least popular form of shipping?

Standard shipping is most popular and Store Pick Up is the least popular.

### Q5

What kind of distribution do we observe for our `Age` column? What does this tell us about the typical shopper in the US?

The distribution is right skewed, and it signifies hat younger populations are buying more than before

### Q6

What kind of distribution do we observe for our `Purchase Amount (USD)` column? Why might this be? Take a look at the boxplots that you've generated to help answer this question.

It is bivariate distribution and it allows for a greater frequency of 2 different amounts of purchase.