# Exploratory vs Confirmatory Data Analysis
---

In this project, we are going to learn about two important data analysis methods **EDA** (Exploratory Data Analysis) and **CDA** (Confirmatory Data Analysis).


### Exploratory Data Analysis (EDA)


Importing Modules

In [None]:
# Install and import necessary libraries
!pip install plotly --upgrade

# Pandas Module
import pandas as pd

# Data Visualization Module
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as pyo
%matplotlib inline

# Setting some default settings
pd.set_option('mode.chained_assignment',None)
#pyo.init_notebook_mode()

Let's load our dataset

In [None]:
#read my dataframe
data = pd.read_csv("/kaggle/input/profit-dataset/dataset.csv", engine="python", encoding='latin1')


If after reading this dataset with the code "pd.read_csv("/content/dataset.csv", engine="python")", it returns "UnicodeDecodeError":

The error indicates that the pandas library is trying to read a CSV file using the default utf-8 encoding, but it encounters a byte sequence (0xa0 in this case) that it cannot decode. This usually happens when the file is encoded in a different encoding.
To solve this error, one way is using (encoding='latin1') argument

Checking the data size

In [None]:
data.shape

Checking the data

In [None]:
data.head()

### EDA - Roadmap

In this task, we are going to talk about How to start our exploration.

    Different column data types
    How are the columns related
    What are the different information in our data
    Make a list of the information and start from the first



Now let's start with checking the column data types

In [None]:
data.dtypes

Now let's talk about what type of information do we have in this data

### In our data, we have the following information:
    
    Time Information (Order Data)
    Customer Information (Customer Name)
    Place Information (State name)
    Hierarchical Information about the products (Category, Sub-category, Product Name)
    Sale Information (sales, profit, quantity)

Now let's start our exploration

### Task 3: Data Exploration: Time Information

What is the timespan of our data?

In [None]:
data["Order Date"] = pd.to_datetime(data['Order Date'])
from_date = data["Order Date"].min()
to_date = data["Order Date"].max()
print(f"We have sales infromation from {from_date} to {to_date}")

Now let's sort our data by the date

In [None]:
data = data.sort_values(by="Order Date")
data.head()

Some data preparation: let's extract year, month, and day from the Order Date column

In [None]:
data["Year"] = data["Order Date"].dt.year
data["Month"] = data["Order Date"].dt.month
data["Day"] = data["Order Date"].dt.day
data.head()

Show the Profit gained over time by different product categories

In [None]:
data_time_profit = data.groupby(["Year", "Category"])["Profit"].sum().reset_index()
data_time_profit.head()

Visualizing the results using a line chart

In [None]:
# prompt: line graph for data_time_profit, x is year, y is profit and color is category

fig = px.line(data_time_profit, x="Year", y="Profit", color="Category", title="Profit Gained Over Time by Product Category")
fig.show()

# If the code above did not show the plot use this: !pip install plotly --upgrade, then run codes and remove the pyo.init_notebook_mode()

Here, we can see that the Technology always had the highest amount of profit

Analyse the monthly profits gained from sales of different product categories. Visualize your results using line chart.

In [None]:
data_time_month_profit = data.groupby(["Year", "Month", "Category"])["Profit"].sum().reset_index()
data_time_month_profit["Date"] = data_time_month_profit["Year"].astype(str) + "-" + data_time_month_profit["Month"].astype(str)
data_time_month_profit.head()

In [None]:
px.line(data_time_month_profit, x="Date", y="Profit", color="Category", title="Profit Gained Over Time by Product Category")

### Data Exploration: Customer Aspect

let's see how many unique costumers do we have

In [None]:
len(data["Customer Name"].unique())

let's see the yearly change in number of unique customers

In [None]:
customer_data = data.groupby("Year")["Customer Name"].nunique().reset_index()
customer_data

visualizing the result

In [None]:
px.line(customer_data, x="Year", y="Customer Name", title="Yearly Change in Number of Unique Customers")

We can see that the busienss was successful because the number of unique customer has increased over the three years

Top 10 customers who brought the highest profit

In [None]:
top_ten_customers = data.groupby("Customer Name")["Profit"].sum().reset_index().sort_values(by="Profit", ascending=False).head(10)
top_ten_customers

In [None]:
px.bar(top_ten_customers, x="Customer Name", y="Profit", title="Top 10 Customers Who Brought the Highest Profit")

### Task 4: Data Exploration: Place (location) Aspect

Let's analyze the profits gained in different states in the US

In [None]:
geo_data = data.groupby("State")["Profit"].sum().reset_index()
geo_data

### Let's create a choropleth map
Plotly uses abbreviated two-letter postal codes for state locations so it will be necessary to create a dictionary that contains conversions of the full names of states into abbreviations.

In [None]:
state_codes = {
        'Alabama': 'AL',
        'Alaska': 'AK',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'District of Columbia': 'DC',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'
}

let's map the abbreviated two-letter postal codes to the State column

In [None]:
geo_data.State = geo_data.State.map(state_codes)

In [None]:
geo_data

In [None]:
px.choropleth(geo_data, locations="State", color="Profit", locationmode="USA-states", scope= "usa", color_continuous_scale = "Blugrn", title="Profit Gained by State")
#color_continuous_scale = "Blugrn" --- changing the color

Where the color is getting darker, we see higher profit. We can see that "California" and "New York" has highest profit

Exercise: Create a choropleth map to visualize the profit gained by selling technology(Category=technology) products in different states.

In [None]:
ex_data = data[data["Category"] == "Technology"]
ex_geo_data = ex_data.groupby("State")["Profit"].sum().reset_index()
px.choropleth(ex_geo_data,
              locations=ex_geo_data.State.map(state_codes), color="Profit",
              locationmode="USA-states", scope= "usa",
              color_continuous_scale = "Pubu", title="Profit Gained by State")

New York has highest profti by selling the technological products

### Task 5: Data Exploration - Hierarchical Information about the products

In [None]:
product_data = data.groupby(["Category", "Sub-Category"])["Profit"].sum().reset_index()
product_data = product_data[product_data.Profit > 0]
product_data["Sales"] = "Any"
product_data

In [None]:
px.sunburst(product_data, path=["Category", "Sub-Category"], values="Profit", title="Profit Gained by Product Category")

In [None]:
px.sunburst(product_data, path=["Sales","Category", "Sub-Category"], values="Profit", title="Sales by Product Category")

Here we see different product categories is mapped to different colors.
We have technology, office, supplies and furniture the size of each arc is represenging the amount of the profit gained for each of the category or subcategory.
we can see that, for example, technology has the highest amount of profit.

By clicking on each category we get more information about their sub-category and see the different heirarchical level of data and the inner circle present the profit gained by any sort of product sold, but only the positive profit

In [None]:
# making tree map
px.treemap(product_data, path=["Sales","Category", "Sub-Category"], values="Profit", title="Profit Gained by Product Category")


The map tree uses same idea, but with uses different rectangles to present values. The small rectangles represents the subcategories and bigger one represents the category and the biggest one is representing the any sort of product that we have.
By clicking on each category we can look at its cubcategories

### Task 6: Data Exploration: Product Sales information (Sales, Quantity, Profit)

In [None]:
data.head()

Distribution Analysis on **Quantity** column

Let's check the statistical summary of the column

In [None]:
data.Quantity.describe()

Here we can see that with mean = 3.7 and standard deviation = 2.7, most of our data values in the quantity column are aroudn mean value.
By looking at the maximum value, we can say that there are outliers, because it is far from the mean value.


In [None]:
px.histogram(data, x="Quantity", title="Quantity Distribution")

Here we can see that the most of the products, most of the sales record, has been recorded with the quantity two so there are 2409 with the quantity.
So three and two are the most common values in the column.
Also we can see that there is a tail on the right side of the histogram, the distirbution of this column is right skewed and it means that there are some outliers on the right side of the histogram.
By looking at the tail of the graph we can see that in the quantity column there are some values that are very accurate. For example, we have quantity 14*29 times happen or quantity 13*27 times.
The values on the tail are very rare in our distribution, so we can call them outliers.

Exercise: Apply distribution analysis using boxplot to the **Profit** column. using statistical summary and a box plot.

In [None]:
px.box(data, y="Quantity", title="Quantity Distribution")

Here we can the statical values, like the max value which is 14, and min value which is 0.
It shows us the median = 3, and lower and upper quarter tiles.
The box plot shows that the mostl of the values are bewteen 2 and 5 and the data points on top are outliers, as we saw in the histogram.

In [None]:
px.box(data, y="Quantity", x="Year", title="Profit Distribution")

Here we see four different box plots realted to the sales in the year of 2014, 15, 16, 17.
so for each year we have a different box plot for quantoty column.


In [None]:
px.box(data, y="Quantity", x="Category", color="Year", title="Quantity Distribution")

Here we see each product category with different box plot and shows different color related to each category for specific year.
Distribution based upon each year and each category at the same time.


Task: Apply distribution analysis the Profit column, using statical summary and a box plot

In [None]:
data.Profit.describe()

Here we can see thet:
- We have negative values (the min value = - 6599)
- The mean value is 28.65
- The max value is 8399
- There are outliers in the both sides

In [None]:
px.box(data, y="Profit", title="Profit Distribution")

Here we can see that the plot is very wierd and there are many outliers

### Task 7: What Is Confirmatory Data Analysis (CDA)?

By definition, Confirmatory Data Analysis is the process of using statistical summary and graphical representations to evaluate the validity of an assumption about the data at hand.

We have the following assumption about our data, and we are going to use different exploration techniques we learned in the previous tasks to validate them.

    Assumption 1 - Every summer technology products have the highest sale quantity compared to other product categories.
    Assumption 2- In New York, there are many big companies, therefore, office supplies product has
    the highest sale quantity compared to other big states such as Texas, Illinois, and California.


Assumption 1 - Every summer technology products have the highest sale quantity compared to other product categories.

In [None]:
seasons = {
    1 : "Winter",
    2 : "Spring",
    3 : "Summer",
    4 : "Fall"
}

Creating **Season** column

In [None]:
data["Season"] = data.Month.astype(int) % 12 // 3 + 1
data.Season=data.Season.map(seasons)
data.head()

Extracting data related to summer every year

In [None]:
summer_data = data[data.Season == "Summer"]
summer_data.head()

Aggregating data based on Year, Category, and Season columns and summing up the Quantity

In [None]:
summer_data_agg = summer_data.groupby(["Year", "Category", "Season"])["Quantity"].sum().reset_index()
summer_data_agg

Let's visualize our result using a grouped bar chart

In [None]:
px.bar(summer_data_agg, x="Year", y="Quantity", color="Category", barmode="group", title="Summer Sales by Product Category")

The bar chart shows us how much of a different product categories, the quantity of different product categories sold in different years that we have and we can see that there are different colors for each board related to different categories.

We assumed that in summer in each year, the technology category has the highest number of sales.
we also can see that the office supplies always has the highest number of sales during the summer.
So the assumption one is invalid.  

Exercise: Use the analytical techniques that you've learned during the course to validate the following assumption:
        
        Assumption 2- In New York, there are many big companies, therefore, office supplies
        product has the highest sale quantity compared to other big states such as Texas, Illinois, and California.

In [None]:
data_office_supplies = data[data['Category'] == 'Office Supplies']
data_office_supplies_states = data_office_supplies[data_office_supplies['State'].isin(['New York', 'Texas', 'Illinois', 'California'])]
px.bar(data_office_supplies_states, x='State', y='Profit', title='Profit Comparison for Office Supplies in Selected States')
