# Module 2: Exploratory Data Analysis

In this module, we will explore the concept of Exploratory Data Analysis (EDA) and its importance in understanding the Walmart dataset. EDA helps us uncover patterns, relationships, and anomalies in the data, allowing us to make informed decisions during the analysis and modeling process.

## The Concept of EDA

EDA is an essential step in any data analysis /Machine learning project. It involves exploring the data visually and numerically to gain insights into its characteristics. EDA helps us understand the data distribution, identify outliers and missing values, detect patterns and relationships between variables, and validate assumptions.

## Correlations and Patterns

During the EDA process, we will also examine correlations between variables. Correlation analysis helps us understand the strength and direction of relationships between different features in the dataset. By identifying correlations, we can uncover meaningful associations and potentially use them to make predictions or gain deeper insights into the data.

In addition to correlations, we will focus on identifying patterns within the Walmart dataset. Time series data often exhibits trends, seasonality, and cyclicalityas discussed earlier. By detecting these patterns, we can better understand the underlying factors that drive sales and make more accurate predictions.

## The Importance of Visualization

Visualization plays a crucial role in EDA. Visual representations of data help us perceive patterns, trends, and distributions more easily. We will utilize various types of visualizations, including histograms, scatter plots, line plots, and box plots, to explore the Walmart dataset. These visualizations will enable us to gain a deeper understanding of the data and communicate our findings effectively.

During this module, we will demonstrate an EDA on the Walmart dataset, showcasing the visualizations and insights generated during our project. Participants will then have the opportunity to perform their own EDA on the dataset, guided by a checklist of tasks based on our project.




## How EDA Helps Researchers

Exploratory Data Analysis (EDA) is a powerful tool that can greatly benefit researchers in various fields. Here are some ways in which EDA can help researchers:

1. **Understanding Data Characteristics**: EDA allows researchers to gain a comprehensive understanding of the dataset they are working with. By exploring the data visually and numerically, researchers can identify data distribution, variability, and outliers, which are crucial for drawing meaningful conclusions and making accurate predictions.

2. **Identifying Patterns and Trends**: EDA helps researchers uncover patterns, trends, and relationships within the data. By visualizing the data and examining various statistical measures, researchers can identify recurring patterns, seasonality, and other temporal trends that may exist in their dataset. This information can be invaluable for developing hypotheses, understanding underlying mechanisms, and making informed decisions.

3. **Detecting Anomalies and Outliers**: EDA enables researchers to identify anomalies and outliers in the data. These could be data points that deviate significantly from the expected patterns or show unusual behavior. By detecting and understanding these anomalies, researchers can investigate the causes behind them and assess their impact on the analysis or modeling process.

4. **Inform feature engineering and modeling decisions**: EDA provides researchers with valuable insights to guide feature engineering and modeling decisions. By understanding the distribution, range, and characteristics of the variables, researchers can select appropriate features, transformations, and modeling techniques for their analysis.


5. **Validating Assumptions and Hypotheses**: EDA helps researchers validate assumptions and hypotheses about their data. By examining the relationships between variables, researchers can assess whether their initial assumptions hold true or need to be revised. This iterative process of hypothesis testing and validation is crucial for building reliable models and drawing accurate conclusions.

6. **Communication and Collaboration**: EDA facilitates effective communication and collaboration among researchers. Visualizations generated during the EDA process can effectively communicate complex findings and insights to peers, supervisors, or stakeholders. Moreover, EDA can serve as a starting point for collaborative discussions, enabling researchers to brainstorm ideas, seek feedback, and refine their research approach.

By utilizing EDA techniques, researchers can gain deeper insights into their data, develop more robust research methodologies, and enhance the overall quality of their findings. EDA serves as a powerful tool in the research process, empowering researchers to explore, analyze, and interpret their data with confidence.

Let's dive into the exciting world of Exploratory Data Analysis!

## EDA Demonstration: Visualizations with Python

In this part of the workshop, we will demonstrate how to perform Exploratory Data Analysis (EDA) on the Walmart dataset using Python. We will utilize various libraries that are widely used for data analysis and visualization. Let's start by introducing these libraries:

1. **Pandas**: Pandas is a powerful library for data manipulation and analysis. It provides flexible data structures, such as DataFrame, and a wide range of functions to explore and transform the data.(as discussed in the previous module)

2. **Matplotlib**: Matplotlib is a popular data visualization library that allows us to create a wide variety of static, animated, and interactive plots. It provides a MATLAB-like interface for creating plots and can be used to generate informative visualizations.

**Matplotlib**: Matplotlib is a versatile data visualization library that provides a wide range of functions to create different types of plots. Here are some commonly used functions:

- `plt.figure()`: Creates a new figure or activates an existing figure for plotting.
- `plt.plot()`: Plots data points on a graph with lines connecting them.
- `plt.scatter()`: Creates a scatter plot to visualize the relationship between two variables.
- `plt.bar()`: Creates vertical bar plots to compare categorical data.
- `plt.hist()`: Creates histograms to show the distribution of numerical data.
- `plt.title()`: Sets the title of the plot.
- `plt.xlabel()`: Sets the label for the x-axis.
- `plt.ylabel()`: Sets the label for the y-axis.
- `plt.legend()`: Adds a legend to the plot.
- `plt.show()`: Displays the plot.

3. **Seaborn**: Seaborn is a high-level data visualization library built on top of Matplotlib. It offers a simpler interface and provides a wide range of attractive and informative statistical graphics.

- `sns.lineplot()`: Creates a line plot to visualize the relationship between two variables over time.
- `sns.scatterplot()`: Creates a scatter plot to visualize the relationship between two variables.
- `sns.barplot()`: Creates vertical bar plots to compare categorical data.
- `sns.histplot()`: Creates histograms to show the distribution of numerical data.
- `sns.boxplot()`: Creates box plots to analyze the distribution and outliers of numerical data.
- `sns.heatmap()`: Creates a heatmap to display the correlation between variables.
- `sns.pairplot()`: Creates a grid of scatter plots for pairwise relationships in the dataset.

These are just a few examples of the functions available in Matplotlib and Seaborn. Feel free to explore their documentation for more functions and customization options.

-https://pandas.pydata.org/docs/
-https://matplotlib.org/stable/users/index.html
-https://seaborn.pydata.org/


Now, let's move on to the practical demonstration and perform EDA on the Walmart dataset using these libraries.

To begin, let's install the necessary libraries by running the following commands:

` !pip install pandas`
` !pip install matplotlib`
` !pip install seaborn`



Once the libraries are installed, we can import them in our Python script or Jupyter Notebook using the following import statements:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


We have already done the ver basic data exploration in the previous module.

for example:

`Displayed the first few rows of the DataFrame `


` Got a concise summary of the DataFrame`

` Displayed the statistical summary`

###Now we will move towards some advance data exploration for instance if we want to know





- The pattern of overall sales across all states and stores on daily basis?

In [None]:
# pyhton code to visualize overall sales trends
daily_overallsales = sales_df.loc[:, 'd_1':'d_1913'].sum()
dates = pd.to_datetime(calendar_df['date'])

# Plotting the daily total sales
plt.plot(dates, daily_overallsales)
plt.title('Overall Daily Sales Trend')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


Explanation:

- We calculate the total sales for each day by summing the sales across all states and stores.
- The dates are extracted from the calendar data and converted to a datetime format.
- The daily total sales are plotted using plt.plot() from Matplotlib.
-We set the title, x-axis label, y-axis label, and rotate the x-axis labels for better readability.

Observations:

- The plot shows the overall daily sales trend across all states and stores.
- The x-axis represents the date, and the y-axis represents the total sales.
-Over time, there is a general trend of increasing sales, indicating growth in Walmart's overall sales.

## Accessing and Modifying Data with .loc

The `.loc` indexer in pandas is used to access and modify specific rows and columns of a DataFrame or Series. It allows label-based indexing, enabling you to select data based on labels or conditions.

### Syntax:
```python
df.loc[row_labels, column_labels]
```
- row_labels: Specify the rows you want to select or modify. It can be a single label, a list of labels, a slice object, or a boolean array.
- column_labels: Specify the columns you want to select or modify. It can be a single label, a list of labels, a slice object, or a boolean array.


#Examples:
&nbsp;&nbsp;&nbsp;&nbsp; 1. Accessing specific rows and columns:
```python
df.loc[3]  # Selects row with index label 3
df.loc[1:5]  # Selects rows with index labels from 1 to 5
df.loc[:, 'column_label']  # Selects all rows for a specific column
df.loc[[1, 3, 5], ['column1', 'column2']]  # Selects specific rows and columns
```
&nbsp;&nbsp;&nbsp;&nbsp; 2. Modifying values:

```python
df.loc[2, 'column_label'] = 10  # Modifies a specific value
df.loc[df['column_label'] > 5, 'column_label'] = 0  # Modifies values based on a condition
```

&nbsp;&nbsp;&nbsp;&nbsp; 3. Conditional selection:

```python
df.loc[df['column_label'] > 5]  # Selects rows based on a condition
df.loc[(df['column1'] > 3) & (df['column2'] == 'value')]  # Selects rows based on multiple conditions
```
In summary, .loc provides a powerful way to select and modify data in pandas DataFrames or Series using label-based indexing.

Before moving further let's look at how to get help on anything in python


###Accessing Help for Functions in Python

Python provides built-in documentation and help functionality that allows you to access information about functions and their usage. This can be useful when you need to understand how a function works, its parameters, and its return values.

1. Using the help() function:
Python has a built-in help() function that displays the documentation string (docstring) of a function and provides information on its usage. To use it, simply pass the function name as an argument to help().

Example:

```pyhton
help(print)
```
Output:

```python
Help on built-in function print in module builtins:
print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file: a file-like object (stream); defaults to the current sys.stdout.
    sep: string inserted between values, default a space.
    end: string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.
```

2. Using ? in Jupyter Notebook or IPython:
If you are using Jupyter Notebook or IPython, you can simply append a question mark (?) to the end of a function to display its docstring and usage information.

Example:
```python
print?
```

Output:
```python
Signature: print(*args, sep=' ', end='\n', file=None, flush=False)
Docstring:
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
Type:      builtin_function_or_method
```
3. Online documentation and resources:
Python has extensive online documentation available at docs.python.org that provides detailed information about various modules, functions, and their usage. You can search for the specific function or topic you need help with and find comprehensive documentation.

Additionally, there are many online resources, forums, and communities where you can ask questions and seek help from the Python community. Some popular platforms include Stack Overflow, Python.org Forum, and various Python-related subreddits.

By utilizing these methods, you can easily access help and documentation to better understand the functionality and usage of Python functions.


Let us go back to EDA

- What is the pattern of total sales for each year separately?

In [None]:
daily_overallsales['year'] = daily_overallsales['date'].dt.year
daily_overallsales_grp = daily_overallsales.groupby('year')

plt.figure(figsize=(12, 8))
for year, data in daily_overallsales_grp:
    plt.plot(data['date'], data['Total_Sales'], label=year)

plt.title('Total Sales Pattern by Year')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.legend()
plt.tight_layout()
plt.show()


Explanation:

- The analysis of total sales for each year separately reveals that there is a consistent pattern observed across the years. The sales data spans from 2011 to 2016, and during this period, we can observe an upward trend in total sales. However, it's important to note that the sales pattern is similar in each year.

- This indicates that there is a yearly seasonality in sales, where the sales trend repeats in a cyclical manner. The similar patterns observed each year suggest that there might be external factors or events that influence customer purchasing behavior during specific times of the year.

- Additionally, it's worth noting that the time series data for each year is stationary. A stationary time series means that the statistical properties of the data, such as the mean and variance, do not change significantly over time. This stationary behavior can be helpful for forecasting and modeling future sales trends.

Observations:

- There is a similar pattern of overall sales across the years, indicating yearly seasonality.
- Each year exhibits a stationary time series, where the sales pattern repeats.
- The sales pattern of each year follows a consistent trend, with slight variations.

- What is the monthly seasonality of total sales across all years over all stores?

In [None]:
monthly_overallsales = daily_overallsales.groupby(daily_overallsales['date'].dt.month).sum()

plt.figure(figsize=(12, 6))
plt.plot(monthly_overallsales.index, monthly_overallsales['Total_Sales'])
plt.xticks(range(1, 13), calendar.month_name[1:13])
plt.title('Total Sales Monthly Seasonality')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.tight_layout()
plt.show()


Explanation:

- Analyzing the monthly seasonality of total sales reveals interesting patterns. The highest sales occur in the month of March, while the lowest sales are observed in November. This indicates that there are certain months where customers tend to make more purchases, possibly due to seasonal factors, promotions, or holidays.

- The dip in sales during the middle of the year, followed by a gradual increase, and another dip towards the end of the year suggests a cyclic pattern in customer behavior. These fluctuations in sales can be attributed to various factors such as changing consumer preferences, seasonal demand, or economic factors influencing purchasing power.

- Understanding the monthly seasonality of sales can help businesses plan their inventory, marketing campaigns, and promotional strategies to align with the periods of high customer demand.

Observations:


- Sales are highest in the month of March and lowest in November.
- There is a dip in sales during the middle of the year, followed by a gradual increase and another dip towards the end of the year.
- This monthly seasonality pattern suggests that customers tend to make more purchases during certain months.

-What is the proportion of sales across the states?

In [None]:
state_sales = sales_df.groupby('state_id')['d_1':'d_1913'].sum().sum()

plt.figure(figsize=(8, 6))
plt.pie(state_sales, labels=state_sales.index, autopct='%1.1f%%', shadow=True)
plt.title('Sales Proportion by State')
plt.tight_layout()
plt.show()


Explanation:

- Analyzing the proportion of sales across different states reveals the distribution of sales volume among California (CA), Texas (TX), and Wisconsin (WI). The analysis shows that California has the highest proportion of sales, followed by Texas and Wisconsin.

- This distribution reflects the market demand and customer behavior in each state. California being the most populous state in the dataset might contribute to the higher sales proportion. It's important for businesses to understand the regional variations in sales to effectively allocate resources, plan expansion strategies, and tailor marketing efforts to specific states.

Observations:

- California (CA) has the highest proportion of sales, followed by Texas (TX) and Wisconsin (WI).
- The sales proportion reflects the sales volume and market demand in each state.
- The distribution of sales across states can provide insights for allocation of resources and targeting specific regions for marketing strategies.

Overall, understanding the patterns and variations in sales across different time periods and regions can provide valuable insights for businesses to optimize their operations, forecast sales, and devise effective strategies to meet customer demand.