### Step 0: Installation and Library Import:
To ensure the proper functionality of the code, it's essential to install and import the required libraries.

In [None]:
# Installation: Use the following command to install the necessary libraries using pip
# in jupyter notebooks, use ! before pip command.

!pip install numpy -U

In [None]:
# Import: In your Python script or notebook, import the installed libraries using the import statement
''' 
NumPy is a fundamental library for scientific computing with Python. 
It provides support for arrays and matrices, along with mathematical functions to operate on these arrays. 
NumPy is widely used for numerical computations and data manipulation, particularly in tasks involving large datasets or mathematical operations.
'''
import numpy as np 

''' 
Pandas is a popular library for data manipulation and analysis. 
It provides data structures like DataFrame and Series, which are powerful tools for handling and organizing data. 
Pandas is frequently used for data cleaning, transformation, and exploration, making it an essential library for data science and analysis tasks
'''
import pandas as pd 

'''
Matplotlib is a versatile library for creating static, animated, or interactive visualizations in Python. 
The pyplot module within Matplotlib is commonly used for creating various types of plots, charts, and graphs. 
It's an excellent choice for data visualization and presentation of your analysis results.
'''
import matplotlib.pyplot as plt

'''
The %matplotlib inline command is a Jupyter Notebook-specific magic command. 
It ensures that the Matplotlib plots are displayed directly in the notebook rather than in separate windows.
'''
%matplotlib inline

'''
Seaborn is a data visualization library built on top of Matplotlib. 
It simplifies the creation of informative and attractive statistical graphics. 
Seaborn offers a high-level interface for drawing attractive and informative statistical graphics.
It's particularly useful for tasks like creating aesthetically pleasing statistical plots.
'''
import seaborn as sns

## Part 1: Data Cleaning

In [None]:
# import csv file
# creating dataframe by importing the csv file using pandas 'read_csv' function.
df = pd.read_csv('retail_sales_dataset.csv', encoding= 'unicode_escape')

In [None]:
# dataframe shape: give tuple of Rows x Columns count.
count_rows, count_columns = df.shape

# Printing with message
print(f"No. of Rows = {count_rows}")
print(f"No. of Columns = {count_columns}")

In [None]:
# dataframe head: return data from first 5 rows.
df.head()

In [None]:
# Create a new column and initialize with default values
df['Age_Group'] = ''

# Apply condition and update values in the new column
df.loc[df['Age'] <= 18, 'Age_Group'] = '0 - 18'
df.loc[(df['Age'] > 18) & (df['Age'] <= 30), 'Age_Group'] = '18 - 30'
df.loc[(df['Age'] > 30) & (df['Age'] <= 45), 'Age_Group'] = '30 - 45'
df.loc[(df['Age'] > 45) & (df['Age'] <= 60), 'Age_Group'] = '45 - 60'
df.loc[(df['Age'] > 60), 'Age_Group'] = '60+'

# Print the DataFrame to verify the changes
print(df.head())

In [None]:
# dataframe tail: return data from last 5 rows.
df.tail()

In [None]:
# display data of in-between rows.
df.head(3).tail(5)

In [None]:
# dataframe info: show the schema of the table columns.
# helpful to show the null value count in some columns.
df.info()

In [None]:
#drop unrelated/blank columns
# 'axis = 1' means whole column.
# 'inplace = True' means changes takes place from here. otherwise same data will appear in next output.

df.drop(['Customer ID'], axis=1, inplace=True)

In [None]:
# pandas isnull: check presence of null values
# sum (): return count
pd.isnull(df).sum()

In [None]:
# dataframe dropna: drop rows containing the null values
df.dropna(inplace=True)

In [None]:
# dataframe astype: change data type
df['Total Amount'] = df['Total Amount'].astype('int')    # using variable to assign changes in same entity.

df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')  ## Convert to datetime using pd.to_datetime()

In [None]:
# dataframe dtypes: check the data type of the columns
df.dtypes

# df['Amount'].dtypes        # check for single column

In [None]:
# dataframe columns: return the names of all columns
df.columns

In [None]:
# dataframe rename: rename column. pass the dictionary of previous and new value.
df.rename(columns= {'Product Category':'product_category'}, inplace =  True)

In [None]:
# dataframe describe: returns statistical description of the data in the DataFrame (i.e. count, mean, std, etc)

df.describe()        # for a all the column

In [None]:
# use describe() for specific columns
# df[['Age', 'Orders', 'Amount']].describe()

## PART 2: Exploratory Data Analysis

### Gender

In [None]:
# plotting a bar chart for Gender and it's count

ax = sns.countplot(x = 'Gender',data = df)
# seaborn countplot: using a plot that will show the count. passing which value will be on x-axis and from where to read the data.

for bars in ax.containers:
    ax.bar_label(bars)
# containers contains the unique values in mentioned Column. iterating the values to display their count over bars in plot.

# Label the x and y axes
ax.set_xlabel('Gender')
ax.set_ylabel('Numer of People')

In [None]:
# plotting a bar chart for gender vs total amount

# dataframe groupby: similar concept as in SQL. 
# grouping by Gender means values in gender will be unique and 2nd column is its data against the values.
sales_gen = df.groupby(['Gender'], as_index=False)['Total Amount'].sum().sort_values(by='Total Amount', ascending=False)

# seaborn barplot: plot a bar graph. mention x and y data and data source.
ax = sns.barplot(x = 'Gender',y= 'Total Amount' ,data = sales_gen)

## Summary
*From above graphs we can see that most of the buyers are females and even the purchasing power of females are greater than men*

### Age

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Define the order of categories based on the count
order = df['Age_Group'].value_counts().index

# Set the figure size
plt.figure(figsize=(20, 6))

# Create the countplot with hue
ax = sns.countplot(x='Age_Group', hue='Gender', data=df, order=order)

# Add labels to the bars
for bars in ax.containers:
    ax.bar_label(bars)

# Set x-axis label
ax.set_xlabel('Age Group')

# Set y-axis label
ax.set_ylabel('Count')

# Show the plot
plt.show()


In [None]:
# Total Amount vs Age Group
sales_age = df.groupby(['Age_Group'], as_index=False)['Total Amount'].sum()
sales_age = sales_age.sort_values(by = ['Total Amount', 'Age_Group'], ascending = True)

sns.barplot(x = 'Age_Group', y= 'Total Amount', data = sales_age)

In [None]:
## same chart as above. Using additional hue on Gender column.

# Group the data by 'Age Group' and calculate the sum of 'Amount' for each group
sales_age = df.groupby(['Age_Group', 'Gender'], as_index=False)['Total Amount'].sum()

# Sort the data to display in a particular order (if needed)
sales_age = sales_age.sort_values(by=[ 'Total Amount', 'Age_Group', 'Gender'], ascending = True)

# Create the barplot with 'Age Group' on the x-axis and 'Amount' on the y-axis, using 'Gender' as hue
sns.barplot(x='Age_Group', y='Total Amount', hue='Gender', data=sales_age)

*From above graphs we can see that most of the buyers are of age group between 26-35 yrs female*

### Order Count for Product Categories

In [None]:
# total number of orders from the Product Categories


''' In the context of a groupby operation in pandas, the as_index parameter is used to specify 
whether the grouping columns should become the index of the resulting DataFrame or not '''

sales_pc = df.groupby(['Product Category'], as_index=true)['Quantity'].sum()
sales_pc = sales_state.sort_values(by='Quantity', ascending=False).head(10)

# set the width of graph. rc is run time parameter.
sns.set(rc={'figure.figsize':(15,5)})

sns.barplot(x = 'Product Category',y= 'Quantity', data = sales_pc)

In [None]:
# Create a new column to store the month names
df['Month'] = df['Date'].dt.strftime('%B')

# Print the DataFrame to verify the changes
print(df.head())

In [None]:
# total amount/sales from top 10 states

sales_state = df.groupby(['Month', 'Product Category'], as_index=False)['Total Amount'].sum()
sales_state = sales_state.sort_values(by=['Total Amount', 'Product Category'], ascending=False)

sns.set(rc={'figure.figsize':(15,5)})
plt.title('Total Amount by Product Category per Month')
sns.barplot(data = sales_state, x = 'Month',y= 'Total Amount', hue='Product Category')

*From above graphs we can see that most of the orders & total sales/amount are*


### Product Category

In [None]:
## count of Product Category orders.

sns.set(rc={'figure.figsize':(20,5)})
ax = sns.countplot(data = df, x = 'Product Category')

for bars in ax.containers:
    ax.bar_label(bars)

In [None]:
## ## count of Product Category VS Amount.

sales_state = df.groupby(['Product Category'], as_index=False)['Total Amount']
sales_state = sales_state.sum().sort_values(by='Total Amount', ascending=False).head()

sns.set(rc={'figure.figsize':(20,5)})
sns.barplot(data = sales_state, x = 'Product Category',y= 'Total Amount')

*From above graphs we can see that most of the sold products are from Electronics, followed by Clothing and Beauty products.*

### Conclusion

*From the analyis it can be concluded that -*

Complete project on GitHub: https://github.com/ShivamH75/Python_DA_NBs

Thank you!