## **Below are the Preprocessing steps for the Diwali Sales data set.**

---



### **Import Python libraries required for analysis**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### **Load the datafile from cloud**





In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Specify the path to the CSV file in your Drive
path = '/content/drive/My Drive/Diwali_Sales_Data.csv'

# Read the CSV file and gather the data
df = pd.read_csv(path,index_col=0, encoding='windows-1252')
print(df)

## **Explore the Data**

### **1. Display Top 5 Rows of The Dataset**

In [None]:
df.head()

### **2. Display columns of The Dataset**

In [None]:
df.columns

### **3. Check The Last 5 Rows of The Dataset**

In [None]:
df.tail()

### **4. Shape of the Dataset (Number of Rows * Number of Columns)**

In [None]:
df.shape

## **Get Information about the Dataset like**
* Total Number Rows,
* Total Number of Columns,
* Datatypes of Each Column
* Memory Requirement


In [None]:
df.info()

### **6. Check for Null Values In The Dataset**

In [None]:
df.isnull().sum()

### **7. Check For any Duplicate Data in the dataset and Drop Them**

In [None]:
df.duplicated().any()

### **8. Get Overall Statistics About The Dataset**

In [None]:
df.describe()

### **9. Drop the columns that are not required**

In [None]:
df.dropna(axis=0, inplace=True)

# **9.5. Ensure the relevant columns exist**

In [None]:
Ensure relevant columns exist
required_columns = ['User_ID', 'Cust_name', 'Product_ID', 'Gender', 'Age Group', 'Age', 'Marital_Status','State','Zone','Occupation','Product_Category','Orders','Amount','Status','unnamed1']
for col in required_columns:
    if col not in df.columns:
        raise ValueError(f"Column '{col}' is missing from the dataset.")
    print(col)



### **10. Explore the relationship between 2 features using groupby clause with aggregate functions**

In [None]:
category_sales = df.groupby(['Product_Category'])
print(category_sales.head(10))

In [None]:
category_sales = df.groupby('Product_ID')['Amount']
print(category_sales.head(10))

In [None]:
category_sales = df.groupby('Product_Category')['Amount'].sum()
print(category_sales.head(20))

In [None]:
category_sales = df.groupby(['Product_ID','Amount']).sum()
print(category_sales.head(10))

 ***Draw the insight of how much sales happened for a particular product category under a particular age group***



In [None]:
category_sales = df.groupby(['Age Group','Product_Category'])['Amount'].sum()
print(category_sales.head(20))

### **14. Plot a grap using seaborn library**

### **Hypothesis 1: Central zone people are making the highest sales margin compared to other regions.**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Zone', data=df,palette = "Set1")
plt.xticks(rotation=85)

### **15.Plot a graph using matplotlib library functions**


In [None]:
#commands used to create a new figure and ticks in the graph
plt.figure(figsize=(10, 8))
plt.xticks(rotation=45)

category_sales = df.groupby('Zone')['Amount'].sum()
category_sales.plot(x='Zone', y='Amount', kind='bar')
#configure the grid lines
plt.grid()

# Commands which mark the label name in y and x axis with the title of the graph
plt.title('Sales Contribution by Zone')
plt.ylabel('Total Sales')
plt.xlabel('Zone')

#Commands to adjust the subplot and to disply the plot.
plt.tight_layout()
plt.show()
