## Create a dataset in csv format consisting of the following
data(product_id, product_name, category, price, qty)
Perform the following operation on data:
1. Load csv file into pandas
2. Check for and handle any missing values or inconsistency in datasets(data cleaning)
3. Calculate sum of qty, average of price, total sales and top selling products [Sum & mean]

In [1]:
#1. Import Libraries:
#This imports the random module for generating random numbers and the pandas library as pd for working with data in tabular form.

import random
import pandas as pd

# Generate random product IDs
#2. Define Function to Generate Product ID:
#This function generate_product_id() generates a random product ID consisting of uppercase letters and digits. 
#It uses random.choices() to select characters from the given set and join() to concatenate them into a single string.

def generate_product_id():
    return ''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', k=6))

# Sample product names, categories, and prices
#3. Define Sample Product Information:
#Lists are defined containing sample product names (product_names), categories (categories), 
#and a list of 100 random prices between 100 and 1000 (prices).
#The random.uniform() function generates random floating-point numbers within the specified range

product_names = ['Laptop', 'Smartphone', 'Headphones', 'Tablet', 'Camera']
categories = ['Electronics', 'Clothing', 'Appliances', 'Books', 'Sports']
prices = [random.uniform(100, 1000) for _ in range(100)]

#4. Generate dataset
#A loop iterates 100 times to generate a dataset. Within each iteration, a product ID is generated using generate_product_id(),
#and random values for product name, category, price, and quantity are chosen.
#These values are then appended to the data list as a sublist.

data = []
for _ in range(100):
    product_id = generate_product_id()
    product_name = random.choice(product_names)
    category = random.choice(categories)
    price = random.choice(prices)
    qty = random.randint(1, 10)
    data.append([product_id, product_name, category, price, qty])

# 5. Introduce inconsistencies and missing values
# This loop randomly selects 10 records from the dataset and introduces inconsistencies and missing values:
#It makes the price negative for some records by multiplying the price by -1.
#It sets quantity to 0 for some records.
#It introduces missing values for product names and categories with a probability of 20% for each record.

for i in range(10):
    idx = random.randint(0, 99)
    # Make the price negative for some records
    data[idx][3] *= -1
    # Set quantity to 0 for some records
    data[idx][4] = 0
    # Introduce missing values for product names and categories
    if random.random() < 0.2:
        data[idx][1] = None
    if random.random() < 0.2:
        data[idx][2] = None

# 6.Convert to DataFrame
#The pd.DataFrame() function converts the data list of lists into a pandas DataFrame. 
#Column names are specified as ['product_id', 'product_name', 'category', 'price', 'qty'].
df = pd.DataFrame(data, columns=['product_id', 'product_name', 'category', 'price', 'qty'])

#7. Save to CSV
#Finally, the DataFrame is saved to a CSV file named product_data.csv without including the index column.
#This CSV file will contain the generated product data.
df.to_csv('product_data.csv', index=False)


In [6]:
# specifically retrieves and displays the first 10 rows of the DataFrame df.
#This is often used to quickly inspect the structure and contents of a DataFrame, especially when dealing with large datasets 
df.head(10)

Unnamed: 0,product_id,product_name,category,price,qty
0,B2C581,Headphones,Electronics,279.57188,8
1,VJJAB4,Smartphone,Books,488.333717,1
2,4DW8NN,Laptop,Unknown,276.123473,1
3,XNXPF4,Camera,Clothing,332.976865,1
4,G7CL4P,Unknown,Electronics,113.644223,1
5,R9XJWC,Camera,Clothing,488.333717,9
6,MB0J39,Tablet,Electronics,332.976865,3
7,RPO8HC,Headphones,Books,845.807407,1
8,QDIKKM,Laptop,Books,726.158934,7
9,HVKYYO,Smartphone,Sports,519.737809,9


In [5]:
# specifically retrieves and displays the last 10 rows of the DataFrame df. 
#This is often useful for quickly examining the most recent data entries or verifying the data at the end of the DataFrame.

df.tail(10)

Unnamed: 0,product_id,product_name,category,price,qty
90,OZMTVW,Headphones,Books,793.719948,4
91,QAGPEP,Tablet,Sports,470.935596,10
92,PFQH92,Tablet,Clothing,140.495291,2
93,27Y948,Camera,Electronics,430.407343,8
94,1Y0ZO7,Tablet,Books,200.005992,9
95,E0HNJH,Camera,Appliances,718.711228,1
96,CMH8HS,Smartphone,Electronics,825.350233,5
97,ABL8Q2,Tablet,Sports,694.629059,4
98,BZPFBJ,Tablet,Books,239.551888,3
99,9TMZLX,Headphones,Sports,678.793903,8


In [2]:
#Load csv file into pandas
#1. Import pandas Library:
#This line imports the pandas library and aliases it as pd, which is a common convention.

import pandas as pd

#2.Read Data from CSV into DataFrame:
#The pd.read_csv() function reads the data from the CSV file named 'product_data.csv' into a pandas DataFrame called df. 
#The data in the CSV file is expected to be in tabular format.

df=pd.read_csv('product_data.csv')

# 3.Check for missing values
#The .isnull() method of the DataFrame df returns a DataFrame of the same shape, where each element is True if the corresponding element in df is NaN (missing), and False otherwise.
#The .sum() method is then used to sum up the number of True values (missing values) along each column axis.
#This results in a pandas Series object (missing_values) where the index represents the column names, and the values represent the count of missing values in each column.

missing_values = df.isnull().sum()

#4. Print Missing Values:
#This code prints out a header "Missing Values:" to indicate that the following output represents the count of missing values.
#It then prints the missing_values Series, which provides a summary of the number of missing values 
#in each column ofthe DataFrame df.
print("Missing Values:")
print(missing_values)

Missing Values:
product_id      0
product_name    2
category        2
price           0
qty             0
dtype: int64


In [3]:
# Check for and handle any missing values or inconsistency in datasets(data cleaning)-Handle missing values
#Fill Missing Values in 'product_name' Column:
#This line fills missing values in the 'product_name' column of the DataFrame df with the string 'Unknown'.
#The fillna() method is used to replace missing values (NaN) with the specified value ('Unknown') in the specified column ('product_name').
#The inplace=True parameter ensures that the changes are made directly to the DataFrame df, rather than returning a new DataFrame.
df['product_name'].fillna('Unknown', inplace=True)

#2. Fill Missing Values in 'category' Column:
#Similar to the previous line, this line fills missing values in the 'category' column of the DataFrame df with the string 'Unknown'.

df['category'].fillna('Unknown', inplace=True)

#Convert Negative Prices to Positive Values:
#This line ensures that all prices in the 'price' column of the DataFrame df are converted to positive values by taking the absolute value of each price.
#The .abs() method is applied to the 'price' column, which returns the absolute value of each element in the column.
df['price'] = df['price'].abs()

#Replace Zero Quantity Values with One:
#This line replaces any zero quantity values in the 'qty' column of the DataFrame df with the value 1.
#The .replace() method is used to replace all occurrences of the specified value (0) with another value (1) in the 'qty' column.

df['qty'] = df['qty'].replace(0, 1)  

In [8]:
#Calculate sum of qty, average of price, total sales and top selling products [Sum & mean]
#Calculate Sum of Quantities:
#This line calculates the sum of all values in the 'qty' column of the DataFrame df.
#df['qty'] selects the 'qty' column from the DataFrame df.
#The .sum() method calculates the sum of all values in the selected column.
sum_of_quantities=df['qty'].sum()

#Print Sum of Quantities:
#This line prints the calculated sum of quantities.
#The message "Sum of Quantities:" is printed first as a header to indicate what the following number represents.
#sum_of_quantities contains the calculated sum, which is then printed along with the message.
print("Sum of Quantities:",sum_of_quantities)

Sum of Quantities: 509


In [9]:
#Calculate Average Price:
#This line calculates the mean (average) of all values in the 'price' column of the DataFrame df.
#df['price'] selects the 'price' column from the DataFrame df.
#The .mean() method calculates the mean (average) of all values in the selected column.
average_price=df['price'].mean()

#Print Average Price:
#This line prints the calculated average price.
#The message "Average Price:" is printed first as a header to indicate what the following number represents.
#average_price contains the calculated average, which is then printed along with the message.
print("Average Price:",average_price)

Average Price: 500.8243240883149


In [16]:
#Calculate Total Sales:
#This line calculates the total sales by first performing an element-wise multiplication between the 'price' column and the 'qty' column of the DataFrame df.
#(df['price'] * df['qty']) generates a new pandas Series resulting from the element-wise multiplication of corresponding elements in the 'price' and 'qty' columns.
#The .sum() method then calculates the sum of all values in this new Series, resulting in the total sales.


total_sales=(df['price']*df['qty']).sum()

#Print Total Sales:
#This line prints the calculated total sales.
#The message "Total Sales:" is printed first as a header to indicate what the following number represents.
#total_sales contains the calculated total sales, which is then printed with two decimal places using the format() function with a format specifier ".2f". This ensures that the total sales value is formatted as a floating-point number with two decimal places.
print("Total Sales:",format(total_sales,".2f"))

Total Sales: 251030.41


In [17]:
#Group Data by Product Name and Sum Quantities:
#This line groups the data in the DataFrame df by the 'product_name' column.
#For each unique product name, it sums up the quantities ('qty') sold.
#The result is a pandas Series where each unique product name is the index, and the corresponding values represent the total quantity sold for that product.
#Select Top 5 Selling Products:

#The .nlargest(5) method selects the top 5 elements from the Series with the largest values.
#In this case, it selects the top 5 products with the highest total quantity sold.

#it stores the top 5 selling products (as a pandas Series) in the variable top_selling_products.



top_selling_products = df.groupby('product_name')['qty'].sum().nlargest(5)

#Display Top Selling Products:
top_selling_products

product_name
Tablet        139
Camera        121
Headphones    111
Laptop         68
Smartphone     68
Name: qty, dtype: int64

In [None]:
# saving the dataframe
df.to_csv('product_data_modified.csv')