# Data Wrangling for **sales.csv**

**Set of taskes performed in this file are:**

Data gathering -> Data Assessing -> Data cleaning -> Creating of cleaned_sales.csv

## Dependencies.

Contains important python modules and libraries for further analysis.

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Data Gathering.

Raw sales.csv is provided by the company.

Importing dataset as Pandas DataFrame Object for further analysis.

In [5]:
sales = pd.read_csv('../raw_datasets/sales.csv')

## Data Assessment.

**Objectives**: To understand data's structure, what content it have?, what's the quality of the data!...

### Structure of data

Dataset is rectangular objects (basically data arranged in rows and columns).

In [6]:
print(f"In sales.csv, number of Rows are: {sales.shape[0]} and number of Columns are: {sales.shape[1]}")

In sales.csv, number of Rows are: 15000 and number of Columns are: 8


### Content of data.

Summarising and writing brief description about the features in dataset. And Displaying first 5 rows of dataset.

#### **Feature description of sales.csv**

**Order_ID**: Generated when a new order is placed. Unique to every order and customer. 2 orders generated by same customer will be assigned 2 different Order_IDs. So it's truly unique in all dataset.

**Customer_ID**: A unique ID assigned to every customer who placed at least one order. One interesting point is, 1 Customer_ID can have more than 1 Order_IDs. 

**Product_ID**: A unique ID assigned to every product which company sells.

**Quantity**: Number of products ordered by the customer.

**Discount**: Discount given to customer on order (in rupees).

**Order_Value**: Total amount payed by the customer.

**Profit**: Profit generated by the company on specific order.

**Order_Date**: On which date order is placed.

### Quality Assessment of the data.

Taking a quick look at quality issues in data and noting them to fix in data cleaning process.

In [40]:
# First 5 rows of dataset.
sales.head(20)

Unnamed: 0,Order_ID,Customer_ID,Product_ID,Quantity,Discount,Order_Value,Profit,Order_Date
0,O141598,C01653,P0340,5,0,572.349964,46.194363,2024-01-09
1,O569509,C01284,P0299,0,5,105.50105,18.342919,2023-02-16
2,O240973,C01082,P0479,1,10,501.322948,123.010896,2024-02-17
3,O914001,C01470,P0052,3,15,802.281732,44.784786,2023-12-17
4,O614116,C01187,P0242,4,20,1198.218159,263.222539,2023-12-31
5,O446261,C01864,P0271,1,20,1321.914855,329.483198,2023-08-29
6,O190141,C01167,P0326,5,0,1429.937013,359.499365,2023-09-26
7,O741507,C00667,P0073,5,10,333.940123,19.484839,2023-05-29
8,O245250,C00390,P0476,2,20,1218.550864,171.275965,2023-03-27
9,O691063,C00423,P0320,2,0,483.67589,32.066841,2023-04-05


#### **Checking for missing values in dataset.**

In [16]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Order_ID     15000 non-null  object 
 1   Customer_ID  15000 non-null  object 
 2   Product_ID   15000 non-null  object 
 3   Quantity     15000 non-null  int64  
 4   Discount     15000 non-null  int64  
 5   Order_Value  15000 non-null  float64
 6   Profit       14801 non-null  float64
 7   Order_Date   15000 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 937.6+ KB


There are total of **199** missing values in **Profit** feature.

#### **Checking for duplicacy in dataset.**

In [37]:
sales['Order_ID'].duplicated().sum()

132

Duplicated rows in **Order_ID** is very important thing to highlight. Order_ID must be unique. Have to check what is underlying reason and how to fix this in Data Cleaning step. 

Other columns can hold duplicated values.