# Tasks
* What is the best-selling book?
* Visualize order status frequency.
* Find a correlation between date and time with order status.
* Find a correlation between city and order status.
* Find any hidden patterns that are counter-intuitive for a layman.
* Can we predict number of orders, or book names in advance?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# !pip3 install arabic-reshaper
# import arabic_reshaper
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
sales_data = pd.read_csv('/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 4.csv', encoding = 'utf-8') #read a comma separate file
sales_data.head(50) # Show top 50 entries

In [None]:
print('Rows: ', sales_data.shape[0])
print('Cols: ', sales_data.shape[1])

In [None]:
new_cols = ['order_num','order_status','book_name','order_date','billing_city']
sales_data.columns = new_cols
sales_data.head(20)

# Data consistency
In this stage we are going to look for missing, incorrect or irrelevant data in our data frame.

In [None]:
# Check for missing data
sales_data.isnull().sum()

We have missing values in book_name and billing_city.

In [None]:
# Lets find out index of missing values in book_name column
sales_data[sales_data['book_name'].isna()]

In [None]:
# Lets find out index of missing values in billing_city column
sales_data[sales_data['billing_city'].isna()]

In [None]:
# Lets check data consistency
sales_data['order_status'].unique()

In [None]:
sales_data[sales_data['order_num'].duplicated() == True]

We have a duplicates in "order_num".

In [None]:
sales_data['book_name'].nunique()

In [None]:
sales_data['book_name'].head(20)

There are multiple books ordered per city. Such entries are separated by '/' e.g. row 11.

In [None]:
sales_data['billing_city'].nunique()

There are not that much cities in Pakistan.

In [None]:
sales_data['billing_city'].unique()

This does not look good. There are white-spaces around city names. There are certain rows with question marks in billing_city and some contain non-alphabet characters like full-stops, commas, question marks etc. For example:

In [None]:
# Check irregular data in billing_city column.
sales_data[sales_data.billing_city.str.contains(r'[^\w\s]',na=False)]

In [None]:
# Check irregular data in book_name column.
sales_data[sales_data.book_name.str.contains(r'[^\w\s]',na=False)]

# PreProcessing

We have seen multiple problems with our data in previous exampeles. So we need to clean (pre-process) our data for efficient EDA.
For accurate analysis we will perform following steps:
 
* Drop rows with missing data.
* Remove duplicates from order_number column. 
* Remove leading and trailing white-spaces from entries.
* Convert categorical data to UPPER case.
* Drop entries with (????) in billing_city column.
* Multiple book entries to one entry per row.
* Solve city name problem: Shorkot
 * Shorkot
 * Shorkot Cantt
 * Shorkot Cantt.
        *** I am looking for a solution for last problem: will update soon ***

In [None]:
# We cannot replace this missing data.
# So we are going to drop rows with Na values.
sales_data = sales_data.dropna()
# Lets see how many rows and columns we have now.
print('Rows: ', sales_data.shape[0])
print('Cols: ', sales_data.shape[1])


In [None]:
sales_data.isnull().any()

nan/null values removed.

In [None]:
# Remove duplicates from order_number column. 
sales_data = sales_data.drop_duplicates()
sales_data[sales_data['order_num'].duplicated()]

We are good to go. Duplicates removed.

In [None]:
# Remove leading and trailing white spaces
sales_data['order_status'] = sales_data['order_status'].str.strip()
sales_data['book_name'] = sales_data['book_name'].str.strip()
sales_data['billing_city'] = sales_data['billing_city'].str.strip()
sales_data.head(50)

In [None]:
# Convert categorical data to upper case
sales_data['order_status'] = sales_data['order_status'].str.upper()
sales_data['book_name'] = sales_data['book_name'].str.upper()
sales_data['billing_city'] = sales_data['billing_city'].str.upper()
sales_data.head(50)

In [None]:
# Find location of entries with (?) in billing_city column and drop such entries
# because we have no data to replace with.
sales_data.loc[sales_data['billing_city'].str.contains('^[?]'), 'billing_city'] = np.nan
sales_data = sales_data.dropna()

In [None]:
# Lets check number of cities now
sales_data['billing_city'].nunique()

So we have reduced unique billing_city entries from 4082 to 3441. But this is still far from perfect.

In [None]:
# Multiple book entries to one entry per row.
print('Before split:',sales_data.shape[0])
# Pre-processing
sales_data = (sales_data.set_index(['order_num', 'order_date', 'order_status', 'billing_city'])
   .apply(lambda x: x.str.split('/').explode())
   .reset_index())
print('After split: ',sales_data.shape[0])

Now we can see our rows have increased from 19184 to 33091.

In [None]:
# We noticed that some book names are repeated with additional text to their name
sales_data[sales_data['book_name'].str.match('^R KA.*') == True]

Data cleaning is a mechanical process. So we checked for more data incosistencies manually and replaced with correct data as required.

In [None]:
# Lets replace these book names.
urd = [sales_data['book_name'][32717], sales_data['book_name'][31235], sales_data['book_name'][31239]]
print(urd)
sales_data['book_name'].replace({urd[0]: "R KA TAARUF", 
                         "PYTHON PROGRAMMING- RELEASE DATE: AUGUST 14, 2020": "PYTHON PROGRAMMING",
                         "Linux - An Introduction  (Release Data - October 3, 2020)": "LINUX - AN INTRODUCTION",
                         urd[1] : "(C++)",
                        "BOOK BAND KAMRON KI MUHABBAT": "BAND KAMRON KI MUHABBAT",
                        urd[2]: "MOLO MASALI"}, inplace=True)

In [None]:
# Found another issue of multiple books on single row.
sales_data['book_name'][12729]

In [None]:
# Solve aforementioned issue
print('Before split:',sales_data.shape[0])
# Pre-processing
sales_data = (sales_data.set_index(['order_num', 'order_date', 'order_status', 'billing_city'])
   .apply(lambda x: x.str.split('؟-').explode())
   .reset_index())
print('After split: ',sales_data.shape[0])

# TODO
* Solve city name problem: Shorkot
 * Shorkot
 * Shorkot Cantt
 * Shorkot Cantt.
    *** I am looking for a solution for this problem: will update soon ***

Now we are done with pre-processing and our data is in good shape for EDA.

# EDA

**Task 1: What is the best-selling book?**


In [None]:
# Set value of value of N to return top-N
N = 5
# Best-selling book
topn_best_sell = sales_data[sales_data['order_status'] == 'COMPLETED']['book_name'].value_counts(ascending=False).nlargest(N).to_frame()
print('The best-selling book is %s with %d sales'%(topn_best_sell.index[0], topn_best_sell.iloc[0][0]))

It is important to notice here that we have dropped the billing_city entries with (????). Moreover we are selecting the best-selling based on COMPLETED order_status.

In [None]:
# Plot Top-N cities which orderd best selling book
cmap = [['C%d'%(d) for d in range(N)]]
ax = topn_best_selling_city.plot.bar(figsize=(12,8), width=0.8, color=cmap, legend=False,title='Top-%d cities which ordered best selling %s book'%(N, topn_best_sell.index[0]))
ax.set_xlabel("Billing Cities")
ax.set_ylabel("Number of Orders")

**Q2. Visualize order status frequency.**

In [None]:
print(sales_data.order_status.value_counts())
sales_data.order_status.value_counts().to_frame().plot.bar( figsize=(12,8), width=0.5, legend=False, color=[['C0','C1','C2']], title='Order Status Frequency')
ax.set_xlabel("Billing Cities")
ax.set_ylabel("Number of Orders")

**Task 3. Find a correlation between date and time with order status.**

**Task 4. Find a correlation between city and order status.**

**Task 5. Find any hidden patterns that are counter-intuitive for a layman.**

In [None]:
# Some more insights into data
print('The best selling book %s has %d returned orders.'%(topn_best_sell.index[0], sales_data[(sales_data['order_status'] == 'RETURNED') & (sales_data['book_name'] ==  topn_best_sell.index[0])].shape[0]))
print('The best selling book %s has %d cancelled orders.'%(topn_best_sell.index[0], sales_data[(sales_data['order_status'] == 'CANCELED') & (sales_data['book_name'] ==  topn_best_sell.index[0])].shape[0]))
print('\n')
# Most returned book
topn_most_returned = sales_data[sales_data['order_status'] == 'RETURNED']['book_name'].value_counts(ascending = False).nlargest(N).to_frame()
print('The most returned book is %s with %d returns.'%(topn_most_returned.index[0], topn_most_returned.iloc[0][0]))
print('\n')
# Book with most cancelled orders
topn_most_cancelled = sales_data[sales_data['order_status'] == 'CANCELED']['book_name'].value_counts(ascending=False).nlargest(N).to_frame()
print('The book having most cancelled orders is %s with %d cancelled orders.'%(topn_most_cancelled.index[0], topn_most_cancelled.iloc[0][0]))



In [None]:
# Some more insights on best-selling book
# Find city with most completed orders of best-selling book.
topn_best_selling_city = sales_data[(sales_data['book_name'] == topn_best_sell.index[0]) & (sales_data['order_status'] == 'COMPLETED')]['billing_city'].value_counts().nlargest(N).to_frame()
print('Best-Selling book (%s) has: \nmost COMPLETED orders from: %s (%d orders)'%(topn_best_sell.index[0],topn_best_selling_city.index[0], topn_best_selling_city.iloc[0][0]))

# Find city with most returned orders of best-selling book
topn_most_returned_city = sales_data[(sales_data['book_name'] == topn_best_sell.index[0]) & (sales_data['order_status'] == 'RETURNED')]['billing_city'].value_counts().nlargest(N).to_frame()
print('most RETURNED orders from: %s (%d orders)'%(topn_most_returned_city.index[0], topn_most_returned_city.iloc[0][0]))

# Find city with most returned orders of best-selling book
topn_most_cancelled_city = sales_data[(sales_data['book_name'] == topn_best_sell.index[0]) & (sales_data['order_status'] == 'CANCELED')]['billing_city'].value_counts().nlargest(N).to_frame()
print('most CANCELLED orders from: %s (%d orders)'%(topn_most_cancelled_city.index[0], topn_most_cancelled_city.iloc[0][0]))

In [None]:
# Plot Top-N best selling books
cmap = [['C%d'%(d) for d in range(N)]]
ax = topn_best_sell.plot.bar(figsize=(12,8), width=0.8, color=cmap, legend=False,title='Top-%d selling books: COMPLETED order_status'%(N))
ax.set_xlabel("Book Name")
ax.set_ylabel("Number of Orders")

In [None]:
# Plot Top-N returned books
cmap = [['C%d'%(d) for d in range(N)]]
ax = topn_most_returned.plot.bar(figsize=(12,8), width=0.8, color=cmap, legend=False,title='Top-%d RETURNED order_status books'%(N))
ax.set_xlabel("Book Name")
ax.set_ylabel("Number of Orders")

In [None]:
# Plot Top-N cancelled order books
cmap = [['C%d'%(d) for d in range(N)]]
ax = topn_most_cancelled.plot.bar(figsize=(12,8), width=0.8, color=cmap, legend=False,title='Top-%d CANCELLED order_status books'%(N))
ax.set_xlabel("Book Name")
ax.set_ylabel("Number of Orders")

**Task 6. Can we predict number of orders, or book names in advance?**