# Dataset Overview
The dataset contains sales records of *Gufhtugu Publications* from January 2019 to January 2021.  
GP is an emerging startup in Pakistan's e-commerce market.  

## In this report:
We wiil try to answer following queries.
1. What are the top 10 best selling books
1. Which are the cities with the most customers
1. What is the best seller in top-selling cities

In particular, we will, 
- Split orders with multiple books
- Standardize language of book titles
- Extract city name from detailed address


In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

# plotting and coding utils
plt.style.use('fivethirtyeight')
%config IPCompleter.use_jedi = False
%matplotlib inline
plt.rcParams["figure.figsize"] = (12, 8)

In [None]:
df = pd.read_csv('/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv', parse_dates= ['Order Date & Time'], encoding = 'utf-8')
df.head()

In [None]:

df.columns = [c.lower().replace(" ","_") for c in df.columns]
dups = df.duplicated().sum()
rows, cols = df.shape
nans = df.isna().sum()
print(f"Dataset includes {rows} rows and {cols} feature columns.")
print(f"Among these records, {dups} are duplicated.\n")
for c,n in nans.items():
    if n>0:
        print(f"Column {c} contains {n} missing values")


**We have only a few records with missing values.  
Since we are not interested in the payment_method for now, We will drop rows with missing values from the 'city' and 'book_name' columns.**

In [None]:
df.dropna(subset = ["book_name","city"], inplace= True)
df.isna().sum()

In [None]:
# Type casting
df["book_name"] = df.book_name.astype(str)
df["city"] = df.city.astype(str)
df.dtypes

# Top selling books
**We are given a list of sold books in string format, where each book title is saperated by '/'.**


In [None]:
pd.set_option('max_colwidth', 100)
df.book_name.sample(10)

Books and orders have an N:N relationship. We will transform this dataset in a way that each row will contain one and only one book.  
In other words, orders with multiple items are split into multiple rows (0 axis).

In [None]:
books_df = pd.DataFrame(df.book_name.str.split('/').tolist(), index=df.order_number).stack()
books_df = books_df.reset_index([0, 'order_number'])
books_df.columns = ["order_id","book_name"]
books_df["book_name"] = books_df.book_name.str.lower()
books_df.head()

## Books with titles in both English and Urdu
Books with their title written in urdu are:

In [None]:
mask = books_df.book_name.str.contains("[a-zZ-Z]", case = 0)
books_df[~mask].book_name.value_counts()

Instead of renaming all these records, we will rename only those titles which are good candidates for top-10 best sellers.  
First, we find the 30 top selling books, and then we manually merge their names into one standard language.

In [None]:
books_df.book_name.value_counts()[:30]

In [None]:
# for top-30 we will find their english titles
mask = books_df.book_name.str.contains("Internet|Data|Machine|algo|bit", case = 0)
books_df[mask].book_name

In [None]:
renaming_dict = {
    "انٹرنیٹ سے پیسہ کمائیں": "earn from internet",
    "ڈیٹا سائنس": "data science",
    "مشین لرننگ": "machine learning",
    "ڈیٹا سائنس ۔ ایک تعارف": "data science",
    "ایک تھا الگورتھم": "ek tha algorithm",
    "انٹرنیٹ سے پیسہ کمائیں؟- مستحقین زکواة": "earn from internet",
    "(c++)": "cpp",
    "سی": 'cpp',
    "(c++) ++سی":"cpp",
    "Bit Coin Block Chain Aur Crypto Currency بٹ کوائن، بلاک چین اور کرپٹو کرنسی":"blockchain, cryptocurrency and bitcoin",
    'بلاک چین اور کرپٹو کرنسی': "blockchain, cryptocurrency and bitcoin",
    'python programming- release date: august 14, 2020': 'python programming'
}

books_df.book_name.replace(renaming_dict,inplace=True)
books_df["book_name"] = books_df.book_name.str.title()

print("top-10 best sellers:")
top_books = books_df.book_name.value_counts()[:10].reset_index()
top_books.columns = ["book_name", "total_sales"]
top_books = top_books.sort_values("total_sales", ascending = False)
top_books

In [None]:
plt.figure(figsize= (14,8))
sns.barplot(y = top_books.book_name, x= top_books.total_sales, palette='flare_r')

N = df.shape[0]
i = 0
for _, v in top_books.total_sales.items():
    plt.text(v + 15, i - .1, f"{v/N *100:.1f}%", color='#555c63', fontweight='bold', fontsize = 12, va = 'center')
    i += 1
    
plt.title("Top 10 Bestsellers", fontdict={'fontsize': "30", "fontweight":"heavy"}, loc = 'left')
plt.grid(True, axis = 'x')
plt.yticks(fontsize= 18)
plt.xticks(fontsize= 14)

plt.xlabel("Sales", fontdict={'fontsize': "20"})
plt.ylabel("")
plt.text(-1200,10,"Book Title", fontdict= {"fontsize":16,"fontweight":"heavy"})
plt.show()

**Remarks**
- During the 2020s pandemic people were forced to reevaluate their methods of earnings. It is no wonder that people tried to learn about earning online. 
- Python, Data Science, AI, and Blockchain have become the backbone of information technology. These topics are going to stay at the top in upcoming years.

# Top selling cities
To find out cities with most number of orders, we will use original dataset.  
Detailed address is converted into city name.

In [None]:
df["city"] = df.city.str.title()
# 100 most famous cities
cities = df.city.value_counts()[:100].index
for city in cities:
    # Rows with detailed address are replaced with city name'only'
    mask = df.city.str.contains(city, case = 0)
    df.loc[mask, "city"] = city
    
top_cities = df.city.value_counts()[:10]
print("Top 10 selling cities:\n" , top_cities, sep="\n")

In [None]:
sns.barplot(y= top_cities.index, x = top_cities, palette='flare_r')

N = df.shape[0]
i = 0
for _, v in top_cities.items():
    plt.text(v + 15, i - .1, f"{v/N *100:.1f}%", color='#555c63', fontweight='bold', va = 'center')
    i += 1

plt.title("Cities with most sales", fontdict={'fontsize': "30","fontweight":"heavy"}, loc = 'left')
plt.grid(True, axis = 'x')

plt.yticks(fontsize= 18)
plt.xticks(fontsize= 14)

plt.xlabel("Sales",fontdict={'fontsize': "20"})
plt.ylabel("")
plt.show()

The distribution of orders is highly correlated with population of the city.  
50% of the orders are from these top-10 cities. 28% of total market is from Karachi and Lahore only.

# Best seller in major cities
We would like to find out the best seller in the major cities of Pakistan.  
To extract these statistics, we will merge books dataframe with source dataframe.  

In [None]:
# merge two dataframes on order id
cols = ["order_id", "book_name_x", "city"]
city_book_df = pd.merge(books_df, df, left_on= 'order_id', right_on= 'order_number')[cols]
city_book_df.rename(columns={"book_name_x":"book_name"}, inplace=True)

# only top 10 cities
mask = city_book_df.city.isin(top_cities.index)
# group by to find number of sales for each city and each book
group_cols = ["city","book_name"]
grouped = city_book_df.loc[mask].groupby(group_cols)
tmp = grouped.count().reset_index()
tmp.rename(columns={"order_id":"sales"}, inplace = True)
# find best seller for each city
top_bookPerCity = tmp.loc[tmp.groupby("city").idxmax()["sales"]]
top_bookPerCity.sort_values(by = 'sales', inplace = True, ascending = 0)
top_bookPerCity

In [None]:
ax = sns.barplot(x= 'city', y = 'sales', hue='book_name', data = top_bookPerCity, dodge=False)

plt.title("Top selling book per city", fontdict={'fontsize': "30"})
plt.yticks(fontsize= 18)
plt.xticks(fontsize= 14, ha = 'center')
plt.legend(loc = 'upper right', prop = {'size':15})
plt.xlabel("City",fontdict={'fontsize': "24"})
plt.ylabel("Sales", fontdict={'fontsize': "20"})


ticks_and_labels = plt.xticks(range(len(top_bookPerCity)), top_bookPerCity.city, rotation=0)
for i, label in enumerate(ticks_and_labels[1]):
    label.set_y(label.get_position()[1] - (i % 2) * 0.05)
    
    
plt.show()

Earn Money online is the best seller in 9/10 major cities.  
Sialkot is known for its exports worldwide, no wonder customers in Sialkot bought 'Product Management' the most.

## Upcoming
Even after extracting city names from detailed addresses, we have a huge number of distinct cities.  In the next step we will perform in-detail data cleaning of city attribute.  


# Any comments are most welcome.  
## If you liked the analysis, Please upvote ☝ to show the support 👍