## Table of Contents

**[1. Introduction](#introduction)**

**[2. Exploring data (Data preprocessing) ](#body1)**

  * [2.1. How many rows columns are there?](#subbody1)
  * [2.2. What is the meaning of each column?](#subbody2)
  * [2.3. What is the meaning of each row?](#subbody3)
  * [2.4.  Are there any duplicated rows?](#subbody4)
  * [2.5.  What is the current data type of each column?](#subbody5)
  * [2.6.  Distribution of numerical attributes](#subbody6)
  * [2.7.  Distribution of categorical attributes](#subbody7)
  
**[3. Ask meaningful questions](#body2)**
  * [**Question 1:** What makes a category become a best-selling category?](#subbody3-1)
  * [**Question 2:** Do readers prefer foreign over Vietnamese authors? And on which categories do the foreign outnumber in terms of quantity and vice versa? ](#subbody3-2)
  * [**Question 3:** When given the title of a book, how do you recommend 5 related books that the user would enjoy?](#subbody3-3)
  * [**Question 4:** Using both positive and negative user feedback, how to determined what users enjoy and hate about its services and books ?](#subbody3-4)

**[4. Answer questions](#body3)**
  
**[5. Reflection](#body4)**
  * [5.1. Trần Quang An Quốc](#subbody4-1)
  * [5.2. Đỗ Đạt Thành](#subbody4-2)
  
**[6. References](#reference)**

# 1. Data collection <a name="body1"></a>

# 2. Exploring data (Data preprocessing) <a name="body1"></a>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 

In [2]:
df = pd.read_csv('shopping_trends_updated.csv')

In [3]:
df.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


In [10]:
df.dtypes

Customer ID                 int64
Age                         int64
Gender                     object
Item Purchased             object
Category                   object
Purchase Amount (USD)       int64
Location                   object
Size                       object
Color                      object
Season                     object
Review Rating             float64
Subscription Status        object
Shipping Type              object
Discount Applied           object
Promo Code Used            object
Previous Purchases          int64
Payment Method             object
Frequency of Purchases     object
dtype: object

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Item Purchased          3900 non-null   object 
 4   Category                3900 non-null   object 
 5   Purchase Amount (USD)   3900 non-null   int64  
 6   Location                3900 non-null   object 
 7   Size                    3900 non-null   object 
 8   Color                   3900 non-null   object 
 9   Season                  3900 non-null   object 
 10  Review Rating           3900 non-null   float64
 11  Subscription Status     3900 non-null   object 
 12  Shipping Type           3900 non-null   object 
 13  Discount Applied        3900 non-null   object 
 14  Promo Code Used         3900 non-null   

In [26]:
df.describe()

Unnamed: 0,Customer ID,Age,Purchase Amount (USD),Review Rating,Previous Purchases
count,3900.0,3900.0,3900.0,3900.0,3900.0
mean,1950.5,44.068462,59.764359,3.749949,25.351538
std,1125.977353,15.207589,23.685392,0.716223,14.447125
min,1.0,18.0,20.0,2.5,1.0
25%,975.75,31.0,39.0,3.1,13.0
50%,1950.5,44.0,60.0,3.7,25.0
75%,2925.25,57.0,81.0,4.4,38.0
max,3900.0,70.0,100.0,5.0,50.0


### 2.1 Are there any duplicates? <a name="subbody1"></a>

In [19]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
3895    False
3896    False
3897    False
3898    False
3899    False
Length: 3900, dtype: bool

In [22]:
df[df.duplicated()].count()

Customer ID               0
Age                       0
Gender                    0
Item Purchased            0
Category                  0
Purchase Amount (USD)     0
Location                  0
Size                      0
Color                     0
Season                    0
Review Rating             0
Subscription Status       0
Shipping Type             0
Discount Applied          0
Promo Code Used           0
Previous Purchases        0
Payment Method            0
Frequency of Purchases    0
dtype: int64

### 2.2 Are there any null value present in the column? <a name="subbody1"></a>

In [24]:
df.isnull().sum()

Customer ID               0
Age                       0
Gender                    0
Item Purchased            0
Category                  0
Purchase Amount (USD)     0
Location                  0
Size                      0
Color                     0
Season                    0
Review Rating             0
Subscription Status       0
Shipping Type             0
Discount Applied          0
Promo Code Used           0
Previous Purchases        0
Payment Method            0
Frequency of Purchases    0
dtype: int64

### 2.1 How many rows and how many columns? <a name="subbody1"></a>

In [29]:
print(f'The number of rows is {df.shape[0]}')
print(f'The number of columns is {df.shape[1]}')

The number of rows is 3900
The number of columns is 18


### 2.1 How many rows and how many columns? <a name="subbody1"></a>

### 2.1 How many rows and how many columns? <a name="subbody1"></a>

### 2.1 How many rows and how many columns? <a name="subbody1"></a>

### 2.1 How many rows and how many columns? <a name="subbody1"></a>