<!-- Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.


Attribute Information:

1.	InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
2.	StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
3.	Description: Product (item) name. Nominal.
4.	Quantity: The quantities of each product (item) per transaction. Numeric.
5.	InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
6.	UnitPrice: Unit price. Numeric, Product price per unit in sterling.
7.	CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
8.	Country: Country name. Nominal, the name of the country where each customer resides. -->

In [1]:
# Data Set Information:

# This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a 
# UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. 
# Many customers of the company are wholesalers.


# Attribute Information:

# 1.	InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
# 2.	StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
# 3.	Description: Product (item) name. Nominal.
# 4.	Quantity: The quantities of each product (item) per transaction. Numeric.
# 5.	InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
# 6.	UnitPrice: Unit price. Numeric, Product price per unit in sterling.
# 7.	CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
# 8.	Country: Country name. Nominal, the name of the country where each customer resides.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv(r"C:\Users\Rishi\Downloads\OnlineRetail.csv",encoding='latin1') 
# If Unicode-Decode error arises then "encoding=latin1" should be used for working
# encoding "unicode_escape"

In [4]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [7]:
tot=df['Quantity']*df['UnitPrice']

In [8]:
tot.mean()

17.98779487699964

In [10]:
# 1.What is the unique count of the invoice numbers:
# A.34000
# B.26200
# C.25700
# D.25900

# unq_invc_cnt = df['InvoiceNo'].nunique()
# print(unq_invc_cnt)
len(df['InvoiceNo'].unique())

25900


In [11]:
# 2.How many people bought “PARTY BUNTING”.:
# A.1727
# B.2343
# C.2159
# D.2200
# pb_count = df[df['Description'] == 'PARTY BUNTING'].shape[0] #['CustomerID'].nunique()
# print(pb_count)
sum(df['Description'] == 'PARTY BUNTING')

1727

In [12]:
# 3.How many people bought “WHITE HANGING HEART T-LIGHT HOLDER”.
# A.2387
# B.2364
# C.2369
# D.2360
WHH = df[df['Description'] == 'WHITE HANGING HEART T-LIGHT HOLDER'].shape[0]
print(WHH)

2369

In [13]:
# 4.how many unique people are from “UNITED KINGDOM”. *****************
# A.342679
# B.361878
# C.361874
# D.361468

# uk_customers = df[df['Country'] == "United Kingdom"]['CustomerID'].nunique()
# print(uk_customers)

3950


In [10]:
df.head(2)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [11]:
# 5.Which customer(CustomerID) has done highest transaction in the dataset.   ***********
# A.14578
# B.14567    
# C.14646
# D.14687

high_trans = df.groupby('CustomerID')['UnitPrice'].sum().idxmax()
print(high_trans)

14096.0


In [12]:
# 6.What is the average amount of transactions:
# A.19
# B.17.98
# C.34.78
# D.none of the above
avg_transaction = df['UnitPrice'].mean()
print(avg_transaction.round(2))

4.61


In [13]:
# 7. What is the frequently bought item in the dataset
# A.REGENCY CAKESTAND 3 TIER
# B. JUMBO BAG RED RETROSPOT
# C. PARTY BUNTING
# D.WHITE HANGING HEART T-LIGHT HOLDER
freq_bought = df['Description'].value_counts().idxmax()
print(freq_bought)

WHITE HANGING HEART T-LIGHT HOLDER


In [14]:
# 8.Create a DataFrame with unique description of item ,count of the item ,total amount paid for the item.   **********
# Then get the least purchased item.
# A.FILIGREE DIAMANTE CHAIN
# B.GLASS BELL JAR SMALL
# C.GLASS BELL JAR LARGE
# D.All the above

In [15]:
#  9.Which algorithms are used for solving Classification problems?
# A.K-Means Clustering
# B.Decision Tree         ans
# C.SVD Algorithm
# D.Linear Regression


In [16]:
# 10.How to handle Outliers ,if the column is numeric?
# A.Replace Outliers by mode  
# B. Replace Outliers by median                ans
# C.Replace Outliers by variance
# D.All of the Above

In [17]:
# 11.Which of the following is also called Normal Distribution?
# A.Bernoulli Distribution
# B.Poisson’s Distribution
# C.Gaussian Distribution       ans
# D.Exponential Distribution

In [18]:
# 12. Which of the following metrics are used to evaluate classification models?
# A. Area under the ROC curve
# B. F1 score
# C. Confusion matrix
# D. All of the above       ans


In [19]:
# 13 Which function is used for calculating Precision of the model?
# A.  accuracy_score()
# B. confusion_matrix()
# C.classification_report()                 ans
# D.None of the above

In [20]:
# 14.Which parameters are used to know the performance of regression model?
# A.adjusted r2_score()
# B.map
# C. rmse value            ans
# D.All of the above

In [21]:
# 15.Random forest model uses:
# A.Bagging                     ans
# B.Boosting
# C.Both A and B
# D.none of the above