<a href="https://colab.research.google.com/github/SUMITYADAV96/CAPSTONE-PROJECT-CARDIOVASCULAR-RISK-PREDICTION-CLASSIFICATION-/blob/main/ONLINE_RETAIL_CUSTOMER_SEGMENTATION_ON_TRANSNATIONAL_DATASETSV_SUMIT_YADAV_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Identification of major customer segments on a transnational dataset. </u></b>

## <b> Problem Description </b>

### In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

## <b> Data Description </b>

### <b>Attribute Information: </b>

* ### InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
* ### StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
* ### Description: Product (item) name. Nominal.
* ### Quantity: The quantities of each product (item) per transaction. Numeric.
* ### InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
* ### UnitPrice: Unit price. Numeric, Product price per unit in sterling.
* ### CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* ### Country: Country name. Nominal, the name of the country where each customer resides.

# **Data Preparation**

## **Importing and Inspecting Dataset**

In [None]:
# Importing required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
%matplotlib inline
from scipy.cluster.hierarchy import dendrogram,linkage
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Defining url of saved csv file
url ='/content/drive/MyDrive/CAPSTONE PROJECT SUBMISSION FOLDER/CLUSTERING ANALYSIS ON UK -BASED INDUSTRIES - SUMIT YADAV/Online Retail.csv'

# Importing dataset to create a dataframe
df = pd.read_csv(url, encoding="unicode_escape", parse_dates=["InvoiceDate"])

In [None]:
# Checking shape of dataframe
df.shape

In [None]:
# Checking top 5 records
df.head()

In [None]:
# Checking bottom 5 records
df.tail()

In [None]:
# Checking all the columns present in the dataset
df.columns

In [None]:
# Basic Info of the dataset
df.info()

In [None]:
# Descriptive Statistics
df.describe()

In [None]:
# Checking number of unique values in each column
for col in df.columns:
  print(col,':',df[col].nunique())

##**Feature Engineering**

In [None]:
# Missing data counts and percentage
missing = df.columns[df.isnull().any()].tolist()

print('Missing Data Count')
print(df[missing].isnull().sum().sort_values(ascending = False))
print('--'*12)
print('Missing Data Percentage')
print(round(df[missing].isnull().sum().sort_values(ascending = False)/len(df)*100,2))

In [None]:
# Dropping the rows with nulls
df.dropna(inplace=True)

In [None]:
# Checking duplicates
print(len(df[df.duplicated()]))

In [None]:
# Dropping duplicate rows
df.drop_duplicates(inplace=True)

In [None]:
# New Shape
df.shape

In [None]:
df.info()

In [None]:
# Creating new features from the datetime column InvoiceDate
df["year"]= df["InvoiceDate"].apply(lambda x: x.year)
df['Month']= df['InvoiceDate'].apply(lambda x: x.month_name())
df['Day']= df['InvoiceDate'].apply(lambda x: x.day_name())
df["hour"]= df["InvoiceDate"].apply(lambda x: x.hour)

In [None]:
# Creating a new feature 'TotalAmount' by multiplying Quantity and UnitPrice
df['TotalAmount'] = df['Quantity']*df['UnitPrice']

In [None]:
# Creating a new feature 'TimeType' based on hours to define whether its Morning,Afternoon or Evening
df['TimeType'] = np.where((df["hour"]>5)&(df["hour"]<18), np.where(
                           df["hour"]<12, 'Morning','Afternoon'),'Evening')

In [None]:
# Checking the number of cancellations by each customer. InvoiceNo starting with 'C' represents cancellation
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
cancellations = df[df['InvoiceNo'].str.contains('C')].groupby('CustomerID')[['InvoiceNo']].count()

# Renaming the columns and checking top 5 cancellations
cancellations.rename(columns={'InvoiceNo': 'Cancellations'}, inplace=True)
cancellations.sort_values(by=['Cancellations'], ascending=False).head()

In [None]:
# Dropping cancellations from the main dataframe
df = df[~df['InvoiceNo'].str.contains('C')]