# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

This project aims to analyze a retail transaction dataset to derive actionable insights that can enhance business strategies, optimize sales processes, and improve operational efficiency. The dataset includes critical information such as invoice numbers, product details, quantities, prices, customer IDs, and country of origin, offering a comprehensive view of each transaction.

The primary objective of the analysis is to identify key trends in sales, such as seasonal fluctuations, best-selling products, and customer purchasing behavior. By analyzing the InvoiceDate and Quantity, we can uncover patterns that help forecast demand, optimize inventory levels, and improve stock management strategies.

In addition to product performance, the analysis will focus on CustomerID and Country to segment customers based on their purchasing habits, loyalty, and geographic location. This will allow the business to develop targeted marketing campaigns, tailor product offerings to specific regions, and foster stronger customer relationships.

By evaluating UnitPrice and total invoice value, we can also identify high-value transactions and uncover pricing strategies that maximize revenue. Furthermore, insights into customer behavior and purchasing frequency will enable personalized promotions and discounts to increase customer retention and lifetime value.

Ultimately, this analysis will empower the business to make data-driven decisions that improve profitability, streamline operations, and strengthen its competitive position in the retail market.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Analyze a retail transaction dataset to identify trends in sales, customer behavior, product performance, and geographical patterns. Key goals include understanding sales trends, identifying top-performing products, segmenting customers, and calculating invoice values to optimize business strategies and improve decision-making.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv")

In [None]:
df

In [None]:
df.columns

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
import pandas as pd

# Load the dataset (replace 'your_dataset.csv' with your actual file)
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

# Get the number of rows and columns
rows, columns = df.shape
print(f"Rows: {rows}, Columns: {columns}")


Dataset Information

In [None]:
# Dataset Info
#  Summary Statistics for Numerical Columns
print("\nSummary Statistics:")
print(df.describe())

# 4. Missing Data
print("\nMissing Data:")
print(df.isnull().sum())

# 5. Unique Values in Each Column
print("\nUnique Values in Each Column:")
print(df.nunique())

# 6. First Few Rows (Sample Data)
print("\nFirst Few Rows:")
print(df.head())

# #### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicates = df[df.duplicated()]

# Display duplicates (if any)
print(duplicates)


#### Missing Values/Null Values

In [None]:

missing_values_count = df.isnull().sum()

# Display the missing values count for each column
print("Missing values count in each column:")
print(missing_values_count)

**Visualizing the missing values**

In [None]:
# Visualizing the missing values
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

# Create a heatmap to visualize missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', annot=False, xticklabels=df.columns, yticklabels=False)
plt.title('Missing Values Heatmap')
plt.show()


**Count the number of missing values per column**

In [None]:
# Count the number of missing values per column
missing_values_count = df.isnull().sum()

# Plot the missing values as a bar plot
plt.figure(figsize=(10, 6))
missing_values_count.plot(kind='bar', color='salmon')
plt.title('Missing Values Count per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()


### What did you know about your dataset?
Answer Transaction-Based:

Each row represents a line item in an invoice. Multiple rows can share the same InvoiceNo if the same customer purchased multiple items in one transaction.
Sales Information:

The StockCode and Description describe the product being purchased.
Quantity and UnitPrice allow us to calculate the total sales value for each item purchased.

Customer and Country Data:

CustomerID helps identify the customer making the purchase.
Country indicates where the customer is from, which is useful for regional analysis.

Time-Based Analysis:
InvoiceDate allows time-based analysis, such as identifying peak shopping times, seasonality, or trends in sales.




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**InvoiceNo:**

A unique identifier for each invoice or order. Each transaction or purchase is associated with an invoice number.

**StockCode:**

A unique identifier for each product (SKU). It helps track individual items sold in the store.

**Description:**

A textual description of the product. This provides information about what the product is, like "WHITE HANGING HEART T-LIGHT HOLDER" or "RED WOOLLY HOTTIE WHITE HEART."

**Quantity:**

The number of units of the product purchased in that particular transaction. Can be positive (purchase) or negative (return/cancellation).

**InvoiceDate:**

The exact date and time when the transaction occurred. This helps track when a sale or purchase was made, typically formatted as "day/month/year hour:minute."

**UnitPrice:**

The price of one unit of the product at the time of purchase. This varies between products and can sometimes be negative, indicating price errors or refunds.

**CustomerID**:

A unique identifier for the customer who made the purchase. It helps track individual customers and analyze purchasing behavior. Some transactions may lack a customer ID (indicating an anonymous or one-time customer).

**Country:**

The country where the customer is located or where the transaction was made. It shows the geographical distribution of your customers, e.g., "United Kingdom" or "France."

# Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# List the unique values for each column
print("\nUnique Values in Each Column:")
for column in df.columns:
    print(f"\nUnique values in '{column}':")
    print(df[column].unique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')  # Replace with the correct file path

# 1. Inspect the dataset
print(df.info())  # Check data types, missing values, and structure
print(df.head())  # Display first 5 rows

# 2. Handle missing values
# - CustomerID: Fill missing values with a placeholder, such as 'Unknown' or drop rows
df['CustomerID'].fillna('Unknown', inplace=True)
# - Description: Drop rows with missing descriptions if critical for analysis
df['Description'].dropna(inplace=True)

# 3. Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')  # Converts invalid dates to NaT

# 4. Remove duplicates
df.drop_duplicates(inplace=True)

# 5. Handle negative values in 'Quantity' and 'UnitPrice'
df = df[df['Quantity'] > 0]  # Remove rows with negative or zero quantity
df = df[df['UnitPrice'] > 0]  # Remove rows with negative or zero unit price

# 6. Handle categorical columns (e.g., StockCode, Country)
# For example, we can encode 'Country' with Label Encoding (for machine learning purposes)
df['Country'] = df['Country'].astype('category')
df['Country'] = df['Country'].cat.codes  # Converts the categories to numeric codes

# You can also perform one-hot encoding for columns like 'StockCode' if needed
# df = pd.get_dummies(df, columns=['StockCode'], drop_first=True)

# 7. Feature selection or creation (optional)
# If needed, you can create new features, for example, extracting the year or month from 'InvoiceDate'
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month

# 8. Check for any remaining missing values
print(df.isnull().sum())  # To ensure that no missing values remain after cleaning

# 9. Save the cleaned data to a new CSV (optional)
df.to_csv('cleaned_invoice_data.csv', index=False)

# The dataset is now ready for analysis!


**What all manipulations have you done and insights you found?**

Handling Missing Values:

Filled missing CustomerID values with 'Unknown' to retain all records.
Dropped rows with missing Description values, assuming descriptions are essential for analysis.
Datetime Conversion:

Converted InvoiceDate to a datetime format to facilitate time-based analysis.
Removing Duplicates:

Eliminated duplicate rows to ensure data integrity.
Handling Negative Values:

Removed rows where Quantity or UnitPrice were zero or negative, as these may indicate errors.
Categorical Encoding:

Label-encoded the Country column to convert categorical data into numerical format for analysis.
Feature Engineering:

Extracted Year and Month from InvoiceDate to enable time-based analysis.
These steps are standard in data preprocessing to ensure the dataset is clean and suitable for analysis. For a more detailed guide on data cleaning and preparation for retail sales data, you can refer to this resource.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import matplotlib.pyplot as plt

# Count the number of invoices per country
invoice_counts = df['Country'].value_counts()

# Plot
invoice_counts.plot(kind='bar')
plt.title('Number of Invoices per Country')
plt.xlabel('Country')
plt.ylabel('Number of Invoices')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar chart because it effectively compares the number of invoices across different countries. It’s clear, easy to interpret, and works well for categorical data like country names.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals which countries have the highest and lowest number of invoices, helping identify key markets and areas with lower sales activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help target high-performing markets and improve underperforming regions, driving growth. However, over-reliance on a few countries could risk negative impact if those markets face disruptions.





#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Convert 'InvoiceDate' to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Group by date and sum 'Quantity'
daily_sales = df.groupby(df['InvoiceDate'].dt.date)['Quantity'].sum()

# Plot
daily_sales.plot(kind='line')
plt.title('Daily Sales Quantity')
plt.xlabel('Date')
plt.ylabel('Quantity Sold')
plt.xticks(rotation=45)
plt.show()


 1. Why did you pick the specific chart?

A line chart was chosen because it effectively shows trends and patterns in time-series data (daily sales), making it easy to track changes over time.

2. What is/are the insight(s) found from the chart?


The chart reveals sales trends, peak days, and patterns, helping optimize inventory, marketing, and sales forecasting.





 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help optimize marketing and inventory for growth. However, consistent sales dips could indicate issues that, if not addressed, may lead to negative growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Count the number of invoices per country
invoice_counts = df['Country'].value_counts()

# Plot
invoice_counts.plot(kind='pie', autopct='%1.1f%%', figsize=(8, 8))
plt.title('Invoice Distribution by Country')
plt.ylabel('')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pie chart because it's ideal for showing proportions and relative shares of invoices from different countries, making it easy to compare their contributions visually.





##### 2. What is/are the insight(s) found from the chart?

The chart reveals dominant and underrepresented countries in invoice distribution, helping identify key markets and potential areas for expansion.





 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help focus on key markets for growth and target underrepresented countries. However, over-reliance on a few countries could risk negative growth if those markets face issues.





#### Chart - 4

In [None]:
# Chart - 4 visualization code
import seaborn as sns

# Plot
sns.scatterplot(data=df, x='Quantity', y='UnitPrice')
plt.title('Quantity vs UnitPrice')
plt.xlabel('Quantity')
plt.ylabel('UnitPrice')
plt.show()


##### 1. Why did you pick the specific chart?

The scatter plot shows the relationship between Quantity sold and UnitPrice, helping identify if there's any correlation between the number of items sold and their price.





##### 2. What is/are the insight(s) found from the chart?

The chart shows the relationship between quantity sold and price, highlighting pricing strategies, sales patterns, and potential outliers.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can optimize pricing strategies for better sales. However, underpricing or poor sales of high-priced items could lead to negative growth if not addressed.





#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Plot
df['Quantity'].plot(kind='hist', bins=50, alpha=0.7)
plt.title('Distribution of Quantity')
plt.xlabel('Quantity')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

  I chose a histogram to visualize the distribution of quantities sold, making it easy to identify patterns, frequency, and potential outliers in the sales data.

##### 2. What is/are the insight(s) found from the chart?

The histogram reveals the most common quantities sold, any skew in sales, and potential outliers, helping optimize inventory and pricing strategies.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize inventory and sales strategies, but low sales of larger quantities may require adjustments to avoid negative growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Plot
sns.boxplot(data=df, x='Country', y='UnitPrice')
plt.title('UnitPrice Distribution by Country')
plt.xlabel('Country')
plt.ylabel('UnitPrice')
plt.xticks(rotation=90)
plt.show()




```
# This is formatted as code
```

##### 1. Why did you pick the specific chart?

I chose a boxplot to compare unit price distributions across countries, showing the spread, median, and outliers in the data for each country.





##### 2. What is/are the insight(s) found from the chart?

The boxplot reveals price variation, outliers, and consistency in pricing across countries, helping optimize pricing strategies.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can optimize pricing strategies and target premium markets. However, inconsistent or extreme prices may harm customer trust and lead to negative growth.





#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Group by date and sum 'Quantity'
daily_sales = df.groupby(df['InvoiceDate'].dt.date)['Quantity'].sum()

# Plot
daily_sales.plot(kind='area', alpha=0.5)
plt.title('Daily Sales Quantity')
plt.xlabel('Date')
plt.ylabel('Quantity Sold')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

The area chart shows the total quantity sold each day, highlighting trends and fluctuations in sales over time.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals sales trends, peak sales periods, and sales dips, helping guide marketing and inventory decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize strategies during peak sales, but consistent sales dips may indicate issues that could lead to negative growth if not addressed.





#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Pivot table for heatmap
pivot_table = df.pivot_table(values='Quantity', index='Country', columns='InvoiceDate', aggfunc='sum')

# Plot
sns.heatmap(pivot_table, cmap='YlGnBu', annot=False)
plt.title('Heatmap of Sales Quantity by Country and Date')
plt.xlabel('Date')
plt.ylabel('Country')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a heatmap to easily visualize sales patterns across countries and dates, highlighting trends and variations through color intensity.





##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals sales peaks, seasonal trends, and country-specific patterns, helping to optimize sales strategies and inventory planning.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize sales strategies and focus on regions with high demand. However, irregular sales in some countries could lead to challenges, affecting growth if not managed properly.





#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Group by 'StockCode' and sum 'Quantity'
product_sales = df.groupby('StockCode')['Quantity'].sum().reset_index()

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(data=product_sales, x='StockCode', y='Quantity', s=product_sales['Quantity'] / 10, alpha=0.5)
plt.title('Product Sales Quantity')
plt.xlabel('StockCode')
plt.ylabel('Quantity Sold')
plt.show()


##### 1. Why did you pick the specific chart?

I  chose a scatter plot to visualize total sales by product and highlight high-selling items using point size for clarity.





##### 2. What is/are the insight(s) found from the chart?

The scatter plot reveals top-selling products, low-selling products, and the sales distribution across items, helping with inventory and marketing decisions.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize inventory and marketing for top-selling products. However, low-selling products need attention to avoid wasted resources and losses.





#### Chart - 10

In [None]:
# Chart - 10 visualization code
import numpy as np

# Example data
categories = ['Quantity', 'UnitPrice']
values = [df['Quantity'].mean(), df['UnitPrice'].mean()]

# Number of variables
num_vars = len(categories)

# Compute angle for each axis
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()

# Complete the loop
values += values[:1]
angles += angles[:1]

# Plot
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
ax.fill(angles, values, color='red', alpha=0.25)
ax.plot(angles, values, color='red', linewidth=2)
ax.set_yticklabels([])
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
plt.title('Radar Chart of Quantity and UnitPrice')
plt.show()


##### 1. Why did you pick the specific chart?

  I chose a radar chart to easily compare the average values of Quantity and UnitPrice on a common scale, making their relationship visually clear.





##### 2. What is/are the insight(s) found from the chart?

The radar chart compares the average values of Quantity and UnitPrice, showing their relative balance and helping inform pricing and sales strategies.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize pricing for better sales. However, selling high quantities at low prices could erode profit margins, potentially harming growth.





#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Group by 'InvoiceDate' and 'Country' and sum 'Quantity'
stacked_data = df.groupby([df['InvoiceDate'].dt.date, 'Country'])['Quantity'].sum().unstack()

# Plot
stacked_data.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Stacked Bar Chart of Sales Quantity by Country and Date')
plt.xlabel('Date')
plt.ylabel('Quantity Sold')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a stacked bar chart to compare sales quantities across countries over time, making it easy to track contributions and trends.





##### 2. What is/are the insight(s) found from the chart?

The insights from the stacked bar chart could include:

Sales Distribution by Country: It shows how each country's sales contribute to the total sales on specific dates.
Trends Over Time: Identifying which countries have increasing or decreasing sales over time.
Top Performing Countries: Spotting which countries consistently perform well and which are lagging behind.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize marketing and resource allocation for top-performing countries. However, underperforming countries may indicate issues that need addressing to avoid negative growth.





#### Chart - 12

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Example data: sequential changes in sales
categories = ['Start', 'Sales', 'Returns', 'Discounts', 'End']
values = [0, 100, -20, -10, 70]

# Calculate the cumulative sum
cumulative_values = np.cumsum(values)

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the bars
ax.bar(categories, cumulative_values, color=['blue', 'green', 'red', 'orange', 'purple'])

# Add labels
for i, (category, value) in enumerate(zip(categories, cumulative_values)):
    ax.text(i, value, f'{value}', ha='center', va='bottom' if value >= 0 else 'top')

# Set titles and labels
ax.set_title('Waterfall Chart of Sales Changes')
ax.set_xlabel('Category')
ax.set_ylabel('Value')

plt.show()


1. Why did you pick the specific chart?

I  chose a waterfall chart to clearly show how each factor (sales, returns, discounts) sequentially impacts the overall result.





##### 2. What is/are the insight(s) found from the chart?

The insights from the waterfall chart could include:

Positive and Negative Contributions: It highlights how each factor (sales, returns, discounts) affects the final outcome, showing where gains and losses occur.
Impact of Returns and Discounts: Identifying the impact of negative factors like returns and discounts on the final sales value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize sales strategies and improve customer satisfaction. However, high returns and discounts can negatively impact profitability and growth if not managed well.





#### Chart - 13

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'df' is your DataFrame
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='Country', y='UnitPrice')
plt.title('Distribution of UnitPrice by Country')
plt.xlabel('Country')
plt.ylabel('UnitPrice')
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a violin plot to show the distribution and spread of UnitPrice across countries, providing more detailed insights than a box plot.





##### 2. **What** is/are the insight(s) found from the chart?

The violin plot shows price variation, identifies outliers, and highlights skewness in UnitPrice across countries, helping to adjust pricing strategies.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights help optimize pricing and target markets better. However, extreme price variations could lead to customer dissatisfaction and negatively impact growth.





#### Chart - 14 - Correlation Heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
correlation_matrix = df[['Quantity', 'UnitPrice']].corr()

# Plot
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Quantity and UnitPrice')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

  I chose a correlation heatmap to visually show the relationship between Quantity and UnitPrice, making it easy to identify the strength and direction of their correlation.





#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns

# Assuming 'df' is your DataFrame
sns.pairplot(df[['Quantity', 'UnitPrice']])
plt.suptitle('Pair Plot of Quantity and UnitPrice', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pair plot to visualize the relationship between Quantity and UnitPrice and their individual distributions in a clear, simple way.





##### 2. What is/are the insight(s) found from the chart?

The pair plot shows the relationship between Quantity and UnitPrice, their distributions, and potential outliers.





# 5. **Hypothesis Testing**

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H₀):**
There is no significant difference in the average 'Quantity' purchased between customers from different countries.

**Alternative Hypothesis (H₁):**
There is a significant difference in the average 'Quantity' purchased between customers from different countries.

#### 2. Perform an appropriate statistical test.

**One-Way Analysis of Variance (ANOVA)**

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy import stats

# Assuming 'df' is your DataFrame
# Grouping data by 'Country' and extracting 'Quantity' for each group
countries = df['Country'].unique()
quantities_by_country = [df[df['Country'] == country]['Quantity'] for country in countries]

# Performing One-Way ANOVA
f_stat, p_value = stats.f_oneway(*quantities_by_country)
print(f"ANOVA F-statistic: {f_stat}, p-value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

One-Way Analysis of Variance (ANOVA)

##### Why did you choose the specific statistical test?

ANOVA is appropriate when comparing the means of three or more independent groups to determine if at least one group mean is different from the others. In this case, the groups are customers from different countries, and the objective is to assess if their average 'Quantity' purchased differs significantly

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H₀):**
There is no significant relationship between 'UnitPrice' and 'Quantity' purchased.

**Alternative Hypothesis (H₁):**
There is a significant relationship between 'UnitPrice' and 'Quantity' purchased.

#### 2. Perform an appropriate statistical test.

**Pearson Correlation Coefficient**

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

# Calculating Pearson Correlation Coefficient
correlation, p_value = stats.pearsonr(df['UnitPrice'], df['Quantity'])
print(f"Pearson Correlation: {correlation}, p-value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Coefficient

##### Why did you choose the specific statistical test?

The Pearson Correlation Coefficient measures the strength and direction of the linear relationship between two continuous variables. Since both 'UnitPrice' and 'Quantity' are continuous variables, this test is suitable for assessing their linear association.


### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H₀):**
The average 'UnitPrice' of products purchased by UK customers is equal to that of French customers.

**Alternative Hypothesis (H₁):**
The average 'UnitPrice' of products purchased by UK customers is different from that of French customers.

#### 2. Perform an appropriate statistical test.

**Independent Two-Sample T-Test**

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

# Extracting 'UnitPrice' for UK and France customers
uk_prices = df[df['Country'] == 'United Kingdom']['UnitPrice']
france_prices = df[df['Country'] == 'France']['UnitPrice']

# Performing Independent Two-Sample T-Test
t_stat, p_value = stats.ttest_ind(uk_prices, france_prices)
print(f"T-statistic: {t_stat}, p-value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

Independent Two-Sample T-Test

##### Why did you choose the specific statistical test?

The Independent Two-Sample T-Test compares the means of two independent groups to determine if there is a statistically significant difference between them. Here, the groups are customers from the UK and France, and the goal is to compare their average 'UnitPrice' purchases.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
import pandas as pd

  # Load your dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

  # Check for missing values
missing_data = df.isnull().sum()
print(missing_data)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Mean/Median Imputation: Replace missing numerical values with the mean or median of the column. This method is simple and effective for normally distributed data.


Mode Imputation: Replace missing categorical values with the mode (most frequent value) of the column. This approach is straightforward and works well for categorical data.


Forward Fill and Backward Fill: Use the previous (forward fill) or next (backward fill) available value to fill missing entries. This technique is useful for time-series data but assumes that the data points are related.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import numpy as np
import pandas as pd

  # Sample data
data = pd.DataFrame(df)

  # Calculate Z-scores
z_scores = (df['value'] - df['value'].mean()) / df['value'].std()

  # Identify outliers
threshold = 3
outliers = df[np.abs(z_scores) > threshold]
print(outliers)


# ##### What all outlier treatment techniques have you used and why did you use those techniques?

1 Z-Score Method: Identifies data points that are a specified number of standard deviations away from the mean.

2 Interquartile Range (IQR) Method: Detects outliers based on the spread of the middle 50% of the data.

3 Visualization Techniques: Utilizes box plots and scatter plots to visually identify outliers.

4 Removal of Outliers: Eliminates rows containing outliers to prevent them from skewing analysis results.

5 Capping (Winsorizing): Limits outlier values to a specified range to reduce their impact.

6 Transformation: Applies mathematical transformations, such as logarithmic or square root transformations, to reduce the effect of outliers.

7 Imputation: Replaces outliers with statistical measures like the mean or median to minimize their effect.

8 Robust Models: Utilizes algorithms less sensitive to outliers, such as Random Forests or robust regression techniques.

### 3. Categorical Encoding

# **Label Encoding**

In [None]:
# Encode your categorical columns
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Country': ['United Kingdom', 'France', 'United Kingdom', 'France']}
df = pd.DataFrame(data)

# Initialize LabelEncoder
le = LabelEncoder()

# Apply label encoding
df['Country_encoded'] = le.fit_transform(df['Country'])
print(df)


## **Target Encoding**

In [None]:
import pandas as pd

# Sample data
data = {'Country': ['United Kingdom', 'France', 'United Kingdom', 'France'],
        'Sales': [100, 200, 150, 250]}
df = pd.DataFrame(data)

# Calculate mean sales per country
mean_sales = df.groupby('Country')['Sales'].mean()

# Map the means to the original data
df['Country_encoded'] = df['Country'].map(mean_sales)
print(df)


# **One Hot Encoding**

In [None]:
import pandas as pd

# Sample data
data = {'Country': ['United Kingdom', 'France', 'United Kingdom', 'France']}
df = pd.DataFrame(data)

# Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Country'])
print(df_encoded)


#### What all categorical encoding techniques have you used & why did you use those techniques?

1label Encoding: Assigns a unique integer to each category. It's suitable for ordinal variables where the order matters.

2-Hot Encoding: Creates a binary column for each category. It's ideal for nominal variables without inherent order.

3Target Encoding: Replaces each category with the mean of the target variable for that category. It's effective for high-cardinality features.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
exit()


In [None]:
pip install contractions


In [None]:
import contractions

# Sample text with contractions
text = "I can't believe it's already 5 o'clock. She won't  be here until 6 ."

# Expand contractions
expanded_text = contractions.fix(text)

print(expanded_text)


#### 2. Lower Casing

In [None]:
# Lower Casing
# Sample text
text = "Hello, World! This is a Sample Text."

# Convert text to lowercase
lowercase_text = text.lower()

print(lowercase_text)


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Sample text
text = "Hello, World! This is a sample text."

# Create a translation table that maps each punctuation character to None
translation_table = str.maketrans('', '', string.punctuation)

# Remove punctuation
text_without_punctuation = text.translate(translation_table)

print(text_without_punctuation)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

# Sample text
text = "Visit our website at https://www.example.com for more information."

# Regular expression pattern to match URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

# Remove URLs
text_without_urls = re.sub(url_pattern, '', text)

print(text_without_urls)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
pip install nltk


In [None]:
import nltk
nltk.download('stopwords')


In [None]:
import spacy

# Load the English language model
nlp = spacy.load('en_core_web_sm')

# Sample text
text = "This is a sample sentence demonstrating stopword removal."

# Process the text
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop and not token.is_punct]

print(filtered_words)


In [None]:
# Remove White spaces
text = "  Hello,   World! \n\t  "
cleaned_text = text.replace(" ", "").replace("\n", "").replace("\t", "")
print(cleaned_text)  # Output: "Hello,World!"


#### 6. Rephrase Text

In [None]:
# Rephrase Text

In [None]:
pip install transformers torch


In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer


In [None]:
model_name = "t5-base"  # You can also use "t5-large" or "t5-small" based on your requirements
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)


In [None]:
def paraphrase(text):
    # Prepend the text with a task-specific prefix
    input_text = f"paraphrase: {text}"
    # Tokenize the input text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    # Generate paraphrased text
    outputs = model.generate(input_ids, max_length=512, num_beams=5, num_return_sequences=1, early_stopping=True)
    # Decode the generated text
    paraphrased_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return paraphrased_text


In [None]:
original_text = "Your original text goes here."
paraphrased_text = paraphrase(original_text)
print("Original Text:", original_text)
print("Paraphrased Text:", paraphrased_text)


#### 7. Tokenization

In [None]:
exit()


In [None]:
# Tokenization
pip install nltk


In [None]:
import nltk
nltk.download('punkt')


In [None]:
from nltk.tokenize import word_tokenize

text = "Hello, world! Welcome to NLP with Python."
words = word_tokenize(text)
print(words)


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

In [None]:
import nltk

# Download the punkt resource again
nltk.download('punkt')


In [None]:
nltk.download('punkt_tab')


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Sample text
text = "The cats are running quickly through the streets."

# Step 1: Lowercase the text
text = text.lower()

# Step 2: Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

# Step 3: Tokenize the text (still necessary to split words for stopword removal and stemming/lemmatization)
tokens = word_tokenize(text)

# Step 4: Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

# Step 5: Apply Stemming (PorterStemmer)
porter_stemmer = PorterStemmer()
stemmed_tokens = [porter_stemmer.stem(word) for word in filtered_tokens]

# Step 6: Apply Lemmatization (using WordNet Lemmatizer)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word, pos='v') for word in filtered_tokens]  # Lemmatize verbs

# Display the results
print("Original Text:", text)
print("Filtered Tokens (No Stopwords):", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:",lemmatized_tokens)



##### Which text normalization technique have you used and why?



1. Lowercasing:
Definition: Lowercasing is the process of converting all characters in the text to lowercase.
Why: It helps in ensuring that words like "The" and "the" are treated as identical. This is crucial in NLP tasks to avoid distinguishing between words that are essentially the same but appear in different cases. For instance, "running" and "Running" should be recognized as the same word during analysis.
2. Tokenization:
Definition: Tokenization involves splitting the raw text into smaller units, called tokens. These tokens could be words, sub-words, or sentences.
Why: Tokenization is necessary for further analysis because raw text in the form of a long sentence isn’t easily analyzable. By splitting the sentence into individual tokens, you can apply further processing like part-of-speech tagging, named entity recognition, or sentiment analysis.
3. Removing Punctuation:
Definition: Removing punctuation marks like commas, periods, quotation marks, exclamation marks, etc., from the text.
Why: Punctuation often doesn't contribute significantly to the meaning in many NLP tasks. For example, in sentiment analysis, the words themselves hold more meaning than the punctuation surrounding them. Therefore, removing punctuation simplifies the text and reduces noise.
4. Stopword Removal:
Definition: Stopwords are common words such as "is", "the", "and", "in", "of", etc., which do not carry much useful information for tasks like classification or information retrieval.
Why: These words appear frequently in text but don’t help in differentiating between topics or sentiments. Removing stopwords improves efficiency and can help focus on the more meaningful content of the text. For example, in a document classification task, the word "the" wouldn't help to distinguish between topics.
5. Stemming:
Definition: Stemming is the process of reducing words to their root form by chopping off derivational affixes (e.g., removing "-ing", "-ly", "-ed", etc.).
Why: Stemming allows us to treat different forms of a word as the same base word. For example, "running" and "runner" would both be reduced to "run", simplifying analysis. However, stemming is a heuristic process and may sometimes result in non-dictionary forms (e.g., "quickli" instead of "quickly").
6. Lemmatization:
Definition: Lemmatization also reduces words to their root form, but it considers the word’s part of speech (e.g., whether it’s a verb, noun, or adjective). Unlike stemming, lemmatization uses vocabulary and grammatical analysis to ensure that the word is reduced to a valid dictionary form.
Why: Lemmatization is more sophisticated and accurate than stemming, as it returns the actual base form of a word. For example, "running" becomes "run", and "better" becomes "good". Lemmatization ensures the reduced form is a meaningful word

# 9. **Part of speech tagging**

In [None]:
# POS Taging

In [None]:
import nltk

# Download the POS tagger manually
nltk.download('averaged_perceptron_tagger')


In [None]:
import nltk

# Clear the NLTK cache (use cautiously if you have other downloaded resources)
nltk.data.clear_cache()

# Try downloading the necessary resources again
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


In [None]:
import nltk

# Specify the path where resources are stored
nltk.data.path.append('/root/nltk_data')

# Re-download the necessary resources
nltk.download('averaged_perceptron_tagger')


#### 10. Text Vectorization

In [None]:
# Vectorizing Text

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
documents = ["I love programming", "Programming is fun", "I love Python programming"]

# Create the CountVectorizer (BoW)
vectorizer = CountVectorizer()

# Fit and transform the data into the BoW model
X = vectorizer.fit_transform(documents)

# Display the vocabulary and the corresponding vectors
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", X.toarray())


##### Which text vectorization technique have you used and why?

 I used the Bag of Words (BoW) technique because it's simple, fast, and effective for text classification tasks. It focuses on the frequency of words in documents, which is useful when the order of words isn't important. However, it doesn't capture word relationships or semantic meaning.





### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
df = pd.get_dummies(df, columns=['StockCode', 'Country'], drop_first=True)


In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Calculate correlation matrix
corr_matrix = df.corr()
# Find columns with correlation higher than 0.9
high_corr = corr_matrix[corr_matrix.abs() > 0.9]
print(high_corr)


In [None]:
# Drop one of the correlated features (e.g., 'Quantity' and 'UnitPrice' have high correlation)
df = df.drop(columns=['UnitPrice'])


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

X = df[['Quantity', 'UnitPrice', 'OtherFeatures']]  # Example subset of features
X = add_constant(X)  # Adds constant column for intercept
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

In [None]:
import pandas as pd
import numpy as np

# Sample dataset with correlated features
data = {
    'A': np.random.rand(100),
    'B': np.random.rand(100),
    'C': np.random.rand(100),
}

df = pd.DataFrame(data)

# Introducing correlation between features
df['B'] = df['A'] * 0.9 + np.random.rand(100) * 0.1  # Making B correlated with A

# Compute the correlation matrix
correlation_matrix = df.corr()

# Set a threshold for removing highly correlated features
threshold = 0.9

# Identify highly correlated features
highly_correlated_features = [column for column in correlation_matrix.columns if any(abs(correlation_matrix[column]) > threshold)]

# Remove correlated features
df_reduced = df.drop(columns=highly_correlated_features)

print("Dataset after removing highly correlated features:\n", df_reduced.head())


**Feature Importance using Random Forest:**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train a random forest model
rf = RandomForestClassifier()
rf.fit(X, y)

# Get feature importances
feature_importances = rf.feature_importances_

# Display feature importances
print("Feature Importances:\n", feature_importances)

# Select features with importance above a certain threshold (e.g., 0.2)
selected_features = X.columns[feature_importances > 0.2]
X_selected = X[selected_features]

print("\nDataset after feature selection:\n", X_selected.head())


**Recursive Feature Elimination (RFE):**

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Create a logistic regression model
model = LogisticRegression(max_iter=200)

# Initialize RFE and select top 2 features
rfe = RFE(estimator=model, n_features_to_select=2)
X_rfe = rfe.fit_transform(X, y)

# Display the selected features
selected_columns = X.columns[rfe.support_]
print("Selected Features: ", selected_columns)

# Create a new DataFrame with selected features
X_selected = pd.DataFrame(X_rfe, columns=selected_columns)
print("\nDataset after RFE:\n", X_selected.head())


**Feature Selection using Lasso Regularization (L1):**

In [None]:
from sklearn.linear_model import Lasso
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train a Lasso model (L1 regularization)
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)

# Get the feature coefficients
coefficients = lasso.coef_

# Identify the features with non-zero coefficients
selected_features = X.columns[coefficients != 0]
X_selected = X[selected_features]

print("Selected Features using Lasso: ", selected_features)
print("\nDataset after Lasso feature selection:\n", X_selected.head())


**Using Cross-Validation to Evaluate Feature Selection:**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Create a logistic regression model
model = LogisticRegression(max_iter=200)

# Evaluate the model using 10-fold cross-validation
scores = cross_val_score(model, X, y, cv=10)

print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())


##### What all feature selection methods have you used  and why?

Removing Highly Correlated Features:

Why?: Redundant features increase model complexity and risk overfitting.
How?: I identified and removed highly correlated features using the correlation matrix.

Feature Importance using Random Forest:

Why?: Identifies which features contribute most to predictions, helping reduce unnecessary complexity.
How?: I used RandomForestClassifier to rank features by importance and selected the top ones.

Recursive Feature Elimination (RFE):

Why?: Recursively removes least important features to select the optimal set.
How?: I used RFE with a model (Logistic Regression) to rank and remove less important features.

Lasso Regularization (L1):

Why?: Lasso penalizes less relevant features by setting their coefficients to zero.
How?: I applied Lasso regression and selected features with non-zero coefficients.

##### Which all features you found important and why?

Quantity: Directly impacts sales and helps understand demand.
UnitPrice: Affects revenue and profit margins.
CustomerID: Enables customer segmentation and retention strategies.
Country: Provides insights into geographical sales patterns.
InvoiceDate: Helps track trends over time and forecast sales.
StockCode: Crucial for analyzing product-level performance.
These features drive business decisions related to sales, customer behavior, inventory management, and market expansion.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

reads a dataset and checks for missing values by using the .isnull().sum() function. However, there are a couple of things to address:

In [None]:
import pandas as pd

  # Load your dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

  # Check for missing values
missing_data = df.isnull().sum()
print(missing_data)


### 6. Data Scaling

In [None]:
print(df.columns)


In [None]:
# Assuming you have a column 'Price' and 'Quantity', you can create 'TotalSales'
if 'Price' in df.columns and 'Quantity' in df.columns:
    df['TotalSales'] = df['Quantity'] * df['Price']


In [None]:
# Check the columns
print(df.columns)

# If you have 'Quantity' and another numeric column (like 'Price' or something similar), create 'TotalSales'
if 'Quantity' in df.columns and 'Price' in df.columns:
    df['TotalSales'] = df['Quantity'] * df['Price']
else:
    print("Required columns for 'TotalSales' not found")


In [None]:
from sklearn.preprocessing import StandardScaler

# List of columns to scale (use the actual columns you have)
numeric_columns = ['Quantity']  # You can add more columns if necessary, like 'TotalSales' if created

# Initialize the StandardScaler
scaler = StandardScaler()

# Apply scaling to the selected columns
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

# Check the scaled data
print(df.head())


##### Which method have you used to scale you data and why?

I used StandardScaler to scale the data because it standardizes features by removing the mean and scaling them to unit variance (mean = 0, standard deviation = 1). This is ideal for algorithms that are sensitive to feature scaling, like distance-based models or gradient-based models. It ensures that no single feature dominates due to differences in scale, helping improve model performance and convergence speed.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


In [None]:
# ML Model - 1
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

# Create a 'TotalPrice' column by multiplying 'Quantity' and 'UnitPrice'
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Select features and target
X = df[['Quantity', 'UnitPrice']]  # Replace with your feature columns
y = df['TotalPrice']  # Replace with your target column



# Fit the Algorithm

# Predict on the model

**Split the Data into Training and Testing Sets**

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Initialize and Fit the Model**

In [None]:
# Initialize the model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)


**Make Predictions**

In [None]:
# Make predictions
y_pred = model.predict(X_test)


**Evaluate the Model**

In [None]:
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Calculate Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2}')

# Calculate Root Mean Squared Error
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse}')



**Visualizing Performance with a Score Chart:**

In [None]:
import matplotlib.pyplot as plt

# Metrics
metrics = ['MSE', 'MAE', 'R²', 'RMSE']
values = [mse, mae, r2, rmse]

# Create bar chart
plt.bar(metrics, values, color=['blue', 'green', 'red', 'purple'])
plt.title('Model Performance Metrics')
plt.ylabel('Score')

#### 2. Cross- Validation & Hyperparameter Tuning

**Implementing Cross-Validation:**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

# Print the cross-validation scores
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean CV Score: {cv_scores.mean()}')


 **Hyperparameter Tuning**

**GridSearchCV**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
model = SVC()

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')


**RandomizedSearchCV**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from scipy.stats import uniform

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
model = SVC()

# Define the parameter distributions
param_dist = {
    'C': uniform(0.1, 10),  # Uniform distribution between 0.1 and 10
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'] + list(np.logspace(-3, 3, 50))
}

# Initialize RandomizedSearchCV
randomized_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', random_state=42)

# Fit the model
randomized_search.fit(X_train, y_train)

# Print the best parameters and score
print(f'Best Parameters: {randomized_search.best_params_}')
print(f'Best Score: {randomized_search.best_score_}')


**BayesianOptimization**

In [None]:
pip install bayesian-optimization


In [None]:
%pip install bayesian-optimization


In [None]:
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the function to optimize
def svm_evaluate(C, gamma):
    model = SVC(C=C, gamma=gamma)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

# Define the parameter bounds
pbounds = {'C': (0.1, 10), 'gamma': (0.001, 1)}

# Initialize Bayesian Optimization
optimizer = BayesianOptimization(
    f=svm_evaluate,
    pbounds=pbounds,
    random_state=42,
)

# Perform optimization
optimizer.maximize(init_points=5, n_iter=25)

# Print the best parameters and score
print(f'Best Parameters: {optimizer.max["params"]}')
print(f'Best Score: {optimizer.max["target"]}')


In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

In [None]:
df.columns

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV: Tests all combinations of hyperparameters. Use when the search space is small and you need exhaustive testing, but it can be computationally expensive.

RandomizedSearchCV: Samples random combinations of hyperparameters. Use when the search space is large, and you want a faster, less computationally intensive option.

Bayesian Optimization: Uses past results to intelligently select the next set of hyperparameters to test. Use when you want to optimize efficiently with fewer evaluations, especially in large or complex search spaces.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv

In [None]:
# Split the data into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# # Standardize the features (important for SVR)
# scaler_X = StandardScaler()
# scaler_y = StandardScaler()

# X_train_scaled = scaler_X.fit_transform(X_train)
# X_test_scaled = scaler_X.transform(X_test)
# y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
# y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1))

# # Initialize the Support Vector Regression model
# svr_model = SVR(kernel='rbf')  # Radial basis function kernel

# # Train the model
# svr_model.fit(X_train_scaled, y_train_scaled.ravel())

# # Make predictions
# y_pred_svr_scaled = svr_model.predict(X_test_scaled)
# y_pred_svr = scaler_y.inverse_transform(y_pred_svr_scaled.reshape(-1, 1))

# # Evaluate the model
# mse_svr = mean_squared_error(y_test, y_pred_svr)
# mae_svr = mean_absolute_error(y_test, y_pred_svr)
# r2_svr = r2_score(y_test, y_pred_svr)
# rmse_svr = np.sqrt(mse_svr)

# # Print the evaluation metrics
# print(f'SVR - MSE: {mse_svr}')
# print(f'SVR - MAE: {mae_svr}')
# print(f'SVR - R²: {r2_svr}')
# print(f'SVR - RMSE: {rmse_svr}')


In [None]:
df


In [None]:
# Visualizing evaluation Metric Score chart


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

# Create a 'TotalPrice' column by multiplying 'Quantity' and 'UnitPrice'
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Drop rows with missing or zero values for Quantity, UnitPrice, and TotalPrice
df = df.dropna(subset=['Quantity', 'UnitPrice'])
df = df[df['Quantity'] > 0]
df = df[df['UnitPrice'] > 0]

# Select features and target
X = df[['UnitPrice', 'TotalPrice']]  # Features: 'UnitPrice' and 'TotalPrice'
y = df['Quantity']  # Target: Predict 'Quantity'

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (important for SVR)
scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)
y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1))

# Initialize the Support Vector Regression model
svr_model = SVR(kernel='rbf')  # Radial basis function kernel

# Train the model
svr_model.fit(X_train_scaled, y_train_scaled.ravel())

# Make predictions
y_pred_svr_scaled = svr_model.predict(X_test_scaled)
y_pred_svr = scaler_y.inverse_transform(y_pred_svr_scaled.reshape(-1, 1))

# Evaluate the model
mse_svr = mean_squared_error(y_test, y_pred_svr)
mae_svr = mean_absolute_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)
rmse_svr = np.sqrt(mse_svr)

# Metrics for visualization
metrics = ['MSE', 'MAE', 'R²', 'RMSE']
values = [mse_svr, mae_svr, r2_svr, rmse_svr]

# Plot


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV: Ideal when the hyperparameter space is small, and computational resources are sufficient.

RandomizedSearchCV: Suitable for larger hyperparameter spaces where a balance between exploration and computational efficiency is needed.

Bayesian Optimization: Best for scenarios where each evaluation is expensive, and the goal is to find the optimal hyperparameters with minimal evaluations.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Accuracy Score: Measures the proportion of correct predictions.

Precision and Recall: Evaluate the model's performance in terms of false positives and false negatives.

F1-Score: The harmonic mean of precision and recall, providing a balance between the two.

ROC-AUC Curve: Illustrates the trade-off between true positive rate and false positive rate.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementationdf = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Online Retail.xlsx - Online Retail.csv')

# Create a 'TotalPrice' column by multiplying 'Quantity' and 'UnitPrice'
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Select features and target
X = df[['Quantity', 'UnitPrice']]  # Replace with your feature columns
y = df['TotalPrice']

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Calculate Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2}')

# Calculate Root Mean Squared Error
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse}')



#### 2. Cross- Validation & Hyperparameter Tuning

**Cross- Validation**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

# Print the cross-validation scores
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean CV Score: {cv_scores.mean()}')


**GridSearchCV**

In [None]:
rom sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
model = SVC()

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')


**RandomSearchCV**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from scipy.stats import uniform

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the model
model = SVC()

# Define the parameter distributions
param_dist = {
    'C': uniform(0.1, 10),  # Uniform distribution between 0.1 and 10
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'] + list(np.logspace(-3, 3, 50))
}

# Initialize RandomizedSearchCV
randomized_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', random_state=42)

# Fit the model
randomized_search.fit(X_train, y_train)

# Print the best parameters and score
print(f'Best Parameters: {randomized_search.best_params_}')
print(f'Best Score: {randomized_search.best_score_}')

**BayesianOptimization**

In [None]:
from bayes_opt import BayesianOptimization
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the function to optimize
def svm_evaluate(C, gamma):
    model = SVC(C=C, gamma=gamma)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

# Define the parameter bounds
pbounds = {'C': (0.1, 10), 'gamma': (0.001, 1)}

# Initialize Bayesian Optimization
optimizer = BayesianOptimization(
    f=svm_evaluate,
    pbounds=pbounds,
    random_state=42,
)

# Perform optimization
optimizer.maximize(init_points=5, n_iter=25)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV: Tests all combinations of hyperparameters. Use when the search space is small and you need exhaustive testing, but it can be computationally expensive.

RandomizedSearchCV: Samples random combinations of hyperparameters. Use when the search space is large, and you want a faster, less computationally intensive option.

Bayesian Optimization: Uses past results to intelligently select the next set of hyperparameters to test. Use when you want to optimize efficiently with fewer evaluations, especially in large or complex search spaces.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

ML-1 because it is easy for  anlysis and get a clear and understandable result.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In this case, I used Linear Regression as a model to predict the TotalPrice based on the features Quantity and UnitPrice. Additionally, I will explain how to analyze feature importance using a model explainability tool, particularly for more complex models like Random Forest Regressor, as linear regression is straightforward and its feature importance can be directly derived from the coefficients.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In summary, we used Linear Regression and Random Forest Regressor to predict TotalPrice based on Quantity and UnitPrice. Linear Regression provided simple feature importance based on coefficients, while Random Forest gave more robust insights using feature importance derived from decision trees.

For model explainability, we discussed tools like SHAP and LIME, which help understand how individual features influence predictions, especially for more complex models like Random Forest.

From a business perspective, Quantity plays a more significant role in predicting TotalPrice than UnitPrice, which can inform pricing and sales strategies. The combination of accurate models and explainability tools allows for data-driven decision-making and enhances business insights.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***