<a href="https://colab.research.google.com/github/SSubhashReddy/AI-ML-project/blob/main/Copy_of_Sample_ML_Submission_Template_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Local Food Wastage Management System



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member 1 -**S.Venkata Subhash Reddy
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The Local Food Wastage Management System is designed to minimize food waste by creating an efficient channel for surplus food collection, redistribution, and disposal. It connects restaurants, hotels, households, supermarkets, and event organizers with NGOs, food banks, and needy communities to ensure edible surplus food reaches beneficiaries instead of ending up in landfills.

The system operates through a centralized digital platform (web or mobile app) where donors can register surplus food details, including type, quantity, freshness, and pickup time. NGOs or collection agents are notified in real time, enabling quick allocation and transport to targeted locations. GPS integration helps track pickup and delivery, ensuring transparency and accountability. Donors can receive updates on the status of their contributions, and NGOs can manage their requests and delivery schedules effectively.

To maintain hygiene and safety, the system follows standard food handling guidelines, ensuring the collected food is fit for consumption. Inedible food waste is diverted for composting or bioenergy production, promoting environmental sustainability.
**Key features include:**

User Registration & Authentication for donors, NGOs, and administrators.

Food Donation Scheduling with automated matching based on location and urgency.

Real-Time Tracking & Notifications to ensure timely collection and delivery.

Analytics Dashboard for tracking total donations, beneficiaries served, and waste reduced.

Waste-to-Energy/Compost Integration for inedible food management.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Food wastage is a significant issue, with many households and restaurants discarding surplus food while numerous people struggle with food insecurity. This project aims to develop a Local Food Wastage Management System, where:

Restaurants and individuals can list surplus food.

NGOs or individuals in need can claim the food.

SQL stores available food details and locations.

A Streamlit app enables interaction, filtering, CRUD operation and visualization.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/food.csv.xlsx'

try:
    # Use read_excel for .xlsx files
    df = pd.read_excel(file_path)
    print("File loaded successfully!")
    print(df.head())  # Preview first few rows
except FileNotFoundError:
    print(f"File not found at {file_path}. Please check the path.")
except Exception as e:
    print("An error occurred:", e)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows = df.shape[0]
num_columns = df.shape[1]

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cmap='viridis', cbar=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

File type & format → CSV, Excel, JSON, etc.

Number of rows & columns → shape of the data.

Column names & data types → numeric, text, date, etc.

Missing values → how much data is incomplete.

Basic statistics → min, max, mean, counts.

Sample records → first few rows for a quick look.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
column_names = df.columns
print("Column Names:")
print(column_names)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Variable → column name

DataType → numeric, object (text), datetime, etc.

NonNullCount → how many non-missing values are there

MissingCount → how many are missing

UniqueValues → number of distinct values in the column

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
print("Unique Values for Each Variable:")
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import pandas as pd

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# 1. Load dataset (auto detect CSV/Excel)
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# 2. Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# 3. Remove duplicate rows
df.drop_duplicates(inplace=True)

# 4. Handle missing values
#    - Numeric columns: fill with mean
#    - Categorical columns: fill with mode
for col in df.columns:
    if df[col].dtype in ['float64', 'int64']:
        df[col].fillna(df[col].mean(), inplace=True)
    else:
        df[col].fillna(df[col].mode()[0], inplace=True)

# 5. Convert date columns automatically (if possible)
for col in df.columns:
    try:
        df[col] = pd.to_datetime(df[col])
    except (ValueError, TypeError):
        pass  # Ignore if not a date

# 6. Encode categorical variables (optional: for ML tasks)
df = pd.get_dummies(df, drop_first=True)

# 7. Final check
print("Dataset is ready for analysis!")
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())


### What all manipulations have you done and insights you found?

Data Shape: e.g., “Dataset has 5,000 rows and 12 columns.”

Top Contributors to Food Waste: e.g., “Restaurants account for 45% of surplus food donations.”

Time Patterns: e.g., “Peak donations occur between 7 PM and 9 PM.”

Food Type Trends: e.g., “Bakery items form the largest share (35%) of donations.”

Waste Reduction Potential: e.g., “If redistribution improves by 15%, landfill waste could be cut by 2 tons/month.”

Geographical Insights: e.g., “Zone A contributes more fresh produce, while Zone C has more packaged goods.”

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset (auto detect CSV/Excel)
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Example: Top 10 most donated food items
plt.figure(figsize=(10,6))
sns.countplot(
    data=df,
    y='food_name',  # Corrected column name
    order=df['food_name'].value_counts().head(10).index,
    palette='viridis'
)
plt.title("Top 10 Most Donated Food Items", fontsize=16)
plt.xlabel("Number of Donations")
plt.ylabel("Food Item")
plt.show()

##### 1. Why did you pick the specific chart?

Horizontal bar charts clearly display ranked categorical data, making it easy to compare donation counts across food items.

##### 2. What is/are the insight(s) found from the chart?

Rice and Soup are the most donated items, while Fruits are the least among the top 10.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — helps plan storage, logistics, and targeted donation drives for high-demand items.

No negative growth — all items are receiving donations, just in varying amounts.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Ensure date column is in datetime format
# Replace 'donation_date' with your actual date column name
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')

# Drop rows without valid dates
df = df.dropna(subset=['timestamp'])

# Group by date and count donations
daily_donations = df.groupby(df['timestamp'].dt.date).size()

# Plot time-series
plt.figure(figsize=(12,6))
sns.lineplot(x=daily_donations.index, y=daily_donations.values, marker='o', color='orange')
plt.title("Daily Food Donations Trend", fontsize=16)
plt.xlabel("Date")
plt.ylabel("Number of Donations")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To show donation trends over time, highlighting peaks and dips clearly.

##### 2. What is/are the insight(s) found from the chart?

Donations fluctuate significantly, with notable peaks around March, July, and October.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — helps schedule campaigns during high-donation months.

Negative — sharp dips suggest potential supply issues or donor disengagement during certain periods.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace 'donor_type' with your actual column name for donor categories
donor_counts = df['type'].value_counts()

# Plot pie chart
plt.figure(figsize=(8,8))
plt.pie(donor_counts, labels=donor_counts.index, autopct='%1.1f%%', startangle=140)
plt.title("Donations by Donor Category", fontsize=16)
plt.axis('equal')  # Equal aspect ratio ensures the pie is a circle
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is ideal to compare proportional contributions from each donor category in a visually clear way.

##### 2. What is/are the insight(s) found from the chart?

Supermarkets contribute the highest share of donations (26.2%).

Grocery Stores (25.6%) and Restaurants (24.6%) follow closely.

Catering Services contribute the least (23.6%), but still make up a significant share.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Highlights balanced contributions among donor categories, reducing over-reliance on a single source.

Negative: Slightly lower contributions from Catering Services might indicate untapped potential for increasing supply.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace 'donation_datetime' with your actual datetime column name
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')

# Drop rows without valid datetime
df = df.dropna(subset=['timestamp'])

# Extract weekday and hour
df['weekday'] = df['timestamp'].dt.day_name()
df['hour'] = df['timestamp'].dt.hour

# Create pivot table
pivot_data = df.pivot_table(index='weekday', columns='hour', values='food_name', aggfunc='count')

# Reorder weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
pivot_data = pivot_data.reindex(weekday_order)

# Plot heatmap
plt.figure(figsize=(12,6))
sns.heatmap(pivot_data, cmap='YlOrRd', linewidths=0.5, annot=True, fmt='.0f')
plt.title("Donations Heatmap by Weekday and Hour", fontsize=16)
plt.xlabel("Hour of Day")
plt.ylabel("Day of Week")
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is the best choice for spotting patterns across two dimensions (weekday and hour) simultaneously. It helps identify peak donation times quickly.

##### 2. What is/are the insight(s) found from the chart?

The highest donation count (16) occurs on Thursday at 15:00.

Other notable peaks: Monday at 9:00 & 10:00, Thursday at 3:00 and 8:00, Friday at 4:00, 15:00, and 21:00.

Donations are generally lower in the early morning (2:00–6:00) and late night (21:00–23:00), except for Friday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Knowing peak hours can help allocate volunteers and transportation more efficiently, reducing waste from delayed pickups.

Negative: If storage is inadequate, high peak donations (like Thursday afternoons) could lead to spoilage before redistribution.

#### Chart - 5

In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

print("First 5 rows:\n", df.head(), "\n")
print("Column names:\n", df.columns.tolist())


##### 1. Why did you pick the specific chart?

To quickly spot relationships between numerical variables like Quantity, provider info, and meal details.

##### 2. What is/are the insight(s) found from the chart?

Strong link between provider and provider type.

Certain meal/food types may have higher quantities.

IDs have little predictive value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Positive: Helps target donation drives and plan supply better.

Negative: Over-reliance on few providers, risk of food wastage if expiry dates are close.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import pandas as pd
import folium

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

print("This visualization cannot be generated as the dataset does not contain latitude and longitude columns.")

##### 1. Why did you pick the specific chart?

Originally chosen to show geographical distribution of food providers/receivers for location-based analysis.

##### 2. What is/are the insight(s) found from the chart?

Not possible to generate since the dataset lacks latitude/longitude.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: If location data is added, mapping could optimize delivery routes and resource allocation.

Negative: Current lack of location data limits ability to make location-based decisions, potentially causing inefficiencies.

#### Chart - 7

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
date_col = 'timestamp'
category_col = 'type'

# Ensure date is in datetime format
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')

# Drop rows without valid date or category
df = df.dropna(subset=[date_col, category_col])

# Create month-year column
df['month_year'] = df[date_col].dt.to_period('M').astype(str)

# Group and pivot data for stacked bar
pivot_df = df.groupby(['month_year', category_col]).size().unstack(fill_value=0)

# Plot stacked bar chart
pivot_df.plot(kind='bar', stacked=True, figsize=(12,6))
plt.title("Donations by Donor Type Over Time", fontsize=16)
plt.xlabel("Month-Year")
plt.ylabel("Number of Donations")
plt.xticks(rotation=45)
plt.legend(title="Donor Type")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart clearly shows the volume of donations over time while also comparing contributions by donor type in each month.

##### 2. What is/are the insight(s) found from the chart?

March 2025 shows an extreme spike in donations across all donor types, especially catering services and grocery stores.

Other months have relatively stable, lower donation levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifying donation peaks helps in planning storage, distribution, and manpower needs. The March spike may correspond to seasonal events or campaigns worth repeating.

Negative: High dependency on irregular spikes could lead to inconsistent supply. If demand remains steady, months with low donations may risk shortages.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
donor_col = 'name'       # column containing donor names
quantity_col = 'quantity'      # column containing donation quantity

# Drop rows with missing donor or quantity
df = df.dropna(subset=[donor_col, quantity_col])

# Convert quantity to numeric if needed
df[quantity_col] = pd.to_numeric(df[quantity_col], errors='coerce')

# Group by donor and get top 10
top_donors = df.groupby(donor_col)[quantity_col].sum().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
top_donors.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title("Top 10 Donors by Quantity Donated", fontsize=16)
plt.xlabel("Donor Name")
plt.ylabel("Total Quantity")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing quantities across a small number of categories—in this case, the top 10 donors. It clearly shows the ranking and scale of donations from each donor.

##### 2. What is/are the insight(s) found from the chart?

Williams PLC and Miller Ltd are the top two donors, with quantities significantly higher than the rest.

The contribution gap between the top donors and the bottom donors (e.g., Bowman LLC) is notable.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifying key donors allows targeted engagement strategies, loyalty programs, and recognition campaigns to maintain their high contribution levels.

Negative: Heavy reliance on a few top donors could pose a risk—if they stop contributing, total donations could drop significantly.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
food_col = 'food_name'   # column containing food item names
quantity_col = 'quantity'  # column containing donation quantity

# Drop missing values for these columns
df = df.dropna(subset=[food_col, quantity_col])

# Convert quantity to numeric if needed
df[quantity_col] = pd.to_numeric(df[quantity_col], errors='coerce')

# Group by food item and get top 10
top_foods = df.groupby(food_col)[quantity_col].sum().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
top_foods.plot(kind='bar', color='lightgreen', edgecolor='black')
plt.title("Top 10 Food Items Donated", fontsize=16)
plt.xlabel("Food Item")
plt.ylabel("Total Quantity")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is effective for comparing donation quantities across different food categories. It clearly highlights which items are most frequently donated.

##### 2. What is/are the insight(s) found from the chart?

Rice is the top donated item, followed by Soup and Dairy.

Items like Fish and Fruits have lower donation quantities compared to staples like Rice and Bread.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: High donations of staple foods like Rice ensure basic food needs are met consistently.

Negative: Lower donations of proteins (e.g., Fish, Chicken) and fresh produce (Fruits, Vegetables) may limit nutritional diversity for recipients, suggesting a need for targeted donation drives.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
date_col = 'timestamp'   # column with donation dates
quantity_col = 'quantity'    # column with donation quantities

# Drop missing values for these columns
df = df.dropna(subset=[date_col, quantity_col])

# Ensure date column is datetime
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')

# Convert quantity to numeric
df[quantity_col] = pd.to_numeric(df[quantity_col], errors='coerce')

# Remove rows with NaT in date
df = df.dropna(subset=[date_col])

# Group by date (daily totals)
daily_donations = df.groupby(date_col)[quantity_col].sum()

# Plot
plt.figure(figsize=(12,6))
plt.plot(daily_donations.index, daily_donations.values, marker='o', linestyle='-', color='blue')
plt.title("Quantity Donated Over Time", fontsize=16)
plt.xlabel("Date")
plt.ylabel("Total Quantity")
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A time series line plot is ideal for tracking donation quantity patterns over time, making it easy to spot trends, fluctuations, and seasonal spikes.

##### 2. What is/are the insight(s) found from the chart?

Donations fluctuate significantly, with occasional large peaks reaching around 90–95 units.

There is no consistent upward or downward trend, but periodic spikes suggest seasonal or event-driven donations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Identifying peak donation periods can help schedule campaigns during high-engagement months to maximize collection.

Negative: Inconsistent donations may create supply shortages in certain periods, affecting recipients; this suggests the need for targeted outreach in low-donation months.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
date_col = 'timestamp'   # column with donation dates
quantity_col = 'quantity'    # column with donation quantities

# Drop missing values
df = df.dropna(subset=[date_col, quantity_col])

# Ensure correct data types
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
df[quantity_col] = pd.to_numeric(df[quantity_col], errors='coerce')

# Remove invalid rows
df = df.dropna(subset=[date_col])

# Extract day of the week (0=Monday, 6=Sunday)
df['day_of_week'] = df[date_col].dt.day_name()

# Group by day of week
donations_by_day = df.groupby('day_of_week')[quantity_col].sum()

# Reorder days
days_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
donations_by_day = donations_by_day.reindex(days_order)

# Plot
plt.figure(figsize=(10,6))
plt.bar(donations_by_day.index, donations_by_day.values, color='skyblue', edgecolor='black')
plt.title("Donations by Day of the Week", fontsize=16)
plt.xlabel("Day of the Week")
plt.ylabel("Total Quantity Donated")
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is perfect for comparing total donations across days of the week, making it easy to identify which days see more or less activity.

##### 2. What is/are the insight(s) found from the chart?

Thursday is the highest donation day (~5200 units), followed by Monday and Friday.

Saturday and Tuesday see the lowest donations (~2500–2600 units).

Midweek (Wednesday) shows moderate donation activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: High activity days (Thursday, Monday, Friday) are great for targeted campaigns and events to maximize turnout.

Negative: Low activity on weekends and Tuesdays may indicate disengagement—extra promotions or awareness drives could help balance donation distribution.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
date_col = 'timestamp'   # column with donation dates
quantity_col = 'quantity'    # column with donation quantities

# Drop missing values
df = df.dropna(subset=[date_col, quantity_col])

# Ensure correct data types
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
df[quantity_col] = pd.to_numeric(df[quantity_col], errors='coerce')

# Remove invalid rows
df = df.dropna(subset=[date_col])

# Extract month name
df['month'] = df[date_col].dt.month_name()

# Group by month
donations_by_month = df.groupby('month')[quantity_col].sum()

# Order months in calendar order
months_order = ["January", "February", "March", "April", "May", "June",
                "July", "August", "September", "October", "November", "December"]
donations_by_month = donations_by_month.reindex(months_order)

# Plot
plt.figure(figsize=(10,6))
plt.bar(donations_by_month.index, donations_by_month.values, color='lightgreen', edgecolor='black')
plt.title("Donations by Month", fontsize=16)
plt.xlabel("Month")
plt.ylabel("Total Quantity Donated")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart clearly compares monthly donation totals, making it easy to spot seasonal peaks and low periods.

##### 2. What is/are the insight(s) found from the chart?

March stands out massively with ~12,000 units—far higher than any other month.

October shows the second-highest donations (~1,700 units).

Most other months range between ~1,000–1,500 units.

September sees the lowest donations (~900 units).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: March could indicate a successful campaign or seasonal event—worth replicating or expanding.

Negative: Months with consistently low donations (e.g., September, January, November) may require targeted outreach or special drives to boost contributions.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import pandas as pd
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# Replace with your actual column names
donor_col = 'name'   # column with donor names
quantity_col = 'quantity'  # column with donation quantities

# Drop missing values
df = df.dropna(subset=[donor_col, quantity_col])

# Ensure quantity is numeric
df[quantity_col] = pd.to_numeric(df[quantity_col], errors='coerce')
df = df.dropna(subset=[quantity_col])

# Group by donor and sum donations
top_donors = df.groupby(donor_col)[quantity_col].sum().nlargest(10)

# Plot
plt.figure(figsize=(10,6))
plt.barh(top_donors.index[::-1], top_donors.values[::-1], color='skyblue', edgecolor='black')
plt.title("Top 10 Donor Contributions", fontsize=16)
plt.xlabel("Total Quantity Donated")
plt.ylabel("Donor Name")
plt.grid(axis='x', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Chosen because it clearly compares top donors’ contributions in descending order.

##### 2. What is/are the insight(s) found from the chart?

Williams PLC is the highest donor; top 3 donors contribute significantly more than the rest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights can help target and retain high-value donors; no major negative growth seen, but dependency on few donors may be a risk.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Drop columns with all NaN values
numeric_df = numeric_df.dropna(axis=1, how='all')

# Compute correlation matrix
corr_matrix = numeric_df.corr()

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap", fontsize=16)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Chosen to visualize correlation between dataset variables.

##### 2. What is/are the insight(s) found from the chart?

Most IDs are perfectly correlated, while Quantity shows very weak correlation with other variables.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Select only numeric columns for pairplot
numeric_df = df.select_dtypes(include=['int64', 'float64'])

# Drop columns with all NaN values
numeric_df = numeric_df.dropna(axis=1, how='all')

# Create Pair Plot
sns.pairplot(numeric_df)
plt.suptitle("Pair Plot of Numeric Features", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

Chosen to explore relationships and distributions between all numeric features simultaneously.

##### 2. What is/are the insight(s) found from the chart?

Features mostly show no strong correlation except perfect diagonal self-correlations; distributions appear uniform.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

H1: The average donation amount is greater than ₹500.

H2: There is a significant difference in donation amounts between male and female donors.

H3: Donation amount is correlated with donor age.

For each, you would run:

H1: One-sample t-test against μ = 500.

H2: Independent t-test between male and female groups.

H3: Pearson correlation test between DonationAmount and Age.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Average donation > ₹500

H₀: μ ≤ 500

H₁: μ > 500

Difference between male & female donation amounts

H₀: μₘ = μₓ

H₁: μₘ ≠ μₓ

Correlation between donation amount & age

H₀: ρ = 0

H₁: ρ ≠ 0

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy import stats

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Standardize column names (lowercase, replace spaces with underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')


# 1. One-sample t-test (Average donation > ₹500)
# Assuming 'quantity' column exists and is numeric
# Need to check if 'quantity' column exists before proceeding
if 'quantity' in df.columns:
    df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce')
    df = df.dropna(subset=['quantity']) # Drop rows with missing or invalid quantity

    if not df['quantity'].empty:
        t_stat1, p_val1 = stats.ttest_1samp(df['quantity'], 500)
        p_val1_one_sided = p_val1 / 2 if t_stat1 > 0 else 1 - (p_val1 / 2)
        print(f"Test 1 (Mean > 500): t={t_stat1:.3f}, p(one-sided)={p_val1_one_sided:.4f}")
    else:
        print("Test 1 skipped: 'quantity' column is empty after handling missing values.")
else:
    print("Test 1 skipped: 'quantity' column not found.")


# 2. Independent t-test (Male vs Female)
# Assuming 'gender' and 'quantity' columns exist
# Need to check if 'gender' and 'quantity' columns exist before proceeding
if 'gender' in df.columns and 'quantity' in df.columns:
    male = df[df['gender'] == 'Male']['quantity']
    female = df[df['gender'] == 'Female']['quantity']

    if not male.empty and not female.empty:
        t_stat2, p_val2 = stats.ttest_ind(male, female, equal_var=False)
        print(f"Test 2 (Male vs Female): t={t_stat2:.3f}, p={p_val2:.4f}")
    else:
        print("Test 2 skipped: Not enough data for Male and/or Female groups.")
else:
    print("Test 2 skipped: 'gender' or 'quantity' column not found.")

# 3. Pearson correlation (Donation vs Age)
# Assuming 'age' and 'quantity' columns exist
# Need to check if 'age' and 'quantity' columns exist before proceeding
if 'age' in df.columns and 'quantity' in df.columns:
    # Drop rows with missing age or quantity for correlation
    corr_df = df.dropna(subset=['age', 'quantity'])

    if not corr_df.empty:
        corr, p_val3 = stats.pearsonr(corr_df['quantity'], corr_df['age'])
        print(f"Test 3 (Correlation): r={corr:.3f}, p={p_val3:.4f}")
    else:
        print("Test 3 skipped: Not enough data with both 'age' and 'quantity' for correlation.")
else:
    print("Test 3 skipped: 'age' or 'quantity' column not found.")

##### Which statistical test have you done to obtain P-Value?

A one-sample t-test was performed to check if the mean donation amount is significantly greater than ₹500.

##### Why did you choose the specific statistical test?

The one-sample t-test is suitable when comparing the mean of a single sample against a known or hypothesized population mean, which matches our research question.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): The mean donation amount is less than or equal to ₹500.
Alternate Hypothesis (H₁): The mean donation amount is greater than ₹500.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy import stats

file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Assuming 'quantity' column exists and is numeric
if 'quantity' in df.columns:
    df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce')
    df = df.dropna(subset=['quantity']) # Drop rows with missing or invalid quantity

    if not df['quantity'].empty:
        # Hypothesis Test: Mean donation > 500
        donations = df['quantity']
        t_stat, p_val = stats.ttest_1samp(donations, 500)

        # One-sided p-value for mean > 500
        p_one_sided = p_val / 2 if t_stat > 0 else 1 - (p_val / 2)

        print(f"t-statistic = {t_stat:.4f}, one-sided p-value = {p_one_sided:.4f}")
    else:
        print("Hypothesis test skipped: 'quantity' column is empty after handling missing values.")
else:
    print("Hypothesis test skipped: 'quantity' column not found.")

##### Which statistical test have you done to obtain P-Value?

A one-sample t-test was performed to compare the sample mean against a hypothesized population mean.

##### Why did you choose the specific statistical test?

The one-sample t-test is suitable when testing whether the mean of a single sample differs from a known or assumed population mean, especially when the population standard deviation is unknown.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): The population mean is equal to 500.

Alternate Hypothesis (H₁): The population mean is greater than 500.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy import stats

# Example dataset (replace with your actual data)
data = pd.DataFrame({
    'donation_amount': [520, 480, 505, 510, 495, 530, 515, 490, 500, 525]
})

# Null Hypothesis: mean = 500
# Alternate Hypothesis: mean > 500

sample_mean = data['donation_amount'].mean()
t_stat, p_value_two_sided = stats.ttest_1samp(data['donation_amount'], 500)

# Convert to one-sided p-value
if t_stat > 0:
    p_value_one_sided = p_value_two_sided / 2
else:
    p_value_one_sided = 1 - (p_value_two_sided / 2)

print(f"t-statistic = {t_stat:.4f}, p-value (one-sided) = {p_value_one_sided:.4f}")


##### Which statistical test have you done to obtain P-Value?

One-sample t-test (one-sided).

##### Why did you choose the specific statistical test?

Because we wanted to check if the sample mean donation amount is significantly greater than a known value (500) when the population standard deviation is unknown. The one-sample t-test is suitable for small sample sizes and compares the sample mean to a hypothesized population mean.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Example DataFrame
df = pd.DataFrame({
    'Age': [25, np.nan, 30, 22, np.nan],
    'Salary': [50000, 60000, np.nan, 52000, 58000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT']
})

print("Before Handling Missing Values:\n", df)

# 1. Drop rows/columns with too many missing values (if needed)
df = df.dropna(axis=0, thresh=2)   # keep rows with at least 2 non-NaN values

# 2. Numerical Imputation (mean strategy)
num_imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = num_imputer.fit_transform(df[['Age', 'Salary']])

# 3. Categorical Imputation (most frequent strategy)
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['Department']] = cat_imputer.fit_transform(df[['Department']])

print("\nAfter Handling Missing Values:\n", df)


#### What all missing value imputation techniques have you used and why did you use those techniques?

**Row Dropping (Threshold-based)**

Dropped rows with too many missing values (using thresh=2).

Reason: If a row has very little information, imputing it may add noise.

**Mean Imputation (for Numerical features – Age, Salary)**

Replaced missing values with the mean of the column.

Reason: Mean preserves the overall data distribution without biasing towards extreme values.

**Mode / Most Frequent Imputation (for Categorical features – Department)**

Replaced missing values with the most frequent category.

Reason: Helps maintain categorical consistency and avoids creating unrealistic categories.

### 2. Handling Outliers

In [None]:
import pandas as pd

# Example Data
data = pd.DataFrame({'Salary': [50000, 52000, 58000, 60000, 1200000]})

# IQR Method
Q1 = data['Salary'].quantile(0.25)
Q3 = data['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Define boundaries
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Cap outliers
data['Salary'] = data['Salary'].apply(lambda x: upper if x > upper else (lower if x < lower else x))
print(data)


##### What all outlier treatment techniques have you used and why did you use those techniques?

IQR (Interquartile Range) Method

Identified outliers beyond Q1 - 1.5*IQR and Q3 + 1.5*IQR.

Used because it is robust to skewness and works well for small datasets.

Capping / Winsorization

Extreme values were replaced with the nearest acceptable percentile (e.g., 5th and 95th).

Used to retain data points while reducing the influence of outliers.

Median Imputation (if needed)

Outliers were replaced with the median of the distribution.

Median is chosen since it is less affected by extreme values compared to mean.

### 3. Categorical Encoding

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample Data
df = pd.DataFrame({
    'Age': [25, 23, 30, 22, 23],
    'Salary': [50000, 60000, 52000, 52000, 58000],
    'Department': ['HR', 'IT', 'Finance', 'Finance', 'IT']
})

# Label Encoding (for ordinal or target variable type columns)
le = LabelEncoder()
df['Dept_LabelEncoded'] = le.fit_transform(df['Department'])

# One-Hot Encoding (for nominal categorical columns)
df_encoded = pd.get_dummies(df, columns=['Department'], drop_first=True)

print("Label Encoded:\n", df[['Department','Dept_LabelEncoded']])
print("\nOne-Hot Encoded:\n", df_encoded.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding → Converts categories into numeric labels. Useful when categories are ordinal or when models like tree-based algorithms (e.g., RandomForest, XGBoost) are used that can handle numeric labels without assuming order.

One-Hot Encoding → Creates binary columns for each category. Useful for nominal variables where no order exists, preventing the model from misinterpreting categorical values as having a ranking.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
import re

# Dictionary of common contractions
contractions_dict = {
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "can't": "cannot",
    "won't": "will not",
    "isn't": "is not",
    "it's": "it is",
    "i'm": "i am",
    "you're": "you are",
    "they're": "they are",
    "we're": "we are",
    "i've": "i have",
    "we've": "we have",
    "they've": "they have",
    "i'll": "i will",
    "he'll": "he will",
    "she'll": "she will",
    "we'll": "we will",
    "they'll": "they will",
    "that's": "that is",
    "what's": "what is",
    "there's": "there is",
    "let's": "let us"
}

# Function to expand contractions
def expand_contractions(text, contractions_dict=contractions_dict):
    pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b')
    return pattern.sub(lambda x: contractions_dict[x.group()], text.lower())

# Example
text = "I'm sure they won't agree because it's not fair."
expanded_text = expand_contractions(text)
print("Before:", text)
print("After :", expanded_text)


#### 2. Lower Casing

In [None]:
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "Text": ["I Love AI", "This IS Amazing!", "PYTHON is Fun"]
})

# Convert text column to lowercase
df["Text_Lower"] = df["Text"].str.lower()

print(df)


#### 3. Removing Punctuations

In [None]:
import pandas as pd
import string

# Example DataFrame
df = pd.DataFrame({
    "Text": ["Hello!!! How are you?", "Python, AI & ML are cool.", "Let's code...!!!"]
})

# Remove punctuations using str.translate
df["Text_NoPunct"] = df["Text"].str.translate(str.maketrans('', '', string.punctuation))

print(df)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import pandas as pd
import re

# Example DataFrame
df = pd.DataFrame({
    "Text": [
        "Visit https://openai.com for AI updates!",
        "My email is test123mail@gmail.com",
        "Python3 is awesome, use version 3.9",
        "Checkout www.example123.org now!"
    ]
})

# Function to clean text
def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove words containing digits
    text = re.sub(r'\w*\d\w*', '', text)
    return text.strip()

# Apply cleaning
df["Cleaned_Text"] = df["Text"].apply(clean_text)

print(df)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords (only once)
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

# Example DataFrame
df = pd.DataFrame({
    "Text": [
        "Visit the OpenAI website for more AI updates!",
        "Python is one of the best programming languages.",
        "This is a sample sentence with stopwords."
    ]
})

# Function to remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered)

# Apply function
df["Cleaned_Text"] = df["Text"].apply(remove_stopwords)

print(df)


In [None]:
import pandas as pd
import re

# Example DataFrame
df = pd.DataFrame({
    "Text": [
        "   Hello   World   ",
        "This   has   extra   spaces",
        "   Clean   text   processing   "
    ]
})

# Function to remove white spaces
def remove_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

# Apply function
df["Cleaned_Text"] = df["Text"].apply(remove_whitespace)

print(df)


#### 6. Rephrase Text

In [None]:
import nltk
from nltk.corpus import wordnet
import random

# Download WordNet if not already
nltk.download('wordnet')
nltk.download('omw-1.4')

def rephrase_sentence(sentence):
    words = sentence.split()
    new_sentence = []

    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            # Get all lemma names (possible synonyms)
            lemmas = set()
            for syn in synonyms:
                for lemma in syn.lemmas():
                    lemmas.add(lemma.name())
            lemmas = list(lemmas)

            # Replace word with a random synonym (if available)
            if len(lemmas) > 1:
                new_word = random.choice(lemmas)
                new_sentence.append(new_word.replace("_", " "))
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)

    return " ".join(new_sentence)

# Example
text = "The quick brown fox jumps over the lazy dog"
print("Original:", text)
print("Rephrased:", rephrase_sentence(text))


#### 7. Tokenization

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download tokenizer models
nltk.download('punkt')
nltk.download('punkt_tab') # Added to download the missing resource

text = "Natural Language Processing (NLP) is fun! Let's learn tokenization."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)

# Word Tokenization
words = word_tokenize(text)
print("Word Tokenization:", words)

#### 8. Text Normalization

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

text = "running runs easily fairly studies studying studied"

ps = PorterStemmer()

words = word_tokenize(text)
stemmed_words = [ps.stem(word) for word in words]

print("Original Words:", words)
print("Stemmed Words:", stemmed_words)


##### Which text normalization technique have you used and why?

Stemming (Porter Stemmer):

Reduces words to their root form by chopping off suffixes.

Example: studies, studying, studied → studi

It is fast but may produce non-dictionary words (like easili).

Lemmatization (WordNet Lemmatizer):

Converts words to their meaningful dictionary root form.

Example: studies, studying, studied → study

More accurate than stemming as it considers the context and part of speech.

#### 9. Part of speech tagging

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

# Sample text
text = "The quick brown fox is running fast and jumps over the lazy dog."

# Tokenize
tokens = word_tokenize(text)

# POS Tagging
pos_tags = pos_tag(tokens)

print("Tokens:", tokens)
print("POS Tags:", pos_tags)

#### 10. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text corpus
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "A fox is quick and smart"
]

# ----- Count Vectorizer -----
count_vectorizer = CountVectorizer()
count_vectors = count_vectorizer.fit_transform(corpus)

print("Count Vectorizer Vocabulary:\n", count_vectorizer.vocabulary_)
print("\nCount Vectors:\n", count_vectors.toarray())

# ----- TF-IDF Vectorizer -----
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(corpus)

print("\nTF-IDF Vocabulary:\n", tfidf_vectorizer.vocabulary_)
print("\nTF-IDF Vectors:\n", tfidf_vectors.toarray())


##### Which text vectorization technique have you used and why?

Count Vectorizer (Bag of Words): It converts text into numerical feature vectors by counting the frequency of each word in the document. This is useful for simple models where raw word occurrence matters.

TF-IDF (Term Frequency–Inverse Document Frequency): It not only considers word frequency but also reduces the weight of very common words (like the, and, is), giving more importance to unique and informative words. This improves performance in text classification, clustering, and retrieval tasks.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
import pandas as pd
import numpy as np

# Sample Data
data = {
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000],
    'Experience': [2, 5, 8, 12, 20]
}
df = pd.DataFrame(data)

# 1️⃣ Check correlation
corr = df.corr()
print("Correlation Matrix:\n", corr)

# 2️⃣ Drop/remove highly correlated features (if > 0.85)
# Removing this section to avoid KeyError in subsequent feature engineering
# threshold = 0.85
# high_corr = [(c1, c2) for c1 in corr.columns for c2 in corr.columns
#              if c1 != c2 and abs(corr.loc[c1, c2]) > threshold]
# print("\nHighly correlated features:", high_corr)

# Drop one of the correlated columns (example)
# if high_corr:
#     df.drop(columns=[high_corr[0][1]], inplace=True)

# 3️⃣ Feature Engineering (creating new features)
df['Salary_per_YearExperience'] = df['Salary'] / (df['Experience'] + 1)
df['Age_Salary_Interaction'] = df['Age'] * df['Salary']

print("\nFinal DataFrame with New Features:\n", df)

#### 2. Feature Selection

In [None]:
import pandas as pd
# from sklearn.datasets import load_boston # Removed due to ethical concerns
from sklearn.datasets import fetch_california_housing # Using an alternative dataset
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression

# Load sample dataset
# data = load_boston() # Removed
housing = fetch_california_housing() # Load California housing dataset
X = pd.DataFrame(housing.data, columns=housing.feature_names) # Use .data and .feature_names from the loaded dataset
y = housing.target # Use .target from the loaded dataset


# 1️⃣ Feature Selection using statistical test (SelectKBest)
selector = SelectKBest(score_func=f_regression, k=5)  # choose top 5 features
X_new = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]

print("Selected Features:", list(selected_features))

# 2️⃣ Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X[selected_features], y, test_size=0.2, random_state=42)

# 3️⃣ Model Training
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

print("Train Score:", model.score(X_train, y_train))
print("Test Score:", model.score(X_test, y_test))

##### What all feature selection methods have you used  and why?

I used Univariate Feature Selection (SelectKBest with f_regression) because it helps identify the features that have the strongest statistical relationship with the target variable. This method reduces dimensionality, removes irrelevant/noisy features, and helps avoid overfitting by keeping only the most relevant predictors.
Additionally, feature selection improves model interpretability and reduces computational cost.

##### Which all features you found important and why?

MedInc (Median Income) → Strongly influences house prices since higher-income neighborhoods generally have higher property values.

HouseAge → Older or newer houses can affect pricing based on maintenance or modern facilities.

AveRooms (Average Rooms per Household) → Indicates house size; more rooms generally increase house value.

AveBedrms (Average Bedrooms per Household) → Related to house utility and desirability for families.

Latitude → Geographic location affects demand, climate, accessibility, and hence housing prices.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer

# Sample Data
df = pd.DataFrame({
    'Income': [50000, 60000, 80000, 120000, 150000],
    'Age': [22, 25, 30, 35, 40],
    'HouseValue': [200000, 250000, 400000, 600000, 750000]
})

print("Original Data:\n", df)

# 1. Standardization (Z-score Normalization)
scaler = StandardScaler()
df['Income_Standardized'] = scaler.fit_transform(df[['Income']])
df['Age_Standardized'] = scaler.fit_transform(df[['Age']])

# 2. Min-Max Normalization
minmax = MinMaxScaler()
df['HouseValue_MinMax'] = minmax.fit_transform(df[['HouseValue']])

# 3. Power Transformation (for skewed data)
pt = PowerTransformer()
df['Income_PowerTransformed'] = pt.fit_transform(df[['Income']])

print("\nTransformed Data:\n", df)


### 6. Data Scaling

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Sample Data
df = pd.DataFrame({
    'Income': [25000, 50000, 75000, 100000, 200000],
    'Age': [21, 25, 30, 40, 50],
    'HouseValue': [150000, 200000, 250000, 500000, 1000000]
})

print("Original Data:\n", df)

# 1. Standard Scaling (Z-score)
standard_scaler = StandardScaler()
df[['Income_Std', 'Age_Std', 'HouseValue_Std']] = standard_scaler.fit_transform(df[['Income','Age','HouseValue']])

# 2. Min-Max Scaling (0–1 range)
minmax_scaler = MinMaxScaler()
df[['Income_MinMax', 'Age_MinMax', 'HouseValue_MinMax']] = minmax_scaler.fit_transform(df[['Income','Age','HouseValue']])

# 3. Robust Scaling (less sensitive to outliers)
robust_scaler = RobustScaler()
df[['Income_Robust', 'Age_Robust', 'HouseValue_Robust']] = robust_scaler.fit_transform(df[['Income','Age','HouseValue']])

print("\nScaled Data:\n", df)


##### Which method have you used to scale you data and why?

StandardScaler is best when data is normally distributed.

MinMaxScaler is useful for bounded range, while RobustScaler handles outliers effectively.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

It helps reduce noise, avoid overfitting, and improve model efficiency.
Techniques like PCA keep most information while lowering feature space.

In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = pd.DataFrame({
    'Feature1': [2, 4, 5, 6, 8],
    'Feature2': [8, 12, 15, 18, 20],
    'Feature3': [1, 2, 3, 4, 5]
})

# Standardize the data
scaled_data = StandardScaler().fit_transform(data)

# Apply PCA to reduce dimensions to 2
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(scaled_data)

print("Original Shape:", data.shape)
print("Reduced Shape:", reduced_data.shape)
print("Reduced Data:\n", reduced_data)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I have used Principal Component Analysis (PCA) because it reduces high-dimensional data into fewer components while retaining most of the variance. This helps in minimizing redundancy, improving model performance, and avoiding overfitting.

### 8. Data Splitting

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Example dataset
data = {
    'Income': [25000, 50000, 75000, 100000, 200000],
    'Age': [21, 25, 30, 40, 50],
    'HouseValue': [150000, 200000, 250000, 500000, 1000000]
}
df = pd.DataFrame(data)

# Features and target
X = df[['Income', 'Age']]
y = df['HouseValue']

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data:")
print(X_train, y_train, sep="\n")

print("\nTesting Data:")
print(X_test, y_test, sep="\n")


##### What data splitting ratio have you used and why?

80:20 train-test split ratio because it provides enough data (80%) for training the model to learn patterns while keeping sufficient unseen data (20%) to evaluate performance and check for overfitting.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

lack the information needed to determine if the dataset is imbalanced. To assess this, I would need details about the dataset's composition, specifically the distribution of different classes or categories within it. An imbalanced dataset is one where some classes have significantly more instances than others.

In [None]:
from sklearn.datasets import make_classification
from collections import Counter
from imblearn.over_sampling import SMOTE

# Create imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1],
                           n_informative=3, n_redundant=1, n_features=5,
                           n_clusters_per_class=1, n_samples=500, random_state=42)

print("Before Resampling:", Counter(y))

# Apply SMOTE (oversampling minority class)
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

print("After Resampling:", Counter(y_res))


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I used SMOTE (Synthetic Minority Over-sampling Technique) to handle imbalance because it generates synthetic samples for the minority class instead of duplicating existing ones. This prevents overfitting, balances the dataset, and helps the model learn decision boundaries more effectively.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# ---------- 1. Load dataset ----------
file_path = '/content/drive/MyDrive/food.csv.xlsx'

# Load dataset
if file_path.endswith('.csv'):
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
else:
    df = pd.read_excel(file_path)

# Example: assuming 'target' is your label column
# Check if 'Status' column exists as it was used in previous successful cells
if 'Status' in df.columns:
    X = df.drop(columns=['Status'])
    y = df['Status']

    # Preprocess features (handle missing values and encode categorical)
    X = X.fillna(0) # Simple imputation for demonstration
    X = pd.get_dummies(X)

    # ---------- 2. Split data ----------
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # ---------- 3. Fit the algorithm ----------
    model = LogisticRegression(max_iter=500)
    model.fit(X_train, y_train)

    # ---------- 4. Predict on the model ----------
    y_pred = model.predict(X_test)

    # ---------- 5. Evaluate ----------
    print("✅ Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))
else:
    print("Error: 'Status' column not found in the dataset. Please specify a valid target column.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# ---------- Calculate Metrics ----------
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# ---------- Prepare Data ----------
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
values = [accuracy, precision, recall, f1]

# ---------- Plot ----------
plt.bar(metrics, values, color=['skyblue', 'orange', 'green', 'red'])
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title("Model Evaluation Metrics")
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Annotate values
for i, v in enumerate(values):
    plt.text(i, v + 0.02, f"{v:.2f}", ha='center', fontsize=10)

plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# ---------- 1. Split Data ----------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ---------- 2. Define Model ----------
rf = RandomForestClassifier(random_state=42)

# ---------- 3. Define Hyperparameter Grid ----------
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# ---------- 4. GridSearchCV ----------
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)

# ---------- 5. Fit Model ----------
grid_search.fit(X_train, y_train)

# ---------- 6. Best Model ----------
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# ---------- 7. Predictions ----------
y_pred = best_model.predict(X_test)

# ---------- 8. Evaluation ----------
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


##### Which hyperparameter optimization technique have you used and why?

Used GridSearchCV for exhaustive search of best hyperparameters via cross-validation to ensure optimal model settings.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Accuracy improved from 0.27 → 0.30 with slight gains in precision, recall, and F1-score.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt

# Metrics before and after optimization
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
before = [0.27, 0.28, 0.27, 0.27]
after = [0.30, 0.31, 0.30, 0.30]

x = range(len(metrics))
plt.bar(x, before, width=0.4, label='Before', align='center')
plt.bar([i + 0.4 for i in x], after, width=0.4, label='After', align='center')

plt.xticks([i + 0.2 for i in x], metrics)
plt.ylabel("Score")
plt.title("Evaluation Metric Score Comparison")
plt.legend()
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Model
rf = RandomForestClassifier(random_state=42)

# Parameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Predictions
y_pred = grid_search.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


##### Which hyperparameter optimization technique have you used and why?

Used GridSearchCV to exhaustively search the best combination of hyperparameters for the Random Forest model, ensuring optimal performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Slight accuracy improvement from baseline (~0.28 to 0.30). Precision, recall, and F1-score remain low, suggesting limited model effectiveness despite tuning.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Accuracy: 30% means the model correctly predicts only 3 out of 10 orders, which is poor for operational decision-making.

Precision: Low precision for "Cancelled" & "Pending" means high false positives, leading to unnecessary actions (e.g., cancelling active orders).

Recall: Low recall means the model misses many actual cases (e.g., failing to flag real pending orders), causing delays or service issues.

F1-score: Balances precision & recall; low scores indicate the model is unreliable in identifying critical order statuses, impacting customer satisfaction and operational efficiency.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Fit the Algorithm
model3 = SVC(kernel='rbf', C=1, gamma='scale')
model3.fit(X_train, y_train)

# Predict on the model
y_pred3 = model3.predict(X_test)

# Evaluation
acc3 = accuracy_score(y_test, y_pred3)
print(f"Accuracy: {acc3:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred3))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate metrics
metrics = {
    "Accuracy": accuracy_score(y_test, y_pred3),
    "Precision": precision_score(y_test, y_pred3, average='weighted'),
    "Recall": recall_score(y_test, y_pred3, average='weighted'),
    "F1-score": f1_score(y_test, y_pred3, average='weighted')
}

# Plot the metrics
plt.bar(metrics.keys(), metrics.values(), color=['blue', 'green', 'orange', 'red'])
plt.ylim(0, 1)
plt.ylabel("Score")
plt.title("Evaluation Metrics - Model 3")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

# Define model
model3 = RandomForestClassifier(random_state=42)

# Define parameter distribution
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# RandomizedSearchCV
random_search = RandomizedSearchCV(model3, param_dist, n_iter=20, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# Best parameters & accuracy
print("Best Parameters:", random_search.best_params_)

# Predictions
y_pred3 = random_search.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred3))
print("\nClassification Report:\n", classification_report(y_test, y_pred3))


##### Which hyperparameter optimization technique have you used and why?

RandomizedSearchCV – faster than GridSearchCV for large parameter spaces.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant change – accuracy stayed at 0.30, metrics nearly same as baseline.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

I considered precision, recall, and F1-score as they directly reflect model reliability in predicting order status, minimizing wrong predictions that could affect customer satisfaction and operational costs.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I selected Random Forest with RandomizedSearchCV due to its robustness to overfitting, ability to handle mixed feature types, and interpretability via feature importance.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Random Forest is an ensemble of decision trees using bagging, which improves prediction accuracy by averaging multiple tree outputs.
Using SHAP values, we can explain how each feature contributes to the model’s predictions.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import matplotlib.pyplot as plt

# Example data
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
          'August', 'September', 'October', 'November', 'December']
donations = [1000, 1200, 12000, 1300, 1200, 1000, 1500, 1400, 950, 1750, 1020, 1450]

# Plot
plt.figure(figsize=(10, 6))
plt.bar(months, donations, color='lightgreen', edgecolor='black')
plt.xlabel('Month')
plt.ylabel('Total Quantity Donated')
plt.title('Donations by Month')
plt.xticks(rotation=45)

# Save chart to file
plt.savefig("donations_by_month.png", dpi=300, bbox_inches='tight')

plt.show()


### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Excel file
file_path = '/content/drive/MyDrive/food.csv.xlsx'
df = pd.read_excel(file_path)

# 2. Inspect the data
print("First few rows of the dataset:")
print(df.head())

print("\nAvailable columns:")
print(df.columns)

# 3. Define the target column
target_column = 'Status'  # ✅ You can change this to 'Meal_Type', 'Food_Type', etc.

# 4. Check if the target column exists
if target_column not in df.columns:
    raise ValueError(f"Target column '{target_column}' not found. Choose from: {df.columns.tolist()}")

# 5. Split features and target
X = df.drop(target_column, axis=1)
y = df[target_column]

# 6. Preprocess features
X = X.fillna(0)
X = pd.get_dummies(X)

# 7. Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 8. Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# 9. Validate the model
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
print(f"\n✅ Validation Accuracy: {accuracy:.2%}")

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***