# Day 1: Introduction:

Task 1: Data Collection and Cleaning

1.Problem: You are given a dataset containing customer information for a retail company. The dataset includes missing values, outliers, and incorrect data entries. Clean the data by identifying and addressing the missing or erroneous values.

2.Task:

Load the dataset and identify any missing or incorrect data.

Use techniques such as imputation, outlier removal, or data correction to clean the data.

Document the methods you used for cleaning the data and explain why you chose them.



- Problem Statement

Given a dataset containing customer information for a retail company.

The dataset contains:

Missing values

Outliers

Incorrect or inconsistent data entries

Step 1: Load the Dataset & Identify Issues

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Loading dataset
df = pd.read_csv("customer_data.csv")

# Displaying first few rows
df.head()

# Checking dataset structure
df.info()

# Checking missing values
df.isnull().sum()


Step 2: Handling Missing Values

Numerical Columns

Using mean/median imputation

Median is preferred when data contains outliers

In [None]:
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Annual_Income'] = df['Annual_Income'].fillna(df['Annual_Income'].median())


Categorical Columns

Use mode

In [None]:
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['City'] = df['City'].fillna(df['City'].mode()[0])


Step 3: Detecting and Handling Outliers

Using IQR Method

In [None]:
Q1 = df['Annual_Income'].quantile(0.25)
Q3 = df['Annual_Income'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Removing outliers
df = df[(df['Annual_Income'] >= lower_bound) & (df['Annual_Income'] <= upper_bound)]


Step 4: Correcting Incorrect Data Entries

Removing negative age values

Fixing inconsistent gender labels

In [None]:
# Removing invalid ages
df = df[df['Age'] > 0]

# Standardizing gender values
df['Gender'] = df['Gender'].str.lower()
df['Gender'] = df['Gender'].replace({'m': 'male', 'f': 'female'})


Step 5: Verify Cleaned Data

In [None]:
df.describe()

df.isnull().sum()


3. Critical Thinking:

Evaluate the impact of missing data on the results. How would different methods of dealing with missingdata affect the analysis?

Impact of Missing Data on Results

Missing data can:

Bias statistical measures

Reduce model accuracy

Lead to incorrect conclusions

Different handling methods affect results differently:

Deletion may reduce dataset size and lose information

Mean/median imputation preserves size but may reduce variance

Advanced imputation can improve accuracy but adds complexity

4. Higher-Order Thinking

Suggest additional methods for improving data quality beyond what was discussed in the assignment.

Additional Methods to Improve Data Quality
1. Automated Validation Rules

Range checks

2. Advanced Imputation

KNN Imputation

Regression-based imputation

3. Data Consistency Checks

Cross-column validation

4. Duplicate Detection

Remove repeated customer records

5. Data Versioning & Logging

Track changes made during cleaning

# Day 2: Data Visualization and Interpretation

Task 2: Create Visualizations

Problem: Given a sales dataset, create at least three different types of visualizations that provide insights into sales trends, customer preferences, or regional performance.

Task:
> Use tools like Excel, Python (matplotlib or seaborn), or Tableau to create the visualizations.

> Interpret the findings from each visualization. For example, how do trends change over time? Which regions have the highest sales?

- Problem Statement

Given a sales dataset, create at least three different visualizations to provide insights into:
Sales trends over time
Customer preferences
Regional performance

Step 1: Loading Required Libraries and Dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading dataset
sales_df = pd.read_csv("sales_data.csv")

# Previewing data
sales_df.head()

# Dataset overviewing
sales_df.info()


Assumming Dataset Columns

Date

Region

Product_Category

Sales

Customer_Type

In [None]:
# Converting Date column to datetime
sales_df['Date'] = pd.to_datetime(sales_df['Date'])


In [None]:
Visualization 1: Line Chart – Sales Trend Over Time

In [None]:
sales_over_time = sales_df.groupby('Date')['Sales'].sum()

plt.figure(figsize=(10,5))
plt.plot(sales_over_time.index, sales_over_time.values)
plt.title("Sales Trend Over Time")
plt.xlabel("Date")
plt.ylabel("Total Sales")
plt.show()


Interpretation

The line graph shows how sales change over time.

Upward trends indicate business growth.

Sudden drops may indicate seasonal effects or operational issues.

Visualization 2: Bar Chart – Sales by Region

In [None]:
region_sales = sales_df.groupby('Region')['Sales'].sum()

plt.figure(figsize=(8,5))
plt.bar(region_sales.index, region_sales.values)
plt.title("Total Sales by Region")
plt.xlabel("Region")
plt.ylabel("Sales")
plt.show()


Interpretation

Regions with taller bars contribute more to overall revenue.

Helps identify top-performing and underperforming regions.

Useful for regional strategy planning.

Visualization 3: Bar Chart – Customer Preferences

In [None]:
category_sales = sales_df.groupby('Product_Category')['Sales'].sum()

plt.figure(figsize=(9,5))
sns.barplot(x=category_sales.index, y=category_sales.values)
plt.title("Sales by Product Category")
plt.xlabel("Product Category")
plt.ylabel("Sales")
plt.xticks(rotation=45)
plt.show()


Interpretation

Highlights which product categories are most popular.

Helps inventory and marketing teams focus on high-demand products.

3. Critical Thinking:
> What visualization would best convey trends in sales over time? Why?
> How could you improve these visualizations for a non-technical audience?

1. Best Visualization for Sales Trends Over Time

Answer:

A line chart is the best visualization for sales trends over time because:

It clearly shows increases and decreases

Trends and seasonality are easy to observe

Time-based comparisons are intuitive

2. Improving Visualizations for a Non-Technical Audience

Use simple titles and labels

Add annotations

Use consistent colors

Avoid excessive data points

Use summary charts instead of raw scatter plots

4. Higher-Order Thinking:
> Compare and contrast two visualizations of the same data. Which one provides clearer insights and why?

Compare Two Visualizations of the Same Data

Line Chart vs Scatter Plot

| Aspect           | Line Chart       | Scatter Plot      |
| ---------------- | ---------------- | ----------------- |
| Trend visibility | Very clear       | Less clear        |
| Noise handling   | Smooth           | Noisy             |
| Best use case    | Long-term trends | Outlier detection |


# Day 3: Introduction to Predictive Analytics

Task 3: Build a Predictive Mode
1. Problem: Use a dataset of past customer purchases to build a predictive model that forecasts whether a new customer will make a purchase in the future.

2. Task:

Split the dataset into training and test sets.
Choose a simple algorithm (e.g., logistic regression or decision tree) and build the model.

Evaluate the model using appropriate metrics (e.g., accuracy, precision).

- Problem Statement

Using past customer purchase data, build a model that predicts whether a new customer will make a purchase in the future.

Target Variable:

Purchased → 1 (Yes), 0 (No)

Step 1: Importing Libraries & Load Dataset

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Loading dataset
df = pd.read_csv("customer_purchase_data.csv")

df.head()


Assumed Dataset Columns

Age

Annual_Income

Spending_Score

Purchased

Step 2: Featuring Selection & Data Preparation

In [None]:
X = df[['Age', 'Annual_Income', 'Spending_Score']]
y = df['Purchased']


Step 3: Train–Test Spliting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


Step 4: Feature Scaling

In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Step 5: Build the Predictive Model

In [None]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)


Step 6: Making Predictions

In [None]:
y_pred = model.predict(X_test_scaled)


Step 7: Model Evaluation

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

accuracy, precision, recall, f1

print(classification_report(y_test, y_pred))


Interpretation of Results

Accuracy: Overall model correctness

Precision: How many predicted buyers actually purchased

Recall: How many actual buyers were correctly identified

F1 Score: Balanced performance metric

✔ Logistic Regression works well for binary classification and is easy to interpret.

3. Critical Thinking:
How would you handle an imbalanced dataset (e.g., more customers who do not make a purchase than those who do)?
How can you improve the predictive power of your model?

1. Handling Imbalanced Datasets

If most customers do not purchase, the dataset becomes imbalanced.

Solutions

Resampling

Oversample minority class

Undersample majority class

Class Weights

Penalize wrong predictions of minority class

Use better metrics

Precision, Recall, F1 instead of Accuracy

2. Improving Predictive Power

Add more relevant features

Feature engineering

Try advanced models:

Decision Tree

Random Forest

Gradient Boosting

Hyperparameter tuning

Increase dataset size

4. Higher-Order Thinking:
Discuss the potential ethical implications of predictive models in business decisions. How could bias impact these models?

Ethical Implications of Predictive Models

Predictive models can strongly influence business decisions such as:

Who receives discounts

Who gets targeted ads

Who is denied offers

Potential Ethical Risks

Bias in data

Historical bias can disadvantage certain groups

Unfair targeting

Excluding customers based on predictions

Privacy concerns

Misuse of personal data

Mitigation Strategies

Regular bias audits

Transparent models

Ethical data collection

Human oversight in decision-making

# Day 4: Advanced Data Analytics Techniques

Task 4: Customer Segmentation Using Clustering
1. Problem:

You have a dataset of customer behavior (e.g., purchase history, browsing behavior). Use clustering techniques to group customers into segments that share similar characteristics.

2. Task:

Apply a clustering algorithm (e.g., K-means) on the dataset
Identify meaningful customer segments and interpret their characteristics (e.g., high-value customers, frequent shoppers)

- Problem Statement

You are given a dataset containing customer behavior such as:

Purchase history

Browsing behavior

Spending patterns

Your task is to group customers into meaningful segments using clustering.

Step 1: Import Libraries & Load Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Loading dataset
df = pd.read_csv("customer_behavior_data.csv")

df.head()


Assumming Dataset Columns

Annual_Income

Spending_Score

Purchase_Frequency

Step 2: Feature Selection

In [None]:
X = df[['Annual_Income', 'Spending_Score', 'Purchase_Frequency']]


Step 3: Feature Scaling

Clustering algorithms are distance-based, so scaling is essential.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


Step 4: Choosing the Optimal Number of Clusters

In [None]:
wcss = []

for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(8,5))
plt.plot(range(1, 11), wcss, marker='o')
plt.title("Elbow Method for Optimal K")
plt.xlabel("Number of Clusters")
plt.ylabel("WCSS")
plt.show()


Interpretation

The “elbow point” represents a balance between performance and simplicity
Assume K = 4 based on the elbow curve

Step 5: Apply K-Means Clustering

In [None]:
kmeans = KMeans(n_clusters=4, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)

df.head()


Step 6: Visualize Customer Segments

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(df['Annual_Income'], df['Spending_Score'], c=df['Cluster'])
plt.title("Customer Segments")
plt.xlabel("Annual Income")
plt.ylabel("Spending Score")
plt.show()


Step 7: Interpreting Customer Segments

In [None]:
df.groupby('Cluster').mean()


Cluster Interpretation

| Cluster | Characteristics                  | Customer Type        |
| ------- | -------------------------------- | -------------------- |
| 0       | High income, high spending       | High-value customers |
| 1       | Low income, low spending         | Low-value customers  |
| 2       | High income, low spending        | Potential customers  |
| 3       | Moderate income, frequent buyers | Loyal customers      |


3. Critical Thinking:
What business strategies could be implemented for different customer segments?
What challenges did you face when choosing the right number of clusters?

. Business Strategies for Different Segments

High-value customers

Loyalty programs

Premium offers

Frequent shoppers

Subscription plans

Personalized discounts

Low-engagement customers

Targeted promotions

Re-engagement campaigns

Potential customers

Personalized recommendations

Incentives to increase spending

2. Challenges in Choosing the Right Number of Clusters

Elbow point not always clear

Too many clusters → overfitting

Too few clusters → oversimplification

Business interpretability vs mathematical optimality

✔ Cluster choice should balance analytics + business understanding

4. Higher-Order Thinking:
How would you use clustering to improve customer retention strategies in a business?


Using Clustering to Improve Customer Retention

Clustering can significantly enhance customer retention by:

Identifying customers at risk of churn

Creating personalized retention strategies

Detecting declining engagement early

Tailoring communication and offers

# Day 5: Decision Making and Optimization

Task 5: Optimization of Marketing Campaign

Problem: You are working as a data analyst for a company launching a new marketing campaign. Use data analytics to identify the best strategy to maximize return on investment (ROI)

2. Task:

Analyze customer demographics, previous campaign performance, and engagement metrics.

Recommend an optimal marketing strategy that targets the most relevant customer segments.

Explain how data supports your decision-making process.

Problem Statement

You are working as a data analyst for a company launching a new marketing campaign.Your goal is to use data analytics to identify the best strategy to maximize ROI.

Step 1: Loading Libraries & Dataset

In [None]:
import pandas as pd
import numpy as np

# Load marketing dataset
df = pd.read_csv("marketing_campaign_data.csv")

df.head()


Assumming Dataset Columns

Customer_Age

Income

Customer_Segment

Marketing_Channel

Campaign_Cost

Revenue_Generated

Engagement_Score

Step 2: Calculating ROI
	​


In [None]:
df['ROI'] = (df['Revenue_Generated'] - df['Campaign_Cost']) / df['Campaign_Cost']

df[['Marketing_Channel', 'Campaign_Cost', 'Revenue_Generated', 'ROI']].head()


Step 3: Analyzing Performance by Marketing Channel

In [None]:
channel_performance = df.groupby('Marketing_Channel')[['Campaign_Cost', 'Revenue_Generated', 'ROI']].mean()
channel_performance


Interpretation

Higher ROI → better return for money spent

Channels with high revenue but low ROI may be inefficient

Step 4: Customer Segment Analysis

In [None]:
segment_performance = df.groupby('Customer_Segment')['ROI'].mean().sort_values(ascending=False)
segment_performance


Insights

Certain customer segments consistently generate higher ROI

Marketing efforts should focus on high-ROI segments

Step 5: Engagement-Based Optimization

In [None]:
engagement_roi = df.groupby('Marketing_Channel')['Engagement_Score'].mean()
engagement_roi


Insight

Channels with high engagement often correlate with higher ROI

Engagement metrics help validate marketing effectiveness

How Data Supports Decision-Making

ROI identifies profitable strategies

Segmentation reveals where marketing works best

Engagement metrics validate customer interest

Historical performance reduces guesswork


3. Critical Thinking:
What factors could lead to an ineffective marketing strategy even if the data suggests it's the best option?
How would you handle uncertain or incomplete data when making recommendations?

1. Why a Data-Suggested Strategy Might Still Fail

Market conditions change

Customer preferences evolve

Data may be outdated

External factors

Poor campaign execution despite good strategy

2. Handling Uncertain or Incomplete Data

Use assumptions with confidence intervals

Rely on multiple metrics, not one

Run pilot campaigns or A/B tests

Continuously update models as new data arrives

4. Higher-Order Thinking:
Propose a data-driven approach to optimize the allocation of marketing resources across different channels (e.g., social media, email marketing, direct mail).

Optimizing Marketing Resource Allocation Across Channels

Proposed Data-Driven Approach

Budget Allocation by ROI

Assign higher budget to channels with higher ROI

Multi-Channel Optimization

Combine channels for different customer segments

A/B Testing

Test variations in messaging and channel mix

Continuous Feedback Loop

Monitor performance in real time

Reallocate budget dynamically