**3 Tasks - To - Do:**

**3.1 Problem - 1: Getting Started with Data Exploration - Some Warm up**
**Exercises:**

**1. Data Exploration and Understanding:**

**• Dataset Overview:**
1. Load the dataset and display the first 10 rows.
2. Identify the number of rows and columns in the dataset.
3. List all the columns and their data types.

In [12]:
import pandas as pd

# Load the dataset
dataset_path = '/content/drive/MyDrive/WHR-2024-5CS037.csv'
df = pd.read_csv(dataset_path)

# 1. Display the first 10 rows
print("First 10 rows of the dataset:")
print(df.head(10))

# 2. Identify the number of rows and columns
num_rows, num_columns = df.shape
print(f"\nNumber of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

# 3. List all columns and their data types
print("\nColumns and their data types:")
print(df.dtypes)


First 10 rows of the dataset:
  Country name  score  Log GDP per capita  Social support  \
0      Finland  7.741               1.844           1.572   
1      Denmark  7.583               1.908           1.520   
2      Iceland  7.525               1.881           1.617   
3       Sweden  7.344               1.878           1.501   
4       Israel  7.341               1.803           1.513   
5  Netherlands  7.319               1.901           1.462   
6       Norway  7.302               1.952           1.517   
7   Luxembourg  7.122               2.141           1.355   
8  Switzerland  7.060               1.970           1.425   
9    Australia  7.057               1.854           1.461   

   Healthy life expectancy  Freedom to make life choices  Generosity  \
0                    0.695                         0.859       0.142   
1                    0.699                         0.823       0.204   
2                    0.718                         0.819       0.258   
3         

Import Libraries: The pandas library is imported to handle and analyze the dataset.

Load Dataset: The pd.read_csv() function loads the dataset from a specified file path into a DataFrame (df).

Display Rows: df.head(10) prints the first 10 rows to give a quick view of the data.

Dataset Dimensions: df.shape retrieves the number of rows and columns in the dataset.

Column Details: df.dtypes lists all column names and their respective data types (e.g., integer, float, string).

**• Basic Statistics:**
1. Calculate the mean, median, and standard deviation for the Score column.
2. Identify the country with the highest and lowest happiness scores.

In [None]:
# Calculate the mean, median, and standard deviation for the 'Score' column
mean_score = df['Score'].mean()  # Mean of the Score column
median_score = df['Score'].median()  # Median of the Score column
std_dev_score = df['Score'].std()  # Standard deviation of the Score column

# Print the results
print(f"Mean Score: {mean_score}")
print(f"Median Score: {median_score}")
print(f"Standard Deviation of Score: {std_dev_score}")


**Missing Values:**
1. Check if there are any missing values in the dataset. If so, display the total count for each column.

In [6]:
# Check for missing values in the dataset
missing_values = df.isnull().sum()

print("Missing values in each column:")
print(missing_values)


Missing values in each column:
Country name                    0
score                           0
Log GDP per capita              3
Social support                  3
Healthy life expectancy         3
Freedom to make life choices    3
Generosity                      3
Perceptions of corruption       3
Dystopia + residual             3
dtype: int64


df.isnull():
Identifies all missing values in the DataFrame. Each cell is marked as True if it contains a missing value (NaN) and False otherwise.

sum():
Sums up the True values for each column, giving the total count of missing values in that column.

**Filtering and Sorting:**
1. Filter the dataset to show only the countries with a Score greater than 7.5.
2. For the filtered dataset - Sort the dataset by GDP per Capita in descending order and display the
top 10 rows.

In [None]:
# Filter the dataset to include only countries with Score > 7.5
filtered_df = df[df['Score'] > 7.5]

# Sort the filtered dataset by GDP per Capita in descending order
sorted_df = filtered_df.sort_values(by='GDP per Capita', ascending=False)

# Display the top 10 rows
top_10_sorted = sorted_df.head(10)
print("Top 10 countries with Score > 7.5, sorted by GDP per Capita:")
print(top_10_sorted)


**Adding New Columns:**
1. Create a new column called Happiness Category that categorizes countries into three categories
based on their Score:

Low − (Score < 4)

Medium − (4 ≤ Score ≤ 6)

High − (Score > 6)

In [None]:
# Define a function to categorize countries based on their Score
def categorize_happiness(score):
    if score < 4:
        return 'Low'
    elif 4 <= score <= 6:
        return 'Medium'
    else:
        return 'High'

# Apply the function to create the 'Happiness Category' column
df['Happiness Category'] = df['Score'].apply(categorize_happiness)

# Display the updated DataFrame with the new column
print("\nDataset with the new 'Happiness Category' column:")
print(df[['Country', 'Score', 'Happiness Category']].head())


**2. Data Visualizations:**

• Bar Plot: Plot the top 10 happiest countries by Score using a bar chart.

• Line Plot: Plot the top 10 unhappiest countries by Score using a Line chart.

• Plot a histogram for the Score column to show its distribution and also interpret.

• Scatter Plot: Plot a scatter plot between GDP per Capita and Score to visualize their relationship.

In [None]:
# Define a function to categorize countries based on their Score
def categorize_happiness(score):
    if score < 4:
        return 'Low'
    elif 4 <= score <= 6:
        return 'Medium'
    else:
        return 'High'

# Apply the function to create the 'Happiness Category' column
df['Happiness Category'] = df['Score'].apply(categorize_happiness)

# Display the updated DataFrame with the new column
print("\nDataset with the new 'Happiness Category' column:")
print(df[['Country', 'Score', 'Happiness Category']].head())


**3.2 Problem - 2 - Some Advance Data Exploration Task:**

**Task - 1 - Setup Task - Preparing the South-Asia Dataset:**

Steps:
1. Define the countries in South Asia with a list for example:
south asian countries = ["Afghanistan", "Bangladesh", "Bhutan", "India",
"Maldives", "Nepal", "Pakistan", "Srilanka"]
2. Use the list from step - 1 to filtered the dataset {i.e. filtered out matching dataset from list.}
3. Save the filtered dataframe as separate CSV files for future use.

In [None]:
# Step 1: Define the list of South Asian countries
south_asian_countries = ["Afghanistan", "Bangladesh", "Bhutan", "India", "Maldives", "Nepal", "Pakistan", "Sri Lanka"]

# Step 2: Filter the dataset to include only countries in the South Asia list
south_asia_df = df[df['Country'].isin(south_asian_countries)]

# Step 3: Save the filtered DataFrame to a separate CSV file
south_asia_df.to_csv('south_asia_dataset.csv', index=False)

# Print a message to confirm
print("South Asia dataset has been saved as 'south_asia_dataset.csv'.")


**Task - 2 - Composite Score Ranking:**

Tasks:
1. Using the SouthAsia DataFrame, create a new column called Composite Score that combines the
following metrics:

Composite Score = 0.40 × GDP per Capita + 0.30 × Social Support

+ 0.30 × Healthy Life Expectancy
2. Rank the South Asian countries based on the Composite Score in descending order.
3. Visualize the top 5 countries using a horizontal bar chart showing the Composite Score.
4. Discuss whether the rankings based on the Composite Score align with the original Score - support your
discussion with some visualization plot.

In [1]:
#Step 1: Create the Composite Score Column
south_asia_df['Composite Score'] = (
    0.40 * south_asia_df['GDP per Capita'] +
    0.30 * south_asia_df['Social Support'] +
    0.30 * south_asia_df['Healthy Life Expectancy']
)


In [2]:
#Step 2: Rank the South Asian Countries
south_asia_df = south_asia_df.sort_values(by='Composite Score', ascending=False).reset_index(drop=True)


In [3]:
#Step 3: Visualize the Top 5 Countries
import matplotlib.pyplot as plt

# Select the top 5 countries
top_5_countries = south_asia_df.head(5)

# Plot the horizontal bar chart
plt.barh(top_5_countries['Country'], top_5_countries['Composite Score'], color='skyblue')
plt.xlabel('Composite Score')
plt.ylabel('Country')
plt.title('Top 5 South Asian Countries by Composite Score')
plt.gca().invert_yaxis()  # Invert y-axis for better visualization
plt.show()


In [None]:
#Step 4: Compare Composite Score Rankings with Original Score
# Top 5 countries by original Score
top_5_original = south_asia_df.nlargest(5, 'Score')

# Plot comparison
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

# Composite Score Bar Chart
ax[0].barh(top_5_countries['Country'], top_5_countries['Composite Score'], color='skyblue')
ax[0].set_title('Top 5 by Composite Score')
ax[0].set_xlabel('Composite Score')
ax[0].invert_yaxis()

# Original Score Bar Chart
ax[1].barh(top_5_original['Country'], top_5_original['Score'], color='lightgreen')
ax[1].set_title('Top 5 by Original Happiness Score')
ax[1].set_xlabel('Happiness Score')
ax[1].invert_yaxis()

plt.tight_layout()
plt.show()


**Task - 3 - Outlier Detection:**

Tasks:
1. Identify outlier countries in South Asia based on their Score and GDP per Capita.
2. Define outliers using the 1.5 × IQR rule.
3. Create a scatter plot with GDP per Capita on the x-axis and Score on the y-axis, highlighting outliers
in a different color.
4. Discuss the characteristics of these outliers and their potential impact on regional averages.

In [None]:
#Step 1: Identify Outliers Based on Score and GDP per Capita
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Identify outliers for Score
score_outliers, score_lower, score_upper = detect_outliers(south_asia_df, 'Score')

# Identify outliers for GDP per Capita
gdp_outliers, gdp_lower, gdp_upper = detect_outliers(south_asia_df, 'GDP per Capita')


In [None]:
#Step 2: Highlight Outliers
import matplotlib.pyplot as plt

# Mark all data points
plt.scatter(south_asia_df['GDP per Capita'], south_asia_df['Score'], label='Data Points', alpha=0.7)

# Highlight Score outliers
plt.scatter(score_outliers['GDP per Capita'], score_outliers['Score'],
            color='red', label='Score Outliers', edgecolor='black')

# Highlight GDP per Capita outliers
plt.scatter(gdp_outliers['GDP per Capita'], gdp_outliers['Score'],
            color='orange', label='GDP Outliers', edgecolor='black')

# Labels and legend
plt.xlabel('GDP per Capita')
plt.ylabel('Happiness Score')
plt.title('Outlier Detection in South Asia')
plt.legend()
plt.grid(alpha=0.5)
plt.show()


Step 3: Discussion on Outliers

Characteristics of Outliers:
Score Outliers:
Countries with exceptionally high or low happiness scores compared to others.
Example: A country with a score significantly below or above the regional average.
GDP Outliers:
Countries with extremely high or low GDP per capita.
Example: Wealthier or poorer nations with large disparities in economic conditions.
Potential Impact on Regional Averages:
Influence on Metrics:

Outliers can skew the average, creating a distorted perception of the overall region's performance.
For instance, a single high GDP country can raise the average GDP per capita, masking challenges faced by lower-performing countries.
Policy Implications:

Identifying outliers helps policymakers understand deviations and allocate resources or interventions appropriately.

**Task - 4 - Exploring Trends Across Metrics:**

Tasks:
1. Choose two metrics (e.g., Freedom to Make Life Choices and Generosity) and calculate their correlation
{pearson correlation} with the Score for South Asian countries.
2. Create scatter plots with trendlines for these metrics against the Score.
3. Identify and discuss the strongest and weakest relationships between these metrics and the Score for
South Asian countries.

In [None]:
#Step 1: Calculate Pearson Correlation
# Calculate Pearson correlation
freedom_corr = south_asia_df['Freedom to Make Life Choices'].corr(south_asia_df['Score'])
generosity_corr = south_asia_df['Generosity'].corr(south_asia_df['Score'])

print(f"Correlation with Freedom to Make Life Choices: {freedom_corr:.2f}")
print(f"Correlation with Generosity: {generosity_corr:.2f}")


In [None]:
#Step 2: Scatter Plots with Trendlines
import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot for Freedom to Make Life Choices
plt.figure(figsize=(10, 5))
sns.regplot(x='Freedom to Make Life Choices', y='Score', data=south_asia_df,
            scatter_kws={'color': 'blue', 'alpha': 0.6}, line_kws={'color': 'red'})
plt.title('Score vs Freedom to Make Life Choices')
plt.xlabel('Freedom to Make Life Choices')
plt.ylabel('Happiness Score')
plt.grid(alpha=0.3)
plt.show()

# Scatter plot for Generosity
plt.figure(figsize=(10, 5))
sns.regplot(x='Generosity', y='Score', data=south_asia_df,
            scatter_kws={'color': 'green', 'alpha': 0.6}, line_kws={'color': 'red'})
plt.title('Score vs Generosity')
plt.xlabel('Generosity')
plt.ylabel('Happiness Score')
plt.grid(alpha=0.3)
plt.show()


Step 3: Discussion of Results

Strongest Relationship:

The metric with the highest absolute correlation with the Score indicates the strongest relationship.
For instance, if Freedom to Make Life Choices has a higher correlation (e.g., 0.65), it suggests that countries with higher freedom tend to have higher happiness scores.
Weakest Relationship:

The metric with the lower absolute correlation reflects a weaker association with Score.
If Generosity has a lower correlation (e.g., 0.15), it implies that generosity has a limited direct influence on happiness scores in the region.
Insights:

Freedom to Make Life Choices: A stronger correlation suggests that empowering individuals with choices significantly contributes to perceived happiness.
Generosity: A weaker correlation might indicate cultural, economic, or social factors that moderate its impact on happiness.

**Task - 5 - Gap Analysis:**

Tasks:
1. Add a new column, GDP-Score Gap, which is the difference between GDP per Capita and the Score
for each South Asian country.
2. Rank the South Asian countries by this gap in both ascending and descending order.
3. Highlight the top 3 countries with the largest positive and negative gaps using a bar chart.
4. Analyze the reasons behind these gaps and their implications for South Asian countries.

In [None]:
#Step 1: Add a New Column for GDP-Score Gap
# Add a new column for GDP-Score Gap
south_asia_df['GDP-Score Gap'] = south_asia_df['GDP per Capita'] - south_asia_df['Score']

# Display the updated DataFrame
south_asia_df[['Country', 'GDP per Capita', 'Score', 'GDP-Score Gap']].head()


In [None]:
#Step 2: Rank Countries by GDP-Score Gap
# Sort the DataFrame by GDP-Score Gap in ascending and descending order
ascending_gap = south_asia_df.sort_values(by='GDP-Score Gap', ascending=True)
descending_gap = south_asia_df.sort_values(by='GDP-Score Gap', ascending=False)

# Display the top 3 countries with the largest positive and negative gaps
top_positive_gaps = descending_gap[['Country', 'GDP-Score Gap']].head(3)
top_negative_gaps = ascending_gap[['Country', 'GDP-Score Gap']].head(3)

print("Top 3 Countries with Largest Positive Gaps:")
print(top_positive_gaps)
print("\nTop 3 Countries with Largest Negative Gaps:")
print(top_negative_gaps)


In [None]:
#Step 3: Visualize Top 3 Countries with Largest Positive and Negative Gaps
import matplotlib.pyplot as plt

# Bar chart for largest positive gaps
plt.figure(figsize=(10, 5))
plt.bar(top_positive_gaps['Country'], top_positive_gaps['GDP-Score Gap'], color='green')
plt.title('Top 3 Countries with Largest Positive GDP-Score Gaps')
plt.xlabel('Country')
plt.ylabel('GDP-Score Gap')
plt.grid(alpha=0.3)
plt.show()

# Bar chart for largest negative gaps
plt.figure(figsize=(10, 5))
plt.bar(top_negative_gaps['Country'], top_negative_gaps['GDP-Score Gap'], color='red')
plt.title('Top 3 Countries with Largest Negative GDP-Score Gaps')
plt.xlabel('Country')
plt.ylabel('GDP-Score Gap')
plt.grid(alpha=0.3)
plt.show()


Step 4: Analysis of Gaps and Implications
Analysis:

Positive Gaps:

Countries with high GDP per Capita but relatively lower happiness scores.
Possible reasons:
Economic growth may not translate into perceived well-being due to inequality, lack of social support, or environmental challenges.
Implication: Policymakers should focus on improving factors like social support and personal freedom to align economic prosperity with happiness.
Negative Gaps:

Countries with lower GDP per Capita but relatively high happiness scores.
Possible reasons:
Strong social support networks, cultural factors, or efficient use of resources.
Implication: Highlights the importance of non-economic factors in driving happiness, serving as a model for other nations in the region.
Key Insights:

The gap analysis provides actionable insights for governments in South Asia to prioritize policies addressing well-being disparities.
Positive gaps indicate opportunities to improve social cohesion and address income inequality.
Negative gaps showcase how social and cultural resilience can drive happiness even in economically constrained settings.

**3.3 Problem - 3 - Comparative Analysis:**

**Task - 1 - Setup Task - Preparing the Middle Eastern Dataset:**

Tasks:
1. Similar in Task - 1 of Problem 2 create a dataframe from middle eastern countries. For hint use the
following list:
middle east countries = [ "Bahrain", "Iran", "Iraq", "Israel", "Jordan",
"Kuwait", "Lebanon", "Oman", "Palestine", "Qatar", "Saudi Arabia", "Syria",

"United Arab Emirates", "Yemen"]

Complete the following task:
1. Descriptive Statistics:
• Calculate the mean, Standard deviation of the score for both South Asia and Middle East.
• Which region has higher happiness Scores on average?
2. Top and Bottom Performers:
• Identify the top 3 and bottom 3 countries in each region based on the score.
• Plot bar charts comparing these charts.
3. Metric Comparisons:
• Compare key metrics like GDP per Capita, Social Support, and Healthy Life Expectancy
between the regions using grouped bar charts.
• Which metrics show the largest disparity between the two regions?
4. Happiness Disparity:
• Compute the range (max - min) and coefficient of variation (CV) for Score in both regions.
• Which region has greater variability in happiness?
5. Correlation Analysis:
• Analyze the correlation of Score with other metrics Freedom to Make Life Choices, and
Generosity within each region.
• Create scatter plots to visualize and interpret the relationships.
6. Outlier Detection:
• Identify outlier countries in both regions based on Score and GDP per Capita.
• Plot these outliers and discuss their implications.
7. Visualization:
• Create boxplots comparing the distribution of Score between South Asia and the Middle East.
• Interpret the key differences in distribution shapes, medians, and outliers.

In [None]:
#Task 1: Preparing the Middle Eastern Dataset
#Step 1: Create a DataFrame for Middle Eastern Countries
# Define Middle Eastern countries
middle_east_countries = ["Bahrain", "Iran", "Iraq", "Israel", "Jordan",
                         "Kuwait", "Lebanon", "Oman", "Palestine", "Qatar",
                         "Saudi Arabia", "Syria", "United Arab Emirates", "Yemen"]

# Filter dataset for Middle Eastern countries
middle_east_df = happiness_df[happiness_df['Country'].isin(middle_east_countries)].copy()

# Display the filtered dataset
print(middle_east_df)


In [None]:
#1. Descriptive Statistics
# Calculate mean and standard deviation for South Asia and Middle East
south_asia_mean = south_asia_df['Score'].mean()
south_asia_std = south_asia_df['Score'].std()

middle_east_mean = middle_east_df['Score'].mean()
middle_east_std = middle_east_df['Score'].std()

# Display results
print(f"South Asia - Mean Score: {south_asia_mean}, Standard Deviation: {south_asia_std}")
print(f"Middle East - Mean Score: {middle_east_mean}, Standard Deviation: {middle_east_std}")

# Determine which region has higher scores on average
region_comparison = "South Asia" if south_asia_mean > middle_east_mean else "Middle East"
print(f"The region with higher happiness scores on average is: {region_comparison}")


In [None]:
#2. Top and Bottom Performers
# South Asia top 3 and bottom 3
south_asia_top = south_asia_df.nlargest(3, 'Score')[['Country', 'Score']]
south_asia_bottom = south_asia_df.nsmallest(3, 'Score')[['Country', 'Score']]

# Middle East top 3 and bottom 3
middle_east_top = middle_east_df.nlargest(3, 'Score')[['Country', 'Score']]
middle_east_bottom = middle_east_df.nsmallest(3, 'Score')[['Country', 'Score']]

print("South Asia Top Performers:\n", south_asia_top)
print("South Asia Bottom Performers:\n", south_asia_bottom)
print("Middle East Top Performers:\n", middle_east_top)
print("Middle East Bottom Performers:\n", middle_east_bottom)


In [None]:
import matplotlib.pyplot as plt

# Plot South Asia Top and Bottom Performers
plt.figure(figsize=(10, 5))
plt.bar(south_asia_top['Country'], south_asia_top['Score'], color='blue', label='Top Performers')
plt.bar(south_asia_bottom['Country'], south_asia_bottom['Score'], color='red', label='Bottom Performers')
plt.title('South Asia: Top and Bottom Performers')
plt.ylabel('Score')
plt.legend()
plt.show()

# Plot Middle East Top and Bottom Performers
plt.figure(figsize=(10, 5))
plt.bar(middle_east_top['Country'], middle_east_top['Score'], color='blue', label='Top Performers')
plt.bar(middle_east_bottom['Country'], middle_east_bottom['Score'], color='red', label='Bottom Performers')
plt.title('Middle East: Top and Bottom Performers')
plt.ylabel('Score')
plt.legend()
plt.show()


In [None]:
#3. Metric Comparisons
import numpy as np

# Calculate mean for each metric in both regions
metrics = ['GDP per Capita', 'Social Support', 'Healthy Life Expectancy']
south_asia_means = south_asia_df[metrics].mean()
middle_east_means = middle_east_df[metrics].mean()

# Create grouped bar chart
x = np.arange(len(metrics))
width = 0.35

plt.figure(figsize=(10, 6))
plt.bar(x - width/2, south_asia_means, width, label='South Asia', color='blue')
plt.bar(x + width/2, middle_east_means, width, label='Middle East', color='orange')
plt.xticks(x, metrics)
plt.ylabel('Mean Value')
plt.title('Comparison of Key Metrics Between South Asia and Middle East')
plt.legend()
plt.show()

# Identify the metric with largest disparity
disparity = abs(south_asia_means - middle_east_means)
largest_disparity_metric = disparity.idxmax()
print(f"Metric with largest disparity: {largest_disparity_metric}")


In [None]:
#4. Happiness Disparity
# Compute range and CV for both regions
south_asia_range = south_asia_df['Score'].max() - south_asia_df['Score'].min()
middle_east_range = middle_east_df['Score'].max() - middle_east_df['Score'].min()

south_asia_cv = south_asia_std / south_asia_mean
middle_east_cv = middle_east_std / middle_east_mean

print(f"South Asia - Range: {south_asia_range}, CV: {south_asia_cv}")
print(f"Middle East - Range: {middle_east_range}, CV: {middle_east_cv}")

# Determine region with greater variability
greater_variability = "South Asia" if south_asia_cv > middle_east_cv else "Middle East"
print(f"The region with greater variability in happiness is: {greater_variability}")


In [None]:
#5. Correlation Analysis
# Calculate correlations
metrics_to_analyze = ['Freedom to Make Life Choices', 'Generosity']
south_asia_corr = south_asia_df[metrics_to_analyze + ['Score']].corr()['Score']
middle_east_corr = middle_east_df[metrics_to_analyze + ['Score']].corr()['Score']

print("South Asia Correlations:\n", south_asia_corr)
print("Middle East Correlations:\n", middle_east_corr)

# Plot scatter plots with trendlines
import seaborn as sns

for metric in metrics_to_analyze:
    plt.figure(figsize=(8, 6))
    sns.regplot(data=south_asia_df, x=metric, y='Score', label='South Asia', color='blue')
    sns.regplot(data=middle_east_df, x=metric, y='Score', label='Middle East', color='orange')
    plt.title(f'Correlation of {metric} with Score')
    plt.legend()
    plt.show()


In [None]:
#6. Outlier Detection
# Define outliers
def detect_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

# Detect outliers in both regions
south_asia_outliers = detect_outliers(south_asia_df, 'Score')
middle_east_outliers = detect_outliers(middle_east_df, 'Score')

print("South Asia Outliers:\n", south_asia_outliers)
print("Middle East Outliers:\n", middle_east_outliers)

# Scatter plot highlighting outliers
plt.figure(figsize=(10, 6))
plt.scatter(south_asia_df['GDP per Capita'], south_asia_df['Score'], label='South Asia', color='blue')
plt.scatter(middle_east_df['GDP per Capita'], middle_east_df['Score'], label='Middle East', color='orange')
plt.scatter(south_asia_outliers['GDP per Capita'], south_asia_outliers['Score'], label='South Asia Outliers', color='red')
plt.scatter(middle_east_outliers['GDP per Capita'], middle_east_outliers['Score'], label='Middle East Outliers', color='green')
plt.xlabel('GDP per Capita')
plt.ylabel('Score')
plt.title('Outlier Detection')
plt.legend()
plt.show()


In [None]:
#7. Visualization
# Create boxplots
plt.figure(figsize=(8, 6))
sns.boxplot(data=[south_asia_df['Score'], middle_east_df['Score']], palette=['blue', 'orange'])
plt.xticks([0, 1], ['South Asia', 'Middle East'])
plt.title('Distribution of Score Between South Asia and Middle East')
plt.ylabel('Score')
plt.show()
