<a href="https://colab.research.google.com/github/ChoudharyImran/Lab-5-Exploring-the-Dataset/blob/main/Guided_Practice_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd

def load_and_display_csv(url):
    """
    Reads a CSV file from a URL into a pandas DataFrame and prints its first 5 rows.

    Args:
        url (str): The URL of the CSV file.
    """
    try:
        # 1. Read the CSV file into a pandas DataFrame.
        # pandas.read_csv() can directly handle URLs.
        # It assumes the first row is the header by default.
        df = pd.read_csv(url)

        # 2. Print the first 5 rows of the DataFrame.
        # The .head() method returns the first n rows (default is 5).
        print("Successfully loaded the DataFrame. Here are the first 5 rows:")
        print(df.head())
        return df

    except FileNotFoundError:
        print(f"Error: The file at the URL was not found.")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# The URL provided by the user
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0272EN-SkillsNetwork/labs/dataset/2016.csv"

# Run the function with the specified URL and assign the returned DataFrame to df
df = load_and_display_csv(file_url)

Successfully loaded the DataFrame. Here are the first 5 rows:
       Country          Region  Happiness Rank  Happiness Score  \
0      Denmark  Western Europe               1            7.526   
1  Switzerland  Western Europe               2            7.509   
2      Iceland  Western Europe               3            7.501   
3       Norway  Western Europe               4            7.498   
4      Finland  Western Europe               5            7.413   

   Lower Confidence Interval Upper Confidence Interval  \
0                      7.460                     7.592   
1                      7.428                      7.59   
2                      7.333                     7.669   
3                      7.421                     7.575   
4                      7.351                     7.475   

  Economy (GDP per Capita)   Family Health (Life Expectancy)  Freedom  \
0                  1.44178  1.16374                  0.79504  0.57941   
1                  1.52733  1.14524     

In [4]:
# Check the data types of the columns
print("\nData types of the columns:")
print(df.dtypes)


Data types of the columns:
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval         object
Economy (GDP per Capita)          object
Family                           float64
Health (Life Expectancy)          object
Freedom                           object
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object


In [18]:
# 1. Identify columns with missing values.
print("\nMissing values before handling:")
print(df.isnull().sum())

# 2. Replace the missing values with the mean values of the column.
# Iterate through columns with missing values and fill them with the mean.
for column in df.columns:
    if df[column].isnull().any():
        # Calculate the mean of the column, excluding NaN values.
        mean_value = df[column].mean()
        # Fill the missing values with the calculated mean.
        df[column] = df[column].fillna(mean_value)

print("\nMissing values after handling:")
print(df.isnull().sum())

# Display the first few rows to see the changes
print("\nFirst 5 rows after handling missing values:")
display(df.head())


Missing values before handling:
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

Missing values after handling:
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


## World Happiness Report Dashboard Narrative

Here is a narrative to accompany the dashboard showcasing insights from the World Happiness Report data:

Welcome to this dashboard presenting an analysis of the World Happiness Report data. We have explored various factors contributing to happiness across different countries and regions, visualized through a series of interactive charts.

**1. Correlation Heatmap:**

Our analysis begins with a correlation heatmap showing the relationships between several key attributes and Happiness Score. As you can see in the heatmap below, there are strong positive correlations between 'Happiness Score' and factors like 'Economy (GDP per Capita)', 'Family', and 'Health (Life Expectancy)'. This suggests that countries with higher GDP, strong social support systems, and better health outcomes tend to report higher happiness levels. 'Freedom' and 'Trust (Government Corruption)' also show positive correlations with Happiness Score, albeit weaker than the economic and social factors. 'Generosity' appears to have a very weak positive correlation with happiness in this dataset.

**2. Happiness Score vs. Economy (GDP per Capita) Scatter Plot:**

Moving on to the scatter plot, we visualize the relationship between 'Happiness Score' and 'Economy (GDP per Capita)', with data points colored by 'Region'. This plot clearly illustrates the positive trend: generally, countries with higher GDP per capita tend to have higher happiness scores. The coloring by region allows us to observe how this relationship varies across different parts of the world and whether there are regional clusters with distinct patterns.

**3. Average Happiness Score by Region Pie Chart:**

The pie chart provides a clear view of the distribution of average Happiness Scores across different regions. Each slice represents a region, and its size corresponds to the average happiness score in that region. This helps us quickly identify which regions tend to have the highest and lowest average happiness levels and understand their proportion of the overall happiness landscape in this dataset.

**4. GDP per Capita and Healthy Life Expectancy by Country Map:**

Finally, the world map visualizes 'Economy (GDP per Capita)' by country, with 'Health (Life Expectancy)' included in the tooltip that appears when you hover over a country. This map allows for a geographical exploration of economic prosperity and health outcomes. You can visually identify regions or countries with high or low GDP per capita and simultaneously see their corresponding healthy life expectancy, providing a spatial context to the other findings.

In summary, this dashboard provides a multi-faceted view of the World Happiness Report data, highlighting the significant roles of economic factors, social support, and health in contributing to national happiness, while also allowing for regional and geographical comparisons.

In [28]:
import plotly.offline as offline

# Generate HTML for each figure
html_fig2 = offline.plot(fig2, include_plotlyjs=False, output_type='div')
html_fig3 = offline.plot(fig3, include_plotlyjs=False, output_type='div')
html_fig4 = offline.plot(fig4, include_plotlyjs=False, output_type='div')
html_fig5 = offline.plot(fig5, include_plotlyjs=False, output_type='div')

# Combine the HTML outputs
combined_html = f"""
<html>
<head>
<title>Plotly Dashboard</title>
<script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
</head>
<body>
<h1>Dashboard</h1>
<h2>Correlation Heatmap</h2>
{html_fig2}
<h2>Happiness Score vs. Economy (GDP per Capita)</h2>
{html_fig3}
<h2>Average Happiness Score by Region</h2>
{html_fig4}
<h2>GDP per Capita and Healthy Life Expectancy by Country</h2>
{html_fig5}
</body>
</html>
"""

# Write the combined HTML to a file
with open("dashboard.html", "w") as f:
    f.write(combined_html)

print("Dashboard saved to dashboard.html")

Dashboard saved to dashboard.html


**Reasoning**:
Generate HTML for each of the selected Plotly figures and combine them into a single HTML file.

## Save figures to HTML

### Subtask:
Save four of the Plotly figures to a single HTML file.

In [27]:
import plotly.express as px

# Create a world map named fig5
fig5 = px.choropleth(df,
                     locations="Country",
                     locationmode='country names',
                     color="Economy (GDP per Capita)",
                     hover_name="Country",
                     hover_data=["Health (Life Expectancy)", "Economy (GDP per Capita)"],
                     title='GDP per Capita and Healthy Life Expectancy by Country')

# Display the map
fig5.show()

**Reasoning**:
Generate a world map named fig5 using Plotly to display GDP per capita of countries and include Healthy Life Expectancy in the tooltip.

## Create World Map

### Subtask:
Generate a world map named `fig5` using Plotly to display GDP per capita of countries and include Healthy Life Expectancy in the tooltip.

In [26]:
import plotly.express as px

# Group data by 'Region' and calculate the mean 'Happiness Score' for each region
region_happiness = df.groupby('Region')['Happiness Score'].mean().reset_index()

# Create a pie chart named fig4
fig4 = px.pie(region_happiness,
              values='Happiness Score',
              names='Region',
              title='Average Happiness Score by Region')

# Display the pie chart
fig4.show()

**Reasoning**:
Group the data by 'Region', calculate the mean 'Happiness Score' for each region, and then generate a pie chart using Plotly to visualize the distribution.

## Create pie chart

### Subtask:
Generate a pie chart named `fig4` using Plotly to present the distribution of 'Happiness Score' by 'Region'.

In [25]:
import plotly.express as px

# Create a scatter plot named fig3
fig3 = px.scatter(df,
                  x='Economy (GDP per Capita)',
                  y='Happiness Score',
                  color='Region',
                  title='Happiness Score vs. Economy (GDP per Capita) by Region')

# Display the scatter plot
fig3.show()

**Reasoning**:
Generate a scatter plot named fig3 using Plotly to show the relationship between 'Happiness Score' and 'Economy (GDP per Capita)', colored by 'Region'.

## Create scatter plot

### Subtask:
Generate a scatter plot named `fig3` using Plotly to show the relationship between 'Happiness Score' and 'Economy (GDP per Capita)', colored by 'Region'.

## Summary:

### Data Analysis Key Findings

* The initial data types were examined to identify columns requiring cleaning and type conversion.
* The 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Health (Life Expectancy)', and 'Freedom' columns were successfully cleaned by removing leading/trailing whitespaces, replacing empty strings with NaN values, and converting them to a numeric data type (float64).
* After cleaning and conversion, the data types were verified, confirming that the target columns now have a numeric data type.

### Insights or Next Steps

* The data is now in a cleaner format, making it suitable for numerical analysis and calculations.
* Further analysis can be performed on the cleaned numerical columns to explore relationships and trends.

In [17]:
# 1. Print the data types of all columns in the DataFrame df using the .dtypes attribute.
print("\nData types after cleaning and conversion:")
print(df.dtypes)

# 2. Examine the output to confirm that the 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Health (Life Expectancy)', and 'Freedom' columns now have a numeric data type (e.g., float64).
# This step is done by examining the output of the previous print statement.

# 3. Briefly note any other columns that might still require data type conversion or cleaning based on their current data type and the expected data format.
# This will also be noted by examining the output of the print statement.


Data types after cleaning and conversion:
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object


**Reasoning**:
Print the data types of all columns to verify the cleaning and conversion steps.

## Inspect cleaned data types

### Subtask:
Verify the data types after cleaning and conversion.

In [16]:
# 1. Convert the column to string type before removing leading and trailing whitespaces.
df['Freedom'] = df['Freedom'].astype(str).str.strip()

# 2. Replace any empty strings within the 'Freedom' column with NaN values.
df['Freedom'] = df['Freedom'].replace('', pd.NA)

# 3. Convert the 'Freedom' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Freedom'] = pd.to_numeric(df['Freedom'], errors='coerce')

# 4. Print the data types of the DataFrame to confirm the change.
print("\nData types after cleaning 'Freedom':")
print(df.dtypes)

# 5. Display the first few rows of the DataFrame to see the cleaned column.
print("\nFirst 5 rows after cleaning 'Freedom':")
display(df.head())


Data types after cleaning 'Freedom':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Freedom':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


**Reasoning**:
Clean the 'Freedom' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

## Clean 'freedom' column

### Subtask:
Clean the 'Freedom' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

In [15]:
# 1. Convert the column to string type before removing leading and trailing whitespaces.
df['Health (Life Expectancy)'] = df['Health (Life Expectancy)'].astype(str).str.strip()

# 2. Replace any empty strings within the 'Health (Life Expectancy)' column with NaN values.
df['Health (Life Expectancy)'] = df['Health (Life Expectancy)'].replace('', pd.NA)

# 3. Convert the 'Health (Life Expectancy)' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Health (Life Expectancy)'] = pd.to_numeric(df['Health (Life Expectancy)'], errors='coerce')

# 4. Print the data types of the DataFrame to confirm the change.
print("\nData types after cleaning 'Health (Life Expectancy)':")
print(df.dtypes)

# 5. Display the first few rows of the DataFrame to see the cleaned column.
print("\nFirst 5 rows after cleaning 'Health (Life Expectancy)':")
display(df.head())


Data types after cleaning 'Health (Life Expectancy)':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Health (Life Expectancy)':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


**Reasoning**:
Clean the 'Health (Life Expectancy)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

## Clean 'health (life expectancy)' column

### Subtask:
Clean the 'Health (Life Expectancy)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

In [14]:
# 1. Convert the column to string type before removing leading and trailing whitespaces.
df['Economy (GDP per Capita)'] = df['Economy (GDP per Capita)'].astype(str).str.strip()

# 2. Replace any empty strings within the 'Economy (GDP Per Capita)' column with NaN values.
df['Economy (GDP per Capita)'] = df['Economy (GDP per Capita)'].replace('', pd.NA)

# 3. Convert the 'Economy (GDP per Capita)' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Economy (GDP per Capita)'] = pd.to_numeric(df['Economy (GDP per Capita)'], errors='coerce')

# Display the data types again to confirm the change
print("\nData types after cleaning 'Economy (GDP per Capita)':")
print(df.dtypes)

# Display the first few rows to see the cleaned column
print("\nFirst 5 rows after cleaning 'Economy (GDP per Capita)':")
display(df.head())


Data types after cleaning 'Economy (GDP per Capita)':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Economy (GDP per Capita)':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


**Reasoning**:
Clean the 'Economy (GDP per Capita)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

## Clean 'economy (gdp per capita)' column

### Subtask:
Clean the 'Economy (GDP per Capita)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

In [12]:
# 1. Convert the column to string type before removing leading and trailing whitespaces.
df['Upper Confidence Interval'] = df['Upper Confidence Interval'].astype(str).str.strip()

# 2. Replace any empty strings within the 'Upper Confidence Interval' column with NaN values.
df['Upper Confidence Interval'] = df['Upper Confidence Interval'].replace('', pd.NA)

# 3. Convert the 'Upper Confidence Interval' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Upper Confidence Interval'] = pd.to_numeric(df['Upper Confidence Interval'], errors='coerce')

# Display the data types again to confirm the change
print("\nData types after cleaning 'Upper Confidence Interval':")
print(df.dtypes)

# Display the first few rows to see the cleaned column
print("\nFirst 5 rows after cleaning 'Upper Confidence Interval':")
display(df.head())


Data types after cleaning 'Upper Confidence Interval':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Upper Confidence Interval':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


**Reasoning**:
Clean the 'Upper Confidence Interval' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

## Clean 'upper confidence interval' column

### Subtask:
Clean the 'Upper Confidence Interval' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.

In [10]:
# 1. Identify columns with missing values.
print("\nMissing values before handling:")
print(df.isnull().sum())

# 2. Replace the missing values with the mean values of the column.
# Iterate through columns with missing values and fill them with the mean.
for column in df.columns:
    if df[column].isnull().any():
        # Calculate the mean of the column, excluding NaN values.
        mean_value = df[column].mean()
        # Fill the missing values with the calculated mean.
        df[column].fillna(mean_value, inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())

# Display the first few rows to see the changes
print("\nFirst 5 rows after handling missing values:")
display(df.head())


Missing values before handling:
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        4
Upper Confidence Interval        3
Economy (GDP per Capita)         2
Family                           0
Health (Life Expectancy)         3
Freedom                          1
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

Missing values after handling:
Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  0
Lower Confidence Interval        0
Upper Confidence Interval        0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
dtype: int64

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(mean_value, inplace=True)


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


## Handle Missing Values

### Subtask:
Identify columns with missing values and replace them with the mean of the column.

In [9]:
# 1. Print the data types of all columns in the DataFrame df using the .dtypes attribute.
print("\nData types after cleaning and conversion:")
print(df.dtypes)

# 2. Examine the output to confirm that the 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Health (Life Expectancy)', and 'Freedom' columns now have a numeric data type (e.g., float64).
# This step is done by examining the output of the previous print statement.

# 3. Briefly note any other columns that might still require data type conversion or cleaning based on their current data type and the expected data format.
# This will also be noted by examining the output of the print statement.


Data types after cleaning and conversion:
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object


In [8]:
# 1. Remove leading and trailing whitespaces from the 'Freedom' column.
df['Freedom'] = df['Freedom'].str.strip()

# 2. Replace any empty strings within the 'Freedom' column with NaN values.
df['Freedom'] = df['Freedom'].replace('', pd.NA)

# 3. Convert the 'Freedom' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Freedom'] = pd.to_numeric(df['Freedom'], errors='coerce')

# 4. Print the data types of the DataFrame to confirm the change.
print("\nData types after cleaning 'Freedom':")
print(df.dtypes)

# 5. Display the first few rows of the DataFrame to see the cleaned column.
print("\nFirst 5 rows after cleaning 'Freedom':")
display(df.head())


Data types after cleaning 'Freedom':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Freedom':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


In [7]:
# 1. Remove leading and trailing whitespaces from the 'Health (Life Expectancy)' column.
df['Health (Life Expectancy)'] = df['Health (Life Expectancy)'].str.strip()

# 2. Replace any empty strings within the 'Health (Life Expectancy)' column with NaN values.
df['Health (Life Expectancy)'] = df['Health (Life Expectancy)'].replace('', pd.NA)

# 3. Convert the 'Health (Life Expectancy)' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Health (Life Expectancy)'] = pd.to_numeric(df['Health (Life Expectancy)'], errors='coerce')

# 4. Print the data types of the DataFrame to confirm the change.
print("\nData types after cleaning 'Health (Life Expectancy)':")
print(df.dtypes)

# 5. Display the first few rows of the DataFrame to see the cleaned column.
print("\nFirst 5 rows after cleaning 'Health (Life Expectancy)':")
display(df.head())


Data types after cleaning 'Health (Life Expectancy)':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                           object
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Health (Life Expectancy)':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


In [6]:
# 1. Remove leading and trailing whitespaces from the 'Economy (GDP per Capita)' column.
df['Economy (GDP per Capita)'] = df['Economy (GDP per Capita)'].str.strip()

# 2. Replace any empty strings within the 'Economy (GDP Per Capita)' column with NaN values.
df['Economy (GDP per Capita)'] = df['Economy (GDP per Capita)'].replace('', pd.NA)

# 3. Convert the 'Economy (GDP per Capita)' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Economy (GDP per Capita)'] = pd.to_numeric(df['Economy (GDP per Capita)'], errors='coerce')

# Display the data types again to confirm the change
print("\nData types after cleaning 'Economy (GDP per Capita)':")
print(df.dtypes)

# Display the first few rows to see the cleaned column
print("\nFirst 5 rows after cleaning 'Economy (GDP per Capita)':")
display(df.head())


Data types after cleaning 'Economy (GDP per Capita)':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)          object
Freedom                           object
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Economy (GDP per Capita)':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


In [5]:
# 1. Remove leading and trailing whitespaces from the 'Upper Confidence Interval' column.
df['Upper Confidence Interval'] = df['Upper Confidence Interval'].str.strip()

# 2. Replace any empty strings within the 'Upper Confidence Interval' column with NaN values.
df['Upper Confidence Interval'] = df['Upper Confidence Interval'].replace('', pd.NA)

# 3. Convert the 'Upper Confidence Interval' column to a numeric data type, coercing any values that cannot be converted into NaN.
df['Upper Confidence Interval'] = pd.to_numeric(df['Upper Confidence Interval'], errors='coerce')

# Display the data types again to confirm the change
print("\nData types after cleaning 'Upper Confidence Interval':")
print(df.dtypes)

# Display the first few rows to see the cleaned column
print("\nFirst 5 rows after cleaning 'Upper Confidence Interval':")
display(df.head())


Data types after cleaning 'Upper Confidence Interval':
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)          object
Family                           float64
Health (Life Expectancy)          object
Freedom                           object
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object

First 5 rows after cleaning 'Upper Confidence Interval':


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596


# Task
Write a python code to perform the following tasks as per latest pandas: 1. Remove leading and trailing whitespaces from the values in a column. 2. Clean a column in a DataFrame by replacing empty strings with NaN values. 3. Change the data type of the columns to appropriate type as per the latest version of pandas. Please note as of the future version 3 of pandas it is recommended to use df.method({col: value}, inplace=True) instead of df[col].method(value, inplace=True). You may see warning message in this regard, when you generate code and execute it in the notebook.

## Inspect data types

### Subtask:
Re-examine the data types to confirm which columns need cleaning and type conversion.


## Clean 'upper confidence interval' column

### Subtask:
Clean the 'Upper Confidence Interval' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.


**Reasoning**:
Clean the 'Upper Confidence Interval' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.



## Clean 'economy (gdp per capita)' column

### Subtask:
Clean the 'Economy (GDP per Capita)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.


**Reasoning**:
Clean the 'Economy (GDP per Capita)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.



## Clean 'health (life expectancy)' column

### Subtask:
Clean the 'Health (Life Expectancy)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.


**Reasoning**:
Clean the 'Health (Life Expectancy)' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.



## Clean 'freedom' column

### Subtask:
Clean the 'Freedom' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.


**Reasoning**:
Clean the 'Freedom' column by removing leading/trailing whitespaces, replacing empty strings with NaN, and converting it to a numeric type.



## Inspect cleaned data types

### Subtask:
Verify the data types after cleaning and conversion.


**Reasoning**:
Print the data types of all columns to verify the cleaning and conversion steps.



## Summary:

### Data Analysis Key Findings

*   The initial data types were examined to identify columns requiring cleaning and type conversion.
*   The 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Health (Life Expectancy)', and 'Freedom' columns were successfully cleaned by removing leading/trailing whitespaces, replacing empty strings with NaN values, and converting them to a numeric data type (float64).
*   After cleaning and conversion, the data types were verified, confirming that the target columns now have a numeric data type.

### Insights or Next Steps

*   The data is now in a cleaner format, making it suitable for numerical analysis and calculations.
*   Further analysis can be performed on the cleaned numerical columns to explore relationships and trends.


# Task
Write a python code that identifies the GDP per capita and Healthy Life Expectancy of the top 10 countries and create a bar chart named fig1 to show the GDP per capita and Healthy Life Expectancy of these top 10 countries using plotly.

## Identify top 10 countries

### Subtask:
Determine the top 10 countries based on 'Happiness Rank'.


**Reasoning**:
Sort the dataframe by 'Happiness Rank' and select the top 10 countries.



In [19]:
# Sort the DataFrame df in ascending order based on the 'Happiness Rank' column.
# Select the first 10 rows of the sorted DataFrame to get the top 10 countries.
top_10_countries = df.sort_values(by='Happiness Rank').head(10)

# Display the top 10 countries
print("Top 10 countries based on Happiness Rank:")
display(top_10_countries)

Top 10 countries based on Happiness Rank:


Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Lower Confidence Interval,Upper Confidence Interval,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Denmark,Western Europe,1,7.526,7.46,7.592,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,2.73939
1,Switzerland,Western Europe,2,7.509,7.428,7.59,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,2.69463
2,Iceland,Western Europe,3,7.501,7.333,7.669,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,2.83137
3,Norway,Western Europe,4,7.498,7.421,7.575,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,2.66465
4,Finland,Western Europe,5,7.413,7.351,7.475,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,2.82596
5,Canada,North America,6,7.404,7.335,7.473,1.44015,1.0961,0.8276,0.5737,0.31329,0.44834,2.70485
6,Netherlands,Western Europe,7,7.339,7.284,7.394,1.46468,1.02912,0.81231,0.55211,0.29927,0.47416,2.70749
7,New Zealand,Australia and New Zealand,8,7.334,7.264,7.404,1.36066,1.17278,0.83096,0.58147,0.41904,0.49401,2.47553
8,Australia,Australia and New Zealand,9,7.313,7.241,7.385,1.44443,1.10476,0.8512,0.56837,0.32331,0.47407,2.5465
9,Sweden,Western Europe,10,7.291,7.227,5.472753,1.45181,1.08764,0.83121,0.58218,0.40867,0.38254,2.54734


## Select relevant data

### Subtask:
Extract the 'Country', 'Economy (GDP per Capita)', and 'Health (Life Expectancy)' for these top 10 countries.


**Reasoning**:
Extract the specified columns for the top 10 countries and display the result.



In [20]:
# Create a new DataFrame by selecting the 'Country', 'Economy (GDP per Capita)', and 'Health (Life Expectancy)' columns from the top_10_countries DataFrame.
top_10_data = top_10_countries[['Country', 'Economy (GDP per Capita)', 'Health (Life Expectancy)']]

# Display the first few rows of the newly created DataFrame to verify the selection.
print("Selected columns for the top 10 countries:")
display(top_10_data.head())

Selected columns for the top 10 countries:


Unnamed: 0,Country,Economy (GDP per Capita),Health (Life Expectancy)
0,Denmark,1.44178,0.79504
1,Switzerland,1.52733,0.86303
2,Iceland,1.42666,0.86733
3,Norway,1.57744,0.79579
4,Finland,1.40598,0.81091


## Create bar chart

### Subtask:
Generate a bar chart using Plotly to compare GDP per capita and Healthy Life Expectancy for the top 10 countries.


**Reasoning**:
Generate a bar chart using Plotly to compare GDP per capita and Healthy Life Expectancy for the top 10 countries.



In [21]:
import plotly.graph_objects as go

# Create a figure object
fig1 = go.Figure()

# Add a bar trace for 'Economy (GDP per Capita)'
fig1.add_trace(go.Bar(
    x=top_10_data['Country'],
    y=top_10_data['Economy (GDP per Capita)'],
    name='GDP per Capita'
))

# Add a bar trace for 'Health (Life Expectancy)'
fig1.add_trace(go.Bar(
    x=top_10_data['Country'],
    y=top_10_data['Health (Life Expectancy)'],
    name='Health (Life Expectancy)'
))

# Update the layout of the figure
fig1.update_layout(
    title='GDP per Capita and Healthy Life Expectancy for Top 10 Countries',
    xaxis_title='Country',
    yaxis_title='Score'
)

## Summary:

### Data Analysis Key Findings

*   The analysis identified the top 10 countries based on their 'Happiness Rank'.
*   The data for 'Country', 'Economy (GDP per Capita)', and 'Health (Life Expectancy)' was successfully extracted for these top 10 countries.
*   A bar chart (`fig1`) was successfully created using Plotly, visualizing the GDP per capita and Healthy Life Expectancy for the top 10 countries.

### Insights or Next Steps

*   The bar chart allows for a visual comparison of GDP per capita and Healthy Life Expectancy among the top 10 happiest countries.
*   The next step could be to display the generated bar chart (`fig1`).


# Task
Create a sub-dataset including Economy (GDP per Capita), Family, Health (Life Expectancy), Freedom, Trust (Government Corruption), Generosity, and Happiness Score attributes from the dataframe (df). Find the correlation between the attributes in the subdataset as a heatmap named fig2 using Plotly of width 800 and height 600.

## Create sub-dataset

### Subtask:
Create a sub-dataset including the specified attributes from the dataframe `df`.


**Reasoning**:
Create a new DataFrame with the specified columns and display the first few rows.



In [22]:
# Create a new DataFrame called sub_df by selecting the specified columns from the DataFrame df.
sub_df = df[['Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Happiness Score']]

# Display the first 5 rows of the new DataFrame sub_df to verify the selection.
print("First 5 rows of the sub-dataset:")
display(sub_df.head())

First 5 rows of the sub-dataset:


Unnamed: 0,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Happiness Score
0,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,7.526
1,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,7.509
2,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,7.501
3,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,7.498
4,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,7.413


## Calculate correlation

### Subtask:
Find the correlation between the attributes in the subdataset.


**Reasoning**:
Calculate the correlation matrix for the sub_df DataFrame.



In [23]:
# Calculate the correlation matrix for the sub_df DataFrame.
correlation_matrix = sub_df.corr()

# Print the correlation matrix.
print("\nCorrelation Matrix:")
print(correlation_matrix)


Correlation Matrix:
                               Economy (GDP per Capita)    Family  \
Economy (GDP per Capita)                       1.000000  0.668984   
Family                                         0.668984  1.000000   
Health (Life Expectancy)                       0.825699  0.586163   
Freedom                                        0.360998  0.450194   
Trust (Government Corruption)                  0.286069  0.213561   
Generosity                                    -0.022072  0.089629   
Happiness Score                                0.789625  0.739252   

                               Health (Life Expectancy)   Freedom  \
Economy (GDP per Capita)                       0.825699  0.360998   
Family                                         0.586163  0.450194   
Health (Life Expectancy)                       1.000000  0.348028   
Freedom                                        0.348028  1.000000   
Trust (Government Corruption)                  0.261928  0.502030   
Generosity  

## Create heatmap

### Subtask:
Generate a heatmap named `fig2` using Plotly to show the correlation, with a width of 800 and height of 600.


**Reasoning**:
Generate a heatmap named fig2 using Plotly to show the correlation, with a width of 800 and height of 600.



In [24]:
import plotly.express as px

fig2 = px.imshow(correlation_matrix, text_auto=True)

fig2.update_layout(
    title='Correlation Heatmap of Selected Attributes',
    width=800,
    height=600
)

## Summary:

### Data Analysis Key Findings

*   A sub-dataset containing 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', and 'Happiness Score' was successfully created.
*   The correlation matrix for the selected attributes in the sub-dataset was calculated.
*   A heatmap named `fig2` was generated using Plotly to visualize the calculated correlation matrix with a width of 800 and a height of 600.

### Insights or Next Steps

*   Analyze the correlation values in the heatmap to identify which attributes have the strongest positive or negative relationships with 'Happiness Score' and with each other.
*   Consider exploring regression analysis to quantify the impact of each selected attribute on 'Happiness Score'.
