# What makes people happy? Can you find Dytopia?

The `World Happiness Report` is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. Over the last year’s, `World Happiness Report` focuses on happiness and the community: how happiness has evolved over the past dozen years, with a focus on the technologies, social norms, conflicts and government policies that have driven those changes.

<img src="https://allthatsinteresting.com/wordpress/wp-content/uploads/2016/03/giphy-4.gif" width="700px">

**Dataset information**

    The information in the datasets is based on answers to the most life evaluation address inquired within the survey. This address, known as the Cantril step, asks respondents to think of a step with the most excellent conceivable life for them being a 10 and the most exceedingly bad conceivable life being a and to rate their claim current lives on that scale. 


The Happiness Score is explained by the following factors:

- `Overall rank`: happiness rank of the different countries
- `Country o region`
- `Score`:  is a national average of the responses to the main life evaluation question asked in the Gallup World Poll (GWP), which uses the Cantril Ladder. Ranges from [0-10]
- `GDP per capita`
- `Healthy Life Expectancy`: score that goes from [0-1], being 1 the ones that have more confidence in terms of healthy life
- `Social support`: it indicates how people are appreciating the social support by governments, and it that ranges from [0-2]
- `Freedom to make life choices` score that ranges from 0 to 1, being 1 the ones that feel more free
- `Generosity`:score that ranges from 0 to 1, being 1 the ones that feel more generosity 
- `Perceptions of corruption`: perception of corruption on the country that it goes from  0 to 1. The higher this value is the lower perception of corruption 
- `year`  


>dataset: `happiness_score.csv`

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

import warnings
warnings.filterwarnings('ignore')

In [48]:
# Import functions
import sys
sys.path.insert(0, '../_functions_')

import functions_EDA as feda

### Exercise 1.Tell a story with your dataset, to try to answer the following question:

>"Which factors are more important to live a happier life? As a result, people and countries can focus on the more significant factors to achieve a higher happiness level "

To achieve this goal, use the different funcionalities from the visualization libraries that you have seen in the module.

## EDA

### Load data and inspect raw structure: shape, columns, head/tail, nulls, data types, descriptive stats (.describe()).

In [83]:
df = pd.read_csv('../data/raw/happiness_score.csv')

In [84]:
df.shape

(312, 12)

In [85]:
df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'Overall rank', 'Country or region',
       'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'year'],
      dtype='object')

In [86]:
df.dtypes

Unnamed: 0.1                      int64
Unnamed: 0                        int64
Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
year                              int64
dtype: object

In [87]:
df.describe()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,year
count,312.0,312.0,312.0,312.0,312.0,312.0,7.0,312.0,312.0,311.0,312.0
mean,155.5,77.5,78.5,5.391506,0.898298,1.211026,0.038571,0.423538,0.182926,0.111299,2018.5
std,90.210864,45.104737,45.104737,1.114631,0.394592,0.30031,0.035213,0.156024,0.096739,0.095365,0.500803
min,0.0,0.0,1.0,2.853,0.0,0.0,0.0,0.0,0.0,0.0,2018.0
25%,77.75,38.75,39.75,4.51425,0.6095,1.05575,0.005,0.3225,0.10875,0.05,2018.0
50%,155.5,77.5,78.5,5.3795,0.96,1.2655,0.048,0.4495,0.1755,0.082,2018.5
75%,233.25,116.25,117.25,6.176,1.2195,1.4575,0.066,0.54025,0.245,0.1405,2019.0
max,311.0,155.0,156.0,7.769,2.096,1.644,0.08,0.724,0.598,0.457,2019.0


In [88]:
df.head()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,year
0,0,0,1,Finland,7.632,1.305,1.592,,0.681,0.202,0.393,2018
1,1,1,2,Norway,7.594,1.456,1.582,,0.686,0.286,0.34,2018
2,2,2,3,Denmark,7.555,1.351,1.59,,0.683,0.284,0.408,2018
3,3,3,4,Iceland,7.495,1.343,1.644,,0.677,0.353,0.138,2018
4,4,4,5,Switzerland,7.487,1.42,1.549,,0.66,0.256,0.357,2018


In [89]:
df.tail()


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,year
307,307,151,152,Rwanda,3.334,0.359,0.711,,0.555,0.217,0.411,2019
308,308,152,153,Tanzania,3.231,0.476,0.885,,0.417,0.276,0.147,2019
309,309,153,154,Afghanistan,3.203,0.35,0.517,,0.0,0.158,0.025,2019
310,310,154,155,Central African Republic,3.083,0.026,0.0,,0.225,0.235,0.035,2019
311,311,155,156,South Sudan,2.853,0.306,0.575,,0.01,0.202,0.091,2019


In [90]:
df.isna().any()

Unnamed: 0.1                    False
Unnamed: 0                      False
Overall rank                    False
Country or region               False
Score                           False
GDP per capita                  False
Social support                  False
Healthy life expectancy          True
Freedom to make life choices    False
Generosity                      False
Perceptions of corruption        True
year                            False
dtype: bool

In [91]:
df.isnull().sum()

Unnamed: 0.1                      0
Unnamed: 0                        0
Overall rank                      0
Country or region                 0
Score                             0
GDP per capita                    0
Social support                    0
Healthy life expectancy         305
Freedom to make life choices      0
Generosity                        0
Perceptions of corruption         1
year                              0
dtype: int64

In [92]:
df.isnull().mean() * 100

Unnamed: 0.1                     0.000000
Unnamed: 0                       0.000000
Overall rank                     0.000000
Country or region                0.000000
Score                            0.000000
GDP per capita                   0.000000
Social support                   0.000000
Healthy life expectancy         97.756410
Freedom to make life choices     0.000000
Generosity                       0.000000
Perceptions of corruption        0.320513
year                             0.000000
dtype: float64

In [93]:
#Examples of countries/rows where data is incomplete.

df[df.isnull().any(axis=1)].head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption,year
0,0,0,1,Finland,7.632,1.305,1.592,,0.681,0.202,0.393,2018
1,1,1,2,Norway,7.594,1.456,1.582,,0.686,0.286,0.34,2018
2,2,2,3,Denmark,7.555,1.351,1.59,,0.683,0.284,0.408,2018
3,3,3,4,Iceland,7.495,1.343,1.644,,0.677,0.353,0.138,2018
4,4,4,5,Switzerland,7.487,1.42,1.549,,0.66,0.256,0.357,2018


In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Unnamed: 0.1                  312 non-null    int64  
 1   Unnamed: 0                    312 non-null    int64  
 2   Overall rank                  312 non-null    int64  
 3   Country or region             312 non-null    object 
 4   Score                         312 non-null    float64
 5   GDP per capita                312 non-null    float64
 6   Social support                312 non-null    float64
 7   Healthy life expectancy       7 non-null      float64
 8   Freedom to make life choices  312 non-null    float64
 9   Generosity                    312 non-null    float64
 10  Perceptions of corruption     311 non-null    float64
 11  year                          312 non-null    int64  
dtypes: float64(7), int64(4), object(1)
memory usage: 29.4+ KB


!! Healthy life expectancy has many NaNs.

### Standardize and clean column names (consistent naming).

In [95]:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

In [96]:
df.columns


Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'year'],
      dtype='object')

In [97]:
df.shape

(312, 10)

In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  312 non-null    int64  
 1   Country or region             312 non-null    object 
 2   Score                         312 non-null    float64
 3   GDP per capita                312 non-null    float64
 4   Social support                312 non-null    float64
 5   Healthy life expectancy       7 non-null      float64
 6   Freedom to make life choices  312 non-null    float64
 7   Generosity                    312 non-null    float64
 8   Perceptions of corruption     311 non-null    float64
 9   year                          312 non-null    int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 24.5+ KB


In [99]:
df_cleaned = feda.clean_column_names(df)
df_cleaned

Unnamed: 0,overall_rank,country_or_region,score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_to_make_life_choices,generosity,perceptions_of_corruption,year
0,1,Finland,7.632,1.305,1.592,,0.681,0.202,0.393,2018
1,2,Norway,7.594,1.456,1.582,,0.686,0.286,0.340,2018
2,3,Denmark,7.555,1.351,1.590,,0.683,0.284,0.408,2018
3,4,Iceland,7.495,1.343,1.644,,0.677,0.353,0.138,2018
4,5,Switzerland,7.487,1.420,1.549,,0.660,0.256,0.357,2018
...,...,...,...,...,...,...,...,...,...,...
307,152,Rwanda,3.334,0.359,0.711,,0.555,0.217,0.411,2019
308,153,Tanzania,3.231,0.476,0.885,,0.417,0.276,0.147,2019
309,154,Afghanistan,3.203,0.350,0.517,,0.000,0.158,0.025,2019
310,155,Central African Republic,3.083,0.026,0.000,,0.225,0.235,0.035,2019


### Review and correct data types (e.g., dates, categories, floats).

In [100]:
df_cleaned.dtypes

overall_rank                      int64
country_or_region                object
score                           float64
gdp_per_capita                  float64
social_support                  float64
healthy_life_expectancy         float64
freedom_to_make_life_choices    float64
generosity                      float64
perceptions_of_corruption       float64
year                              int64
dtype: object

In [101]:
df_cleaned.head(5)

Unnamed: 0,overall_rank,country_or_region,score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_to_make_life_choices,generosity,perceptions_of_corruption,year
0,1,Finland,7.632,1.305,1.592,,0.681,0.202,0.393,2018
1,2,Norway,7.594,1.456,1.582,,0.686,0.286,0.34,2018
2,3,Denmark,7.555,1.351,1.59,,0.683,0.284,0.408,2018
3,4,Iceland,7.495,1.343,1.644,,0.677,0.353,0.138,2018
4,5,Switzerland,7.487,1.42,1.549,,0.66,0.256,0.357,2018


In [102]:
df_cleaned['country_or_region'].value_counts()

country_or_region
Finland            2
Norway             2
Denmark            2
Iceland            2
Switzerland        2
                  ..
Sudan              1
North Macedonia    1
Gambia             1
Swaziland          1
Comoros            1
Name: count, Length: 160, dtype: int64

In [103]:

df_cleaned['country_or_region'] = df_cleaned['country_or_region'].astype('category')

In [104]:
df_cleaned.dtypes

overall_rank                       int64
country_or_region               category
score                            float64
gdp_per_capita                   float64
social_support                   float64
healthy_life_expectancy          float64
freedom_to_make_life_choices     float64
generosity                       float64
perceptions_of_corruption        float64
year                               int64
dtype: object

In [105]:
df_cleaned['country_or_region'].cat.categories

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       ...
       'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=160)

In [106]:
print(df_cleaned['country_or_region'].cat.categories)

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       ...
       'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=160)


In [107]:
# Get value counts efficiently
df_cleaned['country_or_region'].value_counts()

country_or_region
Afghanistan        2
Albania            2
Algeria            2
Argentina          2
Australia          2
                  ..
Gambia             1
Macedonia          1
North Macedonia    1
Sudan              1
Swaziland          1
Name: count, Length: 160, dtype: int64

### Handle or drop NaNs before converting to non-null numeric types (like int).

In [108]:
df_cleaned.sample(15)

Unnamed: 0,overall_rank,country_or_region,score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_to_make_life_choices,generosity,perceptions_of_corruption,year
55,56,Jamaica,5.89,0.819,1.493,,0.575,0.096,0.031,2018
88,89,Macedonia,5.185,0.959,1.239,,0.394,0.173,0.052,2018
62,63,Estonia,5.739,1.2,1.532,,0.553,0.086,0.174,2018
93,94,Mongolia,5.125,0.914,1.517,,0.395,0.253,0.032,2018
252,97,Bulgaria,5.011,1.092,1.513,,0.311,0.081,0.004,2019
9,10,Australia,7.272,1.34,1.573,,0.647,0.361,0.302,2018
106,107,Ivory Coast,4.671,0.541,0.872,0.08,0.467,0.146,0.103,2018
87,88,Tajikistan,5.199,0.474,1.166,,0.292,0.187,0.034,2018
148,149,Liberia,3.495,0.076,0.858,,0.419,0.206,0.03,2018
275,120,Gambia,4.516,0.308,0.939,,0.382,0.269,0.167,2019


In [109]:
(df_cleaned.isnull().mean() * 100).sort_values(ascending=False)

healthy_life_expectancy         97.756410
perceptions_of_corruption        0.320513
country_or_region                0.000000
overall_rank                     0.000000
gdp_per_capita                   0.000000
score                            0.000000
social_support                   0.000000
freedom_to_make_life_choices     0.000000
generosity                       0.000000
year                             0.000000
dtype: float64

In [112]:
df_cleaned = df_cleaned.drop(columns=['healthy_life_expectancy'])

KeyError: "['healthy_life_expectancy'] not found in axis"

In [113]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   overall_rank                  312 non-null    int64   
 1   country_or_region             312 non-null    category
 2   score                         312 non-null    float64 
 3   gdp_per_capita                312 non-null    float64 
 4   social_support                312 non-null    float64 
 5   freedom_to_make_life_choices  312 non-null    float64 
 6   generosity                    312 non-null    float64 
 7   perceptions_of_corruption     311 non-null    float64 
 8   year                          312 non-null    int64   
dtypes: category(1), float64(6), int64(2)
memory usage: 25.6 KB


In [114]:
# Raw count of NaNs per column
print(df_cleaned.isnull().sum())

overall_rank                    0
country_or_region               0
score                           0
gdp_per_capita                  0
social_support                  0
freedom_to_make_life_choices    0
generosity                      0
perceptions_of_corruption       1
year                            0
dtype: int64


In [79]:
print((df_cleaned.isnull().mean() * 100).round(2))

overall_rank                    0.00
country_or_region               0.00
score                           0.00
gdp_per_capita                  0.00
social_support                  0.00
freedom_to_make_life_choices    0.00
generosity                      0.00
perceptions_of_corruption       0.32
year                            0.00
dtype: float64


In [80]:
df_cleaned[df_cleaned.isnull().any(axis=1)].head()

Unnamed: 0,overall_rank,country_or_region,score,gdp_per_capita,social_support,freedom_to_make_life_choices,generosity,perceptions_of_corruption,year
19,20,United Arab Emirates,6.774,2.096,0.776,0.284,0.186,,2018


Since it's just 1 row out of the entire dataset, the cleanest approach is to drop it

In [81]:
df_cleaned = df_cleaned.dropna(subset=['perceptions_of_corruption'])

In [82]:
print(df_cleaned.isnull().sum())

overall_rank                    0
country_or_region               0
score                           0
gdp_per_capita                  0
social_support                  0
freedom_to_make_life_choices    0
generosity                      0
perceptions_of_corruption       0
year                            0
dtype: int64


## Handle Missing Values

In [37]:
(df_cleaned.isnull().mean() * 100).sort_values(ascending=False)

overall_rank                    0.0
country_or_region               0.0
score                           0.0
gdp_per_capita                  0.0
social_support                  0.0
freedom_to_make_life_choices    0.0
generosity                      0.0
perceptions_of_corruption       0.0
year                            0.0
dtype: float64

Missingness = negligible

Missingness quantified (only 0.1% in perceptions_of_corruption)

The NaN was MCAR (just one missing entry, no dependency on other factors seen).

✅ Correct solution = drop the one row (which you already did). Dataset is now complete.

## Data Consistency & Duplicates

We don’t want the same (country, year) repeated unless justified.

In [38]:
# Check total duplicates (full row)
print("Total duplicate rows:", df_cleaned.duplicated().sum())

Total duplicate rows: 0


In [39]:
# Check duplicates by key (country, year)
print("Country-Year duplicates:", df_cleaned[['country_or_region', 'year']].duplicated().sum())

Country-Year duplicates: 0


In [40]:
# Check Impossible / Illogical Values

# Define expected ranges
checks = {
    'score': (0, 10),
    'gdp_per_capita': (0, 2),
    'social_support': (0, 2),
    'freedom_to_make_life_choices': (0, 1),
    'generosity': (0, 1),
    'perceptions_of_corruption': (0, 1)
}

# Spot out-of-range values
for col, (low, high) in checks.items():
    bad_vals = df_cleaned[(df_cleaned[col] < low) | (df_cleaned[col] > high)]
    if not bad_vals.empty:
        print(f"⚠️ Column {col} has values outside {low}–{high}:")
        print(bad_vals[[col, 'country_or_region', 'year']])

Duplicates: none found → every (country, year) is unique.

Consistency checks: all numeric values fall within expected ranges → no illogical or impossible values detected.

👉 This means the dataset is clean, non-redundant, and consistent.

## Descriptive Statistics

In [41]:
df_cleaned.describe()

Unnamed: 0,overall_rank,score,gdp_per_capita,social_support,freedom_to_make_life_choices,generosity,perceptions_of_corruption,year
count,311.0,311.0,311.0,311.0,311.0,311.0,311.0,311.0
mean,78.688103,5.387061,0.894447,1.212424,0.423987,0.182916,0.111299,2018.501608
std,45.054689,1.113653,0.389311,0.299774,0.156074,0.096895,0.095365,0.500803
min,1.0,2.853,0.0,0.0,0.0,0.0,0.0,2018.0
25%,40.0,4.5125,0.608,1.057,0.3255,0.1085,0.05,2018.0
50%,79.0,5.373,0.96,1.266,0.45,0.175,0.082,2019.0
75%,117.5,6.1735,1.2145,1.458,0.5405,0.245,0.1405,2019.0
max,156.0,7.769,1.684,1.644,0.724,0.598,0.457,2019.0


In [42]:
# Select only numeric columns before computing stats
num_df = df_cleaned.select_dtypes(include=['number'])

# Basic descriptive stats
summary_stats = num_df.describe().T  # transpose for readability

# Add median, skewness, kurtosis
summary_stats['median']   = num_df.median()
summary_stats['skewness'] = num_df.skew()
summary_stats['kurtosis'] = num_df.kurt()

print(summary_stats)

                              count         mean        std       min  \
overall_rank                  311.0    78.688103  45.054689     1.000   
score                         311.0     5.387061   1.113653     2.853   
gdp_per_capita                311.0     0.894447   0.389311     0.000   
social_support                311.0     1.212424   0.299774     0.000   
freedom_to_make_life_choices  311.0     0.423987   0.156074     0.000   
generosity                    311.0     0.182916   0.096895     0.000   
perceptions_of_corruption     311.0     0.111299   0.095365     0.000   
year                          311.0  2018.501608   0.500803  2018.000   

                                    25%       50%        75%       max  \
overall_rank                    40.0000    79.000   117.5000   156.000   
score                            4.5125     5.373     6.1735     7.769   
gdp_per_capita                   0.6080     0.960     1.2145     1.684   
social_support                   1.0570     1.

In [43]:
numeric_cols = df_cleaned.select_dtypes(include=['number']).columns.tolist()
print(numeric_cols)

['overall_rank', 'score', 'gdp_per_capita', 'social_support', 'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption', 'year']


In [44]:
print("🔍 KEY INSIGHTS FROM THE STATISTICS:")
print("=" * 80)

# Analyze key patterns
for col in numeric_cols:
    if col in summary_stats.index:
        skew = summary_stats.loc[col, 'skewness']
        mean_val = summary_stats.loc[col, 'mean']
        median_val = summary_stats.loc[col, 'median']
        
        print(f"\n{col.upper()}:")
        print(f"  • Range: {summary_stats.loc[col, 'min']:.3f} to {summary_stats.loc[col, 'max']:.3f}")
        print(f"  • Mean: {mean_val:.3f}, Median: {median_val:.3f}")
        
        if abs(skew) < 0.5:
            print(f"  • Distribution: Fairly symmetric (skew: {skew:.3f})")
        elif skew > 0.5:
            print(f"  • Distribution: Right-skewed (skew: {skew:.3f}) - few high values")
        else:
            print(f"  • Distribution: Left-skewed (skew: {skew:.3f}) - few low values")

print(f"\n📈 Dataset shape: {df_cleaned.shape[0]} countries/observations, {df_cleaned.shape[1]} variables")
print(f"📅 Years covered: {df_cleaned['year'].min()} to {df_cleaned['year'].max()}")

🔍 KEY INSIGHTS FROM THE STATISTICS:

OVERALL_RANK:
  • Range: 1.000 to 156.000
  • Mean: 78.688, Median: 79.000
  • Distribution: Fairly symmetric (skew: -0.005)

SCORE:
  • Range: 2.853 to 7.769
  • Mean: 5.387, Median: 5.373
  • Distribution: Fairly symmetric (skew: 0.019)

GDP_PER_CAPITA:
  • Range: 0.000 to 1.684
  • Mean: 0.894, Median: 0.960
  • Distribution: Fairly symmetric (skew: -0.377)

SOCIAL_SUPPORT:
  • Range: 0.000 to 1.644
  • Mean: 1.212, Median: 1.266
  • Distribution: Left-skewed (skew: -1.116) - few low values

FREEDOM_TO_MAKE_LIFE_CHOICES:
  • Range: 0.000 to 0.724
  • Mean: 0.424, Median: 0.450
  • Distribution: Left-skewed (skew: -0.639) - few low values

GENEROSITY:
  • Range: 0.000 to 0.598
  • Mean: 0.183, Median: 0.175
  • Distribution: Right-skewed (skew: 0.803) - few high values

PERCEPTIONS_OF_CORRUPTION:
  • Range: 0.000 to 0.457
  • Mean: 0.111, Median: 0.082
  • Distribution: Right-skewed (skew: 1.658) - few high values

YEAR:
  • Range: 2018.000 to 201

In [115]:
# For reproductibility 
summary_stats.to_csv("descriptive_stats.csv")

In [116]:
plt.savefig(f'../visualizations/univariate_plots/dist_{col}.png', dpi=300, bbox_inches='tight')

<Figure size 640x480 with 0 Axes>

In [118]:
summary_stats.to_csv("../data/processed/descriptive_stats.csv")