# Introduction


---


## Overview of data

This report uses the dataset titled `happiness.csv` this dataset contains a list of countries that are ranked 1 to 147 based on a ladder score feature, this score represents a country's happiness. This score, is rated 1-10 where 10 is the happiest. This score is dervived from peoples overall life satisfaction, and will be the main focus point within this report.

Furthermore, This dataset contains additonal features, these extra features contain various datapoints which may have some form of correlation to a country's overall happiness and can be used as part of my exploratory data analysis.

These features include:
- **Un-named 0**

	- an undefined column, with no title.

- **Log GDP per capita**

	- A logarithm value which represents a countries national income/economic state.

- **Social support**

	- The number of people who feel like they have some form of social support.

- **Healthy life expectancy**

	- An estimate of the average life span in a given country, given someone is in good health .

- **Freedom to make life choices**

	- The number of people who feel they have the freedom to make life decisions, this includes things like career & living conditions.

- **Generosity**

	- A person's generosity for example charitable donations or helping other people.

- **Perceptions of corruption**

	- A lower value equates to lower trust, this is derived from questions about the level of corruption within business and/or government.

- **Dystopia + residual**

	- Any part of a countries happiness that is not affected/explained by the other features within this dataset.

This dataset can provide interesting insights into why a certain country may or may not be happier than another, based on some of the features listed above.

---

## Common issues within data

Data in the real world can have various issues, this can cause problems when trying to analyse data. We have to think about how we handle these issue to provide non bias, accurate analysis. These issues include but are not limited to duplicates, missing data, outliers, human error:

---

### **Duplicates**
It's important to identify duplicate records, as duplicate entries can lead to inaccurate results by inflating certain features like frequency columns or causing bias i.e. a duplicated salary. We must identify these duplicates and remove them from the dataset or address them appropriately.

---

### Missing data
Missing data is where you have missing values within a given record i.e. take the Social Support feature this might have NA, Nan or just blank within the a given entry.

Missing data comes in various types each comes with a bias risk meaning each part of missing data might lead to bias analysis i.e. people who dont report high salaries means the dataset is now biased towards low earners:

- MCAR (Missing completely at random)
  - There is no pattern within the data (on any feature) as to why the data is missing.
  - Risk Bias - Low

- MAR (Missing at random)
  - Missing data is related to observed data i.e. when one column is missing another column is.
  - Risk Bias: Moderate

- MNAR (Missing not at random)
  - Missing data but its related to the feature its missing from for example  people might not want to report their age or income.
  - Risk Bias: High (Can greatly affect overall data accuracy)

When we have missing data we need to look at the data and see how we can best handle this data typically we handle missing data using a technique called imputation which is filling in the missing data if you will. There is various forms of imputation:

- Mean Imputaion - Fill with the average value accross a certain feature
- Median Imputation - Fill the missing value with the exact middle between the data point.
- Mode imputation - Use the most frequent value from the feature.
- Forward fill - Use the last value
- Predictive - Use a model to predict the value.

---

### Inconsistent Data
Inconsitent data is typically an commmon issue we seen within real world data, inconsistent formatting i.e. dates, number respresenation etc can cause various problems during later analysis such as parsing issues, un-reliable data & data types. For example:

- Dates
  - mm/dd/yyyy vs dd-mm-yyyy (can cause parsing issues)
- Number format
  - 10K vs 10,000 vs 10_000 vs 10000
- Data Type
  - "10" vs 10

we should, through data cleansing address these types of inconsistent data to provide more accurate results, standardised data equates to better insights.
---

### Outliers
Outliers are described as values that deivate drastically from the "normal" values within a data set i.e. average salary of 50k and outlier of 500k.

outliers can potneitaly skew data, or cause bias. However, can be useful if used correctly & can provide very valuable insights.

we can detect outliers using various techniques:

- Visual inspection - we can use things like boxplots or histograms to show outliers.
- Z-Score - measure how many standard devisations a particular dataset is from its mean. We can then set a threshold for outliers.
- InterQuartile range (IQR) - identifies the spread of data within 50% of the data, anything outwith is an outlier.
- Domain knowledge - If you know a particular domain i.e. how much a hospital should charge for a procedure in the US you can set a custom threshold.

---

### Human error within outliers
Human error is another error we must look at within real data, for example typo's of numbers causing outliers, or typos effecting mode imputation. We have to be aware of the data we are working within.

---

## Purpose
The purpose of this report is to explore the `happiness.csv` dataset, to better understand how features listed within the dataset relate to a country's overall happiness.

Throughout this report I aim to show how certain factors such as economic stance, freedom & social support impacts a countries overall happiness.

---

## Hypotheses
In this report I aim to clarify 3 hypotheses by analysing and providing insights using the happiness dataset:

1. **Freedom & Corruption**
  - Question:  Does having better freedoms and lower corruption result in higher overall happiness?
  - Hypothesis: Greater freedom and lower corruption results has a positive impact on countries happiness score.
2. **Economic stance**
  - Question: Does a lower economic state result in lower happiness scores?
  - Hypothesis: Lower econmic states result in a country having a lower happiness score.
3. **Social Support**
  - Question: Do countries that tend to have people in better social support situations have a better overall happiness?
  - Hypothesis: Countries that have a higher social support score are happier.





# Data Loading and Inspection

In [17]:
import pandas as pd

#read in csv
df = pd.read_csv('Happiness.csv')

# remove the last 5 blank features
df = df.iloc[:, :-5]

df_shape = df.shape

print("DF Shape")
print("=============")
print(f"Rows: {df_shape[0]}, Columns: {df_shape[1]}")
print("=============\n")


df_data_types = df.dtypes

print("DF DataTypes")
print("=============")
print(df_data_types)
print("=============\n")

print("DF Head")
print("=============")
print(df.head(5))
print("=============\n")



DF Shape
Rows: 148, Columns: 12

DF DataTypes
Year                              int64
Rank                              int64
Country name                     object
Ladder score                    float64
Unnamed 0                       float64
Log GDP per capita              float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
Dystopia + residual             float64
dtype: object

DF Head
   Year  Rank Country name  Ladder score  Unnamed 0  Log GDP per capita  \
0  2024     1      Finland         7.736      7.810               1.749   
1  2024     2      Denmark         7.521      7.611               1.825   
2  2024     3      Iceland         7.515      7.606               1.799   
3  2024     4       Sweden         7.345      7.427               1.783   
4  2024     5  Netherlands         7.306      7.372               1.822   

