## Date Project 

In this project our focus is on applying data and cleaning data, and testing data structuring methods. Hereby, applying data analysis methods known from samfundsbeskrivelse. 
We have gathered data from the website https://ourworldindata.org/ where we have used the following two datasets: "Share of adults who drank alcohol in last year, 2016" and "life expectancy of women vs life expectancy of men".

In [None]:
# Importing modules and packages

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as nd
import pandas as pd

import os 
# Using assert to check that paths exist on computer.
assert os.path.isdir('data/')
assert os.path.isfile('data/lifeexp.xlsx')
assert os.path.isfile('data/alconsp.xlsx')

# Print everything in data
os.listdir('data/')
print(lifeexp.columns)

In [None]:
# Reading in Excelfile from our world in data
lifeexp = pd.read_excel('data/lifeexp.xlsx')
(lifeexp).head()

Since we are only interested in the data in Year 2016, we have to reduce it.

In [None]:
#dropping columns
lifeexp.drop(['Continent', 'Code'], axis=1, inplace=True) 

# renaming the columns using columns dictionary
columns_dict={}
columns_dict['Entity'] = 'Country'
columns_dict['ratio'] = 'FM_ratio'
columns_dict['Population (historical estimates)'] = 'Population'
columns_dict['Life expectancy - Sex: female - Age: at birth - Variant: estimates']= 'Female_le'
columns_dict['Life expectancy - Sex: male - Age: at birth - Variant: estimates'] = 'Male_le'
lifeexp.rename(columns=columns_dict,inplace=True)

# dropping the NaN values
lifeexp = lifeexp.dropna(subset=['Female_le', 'Male_le'])

#calculating female male life expectancy ratio
lifeexp['FM_le_ratio'] = lifeexp['Female_le']/lifeexp['Male_le']

lifeexp.head(20)
lifeexp.describe()

The Year column has a constant value of 2016. The Female_le and Male_le columns have mean life expectancies of 75.47 and 70.24 years, respectively. The Population column has a mean of 193.56 million, but a very large standard deviation of 855.16 million, indicating that there is a wide range of population sizes in the data. Lastly, the FM_le_ratio column has a mean of 1.07, indicating that, on average, female life expectancy is about 7% higher than male life expectancy.

In [None]:
#extracting population data from the dataset
pop = lifeexp['Population ']  

# Storing pop as a numpy array: np_pop
np.array(pop)
# Format population numbers to display in millions
np_pop1 = np.array(pop) / 10000000
pop_labels = [f'{pop:.2f} million' for pop in np_pop1]

#Create scatter plot
plt.scatter(lifeexp['Male_le'], lifeexp['Female_le'], s= np_pop1,  alpha=0.5, c= np_pop1, cmap = 'rainbow')

# set the x and y-axis limits
plt.xlim(50, 90)
plt.ylim(50, 90)
# Add axis labels
plt.xlabel('Male life expectancy')
plt.ylabel('Female life expectancy')
# Add colorbar legend
plt.colorbar(label='Population (millions)')

# Add title
plt.title('Female Male life expectancy in 2016')

# Add 45 degree line
plt.plot([0, 100], [0, 100], linestyle='--', color='grey', alpha =0.5)

# Add country labelsfor i, row in lifeexp.iterrows():    plt.annotate(row['Entity'], xy=(row['Male_le'], row['Female_le']), xytext=(5, 5), textcoords='offset points', fontsize=8)# After customizing, display the plot
plt.figure(figsize=(15, 13))
plt.show()

In [None]:
# Reading in second excelfile from Our world in data
alconsp = pd.read_excel('data/alconsp.xlsx')
alconsp.head(5)

print(alconsp.columns)

In [None]:
# Creating and renaming columns in dataset in a dictionary
columns_dict={}
columns_dict['Entity'] = 'Country'
columns_dict['Indicator:Alcohol, consumers past 12 months (%) - Sex:Male'] = 'Male_alc'
columns_dict['Indicator:Alcohol, consumers past 12 months (%) - Sex:Female']= 'Female_alc'
columns_dict['Population (historical estimates)'] = 'Population'
alconsp.rename(columns=columns_dict,inplace=True)

#dropping columns continent and code
alconsp.drop(['Continent','Code'],axis=1,inplace=True)

# dropping the NaN values
alconsp = alconsp.dropna(subset=['Female_alc', 'Male_alc'])

#calculating female male alcohol consumption ratio
alconsp['FM_alc_ratio'] = alconsp['Female_alc']/alconsp['Male_alc']

alconsp.head(15)

In [None]:
#extracting population data from the dataset
Pop = alconsp.Population

# Storing pop as a numpy array: np_pop
np.array(Pop)
# Format population numbers to display in millions
np_pop = np.array(Pop) / 1000000
pop_labels = [f'{Pop:.2f} million' for Pop in np_pop]
#Create scatter plot
plt.scatter(alconsp['Female_alc'], alconsp['Male_alc'], s= np_pop,  alpha=0.5, c= np_pop, cmap = 'rainbow')

# Add axis labels
plt.xlabel('Female alcohol consumption')
plt.ylabel('Male alcohol consumption')
# Add colorbar legend
plt.colorbar(label='Population (millions)')

# Add title
plt.title('Female Male alcohol consumption in 2016')

# Add 45 degree line
plt.plot([0, 100], [0, 100], linestyle='--', color='black', alpha =0.5)

# After customizing, display the plot
plt.figure(figsize=(10, 8))
plt.show()

Here we can see that the alcohol consumtion for males are higher than females in all included countries. The bubble chart represents female alcohol consumption on the x-axis and male alcohol consumption on the y-axis. The size and color of the bubbles indicates the population size of each country.

In [None]:
#Descriptive statistics
alconsp.describe()

The table above shows some statistics on the male and female alcohol consumption, population, and male-to-female alcohol consumption ratio for 188 countries in the year 2016.

The table shows that on average, males consume more alcohol than females, with a male-to-female alcohol consumption ratio of 0.52. The standard deviation of male alcohol consumption is higher than that of females. The population ranges from 1,883 to 1.401890e+09, with a mean of 3.936001e+07.

These statistics can provide insight into alcohol consumption patterns and their potential impacts on health and society in each of the countries.

In [None]:
alconsp.drop(['Male_alc','Female_alc','Population'],axis=1, inplace= True)
lifeexp.drop(['Male_le','Female_le', 'Population '], axis= 1, inplace = True)

combined_df = pd.merge(lifeexp,alconsp, on= 'Country', how = 'inner')
combined_df.head(15)

In [None]:
combined_df.drop( 'Year_y', axis=1, inplace=True)
combined_df = combined_df.rename(columns= {'Year_x': 'Year'})
combined_df = combined_df.dropna()

In [None]:
combined_df.drop( 'Year_y', axis=1, inplace=True)
combined_df = combined_df.rename(columns= {'Year_x': 'Year'})
combined_df = combined_df.dropna()
combined_df.head(10)

From the analysis of the data, we can see that there is a negative correlation between alcohol consumption and life expectancy for both men and women. The data also suggests that women tend to have a longer life expectancy than men on average, and they consume less alcohol than men.

The combined data of alcohol consumption and life expectancy for 188 countries in 2016 showed a moderate negative correlation between alcohol consumption and life expectancy for both males and females. The mean alcohol consumption was 49.44 for males and 29.35 for females, with a standard deviation of 25.74 and 21.32, respectively. The mean FM alcohol ratio was 0.52, indicating that on average, males consumed more alcohol than females.

After filtering out the countries that was not used, we combined the two data sets.

It is important to notice that correlation does not imply causation, and there may be other factors, such as wealth etc., that contributes to life expectancy besides alcohol consumption. Additionally, the data only provides information for the year 2016 and may not be representative of overall trends.
