# Bad habits vs Education

This guided project aims to explore whether there is a connection between the student population and "bad" habits or behaviors that may negatively impact people's health. The habits to study are:

- Coffee adiction.
- Smoking.
- Videogames adiction.

Understanding the demographic factors of this habits in a culture and the connection with the students population could help society to understand if this bad habits or missused habits are directly related to the stress faced by higher education students. Suggesting the need to restructure the current higher education programs in places where this adictions are higher. It is important to highlight that due to lack of recent research and measures in this topics the data sources are between 2019 and 2020. 

*Press the '↓' key* to see what python libraries we will use for this project. 

In [None]:
import pandas as pd #For reading and handling CSV files.
import numpy as np  # For numerical operations and handling missing values.
import matplotlib.pyplot as plt  # For creating basic visualizations.
import seaborn as sns  # For advanced data visualization used in the infographic.

## Content:
1. Introduction.
2. Load Data.

Also feel free to visit **https://github.com/DefoNotGus/DV_assesment** To find the project's notebook

## Load Data

### The datasets used to study the habits are:

1. **Coffee Consumption Dataset**: Lists coffee consumption by country with extensive coverage.
- **Source**: [Kaggle](https://www.kaggle.com/datasets/nurielreuven/coffee-consumption-by-country-2022/data)  
2. **Smoking Rates Dataset**: Provides a chronological overview of smoking rates in many countries, sourced via a Google search.
- **Source**: [World Population Review](https://worldpopulationreview.com/country-rankings/smoking-rates-by-country)  
3. **Gamers Market Dataset**: Self-created CSV, compiles 2019 gaming market overview in many countries, offering chronological alignment with the other datasets. 
- **Source**: [Allcorrect Games](https://allcorrectgames.com/insights/a-global-research-of-2019-games-market/)  

*Press the '↓' key*

### The datasets used to analyze students and enrollment data are:  

1. **Education Statistics Dataset**: A massive dataset with student enrollment data by region, sourced using the World Bank DataBank tool.  
- **Source**: [World Bank](https://databank.worldbank.org/indicator/)  
2. **Students Dataset**: Provides country-specific enrollment data filtered to align with the habits datasets, sourced from the OECD Data Explorer.  
- **Source**: [OECD Data Explorer](https://data-explorer.oecd.org/)  

Datasets are loaded into variables using `pd.read_csv()` from the Pandas library.
*Press the '↓' key*

In [None]:
# Importing the CSV files
coffee_df = pd.read_csv('Datasets/coffee.csv')
smoking_df = pd.read_csv('Datasets/smoking.csv')
gamers_df = pd.read_csv('Datasets/gamers.csv')
edstats_df = pd.read_csv('Datasets/edstats.csv')
students_df = pd.read_csv('Datasets/students.csv')


## Visualization Plan.

Before going to the next step, it's important to display all the datasets head (first 2 columns), the size of the dataset using the method "shape" and list their features or column headers in order to explain the aim of each dataset. 

### Coffee_df:
This dataset contains information on coffee consumption per capita (in kilograms) for 183 countries in the years 2020 and 2016. It is useful for identifying countries with high coffee consumption. Additional details about this dataset and others can be found in the reference section.

In [None]:
#Display head (2 first rows)
print("Coffee DataFrame:",coffee_df.head(2), "\n")

#size and features display 
print(f"Size:\nRows: {coffee_df.shape[0]}, Columns: {coffee_df.shape[1]}")
print("Coffee columns:", coffee_df.columns.tolist(), "\n")

### Smoking_df:
This dataset provides smoking rates (as a percentage of the population) for 164 countries in the years 2020, 2021, and 2022. It also includes data broken down by gender (male and female). The dataset can be used to compare smoking rates across countries and analyze gender differences in cigarette consumption.

In [None]:
# Display head (2 first rows)
print("Smoking DataFrame:", smoking_df.head(2), "\n")
# Size and features display
print(f"Size:\nRows: {smoking_df.shape[0]}, Columns: {smoking_df.shape[1]}")
print("Smoking columns:", smoking_df.columns.tolist(), "\n")

### Gamers_df:
This dataset, which I created myself from a website, contains data for 29 countries, including key metrics such as market revenue in the video games industry (in millions of dollars), internet penetration (percentage), number of gamers (in millions), mobile market revenue (in millions of dollars), yearly spending on mobile games per user (in dollars), and English proficiency (average level of English speakers). While the dataset is rich in information, it is limited in the number of countries covered. Nevertheless, it is valuable for analyzing the gaming population across different countries.

In [None]:
# Display head (2 first rows)
print("Gamers DataFrame:", gamers_df.head(2), "\n")
# Size display and  features
print(f"Size:\nRows: {gamers_df.shape[0]}, Columns: {gamers_df.shape[1]}")
print("Gamers columns:", gamers_df.columns.tolist(), "\n")

### Edstats_df:
This dataset contains extensive data collected in series, not limited to country-level information, making it suitable for analyzing specific details by region. For this project, we will primarily focus on analyzing the higher education population per country, filtering and selecting the most relevant series for our analysis

In [None]:
# Display head (2 first rows)
print("Students Enrolment DataFrame:", edstats_df.head(2), "\n")
# Size display and features
print(f"Size:\nRows: {edstats_df.shape[0]}, Columns: {edstats_df.shape[1]}")
print("Student enrolment columns:", edstats_df.columns.tolist(), "\n")

### Students_df
This dataset has information relevant to the number of enrolled students per country and includes various metrics related to education, such as enrollment rates, education levels, and potentially regional or demographic breakdowns. 

In [None]:
# Display head (2 first rows)
print("Students Enrolment DataFrame:", students_df.head(2), "\n")

# Size display and features
print(f"Size:\nRows: {students_df.shape[0]}, Columns: {students_df.shape[1]}")
print("Student enrolment columns:", students_df.columns.tolist(), "\n")

## Processing Data

We will progresively remove or handle missing values, correcting or convert data types accurately, filtering out unnecessary rows or columns, transforming data by normalizing, scaling, or aggregating to ensure consistency and usability.




**Coffee_df:**

Now to process this dataset, we will do the following in a new dataset called **coffee2020_df**:

1. Remove unnecessary features, specifically the 2016 data.
2. Rename columns to give them clearer titles.
3. Remove missing values.
4. Change data type when needed, as by default is often adopted "string" type.
5. Set 'Country' as the dataset index.

The main reason for keeping only the 2020 information is to perform a country-to-country analysis within the same year.


In [None]:
# Copy the coffee_df to a new DataFrame for processing
coffee2020_df = coffee_df.copy()
# 1. Remove the "coffeeConsumptionByCountry_perCapitaCons2016" feature
coffee2020_df = coffee2020_df.drop(columns=['coffeeConsumptionByCountry_perCapitaCons2016'])
# 2. Rename columns for clarity
coffee2020_df = coffee2020_df.rename(columns={
    'coffeeConsumptionByCountry_perCapitaCons2020': 'Coffee per capita in 2020 (KG)',
    'country': 'Country'
})
# 3. Remove rows with missing values or "0"
coffee2020_df = coffee2020_df.dropna()
coffee2020_df = coffee2020_df[coffee2020_df['Coffee per capita in 2020 (KG)'] != 0]
# Convert the 'Coffee per capita in 2020 (KG)' column to float
coffee2020_df['Coffee per capita in 2020 (KG)'] = coffee2020_df['Coffee per capita in 2020 (KG)'].astype(float)
# 5. Set 'Country' as the dataset index
coffee2020_df = coffee2020_df.set_index('Country')

We will now verify that the data has been processed correctly by using checking for missing values and printing the new features. Also we can print how many columns and rows have been remove by using the size of the original dataframe taking away the size the new dataframe as shown below: 

In [None]:
# Check for missing (NaN) or empty (blank) values
empty_values = (coffee2020_df == '').sum().sum()  # Count empty strings
missing_or_empty_values = coffee2020_df.isnull().sum().sum() + empty_values

print(f"Total missing or empty values: {missing_or_empty_values}")

# Print the features (columns) of the processed dataset
print("Features in coffee2020_df:", coffee2020_df.columns.tolist())


# Calculate and display the difference in rows and columns
row_diff, col_diff = coffee_df.shape[0] - coffee2020_df.shape[0], coffee_df.shape[1] - coffee2020_df.shape[1]
print(f"Difference in rows: {row_diff}\nDifference in columns: {col_diff}")


We can conclude that this dataframe has been processed properly and now is ready for merging or visualization in the next step.

**Smoking_df:**

Similary, it's needed to create a new csv table called **smoking2020_df**:

1. Remove unnecessary features, all the data from 2022 and 2021.
2. Rename columns to give them clearer titles.
3. Change data type when neededto interger.
4. Remove missing values or empty strings per row.
5. Set 'Country' as the dataset index.

In [None]:
# 1. Create a new DataFrame with only 2020 data
smoking2020_df = smoking_df[['country', 'smokingRatesByCountry_rateBothPct2020', 
                               'smokingRatesByCountry_rateMalePct2020', 
                               'smokingRatesByCountry_rateFemalePct2020']]

# 2. Rename the columns for clarity
smoking2020_df = smoking2020_df.rename(columns={
    'country': 'Country',
    'smokingRatesByCountry_rateBothPct2020': 'Smoking rate in 2020(%)',
    'smokingRatesByCountry_rateMalePct2020': 'Male smoking rate in 2020(%)',
    'smokingRatesByCountry_rateFemalePct2020': 'Female smoking rate in 2020(%)'
})

# 3. Convert the columns to integers, coercing errors to NaN
smoking2020_df['Smoking rate in 2020(%)'] = pd.to_numeric(smoking2020_df['Smoking rate in 2020(%)'], errors='coerce')
smoking2020_df['Male smoking rate in 2020(%)'] = pd.to_numeric(smoking2020_df['Male smoking rate in 2020(%)'], errors='coerce')
smoking2020_df['Female smoking rate in 2020(%)'] = pd.to_numeric(smoking2020_df['Female smoking rate in 2020(%)'], errors='coerce')

# 4. Remove missing or unwanted data
smoking2020_df = smoking2020_df.replace('', np.nan)  # Replace empty strings with NaN
smoking2020_df = smoking2020_df.replace('0', np.nan)  # Replace '0' with NaN
smoking2020_df = smoking2020_df.dropna()  # Drop rows with NaN values

# 5. Set 'Country' as the dataset index
smoking2020_df = smoking2020_df.set_index('Country')


We follow the same procedure to verify  processing of this table. Concluding that the data processing has been succesfull as shown below. 

In [None]:
# Check for missing (NaN) or empty (blank) values
empty_values = (smoking2020_df == '').sum().sum()  # Count empty strings
missing_or_empty_values =smoking2020_df.isnull().sum().sum() + empty_values

print(f"Total missing or empty values: {missing_or_empty_values}")

# Print the features (columns) of the processed dataset
print("Features in smoking2020_df:",smoking2020_df.columns.tolist())


# Calculate and display the difference in rows and columns
row_diff, col_diff = smoking_df.shape[0] - smoking2020_df.shape[0], smoking_df.shape[1] -smoking2020_df.shape[1]
print(f"Difference in rows: {row_diff}\nDifference in columns: {col_diff}")


**Gamers_df:**

We will process the gamers_df dataset similarly, creatign a new dataframe, *gamers2019_df*, performing the following tasks:

1. Remove unnecessary columns.
2. Rename columns to give them clearer titles.
3. Convert relevant columns to the correct data type.
4. Remove rows with missing or incorrect data.
5. Set 'Country' as the dataset index.

In [None]:
# 1. Create a new DataFrame with only the 'country' and 'Number of Gamers (millions)' columns
gamers2019_df = gamers_df[['Country', 'Number of Gamers (millions)']]

# 2. Rename the columns for clarity
gamers2019_df = gamers2019_df.rename(columns={'Number of Gamers (millions)': 'Gamers in 2019(MM)'})

# 3. Convert the 'Gamers in 2019(MM)' column to numeric, coercing errors to NaN
gamers2019_df['Gamers in 2019(MM)'] = pd.to_numeric(gamers2019_df['Gamers in 2019(MM)'], errors='coerce')

# 4. Remove rows with missing values (NaN)
gamers2019_df = gamers2019_df.dropna()

# 5. Set 'Country' as the dataset index
gamers2019_df = gamers2019_df.set_index('Country')




Now we verify the dataframe and changes made:

In [None]:
# Check for missing (NaN) or empty (blank) values in gamers2019_df
empty_values = (gamers2019_df == '').sum().sum()  # Count empty strings
missing_or_empty_values = gamers2019_df.isnull().sum().sum() + empty_values

print(f"Total missing or empty values: {missing_or_empty_values}")

# Print the features (columns) of the processed dataset
print("Features in gamers2019_df:", gamers2019_df.columns.tolist())

# Calculate and display the difference in rows and columns between original and processed DataFrame
row_diff, col_diff = gamers_df.shape[0] - gamers2019_df.shape[0], gamers_df.shape[1] - gamers2019_df.shape[1]
print(f"Difference in rows: {row_diff}\nDifference in columns: {col_diff}")


**EdStats:**

This dataframe requires to carry on a particulary more complex approach. since we first need to find what "Series" we want to filter first. Before following our typical processing approach. It's important to create an *"enrolment_df"* where we will use "str.contains()" method to filter all the series with the word enrolment in the Series column. 
We display the different titles in series using the "unique()" method, displaying every single unique value.

In [None]:
# Filter rows where the 'Series' column contains the word 'enrollment'
enrolment_df = edstats_df['Series'][edstats_df['Series'].str.contains('enrolment', case=False, na=False)]

# Get unique values from the 'Series' column that contain 'enrollment'
unique_enrolment_df = enrolment_df.unique()

# Display the list of unique strings containing 'enrollment'
print(unique_enrolment_df)



From this information we can conclude that we want the following series only: 'Enrolment in tertiary education, all programmes, both sexes (number)'

Now we proceed to:
1. Remove unnecessary rows (All th rows with irrelevant information).
2. Rename columns to give them clearer titles.
3. Remove unnecessary columns (We will only keep Country and the year 2020)
4. Remove rows with missing or incorrect data.
5. Convert relevant columns to the correct data type.
6. Create a new column that handles the both years (2019 and 2020) prioritizing the most recent.
7. Drop rows without numeric data in either 2019 and 2020.
8. Set 'Country' as the dataset index.


In [None]:
# Filter edstats_df for rows where 'Series' matches the specified series exactly
edstats2020_df = edstats_df[edstats_df['Series'] == 'Enrolment in tertiary education, all programmes, both sexes (number)']

# Rename columns in edstats2020_df
edstats2020_df = edstats2020_df.rename(columns={'Country Name': 'Region', 
                                                '2020 [YR2020]': 'Students in 2020',
                                                '2019 [YR2019]': 'Students in 2019'
                                                })

# Keep only 'Region', 'Students in 2019' and 'Students in 2020' columns
edstats2020_df = edstats2020_df[['Region', 'Students in 2020','Students in 2019']]

# Remove any non-numeric characters from 'Students in 2020' and 'Students in 2019'
edstats2020_df['Students in 2020'] = edstats2020_df['Students in 2020'].replace(r'[^0-9]', '', regex=True)
edstats2020_df['Students in 2019'] = edstats2020_df['Students in 2019'].replace(r'[^0-9]', '', regex=True)

# Convert 'Students in 2020' and 'Students in 2019' columns to integers or NaN
edstats2020_df['Students in 2020'] = pd.to_numeric(edstats2020_df['Students in 2020'], errors='coerce').astype('Int64')
edstats2020_df['Students in 2019'] = pd.to_numeric(edstats2020_df['Students in 2019'], errors='coerce').astype('Int64')

# Create 'Students in 2019-20' column with preference for 'Students in 2020' values
edstats2020_df['Students in 2019-20'] = edstats2020_df['Students in 2020'].combine_first(edstats2020_df['Students in 2019'])

# Keep only 'Region' and 'Students in 2019-20' columns
edstats2020_df = edstats2020_df[['Region', 'Students in 2019-20']]

# Drop rows where 'Students in 2019-20' is NaN
edstats2020_df = edstats2020_df.dropna(subset=['Students in 2019-20'])

# Set 'Region' as the index
edstats2020_df = edstats2020_df.set_index('Region')


#size display
print(f"Size:\nRows: {edstats2020_df.shape[0]}, Columns: {edstats2020_df.shape[1]}")

Something very peculiar about this World Bank Groups dataset is the fact that it does not provide only countries but also regions. Therefore we will list all the 'regions' available and then we will proceed to create a data map after carefully choosing each countries most accurate equivalent. Better explained in the merging section.

In [None]:
# Display unique country names in the 'Country' column of edstats2020_df
unique_regions = edstats2020_df.index.unique()
print("Unique regions in edstats2020_df:\n", unique_regions)

**Merging tables**

The last step to follow is to merge "coffee2020_df", "smoking2020_df" and "gamers2019_df" into one single data frame called **"habits2020_df"**.

In [None]:
# Merge the DataFrames on 'Country' without including edstats2020_df
habits2020_df = smoking2020_df.join(gamers2019_df, how='inner', rsuffix='_gamers')
habits2020_df = habits2020_df.join(coffee2020_df, how='inner', rsuffix='_coffee')

# Display the size of the resulting DataFrame
print("Size of habits2020_df:", habits2020_df.shape)



As mentioned before, edstats_df does not have a 'Country' column but a 'Region' instead, in order to merge smoothly this two datasets into a single "HABITS VS EDUCATION" dataframe. We need to add  a 'Region' column to habits2020_df and assign logical values from the "unique_regions" list created before.

To achieve this we have made manually a map, When possible we will use the already existing Country for example, China and South Africa. otherwise we allocated it by simple proximity.

In [None]:
# Creating a mapping from habits2020_df to regions in edstats2020_df
country_to_region_mapping = {
    'China': 'China',  # Direct match
    'United States': 'North America',  # United States in North America
    'Indonesia': 'East Asia & Pacific',  # Indonesia in Southeast Asia
    'Brazil': 'Latin America & Caribbean',  # Brazil in Latin America
    'Russia': 'Europe & Central Asia',  # Russia in Eastern Europe
    'Japan': 'East Asia & Pacific',  # Japan in East Asia
    'Philippines': 'East Asia & Pacific',  # Philippines in Southeast Asia
    'Vietnam': 'East Asia & Pacific',  # Vietnam in Southeast Asia
    'Iran': 'Middle East & North Africa',  # Iran in the Middle East
    'Turkey': 'Europe & Central Asia',  # Turkey in Eastern Europe / Middle East
    'Germany': 'Europe & Central Asia',  # Germany in Europe
    'Thailand': 'East Asia & Pacific',  # Thailand in Southeast Asia
    'United Kingdom': 'Europe & Central Asia',  # UK in Europe
    'France': 'Europe & Central Asia',  # France in Europe
    'South Africa': 'South Africa',  # Direct match
    'Italy': 'Europe & Central Asia',  # Italy in Europe
    'South Korea': 'East Asia & Pacific',  # South Korea in East Asia
    'Spain': 'Europe & Central Asia',  # Spain in Europe
    'Canada': 'North America',  # Canada in North America
    'Poland': 'Europe & Central Asia',  # Poland in Europe
    'Saudi Arabia': 'Middle East & North Africa',  # Saudi Arabia in the Middle East
    'Australia': 'East Asia & Pacific',  # Australia in Oceania
    'United Arab Emirates': 'Middle East & North Africa'  # UAE in the Middle East
}

Now we use the map to create our 'Region' Column in a new dataframe called "habits2020_REG":

In [None]:
# Copy the dataframe correctly
habits2020_REG = habits2020_df.copy()  # Add parentheses to properly copy the DataFrame

# Now, map the 'Country' (index) to 'Region' using the country_to_region_mapping
habits2020_REG['Region'] = habits2020_REG.index.map(country_to_region_mapping)


# Group by 'Region' and compute the mean for each region
habits2020_REG = habits2020_REG.groupby('Region').mean()

# Optional: Reset index if you want 'Region' as a column instead of the index
habits2020_REG = habits2020_REG.reset_index()

# Print or inspect the new DataFrame
print(habits2020_REG)

# Set 'Region' as the new index
habits2020_REG.set_index('Region', inplace=True)

# Display the size of the resulting DataFrame
print("Size of habits2020_REG:", habits2020_REG.shape)


Now we will merge habits2020_REG with edstats2020_df. Finally achieving our goal target habits_vs_education_df.

***habits_vs_education_df:***

In [None]:
# Merge habits2020_REG and edstats2020_df on 'Region'
habits_vs_education_df = habits2020_REG.merge(edstats2020_df, left_index=True, right_index=True, how='inner')

# Display the merged DataFrame to verify the result
print(habits_vs_education_df.head(1))

# Display the size of the resulting DataFrame
print("Size of habits_vs_education_df:", habits_vs_education_df.shape)

unique_index_values = habits_vs_education_df.index.unique()
print(unique_index_values)



## DATA VIZUALIZATION

Now we will display the relevant data some of the processed and arranged data. From habits_vs_education_df we will collect the data we need for each graphic desired to be displayed, as listed below:

- figure1 will display of the contrast of smoking by gender for each region or country using the dataset "habits_vs_education_df" since it has the smokers by gender of each region.
- figure2 will display the 'habits' by region 
- figure3 will display the contrast between habits and students.

### Comparison of smokers consumption by gender per region 

In order to organize this data we will first create a table called 'fig1_df' which will hold only *Male smoking rate in 2020(%)* and *Female smoking rate in 2020(%)*. Then we will simplify the names, create a figure and create the bars and stack 'Female Smoking Rate' bar on top of the 'Male Smoking Rate' bar to create a visual effect of a cigarette and it's filter.

In [None]:
# Prepare the data for the table
fig1_df = habits_vs_education_df[['Male smoking rate in 2020(%)', 'Female smoking rate in 2020(%)']]
fig1_df.index = habits_vs_education_df.index  # Ensure Region is set as the index
fig1_df = fig1_df.rename(columns={
    'Male smoking rate in 2020(%)': 'Male Smoking Rate (%)',
    'Female smoking rate in 2020(%)': 'Female Smoking Rate (%)'
})

# Create a horizontal bar chart for Female smoking rate over Male smoking rate
fig, ax = plt.subplots(figsize=(12, 8))

# Plot Male smoking rate bars
male_bars = ax.barh(fig1_df.index, fig1_df['Male Smoking Rate (%)'], color='#fafafa', label='Male Smoking Rate')

# Plot Female smoking rate bars over Male smoking rate
female_bars = ax.barh(fig1_df.index, fig1_df['Female Smoking Rate (%)'], 
                      left=fig1_df['Male Smoking Rate (%)'], color='#ff7f0e', label='Female Smoking Rate')

# Customize the chart
ax.set_title('Female vs. Male Smoking Rates by Region (2020)', fontsize=16, weight='bold', color='white')
ax.set_xlabel('Smoking Rate (%)', fontsize=14, weight='bold', color='white')
ax.set_ylabel('Region', fontsize=14, weight='bold', color='white')
ax.set_facecolor('#0c0c0c')  # Inner chart background color
fig.patch.set_facecolor('#0c0c1b')  # Outer background color
ax.tick_params(colors='white')  # Set tick label color to white

# Add legend to the right of the chart with customized font color and background
legend = ax.legend(title='Gender', fontsize=12, title_fontsize=13, loc='center left', 
                   bbox_to_anchor=(1.05, 0.5), frameon=True)  # Enable legend frame

# Customize legend background and font colors
legend.get_frame().set_facecolor('#2f2f2f')  # Set background color
legend.get_frame().set_edgecolor('white')    # Set edge color (optional)
legend.get_frame().set_linewidth(1)          # Adjust frame thickness
legend.get_title().set_color('white')        # Set title font color
for text in legend.get_texts():
    text.set_color('white')  # Set label font color

# Adjust layout to prevent clipping
plt.tight_layout()

# Display the chart
plt.show()

## Habbits rates between the Regions:

Using the Smoking, Gamers and Coffee data we will normalize them in order for the highest value in each column to be one and in this way scale up each value.

In [None]:
# Normalize data for comparison
normalized_smoking = habits_vs_education_df['Smoking rate in 2020(%)'] / habits_vs_education_df['Smoking rate in 2020(%)'].max()
normalized_gamers = habits_vs_education_df['Gamers in 2019(MM)'] / habits_vs_education_df['Gamers in 2019(MM)'].max()
normalized_coffee = habits_vs_education_df['Coffee per capita in 2020 (KG)'] / habits_vs_education_df['Coffee per capita in 2020 (KG)'].max()

# Plot multi-line chart
fig, ax = plt.subplots(figsize=(12, 8))

# Plot the lines
ax.plot(habits_vs_education_df.index, normalized_smoking, label='Smoking Rate', marker='o', color='#1f77b4')
ax.plot(habits_vs_education_df.index, normalized_gamers, label='Gamers', marker='s', color='#ff69b4')
ax.plot(habits_vs_education_df.index, normalized_coffee, label='Coffee Consumption', marker='^', color='#2ca02c')

# Customize chart
ax.set_xlabel('Region', fontsize=12, color='white')  # Font color for x-axis
ax.set_ylabel('Normalized Values of the Habits', fontsize=12, color='white')  # Font color for y-axis
ax.set_title('Trends in Smoking, Gamers, and Coffee Consumption by Region', fontsize=16, color='white')  # Font color for title
ax.set_xticks(range(len(habits_vs_education_df.index)))
ax.set_xticklabels(habits_vs_education_df.index, rotation=45, ha='right', color='white')  # Font color for xticklabels

# Set background color (both inner and outer)
ax.set_facecolor('#0c0c0c')  # Inner background color
fig.patch.set_facecolor('#0c0c1b')  # Outer background color
ax.tick_params(colors='white')  # Set tick label color to white

# Set the color of the axis lines (spines) to white
ax.spines['top'].set_color('white')
ax.spines['bottom'].set_color('white')
ax.spines['left'].set_color('white')
ax.spines['right'].set_color('white')

# Move legend to the right of the plot
legend = ax.legend(loc='upper left', bbox_to_anchor=(1.05, 1), title='Habits', fontsize=12, title_fontsize=13, frameon=True)

# Customize legend background and font colors
legend.get_frame().set_facecolor('#2f2f2f')  # Set background color for legend box
legend.get_frame().set_edgecolor('white')    # Set edge color for the legend box
legend.get_frame().set_linewidth(1)          # Set line width for the legend box
legend.get_title().set_color('white')        # Set legend title font color
for text in legend.get_texts():
    text.set_color('white')  # Set legend labels font color

# Adjust layout to prevent clipping
plt.tight_layout()

# Display the plot
plt.show()

## Habits by region Vs Students by region.

For our last plot we will simply normalize data one more time but this time with different scalability. The main reason is that due to the challenge of different proyected data we have (Rate, Millions and KG) we will have to normalize it all to be able of contrasting the consumption per region.

In [None]:
# Normalize the specified columns to a range from 0 to 1 (for habits data)
normalized_smoking = habits_vs_education_df['Smoking rate in 2020(%)'] / habits_vs_education_df['Smoking rate in 2020(%)'].max()
normalized_gamers = habits_vs_education_df['Gamers in 2019(MM)'] / habits_vs_education_df['Gamers in 2019(MM)'].max()
normalized_coffee = habits_vs_education_df['Coffee per capita in 2020 (KG)'] / habits_vs_education_df['Coffee per capita in 2020 (KG)'].max()

# Normalize 'Students in 2019-20' to a range from 0 to 3
normalized_students = habits_vs_education_df['Students in 2019-20'] / habits_vs_education_df['Students in 2019-20'].max() * 3

# Create the new DataFrame fig3_df with the normalized values
fig3_df = pd.DataFrame({
    'Normalized Smoking Rate': normalized_smoking,
    'Normalized Gamers': normalized_gamers,
    'Normalized Coffee Consumption': normalized_coffee,
    'Normalized Students': normalized_students
})

# Merge 'China' with 'East Asia & Pacific' by adding their values
fig3_df.loc['East Asia & Pacific'] = fig3_df.loc['China'] + fig3_df.loc['East Asia & Pacific']
fig3_df = fig3_df.drop('China', axis=0)

# Create the positions for the bars (the x-axis for the bars)
x = np.arange(len(fig3_df))  # Positions for the bars
width = 0.35  # Width of the bars

# Plotting the stacked bar chart
fig, ax = plt.subplots(figsize=(12, 6))

# Plot each of the habits on top of each other
ax.bar(x, fig3_df['Normalized Smoking Rate'], width, label='Smoking Rate', color='#1f77b4')
ax.bar(x, fig3_df['Normalized Gamers'], width, bottom=fig3_df['Normalized Smoking Rate'], label='Gamers', color='#ff69b4')
ax.bar(x, fig3_df['Normalized Coffee Consumption'], width, bottom=fig3_df['Normalized Smoking Rate'] + fig3_df['Normalized Gamers'], label='Coffee Consumption', color='#2ca02c')

# Plot the normalized students next to the stacked habits bar
ax.bar(x + width, fig3_df['Normalized Students'], width, label='Students', color='#d62728')

# Customize chart
ax.set_xlabel('Region', fontsize=12, color='white')
ax.set_ylabel('Normalized Values', fontsize=12, color='white')
ax.set_title('Comparison of Habits and Students by Region', fontsize=16, color='white')
ax.set_xticks(x + width / 2)  # Position the labels between bars
ax.set_xticklabels(fig3_df.index, rotation=45, ha='right', color='white')
ax.tick_params(axis='x', colors='white')  # Set the x-axis tick color to white
ax.tick_params(axis='y', colors='white')  # Set the y-axis tick color to white

# Set background color
ax.set_facecolor('#0c0c0c')  # Inner background color
fig.patch.set_facecolor('#0c0c1b')  # Outer background color

# Add a legend to the right of the plot
ax.legend(loc='upper left', bbox_to_anchor=(1.05, 1), title='Categories', fontsize=12, title_fontsize=13)

# Adjust layout to avoid clipping
plt.tight_layout()

# Display the plot
plt.show()
