<a href="https://colab.research.google.com/github/yoonhb7/2024_2_Data_analysis/blob/main/2%EB%B6%84%EB%B0%98/1%EC%B0%A8%ED%94%84%EB%A1%9C%EC%A0%9D%ED%8A%B8/%ED%99%A9%EB%B3%B4%EC%9C%A4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
ernestitus_2024_olympics_medals_vs_gdp_path = kagglehub.dataset_download('ernestitus/2024-olympics-medals-vs-gdp')

print('Data source import complete.')


In the world of sports, the Olympics is the ultimate stage where nations showcase their athletic prowess. But have you ever wondered if there's a correlation between a country's economic power and its Olympic success? Let's dive into the data from the 2024 Olympics and see what insights we can uncover.

## Introduction

This notebook explores the relationship between a country's GDP and its performance in the 2024 Olympics. We'll analyze the data to see if wealthier countries tend to win more medals, and if so, how strong that correlation is. We'll also look at other factors such as population size and regional differences.

In [None]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import warnings



warnings.filterwarnings('ignore')

In [None]:
# Load the dataset

file_path = '/kaggle/input/2024-olympics-medals-vs-gdp/olympics.csv'

df = pd.read_csv(file_path)

df.head()

## Data Overview

Let's take a closer look at the data to understand its structure and the types of information it contains.

In [None]:
df.info()

## Descriptive Statistics

Before diving into the analysis, let's get some basic statistics to understand the distribution of the data.

In [None]:
df.describe()

## Correlation Analysis

To understand the relationships between numeric variables, we'll create a correlation heatmap. This will help us identify any strong correlations between GDP, population, and medal counts.

In [None]:
# Select only numeric columns for correlation

numeric_df = df.select_dtypes(include=[np.number])



# Plot correlation heatmap

plt.figure(figsize=(10, 8))

sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt='.2f')

plt.title('Correlation Heatmap')

plt.show()

## GDP vs. Total Medals

Let's visualize the relationship between GDP and the total number of medals won by each country.

In [None]:
plt.figure(figsize=(12, 6))

sns.scatterplot(data=df, x='gdp', y='total', hue='region', size='population', sizes=(20, 200), alpha=0.7)

plt.title('GDP vs. Total Medals')

plt.xlabel('GDP (in billions)')

plt.ylabel('Total Medals')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

## Population vs. Total Medals

Another factor to consider is the population size of each country. Let's see if there's a relationship between population and medal counts.

In [None]:
plt.figure(figsize=(12, 6))

sns.scatterplot(data=df, x='population', y='total', hue='region', size='gdp', sizes=(20, 200), alpha=0.7)

plt.title('Population vs. Total Medals')

plt.xlabel('Population (in millions)')

plt.ylabel('Total Medals')

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()

## Predictive Modeling

Given the data, we can attempt to predict the total number of medals a country might win based on its GDP and population. We'll use a simple linear regression model for this task.

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score



# Features and target variable

X = df[['gdp', 'population']]

y = df['total']



# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Initialize and train the model

model = LinearRegression()

model.fit(X_train, y_train)



# Make predictions

y_pred = model.predict(X_test)



# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)



mse, r2

## Discussion

Our analysis reveals some interesting insights into the relationship between a country's GDP, population, and its Olympic success. The correlation heatmap suggests that GDP and population both have a positive correlation with the total number of medals won. However, the strength of these correlations varies.



The scatter plots provide a visual representation of these relationships, highlighting regional differences and the impact of population size.



The predictive model, while simple, offers a starting point for estimating a country's potential medal count based on economic and demographic factors. The model's performance, as measured by the mean squared error and R-squared value, indicates room for improvement. Future analyses could explore more complex models or additional features to enhance predictive accuracy.



If you found this notebook insightful, please consider upvoting it.

## Credits

This notebook was created with the help of [Devra AI data science assistant](https://devra.ai/ref/kaggle)