# Squid Game Effect
<a id="toc"></a>
Table of Contents
* [1.0 Introduction & Methodology](#section_01)
* [2.0 Data Sources & ETL Process](#section_02)
* [3.0 Part 1 The Mystery - Unexpected Global Interest](#section_03)
* [4.0 Part 2 The Correlation Evidence](#section_04)
* [5.0 Part 3 The Smoking Gun - Squid Game Experiments](#section_05)
* [6.0 Part 4 The Geographic Proof](#section_06)
* [7.0 Part 5 The Sustained Wave - Beyond Viral Moments](#section_07)
* [8.0 Limitations & Conclusions](#section_08)
* [9.0 Credits](#section_09)

--------------------------------

<a id="section_01"></a>
## 1.0 Introduction
For decades, language learning was predictable: English, Spanish and French were the languages of choice. Hoowever, in 2018, Duolingo's data detectives spotted an anomaly. They noticed that interests in Korean culture and language are surging in countries with no historical ties to Korea.

Their hyoothesis? The K-pop and K-drama effect.  Duolinga began weaving famous K-drama lines into lessons. Their latest campaign of "Learn Korean or Else" was a partnership with Netflix in late 2024 with the release of "Squid Game Season 2".

This project aims to start on a data trial to see if this is just a corporate intuition based on fads or that binge-watching K-drama is indeed becoming the world's newest Korean classrooms.

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

1.1 Get all the necessary libraries

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid') # sets a white background with grid lines 
import plotly.express as px

------------------------

<a id="section_02"></a>
## 2.0 Data Sources & ETL Process


### 2.1 Duolingo 2024 Language Report
* [Duolingo 2024 Report](https://docs.google.com/spreadsheets/d/1CndYC5ZovYfmPuMN9T9Jxfa4CQXOzZrfQ2kAUaWG1ZU/edit?ref=blog.duolingo.com&gid=532174835#gid=532174835
)
* Data Source: [Google Docs](https://docs.google.com/spreadsheets/d/1CndYC5ZovYfmPuMN9T9Jxfa4CQXOzZrfQ2kAUaWG1ZU/edit?ref=blog.duolingo.com&gid=532174835#gid=532174835)

In [None]:
duolingo_filepath = '../data/raw/Duolingo Language Report [2020-2024]_ Public data.xlsx'
duolingo_df = pd.read_excel(
        duolingo_filepath, 
        sheet_name='Data by country', 
        skiprows=1)
# Rename the first column to "country"
duolingo_df = duolingo_df.rename(columns={duolingo_df.columns[0]: 'country'})


In [71]:
# Display the first few rows to understand the structure
print("DataFrame shape:", duolingo_df.shape)
print("\nFirst few rows:")
print(duolingo_df.head())

# Check column names
print("\nColumn names:")
print(duolingo_df.columns.tolist())


DataFrame shape: (193, 11)

First few rows:
       country pop1_2020 pop2_2020 pop1_2021 pop2_2021 pop1_2022 pop2_2022  \
0  Afghanistan   English   Spanish   English   Turkish    German   English   
1      Albania    German   English    German   English    German   English   
2      Algeria   English    French   English    French   English    French   
3      Andorra   English    French   English    French   English    French   
4       Angola   English    French   English    French   English    French   

  pop1_2023 pop2_2023 pop1_2024 pop2_2024  
0   English    German   English    German  
1    German   English   English    German  
2   English    French   English    French  
3   English    French   English   Spanish  
4   English    French   English    French  

Column names:
['country', 'pop1_2020', 'pop2_2020', 'pop1_2021', 'pop2_2021', 'pop1_2022', 'pop2_2022', 'pop1_2023', 'pop2_2023', 'pop1_2024', 'pop2_2024']


In [72]:
# Get all columns that contain language data (pop1 and pop2 for each year)
year_columns = [col for col in duolingo_df.columns if col != 'country']

In [73]:
# Create an empty set to store all unique values
unique_languages = set()

# Loop through columns and add unique values
for col in year_columns:
        unique_in_col = set(duolingo_df[col].dropna().unique())
        unique_languages.update(unique_in_col)
        

print("All unique languages across all year columns:")
sorted(unique_languages)


All unique languages across all year columns:


['Arabic',
 'Chinese',
 'Danish',
 'English',
 'Finnish',
 'French',
 'German',
 'Guarani',
 'Hebrew',
 'Hindi',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Norwegian',
 'Portuguese',
 'Russian',
 'Spanish',
 'Swahili',
 'Swedish',
 'Turkish']

#### EDA of the Duolingo's Data

In [96]:
# Filter rows where any language column contains "Korean"
korean_countries_df = duolingo_df[duolingo_df[year_columns].apply(
    lambda x: x.str.contains('Korean').any(), axis=1)]

tot_duolingo_country = duolingo_df['country'].value_counts().sum()
tot_korean_country = korean_countries_df['country'].value_counts().sum()
print(f"* There are {tot_duolingo_country} countries in Duolingo's data")
print(f"* However only {tot_korean_country/tot_duolingo_country * 100:.2f}% of them or {tot_korean_country} countries have Korean as one of the top 2 most popular languages.")
print("* These countries with Korean are as follows:")
korean_countries_df['country'].value_counts()

* There are 193 countries in Duolingo's data
* However only 7.25% of them or 14 countries have Korean as one of the top 2 most popular languages.
* These countries with Korean are as follows:


country
Bangladesh     1
Bhutan         1
Brunei         1
Indonesia      1
Japan          1
Kiribati       1
Malaysia       1
Mongolia       1
Myanmar        1
Nepal          1
Pakistan       1
Philippines    1
South Korea    1
Thailand       1
Name: count, dtype: int64

<font color="#A8DADC">**EDA Conclusion**:
* The 14 countries having Korean as their top 2 langauges to learn are mostly in Asia. Hence, it does not give the impression that learning Korean language is a "global" phenomenon.
* Perhaps this is because Duolingo's data only shows the top 2 most favourite languagues in each country which ignores countries with korean langauge as their 3rd or 4th most popular languages to learn
* Perhaps we should look at Google Trend regional data to see if it is indeed a global phenomenon.</font>

-----------------

### 2.2 Google Trends 5-year Time Lines & Regional Data
* _Data Collected on 1 Nov. 2025_

### 2.3 Google Trends Timeline & Region Data For Squid Game 1 & Squid Game 2
* 1 year period covering 9 months before the release of each show and 3 months after
    * Squid Game 1 was released on Netflix on 17 September 2021
    * Squid Game 2 was released on Netflix on 26 December 2024
* _Data Collected on 1 Nov. 2025_


In [99]:
# create the file paths dictionary
google_filepaths={
    'kdrama': {
        'timeline': '../data/raw/googletrends_01_multitimeline_kdrama_5years.csv',
        'geo': '../data/raw/googletrends_02_geomap_kdrama_5years.csv',
        'squidgame1_timeline': '../data/raw/googletrends_01_multitimeline_kdrama_squidgame01_2021.csv',
        'squidgame1_geo': '../data/raw/googletrends_02_geomap_kdrama_squidgame01_2021',
        'squidgame2_timeline': '../data/raw/googletrends_01_multitimeline_kdrama_squidgame02_20240401_20250331.csv',
        'squidgame2_geo': '../data/raw/googletrends_02_geomap_kdrama_squidgame02_20240401_20250331.csv',
    },
    'learn_korean': {
        'timeline': '../data/raw/googletrends_01_multitimeline_learnkorean_5years.csv',
        'geo': '../data/raw/googletrends_02_geomap_learnkorean_5years.csv',
        'squidgame1_timeline': '../data/raw/googletrends_01_multitimeline_learnkorean_squidgame01_2021.csv',
        'squidgame1_geo': '../data/raw/googletrends_02_geomap_learnkorean_squidgame01_2021',
        'squidgame2_timeline': '../data/raw/googletrends_01_multitimeline_learnkorean_squidgame02_20240401_20250331.csv',
        'squidgame2_geo': '../data/raw/googletrends_02_geomap_learnkorean_squidgame02_20240401_20250331.csv',
    }
}

In [119]:
def load_timeline_data(filepath, search_term):
    """_summary_

    Args:
        filepath (_type_): _description_
        topic (_type_): _description_

    Returns:
        _type_: _description_
    """
    df = pd.read_csv(filepath, skiprows=2)
    # Rename the first column to "country"
    df = df.rename(columns={df.columns[1]: search_term})
    # df = df.rename(columns={df.columns[0]: week})
    df['Week'] = pd.to_datetime(df['Week'])
    # Set as index
    df = df.set_index('Week')

    is_sorted = df.index.is_monotonic_increasing
    print(f"üìà {search_term}'s Index DateTime Colum sorted: {is_sorted}")
    if not is_sorted:
        print("   Sorting Index DateTime Column ...")
        df = df.sort_index()
        print("   ... Index DateTime Column Sorted!")
    return df

kdrama_timeline = load_timeline_data(google_filepaths['kdrama']['timeline'], "kdrama")
print(kdrama_timeline.dtypes)
print(kdrama_timeline.head())





üìà kdrama's Index DateTime Colum sorted: True
kdrama    int64
dtype: object
            kdrama
Week              
2020-11-01       5
2020-11-08       5
2020-11-15       6
2020-11-22       7
2020-11-29       6


In [107]:
# Quick preview function
def preview_timeline_data(df, title):
    """Quick preview of timeline data"""
    if df is not None:
        print(f"\nüìä {title} Preview:")
        print(f"   Shape: {df.shape}")
        print(f"   Date range: {df.index.min()} to {df.index.max()}")
        print(f"   First 5 rows:")
        print(df.head())
        print("-" * 50)

preview_timeline_data(kdrama_timeline, "K-drama 5-year Timeline")        


üìä K-drama 5-year Timeline Preview:
   Shape: (261, 2)
   Date range: 0 to 260
   First 5 rows:
         Week  kdrama
0  2020-11-01       5
1  2020-11-08       5
2  2020-11-15       6
3  2020-11-22       7
4  2020-11-29       6
--------------------------------------------------


### 2.4 Netflix Movies & TC shows Dataset
* source: [HQ DATA PROFILER on Kaggle](https://www.kaggle.com/datasets/hqdataprofiler/cleaned-netflix-movies-and-tv-shows)

### 2.5 My Drama List
* source: [REDHATA on Kaggle](https://www.kaggle.com/datasets/redhata/korean-drama-dataset-2010-2025-2600-titles)

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_03"></a>
## 3.0 Part 1 The Mystery - Unexpected Global Interest

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_04"></a>
## 4.0 Part 2 The Correlation Evidence

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_05"></a>
## 5.0 Part 3 The Smoking Gun - Squid Game Experiments

In [97]:
# Important dates for your analysis
SQUID_GAME_RELEASE = '2021-09-17'
analysis_periods = {
    'pre_squid_game': ('2019-01-01', '2021-09-16'),
    'post_squid_game': ('2021-09-17', '2024-01-01'),
    'full_period': ('2019-01-01', '2024-01-01')
}

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_06"></a>
## 6.0 Part 4 The Geographic Proof

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_07"></a>
## 7.0 Part 5 The Sustained Wave - Beyond Viral Moments

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_08"></a>
## 8.0 Limitations & Conclusions

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------

<a id="section_09"></a>
## 9.0 Credit

###### [‚Ü©Ô∏è Back to Table of Contents](#toc)

--------------------------------