<a href="https://colab.research.google.com/github/Haikoo96/Kpop-Trend-Analysis/blob/master/scraping%20and%20preprocess/group_names_extract%26process_regex%26JSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Objective**
1. Utilize the JSON file with k-idol names and lists of members in it.
2. Match the keywords extracted by NER (Name Entity Recognition) models to the k-idol names in the JSON
3. Extract group names directly from title using regex (regular expression)

## **What is TheFuzz library?**
- TheFuzz Library lets you compare two words and see the similarity between them.


In [1]:
!pip install thefuzz



In [2]:
import pandas as pd
import json
import os
import glob
import re
from thefuzz import fuzz
import numpy as np

In [3]:
# load the directory that contains all csvs
dir_path = 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw'
if os.path.exists(dir_path):
  csv_paths = glob.glob(os.path.join(dir_path, '*.csv'))
csv_paths

['drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/kpop_news_trends.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/augmented_dataset.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/kpop_augmented_dataset.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/kpop_augmented_dataset_v2.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/kpop_augmented_dataset_v4.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/kpop_augmented_dataset_v6.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/kpop_modelcompare_dataset.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/temp.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/temp - temp.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/group_added.csv',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/entire_group.csv',
 'drive/MyDrive/Colab Note

In [4]:
# Load CSV
df = pd.read_csv(csv_paths[-5])
df.drop('Unnamed: 0', axis=1, inplace=True)
df

Unnamed: 0,title,category,author_name,dates,num comments,num views,PER,LOC,ORG,MISC
0,Ryu Jun Yeol back in Korea; confirms dating Ha...,News,Alec06,3/18/2024,5,3197,"Ryu Jun Yeol, Han So Hee","Korea, Hawaii",,
1,Ryu Jun Yeol and Han So Hee's unwelcomed start...,News,Alec06,3/18/2024,14,7556,"Ryu Jun Yeol, Han So Hee",Hawaii,,
2,DAY6's Dowoon talks post-military growth and n...,News,Alec06,3/18/2024,0,332,,,"DAY6, Dowoon",Fourever
3,"Park Shin Hye reflects on 'Dr. Slump' finale, ...",News,Alec06,3/18/2024,1,2018,Park Shin Hye,,,Dr. Slump
4,"Han Ye Seul to host 'SNL Korea' season 5, spar...",News,Alec06,3/18/2024,2,1006,Han Ye Seul,,SN,##L Korea
...,...,...,...,...,...,...,...,...,...,...
1875,THEBLACKLABEL responds to rumors that their ne...,Rumors,Sophie-Ha,02/06/2024,41,38809,,,Shinsegae Chaebol Family,
1876,The Most Infuriating Villains in K-Dramas that...,Original Content,ean1994,02/06/2024,7,6224,,,,"Most Inf, ##ting Villains in K - Dramas that y..."
1877,The preliminary audition for selecting new FIF...,News,Sophie-Ha,02/06/2024,21,9237,,Singapore,FIFTY FIFTY,
1878,TREASURE and their remarkable stage presence,Original Content,Rika-go,02/06/2024,9,8249,,,TREASURE,


In [5]:
# Load JSON file paths
json_paths = glob.glob(os.path.join(dir_path, '*.json'))
json_paths

['drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/normal_char_idol.json',
 'drive/MyDrive/Colab Notebooks/kpop trend analysis/dataset_raw/special_char_idol.json']

In [6]:
# Loading two JSON file
with open(json_paths[0], 'r') as file: # alphabetically ordered
  json_a = json.load(file)

with open(json_paths[1], 'r') as file_: # special charactered
  json_b = json.load(file_)

In [7]:
# Editing json_b
temp_lst = []
temp_lst.extend(json_b['#']['#'][1:])
json_b['#']['(G)I-DLE'] = temp_lst
del json_b['#']['#']

In [8]:
# storing the entire names of IDOL group in one list
import string

group_idol_lst = [key for key in json_b['#'].keys()]

for letter in string.ascii_uppercase:
  group_idol_lst.extend(json_a[letter].keys())

##**Group Name Assignment with Fuzzy Matching**

- Iterates through a list of entity types (e.g., PER, ORG, MISC).

- For each entity type:
  - Handles missing values in the corresponding DataFrame column.
  - Processes each row's words (if present):
    - Performs fuzzy matching between words and groups in a separate list.
    - Assigns the matching group name to the DataFrame's group name column if a match is found above a similarity threshold (85% in this case).
  - Assigns None to the group name column if no match is found for all words in a row.

In [9]:
ent_lst = ['PER', 'ORG', 'MISC']

for ent in ent_lst:
    for i, words in enumerate(df[f'{ent}']):
        if words is None:
            df.at[i, f'group_{ent}'] = None
            continue

        split_words = str(words).split(', ')
        found_match = False  # Initialize found_match for each new entity

        for word in split_words:
            if found_match:
                break  # Skip remaining words if a match has been found
            for group in group_idol_lst:
                sim_num = fuzz.ratio(word.lower(), group.lower())
                if sim_num > 85:
                    df.at[i, f'group_{ent}'] = group  # Assign matching group using formatted string
                    found_match = True  # Indicate that a match has been found
                    break  # Break out of the group_idol_lst loop

        if not found_match:
            df.at[i, f'group_{ent}'] = None  # Assign None using formatted string if no match is found


In [10]:
df[['group_PER', 'group_ORG', 'group_MISC']]

Unnamed: 0,group_PER,group_ORG,group_MISC
0,,,
1,,,
2,,DAY6,
3,,,
4,,,
...,...,...,...
1875,,,
1876,,,
1877,,FIFTY FIFTY,
1878,,TREASURE,


##**Extract & Validate Idol Group Names (Regex & JSON)**
- Function: Extracts idol group names from titles using regex & checks against JSON data. Returns matched name (or None) and index.

- Process: Matches title with patterns, checks if combined match is a key in JSON. If yes, finds matching idol name within the match. Returns name and index if found, otherwise prints error and returns None, index.

- Usage: Iterates through titles, calls function, assigns matched name (if any) to a new column.

In [11]:
def process_title(title, json_a, idx):
    pattern_a = r"[A-Z]+\s[A-Za-z!-()]+"
    pattern_b = r"[A-Z0-9\s]{3,}"
    patterns = [pattern_a, pattern_b]

    for pattern in patterns:
        if re.match(pattern, title):
            matches = re.findall(pattern, title)
            string_ver = ', '.join(matches)
            # Ensure the entire matched string is a valid key in json_a
            if string_ver[0] in json_a:
                idol_names = json_a[string_ver[0]].keys()
                # print(f'At {idx} match found with {pattern}: {string_ver}')
                # print(idol_names)

                for name in idol_names:
                  if name in string_ver:
                    # print(f'this is the filtered name: {name}', '\n')
                    return name, idx
            else:
                print(f'At {idx} No corresponding key in json_a for: {string_ver}')
                return (None, idx)

# Example usage
for idx, title in enumerate(df['title']):
    tuple_t = process_title(title, json_a, idx)
    if tuple_t is None:
      continue
    else:
      name, index = tuple_t
      df.at[index, 'group_title_a'] = name

At 234 No corresponding key in json_a for: 5 B
At 404 No corresponding key in json_a for: 2008
At 464 No corresponding key in json_a for: 7 K
At 469 No corresponding key in json_a for: 5 M
At 648 No corresponding key in json_a for: 8 TWICE S
At 794 No corresponding key in json_a for: 2AM
At 806 No corresponding key in json_a for: 15 I
At 859 No corresponding key in json_a for: 7 K
At 1049 No corresponding key in json_a for: 10 K
At 1089 No corresponding key in json_a for: 7 S
At 1134 No corresponding key in json_a for: 5 S
At 1246 No corresponding key in json_a for: 10 R
At 1250 No corresponding key in json_a for: 8 P
At 1289 No corresponding key in json_a for: 2PM
At 1308 No corresponding key in json_a for: 7 M
At 1462 No corresponding key in json_a for: 8 B,  BTS,  RM 
At 1571 No corresponding key in json_a for: 6 K
At 1632 No corresponding key in json_a for: 10 L
At 1695 No corresponding key in json_a for: 2PM
At 1732 No corresponding key in json_a for: 6 S


In [13]:
df.head(10)

Unnamed: 0,title,category,author_name,dates,num comments,num views,PER,LOC,ORG,MISC,group_PER,group_ORG,group_MISC,group_title_a
0,Ryu Jun Yeol back in Korea; confirms dating Ha...,News,Alec06,3/18/2024,5,3197,"Ryu Jun Yeol, Han So Hee","Korea, Hawaii",,,,,,
1,Ryu Jun Yeol and Han So Hee's unwelcomed start...,News,Alec06,3/18/2024,14,7556,"Ryu Jun Yeol, Han So Hee",Hawaii,,,,,,
2,DAY6's Dowoon talks post-military growth and n...,News,Alec06,3/18/2024,0,332,,,"DAY6, Dowoon",Fourever,,DAY6,,DAY6
3,"Park Shin Hye reflects on 'Dr. Slump' finale, ...",News,Alec06,3/18/2024,1,2018,Park Shin Hye,,,Dr. Slump,,,,
4,"Han Ye Seul to host 'SNL Korea' season 5, spar...",News,Alec06,3/18/2024,2,1006,Han Ye Seul,,SN,##L Korea,,,,
5,The Boyz drops new album 'PHANTASY' Pt.3 and p...,News,Alec06,3/18/2024,0,571,,,The Boyz,PHANTASY,,THE BOYZ,,
6,"Nam Bo Ra expresses love for siblings, marks 1...",News,Alec06,3/18/2024,2,989,Nam Bo Ra,,,,,,,
7,Candy Shop unveils mini album 'Hashtag#' track...,News,Alec06,3/18/2024,1,684,,New York,Candy Shop,Hashtag,,Candy Shop,HashTag,
8,"BABYMONSTER ready for comeback, set for summer...",News,Alec06,3/18/2024,7,2415,,,BABYMONST,,,BABYMONSTER,,BABYMONSTER
9,Jun Ji Hyun and Son Heung Min grace Harper's B...,News,Alec06,3/18/2024,0,2783,"Jun Ji Hyun, Son Heung Min",,,Harper ' s Bazaar Korea,,,,


In [None]:
df.to_csv(os.path.join(dir_path, 'test_b.csv'))