# 기자 전처리 3

### csv 정보
- bylines.csv: 원본 기자명
- names.csv: 3글자 기자명
  - 수동 정규화 - 단체 이름 제거
- names_newline.csv: 개행 문자 포함한 기자명
  - 수동 정규화
- names_others.csv
  - `OOO 기자` 정규화
  - `OOO` 수동 정규화
- names_others1.csv
  - 나머지 이름 수동 정규화

In [1]:
from pathlib import Path
import os

os.chdir(Path(os.getcwd()).parent)

In [2]:
from os.path import join

import pandas as pd
import numpy as np

In [3]:
root_path = os.getcwd()
data_path = join(os.getcwd(), "data", "preprocessing")
byline_path = join(data_path, "bylines")
reporter_path = join(data_path, "bylines.csv")

## 1. bylines -> reporters
- 수동 복사

In [4]:
reporter_df = pd.read_csv(reporter_path)

In [5]:
reporter_df.head()

Unnamed: 0,이름
0,(中）/[번역]강지혜\n[번역]강지혜
1,(강원)강대웅·위준휘\n위준휘
2,(경주) 최주호\n최주호
3,(과천) 박재천\n박재천
4,(광주)박승호\n박승호


In [6]:
reporter_df.shape

(4554, 1)

In [7]:
reporter_df["정규화"] = np.nan

In [8]:
reporter_df= reporter_df.astype({"정규화": "object"})
reporter_df.dtypes

이름     object
정규화    object
dtype: object

## 2. names → reporters

In [9]:
names_df = pd.read_csv(join(byline_path, "names.csv"))

In [10]:
names_df.head()

Unnamed: 0,이름,정규화
0,FTV,
1,KBS,
2,KNN,
3,TBC,
4,UBC,


In [11]:
for index, row in names_df.iterrows():
    name, new_name = row["이름"], row["정규화"]
    if new_name is not np.nan:
        reporter_df.loc[reporter_df["이름"] == name, "정규화"] = new_name

## 3. names_newline → reporters

In [12]:
names_newline_df = pd.read_csv(join(byline_path, "names_newline.csv"))

In [13]:
names_newline_df.head()

Unnamed: 0,이름,정규화
0,(中）/[번역]강지혜\n[번역]강지혜,강지혜
1,(강원)강대웅·위준휘\n위준휘,강대웅||위준휘
2,(경주) 최주호\n최주호,최주호
3,(과천) 박재천\n박재천,박재천
4,(광주)박승호\n박승호,박승호


In [14]:
for index, row in names_newline_df.iterrows():
    name, new_name = row["이름"], row["정규화"]
    if new_name is not np.nan:
        reporter_df.loc[reporter_df["이름"] == name, "정규화"] = new_name

## 4. names_others → reporters

In [15]:
others_df = pd.read_csv(join(byline_path, "names_others.csv"))

In [16]:
others_df.head()

Unnamed: 0,이름,정규화
0,.,
1,/ 이준헌 기자,
2,/인천=장현일 기자 hichang@sedaily.com,
3,2006022;2021005 기자,
4,2016004;2020021 기자,


In [17]:
for index, row in others_df.iterrows():
    name, new_name = row["이름"], row["정규화"]
    if new_name is not np.nan:
        reporter_df.loc[reporter_df["이름"] == name, "정규화"] = new_name

## 5. names_others1 → reporters

In [18]:
others1_df = pd.read_csv(join(byline_path, "names_others1.csv"))

In [19]:
others1_df.head()

Unnamed: 0,이름,정규화
0,.,
1,/ 이준헌 기자,이준헌
2,/인천=장현일 기자 hichang@sedaily.com,장현일
3,G1 박성준,박성준
4,Hoàng Phương Ly,Hoàng Phương Ly


In [20]:
for index, row in others1_df.iterrows():
    name, new_name = row["이름"], row["정규화"]
    if new_name is not np.nan:
        reporter_df.loc[reporter_df["이름"] == name, "정규화"] = new_name

## 6. 저장

In [21]:
reporter_df.shape

(4554, 2)

In [22]:
reporter_df.head()

Unnamed: 0,이름,정규화
0,(中）/[번역]강지혜\n[번역]강지혜,강지혜
1,(강원)강대웅·위준휘\n위준휘,강대웅||위준휘
2,(경주) 최주호\n최주호,최주호
3,(과천) 박재천\n박재천,박재천
4,(광주)박승호\n박승호,박승호


In [23]:
reporter_df.to_csv(reporter_path, index=False, encoding="utf-8-sig")