### 데이터 시각화 ( https://flourish.studio/examples/ )
- 데이터 분석 결과를 쉽게 이해할 수 있도록 시각적으로 표현하고 전달되는 과정
- 탐색적 데이터 분석, 데이터 처리, 데이터 예측 모든 경우 결과를 알아보기 쉽게 하기위해 데이터 시각화는 필수적임
- 다양한 시각화 기법 중 가장 최신의 흥미로운 데이터 시각화 과정을 진행해보기로함

In [1]:
# 라이브러리 호출
# Import required libraries
import os
import pandas as pd

In [2]:
filePath = "D:/myAnalyze/PANDASPLOTLY_FUNCODING_FULLDATA_20240601/00_Material(Uploaded)/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"
doc_1 = pd.read_csv(filePath + "04-01-2020.csv", encoding="utf-8-sig")
doc_1.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
0,45001.0,Abbeville,South Carolina,US,2020-04-01 21:58:49,34.223334,-82.461707,4,0,0,4,"Abbeville, South Carolina, US"
1,22001.0,Acadia,Louisiana,US,2020-04-01 21:58:49,30.295065,-92.414197,47,1,0,46,"Acadia, Louisiana, US"
2,51001.0,Accomack,Virginia,US,2020-04-01 21:58:49,37.767072,-75.632346,7,0,0,7,"Accomack, Virginia, US"
3,16001.0,Ada,Idaho,US,2020-04-01 21:58:49,43.452658,-116.241552,195,3,0,192,"Ada, Idaho, US"
4,19001.0,Adair,Iowa,US,2020-04-01 21:58:49,41.330756,-94.471059,1,0,0,1,"Adair, Iowa, US"


In [3]:
doc_2 = pd.read_csv(filePath + "03-01-2020.csv", encoding="utf-8-sig")
doc_2.head()

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
0,Hubei,Mainland China,2020-03-01T10:13:19,66907,2761,31536,30.9756,112.2707
1,,South Korea,2020-03-01T23:43:03,3736,17,30,36.0,128.0
2,,Italy,2020-03-01T23:23:02,1694,34,83,43.0,12.0
3,Guangdong,Mainland China,2020-03-01T14:13:18,1349,7,1016,23.3417,113.4244
4,Henan,Mainland China,2020-03-01T14:13:18,1272,22,1198,33.882,113.614


#### (1) 필드명이 서로 다른 것을 확인 할 수 있음(Country_Region, Country/Region, Province_State, Province/State)
#### (2) Confirmed 필드 값이 비어져있는 행 데이터가 있음. 이는 과감히 dropna(subset=["Confirmed"])으로 처리
#### (3) Confirmed 가 object 또는 부동 소수점인 float 타입일수 있는 경우가 있어 int64로 형변환 진행
#### (4) 국가 코드별로 국기를 받아올 링크를 위해 필드 생성 및 특정 URL 링크 값 지정(https://flagpedia.net/data/flags/w580/countryCode.png)




#### TIP 1. lambda 함수

`lambda`는 익명 함수(이름이 없는 함수)를 생성하는 Python의 기능입니다. 여기서는 `lambda`를 사용하여 파일 이름을 `datetime` 객체로 변환합니다.

```python
lambda x: datetime.strptime(x, '%m-%d-%Y.csv')
```
- `lambda` 함수는 짧은 함수 정의를 할 때 사용됩니다.
- 일반 함수 정의와 달리, `lambda` 함수는 `lambda` 키워드를 사용하여 한 줄로 작성됩니다.
- 예시:
  ```python
  def add(a, b):
      return a + b

  # 위 함수는 아래와 같이 lambda 함수로 표현할 수 있습니다.
  add = lambda a, b: a + b
  ```
- 여기서는 `x`를 입력으로 받아 `datetime.strptime(x, '%m-%d-%Y.csv')`를 반환하는 함수입니다.

##### TIP 2. datetime.strptime 함수

- `datetime.strptime` 함수는 문자열을 `datetime` 객체로 변환합니다.
- 첫 번째 인자는 변환할 문자열이고, 두 번째 인자는 문자열의 형식을 지정하는 형식 문자열입니다.
- 형식 문자열의 주요 옵션:
  - `%m`: 월 (01에서 12)
  - `%d`: 일 (01에서 31)
  - `%Y`: 연도 (예: 2021)
- 예시:
  ```python
  date_str = '01-01-2021.csv'
  date_obj = datetime.strptime(date_str, '%m-%d-%Y.csv')

In [4]:
import os
import pandas as pd
from datetime import datetime

# 📂 CSV 파일이 저장된 폴더 경로
filePath = "D:/myAnalyze/PANDASPLOTLY_FUNCODING_FULLDATA_20240601/00_Material(Uploaded)/COVID-19-master/csse_covid_19_data/csse_covid_19_daily_reports/"

# 📌 폴더 내 모든 CSV 파일 목록 가져오기
dataFolder = os.listdir(filePath)[1:-1]

# ✅ 파일명을 날짜 형식으로 변환 후 정렬 (".csv" 제외)
dataFolder.sort(key=lambda x: datetime.strptime(x.replace(".csv", ""), "%m-%d-%Y"))

# 📌 결과를 저장할 데이터프레임 초기화
raw_data = None

# 🔄 파일 반복 처리
for v in dataFolder:
    print(f"Processing file: {v}")

    # 📌 CSV 파일 읽기
    csv_file = pd.read_csv(filePath + v, encoding="utf-8-sig")

# 📌 필요한 컬럼만 선택 (country_region, confirmed)
    try:
        csv_file = csv_file[["Country_Region", "Confirmed"]]
    except KeyError:
        csv_file = csv_file[["Country/Region", "Confirmed"]]
        csv_file.columns = ["Country_Region", "Confirmed"]

    # 📌 결측값 제거 후 데이터 타입 변환
    csv_file = csv_file.dropna(subset=["Confirmed"])
    csv_file["Confirmed"] = csv_file["Confirmed"].astype("int64")

    # 📌 날짜 컬럼명을 파일명에서 추출
    date_column = v.replace(".csv", "").replace("-", "/")
    csv_file.columns = ["Country_Region", date_column]  # ✅ 날짜별 컬럼 추가

    # 📌 국가별 그룹화 (sum)
    csv_file = csv_file.groupby("Country_Region").sum()

    # 📌 첫 번째 파일이면 그대로 저장, 이후부터는 `merge()` 수행 (오름차순 정렬 유지)
    if raw_data is None:
        raw_data = csv_file
    else:
        raw_data = raw_data.merge(csv_file, on="Country_Region", how="outer")

Processing file: 01-22-2020.csv
Processing file: 01-23-2020.csv
Processing file: 01-24-2020.csv
Processing file: 01-25-2020.csv
Processing file: 01-26-2020.csv
Processing file: 01-27-2020.csv
Processing file: 01-28-2020.csv
Processing file: 01-29-2020.csv
Processing file: 01-30-2020.csv
Processing file: 01-31-2020.csv
Processing file: 02-01-2020.csv
Processing file: 02-02-2020.csv
Processing file: 02-03-2020.csv
Processing file: 02-04-2020.csv
Processing file: 02-05-2020.csv
Processing file: 02-06-2020.csv
Processing file: 02-07-2020.csv
Processing file: 02-08-2020.csv
Processing file: 02-09-2020.csv
Processing file: 02-10-2020.csv
Processing file: 02-11-2020.csv
Processing file: 02-12-2020.csv
Processing file: 02-13-2020.csv
Processing file: 02-14-2020.csv
Processing file: 02-15-2020.csv
Processing file: 02-16-2020.csv
Processing file: 02-17-2020.csv
Processing file: 02-18-2020.csv
Processing file: 02-19-2020.csv
Processing file: 02-20-2020.csv
Processing file: 02-21-2020.csv
Processi

In [5]:
raw_data

Unnamed: 0_level_0,01/22/2020,01/23/2020,01/24/2020,01/25/2020,01/26/2020,01/27/2020,01/28/2020,01/29/2020,01/30/2020,01/31/2020,...,02/28/2023,03/01/2023,03/02/2023,03/03/2023,03/04/2023,03/05/2023,03/06/2023,03/07/2023,03/08/2023,03/09/2023
Country_Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Azerbaijan,,,,,,,,,,,...,,,,,,,,,,
Afghanistan,,,,,,,,,,,...,209322.0,209340.0,209358.0,209362.0,209369.0,209390.0,209406.0,209436.0,209451.0,209451.0
Albania,,,,,,,,,,,...,334391.0,334408.0,334408.0,334427.0,334427.0,334427.0,334427.0,334427.0,334443.0,334457.0
Algeria,,,,,,,,,,,...,271441.0,271448.0,271463.0,271469.0,271469.0,271477.0,271477.0,271490.0,271494.0,271496.0
Andorra,,,,,,,,,,,...,47866.0,47875.0,47875.0,47875.0,47875.0,47875.0,47875.0,47875.0,47890.0,47890.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Winter Olympics 2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,535.0,535.0,535.0,535.0,535.0,535.0,535.0,535.0,535.0,535.0
Yemen,,,,,,,,,,,...,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0
Zambia,,,,,,,,,,,...,343012.0,343012.0,343079.0,343079.0,343079.0,343135.0,343135.0,343135.0,343135.0,343135.0
Zimbabwe,,,,,,,,,,,...,263921.0,264127.0,264127.0,264127.0,264127.0,264127.0,264127.0,264127.0,264276.0,264276.0


In [7]:
raw_data = raw_data.reset_index()

In [8]:
#NaN 값 0으로 처리
raw_data = raw_data.fillna(0)
raw_data

Unnamed: 0,index,Country_Region,01/22/2020,01/23/2020,01/24/2020,01/25/2020,01/26/2020,01/27/2020,01/28/2020,01/29/2020,...,02/28/2023,03/01/2023,03/02/2023,03/03/2023,03/04/2023,03/05/2023,03/06/2023,03/07/2023,03/08/2023,03/09/2023
0,0,Azerbaijan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,209322.0,209340.0,209358.0,209362.0,209369.0,209390.0,209406.0,209436.0,209451.0,209451.0
2,2,Albania,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,334391.0,334408.0,334408.0,334427.0,334427.0,334427.0,334427.0,334427.0,334443.0,334457.0
3,3,Algeria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,271441.0,271448.0,271463.0,271469.0,271469.0,271477.0,271477.0,271490.0,271494.0,271496.0
4,4,Andorra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,47866.0,47875.0,47875.0,47875.0,47875.0,47875.0,47875.0,47875.0,47890.0,47890.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,245,Winter Olympics 2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,535.0,535.0,535.0,535.0,535.0,535.0,535.0,535.0,535.0,535.0
246,246,Yemen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0,11945.0
247,247,Zambia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,343012.0,343012.0,343079.0,343079.0,343079.0,343135.0,343135.0,343135.0,343135.0,343135.0
248,248,Zimbabwe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,263921.0,264127.0,264127.0,264127.0,264127.0,264127.0,264127.0,264127.0,264276.0,264276.0


### 국가 코드 데이터 : 국가별 국기 링크 확보를 위해 국가 코드데이터 활용

In [9]:
country_data = pd.read_csv("D:/myAnalyze\PANDASPLOTLY_FUNCODING_FULLDATA_20240601/00_Material(Uploaded)/COVID-19-master/country_region_flag.csv")

country_data = country_data[["iso2","Country_Region"]]

country_data=  country_data.drop_duplicates()

  country_data = pd.read_csv("D:/myAnalyze\PANDASPLOTLY_FUNCODING_FULLDATA_20240601/00_Material(Uploaded)/COVID-19-master/country_region_flag.csv")


In [None]:
country_data

In [None]:
country_data["Country_Region"].value_counts()

In [11]:
country_data.loc[country_data["Country_Region"] == "US", "iso2"] = "US"

In [None]:
country_data.loc[country_data["Country_Region"] == "US", "iso2"]

In [None]:
country_data["Country_Region"].value_counts()

In [13]:
code_lower_link = list()
for v in range(country_data.shape[0]):
    lower_code = str(country_data.iloc[v,0]).lower()
    lower_link = f"https://flagpedia.net/data/flags/w580/{lower_code}.png"
    code_lower_link.append(lower_link)
country_data["Country_link"] = code_lower_link

country_data = country_data[["iso2", "Country_Region", "Country_link"]]

In [14]:
country_data = country_data[["Country_Region", "Country_link"]]

#### 중간 전처리 간 확보한 데이터 셋 총 2개 : 년/월/일간 코로나 확진 데이터(raw_data), 국가코드별 링크 데이터(country_data)

In [None]:
print(raw_data.shape)
raw_data.head()

In [None]:
print(country_data.shape)
country_data.head()

In [None]:
country_data["Country_Region"].value_counts()

In [None]:
country_data.to_csv("hi.csv")

#### 년/월/일 코로나 확진 데이터(raw_data) 의 Province_State NaN 처리를 위해 Json 데이터와 apply 함수 처리

In [None]:
import json

with open("D:/myAnalyze/PANDASPLOTLY_FUNCODING_FULLDATA_20240601/00_Material(Uploaded)/COVID-19-master/csse_covid_19_data/country_convert.json") as json_file:
    myJson = json.load(json_file)
    print(myJson.keys())
    print(myJson.values())

#### Country_Region 컬럼 값을 확인해서 국가명이 다르게 기재되어 있을 경우에만 지정한 국가명으로 변경

In [None]:
# 함수정의
def notNaN(x):
    if x["Country_Region"] in myJson:
        x["Country_Region"] = myJson[x["Country_Region"]]
    return x

raw_data = raw_data.apply(notNaN, axis=1)
raw_data.head()

#### raw_data 에 국가별 국기링크 넣기

In [42]:
final_data = pd.merge(raw_data, country_data, on="Country_Region", how="left")

In [None]:
final_data

In [44]:
#컬럼 순서 변경
edit_column = list(final_data.columns)
edit_column.insert(1, edit_column[-1])
del edit_column[-1]

final_data = final_data[edit_column]

In [None]:
final_data.dropna(subset=["Country_link"])

In [None]:
final_data["Country_Region"].value_counts()

In [None]:
final_data[final_data["Country_Region"]=="US"]