データの前処理

1. ファイルを指定してデータを読み込む
2. 回答ありのカラムを追加する
3. タイムスタンプをdatetimeオブジェクトにする
4. 国を大州と地域に分割する　←　カラムを追加する
5. カラムをcategoryにする
6. コメントのある／なしフラグを追加する　←　カラムを追加する
7. （他にあれば追加する）
8. ファイルをCSVで書き出す

In [1]:
from pathlib import Path
import pandas as pd
import altair as alt
import textblob as tb
import titanite as ti

print(f"Altair {alt.__version__}")
print(f"Pandas {pd.__version__}")
print(f"Textblob {tb.__version__}")
print(f"Titanite {ti.__version__}")

Altair 5.0.1
Pandas 2.0.3
Textblob 0.17.1
Titanite 0.1.2


設定ファイルを読み込む

In [2]:
fname = "../sandbox/config.toml"
config = ti.Config(fname=fname)
config.load()
# config.questions
# config.choices
# dir(config)

データを読み込む

In [2]:
fname = Path("../data/raw_data/20230715_icrc2023_diversity_presurvey_answers.csv")
data = pd.read_csv(fname, skiprows=1)
data["response"] = 1

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   timestamp          260 non-null    object 
 1   q1                 260 non-null    object 
 2   q2                 260 non-null    object 
 3   q3                 260 non-null    object 
 4   q4                 260 non-null    object 
 5   q5                 260 non-null    object 
 6   q6                 260 non-null    object 
 7   q7                 260 non-null    object 
 8   q8                 260 non-null    object 
 9   q9                 260 non-null    object 
 10  q10                260 non-null    float64
 11  q11                260 non-null    object 
 12  q12_genderbalance  260 non-null    object 
 13  q12_diversity      260 non-null    object 
 14  q12_equity         260 non-null    object 
 15  q12_inclusion      260 non-null    object 
 16  q13                260 non

日付のカラムを``datetime``オブジェクトに変換する

In [4]:
data["timestamp"] = pd.to_datetime(data["timestamp"])

カラムの値を置換して整理する
- "regional" が "Prefer not to answer" の場合、"subregional" も "Prefer not to answer"
- "regional" が "Oceania" の場合、"subregional"も"Oceania"
- "No Interest" を "No interest"に修正する

In [5]:
data["q3"] = data["q3"].replace(
    {"Prefer not to answer": "Prefer not to answer / Prefer not to answer",
    "Oceania": "Oceania / Oceania"}
)
data["q4"] = data["q4"].replace(
    {"Prefer not to answer": "Prefer not to answer / Prefer not to answer",
    "Oceania": "Oceania / Oceania"}
)
data["q14"] = data["q14"].replace(
    {"No Interest": "No interest"}
)

In [6]:
_q3 = data["q3"].str.split("/", expand=True)
_q3[0] = _q3[0].str.strip()
_q3[1] = _q3[1].str.strip()
_q3 = _q3.rename(columns={0: "q3_regional", 1: "q3_subregional"})

_q4 = data["q4"].str.split("/", expand=True)
_q4[0] = _q4[0].str.strip()
_q4[1] = _q4[1].str.strip()
_q4 = _q4.rename(columns={0: "q4_regional", 1: "q4_subregional"})

In [7]:
data = pd.concat([data, _q3], axis=1)
data = pd.concat([data, _q4], axis=1)
#data

In [9]:
data.columns

Index(['timestamp', 'q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9',
       'q10', 'q11', 'q12_genderbalance', 'q12_diversity', 'q12_equity',
       'q12_inclusion', 'q13', 'q14', 'q15', 'q16', 'q17_genderbalance',
       'q17_diversity', 'q17_equity', 'q17_inclusion', 'q18', 'q19', 'q20',
       'q21', 'q22', 'response', 'q3_regional', 'q3_subregional',
       'q4_regional', 'q4_subregional'],
      dtype='object')

In [9]:
data["q10"].unique()

array([ 0. ,  7. ,  1. ,  4. ,  2. ,  3. ,  6. , 10. , -1. ,  8. ,  9. ,
        0.5,  5. , 12. , 14. ,  0.1, 24. ])

In [82]:
d = pd.cut(
    data["q10"],
    range(-1, 26),
    right=False,
    labels=["NA", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"])
alt.Chart(d.reset_index()).mark_bar().encode(x="q10", y="count()")

In [80]:
#data["q13"].value_counts()
#d = pd.cut(data["q13"], [-1, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], right=False, labels=["NA", "0", "10", "20", "30", "40", "50", "60", "70", "80", "90"])
d = pd.cut(
    data["q13"],
    [-1, 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100],
    right=False,
    labels=["NA", "0", "5", "10", "15", "20", "25", "30", "35", "40", "45", "50", "55", "60", "65", "70", "75", "80", "85", "90", "95"])
d.reset_index()
alt.Chart(d.reset_index()).mark_bar().encode(x="q13", y="count()")

カテゴリー型に変換する

In [10]:
category = config.categories()
data["q1"] = data["q1"].astype(category["age"])
data["q2"] = data["q2"].astype(category["gender"])
data["q3"] = data["q3"].astype(category["geoscheme"])
data["q3_regional"] = data["q3_regional"].astype(category["regional"])
data["q3_subregional"] = data["q3_subregional"].astype(category["subregional"])
data["q4"] = data["q4"].astype(category["geoscheme"])
data["q4_regional"] = data["q4_regional"].astype(category["regional"])
data["q4_subregional"] = data["q4_subregional"].astype(category["subregional"])
data["q5"] = data["q5"].astype(category["job_title"])
data["q6"] = data["q6"].astype(category["research_group"])
data["q7"] = data["q7"].astype(category["research_field"])
data["q8"] = data["q8"].astype(category["research_years"])
data["q9"] = data["q9"].astype(category["yes_no"])
# data["q10"]
data["q11"] = data["q11"].astype(category["yes_no"])
data["q12_genderbalance"] = data["q12_genderbalance"].astype(category["good_poor"])
data["q12_diversity"] = data["q12_diversity"].astype(category["good_poor"])
data["q12_equity"] = data["q12_equity"].astype(category["good_poor"])
data["q12_inclusion"] = data["q12_inclusion"].astype(category["good_poor"])
# data["q13"]
data["q14"] = data["q14"].astype(category["good_poor"])
# data["q15"]
# data["q16"]
data["q17_genderbalance"] = data["q17_genderbalance"].astype(category["agree_disagree"])
data["q17_diversity"] = data["q17_diversity"].astype(category["agree_disagree"])
data["q17_equity"] = data["q17_equity"].astype(category["agree_disagree"])
data["q17_inclusion"] = data["q17_inclusion"].astype(category["agree_disagree"])
# data["q18"]
data["q19"] = data["q19"].astype(category["school"])
# data["q20"]
# data["q21"]
# data["q22"]

自由記述の感情分析をする

- 対象となるカラム: ``q15`` / ``q16`` / ``q18`` / ``q20`` / ``q21`` / ``q22``
- 内容を分析して、ポジティブ、ニュートラル、ネガティブに分類する
  - 文脈を判断するわけではないのが注意点 : https://qiita.com/K_Nemoto/items/28a817d57706d536d625
```python
$ python -m textblob.download_corpora
[nltk_data] Downloading package brown to /Users/shotakaha/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /Users/shotakaha/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shotakaha/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/shotakaha/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/shotakaha/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/shotakaha/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.
``````

In [21]:
from textblob import TextBlob
import numpy as np
def sentiment_polarity(text):
    try:
        blob = TextBlob(text)
        return blob.sentiment.polarity
    except TypeError as e:
        #print(e)
        return np.nan
    
def sentiment_subjectivity(text):
    try:
        blob = TextBlob(text)
        return blob.sentiment.subjectivity
    except TypeError as e:
        # print(e)
        return np.nan
    
def translation(text):
    try:
        blob = TextBlob(text)
        return blob.translate(from_lang="en", to="ja")
    except (TypeError, AttributeError) as e:
        print(e)
        return np.nan
    

In [22]:
data["q15"][255]
# data["q15"].apply(sentiment_polarity)[255]
# data["q15"].apply(sentiment_subjectivity)[255]
data["q15"].apply(translation)

The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, not <class 'float'>
The `text` argument passed to `__init__(text)` must be a string, 

0                                                    NaN
1                                                    NaN
2                                                    NaN
3                                                    NaN
4                                                    NaN
                             ...                        
255    (D, e, ＆, I, は, マ, ル, ク, ス, 主, 義, の, ツ, ー, ル, ...
256                                               (議, 論)
257                                                  NaN
258                                                  NaN
259    (こ, れ, に, 対, す, る, 具, 体, 的, な, ア, ク, テ, ィ, ビ, ...
Name: q15, Length: 260, dtype: object

In [40]:
("").join(list(("D", "e", "&", "I", "は",)))

'De&Iは'