問題点
- ~~ 基本形に直せきれていない ~~ 
- 単語を指定して、その範囲内で翻訳できるか確認する
- ~~ 固有名詞が未知の単語と判定されてしまう~~ 
    - ~~ 固有表現認識を使う　https://qiita.com/Hironsan/items/a5acf1d121926666907b~~ 

確認すべきこと
- pタグを取得することで、大半のURLから本文を取得できるか確認する
- 決済のapiをどうするか考える
    - paypalか、urlを埋め込むだけのタイプもあるらしい
    - 個人用かビジネス用か

仕様
- 知らん単語があれば自分で単語帳に登録すれば良い
- 熟語は無視する。意味のわからない単語は自分で調べるだろうから
- 引用符をつけずに英訳するように指定する

spacyで、レンマ化するのがおすすめらしい。

基本形に変換するレンマ化の精度を高める

固有表現抽出もspacyを使うのが良さそう
### 下で性能が良いのを確認できた

In [1]:
import csv
import re
import nltk
from nltk.stem import WordNetLemmatizer
import spacy

# python3 -m spacy download en_core_web_sm を実行する必要あるかも
nlp = spacy.load("en_core_web_sm")  # or other models

# # 大文字を小文字に変換、単語を基本形に変換
# def pre_process(text: str) -> set[str]:
#     text = text.replace('.', '')  # ピリオドを取り除く
#     text = text.replace(',', '')  # コンマを取り除く
#     text = re.sub(r'\d+', '', text)
#     text_lowew = text.lower()
#     # テキストを単語に分割
#     words_in_text = text_lowew.split()
#     # レンマタイザの初期化
#     lemmatizer = WordNetLemmatizer()

#     # textの単語の基本形を取得
#     lemmatized_words_in_text = [lemmatizer.lemmatize(word) for word in words_in_text]

#     return set(lemmatized_words_in_text)

# 前処理した単語の集合と、受験で覚えるべき単語の集合を比べる
def check_none_testwords_in_text(file_path: str, text: set[str]) -> set[str]:
    # 単語を格納するリストを初期化
    words_for_test = set()

    # CSVファイルを開く
    with open(file_path, newline="") as file:
        reader = csv.reader(file)

        # 各行を読み込み、単語をリストに追加
        for row in reader:
            # rowはリスト形式なので、最初の要素を取得
            word = row[0]
            words_for_test.add(word)

    # words_for_testに含まれていない単語を、textから抜き出す
    not_included = text - words_for_test
    return not_included


大文字小文字の変換、固有名詞の除去をどの順番でするか

In [2]:
# 固有名詞を文中から抽出,大文字を小文字に変換、単語を基本形に変換
def pre_process(text:str)->set[str]:
    # Process the text
    doc = nlp(text)

    # Extract proper nouns
    proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]

    # 基本形に変換、すでに小文字化して返す、固有名詞は大文字のまま
    lemmatized_sentence = [token.lemma_ for token in doc if token.is_alpha]
    lemma_extract_propn = set(lemmatized_sentence)-set(proper_nouns)-{word.lower() for word in proper_nouns}
    return lemma_extract_propn


In [3]:
# 上の二つの関数を合わせる
def set_none_words(file_path: str, text: set[str]):
    text_pre_process = pre_process(text)
    answer = check_none_testwords_in_text(file_path, text_pre_process)
    return answer

In [4]:
# CSVファイルのパス
file_path_high = "../word_list/csv_folder/word_list_j_high_school.csv"
file_path_uni = "../word_list/csv_folder/word_list_2zi_test.csv"

# text = "Professional baseball team managers and players have their schedules set months before the season starts. This includes game dates and times for flights. The season begins in April, but they have camps in February and open games in March. So, from February to about October, their schedules are mostly planned. I usually take a few days off in December and do my own training in January. On a game day, like when we play at Jingu Stadium, I wake up at 10 AM, get to the stadium after noon, start practice around 2 PM, and the game begins at 6 PM. After the game, I discuss it with the coaches and plan for the next game. Then, I do some training and go home. I usually get home by midnight and sleep at 3 AM. We have six games a week and travel too, so we only get about two days off each month."
text = "In explaining why he predicted 3rd place for Daigo, he mentioned there might be opinions like it's troublesome if a boss gets drunk. He then revealed that recently, an acquaintance of mine was passing by the police box at Shinagawa Station late at night. He saw someone collapsed there, and a policewoman was asking if they were alright. When he looked closely, it turned out to be Daigo. He's still doing things like that.Daigo burst into laughter at this unexpected revelation. The co-performers couldn't hide their surprise, and Hakata Daikichi asked about the timing of the incident, thinking it was a story from years ago. But Yamauchi immediately replied that it was just two weeks ago. Daigo then elicited laughter by explaining that he is a bit conscious that if he's going to collapse, it's better near a police box where someone will help him."

print(
    "高校受験用の単語集合と比較",
    len(set_none_words(file_path_high, text)),
    set_none_words(file_path_high, text),
)
print(
    "大学受験用の単語集合と比較",
    len(set_none_words(file_path_uni, text)),
    set_none_words(file_path_uni, text),
)

高校受験用の単語集合と比較 17 {'conscious', 'drunk', 'alright', 'incident', 'troublesome', 'timing', 'elicit', 'burst', 'co', 'boss', 'policewoman', 'reveal', 'mention', 'collapse', 'revelation', 'acquaintance', 'predict'}
大学受験用の単語集合と比較 8 {'drunk', 'alright', 'timing', 'elicit', 'co', 'boss', 'policewoman', 'troublesome'}


In [5]:
def pick_propm(text:str)->set[str]:
    # Process the text
    doc = nlp(text)
    # Extract proper nouns
    proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]
    return set(proper_nouns)

print(pick_propm(text))


{'Hakata', 'Daigo', 'Shinagawa', 'Yamauchi', 'Station', 'Daikichi'}


In [6]:
def lem(text:str)->set[str]:
    doc = nlp(text)
    lemmatized_sentence = [token.lemma_ for token in doc]
    return set(lemmatized_sentence)
print(lem(text))

{'alright', 'burst', 'box', 'revelation', 'might', 'he', 'performer', 'just', 'Yamauchi', 'acquaintance', 'unexpected', '-', 'explain', 'see', 'recently', 'police', 'night', 'like', 'late', 'an', 'week', 'it', 'when', 'where', 'from', 'about', 'mine', 'conscious', 'thing', 'will', 'mention', 'Daikichi', 'two', 'incident', 'help', 'near', 'if', '.', 'think', 'place', ',', 'a', 'boss', 'surprise', 'be', 'immediately', 'pass', 'Station', 'the', 'they', 'do', 'Shinagawa', 'then', 'predict', 'their', 'drunk', 'timing', 'into', 'year', 'reveal', 'and', 'go', 'this', 'story', 'collapse', 'still', 'in', 'for', '3rd', 'co', 'Hakata', 'of', 'elicit', 'that', 'why', 'well', 'ask', 'could', 'ago', 'reply', 'opinion', 'someone', 'at', 'get', 'closely', 'there', 'Daigo', 'daigo', 'by', 'to', 'laughter', 'troublesome', 'turn', 'look', 'out', 'not', 'policewoman', 'hide', 'bit', 'but'}
