## 資料初步抓取

### 抓取HTML回來
事前已將含有是ford的關鍵字的URL存都記錄並放於 ford-news-source.db，接著此處將抓取HTML檔案回來做初步處理

In [19]:
import sqlite3
import requests
import urllib.parse

# 設定 SQLite DB 路徑
db_path = './ford-news-source.db'

# 設定儲存路徑
save_folder = './data-get/html/'

# 建立與資料庫的連接
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# 讀取 Real_URL 欄位的 URL 數據
cursor.execute("SELECT Real_URL FROM DATA")
url_column = cursor.fetchall()

# 迭代每個 URL
for i, url in enumerate(url_column, start=1):
    #if i >= 2:
    #  break  # 僅處理前1筆

    try:
        # 檢查 URL 是否有效
        if not url[0]:
            print(f"第 {i} 列 URL 為空，跳過抓取")
            continue

        # 檢查 HTML_File_ID 欄位是否已有資料
        #cursor.execute("SELECT HTML_File_ID FROM DATA WHERE rowid=?", (i,))
        #result = cursor.fetchone()
        #if result and result[0]:
            print(f"第 {i} 列已有資料，跳過抓取")
            continue

        # 解析 URL
        parsed_url = urllib.parse.urlparse(url[0])
        if not parsed_url.scheme:
            print(f"第 {i} 列 URL 無效，跳過抓取: {url[0]}")
            continue

        # 發送 GET 請求取得網頁內容
        response = requests.get(url[0])

        # 檢查請求是否成功
        if response.status_code == 200:
            # 產生檔案名稱
            file_id =f"data_file_{i}"
            file_name = file_id+".html"

            # 儲存檔案到指定目錄，使用 'utf-8' 
            file_path = save_folder + file_name
            with open(file_path, 'w', encoding='utf-8') as f:
                f.write(response.text)

            print(f"下載完成: {file_name}")

            # 將檔案路徑寫入 HTML_File_ID 欄位
            cursor.execute("UPDATE DATA SET HTML_File_ID=? WHERE rowid=?", (file_id, i))
            conn.commit()

        else:
            print(f"請求失敗: {url[0]}")

    except Exception as e:
        print(f"錯誤發生: {str(e)}")

# 關閉資料庫連接
conn.close()


下載完成: data_file_1.html
下載完成: data_file_2.html
下載完成: data_file_3.html
下載完成: data_file_4.html
下載完成: data_file_5.html
下載完成: data_file_6.html
下載完成: data_file_7.html
下載完成: data_file_8.html
第 9 列 URL 為空，跳過抓取
第 10 列 URL 為空，跳過抓取
第 11 列 URL 為空，跳過抓取
第 12 列 URL 為空，跳過抓取
第 13 列 URL 為空，跳過抓取
第 14 列 URL 為空，跳過抓取
第 15 列 URL 為空，跳過抓取
第 16 列 URL 為空，跳過抓取
第 17 列 URL 為空，跳過抓取
第 18 列 URL 為空，跳過抓取
下載完成: data_file_19.html
下載完成: data_file_20.html
下載完成: data_file_21.html
下載完成: data_file_22.html
下載完成: data_file_23.html
下載完成: data_file_24.html
下載完成: data_file_25.html
下載完成: data_file_26.html
下載完成: data_file_27.html
下載完成: data_file_28.html
下載完成: data_file_29.html
下載完成: data_file_30.html
下載完成: data_file_31.html
下載完成: data_file_32.html
下載完成: data_file_33.html
下載完成: data_file_34.html
下載完成: data_file_35.html
下載完成: data_file_36.html
下載完成: data_file_37.html
下載完成: data_file_38.html
下載完成: data_file_39.html
下載完成: data_file_40.html
下載完成: data_file_41.html
下載完成: data_file_42.html
下載完成: data_file_43.html
下載完成: data_file_44.html
下載完

### 剔除垃圾資料
部分網站資料無法使用，將其剃除


In [35]:
import sqlite3
import os
import shutil

# 設定 SQLite DB 路徑
db_path = './ford-news-source.db'

# 設定目錄路徑
html_folder = '.\data-get\html'
del_folder = '.\data-get\html\del'

# 定義要處理的 Source 列表
target_sources = ['Facebook','YOUTUBE']  # 可以根據需要增加其他項目

# 建立與資料庫的連接
conn = sqlite3.connect(db_path)
cursor = conn.cursor()

# 讀取 Source 和 HTML_File_ID 欄位的數據
cursor.execute("SELECT Source, HTML_File_ID FROM DATA")
data = cursor.fetchall()

# 迭代每條數據
for row in data:
    source = row[0]
    html_file_id = row[1]

    try:
        # 如果 Source 在目標列表中
        if source in target_sources:
            # 构建 HTML 文件和 TXT 文件的完整路径
            html_file_id_html = html_file_id + ".html"
            html_file_path = os.path.join(html_folder, html_file_id_html)
            txt_file_path = os.path.join(html_folder, os.path.splitext(html_file_id)[0] + '.txt')
            
            # 檢查 HTML 文件是否存在
            if os.path.exists(html_file_path):
                # 創建 del 資料夾（如果不存在）
                os.makedirs(del_folder, exist_ok=True)
                
                # 將檔案搬移到 del 資料夾
                shutil.move(html_file_path, del_folder)
                
                # 將 HTML 文件的路徑更新為空值
                cursor.execute("UPDATE DATA SET HTML_File_ID='' WHERE HTML_File_ID=?", (html_file_id,))
                conn.commit()

                print(f"Moved file {html_file_id} to {del_folder}")
            else:
                print(f"HTML file not found: {html_file_path}")
        else:
            print(f"Skipping non-target source: {source}")
    except Exception as e:
        print(f"Error processing {html_file_id}: {str(e)}")


# 關閉資料庫連接
conn.close()


HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
Skipping non-target source: 動腦
Skipping non-target source: 一手車訊
Skipping non-target source: 鏡週刊-別冊
Skipping non-target source: 中國時報北市A
Skipping non-target source: 中國時報北市A
Skipping non-target source: 今周刊
Skipping non-target source: 工商時報
Skipping non-target source: 工商時報
Skipping non-target source: 人間福報
Skipping non-target source: 自由時報北市A
HTML file not found: .\data-get\html\.html
HTML file not found: .\data-get\html\.html
Skipping non-target source: U-CAR
Skipping non-target source: U-CAR
Skipping non-target source: U-CAR
Skipping non-target source: U-CAR
Moved file data_file_25 to .\data-get\html\del
Skipping non-target source: 台灣雅虎奇摩
Skipping non-t

## 資料預處理

### 轉換HTML為TXT
藉由 beautifulSoup 進行 HTML 解析，並讀出內容資料。

In [36]:
import os
from bs4 import BeautifulSoup

# 設定 HTML 資料夾位置
html_folder = './data-get/html'

# 遍歷目標資料夾中所有 html 檔案名稱
for file_name in os.listdir(html_folder):
    if file_name.endswith('.html'):
        # 組合名稱與檔案的完整路徑
        html_file_path = os.path.join(html_folder, file_name)

        # 組合產生 txt 檔案的完整路徑
        txt_file_path = os.path.join(html_folder, os.path.splitext(file_name)[0] + '.txt')

        # 檢查這個 txt 是否已經存在。若存在就跳過
        if os.path.exists(txt_file_path):
            print(f'Skipped existing TXT file: {txt_file_path}')
            continue

        try:
            # 讀取 html 內容，然後採用 utf-8 encoding
            with open(html_file_path, 'r', encoding='utf-8') as file:
                html_content = file.read()

            # 使用BeautifulSoup 解析 HTML
            soup = BeautifulSoup(html_content, 'html.parser')

            # 獲得文字內容
            text = soup.get_text()

            # 將文字存入 txt
            with open(txt_file_path, 'w', encoding='utf-8') as file:
                file.write(text)

            print(f'Saved text content as {txt_file_path}')
        except Exception as e:
            print(f'Error processing {html_file_path}: {str(e)}')


Saved text content as ./data-get/html\data_file_1000.txt
Saved text content as ./data-get/html\data_file_1002.txt
Saved text content as ./data-get/html\data_file_1007.txt
Saved text content as ./data-get/html\data_file_1009.txt
Saved text content as ./data-get/html\data_file_101.txt
Saved text content as ./data-get/html\data_file_1011.txt
Saved text content as ./data-get/html\data_file_1012.txt
Saved text content as ./data-get/html\data_file_1017.txt
Saved text content as ./data-get/html\data_file_1019.txt
Saved text content as ./data-get/html\data_file_102.txt
Saved text content as ./data-get/html\data_file_1020.txt
Saved text content as ./data-get/html\data_file_1022.txt
Saved text content as ./data-get/html\data_file_1023.txt
Saved text content as ./data-get/html\data_file_1024.txt
Saved text content as ./data-get/html\data_file_1025.txt
Saved text content as ./data-get/html\data_file_1026.txt
Saved text content as ./data-get/html\data_file_1027.txt
Saved text content as ./data-get/

### TXT資料整理 1 -清除空白行


In [37]:
import os

# 設定 HTML 資料夾位置
folder_path = './data-get/html'

# 遍歷資料夾下所有檔案，抓取 txt
for file_name in os.listdir(folder_path):
    if file_name.endswith('.txt'):
        try:
            # txt 完整路徑
            file_path = os.path.join(folder_path, file_name)
            
            # 讀出內容
            with open(file_path, 'r', encoding='utf-8') as file:
                lines = file.readlines()
            
            # 清除空白行直到沒有
            lines = [line.strip() for line in lines if line.strip()]
            
            # 重新保存
            with open(file_path, 'w', encoding='utf-8') as file:
                file.write('\n'.join(lines))

            print(f'Cleared empty lines in {file_path}')
        except Exception as e:
            print(f'Error processing {file_path}: {str(e)}')

Cleared empty lines in ./data-get/html\data_file_1000.txt
Cleared empty lines in ./data-get/html\data_file_1002.txt
Cleared empty lines in ./data-get/html\data_file_1007.txt
Cleared empty lines in ./data-get/html\data_file_1009.txt
Cleared empty lines in ./data-get/html\data_file_101.txt
Cleared empty lines in ./data-get/html\data_file_1011.txt
Cleared empty lines in ./data-get/html\data_file_1012.txt
Cleared empty lines in ./data-get/html\data_file_1017.txt
Cleared empty lines in ./data-get/html\data_file_1019.txt
Cleared empty lines in ./data-get/html\data_file_102.txt
Cleared empty lines in ./data-get/html\data_file_1020.txt
Cleared empty lines in ./data-get/html\data_file_1022.txt
Cleared empty lines in ./data-get/html\data_file_1023.txt
Cleared empty lines in ./data-get/html\data_file_1024.txt
Cleared empty lines in ./data-get/html\data_file_1025.txt
Cleared empty lines in ./data-get/html\data_file_1026.txt
Cleared empty lines in ./data-get/html\data_file_1027.txt
Cleared empty li

### TXT資料整理 2 -清除非句子文字
嘗試採用 NLTK 與 JIEBA

In [5]:
%pip install -U nltk
%pip install -U jieba


Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
                                              0.0/1.5 MB ? eta -:--:--
                                              0.0/1.5 MB ? eta -:--:--
                                              0.0/1.5 MB ? eta -:--:--
                                              0.0/1.5 MB 330.3 kB/s eta 0:00:05
                                              0.0/1.5 MB 330.3 kB/s eta 0:00:05
                                              0.0/1.5 MB 330.3 kB/s eta 0:00:05
     -                                        0.0/1.5 MB 164.3 kB/s eta 0:00:09
     -                                        0.1/1.5 MB 218.8 kB/s eta 0:00:07
     --                                       0.1/1.5 MB 275.8 kB/s eta 0:00:06
     --                                       0.1/1.5 MB 275.8 kB/s eta 0:00:06
     --                                       0.1/1.5 MB 275.8 kB/s eta 0:00:06
     ---                                      0.1/1.5 MB 257.8 kB/s eta 0:00:06


In [9]:
import os
import nltk
import jieba
nltk.download('punkt')  # 下載 NLTK 的句子分割器所需的資源

from nltk.tokenize import sent_tokenize

# 輸入和輸出資料夾路徑
input_folder = "./data-txt"
output_folder = "./data-txt/sentence"

# 創建輸出資料夾
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# 讀取資料夾中的每個 .txt 檔案
for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        # 輸入檔案的完整路徑
        input_file_path = os.path.join(input_folder, filename)
        
        # 輸出檔案的完整路徑
        output_filename = os.path.splitext(filename)[0] + "_nltk.txt"
        output_file_path = os.path.join(output_folder, output_filename)
        
        # 讀取輸入檔案並處理句子
        with open(input_file_path, "r", encoding="utf-8") as input_file:
            text = input_file.read()
            sentences = sent_tokenize(text)  # 使用 NLTK 的句子分割器
            
            # 組合句子並保存到新位置
            output_text = "\n".join(sentences)
            
            with open(output_file_path, "w", encoding="utf-8") as output_file:
                output_file.write(output_text)


[nltk_data] Downloading package punkt to C:\Users\hardy-
[nltk_data]     game\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
