<a href="https://colab.research.google.com/github/LC1332/Luotuo-Chinese-LLM/blob/main/notebook/MMC4clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 对于MMC4清理工作

话说从hf上下载LAION让我非常头秃，我发现MMC4的组织更干净一些

让我们来下载看一看

建立一个MMC4的下载脚本

+ 每个iter下载10个文件，
+ 对里面高similairty的进行抽取、排序，留下每个文件1万
+ 留下similarity，url, caption字段
+ 保存成一个新的文件

这样我们对这个数据集进行翻译后，可以做自学习或者是text only的学习

In [1]:
# a dataset organized as following link
# https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_{$SHARD}_v2.jsonl.zip
# where SHARD can vary from 0 to 23098.

# import a python program try download first 20 file and save into /content/temp
import os
import urllib.request
import zipfile
import json


user = 'others' #'lengziang'

# create temp directory if it doesn't exist
if not os.path.exists("temp"):
    os.makedirs("temp")


save_folder = '/content/save'

# check save_folder and if not exist create it
if not os.path.exists(save_folder):
    os.makedirs(save_folder)


# download and extract first 20 shards

start_index = 0
end_index = 10

max_share = 23098

def download_and_extract(start_index, end_index):
    datas = []

    print('{} to {} start job'.format(start_index, end_index))

    missing_shards = [3218,3267,5064,5146,7119,8991,9750,11899,15127,15252,16996,17369,17499,17818]

    for shard in range(start_index, end_index):

        if shard in missing_shards:
            continue

        if shard > max_share:
            break
        
        url = f"https://storage.googleapis.com/ai2-jackh-mmc4-public/data/docs_no_face_shard_{shard}_v2.jsonl.zip"
        filename = f"temp/docs_no_face_shard_{shard}_v2.jsonl.zip"
        urllib.request.urlretrieve(url, filename)
        
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall("temp")

        # jsonl named same as zip file like docs_no_face_shard_{shard}_v2.jsonl, append all jsonl into data
        with open(f"temp/docs_no_face_shard_{shard}_v2.jsonl", "r") as f:
            for line in f:
                json_format = json.loads(line)
                datas.append(json_format)

    saved_data = []

    for data in datas:
        # data organized like [{'image_name': '3dfe7f1b8889.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/10/light-681540_640.jpg', 'matched_text_index': 3, 'matched_sim': 0.18006281554698944}, {'image_name': '63ebb64fea64.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/08/tools-15539_640.jpg', 'matched_text_index': 1, 'matched_sim': 0.21386536955833435}, {'image_name': 'da29310f2040.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/05/Vouchercloud-summer-ready.jpg', 'matched_text_index': 2, 'matched_sim': 0.29001814126968384}, {'image_name': '8ef4a00d9e7c.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/03/KindnessonSocial-Media.jpg', 'matched_text_index': 7, 'matched_sim': 0.22565722465515137}, {'image_name': '275fa8a92ec7.png', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/03/Bloggers-Random.png', 'matched_text_index': 8, 'matched_sim': 0.19497966766357422}, {'image_name': 'd1a15d0b3c3b.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/06/You-Will-Never-Be-Everyones-Cup-of-Teaand.jpg', 'matched_text_index': 4, 'matched_sim': 0.33313706517219543}, {'image_name': '76b7018f1332.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/02/dice-18208_1280.jpg', 'matched_text_index': 9, 'matched_sim': 0.22152987122535706}, {'image_name': '15ea30341800.jpg', 'raw_url': 'https://www.acornishmum.com/wp-content/uploads/2018/09/danger-of-believing.jpg', 'matched_text_index': 0, 'matched_sim': 0.20117005705833435}]
        text_list = data['text_list']
        sim_list = data['similarity_matrix']
        img_infos = data['image_info']

        # from image_info get image_name and raw_url and using matched_text_index to get text
        for img_info in img_infos:
            image_name = img_info['image_name']
            raw_url = img_info['raw_url']
            matched_text_index = img_info['matched_text_index']
            matched_sim = img_info['matched_sim']
            text = text_list[matched_text_index]

            if user == 'lengziang':
                # 冷子昂希望我保存成这个格式 dict_keys(['image_id', 'id', 'caption', 'chinese_caption', 'en_embedding', 'zh_embedding'])
                # 即使没有求embedding也保留空的字符串
                saved_data.append({'image_id': image_name, 'id': raw_url, 'caption': text, \
                    'chinese_caption':'', 'en_embedding':'', 'zh_embedding':'', 'sim': matched_sim})
            else:
                saved_data.append({'image_name': image_name, 'url': raw_url, 'caption': text, 'sim': matched_sim})

    print('{} to {} append all data'.format(start_index, end_index))

    # sort saved_data by sim in descending order
    saved_data.sort(key=lambda x: x['sim'], reverse=True)

    # keep only the top 10000 elements with largest sim values
    saved_data = saved_data[:10000]

    # organize save_name like /content/save/saved_data_{start_index}_{end_index}.jsonl
    save_name = f"{save_folder}/saved_data_{start_index}_{end_index}.jsonl"

    # save saved_data into jsonl file
    with open(save_name, "w") as f:
        for data in saved_data:
            json.dump(data, f)
            f.write("\n")

    print('{} to {} done'.format(start_index, end_index))


尝试一下多线程

In [2]:
import multiprocessing

# compute max_n from max_share
if max_share %10 == 0:
    max_n = max_share
else:
    max_n = (1 + max_share//10)*10

pool = multiprocessing.Pool(6)

for i in range(0, max_n, 10):
    start = i
    end = i + 10
    pool.apply_async(download_and_extract, args=(start, end))
pool.close()
pool.join()

40 to 50 start job30 to 40 start job10 to 20 start job0 to 10 start job
20 to 30 start job



50 to 60 start job
0 to 10 append all data
0 to 10 done
60 to 70 start job
10 to 20 append all data
10 to 20 done
70 to 80 start job
50 to 60 append all data
30 to 40 append all data
40 to 50 append all data
30 to 40 done
20 to 30 append all data
50 to 60 done
80 to 90 start job
40 to 50 done
90 to 100 start job
20 to 30 done
100 to 110 start job
110 to 120 start job
60 to 70 append all data
60 to 70 done
70 to 80 append all data
110 to 120 append all data
70 to 80 done
90 to 100 append all data
100 to 110 append all data
80 to 90 append all data
110 to 120 done
90 to 100 done

100 to 110 done80 to 90 done


这个测试还挺不错的

让冷子昂挂在服务器上面去做大规模的。

我这边colab收集一个小规模的我先翻译一个子集来看一看