[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10IXpo-HOHTym_9zJGibz1OyUYP9_qhLd?usp=sharing)

In [None]:
!pip install -qU datasets==2.12.0 cohere 'pinecone-client[grpc]'

In [None]:
!pip install -qU apache_beam==2.44.0 mwparserfromhell

In [None]:
from datasets import load_dataset

en = load_dataset("Cohere/wikipedia-22-12", "en", streaming=True)
yo = load_dataset("wikipedia", language="yo", date="20221220", beam_runner='DirectRunner')
ig = load_dataset("wikipedia", language="ig", date="20221220", beam_runner='DirectRunner')
ha = load_dataset("wikipedia", language="ha", date="20221220", beam_runner='DirectRunner')

In [None]:
!pip install -qU qdrant-client

In [4]:
print(len(yo["train"]), len(ig["train"]), len(ha["train"]))

32394 14899 21029


In [22]:
yo["train"][0]

{'id': '598',
 'url': 'https://yo.wikipedia.org/wiki/A',
 'title': 'A',
 'text': 'el: άλφα\n yo: a (ah)\n\naa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az\n\na (1A) kan: I gave him a dog. (Mo fun un ní ajá kan).'}

In [4]:
def add_lang(example, lang):
    example["lang"] = lang
    return{'id': example["id"],
            'url': example["url"],
            'title': example["title"],
            'text': example["text"],
            "lang": example["lang"]}

# use the map() method to apply the function to each item in the dataset
yo = yo["train"].map(lambda example: add_lang(example, "yo"))
ig = ig["train"].map(lambda example: add_lang(example, "ig"))
ha = ha["train"].map(lambda example: add_lang(example, "ha"))

# print the first item in the dataset to verify that the "lang" key has been added
print(yo[0])


Map:   0%|          | 0/32394 [00:00<?, ? examples/s]

{'id': '598', 'url': 'https://yo.wikipedia.org/wiki/A', 'title': 'A', 'text': 'el: άλφα\n yo: a (ah)\n\naa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az\n\na (1A) kan: I gave him a dog. (Mo fun un ní ajá kan).', 'lang': 'yo'}


In [10]:
[(i,x) for i,x in enumerate(yo) if x["url"] == "https://yo.wikipedia.org/wiki/L%C3%ADt%C3%ADr%C3%A9%E1%B9%A3%E1%BB%8D%CC%80"]

[(177,
  {'id': '2821',
   'url': 'https://yo.wikipedia.org/wiki/L%C3%ADt%C3%ADr%C3%A9%E1%B9%A3%E1%BB%8D%CC%80',
   'title': 'Lítíréṣọ̀',
   'text': 'Litireso tabi iṣẹ́ọnàmọ̀ọ́kọmọ̀ọ́kà\n\nOHUN TÍ LÍTÍRÉṢỌ̀ JẸ́\n\t\nEléyìí náà kò ṣàì ní orírun tirẹ̀ láti inú èdè Látìn “LITERE” Èyí ni àwọn gẹ́ẹ́sì yá wọ inú èdè wọn tí wọ́n ń pè ni lete rature”\nÌtumọ̀ tí a fún lítíréṣọ̀ máa ń yípadà láti ọ̀dọ̀ ẹ̀yà ènìyàn kan sí èkejì láti ìgbà dé ìgbà. Lítíréṣọ̀ kò dúró sójú kan.\nBabalọlá (1986), ṣàpèjúwe lítíréṣọ̀ pé: \n\n“Àkójọpọ̀ ìjìnlẹ̀ ọ̀rọ̀ ní èdè kan tàbí òmíràn tó jásí ewì, ìtàn àlọ́, ìyànjú, eré onítàn, ìròyìn àti eré akọ́nilọ́gbọ́n lórí ìtàgé”\n\nTí a bá wo òde òní, ìtumọ̀ tí a fún lítíréṣọ̀ tún yàtọ̀, fún àpẹẹrẹ a máa ń ṣàkíyèsí ìlò èdè tí wọ́n fi kọ ìwé kan yàtọ̀ sí èyí, a tún ka lítíréṣọ̀ kùn iṣẹ́ òǹkọ̀wé alátinúdá tàbí iṣẹ́ tó ní í ṣe pẹ̀lú ọ̀rọ̀ ìlò ojú inú gẹ́gẹ́ bí ewì, ìtàn àròkọ àti eré oníṣe. A ó sì rí i pe awẹ́ tàbí ẹ̀yà lítíréṣọ̀ tí a mẹ̀nubà wọ̀nyí kó púpọ̀ nínú ìmọ̀ ìgbé ẹ̀dá l

In [10]:
next(iter(en['train']))

{'id': 0,
 'title': 'Deaths in 2022',
 'text': 'The following notable deaths occurred in 2022. Names are reported under the date of death, in alphabetical order. A typical entry reports information in the following sequence:',
 'url': 'https://en.wikipedia.org/wiki?curid=69407798',
 'wiki_id': 69407798,
 'views': 5674.4492597435465,
 'paragraph_id': 0,
 'langs': 38}

In [12]:
import os
import cohere
from qdrant_client import QdrantClient
from qdrant_client import models
from qdrant_client.http import models as rest

In [13]:
# load environment variables

QDRANT_API_KEY = ""
COHERE_API_KEY = "" 

In [14]:
co = cohere.Client(COHERE_API_KEY)

In [81]:
embeddings = co.embed(
    texts=["A test sentence"],
    model="multilingual-22-12",
)
vector_size = len(embeddings.embeddings[0])
vector_size

768

In [15]:
qdrant_client = QdrantClient(
    url="https://abe27875-57da-4af8-85f9-2a50b4aaaf19.us-east-1-0.aws.cloud.qdrant.io:6333", 
    api_key=QDRANT_API_KEY,
    prefer_grpc=True,
)

In [90]:
# qdrant_client.recreate_collection(
#     collection_name="wiki-embed",
#     vectors_config=models.VectorParams(
#         size=vector_size, 
#         distance=rest.Distance.DOT
#     ),
# )

True

In [157]:
import itertools
import datasets

en_slice = itertools.islice(en["train"], 2500)
en_2500_list = list(en_slice)

my_list_dicts = [{"id": item["id"], "title": item["title"], "text": item["text"], "url": item["url"], "wiki_id": item["wiki_id"], "views": item["views"], "paragraph_id": item["paragraph_id"], "langs": item["langs"], "lang": "en"} for item in en_2500_list]

# create a new Dataset object from the list of dictionaries
en_2500 = datasets.Dataset.from_dict({"id": [item["id"] for item in my_list_dicts],
                                           "title": [item["title"] for item in my_list_dicts],
                                           "text": [item["text"] for item in my_list_dicts],
                                           "url": [item["url"] for item in my_list_dicts],
                                           "wiki_id": [item["wiki_id"] for item in my_list_dicts],
                                           "views": [item["views"] for item in my_list_dicts],
                                           "paragraph_id": [item["paragraph_id"] for item in my_list_dicts],
                                           "langs": [item["langs"] for item in my_list_dicts],
                                           "lang": [item["lang"] for item in my_list_dicts]})



In [91]:
yo_2500 = yo.select(range(2500))
ig_2500 = ig.select(range(2500))
ha_2500 = ha.select(range(2500))

In [111]:
len(ig_2500["text"])

2500

In [112]:
ig_2500[0]

{'id': '918',
 'url': 'https://ig.wikipedia.org/wiki/Chineke',
 'title': 'Chineke',
 'text': 'Chineke bụ aha ọzọ ndi Igbo n\' omenala Igbo kpọrọ Chukwu. Mgbe ndị bekee bịara, ha mee ya nke ndi okpukpere ụka. N\'echiche ndi okpukpere chi nke Omenala Ndi Igbo, Christianity, Judaism, ma Islam, Chineke nwere ọtụtụ utu aha, ma nwee nanị otu aha. Ụzọ abụọ e si akpọ aha ahụ bụ "Jehovah maọbụ Yahweh". N\' ọtụtụ Akwụkwọ Nsọ, e wepụla aha Chineke ma jiri utu aha bụ "Onyenwe Anyị" maọbụ "Chineke" dochie ya. Ma mgbe e dere akwụkwọ nsọ, aha ahụ bụ Jehova pụtara n’ime ya, ihe dịka ugboro pụkụ asaa(7,000).\n\nN\'ime akwụkwọ nsọ nke ọhụrụ nke edere n\'asụsụ ndi girik, aha Chineke, dika Jehova na Yaweh apụtaghi, kama aha ya n\'ọnọdụ dika "Onyenwe Anyi" ya na "Chineke" pụtara nke ụkwụ. N\'eto nụ aha nsọ Chineke nna anyi n\'asụsụ ọbula n\'ime mmụọ n\' eziokwụ. Maka na amara ya gbara ndi n\'eso ụzọ ya gburu gburu.\n\nLeekwa\n Chukwu\n\nÉféfé',
 'lang': 'ig'}

In [62]:
type(yo)

datasets.arrow_dataset.Dataset

In [18]:
MLLM_MODEL = "multilingual-22-12"

In [110]:
def embed_dataset(dataset):
  MLLM_MODEL = "multilingual-22-12"
  docs = dataset["text"]

  doc_response = co.embed(
      texts= docs,
      model=MLLM_MODEL,
  )
  vectors = [list(map(float, vector)) for vector in doc_response.embeddings]
  ids = [int(x["id"]) for x in dataset]

  return vectors, ids



In [97]:
def upsert_embedings(dataset, vectors, ids):
  qdrant_client.upsert(
        collection_name="wiki-embed", 
        points=rest.Batch(
            ids=ids,
            vectors=vectors,
            payloads=list(dataset),
        ))


In [156]:
en_2500[0].keys()

dict_keys(['id', 'title', 'text', 'url', 'wiki_id', 'views', 'paragraph_id', 'langs'])

In [159]:
vectors, ids = embed_dataset(en_2500)

In [160]:
upsert_embedings(en_2500, vectors, ids)

In [19]:
def query_knowledge_base(question, lang = None):
  query_embeddings = co.embed(
    texts=[question],
    model=MLLM_MODEL,
  )
  query_filter = None


  if lang:
    query_filter = models.Filter(
        must=[
            models.FieldCondition(
                key="lang",
                match=models.MatchValue(
                    value="yo",
                ),
            )
        ]
    )

  result = qdrant_client.search(
    collection_name="wiki-embed",
    query_filter = query_filter, 
    search_params=models.SearchParams(
        hnsw_ef=128,
        exact=False
    ),
    query_vector=query_embeddings.embeddings[0],
    limit=10,
  )

  return result




In [20]:
result = query_knowledge_base("What is Literature")

In [39]:
# https://yo.wikipedia.org/wiki/%C3%8Ct%C3%A0n_il%E1%BA%B9%CC%80_N%C3%A0%C3%ACj%C3%ADr%C3%AD%C3%A0

In [40]:
# Who is Samuel Ajaji Crowder
# Tell me about Chinua Achebe and his works
# Who was president between 2015 and ----
# I want to know the past presidents of America

In [21]:
for _ in range(10):
  print(result[_].payload["url"])

https://yo.wikipedia.org/wiki/L%C3%ADt%C3%ADr%C3%A9%E1%B9%A3%E1%BB%8D%CC%80
https://en.wikipedia.org/wiki?curid=31717
https://yo.wikipedia.org/wiki/%C3%8Cw%C3%A9%20%E1%BA%B8%CC%81sr%C3%A0
https://ha.wikipedia.org/wiki/Littafi
https://yo.wikipedia.org/wiki/K%C3%AD%20ni%20f%C3%AD%C3%ACm%C3%B9%3F
https://yo.wikipedia.org/wiki/%C3%8Ct%C3%A0n
https://yo.wikipedia.org/wiki/%C3%8Cw%C3%A9%20%E1%BA%B8%CC%81st%C3%A9r%C3%AC
https://en.wikipedia.org/wiki?curid=31750
https://ig.wikipedia.org/wiki/Ak%E1%BB%A5k%E1%BB%8Dnd%E1%BB%A5%CC%80onye
https://ha.wikipedia.org/wiki/%D0%9B


In [90]:
 text = """
Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that, through cellular respiration, can later be released to fuel the organism's activities. Some of this chemical energy is stored in carbohydrate molecules, such as sugars and starches, which are synthesized from carbon dioxide and water – hence the name photosynthesis, from the Greek phōs (φῶς), "light", and synthesis (σύνθεσις), "putting together".[1][2][3] Most plants, algae, and cyanobacteria perform photosynthesis; such organisms are called photoautotrophs. Photosynthesis is largely responsible for producing and maintaining the oxygen content of the Earth's atmosphere, and supplies most of the energy necessary for life on Earth.[4]

Although photosynthesis is performed differently by different species, the process always begins when energy from light is absorbed by proteins called reaction centers that contain green chlorophyll (and other colored) pigments/chromophores. In plants, these proteins are held inside organelles called chloroplasts, which are most abundant in leaf cells, while in bacteria they are embedded in the plasma membrane. In these light-dependent reactions, some energy is used to strip electrons from suitable substances, such as water, producing oxygen gas. The hydrogen freed by the splitting of water is used in the creation of two further compounds that serve as short-term stores of energy, enabling its transfer to drive other reactions: these compounds are reduced nicotinamide adenine dinucleotide phosphate (NADPH) and adenosine triphosphate (ATP), the "energy currency" of cells.

In plants, algae and cyanobacteria, sugars are synthesized by a subsequent sequence of light-independent reactions called the Calvin cycle. In the Calvin cycle, atmospheric carbon dioxide is incorporated into already existing organic carbon compounds, such as ribulose bisphosphate (RuBP).[5] Using the ATP and NADPH produced by the light-dependent reactions, the resulting compounds are then reduced and removed to form further carbohydrates, such as glucose. In other bacteria, different mechanisms such as the reverse Krebs cycle are used to achieve the same end.
"""

In [92]:
response = co.summarize(
    text= text,
    model='summarize-xlarge',
    length='medium',
    format='bullets',
    extractiveness='high',
    temperature=0.3,
    additional_command=None,
)

In [93]:
response.summary

'- Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that, through cellular respiration, can later be released to fuel the organism\'s activities.\n- Some of this chemical energy is stored in carbohydrate molecules, such as sugars and starches, which are synthesized from carbon dioxide and water – hence the name photosynthesis, from the Greek phōs (φῶς), "light", and synthesis (σύνθεσις), "putting together".\n- Most plants, algae, and cyanobacteria perform photosynthesis; such organisms are called photoautotrophs.\n- Photosynthesis is largely responsible for producing and maintaining the oxygen content of the Earth\'s atmosphere, and supplies most of the energy necessary for life on Earth.'

---