In [2]:
data_science_document = [
    
    "Performing data science work is often an iterative process, where the data scientist needs to return to earlier steps if they run into challenges.",

    "There are many ways to categorize the data science process, but it often includes: Data collection, Data exploration, Data modeling, Model evaluation, and Model deployment and monitoring.",

    "Data Collection: Data collection and preprocessing involves gathering data from various sources (such as databases, APIs, and web scraping), then cleaning and transforming the data to prepare it for analysis. This step involves dealing with missing, inconsistent, or noisy data and converting it into a structured format. Depending on the organization, a team of data engineers may support this step; however, it is common for the data scientist to manage this process as well. This requires intimate knowledge of data sources and the ability to write SQL queries, database code, or custom tools such as web scrapers.",

    "Data Exploration: Data exploration involves conducting exploratory data analysis (EDA) to better understand the data, detect anomalies, and identify relationships between variables. The key to this step is to look for correlations and understand the distribution of the data. This involves using descriptive statistics and visualization techniques to summarize the data and gain insights. A data scientist should be able to use summary statistics, create descriptive visualizations, or use reporting tools such as Power BI or Tableau.",

    "Data Modeling: Using what was learned in the data exploration step, data modeling is the step when the data scientist builds predictive or descriptive models using machine learning and statistical techniques that identify patterns and relationships in the data. Here, the data scientist selects appropriate algorithms, trains models on historical data, and validates their performance.",

    "Model Evaluation: Model evaluation and optimization involves assessing the performance of models using metrics such as accuracy, RMSE, precision, recall, AUC, or F1-score. Based on these evaluations, data scientists may refine models or try alternative algorithms to improve performance. Understanding the reasons behind model predictions is crucial for trust and alignment with domain knowledge. The data scientist must ensure the model solves the organizational/business goal and communicate findings to both technical and non-technical stakeholders.",

    "Model Deployment and Monitoring: Model deployment and monitoring involves implementing models in real-world applications, monitoring performance, and maintaining them to ensure continued accuracy and relevance. Data scientists may work with data engineering teams or use tools such as containers to implement models. After deployment, they may develop dashboards to monitor model performance and alert stakeholders if performance goes outside expected ranges.",

    "Data science is a profession that incorporates many data-related tasks, particularly those involving acquisition, preparation, and delivery of data. While modeling gets most of the attention, the majority of the work (around 80%) often comes from data preparation, exploration, and operational tasks. This does not include other responsibilities like stakeholder communication, gathering requirements, debugging software, emails, and research.",

    "Now that we understand the common tasks associated with the job, we can explore the different types (or flavors) of data science.",

    "Dissecting the Flavors of Data Science",

    "The role of a data scientist often covers many different skills. Data scientists are frequently asked to perform tasks such as designing database tables, programming ML algorithms, understanding statistics, and creating visuals to explain findings. However, it is difficult for one person to master every skill area.",

    "Therefore, many data scientists become particularly skilled in one or two areas and maintain basic competencies in others. Their talents can be considered T-shaped: broad proficiency across many areas (horizontal line of the T), with deep expertise in a few areas (vertical line of the T).",

    "A data scientist’s competencies are often aligned with their unique experiences or interests. For example, a statistics major may naturally excel in ML, while a former BI engineer with strong ETL experience may quickly grasp data engineering concepts.",

    "It is important to master the fundamentals, but most people will develop their own T of Competencies — a combination of top skill sets that shapes their identity in the data science space.",

    "Some of the most common flavors of data science include: Data Engineer, Dashboarding and Visual Specialist, ML Specialist, and Domain Expert.",

    "Data Engineer: Data engineering is a crucial aspect of the data science process involving data collection, storage, processing, and management. It focuses on designing, developing, and maintaining scalable data infrastructure and ensuring the availability of high-quality data for analysis and modeling. Data engineers are most known for managing ETL pipelines and data workflows.",

    "In smaller organizations, data engineering responsibilities may fall under the data science team. A data scientist specializing in this area supports projects by collecting and storing data, and structuring it so it can be efficiently fed into ML or deep learning algorithms.",

    "Common tools for Data Engineers include: Programming languages (Python, SQL, Scala, R, C++), Data storage systems (MySQL, PostgreSQL, Oracle, MongoDB, Cassandra, DynamoDB, Snowflake, Redshift, BigQuery, HDFS), Data processing tools (Spark, Flink, Storm, Beam, MapReduce, Hadoop, Hive, Kafka, Kinesis), ETL tools (NiFi, Talend, Airflow, AWS Glue, Google Cloud Dataflow, dbt), Version control (Git, GitHub, GitLab, Bitbucket, Azure DevOps), Visualization tools (Tableau, Power BI, Looker, QlikView, Domo), Cloud platforms (Azure, GCP, AWS), and containerization tools (Docker, Kubernetes).",

    "Dashboarding and Visual Specialist: Data visualization is the graphical representation of data using charts, graphs, and maps. It helps stakeholders understand complex patterns, trends, and relationships, supports decision-making, and communicates insights effectively. Strong visualization combined with storytelling can drive organizational action, and many news organizations hire data scientists specifically for visualization skills.",

    "Dashboarding and visual specialists may be called BI engineers, data analysts, visualization experts, or data storytellers. They typically have strong skills in descriptive statistics, storytelling, and KPI development.",

    "Common tools for Dashboarding and Visual Specialists include: Programming languages (Python, SQL, R, JavaScript), Data storage systems (MySQL, PostgreSQL, Oracle, MongoDB, Cassandra, DynamoDB, Snowflake, Redshift, BigQuery), Frameworks (Dask, Plotly, ggplot2, Shiny, Matplotlib, Seaborn, D3.js), BI tools (Tableau, Power BI, Looker, QlikView, Domo, Funnel, Excel), and cloud platforms (Azure, GCP, AWS).",

    "ML Specialist: ML specialists focus on designing and implementing machine learning algorithms. They build models that allow computers to learn from experience without explicit programming. Their work involves analyzing data, identifying patterns, and making predictions or decisions. They are skilled in selecting appropriate algorithms and tuning parameters to achieve the best results.",

    "ML specialists often stay current through research and are experienced in model development, deployment, and maintenance. Many have strong backgrounds in statistics, operations research, computer science, or information systems.",

    "Common tools for ML Specialists include: Programming languages (Python, SQL, R, Java, C++), ML frameworks (TensorFlow, Keras, scikit-learn, PyTorch, H2O, Hugging Face), Data storage systems (MySQL, PostgreSQL, Oracle, MongoDB, Cassandra, DynamoDB, Snowflake, Redshift, BigQuery, HDFS), Processing tools (Spark, Flink, Storm, Beam, MapReduce, Kafka), ETL tools (NiFi, Talend, Airflow, AWS Glue, Google Cloud Dataflow), Version control (Git, GitHub, GitLab, Bitbucket), Cloud platforms (Azure, GCP, AWS), and deployment tools (Docker, Kubernetes, Flask).",

    "Domain Expert: Domain experts are data scientists with deep knowledge in a specific domain, either technical (such as computer vision or natural language processing) or business-related (such as marketing, aviation, finance, healthcare, etc.). They use their domain expertise to build customized models and analysis methods tailored to domain-specific problems.",

    "Non-technical domain experts may have an advantage in specialized data science roles because they understand industry challenges and workflows. For example, a digital marketing professional may excel in attribution modeling, while someone with aviation experience may excel in route optimization problems."]

In [3]:
def chunk_text (text, chunk_size = 80, overlap = 20):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

chunks = []
for doc in data_science_document:
    chunks.extend(chunk_text(doc))
print('Total chunks:', len(chunks))
for c in chunks:
    print(c)
    

Total chunks: 162
Performing data science work is often an iterative process, where the data scien
where the data scientist needs to return to earlier steps if they run into chall
 they run into challenges.
There are many ways to categorize the data science process, but it often include
but it often includes: Data collection, Data exploration, Data modeling, Model e
ta modeling, Model evaluation, and Model deployment and monitoring.
toring.
Data Collection: Data collection and preprocessing involves gathering data from 
gathering data from various sources (such as databases, APIs, and web scraping),
, and web scraping), then cleaning and transforming the data to prepare it for a
 to prepare it for analysis. This step involves dealing with missing, inconsiste
 missing, inconsistent, or noisy data and converting it into a structured format
 a structured format. Depending on the organization, a team of data engineers ma
of data engineers may support this step; however, it is common for th

In [4]:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedder.encode(chunks)
print(embeddings.shape)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 541.46it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


(162, 384)


In [5]:
import numpy as np
import faiss 

dimemssion = embeddings.shape[1]
index = faiss.IndexFlatL2(dimemssion)
index.add(np.array(embeddings))
print("vector stored", index.ntotal)

vector stored 162


In [12]:
# Step 3: query
query = "what is Data Engineer ?"
query_embedding = embedder.encode([query]).astype("float32")

distances, indices = index.search(query_embedding, k=2)

retrieved_chunks = [chunks[i] for i in indices[0]]

print("\nRetrieved context:")
for chunk in retrieved_chunks:
    print("-", chunk)


Retrieved context:
- Data Engineer: Data engineering is a crucial aspect of the data science process 
- Data science is a profession that incorporates many data-related tasks, particul


In [15]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")
context = "".join(retrieved_chunks)
prompt = f"""
Answer the question using only the context below

Context
{context}

Question:
{query}
"""
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=4000)
answer =tokenizer.decode(outputs[0], skip_special_tokens=True)
print ("Answer:" , answer)

Loading weights: 100%|██████████| 190/190 [00:00<00:00, 409.79it/s, Materializing param=shared.weight]                                                       


Answer: Data engineer is a profession that incorporates many data-related tasks, particul
