<a href="https://colab.research.google.com/github/Ashish-Soni08/Playground/blob/main/DTC_zoomcamp_Q%26A_Challenge_Ashish_Soni.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

**DataTalks.Club**, a hub for data science learning, generates a significant amount of Q&A data through its courses. This competition aims to harness this data to create models that can automate the matching of questions to answers, **enhancing educational resources and learner experiences**.

## Objective
The primary goal is to develop a model that can accurately match a given question to its correct answer using the provided dataset.

## Impact
Successful models from this competition could revolutionize how **DataTalks.Club**  manages the content, making it easier for learners to find the answers.

## Evaluation
**Accuracy**: The accuracy metric will assess the proportion of correctly matched question-answer pairs. Higher accuracy indicates better performance of the model

## Instructions
- Harness your data science skills to explore, analyze, and create models that excel in question-answer pairing
- Baseline using all-MiniLM-L6-v2 sentence transformer and cosine similarity to find the matching answer_id for each question

# Install

In [1]:
!pip install -qqq kaggle langchain langchain-openai langchain-together langchain_mistralai llama-index Pillow sentence-transformers transformers fastapi kaleido python-multipart uvicorn cohere

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.6/803.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

## Imports

In [2]:
import os

from google.colab import data_table

import numpy as np

import pandas as pd

import torch

from tqdm.auto import tqdm

from sentence_transformers import SentenceTransformer, util

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
# Embeddings from different Providers through the Langchain framework
## Langchain
from langchain_community.embeddings import CohereEmbeddings
from langchain_community.embeddings import JinaEmbeddings
from langchain_mistralai import MistralAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_together.embeddings import TogetherEmbeddings
from langchain_community.embeddings import VoyageEmbeddings
## LlamaIndex
from llama_index.embeddings import OpenAIEmbedding

from typing import Dict, List

import warnings
warnings.filterwarnings("ignore")

In [3]:
pd.set_option('display.max_rows', None)  # Shows all rows
pd.set_option('display.max_columns', None)  # Shows all columns
pd.set_option('display.width', None)  # Auto-detect the display width
pd.set_option('display.max_colwidth', None)  # Display full width of columns

In [4]:
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Available device: {device}")

Available device: cuda


In [5]:
from google.colab import userdata

# Cohere
COHERE_API_KEY = userdata.get('cohere_api')

# Jina
JINA_API_KEY = userdata.get('jina_embed_api')

# Kaggle
KAGGLE_USERNAME = userdata.get('kaggle_username')
KAGGLE_API_KEY = userdata.get('kaggle_api_key')

# MISTRAL
MISTRAL_API_KEY = userdata.get('mistral_api')

# OPENAI
OPENAI_API_KEY = userdata.get('openai_api')

# Together
TOGETHER_API_KEY = userdata.get('togetherai_api')

# Voyage
VOYAGE_API_KEY = userdata.get('voyage_api')

print("All APIs and ENDPOINTS are available for Access! Let's get Started :)")

All APIs and ENDPOINTS are available for Access! Let's get Started :)


# DATA

In [None]:
from google.colab import files

# Kaggle Setup

kaggle_file = "~/.kaggle/kaggle.json"

if not os.path.exists('~/.kaggle'):
  !mkdir ~/.kaggle

if not os.path.isfile(kaggle_file):
  files.upload() # produces a prompt
  # After uploading kaggle.json
  !cp kaggle.json ~/.kaggle/
  !chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [None]:
# Check if it works
!kaggle datasets list

ref                                                                title                                               size  lastUpdated          downloadCount  voteCount  usabilityRating  
-----------------------------------------------------------------  -------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
nelgiriyewithana/apple-quality                                     Apple Quality                                      170KB  2024-01-11 14:31:07           6150        144  1.0              
hummaamqaasim/jobs-in-data                                         Jobs and Salaries in Data Science                   76KB  2023-12-25 20:03:32          12504        228  1.0              
emirhanai/city-happiness-index-2024                                City Happiness Index - 2024                          8KB  2024-01-22 00:41:50           1475         32  1.0              
shiivvvaam/revenue-of-top-companies-in-india      

In [None]:
# Download Data from Kaggle
if not os.path.exists('content/data/train_questions.csv'):
  !kaggle competitions download -c dtc-zoomcamp-qa-challenge
  !mkdir data
  !unzip dtc-zoomcamp-qa-challenge.zip -d data > /dev/null
  !rm dtc-zoomcamp-qa-challenge.zip

Downloading dtc-zoomcamp-qa-challenge.zip to /content
  0% 0.00/5.15M [00:00<?, ?B/s] 97% 5.00M/5.15M [00:00<00:00, 47.5MB/s]
100% 5.15M/5.15M [00:00<00:00, 48.6MB/s]


In [None]:
!ls data

 attachments		      test_answers.csv	   train_answers.csv
'sample_submission (1).csv'   test_questions.csv   train_questions.csv


## LOAD DATA

In [None]:
DATA_PATH = '/content/data'

train_questions_df = pd.read_csv(f'{DATA_PATH}/train_questions.csv')
train_answers_df = pd.read_csv(f'{DATA_PATH}/train_answers.csv')
test_questions_df = pd.read_csv(f'{DATA_PATH}/test_questions.csv')
test_answers_df = pd.read_csv(f'{DATA_PATH}/test_answers.csv')

## EDA for Training Data

### Observations after EDA:

1. Duplicate Question IDs: The dataset contains duplicate question_id entries, specifically 3 instances where question_id is repeated.

2. Column Uniqueness:

- Columns like `course_question`, `year_question`, `course_answer`, and `year_answer` have only 2 unique values, indicating the courses `Machine Learning Zoomcamp`, `Data Engineering Zoomcamp` and the years `2021` and `2022`.

- The attachments_files column has 25 unique values but also a significant number of missing values (374 out of 399), suggesting limited utility in analysis.

3. Duplicate Rows: There are duplicate rows in the dataset, particularly where question_id is `647840, 172850, and 631555`. These duplicates seem to represent the same question being asked or answered multiple times, with minor variations in the candidate answers and answers.

### Decisions for Data Preprocessing
Based on the observations, the following decisions were made for data preprocessing:

1. Remove attachments_files Column: This column will be removed due to the high number of missing values and its limited contribution to the overall informative value of the dataset.

2. Resolve Duplicate `question_id` Entries: For each duplicate question_id, only the last row will be kept. This decision was made based on the observation that the last entry tends to have the most complete and relevant answer, especially noticeable for `question_id` `647840`. This approach ensures that the dataset represents the most comprehensive information available for each question.

- Specifically for `question_id` `647840`, the last row was kept as it contained a more comprehensive answer compared to the previous entry.
- The same approach will be applied to `question_id` `172850 and 631555`

In [None]:
def column_analysis(df: pd.DataFrame, column_name: str) -> None:
    """
    Performs basic analysis on a specified column of a pandas DataFrame and prints the results.

    Args:
    df (pd.DataFrame): The DataFrame containing the column to be analyzed.
    column_name (str): The name of the column to analyze.

    The function prints:
      - 'column_name': The name of the column.
      - 'type': The data type of the column.
      - 'unique_count': The number of unique values in the column.
      - 'missing_values': The number of missing (NaN, None, etc.) values in the column.
      - 'duplicate_values': The number of duplicate values in the column.
    If the specified column name does not exist in the DataFrame, a message indicating this is printed.
    """
    if column_name in df.columns:
        analysis = {
            'column_name': column_name,
            'type': df[column_name].dtype,
            'unique_count': df[column_name].nunique(),
            'missing_values': df[column_name].isnull().sum(),
            'duplicate_values': df[column_name].shape[0] - df[column_name].nunique()
        }
        for key, value in analysis.items():
            print(f"{key}: {value}")
    else:
        print(f"The column '{column_name}' does not exist in the DataFrame.")

In [None]:
def print_column_values(df: pd.DataFrame, column_name: str) -> None:
    """
    Prints the unique values of a specified column in a pandas DataFrame.

    Args:
    df (pd.DataFrame): The DataFrame containing the column.
    column_name (str): The name of the column for which to print unique values.

    Returns:
    None. Prints the unique values of the column to the console.
    """
    if column_name in df.columns:
        unique_values = df[column_name].unique()
        unique_values_list = unique_values.tolist()
        print(f"Unique values in '{column_name}': {unique_values_list}")
    else:
        print(f"The column '{column_name}' does not exist in the DataFrame.")

### Training Data

In [None]:
print(f"Rows and Columns -> {train_questions_df.shape}")
print(f"Rows and Columns -> {train_answers_df.shape}")

Rows and Columns -> (397, 6)
Rows and Columns -> (397, 5)


In [None]:
data_table.DataTable(train_questions_df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question_id,question,course,year,candidate_answers,answer_id
0,79062,"For categorical target set, where the distribution is imbalanced (for example, 90/10) what approach should be used?",Machine Learning Zoomcamp,2021,156400754877105368643810912439,156400
1,468946,"Is there anything that we are not allowed to use? For example, if you want to use hyper-parameter tuning, are we allowed to use GridSearchCV, RandomizedSearchCV?",Machine Learning Zoomcamp,2021,641330634887912439425941642829,634887
2,968800,I have been catching up and have been doing homeworks but not tuning in. Will I be able to turn in the final project?,Data Engineering Zoomcamp,2022,9540161678567591936798838013,954016
3,688404,Could you please explain what code we should load to GitHub?,Data Engineering Zoomcamp,2022,1986616298986865773699141765,3699
4,63921,Is it just me or does the model have really bad accuracy and generates random images? I tested and identified cats as dogs.,Machine Learning Zoomcamp,2021,754877604487912439858915425941,858915
5,521365,Are we distant from what a data engineer’s day-to-day work looks like? What are the main differences?,Data Engineering Zoomcamp,2022,75919830918337445449040131069,830918
6,74423,How is the data engineering market in Berlin? How can someone land a DE job in Berlin?,Data Engineering Zoomcamp,2022,159394243026838013478083296080,243026
7,425250,"We used tf.keras.layers.GlobalAveragePooling2D in our model. What is the difference in using GlobalAveragePooling1D, 2D, and 3D in pooling layers?",Machine Learning Zoomcamp,2021,200161152693478055347048912439,347048
8,278678,"If you pickle load an object, are all the methods associated with that object also loaded?",Machine Learning Zoomcamp,2021,772478890166214199912439655484,772478
9,184224,How's ranking different from the regression model?,Machine Learning Zoomcamp,2021,764907548146643810934691642829,934691


In [None]:
data_table.DataTable(train_answers_df, include_index=False, num_rows_per_page=3)

Unnamed: 0,answer_id,answer,course,year,attachments_files
0,156400,"Alexey\nShould we use something non-standard there or can we just go with the usual things we learned in the course?\nHamed\nYou just need to test different strategies. Something I noticed – if you have so many parse subclasses in your categorical [inaudible], you should be careful about using one-hot encoding. You might say you can use ordinal encoding, if your data in nature had some order. It will be useful. In my particular data, I couldn't have domain knowledge. I didn't know what the subclasses were, so I couldn't decide which strategy I should choose. But if you have the domain knowledge, that’s the key here, I think.",Machine Learning Zoomcamp,2021,
1,634887,"No, I don't think there is anything you cannot use. I just want to ask you to document it, if there is something you use that is not a part of the course. For somebody who will be reviewing your project, it will be new information. So just give them some context and explain, “I used GridSearchCV because this is easier and this is what you can do.” Then you describe it, and you give your peers a chance to learn from what you’re doing. This way, they will not be lost and will be able to grade your homework. If you do that, you can use whatever you want. \nFrom the materials in the course, like if you use logistic regression, or rich regression, or XGBoost, or random forest, then you don't need to really go into details when documenting because I would expect that people understand that without much explanation. Maybe it's also a good idea if somebody, let's say, wants to use not XGBoost, but LightGBM, or CatBoost, which are different implementations of gradient boosting – you're free to use them in your project as well. You can use FastAPI instead of Flask, for example. You can use Poetry instead of Pipenv. You can use Conda instead of Pipenv. You're basically free to explore to use what you want and play with different tools, just be sure to document that.",Machine Learning Zoomcamp,2021,
2,954016,"Alexey\nYes, you will be. You can submit the project. As we promised at the very beginning, we will give certificates based on project completion, not based on the homework. So if you're catching up right now, and you didn't do all the homework, it's fine. If you are caught up and you know what to do, you can take part in a project.",Data Engineering Zoomcamp,2022,
3,3699,"Alexey\nI think the question refers to the homework form where we asked you to submit your code. Here, you can just create a repo in GitHub, create a folder there, homework 1, put your SQL files there, and just leave a link to this file when you submit. It can be anything else. If you are more of a GitLab kind of person, you can create a report in GitLab. But it should be something publicly accessible. Put your SQL queries there and submit it with the form.",Data Engineering Zoomcamp,2022,
4,858915,"Dmitry\nIt's fine, because this is the showcase purpose. We said to optimize the number of books, for example, the validation steps. Also, the architecture can be a bit changed. For sure, we can create a much better model.\nAlexey\nFor this model, I guess, if we wanted to build a cat vs dog classifier that works well, we would use a pre-trained neural network and fine tune it, right?\nDmitry\nYeah. That’s one of the options.\nAlexey\nOr just get a lot of different pictures of cats and dogs and train from scratch. Are there any other options?\nDmitry\nOther options would be to try to tweak the parameters and all those things.\nAlexey\nSo you think you can train a model from scratch – the one we had – without using a pre-trained neural network and then have a decent accuracy?\nDmitry\nYeah, but it's also the question of what “decent” is. \nAlexey\nAt least 80%, for example.\nDmitry\nAround 80, I think yes. But we need to remember that the pretrained will always have the benefit.\nAlexey\nIf we think about this – right now, this model that we have, has an accuracy of 65% on validation, which is just slightly better than a random guess. Well, it's not slightly better, but it's 10-15% better. There are chances that, you pick a few random images and in both of these cases, it will be incorrect just because it's 65%. Right? \nDmitry\nYeah, but we didn't do any…\nAlexey\nYeah, I'm just saying why this can happen. This could be the reason – 65% is not the best accuracy in the world.",Machine Learning Zoomcamp,2021,
5,830918,"Alexey\nI guess the question is asking about the difference between what we are doing here in the course and the actual work of a data engineer.\nVictoria\nMore problems. [chuckles] More troubleshooting, probably.\nAnkush\nI think the course is made in a way to set you up for becoming a data engineer or tackling all the problems of a data engineer. Mostly it will revolve around the technologies that we are talking about.\nAlexey\nSo what are the main differences, except for more troubleshooting? I imagined that there are analysts who come to you or other people with ad hoc queries, saying, “Hey, where is this data? Where is this table?” And then you have to help them. What are the other main differences?\nAnkush\nI would say complexity. Definitely.\nAlexey\nYeah, we have a relatively simple case, right? We only need to do one join. Well, two joins. We have a bunch of tables, but we don't need to join these big tables with each other. We only do a join with the location table, which is pretty small. But in practice, we often have two big tables that we need to join.\nVictoria\nYeah, we don't have complex business logic. Plus, you don't really do the setup all the time. You set up BigQuery once and then you maintain it, but you probably won’t be dealing with the service account that much.\nAlexey\nIdeally, there may be a team who deals with that. For example, at our work, we have a team who manages Airflow, so I don't need to worry about this. All I do is just write DAGs, commit them, and that's it. I never had to set up Airflow locally and worry about these things because it's managed. I think it’s the same with the data warehouse and other things.\nAnkush\nI think one more thing that everybody with data does is basically communicate outwards what the meaning of different columns is or what this ratio is versus that ratio. All that. I have to personally do a lot of that. What's your experience like?\nVictoria\nI would say the same, especially for business stakeholders. But DBT also has this data catalog part. So in the data team that helps quite a lot. Also we look at and work on the selection of an actual data catalog. So hopefully, there’s not too many questions.\nAlexey\nIn our case, for these questions we will have a catalog team. We have an internal catalog tool and they get all these questions, not data engineers.",Data Engineering Zoomcamp,2022,
6,243026,"Victoria\nI think there's a lot. It's a little bit hard to get a data engineer, and that's also why I would assume there are a lot of openings, and you will have a lot of options. How can someone land a job? If you go to meetups or have contacts, that's probably a good way. Or just apply.\nSejal\nI agree. Actually, there are plenty of data engineering openings. When I was a data engineer in my previous role, I used to get invitations for job interviews almost every week from recruiters. There's an abundance for this role in the market. Definitely try it out.\nAlexey\nSince you changed your title to ML engineer, do you now get fewer or more invitations? You don't get anything? \nSejal\nOddly. [chuckles] \nAlexey\n[chuckles] Okay, so we see that being a data engineer is actually better, in terms of market demand.\nSejal\nYeah, I think engineering skills are way more in demand than data science and machine learning skills are.\nAlexey\nYeah, that's interesting. What happened to “the sexiest job of the 21st century”? Nobody wants to hire them now. [chuckles] We had an interesting discussion in one of the podcasts here with Ellen. Ellen was a data scientist and she became a data engineer. She is from Berlin and I think she also talks about how to get a job as a data engineer in Berlin specifically. I think the recommendation was to talk to consulting companies. \nThey have some sort of programs for coaching juniors, because the business model is to hire as many juniors as possible because they're cheap, and then you sell them to their client for a lot of money. So you need to have training programs so that the juniors are effective. Her recommendation was to try to find such a consultancy company and learn there. I hope it's an accurate summary of the episodes. Maybe I'm misinterpreting something.\nVictoria\nYeah, I think if you come from a non-data background, and you're starting with the course and all that, just bear in mind that you'll have to apply a lot and there will probably be a lot of rejections. At least that's how it was for me at the very beginning when I moved to Berlin. Because everything was kind of new and all that, I didn't have a job at the time, because I just moved to Berlin and then looked for one. \nTry to get as many interviews as possible, and then do the challenges and all that. That will train you a lot and will also help you know where the benchmark is, kind of. But yeah, there are a lot of offers, so just start applying. Start applying, start talking to other people, probably to other data engineers, as well. That's also good. About the consultancy, I don't know. I've never used anything like that.\nAlexey\nI worked at an outsourcing company, but it was before I moved to Berlin. But I can confirm that in the company where I worked, they did have quite a good training process. They would have a coach that was assigned to me who would help me with everything. There was also some sort of bootcamp program. Then after that, they would try to sell me to a client and they would coach me on how to pass an interview with a client. Once a client likes me, I just start working. This is quite common, I think, in outsourcing and consulting companies. For me, that was a very long time ago, so I don't know how that actually works now. [chuckles]\nSejal\nThat's actually an interesting model. It's good that companies are actually putting that much effort into coaching the employees to be honed to certain projects. That's nice.\nAlexey\nYeah, but the thing is, the business model is buying somebody for cheap and selling them for a lot of money and then taking the margin. So, unfortunately, sometimes (at least this was in my case) the salary was pretty low. I remember that they didn't have much left after paying for my flat. So be careful. [chuckles] But at least that was a good experience. After one year, I could sell this experience for “market standards,” let's say. So depending on how young you are, how much you want to get this experience, I think sometimes it's worth actually agreeing to a lower salary in exchange for experience. Then you get this experience, and then your value on the market becomes higher. \nAnother shameless plug, but this is something we actually talked about with Juan Pablo. He suggested looking for small gigs that don't pay a lot, but this is experience you can sell afterwards. This is somewhat similar to what we mentioned, even though he's more into analytics than data engineering, but I think the tips that he shared when he was looking for a job – I think they're pretty universal. There was also a funny story that he was actually driving an Uber to be able to survive while he was trying to study things in order to switch. That's a cool one as well.",Data Engineering Zoomcamp,2022,
7,347048,"Let's say we have a two dimensional thing (2D) and we want to turn it into 1D. For this case, we need to use Pooling2D. I'm not sure right now and I want to quickly check it. This is the reason. Remember, when I was showing you how I usually go about defining the model? I build it sort of layer by layer and every time I add one more layer, I do a model that predicts to see what the output is. And then based on that, I see what kind of pooling layer, for example, I need. Let's say if we want to turn 2D into 1D, I think we need to use 2D pooling. I'm not exactly sure whether it's 2D or 1D. I think 1D pooling is needed when we have something one dimensional and we want to turn it into just one value, then we use 1D pooling. And then 3D pooling is when we have a three dimensional thing and we want to turn it into a one-dimensional thing. [image 2] These things always confuse me, to be honest. \nThat's why I follow this step by step and then I just try different poolings. Then I want to make sure that, if I convert an image into a vector presentation, I have something one-dimensional. Usually the size is the number of images times something. So it's more like a 2D array. For each image, I have a one-dimensional vector. Then based on that, I try different poolings. Sometimes, you can also just flatten. Flatten takes whatever D – let's say you have KD – and you want to turn it into 1D, you use flatten. There are many different options. I think it's clear what the difference is. I might not remember exactly when to use 1D, 2D, and so on, but the difference is what kind of input they take and what kind of output they produce. If it's just a cube, then it's 2D, if it's a hypercube with three dimensions, that it's something else.",Machine Learning Zoomcamp,2021,85248_qa9_ml_zoomcamp_office_hours__week_11_pic2.jpg
8,772478,"Kind of. Actually, pickle expects that you have the code that, let's say, if your pickle and object are of a particular class. Let's say this class could be part of the package Scikit Learn linear (the class is logistic regression). When pickle loads an object, it expects that this class is present in your Python. It loads all these methods from that code – from that module. It expects that the code will be there and it just basically loads the data and the behavior is stored in the code. So you have to have the code.",Machine Learning Zoomcamp,2021,
9,934691,"I think in one of the videos, I talked about the three different subtypes of supervised learning, integration, classification, and tracking. About ranking – [image for reference] for regression, Let’s say you have a car and then we extract your feature matrix from the car. You apply the formula (the g function that you trained) and you get a prediction, like price. You do this for one object. You get one car and you do a prediction for one car. \nWhen it comes to ranking, you don't have one object. Of course, you can have multiple cars and apply this to multiple cars. But the core difference with ranking is – in ranking, let's say you have some results from Google. So you have some query and you have some results from Google. There could be, let's say, results from 0 to 99. And you need to apply a function to each of the elements. This is a group. And you need to apply this model that you have (the model g) to each element of this group. Then it produces a ranked list. Let's say, you apply g(x0), and you apply g(x99) this x99 is the row here – the results of the query. Then what you do is rerank the output – you rerank all these results using this function. \nWhat I'm trying to say is, here, you look at the group and you try to see how good the ranking is within the group. While in the case of simple regression, you have more standalone objects, sort of. But with ranking, it always has to be a group. Of course, when it comes to ranking, this g can also be a regression or it can also be classification. But then you always need to think about the other elements in the group. I hope that answers the question. \nThis is not something we will go into detail about – not in this course, for sure. This is just for you to know that it's a little bit different. Maybe this is something you want to do for your project or explore for the article. This is totally fine.",Machine Learning Zoomcamp,2021,966609_qa6_01.jpg


In [None]:
# Let's merge the train questions and answers in one dataframe
train_merged_df = pd.merge(
    train_questions_df, train_answers_df, on='answer_id', how='inner', suffixes=('_question', '_answer')
)

In [None]:
data_table.DataTable(train_merged_df, include_index=False, num_rows_per_page=3)

Unnamed: 0,question_id,question,course_question,year_question,candidate_answers,answer_id,answer,course_answer,year_answer,attachments_files
0,79062,"For categorical target set, where the distribution is imbalanced (for example, 90/10) what approach should be used?",Machine Learning Zoomcamp,2021,156400754877105368643810912439,156400,"Alexey\nShould we use something non-standard there or can we just go with the usual things we learned in the course?\nHamed\nYou just need to test different strategies. Something I noticed – if you have so many parse subclasses in your categorical [inaudible], you should be careful about using one-hot encoding. You might say you can use ordinal encoding, if your data in nature had some order. It will be useful. In my particular data, I couldn't have domain knowledge. I didn't know what the subclasses were, so I couldn't decide which strategy I should choose. But if you have the domain knowledge, that’s the key here, I think.",Machine Learning Zoomcamp,2021,
1,468946,"Is there anything that we are not allowed to use? For example, if you want to use hyper-parameter tuning, are we allowed to use GridSearchCV, RandomizedSearchCV?",Machine Learning Zoomcamp,2021,641330634887912439425941642829,634887,"No, I don't think there is anything you cannot use. I just want to ask you to document it, if there is something you use that is not a part of the course. For somebody who will be reviewing your project, it will be new information. So just give them some context and explain, “I used GridSearchCV because this is easier and this is what you can do.” Then you describe it, and you give your peers a chance to learn from what you’re doing. This way, they will not be lost and will be able to grade your homework. If you do that, you can use whatever you want. \nFrom the materials in the course, like if you use logistic regression, or rich regression, or XGBoost, or random forest, then you don't need to really go into details when documenting because I would expect that people understand that without much explanation. Maybe it's also a good idea if somebody, let's say, wants to use not XGBoost, but LightGBM, or CatBoost, which are different implementations of gradient boosting – you're free to use them in your project as well. You can use FastAPI instead of Flask, for example. You can use Poetry instead of Pipenv. You can use Conda instead of Pipenv. You're basically free to explore to use what you want and play with different tools, just be sure to document that.",Machine Learning Zoomcamp,2021,
2,968800,I have been catching up and have been doing homeworks but not tuning in. Will I be able to turn in the final project?,Data Engineering Zoomcamp,2022,9540161678567591936798838013,954016,"Alexey\nYes, you will be. You can submit the project. As we promised at the very beginning, we will give certificates based on project completion, not based on the homework. So if you're catching up right now, and you didn't do all the homework, it's fine. If you are caught up and you know what to do, you can take part in a project.",Data Engineering Zoomcamp,2022,
3,688404,Could you please explain what code we should load to GitHub?,Data Engineering Zoomcamp,2022,1986616298986865773699141765,3699,"Alexey\nI think the question refers to the homework form where we asked you to submit your code. Here, you can just create a repo in GitHub, create a folder there, homework 1, put your SQL files there, and just leave a link to this file when you submit. It can be anything else. If you are more of a GitLab kind of person, you can create a report in GitLab. But it should be something publicly accessible. Put your SQL queries there and submit it with the form.",Data Engineering Zoomcamp,2022,
4,63921,Is it just me or does the model have really bad accuracy and generates random images? I tested and identified cats as dogs.,Machine Learning Zoomcamp,2021,754877604487912439858915425941,858915,"Dmitry\nIt's fine, because this is the showcase purpose. We said to optimize the number of books, for example, the validation steps. Also, the architecture can be a bit changed. For sure, we can create a much better model.\nAlexey\nFor this model, I guess, if we wanted to build a cat vs dog classifier that works well, we would use a pre-trained neural network and fine tune it, right?\nDmitry\nYeah. That’s one of the options.\nAlexey\nOr just get a lot of different pictures of cats and dogs and train from scratch. Are there any other options?\nDmitry\nOther options would be to try to tweak the parameters and all those things.\nAlexey\nSo you think you can train a model from scratch – the one we had – without using a pre-trained neural network and then have a decent accuracy?\nDmitry\nYeah, but it's also the question of what “decent” is. \nAlexey\nAt least 80%, for example.\nDmitry\nAround 80, I think yes. But we need to remember that the pretrained will always have the benefit.\nAlexey\nIf we think about this – right now, this model that we have, has an accuracy of 65% on validation, which is just slightly better than a random guess. Well, it's not slightly better, but it's 10-15% better. There are chances that, you pick a few random images and in both of these cases, it will be incorrect just because it's 65%. Right? \nDmitry\nYeah, but we didn't do any…\nAlexey\nYeah, I'm just saying why this can happen. This could be the reason – 65% is not the best accuracy in the world.",Machine Learning Zoomcamp,2021,
5,521365,Are we distant from what a data engineer’s day-to-day work looks like? What are the main differences?,Data Engineering Zoomcamp,2022,75919830918337445449040131069,830918,"Alexey\nI guess the question is asking about the difference between what we are doing here in the course and the actual work of a data engineer.\nVictoria\nMore problems. [chuckles] More troubleshooting, probably.\nAnkush\nI think the course is made in a way to set you up for becoming a data engineer or tackling all the problems of a data engineer. Mostly it will revolve around the technologies that we are talking about.\nAlexey\nSo what are the main differences, except for more troubleshooting? I imagined that there are analysts who come to you or other people with ad hoc queries, saying, “Hey, where is this data? Where is this table?” And then you have to help them. What are the other main differences?\nAnkush\nI would say complexity. Definitely.\nAlexey\nYeah, we have a relatively simple case, right? We only need to do one join. Well, two joins. We have a bunch of tables, but we don't need to join these big tables with each other. We only do a join with the location table, which is pretty small. But in practice, we often have two big tables that we need to join.\nVictoria\nYeah, we don't have complex business logic. Plus, you don't really do the setup all the time. You set up BigQuery once and then you maintain it, but you probably won’t be dealing with the service account that much.\nAlexey\nIdeally, there may be a team who deals with that. For example, at our work, we have a team who manages Airflow, so I don't need to worry about this. All I do is just write DAGs, commit them, and that's it. I never had to set up Airflow locally and worry about these things because it's managed. I think it’s the same with the data warehouse and other things.\nAnkush\nI think one more thing that everybody with data does is basically communicate outwards what the meaning of different columns is or what this ratio is versus that ratio. All that. I have to personally do a lot of that. What's your experience like?\nVictoria\nI would say the same, especially for business stakeholders. But DBT also has this data catalog part. So in the data team that helps quite a lot. Also we look at and work on the selection of an actual data catalog. So hopefully, there’s not too many questions.\nAlexey\nIn our case, for these questions we will have a catalog team. We have an internal catalog tool and they get all these questions, not data engineers.",Data Engineering Zoomcamp,2022,
6,74423,How is the data engineering market in Berlin? How can someone land a DE job in Berlin?,Data Engineering Zoomcamp,2022,159394243026838013478083296080,243026,"Victoria\nI think there's a lot. It's a little bit hard to get a data engineer, and that's also why I would assume there are a lot of openings, and you will have a lot of options. How can someone land a job? If you go to meetups or have contacts, that's probably a good way. Or just apply.\nSejal\nI agree. Actually, there are plenty of data engineering openings. When I was a data engineer in my previous role, I used to get invitations for job interviews almost every week from recruiters. There's an abundance for this role in the market. Definitely try it out.\nAlexey\nSince you changed your title to ML engineer, do you now get fewer or more invitations? You don't get anything? \nSejal\nOddly. [chuckles] \nAlexey\n[chuckles] Okay, so we see that being a data engineer is actually better, in terms of market demand.\nSejal\nYeah, I think engineering skills are way more in demand than data science and machine learning skills are.\nAlexey\nYeah, that's interesting. What happened to “the sexiest job of the 21st century”? Nobody wants to hire them now. [chuckles] We had an interesting discussion in one of the podcasts here with Ellen. Ellen was a data scientist and she became a data engineer. She is from Berlin and I think she also talks about how to get a job as a data engineer in Berlin specifically. I think the recommendation was to talk to consulting companies. \nThey have some sort of programs for coaching juniors, because the business model is to hire as many juniors as possible because they're cheap, and then you sell them to their client for a lot of money. So you need to have training programs so that the juniors are effective. Her recommendation was to try to find such a consultancy company and learn there. I hope it's an accurate summary of the episodes. Maybe I'm misinterpreting something.\nVictoria\nYeah, I think if you come from a non-data background, and you're starting with the course and all that, just bear in mind that you'll have to apply a lot and there will probably be a lot of rejections. At least that's how it was for me at the very beginning when I moved to Berlin. Because everything was kind of new and all that, I didn't have a job at the time, because I just moved to Berlin and then looked for one. \nTry to get as many interviews as possible, and then do the challenges and all that. That will train you a lot and will also help you know where the benchmark is, kind of. But yeah, there are a lot of offers, so just start applying. Start applying, start talking to other people, probably to other data engineers, as well. That's also good. About the consultancy, I don't know. I've never used anything like that.\nAlexey\nI worked at an outsourcing company, but it was before I moved to Berlin. But I can confirm that in the company where I worked, they did have quite a good training process. They would have a coach that was assigned to me who would help me with everything. There was also some sort of bootcamp program. Then after that, they would try to sell me to a client and they would coach me on how to pass an interview with a client. Once a client likes me, I just start working. This is quite common, I think, in outsourcing and consulting companies. For me, that was a very long time ago, so I don't know how that actually works now. [chuckles]\nSejal\nThat's actually an interesting model. It's good that companies are actually putting that much effort into coaching the employees to be honed to certain projects. That's nice.\nAlexey\nYeah, but the thing is, the business model is buying somebody for cheap and selling them for a lot of money and then taking the margin. So, unfortunately, sometimes (at least this was in my case) the salary was pretty low. I remember that they didn't have much left after paying for my flat. So be careful. [chuckles] But at least that was a good experience. After one year, I could sell this experience for “market standards,” let's say. So depending on how young you are, how much you want to get this experience, I think sometimes it's worth actually agreeing to a lower salary in exchange for experience. Then you get this experience, and then your value on the market becomes higher. \nAnother shameless plug, but this is something we actually talked about with Juan Pablo. He suggested looking for small gigs that don't pay a lot, but this is experience you can sell afterwards. This is somewhat similar to what we mentioned, even though he's more into analytics than data engineering, but I think the tips that he shared when he was looking for a job – I think they're pretty universal. There was also a funny story that he was actually driving an Uber to be able to survive while he was trying to study things in order to switch. That's a cool one as well.",Data Engineering Zoomcamp,2022,
7,425250,"We used tf.keras.layers.GlobalAveragePooling2D in our model. What is the difference in using GlobalAveragePooling1D, 2D, and 3D in pooling layers?",Machine Learning Zoomcamp,2021,200161152693478055347048912439,347048,"Let's say we have a two dimensional thing (2D) and we want to turn it into 1D. For this case, we need to use Pooling2D. I'm not sure right now and I want to quickly check it. This is the reason. Remember, when I was showing you how I usually go about defining the model? I build it sort of layer by layer and every time I add one more layer, I do a model that predicts to see what the output is. And then based on that, I see what kind of pooling layer, for example, I need. Let's say if we want to turn 2D into 1D, I think we need to use 2D pooling. I'm not exactly sure whether it's 2D or 1D. I think 1D pooling is needed when we have something one dimensional and we want to turn it into just one value, then we use 1D pooling. And then 3D pooling is when we have a three dimensional thing and we want to turn it into a one-dimensional thing. [image 2] These things always confuse me, to be honest. \nThat's why I follow this step by step and then I just try different poolings. Then I want to make sure that, if I convert an image into a vector presentation, I have something one-dimensional. Usually the size is the number of images times something. So it's more like a 2D array. For each image, I have a one-dimensional vector. Then based on that, I try different poolings. Sometimes, you can also just flatten. Flatten takes whatever D – let's say you have KD – and you want to turn it into 1D, you use flatten. There are many different options. I think it's clear what the difference is. I might not remember exactly when to use 1D, 2D, and so on, but the difference is what kind of input they take and what kind of output they produce. If it's just a cube, then it's 2D, if it's a hypercube with three dimensions, that it's something else.",Machine Learning Zoomcamp,2021,85248_qa9_ml_zoomcamp_office_hours__week_11_pic2.jpg
8,278678,"If you pickle load an object, are all the methods associated with that object also loaded?",Machine Learning Zoomcamp,2021,772478890166214199912439655484,772478,"Kind of. Actually, pickle expects that you have the code that, let's say, if your pickle and object are of a particular class. Let's say this class could be part of the package Scikit Learn linear (the class is logistic regression). When pickle loads an object, it expects that this class is present in your Python. It loads all these methods from that code – from that module. It expects that the code will be there and it just basically loads the data and the behavior is stored in the code. So you have to have the code.",Machine Learning Zoomcamp,2021,
9,184224,How's ranking different from the regression model?,Machine Learning Zoomcamp,2021,764907548146643810934691642829,934691,"I think in one of the videos, I talked about the three different subtypes of supervised learning, integration, classification, and tracking. About ranking – [image for reference] for regression, Let’s say you have a car and then we extract your feature matrix from the car. You apply the formula (the g function that you trained) and you get a prediction, like price. You do this for one object. You get one car and you do a prediction for one car. \nWhen it comes to ranking, you don't have one object. Of course, you can have multiple cars and apply this to multiple cars. But the core difference with ranking is – in ranking, let's say you have some results from Google. So you have some query and you have some results from Google. There could be, let's say, results from 0 to 99. And you need to apply a function to each of the elements. This is a group. And you need to apply this model that you have (the model g) to each element of this group. Then it produces a ranked list. Let's say, you apply g(x0), and you apply g(x99) this x99 is the row here – the results of the query. Then what you do is rerank the output – you rerank all these results using this function. \nWhat I'm trying to say is, here, you look at the group and you try to see how good the ranking is within the group. While in the case of simple regression, you have more standalone objects, sort of. But with ranking, it always has to be a group. Of course, when it comes to ranking, this g can also be a regression or it can also be classification. But then you always need to think about the other elements in the group. I hope that answers the question. \nThis is not something we will go into detail about – not in this course, for sure. This is just for you to know that it's a little bit different. Maybe this is something you want to do for your project or explore for the article. This is totally fine.",Machine Learning Zoomcamp,2021,966609_qa6_01.jpg


In [None]:
print(f"Rows and Columns -> {train_merged_df.shape}")

Rows and Columns -> (399, 10)


In [None]:
for column in train_merged_df.columns:
  print()
  column_analysis(train_merged_df, column)
  print()
  print(25 * '*')


column_name: question_id
type: int64
unique_count: 396
missing_values: 0
duplicate_values: 3

*************************

column_name: question
type: object
unique_count: 396
missing_values: 0
duplicate_values: 3

*************************

column_name: course_question
type: object
unique_count: 2
missing_values: 0
duplicate_values: 397

*************************

column_name: year_question
type: int64
unique_count: 2
missing_values: 0
duplicate_values: 397

*************************

column_name: candidate_answers
type: object
unique_count: 397
missing_values: 0
duplicate_values: 2

*************************

column_name: answer_id
type: int64
unique_count: 396
missing_values: 0
duplicate_values: 3

*************************

column_name: answer
type: object
unique_count: 396
missing_values: 0
duplicate_values: 3

*************************

column_name: course_answer
type: object
unique_count: 2
missing_values: 0
duplicate_values: 397

*************************

column_name: year_answ

In [None]:
columns_list = ['course_question', 'year_question', 'course_answer', 'year_answer']
for column in columns_list:
  print_column_values(train_merged_df, column)
  print()

Unique values in 'course_question': ['Machine Learning Zoomcamp', 'Data Engineering Zoomcamp']

Unique values in 'year_question': [2021, 2022]

Unique values in 'course_answer': ['Machine Learning Zoomcamp', 'Data Engineering Zoomcamp']

Unique values in 'year_answer': [2021, 2022]



In [None]:
# Checking for duplicate rows in the DataFrame
duplicate_rows = train_merged_df[train_merged_df.duplicated('question_id')]

# Number of duplicate rows
num_duplicate_rows = duplicate_rows.shape[0]
num_duplicate_rows

3

In [None]:
# Find duplicate question_ids
duplicate_question_df = train_merged_df[train_merged_df.duplicated('question_id', keep=False)]

# Display the duplicate question_ids
data_table.DataTable(duplicate_question_df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question_id,question,course_question,year_question,candidate_answers,answer_id,answer,course_answer,year_answer,attachments_files
72,647840,How many submissions were there this week?,Machine Learning Zoomcamp,2021,912439425941105368276086604487,276086,"I think around 200? 204, I think. Something like this.",Machine Learning Zoomcamp,2021,
159,647840,How many submissions were there this week?,Machine Learning Zoomcamp,2021,912439105368964992604487425941,964992,"Alexey\n246, which is 150 less than last week. Last week, it was 404. I don’t know if this homework was maybe a bit more difficult. But the next one is also fun. I think if you liked the homework we prepared for this week, the next one will also be fun. Right, Dmitry? \nDmitry\nYeah, for sure.",Machine Learning Zoomcamp,2021,
261,172850,Can we use PyTorch instead of TensorFlow?,Machine Learning Zoomcamp,2021,774098912439214199641330425941,774098,"Yes, you can.",Machine Learning Zoomcamp,2021,
262,172850,Can we use PyTorch instead of TensorFlow?,Machine Learning Zoomcamp,2021,774098912439214199641330425941,774098,"Yes, you can.",Machine Learning Zoomcamp,2021,
263,631555,For rating data (number of stars) can we use a linear regression model?,Machine Learning Zoomcamp,2021,641330912439827511774098764907,774098,"Yes, you can.",Machine Learning Zoomcamp,2021,
264,631555,For rating data (number of stars) can we use a linear regression model?,Machine Learning Zoomcamp,2021,641330912439827511774098764907,774098,"Yes, you can.",Machine Learning Zoomcamp,2021,


In [None]:
# Drop Column -> attachments_files
train_dataset = train_merged_df.drop(columns=['attachments_files'])

print(f"Rows and Columns -> {train_dataset.shape}")

Rows and Columns -> (399, 9)


In [None]:
train_dataset.columns

Index(['question_id', 'question', 'course_question', 'year_question', 'candidate_answers',
       'answer_id', 'answer', 'course_answer', 'year_answer'],
      dtype='object')

In [None]:
# Drop duplicates
train_dataset = train_dataset.drop_duplicates(subset = 'question_id', keep = 'last')

print(f"Shape of the Dataset after dropping duplicates -> {train_dataset.shape}") # 396

Shape of the Dataset after dropping duplicates -> (396, 9)


In [None]:
train_dataset["candidate_answers"] = train_dataset["candidate_answers"].str.split(",")

In [None]:
train_dataset.to_csv("training_dataset.csv", index = False)

In [None]:
data_table.DataTable(train_dataset, include_index=False, num_rows_per_page=2)

Unnamed: 0,question_id,question,course_question,year_question,candidate_answers,answer_id,answer,course_answer,year_answer
0,79062,"For categorical target set, where the distribution is imbalanced (for example, 90/10) what approach should be used?",Machine Learning Zoomcamp,2021,"[156400, 754877, 105368, 643810, 912439]",156400,"Alexey\nShould we use something non-standard there or can we just go with the usual things we learned in the course?\nHamed\nYou just need to test different strategies. Something I noticed – if you have so many parse subclasses in your categorical [inaudible], you should be careful about using one-hot encoding. You might say you can use ordinal encoding, if your data in nature had some order. It will be useful. In my particular data, I couldn't have domain knowledge. I didn't know what the subclasses were, so I couldn't decide which strategy I should choose. But if you have the domain knowledge, that’s the key here, I think.",Machine Learning Zoomcamp,2021
1,468946,"Is there anything that we are not allowed to use? For example, if you want to use hyper-parameter tuning, are we allowed to use GridSearchCV, RandomizedSearchCV?",Machine Learning Zoomcamp,2021,"[641330, 634887, 912439, 425941, 642829]",634887,"No, I don't think there is anything you cannot use. I just want to ask you to document it, if there is something you use that is not a part of the course. For somebody who will be reviewing your project, it will be new information. So just give them some context and explain, “I used GridSearchCV because this is easier and this is what you can do.” Then you describe it, and you give your peers a chance to learn from what you’re doing. This way, they will not be lost and will be able to grade your homework. If you do that, you can use whatever you want. \nFrom the materials in the course, like if you use logistic regression, or rich regression, or XGBoost, or random forest, then you don't need to really go into details when documenting because I would expect that people understand that without much explanation. Maybe it's also a good idea if somebody, let's say, wants to use not XGBoost, but LightGBM, or CatBoost, which are different implementations of gradient boosting – you're free to use them in your project as well. You can use FastAPI instead of Flask, for example. You can use Poetry instead of Pipenv. You can use Conda instead of Pipenv. You're basically free to explore to use what you want and play with different tools, just be sure to document that.",Machine Learning Zoomcamp,2021
2,968800,I have been catching up and have been doing homeworks but not tuning in. Will I be able to turn in the final project?,Data Engineering Zoomcamp,2022,"[954016, 167856, 75919, 36798, 838013]",954016,"Alexey\nYes, you will be. You can submit the project. As we promised at the very beginning, we will give certificates based on project completion, not based on the homework. So if you're catching up right now, and you didn't do all the homework, it's fine. If you are caught up and you know what to do, you can take part in a project.",Data Engineering Zoomcamp,2022
3,688404,Could you please explain what code we should load to GitHub?,Data Engineering Zoomcamp,2022,"[198661, 629898, 686577, 3699, 141765]",3699,"Alexey\nI think the question refers to the homework form where we asked you to submit your code. Here, you can just create a repo in GitHub, create a folder there, homework 1, put your SQL files there, and just leave a link to this file when you submit. It can be anything else. If you are more of a GitLab kind of person, you can create a report in GitLab. But it should be something publicly accessible. Put your SQL queries there and submit it with the form.",Data Engineering Zoomcamp,2022
4,63921,Is it just me or does the model have really bad accuracy and generates random images? I tested and identified cats as dogs.,Machine Learning Zoomcamp,2021,"[754877, 604487, 912439, 858915, 425941]",858915,"Dmitry\nIt's fine, because this is the showcase purpose. We said to optimize the number of books, for example, the validation steps. Also, the architecture can be a bit changed. For sure, we can create a much better model.\nAlexey\nFor this model, I guess, if we wanted to build a cat vs dog classifier that works well, we would use a pre-trained neural network and fine tune it, right?\nDmitry\nYeah. That’s one of the options.\nAlexey\nOr just get a lot of different pictures of cats and dogs and train from scratch. Are there any other options?\nDmitry\nOther options would be to try to tweak the parameters and all those things.\nAlexey\nSo you think you can train a model from scratch – the one we had – without using a pre-trained neural network and then have a decent accuracy?\nDmitry\nYeah, but it's also the question of what “decent” is. \nAlexey\nAt least 80%, for example.\nDmitry\nAround 80, I think yes. But we need to remember that the pretrained will always have the benefit.\nAlexey\nIf we think about this – right now, this model that we have, has an accuracy of 65% on validation, which is just slightly better than a random guess. Well, it's not slightly better, but it's 10-15% better. There are chances that, you pick a few random images and in both of these cases, it will be incorrect just because it's 65%. Right? \nDmitry\nYeah, but we didn't do any…\nAlexey\nYeah, I'm just saying why this can happen. This could be the reason – 65% is not the best accuracy in the world.",Machine Learning Zoomcamp,2021
5,521365,Are we distant from what a data engineer’s day-to-day work looks like? What are the main differences?,Data Engineering Zoomcamp,2022,"[75919, 830918, 337445, 449040, 131069]",830918,"Alexey\nI guess the question is asking about the difference between what we are doing here in the course and the actual work of a data engineer.\nVictoria\nMore problems. [chuckles] More troubleshooting, probably.\nAnkush\nI think the course is made in a way to set you up for becoming a data engineer or tackling all the problems of a data engineer. Mostly it will revolve around the technologies that we are talking about.\nAlexey\nSo what are the main differences, except for more troubleshooting? I imagined that there are analysts who come to you or other people with ad hoc queries, saying, “Hey, where is this data? Where is this table?” And then you have to help them. What are the other main differences?\nAnkush\nI would say complexity. Definitely.\nAlexey\nYeah, we have a relatively simple case, right? We only need to do one join. Well, two joins. We have a bunch of tables, but we don't need to join these big tables with each other. We only do a join with the location table, which is pretty small. But in practice, we often have two big tables that we need to join.\nVictoria\nYeah, we don't have complex business logic. Plus, you don't really do the setup all the time. You set up BigQuery once and then you maintain it, but you probably won’t be dealing with the service account that much.\nAlexey\nIdeally, there may be a team who deals with that. For example, at our work, we have a team who manages Airflow, so I don't need to worry about this. All I do is just write DAGs, commit them, and that's it. I never had to set up Airflow locally and worry about these things because it's managed. I think it’s the same with the data warehouse and other things.\nAnkush\nI think one more thing that everybody with data does is basically communicate outwards what the meaning of different columns is or what this ratio is versus that ratio. All that. I have to personally do a lot of that. What's your experience like?\nVictoria\nI would say the same, especially for business stakeholders. But DBT also has this data catalog part. So in the data team that helps quite a lot. Also we look at and work on the selection of an actual data catalog. So hopefully, there’s not too many questions.\nAlexey\nIn our case, for these questions we will have a catalog team. We have an internal catalog tool and they get all these questions, not data engineers.",Data Engineering Zoomcamp,2022
6,74423,How is the data engineering market in Berlin? How can someone land a DE job in Berlin?,Data Engineering Zoomcamp,2022,"[159394, 243026, 838013, 478083, 296080]",243026,"Victoria\nI think there's a lot. It's a little bit hard to get a data engineer, and that's also why I would assume there are a lot of openings, and you will have a lot of options. How can someone land a job? If you go to meetups or have contacts, that's probably a good way. Or just apply.\nSejal\nI agree. Actually, there are plenty of data engineering openings. When I was a data engineer in my previous role, I used to get invitations for job interviews almost every week from recruiters. There's an abundance for this role in the market. Definitely try it out.\nAlexey\nSince you changed your title to ML engineer, do you now get fewer or more invitations? You don't get anything? \nSejal\nOddly. [chuckles] \nAlexey\n[chuckles] Okay, so we see that being a data engineer is actually better, in terms of market demand.\nSejal\nYeah, I think engineering skills are way more in demand than data science and machine learning skills are.\nAlexey\nYeah, that's interesting. What happened to “the sexiest job of the 21st century”? Nobody wants to hire them now. [chuckles] We had an interesting discussion in one of the podcasts here with Ellen. Ellen was a data scientist and she became a data engineer. She is from Berlin and I think she also talks about how to get a job as a data engineer in Berlin specifically. I think the recommendation was to talk to consulting companies. \nThey have some sort of programs for coaching juniors, because the business model is to hire as many juniors as possible because they're cheap, and then you sell them to their client for a lot of money. So you need to have training programs so that the juniors are effective. Her recommendation was to try to find such a consultancy company and learn there. I hope it's an accurate summary of the episodes. Maybe I'm misinterpreting something.\nVictoria\nYeah, I think if you come from a non-data background, and you're starting with the course and all that, just bear in mind that you'll have to apply a lot and there will probably be a lot of rejections. At least that's how it was for me at the very beginning when I moved to Berlin. Because everything was kind of new and all that, I didn't have a job at the time, because I just moved to Berlin and then looked for one. \nTry to get as many interviews as possible, and then do the challenges and all that. That will train you a lot and will also help you know where the benchmark is, kind of. But yeah, there are a lot of offers, so just start applying. Start applying, start talking to other people, probably to other data engineers, as well. That's also good. About the consultancy, I don't know. I've never used anything like that.\nAlexey\nI worked at an outsourcing company, but it was before I moved to Berlin. But I can confirm that in the company where I worked, they did have quite a good training process. They would have a coach that was assigned to me who would help me with everything. There was also some sort of bootcamp program. Then after that, they would try to sell me to a client and they would coach me on how to pass an interview with a client. Once a client likes me, I just start working. This is quite common, I think, in outsourcing and consulting companies. For me, that was a very long time ago, so I don't know how that actually works now. [chuckles]\nSejal\nThat's actually an interesting model. It's good that companies are actually putting that much effort into coaching the employees to be honed to certain projects. That's nice.\nAlexey\nYeah, but the thing is, the business model is buying somebody for cheap and selling them for a lot of money and then taking the margin. So, unfortunately, sometimes (at least this was in my case) the salary was pretty low. I remember that they didn't have much left after paying for my flat. So be careful. [chuckles] But at least that was a good experience. After one year, I could sell this experience for “market standards,” let's say. So depending on how young you are, how much you want to get this experience, I think sometimes it's worth actually agreeing to a lower salary in exchange for experience. Then you get this experience, and then your value on the market becomes higher. \nAnother shameless plug, but this is something we actually talked about with Juan Pablo. He suggested looking for small gigs that don't pay a lot, but this is experience you can sell afterwards. This is somewhat similar to what we mentioned, even though he's more into analytics than data engineering, but I think the tips that he shared when he was looking for a job – I think they're pretty universal. There was also a funny story that he was actually driving an Uber to be able to survive while he was trying to study things in order to switch. That's a cool one as well.",Data Engineering Zoomcamp,2022
7,425250,"We used tf.keras.layers.GlobalAveragePooling2D in our model. What is the difference in using GlobalAveragePooling1D, 2D, and 3D in pooling layers?",Machine Learning Zoomcamp,2021,"[200161, 152693, 478055, 347048, 912439]",347048,"Let's say we have a two dimensional thing (2D) and we want to turn it into 1D. For this case, we need to use Pooling2D. I'm not sure right now and I want to quickly check it. This is the reason. Remember, when I was showing you how I usually go about defining the model? I build it sort of layer by layer and every time I add one more layer, I do a model that predicts to see what the output is. And then based on that, I see what kind of pooling layer, for example, I need. Let's say if we want to turn 2D into 1D, I think we need to use 2D pooling. I'm not exactly sure whether it's 2D or 1D. I think 1D pooling is needed when we have something one dimensional and we want to turn it into just one value, then we use 1D pooling. And then 3D pooling is when we have a three dimensional thing and we want to turn it into a one-dimensional thing. [image 2] These things always confuse me, to be honest. \nThat's why I follow this step by step and then I just try different poolings. Then I want to make sure that, if I convert an image into a vector presentation, I have something one-dimensional. Usually the size is the number of images times something. So it's more like a 2D array. For each image, I have a one-dimensional vector. Then based on that, I try different poolings. Sometimes, you can also just flatten. Flatten takes whatever D – let's say you have KD – and you want to turn it into 1D, you use flatten. There are many different options. I think it's clear what the difference is. I might not remember exactly when to use 1D, 2D, and so on, but the difference is what kind of input they take and what kind of output they produce. If it's just a cube, then it's 2D, if it's a hypercube with three dimensions, that it's something else.",Machine Learning Zoomcamp,2021
8,278678,"If you pickle load an object, are all the methods associated with that object also loaded?",Machine Learning Zoomcamp,2021,"[772478, 890166, 214199, 912439, 655484]",772478,"Kind of. Actually, pickle expects that you have the code that, let's say, if your pickle and object are of a particular class. Let's say this class could be part of the package Scikit Learn linear (the class is logistic regression). When pickle loads an object, it expects that this class is present in your Python. It loads all these methods from that code – from that module. It expects that the code will be there and it just basically loads the data and the behavior is stored in the code. So you have to have the code.",Machine Learning Zoomcamp,2021
9,184224,How's ranking different from the regression model?,Machine Learning Zoomcamp,2021,"[764907, 548146, 643810, 934691, 642829]",934691,"I think in one of the videos, I talked about the three different subtypes of supervised learning, integration, classification, and tracking. About ranking – [image for reference] for regression, Let’s say you have a car and then we extract your feature matrix from the car. You apply the formula (the g function that you trained) and you get a prediction, like price. You do this for one object. You get one car and you do a prediction for one car. \nWhen it comes to ranking, you don't have one object. Of course, you can have multiple cars and apply this to multiple cars. But the core difference with ranking is – in ranking, let's say you have some results from Google. So you have some query and you have some results from Google. There could be, let's say, results from 0 to 99. And you need to apply a function to each of the elements. This is a group. And you need to apply this model that you have (the model g) to each element of this group. Then it produces a ranked list. Let's say, you apply g(x0), and you apply g(x99) this x99 is the row here – the results of the query. Then what you do is rerank the output – you rerank all these results using this function. \nWhat I'm trying to say is, here, you look at the group and you try to see how good the ranking is within the group. While in the case of simple regression, you have more standalone objects, sort of. But with ranking, it always has to be a group. Of course, when it comes to ranking, this g can also be a regression or it can also be classification. But then you always need to think about the other elements in the group. I hope that answers the question. \nThis is not something we will go into detail about – not in this course, for sure. This is just for you to know that it's a little bit different. Maybe this is something you want to do for your project or explore for the article. This is totally fine.",Machine Learning Zoomcamp,2021


## EDA for Test Data


`year`: 2022, 2023

`course`: Data Engineering Zoomcamp, Machine Learning Zoomcamp



### Test Data

In [None]:
print(f"Rows and Columns -> {test_questions_df.shape}")
print(f"Rows and Columns -> {test_answers_df.shape}")

Rows and Columns -> (516, 5)
Rows and Columns -> (516, 5)


In [None]:
## Test Dataset
data_table.DataTable(test_questions_df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question_id,question,course,year,candidate_answers
0,707,How much of an effort would it be to use AWS instead of GCP for assignments?,Data Engineering Zoomcamp,2023,33623233766925830447681767296
1,534450,Can you talk about linear regression and regularization?,Machine Learning Zoomcamp,2022,23120828207286769573165138373
2,996163,"Can you please explain the Python Black setup in Visual Studio Code? Also, can you explain good Python coding standards as you write docstrings and type strings?",Data Engineering Zoomcamp,2023,57189281655947681337669336232
3,860215,How many portfolio projects apart from the course are needed for getting a job?,Machine Learning Zoomcamp,2022,643931988549918931235894608866
4,980124,Can you talk more about the final project? What should we be thinking about now to prepare us?,Data Engineering Zoomcamp,2023,38438133766925830447681747722
5,715450,"Is RMSE (vs MAPE or MAE, etc.) a common metric to use for regression models? Would it be a good metric for optimization in retail store planning forecasts?",Machine Learning Zoomcamp,2022,91715152023533221383875159619
6,83137,7 points for not posting on social media is harsh. I'm recovering from mental health issues and try to avoid social media if I can. Please reconsider.,Machine Learning Zoomcamp,2022,270092514296231208282072957623
7,492004,"With evaluating our model in validation, what to do if our validation model is high accuracy and validation on test is very low?",Machine Learning Zoomcamp,2022,465132520235235894918931967449
8,396112,"I read a book that said the data engineering lifecycle is generation, storage, ingestion, transforming, and serving data. Do we do ingestion because we use an open dataset?",Data Engineering Zoomcamp,2023,125560742225476818144876269
9,950910,"If we consider professional certification for getting into the industry, which one would you recommend?",Data Engineering Zoomcamp,2023,337669998582179159419700985321


In [None]:
for column in test_questions_df.columns:
  print()
  column_analysis(test_questions_df, column)
  print()
  print(25 * '*')


column_name: question_id
type: int64
unique_count: 514
missing_values: 0
duplicate_values: 2

*************************

column_name: question
type: object
unique_count: 514
missing_values: 0
duplicate_values: 2

*************************

column_name: course
type: object
unique_count: 2
missing_values: 0
duplicate_values: 514

*************************

column_name: year
type: int64
unique_count: 2
missing_values: 0
duplicate_values: 514

*************************

column_name: candidate_answers
type: object
unique_count: 516
missing_values: 0
duplicate_values: 0

*************************


In [None]:
test_columns = ['course', 'year']
for column in test_columns:
  print_column_values(test_questions_df, column)
  print()

Unique values in 'course': ['Data Engineering Zoomcamp', 'Machine Learning Zoomcamp']

Unique values in 'year': [2023, 2022]



In [None]:
data_table.DataTable(test_answers_df, include_index=False, num_rows_per_page=2)

Unnamed: 0,answer_id,answer,course,year,attachments_files
0,767296,"Alexey\nProbably more than you want to put in. I mean, if you have time, why not? But we will not be able to support you. I struggle to come up with an estimate. Does anyone here on this call have an estimate?\nVictoria\nI think the hardest part is – you shouldn't be a data engineer if you’re taking this course. You shouldn't have the knowledge that we're teaching. And if you don't have the knowledge, you're trying to learn it, and then everything is showing you something else. On top of that, you want to learn something new on your own that you won't have support for. That's going to be really hard. If you really want AWS because they use AWS at your work, they're going to help you with that one. It's probably not going to be worth it. It's going to be very, very stressful. I would add at least two hours on top of the normal hours.",Data Engineering Zoomcamp,2023,
1,573165,"Yes, I can. There is actually an entire module about that. You probably mean something specific that you didn't understand. Maybe ask about that in Slack.",Machine Learning Zoomcamp,2022,
2,571892,"Jeff\nI can try. I like Black a lot. If you just Google “Python Black” and go to the GitHub readme, it’s become super popular in the last couple of years for automatically formatting your code. It is really nice. Mine is set up in Visual Studio Code, so when I hit “save,” it automatically formats. As it says, it’s an uncompromising code formatter. I think it says here about asking Henry Ford for color back in the day. Is that right? It's colored cars or color telephones from Alexander Graham Bell or somebody, and it's like, “They can have any color they want as long as it's black.” He wasn't going to compromise. You don't have to think about it now. It's easy. Just go right ahead and get your gear code formatted automatically. So it’s things like two lines after function or before – that kind of stuff just happens automatically with it. Getting it set up in VS Code can be a little tricky sometimes. But there are guides to that. Googling is what I do for that, usually.\nAlexey\nI just wanted to do a shameless plug, because we have another course called MLOps Zoomcamp. By the way, there is also a Prefect model there. One of the things there is best practices. In best practices, we have this video called Code quality: linting and formatting. It does not show how to integrate Black with VS Code, I think. I don't remember if we do this. But if you want to learn more about testing, and Black, and other things like pre-commit hooks, make files, and so on – you can check this out.\nJeff\nYeah, it's not too tricky to set up. It's like pip install a package in your environment and then you have to give it a path maybe or set one setting in VS code. If all goes well, cross your fingers, it should just work after you reload it. Not always the case. If you have trouble, a lot of people use it, so there are a lot of good resources online. Black is great. This is an awesome set of some resources here on best practices. Type hints are just getting more and more popular. They're very helpful, so that people know what kind – especially with autocomplete and little pop-up type – things like that in Visual Studio Code and other code editors. You can see what kind of argument type you should put in and then Prefect uses that information as well, in our flows, for example, to make sure that if it's a block in the UI or a parameter in the UI – it'll be smart. It'll be like, “Oh, is this a number? Okay.” It will give you options to put in numbers. “Is this a different kind of form field?” It'll have different options. It also can then do some validation to make sure that people actually put in something that conforms to that type-in. Python is slowly getting more and more smart about how it handles typing and newer versions keep adding more functionality. Type hints are nice to use. It takes a little bit of writing, but it makes your docstring shorter. The last thing that was asked about here was docstrings. It's great to have in every function to tell people what it's about. It's something that maybe you don't always do if you're in a hurry, but you should do it, especially if other people are going to read the code. Code is read like 20 times more often than it's written, or however you translate that – some stat. So do it. It's so helpful for you in the future and it's helpful for other people in the future, who are going to read your code to see, “What were you thinking? What is the purpose of this function?” Keep your function small, explain it in your doc string – it's good stuff. Then it shows up in your code editor, if you're lucky (if you have a good code editor). That's all to say about that.\nAlexey\nDo you know any resources where people can learn about setting up? Or learning more about these things, like good Python coding standards? What I showed is obviously a good resource, but it does not cover all these things that this question asks about.\nJeff\nIt's a good question. I do have a link to Google Style, or there are a couple different styles of docstrings. It seems like they're a little bit much these days, maybe. But there are links for different ways to do type hinting. I do have a few things if I look around for them. I don't have them at my fingertips right now. But Michael looks like maybe he's got one he shared there.\nMichael\nYes, this one is a little bit older, but it is great. It goes into using virtual environments, Poetry… there's a lot to unpack, but I think that's still pretty much the standard best practice at the moment.",Data Engineering Zoomcamp,2023,
3,988549,"Again, you’ll probably hate me soon for saying this, but the answer is “it depends”. Maybe it's zero, maybe it's one, maybe it's two. You never know. You just need to start interviewing and in parallel to that, get projects done. Maybe you’ll get lucky and get hired from the first interview. Probably not, but you will already start learning what companies need. Then, at the same time, you try to implement this and you see, “Okay. This is what companies care about. Let me use some of the technologies they want to see that I’ve used.” You can look at the job descriptions to figure out what is important. You can talk to them when you have interviews and you can ask them, “Hey, what kind of technologies should I use for my portfolio projects to be a good fit for this position?” for example. \nIt never hurts to ask. And keep doing this. Maybe you will get a job on the third project, maybe to be on project zero, when you haven't even started doing this. I think two or three should be enough, but yeah. When I got my first job in data science, I had been working back then as a freelancer already in data science. I had some projects from past clients that I showed. But in that interview for the job I got, most of the time was spent talking about my Master’s thesis, which was about processing Wikipedia data. I was working with mathematics there and the interviewer for that job was really interested in this. So we spent most of the time talking about that project. So maybe a good answer to this question would be to have one project that is relevant for the company you interview, and then you will just spend most of the time of the interview discussing this project.",Machine Learning Zoomcamp,2022,
4,384381,"Alexey\nThe first thing about the dataset – what kind of dataset do you want to use? Or what kind of problem do you want to solve? Once you figure this out, then you're basically ready to start working on a project. Then in the project, you need to decide if you want to do streaming or batch. For batch, it's using things like Prefect, Spark, or DBT. For streaming, it’s using the materials from the last lecture (week 2). Once you decide that, you will just implement this and you will find all the information you need here in the week 7 project repo. Just go through this and if you have any questions left, let us know. Keep in mind that these are the criteria that other people (your peers) will use when evaluating your project. Perhaps you can already think about that and how you want to implement your project in such a way that you maximize the score you get from these criteria.",Data Engineering Zoomcamp,2023,
5,159619,"RMSE and all these metrics are good. I am not an expert in that. I would suggest going to our YouTube channel, where we recently had a talk just a few weeks ago called Probabilistic Demand - Forecasting at Scale by Hagop Dippel. Check it out. He also talks about metrics there and you will see what exactly to use – what kind of metrics to use to evaluate your models for this specific case.",Machine Learning Zoomcamp,2022,
6,270092,"I'm very sorry to hear about your mental health problems and I want to remind you that posting in social media is not required. You don't have to do this. In fact, many of the students from the previous iteration did not post anything on social media and still were able graduate from the course with a certificate. Since they did the projects, they ended up quite high on the leaderboard, which allowed them to be on the page with top 100 names. If this is what you're after – if you want to end up on that page – just keep on working, don't post on social media, and you'll be fine. \nDon't worry about the points because, again, nobody knows which of the hashes is you (only you know) and these points are virtual. So don't… it's not required to post on social media. If you don't feel like doing this, then don't. But I think it will be valuable for you. Maybe, after some time, when it becomes easier for you, I do recommend taking a look at social media and posting there. Not right now, but later it will be very valuable for your career.",Machine Learning Zoomcamp,2022,
7,967449,"When you have a very good score on validation, but a very low score on the test, it means your model became lucky. Remember, we had this explanation at the very first module – we had a lesson about model selection. Sometimes, the model can be lucky and it doesn't necessarily translate well to the test. It happens. What can also happen is that sometimes, validation and test datasets are different. For example, if you speed split your dataset by time – so you have a dataset for one year, and then for training, you use everything from January till September, then validation is October, November, and test is December. So you evaluate your model in validation and October, November looks great – but then we know that in December, it's Christmas time and many models that you trained during the normal months are different. For them, this December could be a surprise. So maybe you have something like that in the test data. It's normal. For these cases, you just need to think about how exactly you can build a validation dataset and training dataset in such a way that they are similar. For example, if you have a lot of data, then maybe you can use the previous year for validation or for testing. Something like this. It's all problem-specific. Usually it's an example of overfitting.",Machine Learning Zoomcamp,2022,
8,876269,"I don't understand the question, to be honest. We don't have the generation part. The generation part is what was done already for us by the New York Taxi Limousine Company. Storage – yeah, they also store it. They host the data. And yeah, we cover ingestion, transform, and serving with the dashboard.",Data Engineering Zoomcamp,2023,
9,179159,I would not recommend any certificates. Just focus on projects.,Data Engineering Zoomcamp,2023,


In [None]:
for column in test_answers_df.columns:
  print()
  column_analysis(test_answers_df, column)
  print()
  print(25 * '*')


column_name: answer_id
type: int64
unique_count: 515
missing_values: 0
duplicate_values: 1

*************************

column_name: answer
type: object
unique_count: 515
missing_values: 0
duplicate_values: 1

*************************

column_name: course
type: object
unique_count: 2
missing_values: 0
duplicate_values: 514

*************************

column_name: year
type: int64
unique_count: 2
missing_values: 0
duplicate_values: 514

*************************

column_name: attachments_files
type: object
unique_count: 8
missing_values: 508
duplicate_values: 508

*************************


In [None]:
for column in test_columns:
  print_column_values(test_answers_df, column)
  print()

Unique values in 'course': ['Data Engineering Zoomcamp', 'Machine Learning Zoomcamp']

Unique values in 'year': [2023, 2022]



In [None]:
# Checking for duplicate rows in the DataFrame
duplicate_rows = test_questions_df[test_questions_df.duplicated('question_id')]

# Number of duplicate rows
num_duplicate_rows = duplicate_rows.shape[0]
num_duplicate_rows

2

In [None]:
# Find duplicate question_ids
duplicate_question_test_df = test_questions_df[test_questions_df.duplicated('question_id', keep=False)]

# Display the duplicate question_ids
data_table.DataTable(duplicate_question_test_df, include_index=False, num_rows_per_page=10)

Unnamed: 0,question_id,question,course,year,candidate_answers
122,502581,I missed the last two weeks. What would you recommend I do in order to get up to speed with the course?,Data Engineering Zoomcamp,2023,47681336232337669724503258304
224,385696,How can we contribute to the course?,Data Engineering Zoomcamp,2023,85612547681258304125560630862
369,502581,I missed the last two weeks. What would you recommend I do in order to get up to speed with the course?,Data Engineering Zoomcamp,2023,33623225830447681547470337669
511,385696,How can we contribute to the course?,Data Engineering Zoomcamp,2023,25830463086285612547681366701


In [None]:
# Drop Duplicates
test_questions_dataset = test_questions_df.drop_duplicates(subset = 'question_id', keep = 'last')
print(f"Shape of the Dataset after dropping duplicates -> {test_questions_dataset.shape}") # 514

Shape of the Dataset after dropping duplicates -> (514, 5)


In [None]:
test_questions_dataset["candidate_answers"] = test_questions_dataset["candidate_answers"].str.split(",")

In [None]:
data_table.DataTable(test_questions_dataset, include_index=False, num_rows_per_page=3)

Unnamed: 0,question_id,question,course,year,candidate_answers
0,707,How much of an effort would it be to use AWS instead of GCP for assignments?,Data Engineering Zoomcamp,2023,"[336232, 337669, 258304, 47681, 767296]"
1,534450,Can you talk about linear regression and regularization?,Machine Learning Zoomcamp,2022,"[231208, 282072, 86769, 573165, 138373]"
2,996163,"Can you please explain the Python Black setup in Visual Studio Code? Also, can you explain good Python coding standards as you write docstrings and type strings?",Data Engineering Zoomcamp,2023,"[571892, 816559, 47681, 337669, 336232]"
3,860215,How many portfolio projects apart from the course are needed for getting a job?,Machine Learning Zoomcamp,2022,"[643931, 988549, 918931, 235894, 608866]"
4,980124,Can you talk more about the final project? What should we be thinking about now to prepare us?,Data Engineering Zoomcamp,2023,"[384381, 337669, 258304, 47681, 747722]"
5,715450,"Is RMSE (vs MAPE or MAE, etc.) a common metric to use for regression models? Would it be a good metric for optimization in retail store planning forecasts?",Machine Learning Zoomcamp,2022,"[917151, 520235, 33221, 383875, 159619]"
6,83137,7 points for not posting on social media is harsh. I'm recovering from mental health issues and try to avoid social media if I can. Please reconsider.,Machine Learning Zoomcamp,2022,"[270092, 514296, 231208, 282072, 957623]"
7,492004,"With evaluating our model in validation, what to do if our validation model is high accuracy and validation on test is very low?",Machine Learning Zoomcamp,2022,"[465132, 520235, 235894, 918931, 967449]"
8,396112,"I read a book that said the data engineering lifecycle is generation, storage, ingestion, transforming, and serving data. Do we do ingestion because we use an open dataset?",Data Engineering Zoomcamp,2023,"[125560, 742225, 47681, 8144, 876269]"
9,950910,"If we consider professional certification for getting into the industry, which one would you recommend?",Data Engineering Zoomcamp,2023,"[337669, 998582, 179159, 419700, 985321]"


In [None]:
test_questions_dataset.to_csv('test_questions_dataset.csv', index=False)

In [None]:
# Checking for duplicate rows in the DataFrame
duplicate_rows = test_answers_df[test_answers_df.duplicated('answer_id')]

# Number of duplicate rows
num_duplicate_rows = duplicate_rows.shape[0]
num_duplicate_rows

1

In [None]:
# Find duplicate question_ids
duplicate_answers_test_df = test_answers_df[test_answers_df.duplicated('answer_id', keep=False)]

# Display the duplicate question_ids
data_table.DataTable(duplicate_answers_test_df, include_index=False, num_rows_per_page=10)

Unnamed: 0,answer_id,answer,course,year,attachments_files
37,774098,"Yes, you can.",Data Engineering Zoomcamp,2023,
73,774098,"Yes, you can.",Data Engineering Zoomcamp,2023,


In [None]:
# Drop Duplicates
test_answers_dataset = test_answers_df.drop_duplicates(subset = 'answer_id', keep = 'last')
print(f"Shape of the Dataset after dropping duplicates -> {test_answers_dataset.shape}") # 514

Shape of the Dataset after dropping duplicates -> (515, 5)


In [None]:
# Drop Column -> attachments_files
test_answers_dataset = test_answers_dataset.drop(columns=['attachments_files'])

print(f"Rows and Columns -> {test_answers_dataset.shape}")

Rows and Columns -> (515, 4)


In [None]:
test_answers_dataset.to_csv('test_answers_dataset.csv', index=False)

In [6]:
train_dataset = pd.read_csv("/content/training_dataset.csv")

In [11]:
train_dataset.shape

(396, 9)

In [8]:
test_questions_dataset = pd.read_csv("/content/test_questions_dataset.csv")

In [12]:
test_questions_dataset.shape

(514, 5)

In [9]:
test_answers_dataset = pd.read_csv("/content/test_answers_dataset.csv")

In [10]:
test_answers_dataset.shape

(515, 4)

# MODELS

In [None]:
# Models to get Embeddings

## Hugging Face
# Baseline - MiniLM
MINILM = 'all-MiniLM-L6-v2'
minilm_model = SentenceTransformer(MINILM)
print(f"{MINILM} -> model is Ready")
print(30 * '#')

# BGE
BGE = 'BAAI/bge-large-en-v1.5'
bge_model = SentenceTransformer(BGE)
print(f"{BGE} -> model is Ready")
print(30 * '#')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

all-MiniLM-L6-v2 -> model is Ready
##############################


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

BAAI/bge-large-en-v1.5 -> model is Ready
##############################


In [None]:
%%time
minilm_model.to(device) # Move model to GPU

print("Training Dataset")
minilm_train_q_emb = minilm_model.encode(train_dataset["question"].values, show_progress_bar = True)
print("Created question embedings using Mini-LM for Training Dataset")
print(30 * '#')

minilm_train_ans_emb = minilm_model.encode(train_dataset["answer"].values, show_progress_bar = True)
print("Created answer embedings using Mini-LM for Training Dataset")
print(30 * '#')

print(f"Shape of Embeddings for Questions using Mini-LM model in Training Dataset -> {minilm_train_q_emb.shape}")
print(f"Shape of Embeddings for Answers using Mini-LM model in Training Dataset -> {minilm_train_ans_emb.shape}")
print()
print(50 * '#')

print("Test Dataset")
minilm_test_q_emb = minilm_model.encode(test_questions_dataset["question"].values, show_progress_bar = True)
print("Created question embedings using Mini-LM for Test Dataset")
print(30 * '#')

minilm_test_ans_emb = minilm_model.encode(test_answers_dataset["answer"].values, show_progress_bar = True)
print("Created answer embedings using Mini-LM for Test Dataset")
print(30 * '#')


print(f"Shape of Embeddings for Questions using Mini-LM model in Test Dataset -> {minilm_test_q_emb.shape}")
print(f"Shape of Embeddings for Answers using Mini-LM model in Test Dataset -> {minilm_test_ans_emb.shape}")

Training Dataset


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

Created question embedings using Mini-LM for Training Dataset
##############################


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

Created answer embedings using Mini-LM for Training Dataset
##############################
Shape of Embeddings for Questions using Mini-LM model in Training Dataset -> (396, 384)
Shape of Embeddings for Answers using Mini-LM model in Training Dataset -> (396, 384)

##################################################
Testing Dataset


Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Created question embedings using Mini-LM for Test Dataset
##############################


Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Created answer embedings using Mini-LM for Test Dataset
##############################
Shape of Embeddings for Questions using Mini-LM model in Test Dataset -> (514, 384)
Shape of Embeddings for Answers using Mini-LM model in Test Dataset -> (515, 384)
CPU times: user 2.98 s, sys: 23.3 ms, total: 3 s
Wall time: 2.71 s


In [None]:
type(minilm_train_q_emb)

numpy.ndarray

In [None]:
%%time
bge_model.to(device) # Move model to GPU
print("Training Dataset")
bge_train_q_embed = bge_model.encode(train_dataset["question"].values, show_progress_bar = True)
print("Created question embedings using BGE")
print(30 * '#')

bge_train_ans_embed = bge_model.encode(train_dataset["answer"].values, show_progress_bar = True)
print("Created answer embedings using BGE")
print(30 * '#')


print(f"Shape of Embeddings for Questions using BGE model in Training Dataset -> {bge_train_q_embed.shape}")
print(f"Shape of Embeddings for Answers using BGE model  in Training Dataset-> {bge_train_ans_embed.shape}")
print()
print(50 * '#')

print("Test Dataset")
bge_test_q_emb = bge_model.encode(test_questions_dataset["question"].values, show_progress_bar = True)
print("Created question embedings using BGE for Test Dataset")
print(30 * '#')

bge_test_ans_emb = bge_model.encode(test_answers_dataset["answer"].values, show_progress_bar = True)
print("Created answer embedings using BGE for Test Dataset")
print(30 * '#')


print(f"Shape of Embeddings for Questions using Mini-LM model in Test Dataset -> {bge_test_q_emb.shape}")
print(f"Shape of Embeddings for Answers using Mini-LM model in Test Dataset -> {bge_test_ans_emb.shape}")

Training Dataset


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

Created question embedings using BGE
##############################


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

Created answer embedings using BGE
##############################
Shape of Embeddings for Questions using BGE model in Training Dataset -> (396, 1024)
Shape of Embeddings for Answers using BGE model  in Training Dataset-> (396, 1024)

##################################################
Test Dataset


Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Created question embedings using BGE for Test Dataset
##############################


Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Created answer embedings using BGE for Test Dataset
##############################
Shape of Embeddings for Questions using Mini-LM model in Test Dataset -> (514, 1024)
Shape of Embeddings for Answers using Mini-LM model in Test Dataset -> (515, 1024)
CPU times: user 46 s, sys: 77.4 ms, total: 46 s
Wall time: 46 s


In [30]:
def get_train_predictions(train_df: pd.DataFrame,
                    question_embeddings: np.ndarray,
                    ans_embeddings: np.ndarray) -> List:
    """
    Get predictions by finding the candidate answer with the highest cosine similarity to each question.

    Args:
    train_df (pd.DataFrame): DataFrame containing the questions, candidate answers, and answer IDs.
    question_embeddings (np.ndarray): Array of embeddings for each question in `train_df`.
    ans_embeddings (np.ndarray): Array of embeddings for each answer in `train_df`.

    Returns:
    List: A list of predicted answers, where each element is the candidate answer with the highest cosine similarity.
    """
    # Create the train_ans_dict from train_df DataFrame and ans_embeddings
    train_ans_dict = {}
    for idx, row in enumerate(tqdm(train_df.iterrows(), total=len(train_df))):
        train_ans_dict[str(row[1]['answer_id'])] = ans_embeddings[idx]

    # Predictions
    preds = []
    for idx, row in enumerate(tqdm(train_df.iterrows(), total=len(train_df))):
        sim = []
        for ca in row[1]["candidate_answers"]:  # Accessing the candidate_answers in the row
            cos_sim = util.cos_sim(question_embeddings[idx], train_ans_dict.get(str(ca), np.zeros_like(ans_embeddings[0])))  # Calculate cosine similarity
            sim.append(cos_sim.item())

        aidx = np.argmax(np.array(sim))  # Find the index of the candidate answer with the highest similarity
        preds.append(row[1]["candidate_answers"][aidx])  # Append the candidate answer with the highest similarity

    return preds

In [None]:
%%time
minilm_predictions = get_train_predictions(train_dataset, minilm_train_q_emb, minilm_train_ans_emb)

  0%|          | 0/396 [00:00<?, ?it/s]

  0%|          | 0/396 [00:00<?, ?it/s]

CPU times: user 360 ms, sys: 2.63 ms, total: 363 ms
Wall time: 381 ms


In [None]:
print(type(minilm_predictions))

len(minilm_predictions)

<class 'list'>


396

In [None]:
%%time
bge_predictions = get_train_predictions(train_dataset, bge_train_q_embed, bge_train_ans_embed)

  0%|          | 0/396 [00:00<?, ?it/s]

  0%|          | 0/396 [00:00<?, ?it/s]

CPU times: user 349 ms, sys: 3.75 ms, total: 353 ms
Wall time: 369 ms


In [None]:
minilm_preds = np.array(minilm_predictions)
acc = accuracy_score(train_dataset.answer_id.values.ravel(), minilm_preds.astype(int).ravel())
print(acc)

0.9292929292929293


In [None]:
bge_preds = np.array(bge_predictions)
acc = accuracy_score(train_dataset.answer_id.values.ravel(), bge_preds.astype(int).ravel())
print(acc)

0.9292929292929293


In [29]:
def get_test_predictions(test_qs: pd.DataFrame,
                         test_ans: pd.DataFrame,
                         q_emb: np.ndarray,
                         ans_emb: np.ndarray) -> List:
    """
    Get test predictions by finding the candidate answer with the highest cosine similarity to each question.

    Args:
    test_qs (pd.DataFrame): DataFrame containing the test questions and candidate answers.
    test_ans (pd.DataFrame): DataFrame containing the test answers and their IDs.
    q_emb (np.ndarray): Array of embeddings for each question in `test_qs`.
    ans_emb (np.ndarray): Array of embeddings for each answer in `test_ans`.

    Returns:
    List: A list of predicted answers, where each element is the candidate answer with the highest cosine similarity.
    """
    # Create the test_ans_dict from test_ans DataFrame and ans_emb
    test_ans_dict = {}
    for idx, row in enumerate(tqdm(test_ans.iterrows(), total=len(test_ans))):
        test_ans_dict[str(row[1]['answer_id'])] = ans_emb[idx]

    # Predictions
    preds = []
    for idx, row in enumerate(tqdm(test_qs.iterrows(), total=len(test_qs))):
        sim = []
        for ca in row[1]["candidate_answers"]:  # Accessing the candidate_answers in the row
            cos_sim = util.cos_sim(q_emb[idx], test_ans_dict.get(str(ca), np.zeros_like(ans_emb[0])))  # Calculate cosine similarity
            sim.append(cos_sim.item())

        aidx = np.argmax(np.array(sim))  # Find the index of the candidate answer with the highest similarity
        preds.append(row[1]["candidate_answers"][aidx])  # Append the candidate answer with the highest similarity

    return preds

In [None]:
%%time
minilm_test_predictions = get_test_predictions(test_questions_dataset, test_answers_dataset, minilm_test_q_emb, minilm_test_ans_emb)

  0%|          | 0/515 [00:00<?, ?it/s]

  0%|          | 0/514 [00:00<?, ?it/s]

CPU times: user 456 ms, sys: 5.88 ms, total: 462 ms
Wall time: 481 ms


In [None]:
print(type(minilm_test_predictions))

len(minilm_test_predictions)

<class 'list'>


514

In [None]:
bge_test_predictions = get_test_predictions(test_questions_dataset, test_answers_dataset, bge_test_q_emb, bge_test_ans_emb)

  0%|          | 0/515 [00:00<?, ?it/s]

  0%|          | 0/514 [00:00<?, ?it/s]

In [None]:
len(bge_test_predictions)

514

In [None]:
bge_preds = np.array(bge_test_predictions).astype(int)
bge_preds.shape

(514,)

In [None]:
test_questions_dataset['predicted_answer_id'] = bge_preds

In [None]:
test_questions_dataset.head()

Unnamed: 0,question_id,question,course,year,candidate_answers,predicted_answer_id
0,707,How much of an effort would it be to use AWS instead of GCP for assignments?,Data Engineering Zoomcamp,2023,"[336232, 337669, 258304, 47681, 767296]",767296
1,534450,Can you talk about linear regression and regularization?,Machine Learning Zoomcamp,2022,"[231208, 282072, 86769, 573165, 138373]",231208
2,996163,"Can you please explain the Python Black setup in Visual Studio Code? Also, can you explain good Python coding standards as you write docstrings and type strings?",Data Engineering Zoomcamp,2023,"[571892, 816559, 47681, 337669, 336232]",571892
3,860215,How many portfolio projects apart from the course are needed for getting a job?,Machine Learning Zoomcamp,2022,"[643931, 988549, 918931, 235894, 608866]",988549
4,980124,Can you talk more about the final project? What should we be thinking about now to prepare us?,Data Engineering Zoomcamp,2023,"[384381, 337669, 258304, 47681, 747722]",384381


In [None]:
test_questions_dataset[['question_id', 'predicted_answer_id']].to_csv('BGE_submission.csv', index=False)

In [13]:
## Langchain
# Cohere embeddings

cohere_model = CohereEmbeddings(cohere_api_key = COHERE_API_KEY, model = "embed-english-v3.0")

print(f"Cohere -> model is Ready")
print(30 * '#')

# Jina embeddings
jina_model = JinaEmbeddings(jina_api_key = JINA_API_KEY, model_name = "jina-embeddings-v2-base-en")
print(f"Jina -> model is Ready")
print(30 * '#')

# Mistral embeddings
mistral_model = MistralAIEmbeddings(mistral_api_key = MISTRAL_API_KEY, model = "mistral-embed")
print(f"Mistral -> model is Ready")
print(30 * '#')

together_model = TogetherEmbeddings(together_api_key = TOGETHER_API_KEY, model="togethercomputer/m2-bert-80M-8k-retrieval")
print(f"Together -> model is Ready")
print(30 * '#')

# Voyage embeddings
voyage_model = VoyageEmbeddings(voyage_api_key = VOYAGE_API_KEY, model = "voyage-01")
print(f"Voyage -> model is Ready")
print(30 * '#')

## LlamaIndex
# OpenAI embeddings
openai_model = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY, model = "text-embedding-3-large")
print(f"OPENAI-> model is Ready")
print(30 * '#')

print("All Embeddings Models Ready!")

Cohere -> model is Ready
##############################
Jina -> model is Ready
##############################
Mistral -> model is Ready
##############################
Together -> model is Ready
##############################
Voyage -> model is Ready
##############################
OPENAI-> model is Ready
##############################
All Embeddings Models Ready!


In [14]:
def get_embedding(model, text: str):
  embedding = model.embed_query(text)
  return embedding

## COHERE - DID NOT WORK

In [53]:
%%time
cohere_train_q_embed = train_dataset['question'].apply(lambda x: get_embedding(cohere_model, x)).values

KeyboardInterrupt: 

In [None]:
%%time
cohere_train_ans_embed = train_dataset['answer'].apply(lambda x: get_embedding(cohere_model, x)).values

In [None]:
%%time
cohere_test_q_emb = test_questions_dataset['question'].apply(lambda x: get_embedding(cohere_model, x)).values

In [None]:
%%time
cohere_test_ans_emb = test_answers_dataset['answer'].apply(lambda x: get_embedding(cohere_model, x)).values

In [52]:
print(f"Shape of Embeddings for Questions using Cohere model in Training Dataset -> {cohere_train_q_embed.shape}, {len(cohere_train_q_embed[0])}")
print(f"Shape of Embeddings for Answers using Cohere model  in Training Dataset-> {cohere_train_ans_embed.shape}, {len(cohere_train_ans_embed[0])}")
print(f"Shape of Embeddings for Questions using Cohere model in Test Dataset -> {cohere_test_q_emb.shape}, {len(cohere_test_q_emb[0])}")
print(f"Shape of Embeddings for Answers using Cohere model  in Test Dataset-> {cohere_test_ans_emb.shape}, {len(cohere_test_ans_emb[0])}")

NameError: name 'cohere_train_q_embed' is not defined

In [None]:
cohere_predictions = get_train_predictions(train_dataset, cohere_train_q_embed, cohere_train_ans_embed)

In [None]:
cohere_preds = np.array(cohere_predictions)
acc = accuracy_score(train_dataset.answer_id.values.ravel(), cohere_preds.astype(int).ravel())
print(acc)

In [None]:
%%time
cohere_test_predictions = get_test_predictions(test_questions_dataset, test_answers_dataset, cohere_test_q_emb, cohere_test_ans_emb)

In [None]:
cohere_preds = np.array(cohere_test_predictions).astype(int)
cohere_preds.shape

In [None]:
test_questions_dataset.columns

Index(['question_id', 'question', 'course', 'year', 'candidate_answers', 'predicted_answer_id'], dtype='object')

In [None]:
test_questions_dataset = test_questions_dataset.drop(columns='predicted_answer_id')

In [None]:
test_questions_dataset.head()

Unnamed: 0,question_id,question,course,year,candidate_answers
0,707,How much of an effort would it be to use AWS instead of GCP for assignments?,Data Engineering Zoomcamp,2023,"[336232, 337669, 258304, 47681, 767296]"
1,534450,Can you talk about linear regression and regularization?,Machine Learning Zoomcamp,2022,"[231208, 282072, 86769, 573165, 138373]"
2,996163,"Can you please explain the Python Black setup in Visual Studio Code? Also, can you explain good Python coding standards as you write docstrings and type strings?",Data Engineering Zoomcamp,2023,"[571892, 816559, 47681, 337669, 336232]"
3,860215,How many portfolio projects apart from the course are needed for getting a job?,Machine Learning Zoomcamp,2022,"[643931, 988549, 918931, 235894, 608866]"
4,980124,Can you talk more about the final project? What should we be thinking about now to prepare us?,Data Engineering Zoomcamp,2023,"[384381, 337669, 258304, 47681, 747722]"


In [None]:
test_questions_dataset['predicted_answer_id'] = cohere_preds

In [None]:
test_questions_dataset[['question_id', 'predicted_answer_id']].to_csv('Cohere_submission.csv', index=False)

## JINA

In [None]:
%%time
jina_train_q_embed = train_dataset['question'].apply(lambda x: get_embedding(jina_model, x)).values



CPU times: user 938 ms, sys: 89 ms, total: 1.03 s
Wall time: 1min 41s


In [None]:
%%time
jina_train_ans_embed = train_dataset['answer'].apply(lambda x: get_embedding(jina_model, x)).values

CPU times: user 795 ms, sys: 80.7 ms, total: 876 ms
Wall time: 1min 41s


In [None]:
%%time
jina_test_q_emb = test_questions_dataset['question'].apply(lambda x: get_embedding(jina_model, x)).values

CPU times: user 1.08 s, sys: 109 ms, total: 1.19 s
Wall time: 2min 10s


In [None]:
%%time
jina_test_ans_emb = test_answers_dataset['answer'].apply(lambda x: get_embedding(jina_model, x)).values

CPU times: user 1.05 s, sys: 104 ms, total: 1.15 s
Wall time: 2min 12s


In [None]:
print(f"Shape of Embeddings for Questions using Jina model in Training Dataset -> {jina_train_q_embed.shape}, {len(jina_train_q_embed[0])}")
print(f"Shape of Embeddings for Answers using Jina model in Training Dataset -> {jina_train_ans_embed.shape}, {len(jina_train_ans_embed[0])}")
print(f"Shape of Embeddings for Questions using Jina model in Test Dataset -> {jina_test_q_emb.shape}, {len(jina_test_q_emb[0])}")
print(f"Shape of Embeddings for Answers using Jina model in Test Dataset -> {jina_test_ans_emb.shape}, {len(jina_test_ans_emb[0])}")

Shape of Embeddings for Questions using Jina model in Training Dataset -> (396,), 768
Shape of Embeddings for Answers using Jina model in Training Dataset -> (396,), 768
Shape of Embeddings for Questions using Jina model in Test Dataset -> (514,), 768
Shape of Embeddings for Answers using Jina model in Test Dataset -> (515,), 768


In [32]:
def get_train_predictions(train_df: pd.DataFrame,
                          question_embeddings: np.ndarray,
                          ans_embeddings: np.ndarray) -> List:
    """
    Get predictions by finding the candidate answer with the highest cosine similarity to each question.

    Args:
    train_df (pd.DataFrame): DataFrame containing the questions, candidate answers, and answer IDs.
    question_embeddings (np.ndarray): Array of embeddings for each question in `train_df`.
    ans_embeddings (np.ndarray): Array of embeddings for each answer in `train_df`.

    Returns:
    List: A list of predicted answers, where each element is the candidate answer with the highest cosine similarity.
    """
    # Create the train_ans_dict from train_df DataFrame and ans_embeddings
    train_ans_dict = {}
    for idx, row in enumerate(tqdm(train_df.iterrows(), total=len(train_df))):
        train_ans_dict[str(row[1]['answer_id'])] = torch.tensor(ans_embeddings[idx], dtype=torch.float32)  # Convert to tensor with dtype float32

    # Predictions
    preds = []
    for idx, row in enumerate(tqdm(train_df.iterrows(), total=len(train_df))):
        sim = []
        question_embed = torch.tensor(question_embeddings[idx], dtype=torch.float32)  # Convert to tensor with dtype float32
        for ca in row[1]["candidate_answers"]:  # Accessing the candidate_answers in the row
            answer_embed = train_ans_dict.get(str(ca), torch.zeros_like(question_embed))  # Get the answer embedding
            cos_sim = util.cos_sim(question_embed, answer_embed)  # Calculate cosine similarity
            sim.append(cos_sim.item())

        aidx = np.argmax(np.array(sim))  # Find the index of the candidate answer with the highest similarity
        preds.append(row[1]["candidate_answers"][aidx])  # Append the candidate answer with the highest similarity

    return preds

In [None]:
# For training dataset
jina_predictions = get_train_predictions(train_dataset, jina_train_q_embed, jina_train_ans_embed)
jina_preds = np.array(jina_predictions)
jina_acc = accuracy_score(train_dataset.answer_id.values.ravel(), jina_preds.astype(int).ravel())
print('Jina Model Training Accuracy:', jina_acc)

  0%|          | 0/396 [00:00<?, ?it/s]

  0%|          | 0/396 [00:00<?, ?it/s]

Jina Model Training Accuracy: 0.9242424242424242


In [None]:
# For test dataset
jina_test_predictions = get_test_predictions(test_questions_dataset, test_answers_dataset, jina_test_q_emb, jina_test_ans_emb)
jina_test_preds = np.array(jina_test_predictions).astype(int)
print('Jina Test Predictions Shape:', jina_test_preds.shape)

  0%|          | 0/515 [00:00<?, ?it/s]

  0%|          | 0/514 [00:00<?, ?it/s]

Jina Test Predictions Shape: (514,)


In [None]:
test_questions_dataset['predicted_answer_id'] = jina_test_preds

In [None]:
test_questions_dataset[['question_id', 'predicted_answer_id']].to_csv('Jina_submission.csv', index=False)

## TOGETHER AI - DID NOT WORK

In [54]:
%%time
together_train_q_embed = train_dataset['question'].apply(lambda x: get_embedding(together_model, x)).values

RateLimitError: Too many requests received. Please pace your requests.

In [None]:
%%time
together_train_ans_embed = train_dataset['answer'].apply(lambda x: get_embedding(together_model, x)).values

In [None]:
%%time
together_test_q_emb = test_questions_dataset['question'].apply(lambda x: get_embedding(together_model, x)).values

In [None]:
%%time
together_test_ans_emb = test_answers_dataset['answer'].apply(lambda x: get_embedding(together_model, x)).values

In [None]:
print(f"Shape of Embeddings for Questions using Together model in Training Dataset -> {together_train_q_embed.shape}, {len(together_train_q_embed)}")
print(f"Shape of Embeddings for Answers using Together model in Training Dataset -> {together_train_ans_embed.shape}, {len(together_train_ans_embed)}")
print(f"Shape of Embeddings for Questions using Together model in Test Dataset -> {together_test_q_emb.shape}, {len(together_test_q_emb)}")
print(f"Shape of Embeddings for Answers using Together model in Test Dataset -> {together_test_ans_emb.shape}, {len(together_test_ans_emb)}")

## VOYAGE AI

In [18]:
%%time
voyage_train_q_embed = train_dataset['question'].apply(lambda x: get_embedding(voyage_model, x)).values

CPU times: user 27.7 s, sys: 468 ms, total: 28.2 s
Wall time: 2min 59s


In [19]:
%%time
voyage_train_ans_embed = train_dataset['answer'].apply(lambda x: get_embedding(voyage_model, x)).values

CPU times: user 27.8 s, sys: 472 ms, total: 28.3 s
Wall time: 3min 10s


In [20]:
%%time
voyage_test_q_emb = test_questions_dataset['question'].apply(lambda x: get_embedding(voyage_model, x)).values

CPU times: user 35.3 s, sys: 530 ms, total: 35.8 s
Wall time: 3min 50s


In [21]:
%%time
voyage_test_ans_emb = test_answers_dataset['answer'].apply(lambda x: get_embedding(voyage_model, x)).values

CPU times: user 36.6 s, sys: 714 ms, total: 37.3 s
Wall time: 4min 3s


In [23]:
print(f"Shape of Embeddings for Questions using Voyage model in Training Dataset -> {voyage_train_q_embed.shape}, {len(voyage_train_q_embed[0])}")
print(f"Shape of Embeddings for Answers using Voyage model in Training Dataset -> {voyage_train_ans_embed.shape}, {len(voyage_train_ans_embed[0])}")
print(f"Shape of Embeddings for Questions using Voyage model in Test Dataset -> {voyage_test_q_emb.shape}, {len(voyage_test_q_emb[0])}")
print(f"Shape of Embeddings for Answers using Voyage model in Test Dataset -> {voyage_test_ans_emb.shape}, {len(voyage_test_ans_emb[0])}")

Shape of Embeddings for Questions using Voyage model in Training Dataset -> (396,), 1024
Shape of Embeddings for Answers using Voyage model in Training Dataset -> (396,), 1024
Shape of Embeddings for Questions using Voyage model in Test Dataset -> (514,), 1024
Shape of Embeddings for Answers using Voyage model in Test Dataset -> (515,), 1024


In [36]:
import json

def get_train_predictions(train_df: pd.DataFrame,
                          question_embeddings: np.ndarray,
                          ans_embeddings: np.ndarray) -> List:
    """
    Get predictions by finding the candidate answer with the highest cosine similarity to each question.

    Args:
    train_df (pd.DataFrame): DataFrame containing the questions, candidate answers, and answer IDs.
    question_embeddings (np.ndarray): Array of embeddings for each question in `train_df`.
    ans_embeddings (np.ndarray): Array of embeddings for each answer in `train_df`.

    Returns:
    List: A list of predicted answers, where each element is the candidate answer with the highest cosine similarity.
    """
    # Create the train_ans_dict from train_df DataFrame and ans_embeddings
    train_ans_dict = {}
    for idx, row in enumerate(tqdm(train_df.iterrows(), total=len(train_df))):
        train_ans_dict[str(row[1]['answer_id'])] = torch.tensor(ans_embeddings[idx], dtype=torch.float32)  # Convert to tensor with dtype float32

    # Predictions
    preds = []
    for idx, row in enumerate(tqdm(train_df.iterrows(), total=len(train_df))):
        sim = []
        question_embed = torch.tensor(question_embeddings[idx], dtype=torch.float32)  # Convert to tensor with dtype float32
        candidate_answers = row[1]["candidate_answers"]

        # Ensure candidate_answers is a list (convert if it's a string representation of a list)
        if isinstance(candidate_answers, str):
            candidate_answers = json.loads(candidate_answers.replace("'", '"'))  # Convert string to list

        for ca in candidate_answers:  # Accessing the candidate_answers in the row
            answer_embed = train_ans_dict.get(str(ca), torch.zeros_like(question_embed))  # Get the answer embedding
            cos_sim = util.cos_sim(question_embed, answer_embed)  # Calculate cosine similarity
            sim.append(cos_sim.item())

        aidx = np.argmax(np.array(sim))  # Find the index of the candidate answer with the highest similarity
        preds.append(candidate_answers[aidx])  # Append the candidate answer with the highest similarity

    return preds


In [37]:
# For training dataset
voyage_predictions = get_train_predictions(train_dataset, voyage_train_q_embed, voyage_train_ans_embed)
voyage_preds = np.array(voyage_predictions)
voyage_acc = accuracy_score(train_dataset.answer_id.values.ravel(), voyage_preds.astype(int).ravel())
print('Voyage Model Training Accuracy:', voyage_acc)

  0%|          | 0/396 [00:00<?, ?it/s]

  0%|          | 0/396 [00:00<?, ?it/s]

Voyage Model Training Accuracy: 0.9242424242424242


In [41]:
import json
from typing import List
import numpy as np
from sentence_transformers import util
from tqdm import tqdm
import pandas as pd
import torch

def get_test_predictions(test_qs: pd.DataFrame,
                         test_ans: pd.DataFrame,
                         q_emb: np.ndarray,
                         ans_emb: np.ndarray) -> List:
    """
    Get test predictions by finding the candidate answer with the highest cosine similarity to each question.

    Args:
    test_qs (pd.DataFrame): DataFrame containing the test questions and candidate answers.
    test_ans (pd.DataFrame): DataFrame containing the test answers and their IDs.
    q_emb (np.ndarray): Array of embeddings for each question in `test_qs`.
    ans_emb (np.ndarray): Array of embeddings for each answer in `test_ans`.

    Returns:
    List: A list of predicted answers, where each element is the candidate answer with the highest cosine similarity.
    """
    # Create the test_ans_dict from test_ans DataFrame and ans_emb
    test_ans_dict = {}
    for idx, row in enumerate(tqdm(test_ans.iterrows(), total=len(test_ans))):
        test_ans_dict[str(row[1]['answer_id'])] = torch.tensor(ans_emb[idx], dtype=torch.float32)  # Convert to tensor with dtype float32

    # Predictions
    preds = []
    for idx, row in enumerate(tqdm(test_qs.iterrows(), total=len(test_qs))):
        sim = []
        question_embed = torch.tensor(q_emb[idx], dtype=torch.float32).unsqueeze(0)  # Convert to 2D tensor with dtype float32
        candidate_answers = row[1]["candidate_answers"]

        # If candidate_answers is a string representation of a list, convert it to an actual list
        if isinstance(candidate_answers, str):
            candidate_answers = json.loads(candidate_answers.replace("'", '"'))  # Convert string to list

        for ca in candidate_answers:  # Accessing the candidate_answers in the row
            answer_embed = test_ans_dict.get(str(ca), torch.zeros_like(question_embed))  # Get the answer embedding
            cos_sim = util.cos_sim(question_embed, answer_embed)  # Calculate cosine similarity
            sim.append(cos_sim.item())

        aidx = np.argmax(np.array(sim))  # Find the index of the candidate answer with the highest similarity

        # Ensure we are appending individual integer IDs or strings that represent integers
        selected_answer = candidate_answers[aidx]
        if isinstance(selected_answer, str) and selected_answer.isdigit():
            selected_answer = int(selected_answer)  # Convert to int if it's a digit string

        preds.append(selected_answer)

    return preds

In [42]:
# For test dataset
voyage_test_predictions = get_test_predictions(test_questions_dataset, test_answers_dataset, voyage_test_q_emb, voyage_test_ans_emb)
voyage_test_preds = np.array(voyage_test_predictions).astype(int)
print('Voyage Test Predictions Shape:', voyage_test_preds.shape)

100%|██████████| 515/515 [00:00<00:00, 5362.77it/s]
100%|██████████| 514/514 [00:00<00:00, 1557.89it/s]

Voyage Test Predictions Shape: (514,)





In [43]:
test_questions_dataset['predicted_answer_id'] = voyage_test_preds
test_questions_dataset[['question_id', 'predicted_answer_id']].to_csv('Voyage_submission.csv', index=False)

In [45]:
test_questions_dataset.head()

Unnamed: 0,question_id,question,course,year,candidate_answers,predicted_answer_id
0,707,How much of an effort would it be to use AWS instead of GCP for assignments?,Data Engineering Zoomcamp,2023,"['336232', '337669', '258304', '47681', '767296']",767296
1,534450,Can you talk about linear regression and regularization?,Machine Learning Zoomcamp,2022,"['231208', '282072', '86769', '573165', '138373']",231208
2,996163,"Can you please explain the Python Black setup in Visual Studio Code? Also, can you explain good Python coding standards as you write docstrings and type strings?",Data Engineering Zoomcamp,2023,"['571892', '816559', '47681', '337669', '336232']",571892
3,860215,How many portfolio projects apart from the course are needed for getting a job?,Machine Learning Zoomcamp,2022,"['643931', '988549', '918931', '235894', '608866']",988549
4,980124,Can you talk more about the final project? What should we be thinking about now to prepare us?,Data Engineering Zoomcamp,2023,"['384381', '337669', '258304', '47681', '747722']",384381


In [46]:
test_questions_dataset = test_questions_dataset.drop(columns='predicted_answer_id')

In [47]:
test_questions_dataset.head()

Unnamed: 0,question_id,question,course,year,candidate_answers
0,707,How much of an effort would it be to use AWS instead of GCP for assignments?,Data Engineering Zoomcamp,2023,"['336232', '337669', '258304', '47681', '767296']"
1,534450,Can you talk about linear regression and regularization?,Machine Learning Zoomcamp,2022,"['231208', '282072', '86769', '573165', '138373']"
2,996163,"Can you please explain the Python Black setup in Visual Studio Code? Also, can you explain good Python coding standards as you write docstrings and type strings?",Data Engineering Zoomcamp,2023,"['571892', '816559', '47681', '337669', '336232']"
3,860215,How many portfolio projects apart from the course are needed for getting a job?,Machine Learning Zoomcamp,2022,"['643931', '988549', '918931', '235894', '608866']"
4,980124,Can you talk more about the final project? What should we be thinking about now to prepare us?,Data Engineering Zoomcamp,2023,"['384381', '337669', '258304', '47681', '747722']"


## OPENAI

In [None]:
%%time

openai_train_q_embed = train_dataset['question'].apply(lambda x: get_embedding(openai_model, x)).values

In [25]:
%%time

openai_train_ans_embed = train_dataset['answer'].apply(lambda x: get_embedding(openai_model, x)).values



CPU times: user 6.13 s, sys: 283 ms, total: 6.42 s
Wall time: 1min 20s


In [26]:
%%time

openai_test_q_emb = test_questions_dataset['question'].apply(lambda x: get_embedding(openai_model, x)).values




CPU times: user 4.87 s, sys: 309 ms, total: 5.18 s
Wall time: 1min 39s


In [27]:
%%time

openai_test_ans_emb = test_answers_dataset['answer'].apply(lambda x: get_embedding(openai_model, x)).values



CPU times: user 7.03 s, sys: 351 ms, total: 7.38 s
Wall time: 1min 45s


In [28]:
print(f"Shape of Embeddings for Questions using OpenAI model in Training Dataset -> {openai_train_q_embed.shape}, {len(openai_train_q_embed[0])}")
print(f"Shape of Embeddings for Answers using OpenAI model in Training Dataset -> {openai_train_ans_embed.shape}, {len(openai_train_ans_embed[0])}")
print(f"Shape of Embeddings for Questions using OpenAI model in Test Dataset -> {openai_test_q_emb.shape}, {len(openai_test_q_emb[0])}")
print(f"Shape of Embeddings for Answers using OpenAI model in Test Dataset -> {openai_test_ans_emb.shape}, {len(openai_test_ans_emb[0])}")

Shape of Embeddings for Questions using OpenAI model in Training Dataset -> (396,), 3072
Shape of Embeddings for Answers using OpenAI model in Training Dataset -> (396,), 3072
Shape of Embeddings for Questions using OpenAI model in Test Dataset -> (514,), 3072
Shape of Embeddings for Answers using OpenAI model in Test Dataset -> (515,), 3072


In [48]:
# For training dataset
openai_predictions = get_train_predictions(train_dataset, openai_train_q_embed, openai_train_ans_embed)
openai_preds = np.array(openai_predictions)
openai_acc = accuracy_score(train_dataset.answer_id.values.ravel(), openai_preds.astype(int).ravel())
print('OpenAI Model Training Accuracy:', openai_acc)

100%|██████████| 396/396 [00:00<00:00, 2810.65it/s]
100%|██████████| 396/396 [00:00<00:00, 1175.38it/s]

OpenAI Model Training Accuracy: 0.9393939393939394





In [49]:
# For test dataset
openai_test_predictions = get_test_predictions(test_questions_dataset, test_answers_dataset, openai_test_q_emb, openai_test_ans_emb)
openai_test_preds = np.array(openai_test_predictions).astype(int)
print('OpenAI Test Predictions Shape:', openai_test_preds.shape)

100%|██████████| 515/515 [00:00<00:00, 2964.16it/s]
100%|██████████| 514/514 [00:00<00:00, 1175.34it/s]

OpenAI Test Predictions Shape: (514,)





In [50]:
test_questions_dataset['predicted_answer_id'] = openai_test_preds

In [51]:
test_questions_dataset[['question_id', 'predicted_answer_id']].to_csv('Openai_submission.csv', index=False)

## TESTING API's

In [None]:
# text_cohere = get_embedding(cohere_model, "How are you?")
# print(f"Embedding Dimension using Cohere Model: {len(text_cohere)}")

Embedding Dimension using Cohere Model: 1024


In [None]:
text_jina = get_embedding(jina_model, "How are you?")
print(f"Embedding Dimension using Jina Model: {len(text_jina)}")

Embedding Dimension using Cohere Model: 768


In [None]:
# text_mistral = embed_text(mistral_model, "How are you?")
# print(f"Embedding Dimension using Mistral Model: {len(text_mistral)}") # DID NOT WORK -> ERROR CODE: 429

In [None]:
text_openai = get_embedding(openai_model, "How are you?")
print(f"Embedding Dimension using OpenAI Model: {len(text_openai)}")



Embedding Dimension using OpenAI Model: 3072


In [None]:
text_together = get_embedding(together_model, "How are you?")
print(f"Embedding Dimension using Together Model: {len(text_together)}")

Embedding Dimension using Together Model: 768


In [None]:
text_voyage = get_embedding(voyage_model, "How are you?")
print(f"Embedding Dimension using Voyage Model: {len(text_voyage)}")

Embedding Dimension using Voyage Model: 1024
