In [1]:
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema import runnable


In [4]:
# web based loading
# loader = WebBaseLoader("/Users/abhinay/Downloads/titanic-data-science-solutions.html") 

# pdf loading
loader = PyPDFLoader("/Users/abhinay/Downloads/titanic-data-science-solutions.pdf") # pdf loading
pages = loader.load_and_split()

In [5]:
# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(loader.load())

In [6]:
key = "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"

In [7]:
# Embed and store splits
embeddings  = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=key)
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

In [20]:
# Define a valid namespace with a prompt template
prompt_namespace = {
    'id': ['langchain', 'prompts', 'prompt', 'PromptTemplate'],
    'lc': 1,
    'type': 'constructor',
    'kwargs': {
        'template': "You are an assistant to get the required steps and code any to build Model based on given artile. Use the following pieces of retrieved context to answer the question if it is code and not asked for steps just give code nothing else also go to new line as it is mentioned in the article. If you don't know the answer, just say that you don't know. Use maximum of 10 sentence if needed more sentences use it.\nQuestion: {question} \nContext: {context} \nAnswer:",
        'input_variables': ['question', 'context'],
        'template_format': 'f-string'
    }
}

# Create an instance of the prompt template
prompt_template = PromptTemplate(**prompt_namespace['kwargs'])


In [21]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, openai_api_key=key, verbose=True)
rag_prompt = PromptTemplate(**prompt_namespace['kwargs'])
rag_chain = {"context": retriever, "question": runnable.RunnablePassthrough()} | rag_prompt | llm

In [10]:
def format_output(summary_text, words_per_line=8):
    words = summary_text.split()
    lines = [words[i:i+words_per_line] for i in range(0, len(words), words_per_line)]
    formatted_text = '\n'.join([' '.join(line) for line in lines])
    return formatted_text

In [None]:
summary = rag_chain.invoke("what was the score of KNN model")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

In [13]:
summary = rag_chain.invoke("give me the list of models and their scores on tabular form")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The list of models and their scores in tabular form is as follows: Model
Score ------------------------- Random Forest 86.76 Decision Tree 86.76 KNN 84.74 Support Vector Machines 83.84
Logistic Regression 80.36 Linear SVC 79.12 Stochastic Gradient Decent 78.56 Perceptron 78.00 Naive Bayes
72.28


In [23]:
summary = rag_chain.invoke("what was the code to implement random forest model ")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

random_forest = RandomForestClassifier(n_estimators=100) random_forest.fit(X_train, Y_train) Y_pred = random_forest.predict(X_test) random_forest.score(X_train, Y_train) acc_random_forest = round(random_forest.score(X_train, Y_train)
* 100, 2) acc_random_forest


In [15]:
summary = rag_chain.invoke("what were all the python libraries that was imported")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The Python libraries that were imported in the given article are: - pandas -
sklearn.linear_model.LogisticRegression - sklearn.svm.SVC - sklearn.svm.LinearSVC - sklearn.ensemble.RandomForestClassifier - sklearn.neighbors.KNeighborsClassifier - sklearn.naive_bayes.GaussianNB - sklearn.linear_model.Perceptron -
sklearn.linear_model.SGDClassifier - sklearn.tree.DecisionTreeClassifier


In [12]:
summary = rag_chain.invoke("what was the aim of this paper or what they are trying to achieve")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The aim of this paper is to explain the step-by-step process of building an
end-to-end data pipeline for the Kaggle Titanic competition. The author wants to help readers
understand the structured approach of a Machine Learning Engineer or Data Scientist in data
cleaning and feature engineering. The paper also aims to provide insights on how to
improve the model and make it more difficult for future participants to stand out.
The author suggests exploring more data imputation possibilities, trying different classifiers such as XGBoost,
and combining several classifiers using a voting pipeline. Overall, the goal is to achieve
a high score and rank among the top 3% in the competition.


In [15]:
summary = rag_chain.invoke("what are main headings in the article")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The main headings in the article are: 1. Count plot for titles after feature
engineering 2. Count plot after grouping titles to frequent groups 3. Bar plot survival
rate vs title 4. Conclusion


In [18]:
summary = rag_chain.invoke("what were the key take aways from the exploratory data analysis give detailed explaination")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The key takeaways from the exploratory data analysis are: 1. Exploratory data analysis is
time-consuming but essential in understanding the dataset and asking the right questions. 2. The
analysis helps in identifying weak features that can be eliminated and focusing on meaningful
features. 3. The cabin feature in the Titanic dataset can be improved by considering
the number of cabins booked instead of just extracting the first two letters. 4.
Balancing the dataset and trying different metrics can be explored to improve the model.
5. The thought process of a Machine Learning Engineer/Data Scientist in data cleaning and
feature engineering is crucial for achieving top scores. 6. The complete code for the
analysis can be found on the GitHub repository provided. 7. Getting the data and
importing necessary modules for handling tabular data is the first step in the process.
8. Exploratory data analysis involves investigating the data and understanding the available features. 9.
Not all featu

In [19]:
summary = rag_chain.invoke("based on exploratory data anaylyis what are the key features that can be used to trian the model")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The key features that can be used to train the model based on exploratory
data analysis are not explicitly mentioned in the given context. However, the context suggests
that the analysis aims to reduce the number of weak features and identify meaningful
features. It also mentions the possibility of extracting the number of cabins booked as
a potential feature. Additionally, it emphasizes the importance of exploring which features should be
considered and which should not. Therefore, further analysis and investigation of the available features
are required to determine the specific key features for training the model.


In [20]:
summary = rag_chain.invoke("based on feature engineering what are the key features that can be used to trian the model")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

Based on the given context, the key features that can be used to train
the model are: 1. Number of cabins booked: Instead of simply extracting the first
two letters of the cabin feature, the number of cabins booked by a passenger
can be considered as a possible feature. 2. Balancing the dataset: Ways to balance
the dataset can be explored to improve the model's performance. 3. Different metric: Trying
a different metric for evaluation can be considered to enhance the model's accuracy. 4.
Exploratory data analysis: Conducting exploratory data analysis can help in reducing the number of
weak features and identifying meaningful features. 5. Feature engineering: Creating new features such as
extracting the number of cabins used and engineering ticket and cabin features can be
experimented with. 6. Family size: Combining the features Parch and SibSp to create a
new feature that captures the information of both can be useful. 7. Bar plot:
Analyzing the bar plot of family size vs survival can

In [22]:
summary = rag_chain.invoke("what were the new features that were created in feature engineering section")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The new features that were created in the feature engineering section are: 1. Number
of cabins used 2. Cabin letter 3. Ticket and Cabin engineered new features


In [23]:
summary = rag_chain.invoke("what were the final features that were used in the training a classifier section")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The final features that were used in the training a classifier section were 'Pclass',
'Fare', 'Title', 'Embarked', 'Fam_type', 'Ticket_len', and 'Ticket_2letter'. The 'Cabin' feature was not used, and
the relevant information about the feature 'age' was already encoded in the 'title' feature.
The 'Sex' feature was also not used to avoid confusing the classifier.


In [24]:
summary = rag_chain.invoke("what were the steps involved in creating pipeline")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The steps involved in creating the pipeline for the model based on the given
article are as follows: 1. Getting the data: The first step is to obtain
the Titanic dataset, which is the dataset used in the Kaggle competition. This dataset
will be used for training and testing the model. 2. Importing necessary modules: The
required modules for handling tabular data and performing machine learning tasks are imported. These
modules will provide the necessary functions and methods for data preprocessing, feature engineering, and
model training. 3. Data cleaning and feature engineering: The thought process of a Machine
Learning Engineer or Data Scientist in data cleaning and feature engineering is discussed. This
step involves handling missing values, transforming categorical variables, creating new features, and preparing the
data for model training. 4. Building the model: The specific model used in the
article is not mentioned, but it is suggested that the Random Forest (RF) classifier
is u

In [25]:
summary = rag_chain.invoke("what were the comments in the pipline picture")
formatted_summary = format_output(summary.content, words_per_line=14)
print(formatted_summary)

The comments in the pipeline picture were not mentioned in the given context.


In [17]:
# our evaluation metric is how close are we to final goal mentioned in the article
# so first step should be to get the final goal of the paper which is very standard or we can even put it as constant prompt
# we can feed the model with all the possible titanic aritcles 
# RAG is not perfect all the time and it might not be completly reliable all the time
# Instruct fine tunning is kind of path way we need to take based on @daniel analysis
# to develop instruct fine tunning we need data set of the below foramt

# prompt: what are the steps I need to build ML model based on the below mentioned article
# response: we need to ask the following questions:
# Q1: what is this paper trying to achieve or what are the final outcomes
# Q2: what is in the introduction
# Q3: what is the first step taken (this can be data cleaning and managing data imbalances)
# Q4: how first step is implemented (follow up question)
# Q5: what was done after first step or what is the step 2 (this can be exploratory data analysis)
# Q6: what was intrepreted from the step 2 (follow up question)
# Q7: what was done after step 2 or what is step 3 (this can be feature engineering) - (there might not be this step in all 
#       the papers)
# Q8: was step 2 (EDA) had affect on step 3 (this can be achieved by verifying if there are new features intrudouced compared to 
#       initial features)
# Q9: what are the outcomes of step 3 (final features from step 3)
# Q10:what is the done after step 3 or what is step 4 (this can what was the baseline model selected) or (the models they played 
#       with before goining to final model)
# Q11:what are the evaluation metrics (this might not be a step this can asked at any point preferebly before baseline model)
# Q12:what is the main model they fouced on or step 5  - at this point I am not sure to call it step 5 but let's keep it that way
# Q13:how the model is tuned or what parameters that were tuned 
# Q14:what were the final tuned parameters 
# Q15:what is the accuracy of the final model


In [None]:
# NOTES
# if we are trying to implement the paper we should not worry about how much data we have because who ever worked on that paper 
# would have taken care of it
# we need to focus on how they implemented the startiges to make use of that data or how they processed it (which is kind of 
#   abstract view of data prepration step)
#  

**Templet:**
You are an assistant to get the required steps to build Model based on given artile. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use maximum of 10 sentence if needed more sentences use it

**Question:**
what are the things that I need to do in Exploratory data analysis ?

**Answer:**
In exploratory data analysis, there are several steps that need to be taken. First, you need to import the necessary modules for handling tabular data. Then, you should read the data CSV files and convert them into pandas dataframes. After that, you can investigate the data to gain a better understanding of the available features. It is not necessary to use every available feature, so you should identify the relevant ones and discard the irrelevant ones. This can help reduce the number of weak features and focus on creating meaningful features. Finally, you can perform data cleaning and feature engineering to improve the quality of the data and create new features that can enhance the performance of the model.

**Templet:**
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

**Question:**
what are the things that I need to do in Exploratory data analysis ?

**Answer:**
In exploratory data analysis, you need to ask the right questions and spend time
searching for better features. The goal is to reduce weak features and create meaningful
ones. Preprocessing the data and understanding the available features are also important steps in
EDA.