## Project set up.

**Setting up a clean environment and all the required dependancies**

**Requirements****

Python 3.9 or later (we used Python 3.10 in our new environment).

A code editor like VS Code, PyCharm, or Jupyter Notebook (optional but helpful).

Git installed if you want version control.

An OpenAI account and API key

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")

print("API key loaded:", api_key[:5] + "…")

API key loaded: sk-pr…


### Loading the data set


In [6]:
#Let us load the dataset and do some cleaning
import pandas as pd
df = pd.read_csv("large_weather.csv")
print("Dataset loaded with shape:", df.shape)
# Display the first few rows of the dataframe
df.head()



Dataset loaded with shape: (70, 5)


Unnamed: 0,City,Date,Temperature,Humidity,Condition
0,Nairobi,2026-01-19,21,54,Partly Cloudy
1,Nairobi,2026-01-20,33,57,Cloudy
2,Nairobi,2026-01-21,30,70,Thunderstorm
3,Nairobi,2026-01-22,30,68,Foggy
4,Nairobi,2026-01-23,26,68,Rainy


In [9]:
# Data cleaning steps
df = df.dropna() 
df = df.drop_duplicates()
print("Dataset after cleaning shape:", df.shape)
df.head()


Dataset after cleaning shape: (70, 5)


Unnamed: 0,City,Date,Temperature,Humidity,Condition
0,Nairobi,2026-01-19,21,54,Partly Cloudy
1,Nairobi,2026-01-20,33,57,Cloudy
2,Nairobi,2026-01-21,30,70,Thunderstorm
3,Nairobi,2026-01-22,30,68,Foggy
4,Nairobi,2026-01-23,26,68,Rainy


In [10]:
#Data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   City         70 non-null     object
 1   Date         70 non-null     object
 2   Temperature  70 non-null     int64 
 3   Humidity     70 non-null     int64 
 4   Condition    70 non-null     object
dtypes: int64(2), object(3)
memory usage: 2.9+ KB


In [13]:
# #Convert the table to RAG format

# #embedding text snippets for easy retrieval
# from langchain.docstore.document import Document
# from langchain.embeddings.openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings(openai_api_key=api_key)
# from langchain.vectorstores import FAISS
# docs = [Document(page_content=row.to_json()) for _, row in df.iterrows()]
# vectorstore = FAISS.from_documents(docs, embeddings)
# print("Vectorstore created with", len(docs), "documents.")


In [12]:
from langchain.docstore.document import Document

docs = []
for _, row in df.iterrows():
    text = f"City: {row['City']}, Date: {row['Date']}, Temp: {row['Temperature']}, Condition: {row['Condition']}"
    docs.append(Document(page_content=text))


In [18]:
# Vectorstore creation
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embeddings = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))
vectorstore = FAISS.from_documents(docs, embeddings)


ModuleNotFoundError: Module langchain_community.embeddings not found. Please install langchain-community to access this module. You can install it using `pip install -U langchain-community`