This project demonstrates a complete data pipeline — from loading raw Twitter data to performing similarity search using a vector database. The dataset used is a small Twitter dataset (~200 rows) that initially contained highly inconsistent and messy information.
The main focus of this project is data cleaning, where major issues such as inconsistent usernames, unnormalized follower counts, irregular date-time formats, mixed sentiments, and emoji handling were addressed.
After cleaning, the data was formatted, feature-engineered, and then converted into embeddings stored inside a vector database to enable semantic similarity search — forming a mini end-to-end RAG (Retrieval-Augmented Generation) system.
-
Data Loading
Loaded a small, unclean Twitter dataset (~200 rows) for analysis. -
Exploratory Data Analysis (EDA)
Explored structure, missing values, and inconsistencies in the dataset. Identified major issues with usernames, followers, sentiment, and text content. -
Data Cleaning
- Removed emojis and unwanted characters using Python’semojilibrary.
- Normalised inconsistent date and time formats into a unified standard.
- Standardised numeric features such as follower counts and other metrics.
- Unified and cleaned mixed sentiment labels so each row has one consistent sentiment.
- Saved both raw (messy) and cleaned CSV files for reference and comparison.
- This stage took ~90% of the total project time, highlighting the importance of thorough data cleaning in real-world workflows. -
Data Formatting & Feature Engineering
Converted textual and categorical data into structured, machine-readable format. Prepared features for embedding generation. -
Vector Database & Similarity Search
Generated embeddings from the cleaned data. Stored them in a vector database (FAISS/Chroma) and performed semantic similarity search to retrieve related content efficiently.
- Data cleaning is the most time-consuming yet most crucial stage of any data project.
- Emoji handling and normalization of mixed data types can be simplified with the right tools.
- Once clean, the same dataset can power advanced applications like RAG pipelines and semantic search.
- Keeping both raw and cleaned datasets allows for full reproducibility and comparison.
- Python
- Pandas, NumPy
- Emoji (for cleaning text)
- Datetime, Regex
- FAISS / Chroma (for vector database)
- Sentence Transformers
- Included both raw and cleaned datasets for comparison.
- Detailed cleaning process notebook (ideal for beginners learning data preprocessing).
- End-to-end flow from messy tweets to vector-based semantic search.
- Demonstrates handling real-world messy data, normalization, feature engineering, and RAG integration.