Twitter Data Cleaning & Similarity Search Pipeline

Overview

This project demonstrates a complete data pipeline — from loading raw Twitter data to performing similarity search using a vector database. The dataset used is a small Twitter dataset (~200 rows) that initially contained highly inconsistent and messy information.

The main focus of this project is data cleaning, where major issues such as inconsistent usernames, unnormalized follower counts, irregular date-time formats, mixed sentiments, and emoji handling were addressed.

After cleaning, the data was formatted, feature-engineered, and then converted into embeddings stored inside a vector database to enable semantic similarity search — forming a mini end-to-end RAG (Retrieval-Augmented Generation) system.

Project Workflow

Data Loading
Loaded a small, unclean Twitter dataset (~200 rows) for analysis.
Exploratory Data Analysis (EDA)
Explored structure, missing values, and inconsistencies in the dataset. Identified major issues with usernames, followers, sentiment, and text content.
Data Cleaning
- Removed emojis and unwanted characters using Python’s emoji library.
- Normalised inconsistent date and time formats into a unified standard.
- Standardised numeric features such as follower counts and other metrics.
- Unified and cleaned mixed sentiment labels so each row has one consistent sentiment.
- Saved both raw (messy) and cleaned CSV files for reference and comparison.
- This stage took ~90% of the total project time, highlighting the importance of thorough data cleaning in real-world workflows.
Data Formatting & Feature Engineering
Converted textual and categorical data into structured, machine-readable format. Prepared features for embedding generation.
Vector Database & Similarity Search
Generated embeddings from the cleaned data. Stored them in a vector database (FAISS/Chroma) and performed semantic similarity search to retrieve related content efficiently.

Key Learnings

Data cleaning is the most time-consuming yet most crucial stage of any data project.
Emoji handling and normalization of mixed data types can be simplified with the right tools.
Once clean, the same dataset can power advanced applications like RAG pipelines and semantic search.
Keeping both raw and cleaned datasets allows for full reproducibility and comparison.

Technologies Used

Python
Pandas, NumPy
Emoji (for cleaning text)
Datetime, Regex
FAISS / Chroma (for vector database)
Sentence Transformers

Highlights

Included both raw and cleaned datasets for comparison.
Detailed cleaning process notebook (ideal for beginners learning data preprocessing).
End-to-end flow from messy tweets to vector-based semantic search.
Demonstrates handling real-world messy data, normalization, feature engineering, and RAG integration.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data Cleaning.ipynb		Data Cleaning.ipynb
Embedding_Data.ipynb		Embedding_Data.ipynb
README.md		README.md
Similarity_Search.ipynb		Similarity_Search.ipynb
cleaned-twitter-dataset.csv		cleaned-twitter-dataset.csv
messy_twitter_dataset_200.csv		messy_twitter_dataset_200.csv
tweets_index.faiss		tweets_index.faiss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Twitter Data Cleaning & Similarity Search Pipeline

Overview

Project Workflow

Key Learnings

Technologies Used

Highlights

About

Uh oh!

Releases

Packages

Languages

AleeCodeAI/Twitter-Data-RAG-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Twitter Data Cleaning & Similarity Search Pipeline

Overview

Project Workflow

Key Learnings

Technologies Used

Highlights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages