This repository contains the code for a personalized recommendation system API. The system is designed to provide personalized text post recommendations based on user data. The goal was to create an API that would return JSONs with text posts on request.
The repository consists of the following main files:
-
features_uploader.ipynb
: This Jupyter notebook is used for gathering and transforming data, which is then uploaded to an SQL server for model training. It includes methods for data extraction, transformation, and loading (ETL). -
training.ipynb
: This Jupyter notebook contains the code for training the recommendation model. It includes data preprocessing, model training, and model evaluation steps. Various machine learning algorithms and techniques are used in this process. -
service.py
: This is the main service file for the API. It handles API requests and responses, and uses the trained model to generate recommendations. It includes methods for handling HTTP requests, processing data, and returning responses.
The data used in this project consists of three tables:
-
user_data
: Contains information about users such as age, city, country, experience group, gender, operating system, and user ID. This table has 163,205 entries. -
post_text_df
: Contains information about text posts such as post ID, text content, and topic. This table has 7,023 entries. -
feed_data
: Contains information about user-to-post interactions such as timestamp, user ID, post ID, action, and target. This table has 76,892,800 entries.
A combined approach was used to create features from both users and posts. Techniques such as accumulative likes on posts, time decomposition, Truncated Singular Value Decomposition (Truncated SVD), and category encoding were used. The Catboost model was pre-trained and its integrated importance features were used to select the best features. Features with potential information leakage were deleted.
The Catboost model was trained using custom metrics 'AUC' and 'NDCG' to align more with the Hitrate@5 metric used for evaluation. The Hitrate@5 is not a part of the Catboost library, so a custom implementation was used.
The Hitrate@5, which shows the likelihood of at least one of the 5 recommended posts being liked by the user, was 0.63 on the test set.
The FastAPI takes the following parameters using the GET method:
id
: The user's ID for whom the posts are requested.time
: A datetime object, e.g.,datetime.datetime(year=2021, month=1, day=3, hour=14)
.limit
: The number of posts for the user.
The API returns 5 posts in the following format:
[
{
"id": 345,
"text": "COVID-19 runs wild....",
"topic": "news"
},
{
"id": 134,
"text": "Chelsea FC wins UEFA..",
"topic": "news"
},
...
]
- Python 3.x
- Jupyter Notebook
- Required Python libraries: (list any libraries that are not included in the standard library)
- Clone this repository to your local machine.
- Install the required Python libraries.
- Run the
features_uploader.ipynb
notebook to gather and transform data. - Run the
training.ipynb
notebook to train the model. - Run
service.py
to start the API
Here is a diagram that illustrates the workflow of the recommendation system:
In this workflow:
- User data is extracted, transformed, and loaded (ETL) in the
features_uploader.ipynb
notebook. - The transformed data is uploaded to an SQL server.
- The
training.ipynb
notebook trains the recommendation model using the uploaded data. - The
service.py
file handles API requests and uses the trained model to generate recommendations. - The recommendations are returned to the user.
Contributions are welcome! Please read the contributing guidelines before getting started.
This project is licensed under the terms of the (your license) - see the LICENSE file for details.